Unit 1
Unit 1
1.1Basic definitions
Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work
on our instructions. But can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with
the development of algorithms which allow a computer to learn from the data and past
experiences on their own.
With the help of sample historical data, which is known as training data, machine learning
algorithms build a mathematical model that helps in making predictions or decisions
without being explicitly programmed. Machine learning brings computer science and
statistics together for creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the information, the
higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining more
data.
A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of predicted
output depends upon the amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:
The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems and here comes the
machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let
them explore the data, construct the models, and predict the required output automatically.
The performance of the machine learning algorithm depends on the amount of data, and it can
be determined by the cost function. With the help of machine learning, we can save both time
and money.
The importance of machine learning can be easily understood by its uses cases, Currently,
machine learning is used in self-driving cars, cyber fraud detection, face recognition,
and friend suggestion by Facebook, etc. Various top companies such as Netflix and
Amazon have build machine learning models that are using a vast amount of data to analyze
the user interest and recommend product accordingly.
Following are some key points which show the importance of Machine Learning:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
The system creates a model using labeled data to understand the datasets and learn
about each data, once the training and processing are done then we test the model by
providing a sample data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student
learns things in the supervision of the teacher. The example of supervised learning is
spam filtering.
o Classification
o Regression
2) Unsupervised Learning
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group
of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:
o Clustering
o Association
3) Reinforcement Learning
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
Example: Let's understand the hypothesis with a common example. Some scientist claims
that ultraviolet (UV) light can damage the eyes then it may also cause blindness.
In this example, a scientist just claims that UV rays are harmful to the eyes, but we assume
they may cause blindness. However, it may or may not be possible. Hence, these types of
assumptions are called a hypothesis.
The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is
specifically used in Supervised Machine learning, where an ML model learns a function that
best maps the input to corresponding outputs with the help of an available dataset.
In supervised learning techniques, the main aim is to determine the possible hypothesis out of
hypothesis space that best maps input to the corresponding or correct outputs.
There are some common methods given to find out the possible hypothesis from the
Hypothesis space, where hypothesis space is represented by uppercase-h (H) and hypothesis
by lowercase-h (h). Th ese are defined as follows:
Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known
as a hypothesis set. It is used by supervised machine learning algorithms to determine the
best possible hypothesis to describe the target function or best maps input to output.
It is often constrained by choice of the framing of the problem, the choice of model, and the
choice of model configuration.
Hypothesis (h):
It is defined as the approximate function that best describes the target in supervised machine
learning algorithms. It is primarily based on data as well as bias and restrictions applied to
data.
Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper
output and can be evaluated as well as used to make predictions.
The hypothesis (h) can be formulated in machine learning as follows:
Y=mx+b
Where,
Y: Range
m: Slope of the line which divided test data or changes in y divided by change in x.
x: domain
Example: Let's understand the hypothesis (h) and hypothesis space (H) with a two-
dimensional coordinate plane showing the distribution of data as follows:
Now, assume we have some test data by which ML algorithms predict the outputs for input as
follows:
If we divide this coordinate plane in such as way that it can help you to predict output or
result as follows:
Based on the given test data, the output result will be as follows:
However, based on data, algorithm, and constraints, this coordinate plane can also be divided
in the following ways as follows:
Hypothesis space (H) is the composition of all legal best possible ways to divide the
coordinate plane so that it best maps input to proper output.
Further, each individual best possible way is called a hypothesis (h). Hence, the hypothesis
and hypothesis space would be like this:
Hypothesis in Statistics
Unlike machine learning, we cannot accept any hypothesis in statistics because it is just an
imaginary result and based on probability. Before start working on an experiment, we must
be aware of two important types of hypotheses as follows:
o Null Hypothesis: A null hypothesis is a type of statistical hypothesis which tells that
there is no statistically significant effect exists in the given set of observations. It is
also known as conjecture and is used in quantitative analysis to test theories about
markets, investment, and finance to decide whether an idea is true or false.
o Alternative Hypothesis: An alternative hypothesis is a direct contradiction of the
null hypothesis, which means if one of the two hypotheses is true, then the other must
be false. In other words, an alternative hypothesis is a type of statistical hypothesis
which tells that there is some significant effect that exists in the given set of
observations.
Significance level
The significance level is the primary thing that must be set before starting an experiment. It is
useful to define the tolerance of error and the level at which effect can be considered
significantly. During the testing process in an experiment, a 95% significance level is
accepted, and the remaining 5% can be neglected. The significance level also tells the critical
or threshold value. For e.g., in an experiment, if the significance level is set to 98%, then the
critical value is 0.02%.
P-value
The p-value in statistics is defined as the evidence against a null hypothesis. In other words,
P-value is the probability that a random chance generated the data or something else that is
equal or rarer under the null hypothesis condition.
If the p-value is smaller, the evidence will be stronger, and vice-versa which means the null
hypothesis can be rejected in testing. It is always represented in a decimal form, such as
0.035.
Whenever a statistical test is carried out on the population and sample to find out P-value,
then it always depends upon the critical value. If the p-value is less than the critical value,
then it shows the effect is significant, and the null hypothesis can be rejected. Further, if it is
higher than the critical value, it shows that there is no significant effect and hence fails to
reject the Null Hypothesis.
Inductive Bias
In this article, we’ll have a look at what is Inductive Bias, and how does it help the
machine make better decisions.
Now, we also need to check if the hypothesis we got from the algorithm is actually
correct or not, also make decisions like what training examples should the machine
learn next.
Let’s have a look at what is Inductive and Deductive learning to understand more
about Inductive Bias.
Inductive Learning:
This basically means learning from examples, learning on the go.
We are given input samples (x) and output samples (f(x)) in the context of inductive
learning, and the objective is to estimate the function (f). The goal is to generalize
from the samples and map such that the output may be estimated for fresh samples in
the future.
In practice, estimating the function is nearly always too difficult, thus we seek
extremely excellent estimates of the function.
Deductive Learning:
Learners are initially exposed to concepts and generalizations, followed by particular
examples and exercises to aid learning.
Already existing rules are applied to the training examples.
Predictions for new scenarios could not be formed if all of these options were treated
equally, that is, without any bias in the sense of a preference for certain forms of
generalization (representing previous information about the target function to be
learned).
The idea of inductive bias is to let the learner generalize beyond the observed training
examples to deduce new examples.
‘ > ’ -> Inductively inferred from.
For example,
x > y means y is inductively deduced from x.
Maximum margin: While creating a border between two classes, try to make the boundary
as wide as possible. In support vector machines, this is the bias. The idea is that distinct
classes are usually separated by large gaps.
Minimum hypothesis description length: When constructing a hypothesis, try to keep the
description as short as possible. Simpler theories are seen to be more likely to be correct.
Occam’s razor does not suggest this. Simpler models are easier to test, not necessarily “more
likely to be true.” See the principle of Occam’s Razor.
Minimum features: features should be removed unless there is strong evidence that they are
helpful. Feature selection methods are based on this premise.
Nearest neighbors: Assume that the majority of the examples in a local neighborhood in
feature space are from the same class.
If the class of a case is unknown, assume that it belongs to the same class as the
majority of the people in its near vicinity. The k-nearest neighbor’s algorithm employs
this bias. Cases that are close to each other are assumed to belong to the same class.
1.4 Evaluation
Machine Learning Model Evaluation
Model evaluation is the process that uses some metrics which help us to analyze the
performance of the model. As we all know that model development is a multi-step process
and a check should be kept on how well the model generalizes future predictions. Therefore
evaluating a model plays a vital role so that we can judge the performance of our model.
The evaluation also helps to analyze a model’s key weaknesses. There are many metrics
like Accuracy, Precision, Recall, F1 score, Area under Curve, Confusion Matrix, and Mean
Square Error. Cross Validation is one technique that is followed during the training phase
and it is a model evaluation technique as well.
Cross Validation and Holdout
Cross Validation is a method in which we do not use the whole dataset for training. In this
technique, some part of the dataset is reserved for testing the model. There are many types
of Cross-Validation out of which K Fold Cross Validation is mostly used. In K Fold Cross
Validation the original dataset is divided into k subsets. The subsets are known as folds.
This is repeated k times where 1 fold is used for testing purposes. Rest k-1 folds are used
for training the model. So each data point acts as a test subject for the model as well as acts
as the training subject. It is seen that this technique generalizes the model well and reduces
the error rate
Holdout is the simplest approach. It is used in neural networks as well as in many
classifiers. In this technique, the dataset is divided into train and test datasets. The dataset
is usually divided into ratios like 70:30 or 80:20. Normally a large percentage of data is
used for training the model and a small portion of the dataset is used for testing the model.
Evaluation Metrics for Classification Task
In this Python code, we have imported the iris dataset which has features like the length and
width of sepals and petals. The target values are Iris setosa, Iris virginica, and Iris
versicolor. After importing the dataset we divided the dataset into train and test datasets in
the ratio 80:20. Then we called Decision Trees and trained our model. After that, we
performed the prediction and calculated the accuracy score, precision, recall, and f1 score.
We also plotted the confusion matrix.
Confusion Matrix
A confusion matrix is an N x N matrix where N is the number of target classes. It
represents the number of actual outputs and the predicted outputs. Some terminologies in
the matrix are as follows:
True Positives: It is also known as TP. It is the output in which the actual and the
predicted values are YES.
True Negatives: It is also known as TN. It is the output in which the actual and the
predicted values are NO.
False Positives: It is also known as FP. It is the output in which the actual value is NO
but the predicted value is YES.
False Negatives: It is also known as FN. It is the output in which the actual value is
YES but the predicted value is NO.
Precision and Recall
Precision is the ratio of true positives to the summation of true positives and false positives.
It basically analyses the positive predictions.
Precision = TP/(TP+FP)
The drawback of Precision is that it does not consider the True Negatives and False
Negatives.
Recall is the ratio of true positives to the summation of true positives and false negatives. It
basically analyses the number of correct positive samples.
Recall = TP/(TP+FN)
The drawback of Recall is that often it leads to a higher false positive rate.
F1 score
The F1 score is the harmonic mean of precision and recall. It is seen that during the
precision-recall trade-off if we increase the precision, recall decreases and vice versa. The
goal of the F1 score is to combine precision and recall.
F1 score = (2×Precision×Recall)/(Precision+Recall)
Accuracy
Accuracy is defined as the ratio of the number of correct predictions to the total number of
predictions. This is the most fundamental metric used to evaluate the model. The formula is
given by
Accuracy = (TP+TN)/(TP+TN+FP+FN)
However, Accuracy has a drawback. It cannot perform well on an imbalanced dataset.
Suppose a model classifies that the majority of the data belongs to the major class label. It
yields higher accuracy. But in general, the model cannot classify on minor class labels and
has poor performance.
AUC-ROC Curve
AUC (Area Under Curve) is an evaluation metric that is used to analyze the classification
model at different threshold values. The Receiver Operating Characteristic(ROC) curve is a
probabilistic curve used to highlight the model’s performance. The curve has two
parameters:
TPR: It stands for True positive rate. It basically follows the formula of Recall.
FPR: It stands for False Positive rate. It is defined as the ratio of False positives to the
summation of false positives and True negatives.
This curve is useful as it helps us to determine the model’s capacity to distinguish between
different classes.
Evaluation Metrics for Regression Task
Regression is used to determine continuous values. It is mostly used to find a relation
between a dependent and an independent variable. For classification, we use a confusion
matrix, accuracy, f1 score, etc. But for regression analysis, since we are predicting a
numerical value it may differ from the actual output. So we consider the error calculation
as it helps to summarize how close the prediction is to the actual value. There are many
metrics available for evaluating the regression model.
In this Python Code, we have implemented a simple regression model using the Mumbai
weather CSV file. This file comprises Day, Hour, Temperature, Relative Humidity, Wind
Speed, and Wind Direction. The link for the dataset is here.
We are basically interested in finding a relationship between Temperature and Relative
Humidity. Here Relative Humidity is the dependent variable and Temperature is the
independent variable. We performed the Linear Regression and used the metrics to evaluate
the performance of our model. To calculate the metrics we make extensive use of sklearn
library.
Mean Absolute Error(MAE)
This is the simplest metric used to analyze the loss over the whole dataset. As we all know
the error is basically the difference between the predicted and actual values.
Therefore MAE is defined as the average of the errors calculated. Here we calculate the
modulus of the error, perform the summation and then divide the result by the number of
data points. It is a positive quantity and is not concerned about the direction. The formula
of MAE is given by
MAE = ∑|ypred-yactual| / N
Mean Squared Error(MSE)
The most commonly used metric is Mean Square error or MSE. It is a function used to
calculate the loss. We find the difference between the predicted values and the truth
variable, square the result and then find the average over the whole dataset. MSE is always
positive as we square the values. The small the MSE better is the performance of our
model. The formula of MSE is given:
MSE = ∑(ypred - yactual)2 / N
Root Mean Squared Error(RMSE)
RMSE is a popular method and is the extended version of MSE(Mean Squared Error). This
method is basically used to evaluate the performance of our model. It indicates how much
the data points are spread around the best line. It is the standard deviation of the Mean
squared error. A lower value means that the data point lies closer to the best fit line.
RMSE=√(∑(ypred - yactual)2 / N)
Mean Absolute Percentage Error (MAPE)
MAPE is basically used to express the error in terms of percentage. It is defined as the
difference between the actual and predicted value. The error is then divided by the actual
value. The results are then summed up and finally, we calculate the average. Smaller the
percentage better the performance of the model. The formula is given by
MAPE = ∑((ypred-yactual) / yactual) / N * 100 %
1.5 Cross Validation
Cross validation is a technique used in machine learning to evaluate the performance of a
model on unseen data. It involves dividing the available data into multiple folds or subsets,
using one of these folds as a validation set, and training the model on the remaining folds.
This process is repeated multiple times, each time using a different fold as the validation
set. Finally, the results from each validation step are averaged to produce a more robust
estimate of the model’s performance.
The main purpose of cross validation is to prevent overfitting, which occurs when a model
is trained too well on the training data and performs poorly on new, unseen data. By
evaluating the model on multiple validation sets, cross validation provides a more realistic
estimate of the model’s generalization performance, i.e., its ability to perform well on new,
unseen data.
There are several types of cross validation techniques, including k-fold cross validation,
leave-one-out cross validation, and stratified cross validation. The choice of technique
depends on the size and nature of the data, as well as the specific requirements of the
modeling problem.
Cross-Validation
Cross-validation is a technique in which we train our model using the subset of the data-set
and then evaluate using the complementary subset of the data-set. The three steps involved
in cross-validation are as follows :
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Methods of Cross Validation
Validation In this method, we perform training on the 50% of the given data-set and rest
50% is used for the testing purpose. The major drawback of this method is that we perform
training on the 50% of the dataset, it may possible that the remaining 50% of the data
contains some important information which we are leaving while training our model i.e
higher bias. LOOCV (Leave One Out Cross Validation) In this method, we perform
training on the whole data-set but leaves only one data-point of the available data-set and
then iterates for each data-point. It has some advantages as well as disadvantages also. An
advantage of using this method is that we make use of all data points and hence it is low
bias. The major drawback of this method is that it leads to higher variation in the testing
model as we are testing against one data point. If the data point is an outlier it can lead to
higher variation. Another drawback is it takes a lot of execution time as it iterates over ‘the
number of data points’ times. K-Fold Cross Validation In this method, we split the data-
set into k number of subsets(known as folds) then we perform training on the all the subsets
but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate
k times with a different subset reserved for testing purpose each time.
Example The diagram below shows an example of the training subsets and evaluation
subsets generated in k-fold cross-validation. Here, we have total 25 instances. In first
iteration we use the first 20 percent of data for evaluation, and the remaining 80 percent for
training([1-5] testing and [5-25] training) while in the second iteration we use the second
subset of 20 percent for evaluation, and the remaining three subsets of the data for
training([5-10] testing and [1-5 and 10-25] training), and so
.
Advantages of train/test split:
1. This runs K times faster than Leave One Out cross-validation because K-fold cross-
validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both training and testing.
Advantages of Cross Validation:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting by providing a
more robust estimate of the model’s performance on unseen data.
2. Model Selection: Cross validation can be used to compare different models and select
the one that performs the best on average.
3. Hyperparameter tuning: Cross validation can be used to optimize the hyperparameters
of a model, such as the regularization parameter, by selecting the values that result in
the best performance on the validation set.
4. Data Efficient: Cross validation allows the use of all the available data for both training
and validation, making it a more data-efficient method compared to traditional
validation techniques.
1.6Linear Regression
Linear Regression is the supervised Machine Learning model in which the model finds the
best fit linear line between the independent and dependent variable i.e it finds the linear
relationship between the dependent and independent variable.
Linear Regression is of two types: Simple and Multiple. Simple Linear Regression is
where only one independent variable is present and the model has to find the linear
Whereas, In Multiple Linear Regression there are more than one independent variables for
variable.
A Linear Regression model’s main aim is to find the best fit linear line and the optimal
Error is the difference between the actual value and Predicted value and the goal is to reduce
this difference.
Introduction
If you are reading this article, I am assuming that you are already into the Data Science world
and have an idea about Machine Learning. If not, then no problem. I will start with the basic
terminologies which one needs to know before understanding the main topic of discussion
This article will cover everything you need to know about Linear Regression, the first
Table of Content
that learn from data and improve their accuracy over time without being programmed to do
so.
Types of Machine Learning:
data i.e output variable is provided in these types of problems. Here, the models find the
mapping function to map input variables with the output variable or the labels.
Unsupervised Machine Learning: It is the technique where models are not provided with
the labeled data and they have to find the patterns and structure in the data to know about the
data.
which the model finds the best fit linear line between the independent and dependent
variable i.e it finds the linear relationship between the dependent and independent variable.
Download Brochure
Linear Regression is of two types: Simple and Multiple. Simple Linear Regression is
where only one independent variable is present and the model has to find the linear
Whereas, In Multiple Linear Regression there are more than one independent variables for
variable.
A Linear Regression model’s main aim is to find the best fit linear line and the optimal
Error is the difference between the actual value and Predicted value and the goal is to reduce
this difference.
x is our dependent variable which is plotted on the x-axis and y is the dependent variable
Black dots are the data points i.e the actual values.
The blue line is the best fit line predicted by the model i.e the predicted values lie on the blue
line.
The vertical distance between the data point and the regression line is known as error
or residual. Each data point has one residual and the sum of all the differences is known
Mathematical Approach:
Residual/Error = Actual values – Predicted Values
i.e
For an in-depth understanding of the Maths behind Linear Regression, please refer to the
1. Linearity: It states that the dependent variable Y should be linearly related to independent
variables. This assumption can be checked by plotting a scatter plot between both variables.
2. Normality: The X and Y variables should be normally distributed. Histograms, KDE plots,
Please refer to my attached blog for a detailed explanation on checking the normality and
Source: https://heljves.com/gallery/vol_1_issue_1_2019_8.pdf
3. Homoscedasticity: The variance of the error terms should be constant i.e the spread of
residuals should be constant for all values of X. This assumption can be checked by plotting a
residual plot. If the assumption is violated then the points will form a funnel shape otherwise
i.e no correlation should be there between the independent variables. To check the
assumption, we can use a correlation matrix or VIF score. If the VIF score is greater than 5
5. The error terms should be normally distributed. Q-Q plots and Histograms can be used
Autocorrelation can be tested using the Durbin Watson test. The null hypothesis assumes that
there is no autocorrelation. The value of the test lies between 0 to 4. If the value of the test is
For example, if the Independence assumption is violated then the relationship between the
There are various methods are techniques available to deal with the violation of the
To treat this problem, we can transform the variables to the normal distribution using various
All the functions are discussed in this article of mine: How to transform into Normal
Distribution
Deriving a new feature by linearly combining the independent variables, such as adding them
components analysis.
necessary. Some of the Evaluation metrics used for Regression analysis are:
1. R squared or Coefficient of Determination: The most commonly used metric for model
Total Variation. The value of R squared lies between 0 to 1, the value closer to 1 the better
the model.
Source: medium.datadriveninvestor.com
where SSRES is the Residual Sum of squares and SSTOT is the Total Sum of squares
is that as the features increase, the value of R2 also increases which gives the illusion of a
good model. So the Adjusted R2 solves the drawback of R2. It only considers the features
which are important for the model and shows the real improvement of the model.
Source: stats.stackexchange.com
3. Mean Squared Error (MSE): Another Common metric for evaluation is Mean squared
error which is the mean of the squared difference of actual vs predicted values.
Source: cppsecrets.com
4. Root Mean Squared Error (RMSE): It is the root of MSE i.e Root of the mean
difference of Actual and Predicted values. RMSE penalizes the large errors whereas MSE
doesn’t.
1.7over fitting
Overfitting and Underfitting are the two main problems that occur in machine learning and
degrade the performance of the machine learning models.
Before understanding the overfitting and underfitting, let's understand some basic term that
will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more
than the required data points present in the given dataset. Because of this, the model starts
caching noise and inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model. The overfitted model has low bias and high variance.
As we can see from the above graph, the model tries to cover all the data points present in the
scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression
model to find the best fit line, but here we have not got any best fit, so, it will generate the
prediction errors.
Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce the
occurrence of overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped
at an early stage, due to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.
Example: We can understand the underfitting using below output of the linear regression
model:
As we can see from the above diagram, the model is unable to capture the data points present
in the plot.
The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning
models to achieve the goodness of fit. In statistics modeling, it defines how closely the result
or predicted values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and ideally, it
makes predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and the same
happens with test data. But if we train the model for a long duration, then the performance of
the model may decrease due to the overfitting, as the model also learn the noise present in the
dataset. The errors in the test dataset start increasing, so the point, just before the raising of
errors, is the good point, and we can stop here for achieving a good model.
There are two other methods by which we can get a good point for our model, which are
the resampling method to estimate model accuracy and validation dataset.
Important points:
In this type of scenario, we can see that User 1 and User 2 give nearly similar ratings to the
movie, so we can conclude that Movie 3 is also going to be averagely liked by the User 1
but Movie 4 will be a good recommendation to User 2, like this we can also see that there
are users who have different choices like User 1 and User 3 are opposite to each other. One
can see that User 3 and User 4 have a common interest in the movie, on that basis we can
say that Movie 4 is also going to be disliked by the User 4. This is Collaborative Filtering,
we recommend users the items which are liked by the users of similar interest
domain. Cosine Distance: We can also use the cosine distance between the users to find
out the users with similar interests, larger cosine implies that there is a smaller angle
between two users, hence they have similar interests. We can apply the cosine distance
between two users in the utility matrix, and we can also give the zero value to all the
unfilled columns to make calculation easy, if we get smaller cosine then there will be a
larger distance between the users and if the cosine is larger than we have a small angle
between the users, and we can recommend them similar things.
Rounding the Data: In collaborative filtering we round off the data to compare it more
easily like we can assign below 3 ratings as 0 and above of it as 1, this will help us to
compare data more easily, for
We again took the previous example and we apply the rounding off process, as you can see
how much readable the data has become after performing this process, we can see that User
1 and User 2 are more similar and User 3 and User 4 are more alike.
Normalizing Rating: In the process of normalizing we take the average rating of a user
and subtract all the given ratings from it, so we’ll get either positive or negative values as a
rating, which can simply classify further into similar groups. By normalizing the data we
can make the clusters of the users which gives a similar rating to similar items and then we
can use these clusters to recommend items to the users.
1.11Decision trees
It is a tool that has applications spanning several different areas. Decision trees can be used
for classification as well as regression problems. The name itself suggests that it uses a
flowchart like a tree structure to show the predictions that result from a series of feature-
based splits. It starts with a root node and ends with a decision made by leaves.
Image 1
Before learning more about decision trees let’s get familiar with some of the terminologies.
Root Nodes – It is the node present at the beginning of a decision tree from this node the
Decision Nodes – the nodes we get after splitting the root nodes are called Decision Node
Leaf Nodes – the nodes where further splitting is not possible are called leaf nodes or
terminal nodes
Sub-tree – just like a small portion of a graph is called sub-graph similarly a sub-section of
Image 2
Decision trees are upside down which means the root is at the top and then this root is split
into various several nodes. Decision trees are nothing but a bunch of if-else statements in
layman terms. It checks if the condition is true and if it is then it goes to the next node
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy?
If yes then it will go to the next feature which is humidity and wind. It will again check if
there is a strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and
play.
Image 4
Did you notice anything in the above flowchart? We see that if the weather is cloudy then we
must go to play. Why didn’t it split more? Why did it stop there?
To answer this question, we need to know about few more concepts like entropy, information
gain, and Gini index. But in simple terms, I can say here that the output for the training
dataset is always yes for cloudy weather, since there is no disorderliness here we don’t need
The goal of machine learning is to decrease uncertainty or disorders from the dataset and for
Now you must be thinking how do I know what should be the root node? what should be the
decision node? when should I stop splitting? To decide this, there is a metric called “Entropy”
Entropy
Entropy is nothing but the uncertainty in our dataset or measure of disorder. Let me try to
Suppose you have a group of friends who decides which movie they can watch together on
Sunday. There are 2 choices for movies, one is “Lucy” and the second is “Titanic” and now
everyone has to tell their choice. After everyone gives their answer we see that “Lucy” gets 4
votes and “Titanic” gets 5 votes. Which movie do we watch now? Isn’t it hard to choose 1
movie now because the votes for both the movies are somewhat equal.
This is exactly what we call disorderness, there is an equal number of votes for both the
movies, and we can’t really decide which movie we should watch. It would have been much
easier if the votes for “Lucy” were 8 and for “Titanic” it was 2. Here we could easily say that
the majority of votes are for “Lucy” hence everyone will be watching this movie.
Now we know what entropy is and what is its formula, Next, we need to know that how
tells how random our data is. A pure sub-split means that either you should be getting “yes”,
Suppose a feature has 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’
We see here the split is not pure, why? Because we can still see some negative classes in both
the nodes. In order to make a decision tree, we need to calculate the impurity of each split,
To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
For feature 3,
We can clearly see from the tree itself that left node has low entropy or more purity than right
node since left node has a greater number of “yes” and it is easy to decide here.
Always remember that the higher the Entropy, the lower will be the purity and the higher will
be the impurity.
As mentioned earlier the goal of machine learning is to decrease the uncertainty or impurity
in the dataset, here by using the entropy we are getting the impurity of a particular node, we
don’t know if the parent entropy or the entropy of a particular node has decreased or not.
For this, we bring a new metric called “Information gain” which tells us how much the parent
Information Gain
Information gain measures the reduction of uncertainty given some feature and it is also a
deciding factor for which attribute should be selected as a decision node or root node.
It is just entropy of the full dataset – entropy of the dataset given some feature.
Suppose our entire population has a total of 30 instances. The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t
Now we have two features to predict whether he/she will go to the gym or not.
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly
motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use information
gain to decide which feature should be the root node and which feature should be placed after
the split.
Let’s calculate the entropy:
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99 and after looking at this value of information gain, we can
say that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root
node.
Similarly, we will do this with the other feature “Motivation” and calculate its
information gain.
Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
We now see that the “Energy” feature gives more reduction which is 0.37 than the
“Motivation” feature. Hence we will select the feature which has the highest information gain
In this example “Energy” will be our root node and we’ll do the same for sub-nodes. Here we
can see that when the energy is “high” the entropy is low and hence we can say a person will
definitely go to the gym if he has high energy, but what if the energy is low? We will again
You must be asking this question to yourself that when do we stop growing our tree? Usually,
real-world datasets have a large number of features, which will result in a large number of
splits, which in turn gives a huge tree. Such trees take time to build and can lead to
overfitting. That means the tree will give very good accuracy on the training dataset but will
There are many ways to tackle this problem through hyperparameter tuning. We can set the
maximum depth of our decision tree using the max_depth parameter. The more the value
of max_depth, the more complex your tree will be. The training error will off-course
decrease if we increase the max_depth value but when our test data comes into the picture,
we will get a very bad accuracy. Hence you need a value that will not overfit as well as
underfit our data and for this, you can use GridSearchCV.
Another way is to set the minimum number of samples for each spilt. It is denoted
spilt. For example, we can use a minimum of 10 samples to reach a decision. That means if a
node has less than 10 samples then using this parameter, we can stop the further splitting of
node. The more you increase the number, the more is the possibility of overfitting.
max_features – it helps us decide what number of features to consider when looking for the
best split.
Pruning
It is another method that can help us avoid overfitting. It helps in improving the performance
of the tree by cutting the nodes or sub-nodes which are not significant. It removes the
(i) Pre-pruning – we can stop growing the tree earlier, which means we can
(ii) Post-pruning – once our tree is built to its depth, we can start pruning the nodes based
on their significance.