0% found this document useful (0 votes)
12 views51 pages

Unit 1

Uploaded by

naganathbkhade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views51 pages

Unit 1

Uploaded by

naganathbkhade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Unit 1.

Introduction to Machine learning

1.1Basic definitions

Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.

Machine learning is a growing technology which enables computers to learn automatically


from past data. Machine learning uses various algorithms for building mathematical models
and making predictions using historical data or information. Currently, it is being used
for various tasks such as image recognition, speech recognition, email filtering, Facebook
auto-tagging, recommender system, and many more.

What is Machine Learning

In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work
on our instructions. But can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with
the development of algorithms which allow a computer to learn from the data and past
experiences on their own.

With the help of sample historical data, which is known as training data, machine learning
algorithms build a mathematical model that helps in making predictions or decisions
without being explicitly programmed. Machine learning brings computer science and
statistics together for creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the information, the
higher will be the performance.

A machine has the ability to learn if it can improve its performance by gaining more
data.

How does Machine Learning work

A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of predicted
output depends upon the amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given dataset.


o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems and here comes the
machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data and let
them explore the data, construct the models, and predict the required output automatically.
The performance of the machine learning algorithm depends on the amount of data, and it can
be determined by the cost function. With the help of machine learning, we can save both time
and money.

The importance of machine learning can be easily understood by its uses cases, Currently,
machine learning is used in self-driving cars, cyber fraud detection, face recognition,
and friend suggestion by Facebook, etc. Various top companies such as Netflix and
Amazon have build machine learning models that are using a vast amount of data to analyze
the user interest and recommend product accordingly.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

1.2 Types of learning

Machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning

Supervised learning is a type of machine learning method in which we provide sample


labeled data to the machine learning system in order to train it, and on that basis, it
predicts the output.

The system creates a model using labeled data to understand the datasets and learn
about each data, once the training and processing are done then we test the model by
providing a sample data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student
learns things in the supervision of the teacher. The example of supervised learning is
spam filtering.

Supervised learning can be grouped further in two categories of algorithms:

o Classification
o Regression

2) Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns without any


supervision.

The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group
of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:

o Clustering
o Association

3) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a


reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the
most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

1.3 Hypothesis space and inductive bias

The hypothesis is defined as the supposition or proposed explanation based on insufficient


evidence or assumptions. It is just a guess based on some known facts but has not yet been
proven. A good hypothesis is testable, which results in either true or false.

Example: Let's understand the hypothesis with a common example. Some scientist claims
that ultraviolet (UV) light can damage the eyes then it may also cause blindness.
In this example, a scientist just claims that UV rays are harmful to the eyes, but we assume
they may cause blindness. However, it may or may not be possible. Hence, these types of
assumptions are called a hypothesis.

Hypothesis in Machine Learning (ML)

The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is
specifically used in Supervised Machine learning, where an ML model learns a function that
best maps the input to corresponding outputs with the help of an available dataset.

In supervised learning techniques, the main aim is to determine the possible hypothesis out of
hypothesis space that best maps input to the corresponding or correct outputs.

There are some common methods given to find out the possible hypothesis from the
Hypothesis space, where hypothesis space is represented by uppercase-h (H) and hypothesis
by lowercase-h (h). Th ese are defined as follows:

Hypothesis space (H):

Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known
as a hypothesis set. It is used by supervised machine learning algorithms to determine the
best possible hypothesis to describe the target function or best maps input to output.

It is often constrained by choice of the framing of the problem, the choice of model, and the
choice of model configuration.

Hypothesis (h):

It is defined as the approximate function that best describes the target in supervised machine
learning algorithms. It is primarily based on data as well as bias and restrictions applied to
data.

Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper
output and can be evaluated as well as used to make predictions.
The hypothesis (h) can be formulated in machine learning as follows:

Y=mx+b

Where,

Y: Range

m: Slope of the line which divided test data or changes in y divided by change in x.

x: domain

Example: Let's understand the hypothesis (h) and hypothesis space (H) with a two-
dimensional coordinate plane showing the distribution of data as follows:

Now, assume we have some test data by which ML algorithms predict the outputs for input as
follows:
If we divide this coordinate plane in such as way that it can help you to predict output or
result as follows:

Based on the given test data, the output result will be as follows:
However, based on data, algorithm, and constraints, this coordinate plane can also be divided
in the following ways as follows:

With the above example, we can conclude that;

Hypothesis space (H) is the composition of all legal best possible ways to divide the
coordinate plane so that it best maps input to proper output.

Further, each individual best possible way is called a hypothesis (h). Hence, the hypothesis
and hypothesis space would be like this:
Hypothesis in Statistics

Similar to the hypothesis in machine learning, it is also considered an assumption of the


output. However, it is falsifiable, which means it can be failed in the presence of sufficient
evidence.

Unlike machine learning, we cannot accept any hypothesis in statistics because it is just an
imaginary result and based on probability. Before start working on an experiment, we must
be aware of two important types of hypotheses as follows:

o Null Hypothesis: A null hypothesis is a type of statistical hypothesis which tells that
there is no statistically significant effect exists in the given set of observations. It is
also known as conjecture and is used in quantitative analysis to test theories about
markets, investment, and finance to decide whether an idea is true or false.
o Alternative Hypothesis: An alternative hypothesis is a direct contradiction of the
null hypothesis, which means if one of the two hypotheses is true, then the other must
be false. In other words, an alternative hypothesis is a type of statistical hypothesis
which tells that there is some significant effect that exists in the given set of
observations.

Significance level

The significance level is the primary thing that must be set before starting an experiment. It is
useful to define the tolerance of error and the level at which effect can be considered
significantly. During the testing process in an experiment, a 95% significance level is
accepted, and the remaining 5% can be neglected. The significance level also tells the critical
or threshold value. For e.g., in an experiment, if the significance level is set to 98%, then the
critical value is 0.02%.

P-value
The p-value in statistics is defined as the evidence against a null hypothesis. In other words,
P-value is the probability that a random chance generated the data or something else that is
equal or rarer under the null hypothesis condition.

If the p-value is smaller, the evidence will be stronger, and vice-versa which means the null
hypothesis can be rejected in testing. It is always represented in a decimal form, such as
0.035.

Whenever a statistical test is carried out on the population and sample to find out P-value,
then it always depends upon the critical value. If the p-value is less than the critical value,
then it shows the effect is significant, and the null hypothesis can be rejected. Further, if it is
higher than the critical value, it shows that there is no significant effect and hence fails to
reject the Null Hypothesis.

Inductive Bias

The phrase “inductive bias” refers to a collection of (explicit or implicit) assumptions


made by a learning algorithm in order to conduct induction, or generalize a limited set
of observations (training data) into a general model of the domain.

In this article, we’ll have a look at what is Inductive Bias, and how does it help the
machine make better decisions.

Why Inductive Bias?


As seen in the previous article on Candidate-Elimination Algorithm, we get two
hypotheses, one specific and one general at the end as a final solution.

Now, we also need to check if the hypothesis we got from the algorithm is actually
correct or not, also make decisions like what training examples should the machine
learn next.

Some of the fundamental questions for inductive reference are,


 What happens if the target concept isn’t in the hypothesis space?
 Is it possible to avoid this problem by adopting a hypothesis space that contains all potential
hypotheses?
 What effect does the size of the hypothesis space have on the algorithm’s capacity to
generalize to unseen instances?
 What effect does the size of the hypothesis space have on the number of training instances
required?

Let’s have a look at what is Inductive and Deductive learning to understand more
about Inductive Bias.

Inductive Learning:
This basically means learning from examples, learning on the go.

We are given input samples (x) and output samples (f(x)) in the context of inductive
learning, and the objective is to estimate the function (f). The goal is to generalize
from the samples and map such that the output may be estimated for fresh samples in
the future.

In practice, estimating the function is nearly always too difficult, thus we seek
extremely excellent estimates of the function.

The following are some instances of induction in practice:


Assessment of credit risk:
The x represents the customer’s properties.
Whether or whether the f(x) has been accepted for credit.

The diagnosis of disease:


The x represents the patient’s characteristics.
The f(x) is the illness they are afflicted with.

Face recognition: is a technique for recognizing someone’s face.


Bitmaps of people’s faces make up the x.
The f(x) is used to give the face a name.

Deductive Learning:
Learners are initially exposed to concepts and generalizations, followed by particular
examples and exercises to aid learning.
Already existing rules are applied to the training examples.

Biased Hypothesis Space:


It does not include all types of training instances. The issue is that we have skewed the
learner’s thinking to only evaluate conjunctive possibilities. In this instance, a more
expressive hypothesis space is required.

Unbiased Hypothesis Space:


The obvious answer to the challenge of ensuring that the target idea is represented in
hypothesis space H is to create a hypothesis space that can represent any teachable
notion.

What is Inductive Bias?


As discussed in the introduction, Inductive bias refers to a set of assumptions made by
a learning algorithm in order to conduct induction or generalize a limited set of
observations (training data) into a general model of the domain.

Induction would be impossible without such a bias, because observations may


generally be extended in a variety of ways.

Predictions for new scenarios could not be formed if all of these options were treated
equally, that is, without any bias in the sense of a preference for certain forms of
generalization (representing previous information about the target function to be
learned).

The idea of inductive bias is to let the learner generalize beyond the observed training
examples to deduce new examples.
‘ > ’ -> Inductively inferred from.

For example,
x > y means y is inductively deduced from x.

Types of Inductive Bias:


 Maximum conditional independence: It aims to maximize conditional independence if the
hypothesis can be put in a Bayesian framework. The Naive Bayes classifier employs this bias.
 Minimum cross-validation error: Select the hypothesis with the lowest cross-validation
error when deciding between hypotheses. Despite the fact that cross-validation appears to be
bias-free, the “no free lunch” theorems prove that cross-validation is biased.

 Maximum margin: While creating a border between two classes, try to make the boundary
as wide as possible. In support vector machines, this is the bias. The idea is that distinct
classes are usually separated by large gaps.

 Minimum hypothesis description length: When constructing a hypothesis, try to keep the
description as short as possible. Simpler theories are seen to be more likely to be correct.
Occam’s razor does not suggest this. Simpler models are easier to test, not necessarily “more
likely to be true.” See the principle of Occam’s Razor.

 Minimum features: features should be removed unless there is strong evidence that they are
helpful. Feature selection methods are based on this premise.

 Nearest neighbors: Assume that the majority of the examples in a local neighborhood in
feature space are from the same class.

If the class of a case is unknown, assume that it belongs to the same class as the
majority of the people in its near vicinity. The k-nearest neighbor’s algorithm employs
this bias. Cases that are close to each other are assumed to belong to the same class.

1.4 Evaluation
Machine Learning Model Evaluation
Model evaluation is the process that uses some metrics which help us to analyze the
performance of the model. As we all know that model development is a multi-step process
and a check should be kept on how well the model generalizes future predictions. Therefore
evaluating a model plays a vital role so that we can judge the performance of our model.
The evaluation also helps to analyze a model’s key weaknesses. There are many metrics
like Accuracy, Precision, Recall, F1 score, Area under Curve, Confusion Matrix, and Mean
Square Error. Cross Validation is one technique that is followed during the training phase
and it is a model evaluation technique as well.
Cross Validation and Holdout
Cross Validation is a method in which we do not use the whole dataset for training. In this
technique, some part of the dataset is reserved for testing the model. There are many types
of Cross-Validation out of which K Fold Cross Validation is mostly used. In K Fold Cross
Validation the original dataset is divided into k subsets. The subsets are known as folds.
This is repeated k times where 1 fold is used for testing purposes. Rest k-1 folds are used
for training the model. So each data point acts as a test subject for the model as well as acts
as the training subject. It is seen that this technique generalizes the model well and reduces
the error rate
Holdout is the simplest approach. It is used in neural networks as well as in many
classifiers. In this technique, the dataset is divided into train and test datasets. The dataset
is usually divided into ratios like 70:30 or 80:20. Normally a large percentage of data is
used for training the model and a small portion of the dataset is used for testing the model.
Evaluation Metrics for Classification Task
In this Python code, we have imported the iris dataset which has features like the length and
width of sepals and petals. The target values are Iris setosa, Iris virginica, and Iris
versicolor. After importing the dataset we divided the dataset into train and test datasets in
the ratio 80:20. Then we called Decision Trees and trained our model. After that, we
performed the prediction and calculated the accuracy score, precision, recall, and f1 score.
We also plotted the confusion matrix.
Confusion Matrix
A confusion matrix is an N x N matrix where N is the number of target classes. It
represents the number of actual outputs and the predicted outputs. Some terminologies in
the matrix are as follows:

 True Positives: It is also known as TP. It is the output in which the actual and the
predicted values are YES.
 True Negatives: It is also known as TN. It is the output in which the actual and the
predicted values are NO.
 False Positives: It is also known as FP. It is the output in which the actual value is NO
but the predicted value is YES.
 False Negatives: It is also known as FN. It is the output in which the actual value is
YES but the predicted value is NO.
Precision and Recall
Precision is the ratio of true positives to the summation of true positives and false positives.
It basically analyses the positive predictions.
Precision = TP/(TP+FP)
The drawback of Precision is that it does not consider the True Negatives and False
Negatives.

Recall is the ratio of true positives to the summation of true positives and false negatives. It
basically analyses the number of correct positive samples.

Recall = TP/(TP+FN)
The drawback of Recall is that often it leads to a higher false positive rate.
F1 score
The F1 score is the harmonic mean of precision and recall. It is seen that during the
precision-recall trade-off if we increase the precision, recall decreases and vice versa. The
goal of the F1 score is to combine precision and recall.
F1 score = (2×Precision×Recall)/(Precision+Recall)

Accuracy
Accuracy is defined as the ratio of the number of correct predictions to the total number of
predictions. This is the most fundamental metric used to evaluate the model. The formula is
given by
Accuracy = (TP+TN)/(TP+TN+FP+FN)
However, Accuracy has a drawback. It cannot perform well on an imbalanced dataset.
Suppose a model classifies that the majority of the data belongs to the major class label. It
yields higher accuracy. But in general, the model cannot classify on minor class labels and
has poor performance.
AUC-ROC Curve
AUC (Area Under Curve) is an evaluation metric that is used to analyze the classification
model at different threshold values. The Receiver Operating Characteristic(ROC) curve is a
probabilistic curve used to highlight the model’s performance. The curve has two
parameters:
 TPR: It stands for True positive rate. It basically follows the formula of Recall.
 FPR: It stands for False Positive rate. It is defined as the ratio of False positives to the
summation of false positives and True negatives.
This curve is useful as it helps us to determine the model’s capacity to distinguish between
different classes.
Evaluation Metrics for Regression Task
Regression is used to determine continuous values. It is mostly used to find a relation
between a dependent and an independent variable. For classification, we use a confusion
matrix, accuracy, f1 score, etc. But for regression analysis, since we are predicting a
numerical value it may differ from the actual output. So we consider the error calculation
as it helps to summarize how close the prediction is to the actual value. There are many
metrics available for evaluating the regression model.
In this Python Code, we have implemented a simple regression model using the Mumbai
weather CSV file. This file comprises Day, Hour, Temperature, Relative Humidity, Wind
Speed, and Wind Direction. The link for the dataset is here.
We are basically interested in finding a relationship between Temperature and Relative
Humidity. Here Relative Humidity is the dependent variable and Temperature is the
independent variable. We performed the Linear Regression and used the metrics to evaluate
the performance of our model. To calculate the metrics we make extensive use of sklearn
library.
Mean Absolute Error(MAE)
This is the simplest metric used to analyze the loss over the whole dataset. As we all know
the error is basically the difference between the predicted and actual values.
Therefore MAE is defined as the average of the errors calculated. Here we calculate the
modulus of the error, perform the summation and then divide the result by the number of
data points. It is a positive quantity and is not concerned about the direction. The formula
of MAE is given by
MAE = ∑|ypred-yactual| / N
Mean Squared Error(MSE)
The most commonly used metric is Mean Square error or MSE. It is a function used to
calculate the loss. We find the difference between the predicted values and the truth
variable, square the result and then find the average over the whole dataset. MSE is always
positive as we square the values. The small the MSE better is the performance of our
model. The formula of MSE is given:
MSE = ∑(ypred - yactual)2 / N
Root Mean Squared Error(RMSE)
RMSE is a popular method and is the extended version of MSE(Mean Squared Error). This
method is basically used to evaluate the performance of our model. It indicates how much
the data points are spread around the best line. It is the standard deviation of the Mean
squared error. A lower value means that the data point lies closer to the best fit line.

RMSE=√(∑(ypred - yactual)2 / N)
Mean Absolute Percentage Error (MAPE)
MAPE is basically used to express the error in terms of percentage. It is defined as the
difference between the actual and predicted value. The error is then divided by the actual
value. The results are then summed up and finally, we calculate the average. Smaller the
percentage better the performance of the model. The formula is given by
MAPE = ∑((ypred-yactual) / yactual) / N * 100 %
1.5 Cross Validation
Cross validation is a technique used in machine learning to evaluate the performance of a
model on unseen data. It involves dividing the available data into multiple folds or subsets,
using one of these folds as a validation set, and training the model on the remaining folds.
This process is repeated multiple times, each time using a different fold as the validation
set. Finally, the results from each validation step are averaged to produce a more robust
estimate of the model’s performance.
The main purpose of cross validation is to prevent overfitting, which occurs when a model
is trained too well on the training data and performs poorly on new, unseen data. By
evaluating the model on multiple validation sets, cross validation provides a more realistic
estimate of the model’s generalization performance, i.e., its ability to perform well on new,
unseen data.
There are several types of cross validation techniques, including k-fold cross validation,
leave-one-out cross validation, and stratified cross validation. The choice of technique
depends on the size and nature of the data, as well as the specific requirements of the
modeling problem.
Cross-Validation
Cross-validation is a technique in which we train our model using the subset of the data-set
and then evaluate using the complementary subset of the data-set. The three steps involved
in cross-validation are as follows :
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Methods of Cross Validation
Validation In this method, we perform training on the 50% of the given data-set and rest
50% is used for the testing purpose. The major drawback of this method is that we perform
training on the 50% of the dataset, it may possible that the remaining 50% of the data
contains some important information which we are leaving while training our model i.e
higher bias. LOOCV (Leave One Out Cross Validation) In this method, we perform
training on the whole data-set but leaves only one data-point of the available data-set and
then iterates for each data-point. It has some advantages as well as disadvantages also. An
advantage of using this method is that we make use of all data points and hence it is low
bias. The major drawback of this method is that it leads to higher variation in the testing
model as we are testing against one data point. If the data point is an outlier it can lead to
higher variation. Another drawback is it takes a lot of execution time as it iterates over ‘the
number of data points’ times. K-Fold Cross Validation In this method, we split the data-
set into k number of subsets(known as folds) then we perform training on the all the subsets
but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate
k times with a different subset reserved for testing purpose each time.
Example The diagram below shows an example of the training subsets and evaluation
subsets generated in k-fold cross-validation. Here, we have total 25 instances. In first
iteration we use the first 20 percent of data for evaluation, and the remaining 80 percent for
training([1-5] testing and [5-25] training) while in the second iteration we use the second
subset of 20 percent for evaluation, and the remaining three subsets of the data for
training([5-10] testing and [1-5 and 10-25] training), and so

.
Advantages of train/test split:
1. This runs K times faster than Leave One Out cross-validation because K-fold cross-
validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both training and testing.
Advantages of Cross Validation:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting by providing a
more robust estimate of the model’s performance on unseen data.
2. Model Selection: Cross validation can be used to compare different models and select
the one that performs the best on average.
3. Hyperparameter tuning: Cross validation can be used to optimize the hyperparameters
of a model, such as the regularization parameter, by selecting the values that result in
the best performance on the validation set.
4. Data Efficient: Cross validation allows the use of all the available data for both training
and validation, making it a more data-efficient method compared to traditional
validation techniques.

Disadvantages of Cross Validation:

1. Computationally Expensive: Cross validation can be computationally expensive,


especially when the number of folds is large or when the model is complex and requires
a long time to train.
2. Time-Consuming: Cross validation can be time-consuming, especially when there are
many hyperparameters to tune or when multiple models need to be compared.
3. Bias-Variance Tradeoff: The choice of the number of folds in cross validation can
impact the bias-variance tradeoff, i.e., too few folds may result in high variance, while
too many folds may result in high bias.

1.6Linear Regression
Linear Regression is the supervised Machine Learning model in which the model finds the
best fit linear line between the independent and dependent variable i.e it finds the linear
relationship between the dependent and independent variable.

Linear Regression is of two types: Simple and Multiple. Simple Linear Regression is

where only one independent variable is present and the model has to find the linear

relationship of it with the dependent variable

Whereas, In Multiple Linear Regression there are more than one independent variables for

the model to find the relationship.

Equation of Simple Linear Regression, where bo is the intercept, b1 is coefficient or slope, x is

the independent variable and y is the dependent variable.


Equation of Multiple Linear Regression, where bo is the intercept, b1,b2,b3,b4…,bn are

coefficients or slopes of the independent variables x1,x2,x3,x4…,xn and y is the dependent

variable.

A Linear Regression model’s main aim is to find the best fit linear line and the optimal

values of intercept and coefficients such that the error is minimized.

Error is the difference between the actual value and Predicted value and the goal is to reduce

this difference.

Let’s understand this with the help of a diagram.

This article was published as a part of the Data Science Blogathon

Introduction

If you are reading this article, I am assuming that you are already into the Data Science world

and have an idea about Machine Learning. If not, then no problem. I will start with the basic
terminologies which one needs to know before understanding the main topic of discussion

i.e. Linear Regression.

This article will cover everything you need to know about Linear Regression, the first

Machine Learning algorithm of Data Science.

Table of Content

1. Brief Introduction of Machine Learning and its types

2. Understanding Linear Regression

3. Assumptions of Linear Regression.

4. How to deal with the violation of Assumptions

5. Evaluation Metrics for Regression problems

Introduction to Machine Learning

Machine learning is a branch of Artificial Intelligence (AI) focused on building applications

that learn from data and improve their accuracy over time without being programmed to do

so.
Types of Machine Learning:

Supervised Machine Learning: It is an ML technique where models are trained on labeled

data i.e output variable is provided in these types of problems. Here, the models find the

mapping function to map input variables with the output variable or the labels.

Regression and Classification problems are a part of Supervised Machine Learning.

Unsupervised Machine Learning: It is the technique where models are not provided with

the labeled data and they have to find the patterns and structure in the data to know about the

data.

Clustering and Association algorithms are a part of Unsupervised ML.

Understanding Linear Regression


In the most simple words, Linear Regression is the supervised Machine Learning model in

which the model finds the best fit linear line between the independent and dependent

variable i.e it finds the linear relationship between the dependent and independent variable.

Become a Full-Stack Data Scientist

Power Ahead in your AI ML Career | No Pre-requisites Required

Download Brochure

Linear Regression is of two types: Simple and Multiple. Simple Linear Regression is

where only one independent variable is present and the model has to find the linear

relationship of it with the dependent variable

Whereas, In Multiple Linear Regression there are more than one independent variables for

the model to find the relationship.

Equation of Simple Linear Regression, where bo is the intercept, b1 is coefficient or slope, x is

the independent variable and y is the dependent variable.

Equation of Multiple Linear Regression, where bo is the intercept, b1,b2,b3,b4…,bn are

coefficients or slopes of the independent variables x1,x2,x3,x4…,xn and y is the dependent

variable.
A Linear Regression model’s main aim is to find the best fit linear line and the optimal

values of intercept and coefficients such that the error is minimized.

Error is the difference between the actual value and Predicted value and the goal is to reduce

this difference.

Let’s understand this with the help of a diagram.

Image Source: Statistical tools for high-throughput data analysis

In the above diagram,

 x is our dependent variable which is plotted on the x-axis and y is the dependent variable

which is plotted on the y-axis.

 Black dots are the data points i.e the actual values.

 bo is the intercept which is 10 and b1 is the slope of the x variable.

 The blue line is the best fit line predicted by the model i.e the predicted values lie on the blue

line.

The vertical distance between the data point and the regression line is known as error

or residual. Each data point has one residual and the sum of all the differences is known

as the Sum of Residuals/Errors.

Mathematical Approach:
Residual/Error = Actual values – Predicted Values

Sum of Residuals/Errors = Sum(Actual- Predicted Values)

Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values))2

i.e

For an in-depth understanding of the Maths behind Linear Regression, please refer to the

attached video explanation.

Assumptions of Linear Regression

The basic assumptions of Linear Regression are as follows:

1. Linearity: It states that the dependent variable Y should be linearly related to independent

variables. This assumption can be checked by plotting a scatter plot between both variables.
2. Normality: The X and Y variables should be normally distributed. Histograms, KDE plots,

Q-Q plots can be used to check the Normality assumption.

Please refer to my attached blog for a detailed explanation on checking the normality and

transforming the variables violating the assumption.

Source: https://heljves.com/gallery/vol_1_issue_1_2019_8.pdf

3. Homoscedasticity: The variance of the error terms should be constant i.e the spread of

residuals should be constant for all values of X. This assumption can be checked by plotting a

residual plot. If the assumption is violated then the points will form a funnel shape otherwise

they will be constant.


Source: OriginLab

4. Independence/No Multicollinearity: The variables should be independent of each other

i.e no correlation should be there between the independent variables. To check the

assumption, we can use a correlation matrix or VIF score. If the VIF score is greater than 5

then the variables are highly correlated.

In the below image, a high correlation is present between x5 and x6 variables.


Source: towards data science

5. The error terms should be normally distributed. Q-Q plots and Histograms can be used

to check the distribution of error terms.


Source: http://rstudio-pubs-static.s3.amazonaws.com

6. No Autocorrelation: The error terms should be independent of each other.

Autocorrelation can be tested using the Durbin Watson test. The null hypothesis assumes that

there is no autocorrelation. The value of the test lies between 0 to 4. If the value of the test is

2 then there is no autocorrelation.

How to deal with the Violation of any of the Assumption


The Violation of the assumptions leads to a decrease in the accuracy of the model therefore

the predictions are not accurate and error is also high.

For example, if the Independence assumption is violated then the relationship between the

independent and dependent variable can not be determined precisely.

There are various methods are techniques available to deal with the violation of the

assumptions. Let’s discuss some of them below.

Violation of Normality assumption of variables or error terms

To treat this problem, we can transform the variables to the normal distribution using various

transformation functions such as log transformation, Reciprocal, or Box-Cox Transformation.

All the functions are discussed in this article of mine: How to transform into Normal

Distribution

Violation of MultiCollineraity Assumption

It can be dealt with by:

 Doing nothing (if there is no major difference in the accuracy)

 Removing some of the highly correlated independent variables.

 Deriving a new feature by linearly combining the independent variables, such as adding them

together or performing some mathematical operation.

 Performing an analysis designed for highly correlated variables, such as principal

components analysis.

Evaluation Metrics for Regression Analysis

To understand the performance of the Regression model performing model evaluation is

necessary. Some of the Evaluation metrics used for Regression analysis are:
1. R squared or Coefficient of Determination: The most commonly used metric for model

evaluation in regression analysis is R squared. It can be defined as a Ratio of variation to the

Total Variation. The value of R squared lies between 0 to 1, the value closer to 1 the better

the model.

Source: medium.datadriveninvestor.com

where SSRES is the Residual Sum of squares and SSTOT is the Total Sum of squares

2. Adjusted R squared: It is the improvement to R squared. The problem/drawback with R2

is that as the features increase, the value of R2 also increases which gives the illusion of a

good model. So the Adjusted R2 solves the drawback of R2. It only considers the features

which are important for the model and shows the real improvement of the model.

Adjusted R2 is always lower than R2.

Source: stats.stackexchange.com

3. Mean Squared Error (MSE): Another Common metric for evaluation is Mean squared

error which is the mean of the squared difference of actual vs predicted values.
Source: cppsecrets.com

4. Root Mean Squared Error (RMSE): It is the root of MSE i.e Root of the mean

difference of Actual and Predicted values. RMSE penalizes the large errors whereas MSE

doesn’t.

1.7over fitting

Overfitting and Underfitting are the two main problems that occur in machine learning and
degrade the performance of the machine learning models.

The main goal of each machine learning model is to generalize well.


Here generalization defines the ability of an ML model to provide a suitable output by
adapting the given set of unknown input. It means after providing training on the dataset, it
can produce reliable and accurate output. Hence, the underfitting and overfitting are the two
terms that need to be checked for the performance of the model and whether the model is
generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic term that
will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.

Overfitting

Overfitting occurs when our machine learning model tries to cover all the data points or more
than the required data points present in the given dataset. Because of this, the model starts
caching noise and inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model. The overfitted model has low bias and high variance.

As we can see from the above graph, the model tries to cover all the data points present in the
scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression
model to find the best fit line, but here we have not got any best fit, so, it will generate the
prediction errors.

How to avoid the Overfitting in Model

Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce the
occurrence of overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting

Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped
at an early stage, due to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear regression
model:

As we can see from the above diagram, the model is unable to capture the data points present
in the plot.

How to avoid underfitting:

o By increasing the training time of the model.


o By increasing the number of features.
Goodness of Fit

The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning
models to achieve the goodness of fit. In statistics modeling, it defines how closely the result
or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and ideally, it
makes predictions with 0 errors, but in practice, it is difficult to achieve it.

As when we train our model for a time, the errors in the training data go down, and the same
happens with test data. But if we train the model for a long duration, then the performance of
the model may decrease due to the overfitting, as the model also learn the noise present in the
dataset. The errors in the test dataset start increasing, so the point, just before the raising of
errors, is the good point, and we can stop here for achieving a good model.

There are two other methods by which we can get a good point for our model, which are
the resampling method to estimate model accuracy and validation dataset.

1.8 Instance based learning


The Machine Learning systems which are categorized as instance-based learning are the
systems that learn the training examples by heart and then generalizes to new instances
based on some similarity measure. It is called instance-based because it builds the
hypotheses from the training instances. It is also known as memory-based
learning or lazy-learning (because they delay processing until a new instance must be
classified). The time complexity of this algorithm depends upon the size of training data.
Each time whenever a new query is encountered, its previously stores data is examined.
And assign to a target function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is the number of
training instances. For example, If we were to create a spam filter with an instance-based
learning algorithm, instead of just flagging emails that are already marked as spam emails,
our spam filter would be programmed to also flag emails that are very similar to them. This
requires a measure of resemblance between two emails. A similarity measure between two
emails could be the same sender or the repetitive use of the same keywords or something
else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the
target function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting the
identification of a local model from scratch.
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning
1.9 Feature reduction(Dimensionality Reduction)
What is Predictive Modeling: Predictive modeling is a probabilistic process that allows us
to forecast outcomes, on the basis of some predictors. These predictors are basically
features that come into play when deciding the final result, i.e. the outcome of the model.
Dimensionality reduction is the process of reducing the number of features (or dimensions)
in a dataset while retaining as much information as possible. This can be done for a variety
of reasons, such as to reduce the complexity of a model, to improve the performance of a
learning algorithm, or to make it easier to visualize the data. There are several techniques
for dimensionality reduction, including principal component analysis (PCA), singular value
decomposition (SVD), and linear discriminant analysis (LDA). Each technique uses a
different method to project the data onto a lower-dimensional space while preserving
important information.
What is Dimensionality Reduction?
In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These factors are basically variables called features.
The higher the number of features, the harder it gets to visualize the training set and then
work on it. Sometimes, most of these features are correlated, and hence redundant. This is
where dimensionality reduction algorithms come into play. Dimensionality reduction is the
process of reducing the number of random variables under consideration, by obtaining a set
of principal variables. It can be divided into feature selection and feature extraction.
Why is Dimensionality Reduction important in Machine Learning and Predictive
Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This
can involve a large number of features, such as whether or not the e-mail has a generic title,
the content of the e-mail, whether the e-mail uses a template, etc. However, some of these
features may overlap. In another condition, a classification problem that relies on both
humidity and rainfall can be collapsed into just one underlying feature, since both of the
aforementioned are correlated to a high degree. Hence, we can reduce the number of
features in such problems. A 3-D classification problem can be hard to visualize, whereas a
2-D one can be mapped to a simple 2 dimensional space, and a 1-D problem to a simple
line. The below figure illustrates this concept, where a 3-D feature space is split into two 2-
D feature spaces, and later, if found to be correlated, the number of features can be reduced
even further.
Components of Dimensionality Reduction
There are two components of dimensionality reduction:
 Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending upon the method
used. The prime linear method, called Principal Component Analysis, or PCA, is discussed
below.
Principal Component Analysis
This method was introduced by Karl Pearson. It works on a condition that while the data in
a higher dimensional space is mapped to data in a lower dimension space, the variance of
the data in the lower dimensional space should be
maximum. It involves the following
steps:
 Construct the covariance matrix of the data.
 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some
data loss in the process. But, the most important variances should be retained by the
remaining eigenvectors.
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes
undesirable.
 PCA fails in cases where mean and covariance are not enough to define datasets.
 We may not know how many principal components to keep- in practice, some thumb
rules are applied.

Important points:

 Dimensionality reduction is the process of reducing the number of features in a dataset


while retaining as much information as possible.
This can be done to reduce the complexity of a model, improve the performance of a
learning algorithm, or make it easier to visualize the data.
 Techniques for dimensionality reduction include: principal component analysis (PCA),
singular value decomposition (SVD), and linear discriminant analysis (LDA).
 Each technique projects the data onto a lower-dimensional space while preserving
important information.
 Dimensionality reduction is performed during pre-processing stage before building a
model to improve the performance
 It is important to note that dimensionality reduction can also discard useful information,
so care must be taken when applying these techniques.
1.10 Collaborative filtering based recommendation
In Collaborative Filtering, we tend to find similar users and recommend what similar users
like. In this type of recommendation system, we don’t use the features of the item to
recommend it, rather we classify the users into the clusters of similar types, and recommend
each user according to the preference of its cluster.
Measuring Similarity: A simple example of the movie recommendation system will help
us in
explaining:

In this type of scenario, we can see that User 1 and User 2 give nearly similar ratings to the
movie, so we can conclude that Movie 3 is also going to be averagely liked by the User 1
but Movie 4 will be a good recommendation to User 2, like this we can also see that there
are users who have different choices like User 1 and User 3 are opposite to each other. One
can see that User 3 and User 4 have a common interest in the movie, on that basis we can
say that Movie 4 is also going to be disliked by the User 4. This is Collaborative Filtering,
we recommend users the items which are liked by the users of similar interest
domain. Cosine Distance: We can also use the cosine distance between the users to find
out the users with similar interests, larger cosine implies that there is a smaller angle
between two users, hence they have similar interests. We can apply the cosine distance
between two users in the utility matrix, and we can also give the zero value to all the
unfilled columns to make calculation easy, if we get smaller cosine then there will be a
larger distance between the users and if the cosine is larger than we have a small angle
between the users, and we can recommend them similar things.
Rounding the Data: In collaborative filtering we round off the data to compare it more
easily like we can assign below 3 ratings as 0 and above of it as 1, this will help us to
compare data more easily, for
We again took the previous example and we apply the rounding off process, as you can see
how much readable the data has become after performing this process, we can see that User
1 and User 2 are more similar and User 3 and User 4 are more alike.
Normalizing Rating: In the process of normalizing we take the average rating of a user
and subtract all the given ratings from it, so we’ll get either positive or negative values as a
rating, which can simply classify further into similar groups. By normalizing the data we
can make the clusters of the users which gives a similar rating to similar items and then we
can use these clusters to recommend items to the users.
1.11Decision trees

What is a Decision Tree?

It is a tool that has applications spanning several different areas. Decision trees can be used
for classification as well as regression problems. The name itself suggests that it uses a
flowchart like a tree structure to show the predictions that result from a series of feature-
based splits. It starts with a root node and ends with a decision made by leaves.

Image 1

Before learning more about decision trees let’s get familiar with some of the terminologies.

Root Nodes – It is the node present at the beginning of a decision tree from this node the

population starts dividing according to various features.

Decision Nodes – the nodes we get after splitting the root nodes are called Decision Node
Leaf Nodes – the nodes where further splitting is not possible are called leaf nodes or

terminal nodes

Sub-tree – just like a small portion of a graph is called sub-graph similarly a sub-section of

this decision tree is called sub-tree.

Pruning – is nothing but cutting down some nodes to stop overfitting.

Image 2

Example of a decision tree

Let’s understand decision trees with the help of an example.


Image 3

Decision trees are upside down which means the root is at the top and then this root is split

into various several nodes. Decision trees are nothing but a bunch of if-else statements in

layman terms. It checks if the condition is true and if it is then it goes to the next node

attached to that decision.

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy?

If yes then it will go to the next feature which is humidity and wind. It will again check if

there is a strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and

play.
Image 4

Did you notice anything in the above flowchart? We see that if the weather is cloudy then we

must go to play. Why didn’t it split more? Why did it stop there?

To answer this question, we need to know about few more concepts like entropy, information

gain, and Gini index. But in simple terms, I can say here that the output for the training

dataset is always yes for cloudy weather, since there is no disorderliness here we don’t need

to split the node further.

The goal of machine learning is to decrease uncertainty or disorders from the dataset and for

this, we use decision trees.

Now you must be thinking how do I know what should be the root node? what should be the

decision node? when should I stop splitting? To decide this, there is a metric called “Entropy”

which is the amount of uncertainty in the dataset.

Entropy
Entropy is nothing but the uncertainty in our dataset or measure of disorder. Let me try to

explain this with the help of an example.

Suppose you have a group of friends who decides which movie they can watch together on

Sunday. There are 2 choices for movies, one is “Lucy” and the second is “Titanic” and now

everyone has to tell their choice. After everyone gives their answer we see that “Lucy” gets 4

votes and “Titanic” gets 5 votes. Which movie do we watch now? Isn’t it hard to choose 1

movie now because the votes for both the movies are somewhat equal.

This is exactly what we call disorderness, there is an equal number of votes for both the

movies, and we can’t really decide which movie we should watch. It would have been much

easier if the votes for “Lucy” were 8 and for “Titanic” it was 2. Here we could easily say that

the majority of votes are for “Lucy” hence everyone will be watching this movie.

In a decision tree, the output is mostly “yes” or “no”

The formula for Entropy is shown below:

Here p+ is the probability of positive class

p– is the probability of negative class

S is the subset of the training example

How do Decision Trees use Entropy?

Now we know what entropy is and what is its formula, Next, we need to know that how

exactly does it work in this algorithm.


Entropy basically measures the impurity of a node. Impurity is the degree of randomness; it

tells how random our data is. A pure sub-split means that either you should be getting “yes”,

or you should be getting “no”.

Suppose a feature has 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’

and 2 ‘no’ whereas right node gets 3 ‘yes’ and 2 ‘no’.

We see here the split is not pure, why? Because we can still see some negative classes in both

the nodes. In order to make a decision tree, we need to calculate the impurity of each split,

and when the purity is 100%, we make it as a leaf node.

To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
For feature 3,

We can clearly see from the tree itself that left node has low entropy or more purity than right

node since left node has a greater number of “yes” and it is easy to decide here.

Always remember that the higher the Entropy, the lower will be the purity and the higher will

be the impurity.

As mentioned earlier the goal of machine learning is to decrease the uncertainty or impurity

in the dataset, here by using the entropy we are getting the impurity of a particular node, we

don’t know if the parent entropy or the entropy of a particular node has decreased or not.
For this, we bring a new metric called “Information gain” which tells us how much the parent

entropy has decreased after splitting it with some feature.

Information Gain

Information gain measures the reduction of uncertainty given some feature and it is also a

deciding factor for which attribute should be selected as a decision node or root node.

It is just entropy of the full dataset – entropy of the dataset given some feature.

To understand this better let’s consider an example:

Suppose our entire population has a total of 30 instances. The dataset is to predict whether the

person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t

Now we have two features to predict whether he/she will go to the gym or not.

Feature 1 is “Energy” which takes two values “high” and “low”

Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly

motivated”.

Let’s see how our decision tree will be made using these 2 features. We’ll use information

gain to decide which feature should be the root node and which feature should be placed after

the split.
Let’s calculate the entropy:

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:

Our parent entropy was near 0.99 and after looking at this value of information gain, we can

say that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root

node.
Similarly, we will do this with the other feature “Motivation” and calculate its
information gain.

Let’s calculate the entropy here:

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
We now see that the “Energy” feature gives more reduction which is 0.37 than the

“Motivation” feature. Hence we will select the feature which has the highest information gain

and then split the node based on that feature.

In this example “Energy” will be our root node and we’ll do the same for sub-nodes. Here we

can see that when the energy is “high” the entropy is low and hence we can say a person will

definitely go to the gym if he has high energy, but what if the energy is low? We will again

split the node based on the new feature which is “Motivation”.

When to stop splitting?

You must be asking this question to yourself that when do we stop growing our tree? Usually,

real-world datasets have a large number of features, which will result in a large number of

splits, which in turn gives a huge tree. Such trees take time to build and can lead to

overfitting. That means the tree will give very good accuracy on the training dataset but will

give bad accuracy in test data.

There are many ways to tackle this problem through hyperparameter tuning. We can set the

maximum depth of our decision tree using the max_depth parameter. The more the value

of max_depth, the more complex your tree will be. The training error will off-course

decrease if we increase the max_depth value but when our test data comes into the picture,
we will get a very bad accuracy. Hence you need a value that will not overfit as well as

underfit our data and for this, you can use GridSearchCV.

Another way is to set the minimum number of samples for each spilt. It is denoted

by min_samples_split. Here we specify the minimum number of samples required to do a

spilt. For example, we can use a minimum of 10 samples to reach a decision. That means if a

node has less than 10 samples then using this parameter, we can stop the further splitting of

this node and make it a leaf node.

There are more hyperparameters such as :

min_samples_leaf – represents the minimum number of samples required to be in the leaf

node. The more you increase the number, the more is the possibility of overfitting.

max_features – it helps us decide what number of features to consider when looking for the

best split.

To read more about these hyperparameters you can read it here.

Pruning

It is another method that can help us avoid overfitting. It helps in improving the performance

of the tree by cutting the nodes or sub-nodes which are not significant. It removes the

branches which have very low importance.

There are mainly 2 ways for pruning:

(i) Pre-pruning – we can stop growing the tree earlier, which means we can

prune/remove/cut a node if it has low importance while growing the tree.

(ii) Post-pruning – once our tree is built to its depth, we can start pruning the nodes based

on their significance.

You might also like