0% found this document useful (0 votes)
2 views29 pages

OE-ML Unit - 3

Linear regression is a fundamental machine learning algorithm used for predictive analysis of continuous variables, establishing a linear relationship between dependent and independent variables. It can be categorized into simple and multiple linear regression, with the goal of minimizing prediction error through methods like gradient descent and cost functions. Polynomial regression extends linear regression by modeling non-linear relationships through polynomial equations, enhancing accuracy for datasets with non-linear patterns.

Uploaded by

B06Shifa Fatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views29 pages

OE-ML Unit - 3

Linear regression is a fundamental machine learning algorithm used for predictive analysis of continuous variables, establishing a linear relationship between dependent and independent variables. It can be categorized into simple and multiple linear regression, with the goal of minimizing prediction error through methods like gradient descent and cost functions. Polynomial regression extends linear regression by modeling non-linear relationships through polynomial equations, enhancing accuracy for datasets with non-linear patterns.

Uploaded by

B06Shifa Fatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Linear Regression in Machine Learning

Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent


(y) and one or more independent (y) variables, hence called as linear regression.
Since linear regression shows the linear relationship, which means it finds how
the value of the dependent variable is changing according to the value of the
independent variable.

The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε
Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called
Multiple Linear Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
o Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.

Finding the best fit line:


When working with linear regression, our main goal is to find the best fit line
that means the error between predicted values and actual values should be
minimized. The best fit line will have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different
line of regression, so we need to calculate the best values for a0 and a1 to find
the best fit line, so to calculate this we use cost function.

Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the different
line of regression, and the cost function is used to estimate the values of the
coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how
a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function is
also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values
and actual values. It can be written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called
residual. If the observed points are far from the regression line, then the residual
will be high, and so cost function will high. If the scatter points are close to the
regression line, then the residual will be small and hence the cost function.

Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the
cost function.
o A regression model uses gradient descent to update the coefficients of the line
by reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively
update the values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is
called optimization. It can be achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression


Below are some important assumptions of Linear Regression. These are some
formal checks while building a Linear Regression model, which ensures to get
the best possible result from the given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and
independent variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables.
Due to multicollinearity, it may difficult to find the true relationship between
the predictors and target variables. Or we can say, it is difficult to determine
which predictor variable is affecting the target variable and which is not. So, the
model assumes either little or no multicollinearity between the features or
independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values
of independent variables. With homoscedasticity, there should be no clear
pattern distribution of data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal
distribution pattern. If error terms are not normally distributed, then confidence
intervals will become either too wide or too narrow, which may cause difficulties
in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without
any deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there
will be any correlation in the error term, then it will drastically reduce the
accuracy of the model. Autocorrelation usually occurs if there is a dependency
between residual errors.

ML Polynomial Regression
o Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree polynomial.
The Polynomial Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

o It is also called the special case of Multiple Linear Regression in ML. Because we
add some polynomial terms to the Multiple Linear regression equation to
convert it into Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a
linear model."

Need for Polynomial Regression:


The need of Polynomial Regression in ML can be understood in the below
points:
o If we apply a linear model on a linear dataset, then it provides us a good result
as we have seen in Simple Linear Regression, but if we apply the same model
without any modification on a non-linear dataset, then it will produce a drastic
output. Due to which loss function will increase, the error rate will be high, and
accuracy will be decreased.
o So for such cases, where data points are arranged in a non-linear fashion,
we need the Polynomial Regression model. We can understand it in a better
way using the below comparison diagram of the linear dataset and non-linear
dataset.

o In the above image, we have taken a dataset which is arranged non-linearly. So


if we try to cover it with a linear model, then we can clearly see that it hardly
covers any data point. On the other hand, a curve is suitable to cover most of
the data points, which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use
the Polynomial Regression model instead of Simple Linear Regression.

Note: A Polynomial Regression algorithm is also called Polynomial Linear


Regression because it does not depend on the variables, instead, it
depends on the coefficients, which are arranged in a linear fashion.

Equation of the Polynomial Regression Model:


Simple Linear Regression equation: y = b0+b1x .........(a)

Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+


bnxn .........(b)
Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+
bnxn ..........(c)

When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables. The
Simple and Multiple Linear equations are also Polynomial equations with a
single degree, and the Polynomial regression equation is Linear equation with
the nth degree. So if we add a degree to our linear equations, then it will be
converted into Polynomial Linear equations.

Note: To better understand Polynomial Regression, you must have


knowledge of Simple Linear Regression.

Implementation of Polynomial Regression using


Python:
Here we will implement the Polynomial Regression using Python. We will
understand it by comparing Polynomial Regression model with the Simple
Linear Regression model. So first, let's understand the problem for which we are
going to build the model.

Problem Description: There is a Human Resource company, which is going to


hire a new candidate. The candidate has told his previous salary 160K per
annum, and the HR have to check whether he is telling the truth or bluff. So to
identify this, they only have a dataset of his previous company in which the
salaries of the top 10 positions are mentioned with their levels. By checking the
dataset available, we have found that there is a non-linear relationship
between the Position levels and the salaries. Our goal is to build a Bluffing
detector regression model, so HR can hire an honest candidate. Below are the
steps to build such a model.
Steps for Polynomial Regression:

The main steps involved in Polynomial Regression are given below:

o Data Pre-processing
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output.

Note: Here, we will build the Linear regression model as well as


Polynomial Regression to see the results between the predictions. And
Linear regression model is for reference.
Data Pre-processing Step:

The data pre-processing step will remain the same as in previous regression
models, except for some changes. In the Polynomial Regression model, we will
not use feature scaling, and also we will not split our dataset into training and
test set. It has two reasons:

o The dataset contains very less information which is not suitable to divide it into
a test and training set, else our model will not be able to find the correlations
between the salaries and levels.
o In this model, we want very accurate predictions for salary, so the model should
have enough information.
The code for pre-processing step is given below:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Position_Salaries.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, 1:2].values
11. y= data_set.iloc[:, 2].values
Explanation:

o In the above lines of code, we have imported the important Python libraries to
import dataset and operate on it.
o Next, we have imported the dataset 'Position_Salaries.csv', which contains
three columns (Position, Levels, and Salary), but we will consider only two
columns (Salary and Levels).
o After that, we have extracted the dependent(Y) and independent variable(X)
from the dataset. For x-variable, we have taken parameters as [:,1:2], because
we want 1 index(levels), and included :2 to make it as a matrix.
Output:

By executing the above code, we can read our dataset as:


As we can see in the above output, there are three columns present (Positions,
Levels, and Salaries). But we are only considering two columns because Positions
are equivalent to the levels or may be seen as the encoded form of Positions.

Here we will predict the output for level 6.5 because the candidate has 4+ years'
experience as a regional manager, so he must be somewhere between levels 7
and 6.

Building the Linear regression model:

Now, we will build and fit the Linear regression model to the dataset. In building
polynomial regression, we will take the Linear regression model as reference
and compare both the results. The code is given below:

1. #Fitting the Linear Regression to the dataset


2. from sklearn.linear_model import LinearRegression
3. lin_regs= LinearRegression()
4. lin_regs.fit(x,y)
In the above code, we have created the Simple Linear model
using lin_regs object of LinearRegression class and fitted it to the dataset
variables (x and y).
Output:

Out[5]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


Building the Polynomial regression model:

Now we will build the Polynomial Regression model, but it will be a little
different from the Simple Linear model. Because here we will
use PolynomialFeatures class of preprocessing library. We are using this class
to add some extra features to our dataset.

1. #Fitting the Polynomial regression to the dataset


2. from sklearn.preprocessing import PolynomialFeatures
3. poly_regs= PolynomialFeatures(degree= 2)
4. x_poly= poly_regs.fit_transform(x)
5. lin_reg_2 =LinearRegression()
6. lin_reg_2.fit(x_poly, y)
In the above lines of code, we have used poly_regs.fit_transform(x), because
first we are converting our feature matrix into polynomial feature matrix, and
then fitting it to the Polynomial regression model. The parameter value(degree=
2) depends on our choice. We can choose it according to our Polynomial
features.

After executing the code, we will get another matrix x_poly, which can be seen
under the variable explorer option:
Next, we have used another LinearRegression object, namely lin_reg_2, to fit
our x_poly vector to the linear model.

Output:

Out[11]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


Visualizing the result for Linear regression:

Now we will visualize the result for Linear regression model as we did in Simple
Linear Regression. Below is the code for it:

1. #Visulaizing the result for Linear Regression model


2. mtp.scatter(x,y,color="blue")
3. mtp.plot(x,lin_regs.predict(x), color="red")
4. mtp.title("Bluff detection model(Linear Regression)")
5. mtp.xlabel("Position Levels")
6. mtp.ylabel("Salary")
7. mtp.show()
Output:
In the above output image, we can clearly see that the regression line is so far
from the datasets. Predictions are in a red straight line, and blue points are
actual values. If we consider this output to predict the value of CEO, it will give
a salary of approx. 600000$, which is far away from the real value.

So we need a curved model to fit the dataset other than a straight line.

Visualizing the result for Polynomial Regression

Here we will visualize the result of Polynomial regression model, code for which
is little different from the above model.

Code for this is given below:

1. #Visulaizing the result for Polynomial Regression


2. mtp.scatter(x,y,color="blue")
3. mtp.plot(x, lin_reg_2.predict(poly_regs.fit_transform(x)), color="red")
4. mtp.title("Bluff detection model(Polynomial Regression)")
5. mtp.xlabel("Position Levels")
6. mtp.ylabel("Salary")
7. mtp.show()
In the above code, we have taken lin_reg_2.predict(poly_regs.fit_transform(x),
instead of x_poly, because we want a Linear regressor object to predict the
polynomial features matrix.
Output:

As we can see in the above output image, the predictions are close to the real
values. The above plot will vary as we will change the degree.

For degree= 3:

If we change the degree=3, then we will give a more accurate plot, as shown in
the below image.
SO as we can see here in the above output image, the predicted salary for level
6.5 is near to 170K$-190k$, which seems that future employee is saying the truth
about his salary.

Degree= 4: Let's again change the degree to 4, and now will get the most
accurate plot. Hence we can get more accurate results by increasing the degree
of Polynomial.
Predicting the final result with the Linear Regression model:

Now, we will predict the final output using the Linear regression model to see
whether an employee is saying truth or bluff. So, for this, we will use
the predict() method and will pass the value 6.5. Below is the code for it:

1. lin_pred = lin_regs.predict([[6.5]])
2. print(lin_pred)
Output:

[330378.78787879]
Predicting the final result with the Polynomial Regression model:

Now, we will predict the final output using the Polynomial Regression model to
compare with Linear model. Below is the code for it:

1. poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
2. print(poly_pred)
Output:

[158862.45265153]
As we can see, the predicted output for the Polynomial Regression is
[158862.45265153], which is much closer to real value hence, we can say that
future employee is saying true.

Logistic Regression in Machine Learning


o Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for predicting
the categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:

Note: Logistic regression uses the concept of predictive modeling as


regression; therefore, it is called logistic regression, but is used to classify
samples; Therefore, it falls under the classification algorithm.

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted


values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are
given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three
types:

o Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
o The decisions or the test are performed on the basis of features of the given
dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as
numeric data.

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best


algorithm for the given dataset and problem is the main point to remember
while creating a machine learning model. Below are the two reasons for using
the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so
it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.

Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of root
attribute with the record (real dataset) attribute and, based on the comparison,
follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches the
leaf node of the tree. The complete process can be better understood using the
below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the
decision tree starts with the root node (Salary attribute by ASM). The root node
splits further into the next decision node (distance from the office) and one leaf
node based on the corresponding labels. The next decision node further gets
split into one decision node (Cab facility) and one leaf node. Finally, the decision
node splits into two leaf nodes (Accepted offers and Declined offer). Consider
the below diagram:
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the
best attribute for the root node and for sub-nodes. So, to solve such problems
there is a technique which is called as Attribute selection measure or ASM. By
this measurement, we can easily select the best attribute for the nodes of the
tree. There are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It
specifies randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)


Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree
in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the
high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get
the optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not
capture all the important features of the dataset. Therefore, a technique that
decreases the size of the learning tree without reducing accuracy is known as
Pruning. There are mainly two types of tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number


of decision trees on various subsets of the given dataset and takes the
average to improve the predictive accuracy of that dataset." Instead of
relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final
output.

The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:

Note: To better understand the Random Forest Algorithm, you should


have knowledge of the Decision Tree Algorithm.

Assumptions for Random Forest


Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct output,
while others may not. But together, all the trees predict the correct output.
Therefore, below are two assumptions for a better Random forest classifier:

o There should be some actual values in the feature variable of the dataset so that
the classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?


Below are some points that explain why we should use the Random Forest
algorithm:

<="" li="" style="box-sizing: border-box;">


o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs
efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?


Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each tree
created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points
(Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into
subsets and given to each decision tree. During the training phase, each
decision tree produces a prediction result, and when a new data point occurs,
then based on the majority of results, the Random Forest classifier predicts the
final decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification
of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest


o Random Forest is capable of performing both Classification and Regression
tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest
o Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.

You might also like