ML Unit-2 Material WORD
ML Unit-2 Material WORD
(IV IT – I SEM.)
UNIT – II
Supervised Learning
Learning a Class from Examples, Linear, Non-linear, Multi-class and Multi-label classification, Decision
Trees: ID3, Classification and Regression Trees (CART), Regression: Linear Regression, Multiple Linear
[Type text]
UNIT – II
Supervised Learning
Decision Tree
Introduction Decision Trees are a type of Supervised Machine Learning (that is you
explain what the input is and what the corresponding output is in the training data) where the
data is continuously split according to a certain parameter. The tree can be explained by two
entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes.
And the decision nodes are wherethe data is split.
An example of a decision tree can be explained using above binary tree. Let‟s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The decision nodes here are questions like „What‟s the age?‟, „Does he exercise?‟,
and „Does he eat a lot of pizzas‟? And the leaves, which are outcomes like either „fit‟, or
„unfit‟. In this case this was a binary classification problem (a yes no type problem). There are
two main types of Decision Trees:
What we have seen above is an example of classification tree, where the outcome was a
variable like„fit‟ or „unfit‟. Here the decision variable is Categorical.
Here the decision or the outcome variable is Continuous, e.g. a number like 123. Working Now
that we know what a Decision Tree is, we‟ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3
Algorithm. ID3 Stands for Iterative Dichotomiser 3. Before discussing the ID3 algorithm,
we‟ll go through few definitions. Entropy Entropy, also called as Shannon Entropy is denoted
by H(S) for a finite set S, is the measure of the amount of uncertainty or randomness in data.
Intuitively, it tells us about the predictability of a certain event. Example, consider a coin toss
whose probability of heads is 0.5 and probability of tails is 0.5. Here the entropy is the highest
possible, since there‟s no way of determining what the outcome might be. Alternatively,
[Type text]
consider a coin which has heads on both the sides, the entropy of such an event can be predicted
[Type text]
perfectly since we know beforehand that it‟ll always be heads. In other words, this event has no
randomness hence it‟s entropy is zero. In particular, lower values imply less uncertainty
while higher values imply high uncertainty. Information Gain Information gain is also called
as Kullback-Leibler divergence denoted by IG(S,A) for a set S is the effective change in entropy
after deciding on a particular attribute A. It measures the relative change in entropy with respect
to the independent variables.
Information Gain Formula
where IG(S, A) is the information gain by applying feature A. H(S) is the Entropy of the entire
set, while the second term calculates the Entropy after applying the feature A, where P(x) is the
probability of event x. Let‟s understand this with the help of an example Consider a piece of
data collected over the course of 14 days where the features are Outlook, Temperature,
Humidity, Wind and the outcome variable is whether Golf was played on the day. Now, our job
is to build a predictive model which takes in above 4 parameters and predicts whether Golf will
be played on the day. We‟ll build a decision treeto do that using ID3 algorithm.
ID3
ID3 Algorithm will perform following tasks recursively
Now we‟ll go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy
of the current state.In the above example, we can see in total there are 5 No‟s and 9 Yes‟s.
Ye No Tota
s l
9 5 14
where „x‟ are the possible values for an attribute. Here, attribute „Wind‟ takes two
possible values in the sampledata, hence x = {Weak, Strong} we‟ll have to calculate:
Amongst all the 14 examples we have 8 places where the wind is weak and 6 where the wind is Strong.
Now out of the 8 Weak examples, 6 of them were „Yes‟ for Play Golf and 2 of them were „No‟
for „Play Golf‟. So,we have,
Similarly, out of 6 Strong examples, we have 3 examples where the outcome was „Yes‟ for
Play Golf and 3where we had „No‟ for Play Golf.
[Type text]
Remember, here half items belong to one class while other half belong to other. Hence we
have perfect randomness.Now we have all the pieces required to calculate the Information
Gain,
Which tells us the Information Gain by considering „Wind‟ as the feature and give us
information gain of 0.048.Now we must similarly calculate the Information Gain for all the
features.
We can clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence
we chose Outlookattribute as the root node. At this point, the decision tree looks like.
[Type text]
Here we observe that whenever the outlook is Overcast, Play Golf is always „Yes‟, it‟s no
coincidence by any chance, the simple tree resulted because of the highest information gain is
given by the attribute Outlook. Now how do we proceed from this point? We can simply apply
recursion, you might want to look at the algorithm steps described earlier. Now that we‟ve used
Outlook, we‟ve got three of them remaining Humidity, Temperature, and Wind. And, we had
three possible values of Outlook: Sunny, Overcast, Rain. Where the Overcast node already
ended up having leaf node „Yes‟, so we‟re left with two subtrees to compute: Sunny and Rain.
As we can see the highest Information Gain is given by Humidity. Proceeding in the same way with
will give us Wind as the one with highest information gain. The final Decision Tree looks
something likethis. The final Decision Tree looks something like this.
In a regression tree, a regression model is fit to the target variable using each of the independent
variables. After this, the data is split at several points for each independent variable.
At each such point, the error between the predicted values and actual values is squared to get
“A Sumof Squared Errors” (SSE). The SSE is compared across the variables and the variable or
point which has the lowest SSE is chosen as the split point. This process is continued
recursively.
Classification and regression tree tutorials, as well as classification and regression tree ppts, exist in
abundance. This is a testament to the popularity of these decision trees and how frequently they are used.
However, these decision trees are not without their disadvantages.
There are many classification and regression trees examples where the use of a decision tree
has notled to the optimal result. Here are some of the limitations of classification and regression
trees.
(i) Overfitting
Overfitting occurs when the tree takes into account a lot of noise that exists in the data
and comes up with an inaccurate result.
(ii) High variance
In this case, a small variance in the data can lead to a very high variance in the
prediction, thereby affecting the stability of the outcome.
(iii) Low bias
A decision tree that is very complex usually has a low bias. This makes it very difficult
for the model to incorporate any new data.
Regression
Regression Analysis in Machine learning
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
[Type text]
specifically, Regression analysis helps us to understand how the value of the dependent variable
is changing corresponding to an independent variable when other independent variables are held
[Type text]
fixed. It predicts continuous/real values such as temperature, age, salary, price,
etc. We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year
and get sales on that. The below list shows the advertisement made by the company in the last 5
years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
[Type text]
value in comparison to other observed values. An outlier may hamper the result, so it
[Type text]
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in
the dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is called
underfitting.
Types of Regression
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression
[Type text]
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and
shows the relationshipbetween the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-
axis) and thedependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the
year of experience.
o Salary forecasting
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as: 3
o f(x)= Output between the 0 and 1 value. 4
When we provide the input values (data) to the function, it gives the S-curve as follows:
[Type text]
o It uses the concept of threshold levels, values above the threshold level are rounded up
to 1, and valuesbelow the threshold level are rounded up to 0.
Learning
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables suchas sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between
the variables. Consider the below image:
[Type text]
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of
freedom) a1 = Linear regression coefficient (scale factor to each
input value).ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such aLinear Regression algorithm is called Simple Linear Regression.
[Type text]
o Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases
on the X-axis, thensuch a relationship is called a negative linear relationship.
Finding the
best fitline:
When working with linear regression, our main goal is to find the best fit line that
means the error betweenpredicted values and actual values should be minimized. The best fit
line will have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so weneed to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and thecost function is used to estimate the values of the coefficient for
the best fit line.
[Type text]
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variableto the output variable. This mapping function is also known as
Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which
is the average ofsquared error occurred between the predicted values and actual values. It can
be written as:
Where,
N=Total number of
observationYi = Actual
value
(a1xi+a0)= Predicted value.
[Type text]
Residuals: The distance between the actual value and predicted values is called residual. If
the observed points arefar from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line
by reducing the costfunction.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process offinding the best model out of various models is called
optimization. It can be achieved by below method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o The high value of R-square determines the less difference between the predicted
values and actual valuesand hence represents a good model.
[Type text]
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables.With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs ifthere is a dependency between residual errors.
Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous or
categorical values.
[Type text]
The Simple Linear Regression model can be represented using the
[Type text]
below equation:y= a0+a1x+ ε
Wher
e, a0= It is the intercept of the Regression line (can be obtained putting
x=0) a1= It is the slope of the regression line, which tells whether the line
is
increasing or decreasing.ε = The error term. (For a good model it will be
negligible)
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
o Each feature variable must model the linear relationship with the dependent variable.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variablesx1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so
the same is applied for the multiple linear regression equation, the equation becomes:
Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>+
b<sub>3</sub>x<sub>3</sub>+...... bnxn (a)
Where,
Y= Output/Response variable
b0, b1, b2, b3 , bn = Coefficients of the model.
x1, x2, x3, x4,= Various Independent/feature variable
Assumptions for Multiple Linear Regression:
o A linear relationship should exist between the Target and predictor variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent variable) in data.
[Type text]