0% found this document useful (0 votes)
14 views139 pages

Module 2

Module 2 covers supervised learning concepts including linear and non-linear relationships, multi-class and multi-label classification, and regression techniques. It explains the importance of labeled training data for model training and outlines steps involved in supervised learning, along with regression applications and evaluation metrics. Key algorithms discussed include linear regression, multilinear regression, Naïve Bayes, and decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views139 pages

Module 2

Module 2 covers supervised learning concepts including linear and non-linear relationships, multi-class and multi-label classification, and regression techniques. It explains the importance of labeled training data for model training and outlines steps involved in supervised learning, along with regression applications and evaluation metrics. Key algorithms discussed include linear regression, multilinear regression, Naïve Bayes, and decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

Module 2: Supervised Learning- I

Contents

 Linear and Non-Linear


 Multi-Class and Multi-Label Classification
 Linear Regression
 Multilinear Regression
 Naïve Bayes Classifier
 Decision Tree
ID3
CART
Linear and Non-Linear

Linear relationship
A linear relationship is a statistical term used to describe a straight-line
relationship between two variables. Linear relationships can be expressed
either in a graphical format or as a mathematical equation form of
y = mx + b
Linear and Non-Linear

Non-Linear relationship

A non-linear model is also a statistical model that does NOT assume a linear
relationship between the dependent variable and the independent variables,
meaning that analysing more complex relationships from more complex and
relatively large datasets is available in non-linear models with making curves,
exponentials, logarithms, or interactions.
Linear Separable Classes

Example: If the two different sets in 2D are linearly separable

Class1 Class2
Linearly Separable
XOR Problem: Not Linearly Separable

We could however construct multiple layers of perceptrons to get around


this problem.
Multi-Class and Multi-Label Classification

Multiclass classification is a machine learning task where the goal is to assign


instances to one of multiple predefined classes or categories, where each instance
belongs to exactly one class.

In a multi-class setting, the classes are mutually exclusive which means a single
instance of data can belong to one and only class.

While binary classification involves distinguishing between only two classes,


multiclass classification expands this scope to involve distinguishing between
multiple classes.

For example, the email is either SPAM or Not SPAM


Multi-Label Classification

In multi-label classification tasks, a single instance of data can simultaneously belong to


two or more classes of target variables.

The predicted classes are not mutually exclusive.

Image Classification with Multiple Labels: Identifying and labeling multiple objects
or features within an image, like recognizing both “cat” and “outdoor” in a photograph.
Three types of Classification
Supervised Machine Learning
 Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data machines predict the
output.

 The labelled data means some input data is already tagged with the correct output.

 In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly.

 It applies the same concept as a student learns in the supervision of the teacher.

 Supervised learning is a process of providing input data as well as correct output data to
the machine learning model.
 The aim of a supervised learning algorithm is to find a mapping function to map
the input variable(x) with the output variable(y).

 In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
Steps Involved in Supervised Learning:

• First determine the type of training dataset


• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and validation
dataset.
• Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
• Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need validation sets
as the control parameters, which are the subset of training datasets.
• Evaluate the accuracy of the model by providing the test set. If the model predicts
the correct output, which means our model is accurate.
Regression
Regression algorithms are used if there is a relationship between the input
variable and the output variable.
It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc.
Real Time Regression Applications

• The Wage data involves


predicting a continuous or
quantitative output value.
• This is often referred to as
a regression problem
Real Time Regression Applications- Advertising Data

The Advertising data set consists of the sales


of a product in 200 different cities, along
with advertising budgets for three different
media: TV, radio, and newspaper
Real Time Regression Applications- Advertising Data

The Advertising data set consists of the sales


of a product in 200 different cities, along
with advertising budgets for three different
media: TV, Radio, and newspaper
Real Time Regression Applications- Advertising Data

The Advertising data set consists of the sales of a


product in 200 different cities, along with
advertising budgets for three different media:
TV, Radio, and newspaper
Goal is to develop an accurate model that can be
used to predict sales on the basis of the three
media budgets
Real Time Regression Applications- Advertising Data

• Input Variables: Advertising budgets


• Input Variables are denoted by X
• X1 – TV budget
• X2 – Radio budget
• X3 – Newspaper budget
• Input variables are called by
different names like
• Predictors
• Independent variables
• Features
• Variables
Real Time Regression Applications- Advertising Data

• Output Variable: Sales


• Output Variables are denoted by Y
• Output variables are called by
different names like
• Responses,
• Dependent variables
Real Time Regression Applications- Advertising Data

There is some relationship between Y


and X = (X1, X2,...,Xp)
General form of relationship is
Y = f(X) + 
Where,
f is some fixed but unknown
function of X1,...,Xp
 is a random error term, which is
independent of X and has mean
zero
Real Time Regression Applications

• The black lines represent the


error associated with each
observation.
• Here some errors are positive (if
an observation lies above the blue
curve) and some are negative (if
an observation lies below the
curve)
• Overall, these errors have
Observed values of
income and years of
approximately mean zero
education for 30
individuals
21
Real Time Regression Applications

Statistical Learning refers to a set of


approaches for estimating f in the
equation
Y = f(X) + 
Reasons to estimate ‘f’,:
Prediction
Inference
Real Time Regression Applications

• Linear Regression is a statistical procedure that determines the


equation for the straight line that best fits a specific set of data.
• This is a very simple approach for supervised learning
• In particular, it is a useful tool for predicting a quantitative response.

JEE Score
Real Time Regression Applications- Advertising Data

On the basis of given advertising data,


• Marketing plan for next year can be made
• To develop the marketing plan, some
information is required.
• Is there a relationship between
advertising budget and sales?
• Is the relationship linear?
• Predicting sales with a high level of
accuracy requires a strong relationship.
• If it is strong relationship then
• In marketing, it is known as a synergy
effect, while in statistics it is called an
interaction effect
Real Time Regression Applications- Advertising Data

The important questions are


 Which media contribute more to sales?
 Do all three contribute to sales, or do just
one or two.
 The individual effects of each medium on
the money spent
 For every dollar spent on advertising in TV
or Radio or Newspaper, by what amount
will sales increase?
 How accurately can we predict this
amount of increase?

Linear regression can be used to answer each of these questions


Linear Regression

Imagine, we need to predict distance travelled (y) from speed


(x).

The linear regression model representation for this problem


would be,

y=m*x+c

or
distance=m*speed+c
Where,
m=slope of line
c=y-intercept
Linear Regression

As the speed increases the distance also increases. The variables has
positive relationship.
What is the best Fit Line?
 Our primary objective while using
linear regression is to locate the best-fit
line, which implies that the error
between the predicted and actual
values should be kept to a minimum.
 There will be the least error in the best-
fit line.
 The best Fit Line equation provides a
straight line that represents the
relationship between the dependent
and independent variables.
 The slope of the line indicates how
much the dependent variable changes
for a unit change in the independent
variable(s).
 The regression line is the best-fit line
for our model.
Linear Regression Line
 Positive Linear Relationship:
If the dependent variable increases on the Y-
axis and independent variable increases on
X-axis, then such a relationship is termed as
a Positive linear relationship.

 Negative Linear Relationship:


If the dependent variable decreases on the
Y-axis and independent variable increases
on the X-axis, then such a relationship is
called a negative linear relationship.
Linear Regression Types
• Types:
 Based on the number of independent variables, there are two types of linear
regression
 Simple Linear Regression
 Multiple Linear Regression
• Mathematically, the linear relationship is approximately modeled as
• y = 0 + 1 x
0 - Intercept
1 - Slope
0 and 1 – Model coefficients
Linear Regression: Explanation with Example

Consider a simple dataset,

x y
1 3
2 2
3 2
4 4
5 3
Linear Regression: Explanation with Example
Linear Regression: Explanation with Example

x y x-xi y-yi (x-xi)2 (x-xi)(y-yi)


1 3 -2 0.2 4 -0.4
2 2 -1 -0.8 1 0.8
3 2 0 -0.8 0 0
4 4 1 1.2 1 1.2
5 3 2 0.2 4 0.4

Total=10 Total=2
Linear Regression: Explanation with Example
Linear Regression: Explanation with Example
Predicting yp using x and plotting the points.

yp=(0.2*x)+2.2

yp=(0.2*1)+2.2=2.4

yp=(0.2*2)+2.2=2.6

yp=(0.2*3)+2.2=2.8

yp=(0.2*4)+2.2=3.0

yp=(0.2*5)+2.2=3.2
Linear Regression
 In linear regression we predict a continuous variable ‘y’ with a given
input variable ‘x’.
 It is a linear model. x is called the independent variable and y is the
dependent variable.
 When there is a single input variable (x), the method is referred to
as simple linear regression.
 When there are multiple input variables, literature from statistics
often refers to the method as multiple linear regression.

 This model assumed linear relationship between x and y as follows,


Linear Regression: Explanation with Example

We keep moving the line through the data points to make sure the best fit line
has the least square distance between the data points and the regression line.
Evaluation metrics for Regression

A loss function in Machine Learning is a measure of how accurately your ML


model is able to predict the expected outcome. In regression the following
are the loss functions

•Mean Absolute Error (MAE)

•Mean Squared Error (MSE)

•R-squared (R²) Score

•Root Mean Squared Error (RMSE)


Mean Absolute Error (MAE)

We aim to get a minimum MAE because this is a loss.


Mean Squared Error(MSE)

The lower the value the better and 0 means the model is perfect.
Root Mean Squared Error(RMSE)

This produces a value between 0 and 1, where values closer to 0 represent


better fitting models. Based on a rule of thumb, it can be said that RMSE
values between 0.2 and 0.5 shows that the model can relatively predict the
data accurately.
R-squared (R²) Score

R-squared, otherwise known as R² typically has a value in


the range of 0 through to 1. The closer the r-squared
value is to 1, the better the fit.

This is the difference between original and predicted sample

R2 squared is also known as Coefficient of


Determination or sometimes also known as
Goodness of fit
Simple Linear Regression
Question-1:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f(x) of the form f(x) = ax + b which is
parameterized by (a, b). Using squared error as the loss function, which of
the following parameters would you use to model this function.
(a) (4 3)
(b) (5 3)
(c) (5 1)
(d) (1 5)
Simple Linear Regression
Solution:
1. Calculate Ypredicted for the given X using the given (a, b) values
2. For each (a, b) value, calculate the RSS (Residual Sum of Squares)
3. The best set of parameters is the one that gives minimum RSS

To calculate RSS, use the following formula

where
Simple Linear Regression
Solution: Formula
For a = 4 and b = 3

X Y Ypredicted (Y-YPredicted)2
2 12.8978 11 3.6016
3 17.7586 15 7.6099
4 23.3192 19 18.6555
5 28.3129 23 28.2269
6 32.1351 27 26.3693
RSS 84.4632
Simple Linear Regression
Solution: Formula
For a = 5 and b = 3

X Y Ypredicted (Y-YPredicted)2
2 12.8978 13 0.0104
3 17.7586 18 0.0583
4 23.3192 23 0.1019
5 28.3129 28 0.0979
6 32.1351 33 0.7481
RSS 1.0166
Simple Linear Regression
Solution: Formula
For a = 5 and b = 1

X Y Ypredicted (Y-YPredicted)2
2 12.8978 11 3.6016
3 17.7586 16 3.0927
4 23.3192 21 5.3787
5 28.3129 26 5.3495
6 32.1351 31 1.2885
RSS 18.7110
Simple Linear Regression
Solution: Formula
For a = 1 and b = 5

X Y Ypredicted (Y-YPredicted)2
2 12.8978 7 34.7840
3 17.7586 8 95.2303
4 23.3192 9 205.0395
5 28.3129 10 335.3623
6 32.1351 11 446.6925
RSS 1117.1086

Answer: The parameter (5,3) which gives least RSS (1.016). Hence
(5,3) is used to model this function
Simple Linear Regression
Question-2:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
(a)Find the best linear fit
(b)Determine the minimum RSS
(c) Draw the residual plot for the best linear fit and comment on the
suitability of the linear model to this training data.
Simple Linear Regression
Solution:
(a) To find the best fit, calculate the model coefficients using the formula
Simple Linear Regression
Solution:
(X- (Y- (X-Xmean)(Y-
X Y Xmean) Ymean) Ymean) (X-Xmean)2
2 12.8978 -2 -9.9869 19.9738 4
3 17.7586 -1 -5.1261 5.1261 1
4 23.3192 0 0.4345 0.0000 0
5 28.3129 1 5.4282 5.4282 1
6 32.1351 2 9.2504 18.5008 4 The best linear fit is
Sum 20 114.4236 0 0.0000 49.0289 10
Y = 4.9029X + 3.2732
Mean 4 22.88472
Substituting
in the formula 0 =3.2732
1 = 4.9029
Simple Linear Regression
Solution:

Best Fit Line


Simple Linear Regression
Solution:
(b) To determine RSS
(X- (Y- (X-Xmean)(Y-
X Y Xmean) Ymean) Ymean) (X-Xmean)2 Ypredicted (Y-YPredicted)2
2 12.8978 -2 -9.9869 19.9738 4 13.0789 0.0328
3 17.7586 -1 -5.1261 5.1261 1 17.9818 0.0498
4 23.3192 0 0.4345 0.0000 0 22.8847 0.1888
5 28.3129 1 5.4282 5.4282 1 27.7876 0.2759
6 32.1351 2 9.2504 18.5008 4 32.6905 0.3085
Sum 20 114.4236 0 0.0000 49.0289 10 RSS 0.8558
Mean 4 22.88472

Y predicted is calculated using the best linear fit


Y = 4.9029 + 3.2732 X
Simple Linear Regression
Solution: Residual Plot
(c) Residual plot for the best linear fit
Residual
X Y Ypredicted (Y-YPredicted)
2 12.8978 13.0789 -0.1811
3 17.7586 17.9818 -0.2232
4 23.3192 22.8847 0.4345
5 28.3129 27.7876 0.5253
6 32.1351 32.6905 -0.5554

The random pattern in it is an indication that a linear model is suitable for this data
Methods to find best fit line in Linear Regression

1. Statistical
2. Gradient Descent
3. Randomly drawing straight line
Advantages of Linear Regression
• Linear regression is a relatively simple algorithm, making it easy to understand
and implement. The coefficients of the linear regression model can be interpreted
as the change in the dependent variable for a one-unit change in the independent
variable, providing insights into the relationships between variables.
• Linear regression is computationally efficient and can handle large datasets
effectively. It can be trained quickly on large datasets, making it suitable for real-
time applications.
• Linear regression is relatively robust to outliers compared to other machine
learning algorithms. Outliers may have a smaller impact on the overall model
performance.
• Linear regression often serves as a good baseline model for comparison with
more complex machine learning algorithms.
• Linear regression is a well-established algorithm with a rich history and is widely
available in various machine learning libraries and software packages.
Disadvantages of Linear Regression
• Linear regression assumes a linear relationship between the dependent and
independent variables. If the relationship is not linear, the model may not perform
well.
• Linear regression is sensitive to multicollinearity, which occurs when there is a high
correlation between independent variables. Multicollinearity can inflate the variance
of the coefficients and lead to unstable model predictions.
• Linear regression assumes that the features are already in a suitable form for the
model. Feature engineering may be required to transform features into a format
that can be effectively used by the model.
• Linear regression is susceptible to both overfitting and underfitting. Overfitting
occurs when the model learns the training data too well and fails to generalize to
unseen data. Underfitting occurs when the model is too simple to capture the
underlying relationships in the data.
• Linear regression provides limited explanatory power for complex relationships
between variables. More advanced machine learning techniques may be necessary for
deeper insights.
Multicollinearity

Multicollinearity occurs when two or more independent variables have a high correlation
with one another in a regression model, which makes it difficult to determine the
individual effect of each independent variable on the dependent variable.
The presence of multicollinearity can lead to unstable and unreliable coefficient
estimates, making it challenging to interpret the results and draw meaningful conclusions
from the model.
To detect multicollinearity
Calculate the variance inflation factor (VIF) for each independent variable, and a VIF
value greater than 1.5 indicates multicollinearity.
To fix multicollinearity
Remove one of the highly correlated variables, combine them into a single variable, or use
a dimensionality reduction technique such as principal component analysis to reduce the
number of variables while retaining most of the information.
Correlation
If the correlation between two variables is +/- 1.0, then the variables are said to be
perfectly collinear.

If there is no relationship at all between two variables, then the correlation


coefficient will certainly be 0.

When the correlation coefficient is positive, an increase in one variable also


increases the other.

When the correlation coefficient is negative, the changes in the two variables are in
opposite directions.
Overfitting and Underfitting

Overfitting occurs when our machine learning model tries to cover all the data points or
more than the required data points present in the given dataset. Because of this, the
model starts caching noise and inaccurate values present in the dataset, and all these
factors reduce the efficiency and accuracy of the model. The overfitted model has low
bias and high variance.

Note:
•Bias: Bias is the difference between the predicted values and the actual values.
•Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.

Underfitting represents the inability of the model to learn the training data effectively
result in poor performance both on the training and testing data. In simple terms, an
underfit model is inaccurate, especially when applied to new, unseen examples.
Variance
Variance is a measure of how data points vary from the mean.

Standard Deviation
Standard deviation is the square root of variance.
1. Implement Simple Linear Regression using statistical approach for the
given dataset sample.

2. Implement Simple Linear Regression using scikit learn machine learning


library and the corresponding built-in functions for the given dataset sample

area price
2600 550000
3000 565000
3200 610000
3600 680000
4000 725000
Multilinear Regression
 One of the most common types of predictive analysis is multiple linear
regression.
 This type of analysis allows you to understand the relationship between
a continuous dependent variable and two or more independent variables.
 The independent variables can be either continuous (like age and height)
or categorical (like gender and occupation).
 It's important to note that if your dependent variable is categorical, you
should dummy code it before running the analysis.

Here : Y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + …… bn * xn
Y = Dependent variable and x1, x2, x3, …… xn = multiple independent variables
Assumptions of Multiple Linear Regression

•A linear relationship should exist between the target and predictor variables.

•The regression residuals (errors) must be normally distributed.

•MLR assumes little or no multicollinearity (correlation between the independent


variable) in data.

•Homoscedasticity- Constant variance of the errors should be maintained.


Example of Multiple Linear Regression

Consider a dataset containing information about houses.

•Number of Bedrooms: The count of bedrooms in the house.


•Square Footage: The total area of the house in square feet.
•Age of the House: The number of years since the house was built.
•Price: The target variable represents the price of the house in dollars.

To build a multiple linear regression model to predict the house price based on the first
three features.
Multiple Linear Regression Solved Example

Problem 1:Evaluate the following dataset with one response variable y and two
predictor variables X1 and X2 to fit a multiple linear regression model.
Multiple Linear Regression Solved Example: Formulae used are….

The estimated linear regression equation is: y = = b0 + b1*x1 + b2*x2

Where,

And the regression sums,


Multiple Linear Regression Solved Example: Solution

To calculate b0, b1, and b2.


Multiple Linear Regression Solved Example: Solution

= [(194.875)(1162.5) – (-200.375)(-953.5)] /
[(263.875) (194.875) – (-200.375)2]

=3.148
Multiple Linear Regression Solved Example: Solution

b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] /
[(263.875) (194.875) – (-200.375)2] = -1.656

The formula to calculate b0 is:

b0 = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867

Hence the estimated linear regression equation is: y = b0 + b1*x1 + b2*x2

y = -6.867 + 3.148x1 – 1.656x2


Multiple Linear Regression Steps

Data Preparation

•Cleaning the Data: Remove or impute missing values to ensure the dataset is
complete.
•Encoding Categorical Variables: If any categorical variables are present, convert
them into numerical form. In this example, our dataset is purely numerical.
•Feature Scaling: Standardise or normalise the features if they vary significantly in
scale. This step ensures that all features contribute equally to the model.
Multiple Linear Regression Steps

Model Training

Splitting the Dataset: Divide the dataset into training and testing sets. A split of
80% training and 20% testing is typically used.

Fitting the Model: Use the training data to fit the multiple linear regression
model.

Evaluation of the Model

After training the model, evaluate its performance using the test data
Predicting: Use the model to make predictions on the test set

Metrics: Assess the model’s performance using metrics such as Mean Absolute Error
(MAE), Mean Squared Error (MSE), and R-squared score.
1. Implement Multilinear Regression using statistical approach for the dataset
sample solved in theory.

2. Implement Multilinear Regression using scikit learn machine learning


library and the corresponding built-in functions for the given dataset sample

area bedrooms age price


2600 3 20 550000
3000 4 15 565000
3200 18 610000
3600 3 30 595000
4000 5 8 760000

3. Implement Multilinear Regression using scikit learn machine learning


library and the corresponding built-in functions for fetch_california_housing
from
scikit learn.
Decision Tree
Decision Trees
Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.

In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions
Decision Trees

A tree can be “learned” by splitting the source set into subsets based on an
attribute value test. This process is repeated on each derived subset in a
recursive manner called recursive partitioning. The recursion is
completed with leaf nodes.

Decision trees are of two main types −


•Classification tree − when the response is a nominal variable, for example if an
email is spam or not.
•Regression tree − when the predicted outcome can be considered a real number (e.g.
the salary of a worker).
Example of a Decision Tree
Splitting Attributes

Tid Refund Marital Taxable


Status Income Cheat
Refund
1 Yes Single 125K No Yes No
2 No Married 100K No
NO MarSt
3 No Single 70K No
Single, Divorced Married
4 Yes Married 120K No
5 No Divorced 95K Yes TaxInc NO
6 No Married 60K No < 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
Model: Decision Tree
Training Data
Apply Model to Test Data
Start from the root of tree.

Test Data
Refund Refund Marital Taxable
Status Income Cheat
Yes No
No Married 80K ?
NO MarSt
10

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree: An Example
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

•Information Gain for ID3

•Gini Index for CART


Information Gain
Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.

To calculate Entropy: Example

For the set X = {a,a,a,b,b,b,b,b}


Total instances: 8
Instances of b: 5
Instances of a: 3
m
Info( D)   pi log 2 ( pi )
i 1
Attribute Selection Measure: Information Gain (ID3/C4.5)

 Select the attribute with the highest information gain


 Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by
|Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info( D)   pi log 2 ( pi )
i 1

 Information needed (after using A to split D into v partitions) to classify D:


v | Dj |
InfoA ( D)    I (D j )
j 1 |D|
 Information gained by branching on attribute A

Gain(A)  Info(D)  InfoA(D)


In simple terms

• To build a decision tree we need to calculate two types of entropy

• One for the whole dataset based on class label

• Another for each attribute based on the class label.

• Then Information gain for each attribute is calculated

• The attribute with the highest value is chosen as the splitting attribute.
Decision Tree- ID3

Key Characteristics

•Uses Entropy and Information Gain: ID3 uses entropy as a measure of impurity
and information gain (with largest value) to decide which attribute to split the data
on at each step.

•Categorical Data: ID3 is designed to work with categorical data. Continuous data
needs to be discretized before using ID3.

•Greedy Approach: The algorithm uses a top-down, recursive approach, selecting


the best attribute based on the highest information gain at each step.
Problem 1: Apply Decision Tree for the given Dataset

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Decision Tree Steps to Solve
Training Phase
To find the splitting attribute for the whole dataset

1. Calculate Entropy for the overall dataset - Info(D)

m
Info( D)   pi log 2 ( pi )
i 1

 Class P: buys_computer = “yes”


 Class N: buys_computer = “no”

9 9 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940
14 14 14 14
2. Entropy for each attribute in the dataset is calculated using the
below formulae
2.1. Entropy and Gain for Age : InfoAge(D) and Gain(age)

age pi ni I(pi, ni)


<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971

5 4
Infoage ( D)  I (2,3)  I (4,0)
14 14
5
 I (3,2)  0.694
14

Gain(age)  Info( D)  Infoage ( D)  0.246


2.2. Similarly calculate Entropy and Gain for Income, student and credit rating

Gain for all the attributes is calculated

Gain (Age) = 0.246

Gain( Income) = 0.029

Gain( Student) = 0.152

Gain ( Credit rating) = 0.0483


3. Choose the attribute with the highest gain as the splitting attribute

Gain (Age) = 0.246

Building the tree choosing age as the root


4. Repeat Step 1 to 3 for T1 and T2 until you get the leaf nodes as the
required outcome.
To find the splitting attribute for T1
To find the splitting attribute for T1

Choose the attribute with the highest gain as the splitting attribute

Gain (student) = 0.971


To find the splitting attribute for T2
To find the splitting attribute for T2

Choose the attribute with the highest gain as the splitting attribute

Gain (credit_rating) = 0.971


Final Decision Tree
Test Case:

IF Age <=30, Student =Yes, Credit rating =fair, Income = Medium

Will this person buy a computer or NOT?

Inference: Yes he will buy


Decision Tree:
Problem 2
Decision Tree-CART

CART stands for Classification and Regression Trees and is a more advanced algorithm
that can handle both classification and regression tasks. It was introduced by Breiman et
al. in 1984.

Key Characteristics

•Uses Gini Impurity or MSE: For classification tasks, CART uses the Gini impurity
that measures largest reduction to decide splits, while for regression tasks, it uses
mean squared error (MSE).
•Handles Numerical and Categorical Data: Unlike ID3, CART can handle both
categorical and continuous data without requiring prior discretization.
•Binary Trees: CART produces binary trees, meaning each internal node splits the data
into two child nodes.
Decision Tree-CART Problem

Given the following weather data set


Decision Tree-CART Problem

For outlook feature,


Decision Tree-CART Problem

For temperature feature,


Decision Tree-CART Problem

For humidity feature,


Decision Tree-CART Problem

For wind feature,


Decision Tree-CART Problem
Decision Tree-CART Problem

From table, you can seen that Gini index for outlook feature is lowest. So we
get our root node.
Decision Tree-CART Problem

lets focus on sub data on sunny outlook feature. we need to find the Gini index for
temperature, humidity and wind feature respectively.
Decision Tree-CART Problem
Gini index for temperature on sunny
outlook
Decision Tree-CART Problem

Gini Index for humidity on sunny outlook


Decision Tree-CART Problem

Gini Index for wind on sunny outlook


Decision Tree-CART Problem
Decision on sunny outlook factor

We can infer that humidity has


lowest value. so next node will
Be humidity.
Decision Tree-CART Problem

we need to focus on rain outlook.


Decision Tree-CART Problem
Gini of temperature for rain outlook
Decision Tree-CART Problem

Gini of humidity for rain outlook


Decision Tree-CART Problem

Gini of wind for rain outlook


Decision Tree-CART Problem

Decision for rain outlook


Decision Tree-CART Problem

Final form of the decision tree built by CART algorithm


Naïve Bayes Classification
Naïve Bayes Classification
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• The Naive Bayes classifier works on the principle of conditional probability,
as given by the Bayes theorem.

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed


event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions.

To solve this problem, we need to follow the below steps:

1.Convert the given dataset into frequency tables.


2.Generate Likelihood table by finding the probabilities of given features.
3.Now, use Bayes theorem to calculate the posterior probability.
Problem 1: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Problem 2
Let us test it on a new set of features (let us call it today):
today = (Sunny, Hot, Normal, False)
P(today)=P(sunny)*P(hot)*p(normal)*p(nowind)
=(5/14)*(4/14)*(7/14)*(8/14)
P(Yes/today)=((2/9)*(2/9)*(6/9)*(6/9)*(9/14))/((5/14)*(4/14)*(7/14)*(8/14))

=0.484

P(No/today)=((3/5)*(2/5)*(1/5)*(2/5)*(5/14))/((5/14)*(4/14)*(7/14)*(8/14))

=0.235
Problem 3: Practice

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

You might also like