Module 3-1
Module 3-1
MODULE 3
Similarity-based Learning: Nearest-Neighbor Learning, Weighted K-Nearest-Neighbor Algorithm,
Nearest Centroid Classifier, Locally Weighted Regression (LWR).
Decision Tree Learning: Introduction to Decision Tree Learning Model, Decision Tree Induction
Algorithms.
Where n is the number of instances in the class, and X represents feature values.
2. Calculate Distance:
o Measure the Euclidean Distance between the test instance t and the centroid
of each class.
where:
o d(x,xi) = Distance between test instance and training instance.
o τ = Bandwidth parameter (controls how fast weights decrease with distance).
3. Fit a linear regression model to the selected neighbors.
4. Use the weighted linear model to make predictions.
2. Ordinary Linear Regression Cost Function: The goal is to minimize the error
between the predicted value hβ(x)h_{\beta}(x)hβ(x) and the actual output yyy.
The ordinary linear regression cost function is:
Here,
o m is the number of training instances.
o This function equally weighs all training instances.
3. Locally Weighted Regression Cost Function: In LWR, the cost function is modified
by applying weights to each training instance based on its distance from the test
instance.
The modified cost function is:
where:
o wi is the weight assigned to each training instance xix_ixi.
o Higher weight is given to points closer to the test instance.
o Points farther away get lower weights.
4. Weight Calculation: The weights are computed using a Gaussian Kernel Function:
o x = Test instance.
o xi= Training instance.
o τ = Bandwidth parameter (controls the influence of nearby points).
If τ is small, only very close points will have higher weights. If τ\tauτ is large, more
points will have influence.
Advantages:
Can fit non-linear data by fitting multiple local models.
No need to assume the data follows a global linear trend.
Disadvantages:
Computationally expensive.
Needs the entire training data at prediction time.
SUMMARY
Causation
Causation is about relationship between two variables as x causes y. This is called x implies b.
Regression is different from causation. Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal relationship between the two events.
Linear and Non-Linear Relationships
The relationship between input features (variables) and the output (target) variable is
fundamental. These concepts have significant implications for the choice of algorithms, model
complexity, and predictive performance. Understanding the relationship between input
features and output variables is key in Machine Learning. It helps in choosing the right model
for predictions.
Linear Relationship
• Proportional relationship between variables
• Represented by a straight line
• Equation: y = a * x + b
• Example: Hours of study vs Marks obtained
Advantages:
- Easy to interpret
- Faster to train
- Works well with linearly correlated data
Limitations:
- Cannot model complex patterns
- Sensitive to outliers
Non-Linear Relationship
• No proportional change between variables
• Curved relationship
• Example: Population Growth over Time
• Equation: y = a * x^2 + b or y = e^x
Popular Non-Linear Models:
Types of Regression
Linear Regression:
Single Independent Variable: Linear regression, also known as simple linear regression, is
used when there is a single independent variable (predictor) and one dependent variable
(target).
Equation: The linear regression equation takes the form: Y = β0 + β1X + ε, where Y is the
dependent variable, X is the independent variable, β0 is the intercept, β1 is the slope
(coefficient), and ε is the error term.
Purpose: Linear regression is used to establish a linear relationship between two variables
and make predictions based on this relationship. It's suitable for simple scenarios where
there's only one predictor.
Multiple Regression:
Multiple Independent Variables: Multiple regression, as the name suggests, is used when there
are two or more independent variables (predictors) and one dependent variable (target).
Equation: The multiple regression equation extends the concept to multiple predictors: Y =
β0 + β1X1 + β2X2 + ... + βnXn + ε, where Y is the dependent variable, X1, X2, ..., Xn are the
independent variables, β0 is the intercept, β1, β2, ..., βn are the coefficients, and ε is the error
term.
Purpose: Multiple regression allows you to model the relationship between the dependent
variable and multiple predictors simultaneously. It's used when there are multiple factors that
may influence the target variable, and you want to understand their combined effect and make
predictions based on all these factors.
Polynomial Regression:
Use: Polynomial regression is an extension of multiple regression used when the relationship
between the independent and dependent variables is non-linear.
Equation: The polynomial regression equation allows for higher-order terms, such as
quadratic or cubic terms: Y = β0 + β1X + β2X^2 + ... + βnX^n + ε. This allows the model to fit a
curve rather than a straight line.
Logistic Regression:
Use: Logistic regression is used when the dependent variable is binary (0 or 1). It models the
probability of the dependent variable belonging to a particular class.
Equation: Logistic regression uses the logistic function (sigmoid function) to model
probabilities: P(Y=1) = 1 / (1 + e^(-z)), where z is a linear combination of the independent
variables: z = β0 + β1X1 + β2X2 + ... + βnXn. It transforms this probability into a binary
outcome.
Limitations of Regression
Common Problems in Regression Analysis that can affect the accuracy and performance of
the regression model:
1. Outliers:
o Outliers are abnormal data points that significantly differ from other
observations.
o They can bias the regression model because the regression line gets pulled
towards the outlier, affecting the overall prediction accuracy.
o Example: If most students score between 60-80 marks, but one student scores
10 marks, that 10 marks is an outlier.
2. Number of Cases:
o The dataset should have a sufficient number of observations (samples) to
create a reliable model.
o The recommended ratio is 20:1 (20 samples for every independent variable).
This section explains the mathematical formulation of Linear Regression using the Least
Squares Method.
y=a0+a1x
Where: y = Dependent variable (Target)
x = Independent variable (Feature)
a0= Intercept (Constant)
a1 = Slope of the line (Coefficient)
Error Calculation
The goal of linear regression is to find the best line that minimizes the errors between the
predicted and actual values.
1. Error Definition
ei=yi−(a0+a1xi)
Where:
yi = Actual value
a0+a1xi = Predicted value
Minimization of Error
1. Sum of Errors:
(This method is not used as positive and negative errors cancel out each other.)
Minimization Function
Derivation of Parameters
By solving this minimization function using partial derivatives, the coefficients are obtained as:
Where:
xˉ = Mean of X
yˉ = Mean of Y
A linear regression model used for determining the value of the response variable, ŷ, can be
represented as the following equation.
y = b0 + b1x1 + b2x2 + … + bnxn + e
where: y - is the dependent variable, b0 is the intercept,
e is the error term
b1, b2, …, bn are the coefficients of the independent
variables x1, x2, …, xn
The coefficients b1, b2, …, bn can also be called
the coefficients of determination. The goal of the OLS
method can be used to estimate the unknown parameters
(b1, b2, …, bn) by minimizing the sum of squared residuals (RSS). The sum of squared
residuals is also termed the sum of squared error (SSE).
This method is also known as the least-squares method for regression or linear regression.
Mathematically the line of equations for points are:
y1=(a0+a1x1)+e1
y2=(a0+a1x2)+e2 and so on
……. yn=(a0+a1xn)+en.
Regression Trees
Consider the training dataset shown in Table 6.42. Discretize the continuous attribute
‘Percentage’.
Table 6.42: Training Dataset
S. No. Percentage Award
1. 95 Yes
2. 80 Yes
3. 72 No
4. 65 Yes
5. 95 Yes
6. 32 No
7. 66 No
8. 54 No
9. 89 Yes
10. 72 Yes
Solution:
For a sample, the calculations are shown below for a single distinct value say, CGPA
6.8.
6 6 4 4
Entropy_Info(T, Award) = = - [10 log 2 10 + 10
𝑙𝑜𝑔2 10]
= 0.9709
6 6 3 3
Entropy(6, 3) = - 9 log 2 9 + 𝑙𝑜𝑔2 9
9
= 0.918
Entropy_Info(T, Percentage 32) = * Entropy(0,1) + Entropy(6,3)
1 0 0 1 1 9 6 6 3 3
[− log 2 − log 2 ] + [− log 2 − log 2 ]
10 1 1 1 1 10 9 9 9 9
=0 + (0.918)
= 0.8262
Gain (CGPA 6.8) = 0.9709 - 0.8262
= 0.1447
From the Table 1, we can observe that Percentage with 72 has the maximum gain as
0.4203. Hence, Percentage 72 is chosen as the split point. Now, we can discretize the
continuous values of Percentage as two categories with Percentage≤72 and
Percentage>72. The resulting discretized instances are shown in Table 2.
Table 2
S. No. Percentage Percentage Award
Continuous Discretized
1 95 > 72 Yes
2 80 > 72 Yes
3 72 ≤72 No
4 65 ≤72 Yes
5 95 > 72 Yes
6 32 ≤72 No
7 66 ≤72 No
8 54 ≤72 No
9 89 > 72 Yes
10 72 ≤72 Yes
11. Consider the training dataset in Table 6.43. Construct decision trees using ID3,
C4.5, and CART.
Table 6.43: Training Dataset
S. No. Assessment Assignment Project Seminar Result
1. Good Yes Yes Good Pass
2. Average Yes No Poor Fail
3. Good No Yes Good Pass
4. Poor No No Poor Fail
5. Good Yes Yes Good Pass
6. Average No Yes Good Pass
7. Good No No Fair Pass
8. Poor Yes Yes Good Fail
9. Average No No Poor Fail
10. Good Yes Yes Fair Pass
Solution:
ID3 algorithm:
Step 1:
Calculate the Entropy for the target class "Results".
Entropy_Info(Target Attribute = Results) = Entropy_Info(6,4) =
6 6 4 4
= - [10 log 2 10 + 𝑙𝑜𝑔2 10]
10
= 0.9709
Iteration 1:
Step 2:
Calculate the Entropy_Info and Gain for each of the attribute in the training data set.
Entropy_Info(T, Assessment)
5 5 5 0 0 3 1 1 2 2 2 2 2
= [− 5 log 2 5 − 5 𝑙𝑜𝑔2 5] + [− 3 𝑙𝑜𝑔2 3 − 3 𝑙𝑜𝑔2 3] + [− 2 𝑙𝑜𝑔2 2 −
10 10 10
0 0
𝑙𝑜𝑔2 ]
2 2
5 3 3 2 2 5 3 3 2 2
Entropy_Info(T, Assignment) = [− 5 log 2 5 − 5 log 2 5] + [− 5 log 2 5 − 5 log 2 5]
10 10
Entropy_Info(T, Project)
6 5 5 1 1 4 1 1 3 3
= [− 6 log 2 6 − 6 log 2 6] + [− 4 log 2 4 − 4 log 2 4]
10 10
Entropy_Info(T, Seminar)
5 4 4 1 1 3 0 0 3 3 2 2 2
= [− 5 log 2 5 − 5 log 2 5] + [− 3 log 2 3 − 3 log 2 3] + [− 2 log 2 2 −
10 10 10
0 0
log 2 2]
2
Step 3: From the Table 1, choose the attribute for which entropy is minimum and
therefore the gain is maximum as the best split attribute.
The best split attribute is Assessment since it has the maximum gain. The tree grows
with the subset of instances with Assessment=’Average’.
Now continue the same process for the subset of data instances branched with
Assessment=’Average’.
Iteration 2 :
In this iteration, the same process of computing the Entropy_Info and Gain are repeated
with the subset of Training set. The subset consists of 3 data instances.
Entropy_Info(T) = Entropy_Info(1,2) =
1 1 2 2
= - [3 log 2 3 + log 2 3]
3
= 0.9182
1 0 0 1 1 2 1 1 1 1
Entropy_Info(T, Assignment) = 3 [− 1 log 2 1 − 1 log 2 1] + [− 2 log 2 2 − 2 log 2 2]
3
1 1 1 0 0 2 0 0 2 2
Entropy_Info(T, Seminar) = 3 [− 1 log 2 1 − 1 log 2 1] + [− 2 log 2 2 − 2 log 2 2] + 0
3
Here both the attributes “Project” and “Seminar” have the same Gain. So we can either
construct the decision tree using “Project” or “Seminar”. The final decision tree is
shown in Figure 1.
C4.5 Algorithm
Step 1:
Calculate the Entropy for the target class "Results".
Entropy_Info(Target Attribute = Results) = Entropy_Info(6,4) =
6 6 4 4
= - [10 log 2 10 + 𝑙𝑜𝑔2 10]
10
= 0.9709
Iteration 1:
Step 2:
Calculate the Entropy_Info and Gain for each of the attribute in the training data set.
Entropy_Info(T, Assessment)
5 5 5 0 0 3 1 1 2 2 2 2 2
= [− 5 log 2 5 − 5 𝑙𝑜𝑔2 5] + [− 3 𝑙𝑜𝑔2 3 − 3 𝑙𝑜𝑔2 3] + [− 2 𝑙𝑜𝑔2 2 −
10 10 10
0 0
𝑙𝑜𝑔2 2]
2
= 1.4854
Gain Ratio(Assessment) = (Gain(Assessment))/(Split_Info(T, Assessment))
=0.6954/1.4854
=0.4681
5 3 3 2 2 5 3 3 2 2
Entropy_Info(T, Assignment) = [− 5 log 2 5 − 5 log 2 5] + [− 5 log 2 5 − 5 log 2 5]
10 10
=1
Gain Ratio(Assignment) = (Gain(Assignment))/(Split_Info(T, Assignment))
=0/1
=0
Entropy_Info(T, Project)
6 5 5 1 1 4 1 1 3 3
= [− 6 log 2 6 − 6 log 2 6] + [− 4 log 2 4 − 4 log 2 4]
10 10
Gain (Project) = 0.2564
6 6 4 4
Split_Info(T, Project) = − 10 log 2 10 − 10 log 2 10
=0.9709
Gain Ratio(Project) = (Gain(Project))/(Split_Info(T, Project))
=0.2564/0.9709
=0.2641
Entropy_Info(T, Seminar)
5 4 4 1 1 3 0 0 3 3 2 2 2
= [− 5 log 2 5 − 5 log 2 5] + [− 3 log 2 3 − 3 log 2 3] + [− 2 log 2 2 −
10 10 10
0 0
log 2 2]
2
= 1.4854
Gain Ratio(Seminar) = (Gain(Seminar))/(Split_Info(T, Seminar))
=0.6099/1.4854
=0.4106
The Gain Ratio calculated for all the attributes are shown in Table 3.
Table 3
Attributes Gain Ratio
Assessment 0.4681
Assignment 0.0
Project 0.2641
Seminar 0.4106
Step 3: From the Table 3, choose the attribute for which Gain Ratio is maximum as the
best split attribute.
The best split attribute is Assessment since it has the maximum Gain Ratio. The tree
grows with the subset of instances with Assessment=’Average’.
Now continue the same process for the subset of data instances branched with
Assessment=’Average’.
Iteration 2:
In this iteration, the same process of computing the Entropy_Info, Gain and Gain_Ratio
are repeated with the subset of Training set. The subset consists of 3 data instances.
Entropy_Info(T) = Entropy_Info(1,2) =
1 1 2 2
= - [3 log 2 3 + log 2 3]
3
= 0.9182
1 0 0 1 1 2 1 1 1 1
Entropy_Info(T, Assignment) = 3 [− 1 log 2 1 − 1 log 2 1] + [− 2 log 2 2 − 2 log 2 2]
3
= 0.9183
Gain Ratio(Assignment) = (Gain(Assignment))/(Split_Info(T, Assignment))
=0.251/0.9183
=0.2733
1 1 1 0 0 2 0 0 2 2
Entropy_Info(T, Project) = 3 [− 1 log 2 1 − 1 log 2 1] + [− 2 log 2 2 − 2 log 2 2]
3
= 0.9183
Gain Ratio(Project) = (Gain(Project))/(Split_Info(T, Project))
=0.9182/0.9183
=1
1 1 1 0 0 2 0 0 2 2
Entropy_Info(T, Seminar) = [− log 2 − log 2 ] + [− log 2 − log 2 ] + 0
3 1 1 1 1 3 2 2 2 2
= 0.9183
Gain Ratio(Seminar) = (Gain(Seminar))/(Split_Info(T, Seminar))
=0.9182/0.9183
=1
The Gain calculated for all the attributes are shown in Table 4.
Table 4
Attributes Gain
Assignment 0.2733
Project 1
Seminar 1
Here both the attributes “Project” and “Seminar” have the same Gain. So we can either
construct the decision tree using “Project” or “Seminar”. The final decision tree is
shown in Figure 2.
Gini_Index(T) = 0.48
Step 2: Compute Gini_Index for each of the attribute and each of the subset in the
attribute.
Assessment has 3 categories, so there are 6 subsets and hence 3 combinations of
subsets.
Table 5
Assessment Results = Pass Results =Fail
Good 5 0
Average 1 2
Poor 0 2
Table 6
Subsets Gini_Index
Step 3: Choose the best splitting subset which has minimum Gini_Index for an
attribute.
The subset Assessment∈{(Average, Poor), Good} has the lowest Gini_Index value as
0.16 is chosen as the best splitting subset.
Step 4: Compute ∆𝐺𝑖𝑛𝑖 for the best splitting subset of that attribute.
∆𝐺𝑖𝑛𝑖(𝐶𝐺𝑃𝐴) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, 𝐶𝐺𝑃𝐴)
= 0.48 - 0.16
= 0.32
Repeat the same process for the remaining attributes in the data set.
Table 7
Assignment Results = Pass Results =Fail
Yes 3 2
No 3 2
3 2 2 2
Gini_Index(T, Assignment ∈ {yes} )=1- ( ) -( )
5 5
= 0.48
3 2 2 2
Gini_Index(T, Assignment ∈ {No})=1- ( ) - ( )
5 5
= 0.48
5 5
Gini_Index(T, Assignment ∈ {Yes, No})= (0.48)+ (0.48)
10 10
= 0.48
∆𝐺𝑖𝑛𝑖(Assignment) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, Assignment)
= 0.48 - 0.48
=0
Table 8
Project Results = Pass Results =Fail
Yes 5 1
No 1 3
Table 9
Seminar Results = Pass Results =Fail
Good 4 1
Fair 2 0
Poor 0 3
Assignment 0.48 0
Step 5: Choose the best splitting attribute that has maximum ∆𝐺𝑖𝑛𝑖.
‘Assessment’ has the highest ∆𝐺𝑖𝑛𝑖 value. We choose ‘Assessment’ as the root node
and split the data sets into two subsets with one subset Assessment ∈{ Good
}branches to leaf node Results =’Pass’ and other subset Assessment ∈{( Average,
Poor) with 5 instances is considered for Iteration2 .
Iteration 2:
In the second Iteration, the data set has 5 data instances shown in Table 12. Repeat the
same process to find the best splitting attribute and the splitting subset for that attribute.
Table 12
S. No. Assessment Assignment Project Seminar Result
2. Average Yes No Poor Fail
4. Poor No No Poor Fail
6. Average No Yes Good Pass
8. Poor Yes Yes Good Fail
9. Average No No Poor Fail
1 2 4 2
𝐺𝑖𝑛𝑖_𝐼𝑛𝑑𝑒𝑥(𝑇) = 1 − ( ) − ( )
5 5
=0.32
Table 13
Assignment Results = Pass Results =Fail
Yes 0 2
No 1 2
0 2 2 2
Gini_Index(T, Assignment ∈ {yes} )=1- ( ) - ( )
2 2
=0
1 2 2 2
Gini_Index(T, Assignment ∈ {No})=1- ( ) - ( )
3 3
= 0.444
2 3
Gini_Index(T, Assignment ∈ {Yes, No})= (0)+ (0.444)
5 5
= 0.2664
∆𝐺𝑖𝑛𝑖(Assignment) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, Assignment)
= 0.32 - 0.2664
= 0.0536
Table 14
Project Results = Pass Results =Fail
Yes 1 1
No 0 3
Table 15
Seminar Results = Pass Results =Fail
Good 1 1
Fair 0 0
Poor 0 3
Table 17
Attribute Gini_Index ∆𝑮𝒊𝒏𝒊
Project and Seminar have the highest ∆𝐺𝑖𝑛𝑖 value. The tree is further branched based
on the attribute "Project". We choose ‘Project’ and split the data sets into two subsets
with one subset Project ∈{ No }branches to leaf node Results =’Fail’ and other subset
Project ∈{Yes} with 2 instances as shown in Table 18 is considered for Iteration3.
Iteration 3:
Table 18
S. No. Assessment Assignment Project Seminar Result
6. Average No Yes Good Pass
8. Poor Yes Yes Good Fail
1 2 1 2
𝐺𝑖𝑛𝑖_𝐼𝑛𝑑𝑒𝑥(𝑇) = 1 − ( ) − ( )
2 2
=0.5
Table 19
Assignment Results = Pass Results =Fail
Yes 0 1
No 1 0
0 2 1 2
Gini_Index(T, Assignment ∈ {yes} )=1- ( ) - ( )
1 1
=0
1 2 0 2
Gini_Index(T, Assignment ∈ {No})=1- ( ) - ( )
1 1
=0
2 3
Gini_Index(T, Assignment ∈ {Yes, No})= (0)+ (0)
5 5
=0
∆𝐺𝑖𝑛𝑖(Assignment) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, Assignment)
= 0.5 - 0 = 0.5
Table 20
Seminar Results = Pass Results =Fail
Good 1 1
Table 21
Attribute Gini_Index ∆𝑮𝒊𝒏𝒊
Assignment 0 0.5
Seminar 0.5 0
Assignment has the highest ∆𝐺𝑖𝑛𝑖 value. Here all branches end up in a leaf node and
the process of construction is completed. The final tree is shown in Figure 3.