AI&ML Module 3
AI&ML Module 3
Example 3.1: Consider the student performance training dataset of 8 data instances shown in Table 3.1 which
describes the performance of individual students in a course and their CGPA obtained in the previous semesters.
The independent attributes are CGPA, Assessment and Project. The target variable is ‘Result’ which is a discrete
valued variable that takes two values ‘Pass’ or ‘Fail’. Based on the performance of a student, classify whether a
student will pass or fail in that course.
Solution: Given a test instance (6.1, 40, 5) and a set of categories {Pass, Fail} also called as classes, we need to use
the training set to classify the test instance using Euclidean distance.
The task of classification is to assign a category or class to an arbitrary instance. Assign k = 3.
Step 1: Calculate the Euclidean distance between the test instance (6.1, 40, and 5) and each of the training instances
as shown in Table 3.2.
Step 2: Sort the distances in the ascending order and select the first 3 nearest training data instances to the test
instance. The selected nearest neighbors are shown in Table 3.3.
Here, we take the 3 nearest neighbors as instances 4, 5 and 7 with smallest distances.
Step 3: Predict the class of the test instance by majority voting.
The class for the test instance is predicted as ‘Fail’.
Example 3.2: Consider the same training dataset given in Table 3.1. Use Weighted k-NN and determine the class.
Solution: Step 1: Given a test instance (7.6, 60, 8) and a set of classes {Pass, Fail}, use the training dataset to classify the test
instance using Euclidean distance and weighting function. Assign k = 3. The distance calculation is shown in Table 3.4.
Step 2: Sort the distances in the ascending order and select the first 3 nearest training data instances to the test instance. The
selected nearest neighbors are shown in Table 3.5.
Example 3.3: Consider the sample data shown in Table 3.8 with two features x and y. The target classes are ‘A’ or
‘B’. Predict the class using Nearest Centroid Classifier.
Solution:
Step 1: Compute the mean/centroid of each class. In this example there are two classes called ‘A’ and ‘B’.
Centroid of class ‘A’ = (3 + 5 + 4, 1 + 2 + 3)/3 = (12, 6)/3 = (4, 2)
Centroid of class ‘B’ = (7 + 6 + 8, 6 + 7 + 5)/3 = (21, 18)/3 = (7, 6)
Now given a test instance (6, 5), we can predict the class.
Step 2: Calculate the Euclidean distance between test instance (6, 5) and each of the centroid.
Eq. (3.1)
• The cost function is such that it minimizes the error difference between the predicted value ℎβ (𝑥) and true value
‘y’ and it is given as in Eq. (3.2).
Eq. (3.2)
Eq. (3.3)
where 𝑤𝑖 is the weight associated with each 𝑥𝑖 .
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Similarity-based Learning
• The weight function used is a Gaussian kernel that gives a higher value for instances that are close to the test
instance, and for instances far away, it tends to zero but never equals to zero.
𝑤𝑖 is computed in Eq. (3.4) as,
Eq. (3.4)
where, τ is called the bandwidth parameter and controls the rate at which 𝑤𝑖 reduces to zero with distance
from 𝑥𝑖 .
Example 3.4: Consider a simple example with four instances shown in Table 3.9 and apply locally weighted
regression.
• The predicted output for the three closer instances is given as follows:
o The predicted output of Instance 2 is:
• Now, we need to adjust this cost function to minimize the error difference and get optimal β parameters.
o Lasso and Ridge Regression Methods These are special variants of regression method where regularization
methods are used to limit the number and size of coefficients of the independent variables.
o These individual errors are added to compute the total error of the
predicted line. This is called sum of residuals.
o The squares of the individual errors can also be computed and added to
give a sum of squared error. The line with the lowest sum of squared error
is called line of best fit.
o In another words, OLS is an optimization technique where the difference
Figure 3.14: Data Points and
between the data points and the line is optimized. their Errors
o Mathematically, based on Eq. (3.5), the line equations for points (x1, x2, …,
xn) are:
Eq. (3.6)
o A regression line is the line of best fit for which the sum of the squares of residuals is minimum. The
minimization can be done as minimization of individual errors by finding the parameters a0 and a1 such
that:
Eq. (3.8)
Or as the minimization of sum of absolute values of the individual errors:
Eq. (3.9)
Or as the minimization of the sum of the squares of the individual errors:
Eq. (3.10)
o Sum of the squares of the individual errors, often preferred as individual errors, do not get cancelled out
and are always positive, and sum of squares results in a large increase even for a small change in the error.
Therefore, this is preferred for linear regression.
o Therefore, linear regression is modelled as a minimization function as follows:
Eq. (3.11)
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis
o Here, J(a0, a1) is the criterion function of parameters a0 and a1. This needs to be minimized. This is done by
differentiating and substituting to zero. This yields the coefficient values of a0 and a1. The values of
estimates of a0 and a1 are given as follows:
Eq. (3.12)
o And the value of a0 is given as follows:
Eq. (3.13)
Example 3.5: Let us consider an example where the five weeks' sales data (in Thousands) is given as shown below
in Table 3.11. Apply linear regression technique to predict the 7th and 9th month sales.
Solution: Here, there are 5 items, i.e., i = 1, 2, 3, 4, 5. The computation table is shown below (Table 3.12). Here, there
are five samples, so i ranges from 1 to 5.
Eq. (3.14)
• This can be written as: Y = Xa + e, where X is an n × 2 matrix, Y is an n × 1 vector, a is a 2 × 1 column vector and e
is an n × 1 column vector.
Example 3.6: Find linear regression of the data of week and product sales (in Thousands) given in Table 3.13. Use
linear regression in matrix form.
Solution: Here, the dependent variable X is be given as:
Eq. (3.15)
• In general, this is given for ‘n’ independent variables as:
Eq. (3.16)
• Here, (𝑥1 , 𝑥2 , …, 𝑥𝑛 ) are predictor variables, y is the dependent variable, (𝑎0 , 𝑎1 , …, 𝑎𝑛 ) are the coefficients of
the regression equation and ε is the error term. This is illustrated through Example 3.7.
Example 3.7: Apply multiple regression for the values given in Table 3.14 where weekly sales along with sales for
products 𝑋1 and 𝑋2 are provided. Use matrix approach for finding multiple regression.
The regression coefficient for multiple regression is calculated the same way as linear regression:
Eq. (3.17)
Using Eq. (3.17), and substituting the values, one gets 𝑎ො as:
Here, the coefficients are 𝑎0 = -1.69, 𝑎1 = 3.48 and 𝑎2 = -0.05. Hence, the constructed model is: y = -1.69 + 3.48𝒙𝟏 -
0.05𝒙𝟐
• Similarly, power function of the form (𝒚 = 𝒂𝒙𝒃) can be transformed by applying log function on both sides as
follows:
Eq. (3.19)
• Once the transformation is carried out, linear regression can be performed and after the results are obtained,
the inverse functions can be applied to get the desire result.
Polynomial Regression
• Polynomial regression provides a non-linear curve such as quadratic and cubic.
• For example, the second-degree transformation called quadratic transformation is given as: 𝒚 = 𝒂𝟎 + 𝒂𝟏 𝒙 +
𝒂𝟐 𝒙𝟐 and the third-degree polynomial is called cubic transformation given as: 𝒚 = 𝒂𝟎 + 𝒂𝟏 𝒙 + 𝒂𝟐 𝒙𝟐 + 𝒂𝟑 𝒙𝟑.
• Generally, polynomials of maximum degree 4 are used, as higher order polynomials take some strange shapes
and make the curve more flexible. It leads to a situation of overfitting and hence is avoided.
• Let us consider a polynomial of 2nd degree. Given points (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), …, (𝑥𝑛 , 𝑦𝑛 ), the objective is to fit a
polynomial of degree 2. The polynomial of degree 2 is given as:
Eq. (3.20)
𝟐
• Such that the error 𝑬 = σ𝒏𝒊=𝟏 𝒚𝒊 − 𝒂𝟎 + 𝒂𝟏 𝒙𝒊 + 𝒂𝟐 𝒙𝟐𝒊 is minimized. The coefficients 𝒂𝟎 , 𝒂𝟏 , 𝒂𝟐 of Eq. (3) can be
𝝏𝑬 𝝏𝑬 𝝏𝑬
obtained by taking partial derivatives with respect to each of the coefficients as , , and substituting it
𝝏𝒂𝟎 𝝏𝒂𝟏 𝝏𝒂𝟐
with zero. This results in 2 + 1 equations given as follows:
Eq. (3.21)
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis
• The best line is the line that minimizes the error between line and data points. Arranging the coefficients of the
above equation in the matrix form results in:
Eq. (3.22)
• This is of the form Xa = B. One can solve this equation for a as:
Eq. (3.23)
Example 3.8: Consider the data provided in Table 3.15 and fit it using the second-order polynomial.
Solution: For applying polynomial regression, computation is done as shown in Table 3.16. Here,
the order is 2 and the sample i ranges from 1 to 4.
Table 3.15:
Sample Data
• It can be noted that, N = 4, σ 𝑦𝑖 = 29, σ 𝑥𝑖 𝑦𝑖 = 96, σ 𝑥𝑖2 𝑦𝑖 = 338. When the order is 2, the matrix using Eq. (5) is
given as follows:
2. If the student should be admitted or not is based on entrance examination marks. Here, categorical variable
response is admitted or not.
3. The student being pass or fail is based on marks secured.
• Thus, logistic regression is used as a binary classifier and works by predicting the probability of the categorical
variable.
• In general, it takes one or more features x and predicts the response y. If the probability is predicted via linear
regression, it is given as:
• Linear regression generated value is in the range -∞ to +∞, whereas the probability of the response variable
ranges between 0 and 1.
• Hence, there must be a mapping function to map the value -∞ to +∞ to 0–1. The core of the mapping function
in logistic regression method is sigmoidal function.
• A sigmoidal function is a ‘S’ shaped function that yields values between 0 and 1. This is known as logit
function. This is mathematically represented as:
Here, x is the independent variable and e is the Euler number. The purpose of the logit function is to map
any real number to zero or 1.
• Logistic regression can be viewed as an extension of linear regression, but the only difference is that the output
of linear regression can be an extremely high number. This needs to be mapped into the range 0–1, as
probability can have values only in the range 0–1. This problem is solved using log odd or logit functions.
• Here, log(.) is a logit function or log odds function. One can solve for p(x) by taking the inverse of the above
function as:
• This is the same sigmoidal function. It always gives the value in the range 0–1. Dividing the numerator and
denominator by the numerator, one gets:
• One can rearrange this by taking the minus sign outside to get the following logistic function:
Here, x is the explanatory or predictor variable, e is the Euler number, and 𝒂𝟎 , 𝒂𝟏 are the regression
coefficients. The coefficients 𝒂𝟎 , 𝒂𝟏 can be learned and the predictor predicts p(x) directly using the threshold
function as:
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis
Example 3.9: Let us assume a binomial logistic regression problem where the classes are pass and fail. The student
dataset has entrance mark based on the historic data of those who are selected or not selected. Based on the logistic
regression, the values of the learnt parameters are a0 = 1 and a1 = 8. Assuming marks of x = 60, compute the
resultant class.
Solution: The values of regression coefficients are 𝑎0 = 1 and 𝑎1 = 8, and given that x = 60.
Based on the regression coefficients, z can be computed as:
One can fit this in a sigmoidal function using the below equation to get the probability as:
If we assume the threshold value as 0.5, then it is observed that 0.44 < 0.5, therefore, the candidate with
marks 60 is not selected.
• Figure 3.16 shows symbols that are used in this module to represent different nodes in the
construction of a decision tree. A circle is used to represent a root node, a diamond symbol is
used to represent a decision node or the internal nodes, and all leaf nodes are represented
with a rectangle.
• A decision tree consists of two major procedures discussed below.
1. Building the Tree
o Goal Construct a decision tree with the given training dataset. The tree is constructed
in a top-down fashion. It starts from the root node. At every level of tree Figure 3.16: Nodes
construction, we need to find the best split attribute or best decision node among all in a Decision Tree
attributes. This process is recursive and continued until we end up in the last level of
the tree or finding a leaf node which cannot be split further.
o Output Decision tree representing the complete hypothesis space.
2. Knowledge Inference or Classification
o Goal Given a test instance, infer to the target class it belongs to.
o Classification Inferring the target class for the test instance or object is based on inductive inference on
the constructed decision tree. In order to classify an object, we need to start traversing the tree from the
root. We traverse as we evaluate the test condition on every decision node with the test object attribute
value and walk to the branch corresponding to the test’s outcome.
o Output Target label of the test instance.
Example 3.10: How to draw a decision tree to predict a student’s academic performance based on the given
information such as class attendance, class assignments, home-work assignments, tests, participation in
competitions or other events, group activities such as projects and presentations, etc.
Solution: The target feature is the student performance in the final examination whether he will pass or fail in the
examination. The decision nodes are test nodes which check for conditions like ‘What’s the student’s class
attendance?’, ‘How did he perform in his class assignments?’, ‘Did he do his home assignments properly?’ ‘What
about his assessment results?’, ‘Did he participate in competitions or other events?’, ‘What is the performance
rating in group activities such as projects and presentations?’. Table 3.17 shows the attributes and set of values for
each attribute.
The leaf nodes represent the outcomes, that is, either ‘pass’, or ‘fail’.
A decision tree would be constructed by following a set of if-else conditions which may or may not
include all the attributes, and decision nodes outcomes are two or more than two. Hence, the tree is not a binary
tree.
Eq. (3.24)
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning
• It is concluded that if the dataset has instances that are completely homogeneous, then the entropy is 0 and if
the dataset has samples that are equally divided (i.e., 50% – 50%), it has an entropy of 1. Thus, the entropy value
ranges between 0 and 1 based on the randomness of the samples in the dataset.
• Let P be the probability distribution of data instances from 1 to n as shown in Eq. (3.25).
Eq. (3.2)
• Entropy of P is the information measure of this probability distribution given in Eq. (3.26),
Eq. (3.26)
where, P1 is the probability of data instances classified as class 1 and P2 is the probability of data instances
classified as class 2 and so on.
P1 = |No of data instances belonging to class 1|/ |Total no of data instances in the training dataset|
Entropy_Info(P) can be computed as shown in Eq. (3.24).
Definitions
Let T be the training dataset.
Let A be the set of attributes A = {A1, A2, A3, ……. An}.
Let m be the number of classes in the training dataset.
Let 𝑃𝑖 be the probability that a data instance or a tuple ‘d’ belongs to class 𝐶𝑖 .
It is calculated as,
𝑃𝑖 = Total no of data instances that belongs to class 𝐶𝑖 in T/Total no of tuples in the training set T
Mathematically, it is represented as shown in Eq. (3.27).
Eq. (3.27)
Expected information or Entropy needed to classify a data instance d in T is denoted as Entropy_Info(T)
given in Eq. (3.28).
Eq. (3.28)
Eq. (3.29)
where, the attribute A has got ‘v’ distinct values {a1, a2, …. av}, Ai is the number of instances for distinct
value ‘i’ in attribute A, and Entropy_Info(Ai) is the entropy for that set of instances.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning
Information_Gain is a metric that measures how much information is gained by branching on an attribute A.
In other words, it measures the reduction in impurity in an arbitrary subset of data. It is calculated as given in
Eq. (3.30):
Eq. (3.30)
Example 3.12: Assess a student’s performance during his course of study and predict whether a student will get a
job offer or not in his final year of the course. The training dataset T consists of 10 data instances with attributes
such as ‘CGPA’, ‘Interactiveness’, ‘Practical Knowledge’ and ‘Communication Skills’ as shown in Table 3.19. The
target class attribute is the ‘Job Offer’.
Solution: Step 1:
Calculate the Entropy for the target class ‛Job Offer’.
Entropy_Info(Target Attribute = Job Offer) = Entropy_Info(7, 3) =
Iteration 1: Step 2:
Calculate the Entropy_Info and Gain(Information_Gain) for each of the attribute in the training dataset.
Table 3.20 shows the number of data instances classified with Job Offer as Yes or No for the attribute CGPA.
Table 3.21 shows the number of data instances classified with Job Offer as Yes or No for the attribute
Interactiveness.
Table 3.22 shows the number of data instances classified with Job Offer as Yes or No for the attribute Practical
Knowledge.
Table 3.22 shows the number of data instances classified with Job Offer as Yes or No for the attribute
Communication Skills.
The Gain calculated for all the attributes is shown in Table 3.23:
Step 3: From Table 3.23, choose the attribute for which entropy is minimum and therefore the gain is maximum as
the best split attribute.
The best split attribute is CGPA since it has the maximum gain. So, we choose CGPA as the root node.
There are three distinct values for CGPA with outcomes ≥9, ≥8 and <8. The entropy value is 0 for ≥8 and <8 with all
instances classified as Job Offer = Yes for ≥8 and Job Offer = No for <8. Hence, both ≥8 and <8 end up in a leaf node.
The tree grows with the subset of instances with CGPA ≥9 as shown in Figure 3.18.
Now, continue the same process for the subset of data instances branched with CGPA ≥ 9.
Iteration 2:
In this iteration, the same process of computing the Entropy_Info and Gain are repeated with the subset of training
set. The subset consists of 4 data instances as shown in the above Figure 3.18.
The gain calculated for all the attributes is shown in Table 3.24.
Here, both the attributes ‘Practical Knowledge’ and ‘Communication Skills’ have the same Gain. So, we can either
construct the decision tree using ‘Practical Knowledge’ or ‘Communication Skills’. The final decision tree is shown
in Figure 3.19.
Eq. (3.31)
where, the attribute A has got ‘v’ distinct values {a1, a2 ,…. av}, and Ai is the number of instances for distinct
value ‘i’ in attribute A.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning
Eq. (3.32)
1. Compute Entropy_Info Eq. (3.28) for the whole training dataset based on the target
attribute.
2. Compute Entropy_Info Eq. (3.29), Info_Gain Eq. (3.30), Split_Info Eq. (3.31) and
Gain_Ratio Eq. (3.32) for each of the attribute in the training dataset.
3. Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
subsets.
6. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.
Example 3.13: Make use of Information Gain of the attributes which are calculated in ID3 algorithm in Example
3.12 to construct a decision tree using C4.5.
Solution:
Iteration 1:
Step 1: Calculate the Class_Entropy for the target class ‘Job Offer’.
Step 2: Calculate the Entropy_Info, Gain(Info_Gain), Split_Info, Gain_Ratio for each of the attribute in the training
dataset.
CGPA:
Interactiveness:
Practical Knowledge:
Communication Skills:
Table 3.25 shows the Gain_Ratio computed for all the attributes.
Step 3: Choose the attribute for which Gain_Ratio is maximum as the best split attribute. From Table 3.25, we can
see that CGPA has highest gain ratio and it is selected as the best split attribute. We can construct the decision tree
placing CGPA as the root node shown in Figure 3.20. The training dataset is split into subsets with 4 data instances.
Practical Knowledge:
Table 3.26 shows the Gain_Ratio computed for all the attributes.
Both ‘Practical Knowledge’ and ‘Communication Skills’ have the highest gain ratio. So, the best splitting
attribute can either be ‘Practical Knowledge’ or ‘Communication Skills’, and therefore, the split can be based on any
one of these.
Here, we split based on ‘Practical Knowledge’. The final decision tree is shown in Figure 3.21.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning
• Remove the duplicates and consider only the unique values of the attribute.
• Now, compute the Gain for the distinct values of this continuous attribute. Table 3.28 shows the computed
values.
Hence, in this CART algorithm, we need to compute the best splitting attribute and the best split subset i in the
chosen attribute.
Higher the GINI value, higher is the homogeneity of the data instances. Gini_Index(T) is computed as given in
Eq. (3.33).
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning
Eq. (3.33)
where, 𝑃𝑖 be the probability that a data instance or a tuple ‘d’ belongs to class 𝐶𝑖 . It is computed as:
𝑃𝑖 = |No. of data instances belonging to class i|/|Total no of data instances in the training dataset T|
• GINI Index assumes a binary split on each attribute, therefore, every attribute is considered as a binary attribute
which splits the data instances into two subsets 𝑆1 and 𝑆2 .
• Gini_Index(T, A) is computed as given in Eq. (3.34).
Eq. (3.34)
• The splitting subset with minimum Gini_Index is chosen as the best splitting subset for an attribute. The best
splitting attribute is chosen by the minimum Gini_Index which is otherwise maximum ΔGini because it reduces
the impurity. ΔGini is computed as given in Eq. (3.35):
Eq. (3.35)
Example 3.14: Choose the same training dataset shown in Table 3.19 and construct a decision tree using CART
algorithm.
Solution:
Step 1: Calculate the Gini_Index for the dataset shown in Table 3.19, which consists of 10 data instances. The target
attribute ‘Job Offer’ has 7 instances as Yes and 3 instances as No.
Step 2: Compute Gini_Index for each of the attribute and each of the subset in the attribute. CGPA has 3 categories,
so there are 6 subsets and hence 3 combinations of subsets (as shown in Table 3.29).
Repeat the same process for the remaining attributes in the dataset such as for Interactiveness shown in
Table 3.31, Practical Knowledge in Table 3.32, and Communication Skills in Table 3.34.
Table 3.33 shows the Gini_Index for various subsets of Practical Knowledge.
Table 3.35 shows the Gini_Index for various subsets of Communication Skills.
Table 3.36 shows the Gini_Index and DGini values calculated for all the attributes.
Step 5: Choose the best splitting attribute that has maximum ΔGini.
CGPA and Communication Skills have the highest ΔGini value. We can choose CGPA as the root node and
split the datasets into two subsets shown in Figure 3.22 since the tree constructed by CART is a binary tree.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning
Tables 3.38, 3.39, and 3.41 show the categories for attributes Interactiveness, Practical Knowledge, and
Communication Skills, respectively.
Table 3.40 shows the Gini_Index values for various subsets of Practical Knowledge.
Table 3.43 shows the Gini_Index and DGini values for all attributes.
Communication Skills has the highest ΔGini value. The tree is further branched based on the attribute
‘Communication Skills’. Here, we see all branches end up in a leaf node and the process of construction is
completed. The final tree is shown in Figure 3.23.
Example 3.15: Construct a regression tree using the following Table 3.44 which consists of 10 data instances and 3
attributes ‘Assessment’, ‘Assignment’ and ‘Project’. The target attribute is the ‘Result’ which is a continuous
attribute.
(Table 3.45)
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning
(Table 3.46)
(Table 3.47)
Table 3.48 shows the standard deviation and data instances for the attribute-Assessment.
(Table 3.49)
Table 3.51 shows the Standard Deviation and Data Instances for attribute, Assignment.
(Table 3.52)
(Table 3.53)
Table 3.55 shows the standard deviation reduction for each attribute in the training dataset.
The attribute ’Assessment’ has the maximum Standard Deviation Reduction and hence it is chosen as the
best splitting attribute.
The training dataset is split into subsets based on the attribute ‘Assessment’ and this process is continued
until the entire tree is constructed. Figure 3.24 shows the regression tree with ‘Assessment’ as the root node and the
subsets in each branch.
*****