Machine Learning Module-03
Machine Learning Module-03
Module-3
Similarity-based Learning: Nearest-Neighbor Learning, Weighted K-Nearest-
Neighbor Algorithm, Nearest Centroid Classifier, Locally Weighted Regression
(LWR).
Inputs: Training dataset T, Distance metric d, Test instance t, Number of nearest neighbors k
1. For each instance i in T, Compute the distance between test instance t and every other
instance i in the training dataset using a distance metric (Euclidean distance).
[Continuous attributes - Euclidean distance between two points in the plane with
coordinates (x1, y1) and (x2, y2) is given as dist((x1, y1), (x2, y2)) =
√(𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2 ]
[Categorical attributes ( Binary ) - Hamming Distance : If the values of two instances
are the same , the distance d will be equal to 0 otherwise d = 1. ]
2. Sort the calculated distances in ascending order and select the first k nearest training
data instances to the test instance.
3. Predict the class of the test instance by majority voting(if target attribute is discrete
valued or mean (if target attribute is continuous valued) of the k selected nearest
instances.
Algorithm 3.1: k-NN
Data normalization is crucial to ensure features with different ranges don't disproportionately
influence distance calculations in k-NN. The performance of k-NN is highly dependent on the
choice of 'k', the distance metric, and the decision rule, and it is most effective with lower-
dimensional data.
Weighted k-NN is an enhanced version of the k-NN algorithm that addresses limitations by
assigning weights to neighbors based on their distance from the test instance, giving closer
neighbors more influence in the prediction. This is achieved by making weights inversely
proportional to distances, allowing for a more refined decision-making process compared to
the standard k-NN, which treats all neighbors equally.
Inputs: Training dataset T, Distance metric d , Weighting function w(i), Test instance f, the
number of nearest neighbors k
1. For each instance i in Training dataset T , compute the distance between the test instance
t and every other instance i using a distance metric ( Euclidean distance ).
[Continuous attributes - Euclidean distance between two points in the plane with
coordinates (x1, y1) and (x2, y2) is given as dist ( (x1, y1), (x2, y2) ) =
√(𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2 ]
[Categorical attributes ( Binary ) - Hamming Distance : If the values of two instances
are the same , the distance d will be equal to 0 Otherwise d = 1. ]
2. Sort the distances in the ascending order and select the first ' k nearest training data
instances to the test instance.
3. Predict the class of the test instance by weighted voting technique ( Weighting function
w(i) ) for the k selected nearest instances:
• Compute the inverse of each distance of the ' k ' selected nearest instances.
• Find the sum of the inverses.
• Compute the weight by dividing each inverse distance by the sum. ( Each weight
is a vote for its associated class ).
• Add the weights of the same class.
• Predict the class by choosing the class with the maximum vote.
Algorithm 3.2: Weighted k-NN
Regression is used to predict continuous variables or quantitative variables such as price and
revenue. Thus, the primary concern of regression analysis is to find answer to questions such
as:
There are many applications of regression analysis. Some of the applications of regressions
include predicting:
Fig. 3.2: Examples of (a) Positive Correlation (b) Negative Correlation (c) Random
Points with No Correlation
Causation explores whether one variable directly influences another, denoted as 'x implies y',
unlike correlation or regression which merely describe relationships. For instance, while
economic background might correlate with high marks, it doesn't necessarily cause them.
Similarly, increased cool drink sales due to temperature rise may be influenced by other factors,
highlighting that correlation doesn't equate to causation.
Linearity in regression implies that the relationship between dependent and independent
variables can be represented by a straight line (y = ax + b), where a change in one variable
results in a proportional change in the other, as shown in below Figure (a). Non-linear
relationships exist in functions such as exponential and power functions in Figure(b) and (c),
do not follow this straight-line pattern.
Fig. 3.3: Example of (a) linear relationship (b) Non-linear relationship (c) Non Linear
relationship
𝑥
The functions like exponential function (y=axb) and power function (y= ) are non-linear
𝑎𝑥 + 𝑏𝑥
relationships between the dependent and independent variables that cannot be fitted in a line.
This is shown in Figures (b) and (c).
Regression
Methods
Linear
Non-linear Logical
Regression
regression regression
Methods
Multiple Linear
Regression
Linear Regression: It is a type of regression where a line is fitted upon given data for finding
the linear relationship between one independent variable and one dependent variable to
describe relationships.
Multiple Regression: It is a type of regression where a line is fitted for finding the linear
relationship between two or more independent variables and one dependent variable to describe
relationships among variables.
Logistic Regression: It is used for predicting categorical variables that involve one or more
independent variables and one dependent variable. This is also known as a binary classifier.
Lasso and Ridge Regression Methods: These are special variants of regression method where
regularization methods are used to limit the number and size of coefficients of the independent
variables.
1. Outliers – Outliers are abnormal data. It can bias the outcome of the regression model,
as outliers push the regression line towards it.
2. Number of cases – The ratio of independent and dependent variables should be at least
20 : 1. For every explanatory variable, there should be at least 20 samples. Atleast five
samples are required in extreme cases.
3. Missing data – Missing data in training data can make the model unfit for the sampled
data.
4. Multicollinearity – If exploratory variables are highly correlated (0.9 and above), the
regression is vulnerable to bias. Singularity leads to perfect correlation of 1. The remedy
is to remove exploratory variables that exhibit correlation more than 1. If there is a tie,
then the tolerance (1 – R squared) is used to eliminate variables that have the greatest
value.
3.6 Introduction to Linear Regression
In the simplest form, the linear regression model can be created by fitting a line among the
scattered data points. The line is of the form given in below equation
y = a₀ + a₁x + e
Here, a₀ is the intercept which represents the bias and a₁ represents the slope of the line. These
are called regression coefficients. e is the error in prediction.
3. The distribution of the error term is independent of the joint distribution of explanatory
variables.
4. The unknown parameters of the regression models are constants.
The idea of linear regression is based on Ordinary Least Square (OLS) approach. This method
is also known as ordinary least squares method. In this method, the data points are modelled
using a straight line. Any arbitrarily drawn line is not an optimal line. In Figure 5.4, three data
points and their errors (e₁, e₂, e₃) are shown. The vertical distance between each point and the
line (predicted by the approximate line equation y = a₀ + a₁x) is called an error. These individual
errors are added to compute the total error of the predicted line. This is called sum of residuals.
The squares of the individual errors can also be computed and added to give a sum of squared
error. The line with the lowest sum of squared error is called line of best fit.
In another words, OLS is an optimization technique where the difference between the data
points and the line is optimized.
Mathematically, the line equations for points (x1, x2, xn) are:
y1 = (a0 + a1x1) + e1
y2 = (a0 + a1x2) + e2
.
.
yn = (a0 + a1xn) + en
In general, the error is given as: ei = yi − (a0+a1xi)
Here, the terms (e1, e2, en ) are error associated with the data points and denote the difference
between the true value of the observation and the point on the line. This is also called as
residuals. The residuals can be positive, negative or zero.
A regression line is the line of best fit for which the sum of the squares of residuals is minimum.
The minimization can be done as minimization of individual errors by finding the parameters
a0 and a1 such that:
Sum of the squares of the individual errors, often preferred as individual errors (positive and
negative errors), do not get cancelled out and are always positive, and sum of squares results
in a large increase even for a small change in the error. Therefore, this is preferred for linear
regression.
Here, J(a0, a1) is the criterion function of parameters a0 and a1. This needs to be minimized.
This is done by differentiating and substituting to zero. This yields the coefficient values of a0
and a1. The values of estimates of a0 and a1 are given as follows:
Matrix notations can be used for representing the values of independent and dependent
variables.
Multiple regression model involves multiple predictors or independent variables and one
dependent variable. This is an extension of the linear regression problem. The basic
assumptions of multiple linear regression are that the independent variables are not highly
correlated and hence multicollinearity problem does not exist. Also, it is assumed that the
residuals are normally distributed.
For example, the multiple regression of two variables x1 and x2 is given as follows:
Here, (x1, x2, …., xn) are predictor variables, y is the dependent variable, (a0, a1, …, an) are the
coefficients of the regression equation and ϵ is the error term
If the relationship between the independent and dependent variables is not linear, then linear
regression cannot be used as it will result in large errors. The problem of non-linear regression
can be solved by two methods:
1. Transformation of non-linear data to linear data, so that the linear regression can handle
the data
Transformations
The first method is called transformation. The trick is to convert non-linear data to linear data
that can be handled using the linear regression method. Let us consider an exponential function
y=aebx. The transformation can be done by applying log function to both sides to get:
ln y = bx + ln a
Similarly, power function of the form (y = axb) can be transformed by applying log function on
both sides as follows:
Once the transformation is carried out, linear regression can be performed and after the results
are obtained, the inverse functions can be applied to get the desire result.
It can handle non-linear relationships among variables by using nth degree of a polynomial.
Instead of applying transforms, polynomial regression can be directly used to deal with
different levels of curvilinearity.
Polynomial regression provides a non-linear curve such as quadratic and cubic. For example,
the second-degree transformation (called quadratic transformation) is given as: y = a0+a1x+a2
x2 and the third-degree polynomial is called cubic transformation given as: y = a0+a1x+a2x2 +a3
x3. Generally, polynomials of maximum degree 4 are used, as higher order polynomials take
some strange shapes and make the curve more flexible. It leads to a situation of overfitting and
hence is avoided.
Let us consider a polynomial of 2nd degree. Given points (x1,y1), (x2,y2), ..., (xn,yn), the
objective is to fit a polynomial of degree 2. The polynomial of degree 2 is given as:
y = a0+a1x+a2x2
𝒏
Such that the error E = ∑ [𝒚
𝟏=𝟏 𝒊
− (𝒂𝟎 + 𝒂𝟏 𝒙𝒊 + 𝒂𝟐 𝒙𝒊 𝟐 )]2 is minimized. The
coefficients a0, a1, a2 can be obtained by taking partial derivatives with respect to each of the
𝜕𝐸 𝜕𝐸 𝜕𝐸
coefficients as , , and substituting it with zero. This results in 2+1 equations given
𝜕𝑎0 𝜕𝑎1 𝜕𝑎2
as follows:
The best line is the line that minimizes the error between line and data points. Arranging the
coefficients of the above equation in the matrix form results in:
This is of the form Xa = B. One can solve this equation for a as:
a = X-1 B.
Linear regression predicts the numerical response but is not suitable for predicting the
categorical variables. When categorical variables are involved, it is called classification
problem. Logistic regression is suitable for binary classification problem. Here, the output is
often a categorical variable. For example, the following scenarios are instances of predicting
categorical variables.
1. Is the mail spam or not spam? The answer is yes or no. Thus, categorical dependant
variable is a binary response of yes or no.
2. If the student should be admitted or not is based on entrance examination marks. Here,
categorical variable response is admitted or not.
3. The student being pass or fail is based on marks secured.
Thus, logistic regression, used as a binary classifier, predicts the probability of a categorical
variable y from features x. If linear regression were used, the probability would be p(x) = a₀ +
a₁x.
Logistic regression models the probability, e.g., a 0.7 probability in email classification
indicates a 70% chance of a normal email.
Since linear regression yields values from -∞ to +∞, while probabilities range from 0 to 1, a
sigmoidal (logit) function maps the values, represented as:
𝟏
logit(x) =
𝟏+ 𝒆−𝒙
Here, x is the independent variable and e is the Euler number. The purpose of the logit function
is to map any real number to 0 or 1.
Logistic regression extends linear regression, mapping its potentially large output to a 0-1
probability range using log-odds or logit functions. Odds represent the ratio of an event's
probability to the probability of it not occurring, contrasting with probability as a direct
likelihood. This is given as:
𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝒂𝒏 𝒆𝒗𝒆𝒏𝒕 𝒑
Odd = =
𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝒂𝒏 𝒏𝒐𝒏−𝒆𝒗𝒆𝒏𝒕 𝟏−𝒑
Here, log(⋅) is a logit function or log odds function. One can solve for p(x) by taking the inverse
of the above function as:
This is the same sigmoidal function. It always gives the value in the range 0-1. Dividing the
numerator and denominator by the numerator, one gets:
One can rearrange this by taking the minus sign outside to get the following logistic function:
Here, x is the explanatory or predictor variable, e is the Euler number, and a0, a1 are the
regression coefficients. The coefficients a0, a1 can be learned and the predictor predicts p(x)
1, 𝑖𝑓 𝑝(𝑥) ≥ 0.5
directly using the threshold function as: 𝑦 = {
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Logistic regression parameters, crucial for understanding variable relationships, are determined
using Maximum Likelihood Estimation (MLE) on training data to minimize prediction errors.
MLE finds the optimal parameters that maximize the probability of observing the given data,
selecting from various possible coefficient sets.
If π is the success of the outcome and 1−π is the failure of the outcome, then the likelihood
function is given as:
To determine parameter values, the log-likelihood function is used, maximized via methods
like Newton's. Logistic regression, for binary classification, is extended to multinomial logistic
regression for multiple classes by creating pairwise classification problems (class vs. not class).
It's a simple, interpretable method, but multinomial logistic regression struggles with many
attributes and non-linear features. Multicollinearity amongst attributes can also hinder its
effectiveness.
Decision tree learning, a popular supervised model, classifies data with high accuracy by
inductively inferring general conclusions from examples, forming a tree structure from training
data to predict target classes for test data. It handles both categorical and continuous target
variables, using features as independent variables to predict the response variable, and
generates a hypothesis space within a tree, employing a preference bias to search for smaller,
efficient decision trees.
A decision tree, structured with a root, internal/decision nodes, branches, and leaf nodes,
represents classification rules derived from data. Internal nodes test attributes, branches reflect
outcomes, and leaf nodes indicate target classes. Each path from root to leaf forms a logical
rule, and the tree is a disjunction of these rules. Decision networks, an extension of Bayesian
belief networks, use directed graphs to model states, actions, outcomes, and utilities, with
specific symbols representing different nodes in the tree structure.
Goal Construct a decision tree with the given training dataset. The tree is constructed in a top-
down fashion. It starts from the root node. At every level of tree construction, we need to find
the best split attribute or best decision node among all attributes. This process is recursive and
continued until we end up in the last level of the tree or finding a leaf node which cannot be
split further. The tree construction is complete when all the test conditions lead to a leaf node.
The leaf node contains the target class or output of classification.
Goal Given a test instance, infer to the target class it belongs to.
Classification Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start
traversing the tree from the root. We traverse as we evaluate the test condition on every decision
node with the test object attribute value and walk to the branch corresponding to the test's
outcome. This process is repeated until we end up in a leaf node which contains the target class
of the test object.
Some of the issues that generally arise with a decision tree learning are that:
1. It is difficult to determine how deeply a decision tree can be grown or when to stop
growing it.
2. If training data has errors or missing attribute values, then the decision tree constructed
may become unstable or biased.
3. If the training data has continuous valued attributes, handling it is computationally
complex and has to be discretized.
4. A complex decision tree may also be over-fitting with the training data.
5. Decision tree learning is not well suited for classifying multiple output classes.
6. Learning an optimal decision tree is also known to be NP-complete.
In decision tree construction, the best split feature is selected at each node to maximize
information gain for classifying test instances, with the process continuing until a stopping
criterion is met. This selection is based on information theory, specifically Shannon Entropy,
which quantifies the uncertainty or randomness of the data; a lower entropy indicates a more
homogeneous dataset and thus a better split. The best feature is the one that results in the largest
reduction in entropy after splitting the data, aiming to create pure or more homogeneous subsets
for effective classification.
The formula for calculating the entropy of a dataset D with n classes is given by:
where:
Example:
If we have 10 data instances, with 6 belonging to the positive class (say, 1) and 4 belonging to
the negative class (say, 0), then:
6
P(positive) = P(p1) = = 0.6
10
4
P(negative) = P(p2) = 10 = 0.4
This value, close to 1, indicates a relatively high level of impurity as the classes are not
perfectly separated.
In essence, the decision tree algorithm greedily selects splits that maximize the information
gained (i.e., minimize the entropy of the resulting subsets), aiming to create branches that lead
to pure leaf nodes representing specific class labels. The stopping criterion, often based on the
entropy value being sufficiently low (approaching 0), ensures that the classification at the leaf
nodes is as certain as possible.
Stopping Criteria
The following are some of the common stopping conditions:
1. The data instances are homogenous which means all belong to the same class Ci, and
hence its entropy is 0.
2. A node with some defined minimum number of data instances becomes a leaf (Number
of data instances in a node is between 0.25 and 1.00% of the full training dataset).
3. The maximum tree depth is reached, so further splitting is not done and the node
becomes a leaf node.
Various decision tree algorithms exist for classification in real-time environments, including
ID3, C4.5, CART, CHAID, QUEST, GUIDE, CRUISE, and CTree. Among these, ID3
(Iterative Dichotomizer 3), developed in 1986, and its advancement C4.5 (1993) are commonly
used. Another popular algorithm is CART (Classification and Regression Trees), introduced in
1984.
The accuracy of a decision tree is heavily influenced by the method used to select the best
attribute for splitting at each node. Different algorithms employ distinct measures for this
purpose. For instance, ID3 utilizes "Information Gain" as its splitting criterion, while C4.5 uses
"Gain Ratio." The CART algorithm, which can handle both categorical and continuous target
variables, employs the "GINI Index" to determine the optimal splits. Decision trees built with
ID3 and C4.5 are classified as univariate decision trees because they consider only a single
feature for splitting at each node. In contrast, CART can construct multivariate decision trees,
which consider a combination of univariate splits at each decision node.
ID3 is a supervised learning algorithm that constructs a univariate decision tree using a greedy,
top-down approach by selecting the best attribute at each node based on the "Information Gain"
purity measure to classify future test instances. ID3 works effectively when the attributes are
discrete or categorical and the training dataset is large with no missing attribute values, but it
requires continuous attributes to be discretized and is susceptible to overfitting on small
datasets and sensitive to outliers due to the lack of pruning.
1. Compute Entropy_Info(T) Eq. for the whole training dataset based on the target
attribute.
2. Compute Entropy_Info(T,A) and Information_Gain Eq. for each of the attribute in the
training dataset.
3. Choose the attribute for which entropy is minimum and therefore the gain is maximum
as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
subsets.
6. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.
C4.5 is an enhancement of ID3 that addresses its limitations by handling both continuous and
discrete attributes, managing missing values by marking them as '?', and incorporating post-
pruning to avoid overfitting and build smaller, more efficient trees. Unlike ID3, which uses
Information Gain and can be biased towards attributes with more values (like a unique 'Register
No'), C4.5 employs Gain Ratio as its splitting criterion. Gain Ratio normalizes the Information
Gain by considering the "Split_Info" of an attribute, effectively reducing the bias towards
attributes with numerous values. The attribute with the highest Gain Ratio is then selected as
the best splitting attribute in C4.5.
Where, the attribute A has got ‘v’ distinct values { a1, a2, …., av} and |Ai| is the number of
instances for distinct value ‘i’ in attribute A.
1. Compute Entropy_Info(T) Eq. for the whole training dataset based on the target
attribute.
3. Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
subsets.
6. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.
CART (Classification and Regression Trees) is a versatile decision tree learning algorithm
capable of handling both categorical and continuous target variables, constructing either
classification or regression trees. Unlike ID3 and C4.5, CART generates binary trees by
recursively splitting nodes into two based on the attribute that yields the highest homogeneity
in the resulting child nodes, as measured by the GINI Index. For categorical attributes with
more than two values, CART considers all possible binary splits of these values to find the split
that maximizes the GINI Index improvement.
Higher the GINI value, higher is the homogeneity of the data instances.
where:
• Pi is the probability that a data instance in T belongs to class Ci. This probability is
calculated as the number of data instances belonging to class Ci divided by the total
number of data instances in T.
A higher GINI Index value indicates higher impurity or heterogeneity in the data. CART aims
to find splits that result in child nodes with lower GINI Index values (higher homogeneity).
The splitting subset with minimum Gini_Index is chosen as the best splitting subset for an
attribute. The best splitting attribute is chosen by the minimum Gini_Index which is
otherwise maximum ΔGini because it reduces the impurity.
1. Compute Gini_Index(T) Eq. for the whole training dataset based on the target attribute.
2. Compute Gini_Index(T,A) for each of the attribute and for the subsets of each attribute
in the training dataset.
3. Choose the best splitting subset which has minimum Gini_Index for an attribute.
4. Compute ΔGini(A) for the best splitting subset of that attribute.
ΔGini(A) = Gini(T) − Gini_Index(T,A)
5. Choose the best splitting attribute that has maximum ΔGini(A).
6. The best split attribute with the best split subset is placed as the root node.
7. The root node is branched into two subtrees with each subtree an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
two subsets.
8. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.
Regression trees are a variant of decision trees where the target feature is a continuous valued
variable. These trees can be constructed using an algorithm called reduction in variance which
uses standard deviation to choose the best splitting attribute.
1. Compute standard deviation for each attribute with respect to the target attribute.
2. Compute standard deviation for the number of data instances of each distinct value of
an attribute.
3. Compute weighted standard deviation for each attribute.
4. Compute standard deviation reduction by subtracting weighted standard deviation for
each attribute from standard deviation of each attribute.
5. Choose the attribute with a higher standard deviation reduction as the best split
attribute.
6. The best split attribute is placed as the root node.
7. The root node is branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
different subsets.
8. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.