ML-Module 3
ML-Module 3
Module 3
Chapter 4 : Similarity Based Learning
4.2 Nearest-neighbor Learning :
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that classifies
unlabeled data by finding the most similar labeled examples
The value of k is critical in KNN as it determines the number of neighbors to consider when
making predictions.(Bias & Variance)
Example
Imagine you’re deciding which fruit it is based on its shape and size. You compare it to fruits
you already know.
• If k = 3, the algorithm looks at the 3 closest fruits to the new one.
• If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is an
apple because most of its neighbors a
.
Prof. Navyashree K S, CSE(DS) RNSIT
MACHINE LEARNING (BCS602)
Centroid of Class A:
We calculate the centroid for Class A using the average of all x and y values for the points in
Class A.
The data points for Class A are:
• (1, 2)
• (2, 3)
• (3, 3)
The centroid of Class A (μA) is the average of the x-coordinates and y-coordinates ,
(6,8)/3 ≈ (2,2.67)
Centroid of Class B:
We calculate the centroid for Class B using the average of all x and y values for the points in
Class B.
The data points for Class B are:
• (6, 6)
• (7, 7)
• (8, 6)
The centroid of Class B(μB) is
(21,19)/3 ≈ (7,6.33)
Step 2: Calculate the Distance to Each Centroid
Now, we calculate the Euclidean distance from the new point (5, 5) to each of the centroids.
Distance to Class A Centroid (2,2.67):
The Euclidean distance is given by:
dA =sqrt (14.43) ≈3 .8
Distance to Class B Centroid (7,6.33):
The Euclidean distance is:
dB=sqrt (5.77) ≈ 2.4
Step 3: Classify the New Point
• The distance to the Class A centroid is 3.8.
• The distance to the Class B centroid is 2.4.
Since the new point is closer to the Class B centroid, we classify the new point as Class B. The
new point (5, 5) is classified as Class B using the Nearest Centroid Classifier.
• Here, Wi is the weight for data point ,xi and τ\tau (bandwidth) controls the rate of weight
decay.
• Bandwidth (Tau): A critical parameter that governs how localized the regression is:
• Small T : Focuses on nearby points, capturing finer details but risks overfitting.
• Large T: Includes more distant points, reducing variance but increasing bias.
3. Weight Calculation
For a given query point , xi compute weights for all data points using the chosen kernel
function. Points closer to xi will have higher weights.
4. Model Fitting:Using the computed weights, fit a weighted least squares regression to the
data. The goal is to minimize the weighted sum of squared errors:
5. Prediction:Once the localized model is fitted, use it to predict the target value for the query
point.
LWR Algorithm:
Chapter 5:
Regression Analysis
5.1 Introduction to Regression
Regression analysis is a statistical technique used to model and analyse the relationships
between variables. This is one of the oldest supervised learning technique.
It helps in understanding how the dependent variable (outcome) changes when one or more
independent variables (predictors) are modified. Given a training dataset D containing N
training points
(xi , yi), where i=1….N, used to model the relationship between one or more independent
variables xi and a dependent variable yi. It is represented by a function y=f(x)
This technique is widely used in various fields such as economics, finance, machine learning,
and social sciences to make predictions and infer relationships between variables.
Types of Regression
Simple Linear Regression: Involves one independent variable and models a linear
relationship (e.g., predicting sales based on advertising spend).
Multiple Linear Regression: Involves multiple independent variables (e.g., predicting
house prices based on location, size, and number of rooms).
Logistic Regression: Used when the dependent variable is categorical (e.g., predicting
whether a customer will buy a product: Yes/No).
Polynomial Regression: Models nonlinear relationships by adding polynomial terms.
Ridge and Lasso Regression: Used for regularization to prevent overfitting.
5.2 Introduction to Linearity, correlation, And Causation
Understanding the relationships between variables is crucial in data analysis and statistics.
Three fundamental concepts that help explain these relationships are linearity, correlation, and
causation. Each concept plays a unique role in interpreting data and making accurate
predictions.
1. Linearity
Linearity refers to a straight-line relationship between two variables. In a linear relationship,
changes in one variable result in proportional changes in another. This relationship can be
expressed using a linear equation:
Y=β0+β1X+ε
Where:
Y is the dependent variable ,X is the independent variable ,β0 is the intercept,β1 is the slope
(rate of change),ε is the error term
Examples of Linearity:
A company’s revenue increasing proportionally with its marketing budget.
Measures the strength and direction of a relationship between two variables. It is represented
by the correlation coefficient (r), which ranges from -1 to 1:
r=1 → Perfect positive correlation (as one variable increases, the other increases).
r=−1 → Perfect negative correlation (as one variable increases, the other decreases).
r=0 → No correlation (no relationship between variables).
Examples of Correlation:
Height and weight often show a positive correlation.
Increase in gas prices and demand for electric cars may have a negative correlation.
However, correlation does not imply causation—just because two variables move together
does not mean one causes the other.
3. Causation
Causation means that a change in one variable directly causes a change in another. Unlike
correlation, which only shows an association, causation requires experimental or observational
evidence.
1. Linear Regression
Purpose: Predict a continuous outcome based on one predictor.
Equation:
y=β0+β1x+ε
Use case: Straight-line relationship (e.g., predicting salary based on years of experience).
2. Multiple Linear Regression
Purpose: Predict a continuous outcome using multiple predictors.
Equation:
y=β0+β1x1+β2x2+…+βnxn+ε
Use case: Predicting house price based on area, number of rooms, location, etc.
3. Polynomial Regression
Purpose: Handle nonlinear relationships by adding polynomial terms.
Equation (2nd degree example):
y=β0+β1x+β2x2+ε
Use case: Modelling curves like the growth rate of a business over time.
4. Logistic Regression
Purpose: Classification (not traditional regression).
Output: Probability (between 0 and 1) that is mapped to classes (e.g., 0 or 1).
Equation (Sigmoid function):
Interpretation: Larger errors get penalized more due to squaring. MSE is always positive.
3.Standard Error (explained in words)
• Refers to the standard deviation of residuals (errors between actual and predicted
values).
• Ideally, it should be small, meaning predictions are close to actual values.
• If residuals follow a normal distribution with mean zero, it's a good sign.
4.RMSE
Measures the average magnitude of prediction errors in other words, how far off your model's
predictions are from the actual values. It's especially useful because it penalizes larger errors
more than smaller ones (due to squaring).
Chapter 6
Decision Tree Learning
Decision Tree Learning is a widely used predictive model for supervised learning that spans
over a number of practical applications in various areas. It is used for both classification and
regression tasks. The decision tree model basically represents logical rules that predict the value
of a target variable by inferring from data features. This chapter provides a keen insight into
how to construct a decision tree and infer knowledge from the tree.
Learning Objectives
Understand the structure of the decision tree
Know about the fundamentals of Entropy
Learn and understand popular univariate Decision Tree Induction algorithms such as
ID3, C4.5, and multivariate decision tree algorithm such as CART
Deal with continuous attributes using improved C4.5 algorithm
Construct Classification and Regression Tree (CART) for classifying both categorical
and continuous-valued target variables
Construct regression trees where the target feature is a continuous-valued variable
Understand the basics of validating and pruning of decision trees
Decision tree learning model, one of the most popular supervised predictive learning
models, classifies data instances with high accuracy and consistency. The model
performs an inductive inference that reaches a general conclusion from observed
examples. This model is variably used for solving complex classification applications.
Decision tree is a concept tree which summarizes the information contained in the
training dataset in the form of a tree structure. Once the concept model is built, test data
can be easily classified.
This model can be used to classify both categorical target variables and continuous-
valued target variables. Given a training dataset X, this model computes a hypothesis
function f(X) as a decision tree.
Inputs to the model are data instances or objects with a set of features or attributes which
can be discrete or continuous, and the output of the model is a decision tree which
predicts or classifies the target class for the test data object.
In statistical terms, attributes or features are called as independent variables. The target
feature or target class is called as response variable which indicates the category we
need to predict on a test object.
The decision tree learning model generates a complete hypothesis space in the form of
a tree structure with the given training dataset and allows us to search through the
possible set of hypotheses which in fact would be a smaller decision tree as we walk
through the tree. This kind of search bias is called as preference bias.
A decision tree has a structure that consists of a root node, internal nodes/decision nodes,
branches, and terminal nodes/leaf nodes. The topmost node in the tree is the root node. Internal
nodes are the test nodes and are also called as decision nodes. These nodes represent a choice
or test of an input attribute and the outcome or outputs of the test condition are the branches
emanating from this decision node. The branches are labelled as per the outcomes or output
values of the test condition. Each branch represents a sub-tree or subsection of the entire tree.
Every decision node is part of a path to a leaf node. The leaf nodes represent the labels or the
outcome of a decision path. The labels of the leaf nodes are the different target classes a data
instance can belong to.
Every path from root to a leaf node represents a logical rule that corresponds to a conjunction
of test attributes and the whole tree represents a disjunction of these conjunctions. The decision
tree model, in general, represents a collection of logical rules of classification in the form of a
tree structure.
Decision networks, otherwise called as influence diagrams, have a directed graph structure with
nodes and links. It is an extension of Bayesian belief networks that represents information about
each node’s current state, its possible actions, the possible outcome of those actions, and their
utility. The concept of Bayesian Belief Network (BBN) is discussed in Chapter 9.
Figure 6.1 shows symbols that are used in this book to represent different nodes in the
construction of a decision tree. A circle is used to represent a root node, a diamond symbol is
used to represent a decision node or the internal nodes, and all leaf nodes are represented with
a rectangle.
Goal: Construct a decision tree with the given training dataset. The tree is constructed in a top-
down fashion. It starts from the root node. At every level of tree construction, we need to find
the best split attribute or best decision node among all attributes. This process is recursive and
continued until we end up in the last level of the tree or finding a leaf node which cannot be
split further. The tree construction is complete when all the test conditions lead to a leaf node.
Goal: Given a test instance, infer to the target class it belongs to.
Classification: Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start
traversing the tree from the root. We traverse as we evaluate the test condition on every decision
node with the test object attribute value and walk to the branch corresponding to the test’s
outcome. This process is repeated until we end up in a leaf node which contains the target class
of the test object.
Output: Target label of the test instance.
Advantages of Decision Trees
1. Easy to model and interpret
2. Simple to understand
3. The input and output attributes can be discrete or continuous predictor variables
4. Can model a high degree of nonlinearity in the relationship between the target variables
and the predictor variables
5. Quick to train
Disadvantages of Decision Trees
Some of the issues that generally arise with a decision tree learning are that:
1. It is difficult to determine how deeply a decision tree can be grown or when to stop growing
it.
2. If training data has errors or missing attribute values, then the decision tree constructed may
become unstable or biased.
3. If the training data has continuous-valued attributes, handling it is computationally complex
and has to be discretized.
4. A complex decision tree may also be over-fitting with the training data.
5. Decision tree learning is not well suited for classifying multiple output classes.
6. Learning an optimal decision tree is also known to be NP-complete.
How to draw a decision tree to predict a student's academic performance based on the given
information such as class attendance, class assignments, home-work assignments, tests,
participation in competitions or other events, group activities such as projects and
presentations, etc.
Solution:
The target feature is the student performance in the final examination—whether he will pass or
fail in the examination. The decision nodes are test nodes which check for conditions like:
“What’s the student’s class attendance?”
“How did he perform in his class assignments?”
The leaf nodes represent the outcomes, that is, either ‘pass’, or ‘fail’.
A decision tree would be constructed by following a set of if-else conditions which may or
may not include all the attributes, and decision nodes outcomes are two or more than two.
Hence, the tree is not a binary tree.
Note: A decision tree is not always a binary tree. It is a tree which can have more than two
branches
Example 6.2:
Predict a student’s academic performance of whether he will pass or fail based on the given
information such as ‘Assessment’ and ‘Assignment’. The following Table 6.2 shows the
independent variables, Assessment and Assignment, and the target variable Exam Result with
their values. Draw a binary decision tree.
Table 6.2: Attributes and Associated Values
Interpretation:
If the dataset is homogeneous, entropy = 0
If the dataset is evenly split, entropy = 1
The value of entropy lies between 0 and 1
A lower entropy means a better (purer) split
Univariate trees (like ID3, C4.5): Split based on one attribute at a time.
Multivariate trees (like CART): Split based on a combination of attributes.
1. Compute Entropy_Info for the entire dataset (based on the target attribute).
2. Compute Entropy_Info and Information Gain for each attribute.
3. Choose the attribute with the lowest entropy / highest gain as the best split.
4. Place it as the root node.
5. Branch the dataset into subsets based on the values of the root attribute.
6. Repeat recursively for each subset until:
o A leaf node is formed.
o No instances remain.
o Entropy is 0.
Note: Stop branching when entropy = 0.At each step, choose the attribute with highest
Information Gain.
Definitions
Where:dci: Number of instances of class Ci ,T: Total number of instances in the dataset