0% found this document useful (0 votes)
13 views42 pages

Unit 3 (MLT)

The Decision Tree Learning Algorithm is a supervised machine learning technique primarily used for classification, structured as a tree with nodes representing decisions and outcomes. It utilizes concepts from information theory, such as entropy and information gain, to determine the best features for splitting data, while also addressing challenges like overfitting and handling continuous attributes. The ID3 algorithm is a foundational method for creating decision trees, emphasizing simplicity and effectiveness for categorical data but also facing limitations such as overfitting and poor handling of numeric data.

Uploaded by

adarsh7380892559
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views42 pages

Unit 3 (MLT)

The Decision Tree Learning Algorithm is a supervised machine learning technique primarily used for classification, structured as a tree with nodes representing decisions and outcomes. It utilizes concepts from information theory, such as entropy and information gain, to determine the best features for splitting data, while also addressing challenges like overfitting and handling continuous attributes. The ID3 algorithm is a foundational method for creating decision trees, emphasizing simplicity and effectiveness for categorical data but also facing limitations such as overfitting and poor handling of numeric data.

Uploaded by

adarsh7380892559
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

DECISION TREE LEARNING ALGORITHM

 It is a Supervised Machine Learning Algorithm used for both classification and


regression problems but primarily for classification models.
 It is a tree-structured classifier consisting of Root nodes, Leaf nodes, Branch
trees, parent nodes, and child nodes.
 These will helps to predict the output.
Decision Tree Terminologies
 Root Node: The initial node at the beginning of a decision tree, where
the entire population or dataset starts dividing based on various features
or conditions.
 Decision Nodes: Nodes resulting from the splitting of root nodes are
known as decision nodes. These nodes represent intermediate decisions
or conditions within the tree.
 Leaf Nodes: Nodes where further splitting is not possible, often
indicating the final classification or outcome. Leaf nodes are also
referred to as terminal nodes.
 Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a
sub-section of a these tree is referred to as a sub-tree. It represents a
specific portion of the decision tree.
 Pruning: The process of removing or cutting down specific nodes in a
tree to prevent overfitting and simplify the model.
 Branch / Sub-Tree: A subsection of the entire is referred to as a branch
or sub-tree. It represents a specific path of decisions and outcomes
within the tree.
 Parent and Child Node: In a decision tree, a node that is divided into
sub-nodes is known as a parent node, and the sub-nodes emerging from
it are referred to as child nodes. The parent node represents a decision
or condition, while the child nodes represent the potential outcomes or
further decisions based on that condition.

Examples of Decision Tree


Example 1:
Example 2: Decision tree that will help a person to decide whether to accept
a job offer or not.

Fig: Decision Tree Terminology


Working of Decision Tree algorithm

 Starting at the Root: The algorithm begins at the top, called the “root
node,” representing the entire dataset.
 Asking the Best Questions: It looks for the most important feature or
question that splits the data into the most distinct groups. This is like
asking a question at a fork in the tree.
 Branching Out: Based on the answer to that question, it divides the data
into smaller subsets, creating new branches. Each branch represents a
possible route through the tree.
 Repeating the Process: The algorithm continues asking questions and
splitting the data at each branch until it reaches the final “leaf nodes,”
representing the predicted outcomes or classifications.

Advantages of Decision Trees


 Easy to Understand: They are simple to visualize and interpret, making
them easy to understand even for non-experts.
 Handles Both Numerical and Categorical Data: They can work with both
types of data without needing much preprocessing.
 No Need for Data Scaling: These trees do not require normalization or
scaling of data.
 Automated Feature Selection: They automatically identify the most
important features for decision-making.
 Handles Non-Linear Relationships: They can capture non-linear patterns
in the data effectively.

Disadvantages of Decision Trees


 Overfitting Risk: It can easily overfit the training data, especially if they
are too deep.
 Unstable with Small Changes: Small changes in data can lead to
completely different trees.
 Biased with Imbalanced Data: They tend to be biased if one class
dominates the dataset.
 Limited to Axis-Parallel Splits: They struggle with diagonal or complex
decision boundaries.
 Can Become Complex: Large trees can become hard to interpret and
may lose their simplicity.
Information Theory [Attribute Selection Measures (ASM)]

 Information theory in the context of decision trees is primarily used to help the
algorithm decide which feature to split on at each step. It provides a way to
quantify uncertainty or impurity in data, and helps choose splits that maximize
the "information gain"—i.e., reduce uncertainty the most.
 Attribute selection measures[ASM] are called a splitting rules because they
decides how the tuples at a given node are to be divided.
 The attribute selection measure supports a ranking for every attribute defining
the given training tuples. The attribute having the best method for the measure is
selected as the splitting attribute for the given tuples.
 There are three famous attribute selection measures including Information Gain,
Gain Ratio, and Gini Index.

Entropy
 Entropy is a measure of disorder or impurity in the given dataset.
 It is the sum of the probability of each label times, the log probability of that
same label. It is an average rate at which a stochastic data source produces
information, or it measures the uncertainty associated with a random variable.
 Mathematically Entropy is calculated as :

Where, c represent all classes


P i represent probability of ith class
For Example: Suppose we have 2 classes i.e. yes and no. To calculate entropy
formula will be:

here S=Total number of samples


P(yes)=Probability of Yes
P(No)=Probability of No

Information Gain
 An amount of information gained about a random variable or signal from
observing another random variable is known as Information Gain.
 It favors smaller partitions with distinct values.
Gain Ratio
 The information gain measure is biased approaching tests with several results.
 It can select attributes having a high number of values. For instance, consider
an attribute that facilitates as a unique identifier, including product ID.
 A split on product ID can result in a huge number of partitions, each one
including only one tuple. Because each partition is authentic, the data needed
to define data set D based on this partitioning would be Infoproduct_ID(D) = 0.

Gini Index
 It is calculated by subtracting the sum of the squared probabilities of each class
from one.
 The Gini index can be used in CART.
 The Gini index calculates the impurity of D, a data partition or collection of
training tuples, as

where pi is the probability that a tuple belongs to class Ci.


 It favours larger partitions.

Inductive Bias
“Inductive bias is the set of assumptions or preferences that a learning
algorithm uses to make predictions beyond the data it has been trained on.”
Without inductive bias, machine learning algorithms would be unable to
generalize from training data to unseen situations, as the possible hypotheses or
models could be infinite.
Types of Inductive Bias
There are two main types of inductive bias in machine learning: restrictive
bias and preferential bias.
1. Restrictive Bias
 Restrictive bias refers to the assumptions that limit the set of functions that the
algorithm can learn.
 For example, a linear regression model assumes that the relationship between the
input variables and the output variable is linear. This means that the model can
only learn linear functions, and any non-linear relationships between the variables
will not be captured.
 Another example of restrictive bias is the decision tree algorithm, which assumes
that the relationship between the input variables and the output variable can be
represented by a tree-like structure. This means that the algorithm can only learn
functions that can be represented by a decision tree.
2. Preferential Bias
 Preferential bias refers to the assumptions that make some functions more likely
to be learned than others.
 For example, a neural network with a large number of hidden layers and
parameters has a preferential bias towards complex, non-linear functions. This
means that the algorithm is more likely to learn complex functions than simple
ones,
 Another example of preferential bias is the k-nearest neighbors’ algorithm, which
assumes that similar inputs have similar outputs. This means that the algorithm is
more likely to predict the same output for inputs that are close together in
feature space.

Importance of Inductive Bias


 Generalization: Inductive bias helps models generalize from the training data to
new, unseen data. Without it, the model would have to start from scratch every
time it encountered new data, making it much less efficient and accurate.
 Preventing Overfitting: A model with no bias or assumptions might fit the
training data perfectly, capturing every minute detail, including noise. This is
known as overfitting. An inductive bias can prevent a model from overfitting by
making it favor simpler hypotheses.
 Inductive Bias in Decision Trees:
 Preference for Shorter Trees: Decision tree algorithms, like ID3, have a bias
towards shorter trees with fewer splits. This is because shorter trees are
generally easier to interpret and less prone to overfitting.
 Preference for High Information Gain Attributes: The algorithm tends to choose
attributes that provide the highest information gain when splitting the data. This
means attributes that are most predictive of the target variable are favored.

Inductive Bias in Decision Trees (Decision Tree Assumptions)


(A preference for shorter trees with fewer splits)
Here are some common assumptions and considerations when creating
decision trees:
1. Binary Splits: Decision trees typically make binary splits, meaning each node
divides the data into two subsets based on a single feature or condition. This
assumes that each decision can be represented as a binary choice.
2. Recursive Partitioning: Decision trees use a recursive partitioning process,
where each node is divided into child nodes, and this process continues until a
stopping criterion is met. This assumes that data can be effectively subdivided
into smaller, more manageable subsets.
3. Feature Independence: These trees often assume that the features used for
splitting nodes are independent. In practice, feature independence may not
hold, but it can still perform well if features are correlated.
4. Homogeneity: It aims to create homogeneous subgroups in each node,
meaning that the samples within a node are as similar as possible regarding the
target variable. This assumption helps in achieving clear decision boundaries.
5. Top-Down Greedy Approach: They are constructed using a top-down, greedy
approach, where each split is chosen to maximize information gain or minimi ze
impurity at the current node. This may not always result in the globally optimal
tree.

Inductive Inference in a Decision Tree


 In a decision tree, inductive inference is the process of: “Generalizing patterns
from training data to make predictions on new, unseen data.”
 In simpler terms: the model looks at specific examples (data points), figures out
rules or patterns, and then infers how to make decisions for new examples it
hasn’t seen before.
 Decision trees use information gain, Gini impurity, or entropy to decide the best
attributes to split on.
 Each split is a kind of inductive inference: “This feature seems to separate the
classes well, so it’s probably meaningful.”
 Inductive inference in decision trees = generalizing decision rules from data.
 It’s how the tree learns to make predictions based on patterns in the training set.
 Example: Let’s say your training data has examples of fruits:
Color Size Shape Fruit
Red Small Round Cherry
Yellow Medium Long Banana
Green Small Round Grape
A decision tree will analyze this and infer something like:
 If Shape == Round and Size == Small, then it's probably a Cherry or Grape.
 If Shape == Long and Color == Yellow, then it’s probably a Banana.

Those rules are inductive inferences—they were inferred from training data and
are used to predict future inputs.

ID3 algorithm (Iterative Dichotomiser 3)


 The ID3 algorithm (Iterative Dichotomiser 3) is a classic algorithm used to create
decision trees in machine learning.
 It was developed by Ross Quinlan and is one of the foundational algorithms for
tree-based models.
 ID3 builds a decision tree by choosing the best attribute to split the data at each
step.
 It does this using a measure called information gain, which is based on entropy
(a concept from information theory).

Step by Step Working of ID3


1. Start with the full training data.
2. Calculate the entropy of the current dataset (how mixed the classes are).
3. For each feature/attribute, calculate the information gain of splitting the
dataset by that attribute.
4. Choose the attribute with the highest information gain to make a decision
node.
5. Split the dataset based on that attribute’s values.
6. Repeat the process recursively on each subset, until:
o All examples in a node belong to the same class.
o No more features remain.
o The dataset is empty.

Example: Say you're building a tree to decide if someone will play tennis based on
weather:
Outlook Temperature PlayTennis
Sunny Hot No
Overcast Mild Yes
Rain Cool Yes

 ID3 will calculate the entropy of the target (PlayTennis)


 Then calculate information gain for each feature (Outlook, Temperature,
etc.)
 It will choose the one with the highest gain to split on first.

Pros of ID3:
 Simple and intuitive.
 Works well for categorical data.
 Good for small to medium datasets.

Limitations of ID3:
 Can overfit the training data.
 Doesn’t handle numeric data well without preprocessing.
 No pruning (in the original version).

Issues in Decision tree learning


1. Overfitting the Data
 Problem: Decision trees can become overly complex, modeling noise or
random fluctuations instead of the true underlying patterns. This usually
happens when the tree grows very deep with many small branches.
 Effect: Poor generalization to unseen data — the model performs well on
the training set but poorly on new, test data.
 Solutions:
o Pruning: Remove sections of the tree that provide little classification
power (e.g., post-pruning after the full tree is grown).
o Early Stopping: Stop splitting once the data at a node becomes
sufficiently pure, or if further splits do not significantly improve
information gain.
o Minimum Split Size: Require a minimum number of instances in a
node before it can be split.
2. Guarding Against Bad Attribute Choices
 Problem: Greedy algorithms like ID3, C4.5, and CART select the "best"
attribute at each step, but a poor choice early in the tree can badly affect
the entire model.
 Solutions:
o Use robust attribute selection measures:
 Information Gain (ID3) can sometimes favor attributes with
many values.
 Gain Ratio (C4.5) adjusts Information Gain to favor balanced
splits.
 Gini Index (CART) measures impurity to make better choices.
o Randomization: Some methods (e.g., Random Forests) introduce
randomness into attribute selection to avoid consistently bad splits.
o Cross-validation: Use validation data to test if splits are leading to
better generalization.
3. Handling Continuous Valued Attributes
 Problem: Standard decision tree algorithms were initially designed for
discrete-valued attributes. Continuous attributes (like temperature, age)
need special handling.
 Solutions:
o Threshold-based Splitting: Find a threshold (e.g., "Temperature >
75°F") that best separates the data based on information gain or Gini
Index.
o Binary splits: A continuous attribute can be split into two branches
(above or below the threshold).
o Sorting: The algorithm typically sorts the values and evaluates
candidate splits between values to find the best threshold.
o Modern algorithms like C4.5 handle continuous attributes natively.
4. Handling Missing Attribute Values
 Problem: Real-world data often has missing attribute values, and decision
trees must handle this gracefully during both training and prediction.
 Solutions:
o During Training:
 Ignore instances with missing values when choosing the best
attribute.
 Distribute the instance probabilistically among child nodes
based on observed proportions.
o During Prediction:
 Follow the most likely branch according to training data
distributions.
 Use surrogate splits (backup attributes that approximate the
primary split) to decide direction when the primary attribute is
missing.
 Algorithms like C4.5 use probabilistic approaches to handle missing data
both during tree construction and classification.
5. Handling Attributes with Differing Costs
 Problem: Some attributes might be expensive to measure or compute (e.g.,
a medical test), so a model should balance predictive accuracy with
attribute cost.
 Solutions:
o Cost-sensitive learning: Modify the attribute selection criterion to
penalize expensive attributes, incorporating both predictive power
and cost.
o Cost-weighted splitting: Adjust the gain calculation by dividing the
gain by the attribute’s cost.
o Post-processing pruning: After building a tree, prune it considering
not just accuracy but also cost to simplify it.
o Some algorithms (like Cost-sensitive Decision Trees) explicitly model
and optimize for a trade-off between accuracy and cost.

6. Sensitivity to Training Data: Small changes in the training data can lead to
significantly different decision trees, making the model unstable and
unreliable.
7. Difficulty with Imbalanced Datasets: If one class is significantly more
prevalent than others, the decision tree may become biased towards the
majority class, leading to poor performance on the minority class.
8. Computational Efficiency:Decision tree algorithms can be computationally
expensive, especially for large datasets.
9. Interpretability: While decision trees are relatively easy to interpret, very
complex trees can be difficult to understand.
10. Choosing Attribute Selection Measures: The choice of attribute selection
measure (e.g., information gain, Gini index) can affect the performance of the
decision tree.
11. NP-completeness: Finding the optimal decision tree is an NP-complete
problem, meaning that finding the absolute best solution is computationally
intractable.
Numerical on ID3
Question: Decision rules will be found based on entropy and information gain
pair of features. Following table informs about decision making factors to play
Tennis at outside for previous 14 days.

Play
Day Outlook Temp. Humidity Wind
Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Solution using ID3 Algorithm
Gain(S,Temp)=0.94- 4 * 1.0 – 6 *0.9183 – 4 *0.8113 = 0.0289
14 14 14
-
Example 2: Make a Decision tree using below training data using ID3 Algorithm
Instance Based Learning

Machine
Learning

Model Based Instance Based


(Lazy Learning) (Eager Learning)

Model Based Learning (Eager Learning)


 Model based learning involves creating a mathematical model that can
predict outcomes based on input data.
 The model is trained on large dataset & then used to make prediction on
new data.
 The model can be thought of as a set of rules that the machine uses to
make predictions.
 The model is typically created using statistical algorithms such as linear
regression, logistic regression, decision trees and neural networks.
 Parameterized: if it learns using predefined mapped function.

Instance Based Learning (Lazy Learning)


 Also known as Memory based Learning or Lazy Learning.
 Instance-based learning operates by directly comparing new data points to
stored instances without constructing a general model during training.
Instead, it retains the entire training dataset and evaluates each new
instance against the stored data to predict outcomes.
 Instance-based learning’s versatility makes it suitable for a variety of
machine learning tasks, especially where interpretability and adaptability
are essential.
Key Characteristics of Instance Based Learning
 Lazy learning approach: Training involves minimal computation, as the
algorithm defers processing until predictions are required.
 Similarity based: Predictions are made based on the similarity between the
new data instance and stored training instances.
 Distance-based comparisons: Metrics like Euclidean or Manhattan
distance are commonly used to measure the similarity between data
points.
 Local Decision Making: Decisions are made based on the local
neighborhood of the input instance. The algorithm considers only a subset
of the training data (e.g. K-nearest neighbours) to make a prediction.

Advantages
 Simplicity: Instance-based learning is straightforward to implement and
does not require complex mathematical modeling.
 Adaptability: It can quickly adapt to new data without retraining, making it
ideal for dynamic datasets.
Disadvantages:
 Storage Requirements: Storing the entire training dataset can require
significant memory, especially for large datasets.
 Computational Cost: The prediction process can be computationally
expensive, especially when dealing with large datasets, as it involves
comparing the new instance to all stored instances.

Use Cases
 Classification tasks: Algorithms like K-Nearest Neighbors (KNN) effectively
classify data points based on similarity metrics.
 Regression tasks: Methods such as locally weighted regression use
instance-based learning for making predictions in continuous spaces.
Challenges and Limitations
 High memory usage: Storing the entire dataset requires significant
memory, particularly for large datasets.
 Computational expense: Predictions are slow for large datasets since
comparisons must be made with all stored instances.
 Sensitivity to irrelevant or noisy features: These can distort similarity
measurements, leading to reduced prediction accuracy.

k-Nearest Neighbour Learning


 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to
the available categories.
 K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets
new data, and then it classifies that data into a category that is much similar to
the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it in
either cat or dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have
already studied in geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as


three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
o There is no particular way to determine the best value for "K", so we need
to try some values to find the best out of them. The most preferred value
for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model leads to inaccurate predictions.
o Large values for K are good, but it may find some difficulties.
o Always use an ODD number as the value of K.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
o No training is required before classification.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some
time.
o The computation cost is high because of calculating the distance between
the data points for all the training samples.
o A lot of memory is required for processing large data sets.

Numerical Problems on K-Nearest Neighbour

Numerical Problem: 1

Sepal Length Sepal Width Species


5.3 3.7 Setosa
5.1 3.8 Setosa
7.2 3.0 Virginica
5.4 3.4 Setosa
5.1 3.3 Setosa
5.4 3.9 Setosa
7.4 2.8 Virginica
6.1 2.8 Verscicolor
7.3 2.9 Virginica
6.0 2.7 Verscicolor
5.8 2.8 Virginica
6.3 2.3 Verscicolor
5.1 2.5 Verscicolor
6.3 2.5 Verscicolor
5.5 2.4 Verscicolor
5.2 3.1 ?
Step 1: Find Distance
Distance= (x-a)2 + (y-b) 2
Where, x=5.2 & y=3.1

Sepal Sepal Species Calculation for Distance Distance


Length Width
5.3 3.7 Setosa [(5.2-5.3) 2 + (3.1-3.7) 2] ½ 0.608
5.1 3.8 Setosa [(5.2-5.1) 2 + (3.1-3.8) 2] ½ 0.707
7.2 3.0 Virginica [(5.2-7.2) 2 + (3.1-3.0) 2] ½ 2.002
5.4 3.4 Setosa [(5.2-5.4) 2 + (3.1-3.4) 2] ½ 0.36
5.1 3.3 Setosa [(5.2-5.1) 2 + (3.1-3.3) 2] ½ 0.22
5.4 3.9 Setosa [(5.2-5.4) 2 + (3.1-3.9) 2] ½ 0.82
7.4 2.8 Virginica [(5.2-7.4) 2 + (3.1-2.8) 2] ½ 2.22
6.1 2.8 Verscicolor [(5.2-6.1) 2 + (3.1-2.8) 2] ½ 0.94
7.3 2.9 Virginica [(5.2-7.3) 2 + (3.1-2.9) 2] ½ 2.1
6.0 2.7 Verscicolor [(5.2-6.0) 2 + (3.1-2.7) 2] ½ 0.89
5.8 2.8 Virginica [(5.2-5.8) 2 + (3.1-2.8) 2] ½ 0.67
6.3 2.3 Verscicolor [(5.2-6.3) 2 + (3.1-2.3) 2] ½ 1.36
5.1 2.5 Verscicolor [(5.2-5.1) 2 + (3.1-2.5) 2] ½ 0.60
6.3 2.5 Verscicolor [(5.2-6.3) 2 + (3.1-2.5) 2] ½ 1.25
5.5 2.4 Verscicolor [(5.2-5.5) 2 + (3.1-2.4) 2] ½ 0.75

Step 2: Find Rank

Sepal Sepal Species Calculation for Distance Distance Rank


Length Width
5.3 3.7 Setosa [(5.2-5.3) 2 + (3.1-3.7) 2] ½ 0.608 3
5.1 3.8 Setosa [(5.2-5.1) 2 + (3.1-3.8) 2] ½ 0.707 6
7.2 3.0 Virginica [(5.2-7.2) 2 + (3.1-3.0) 2] ½ 2.002 13
5.4 3.4 Setosa [(5.2-5.4) 2 + (3.1-3.4) 2] ½ 0.36 2
5.1 3.3 Setosa [(5.2-5.1) 2 + (3.1-3.3) 2] ½ 0.22 1
5.4 3.9 Setosa [(5.2-5.4) 2 + (3.1-3.9) 2] ½ 0.82 8
7.4 2.8 Virginica [(5.2-7.4) 2 + (3.1-2.8) 2] ½ 2.22 15
6.1 2.8 Verscicolor [(5.2-6.1) 2 + (3.1-2.8) 2] ½ 0.94 10
7.3 2.9 Virginica [(5.2-7.3) 2 + (3.1-2.9) 2] ½ 2.1 14
6.0 2.7 Verscicolor [(5.2-6.0) 2 + (3.1-2.7) 2] ½ 0.89 9
5.8 2.8 Virginica [(5.2-5.8) 2 + (3.1-2.8) 2] ½ 0.67 5
6.3 2.3 Verscicolor [(5.2-6.3) 2 + (3.1-2.3) 2] ½ 1.36 12
5.1 2.5 Verscicolor [(5.2-5.1) 2 + (3.1-2.5) 2] ½ 0.60 4
6.3 2.5 Verscicolor [(5.2-6.3) 2 + (3.1-2.5) 2] ½ 1.25 11
5.5 2.4 Verscicolor [(5.2-5.5) 2 + (3.1-2.4) 2] ½ 0.75 7

Arrange in the increasing order of rank

Sepal Sepal Species Calculation for Distance Distance Rank


Length Width
5.1 3.3 Setosa [(5.2-5.1) 2 + (3.1-3.3) 2] ½ 0.22 1
5.4 3.4 Setosa [(5.2-5.4) 2 + (3.1-3.4) 2] ½ 0.36 2
5.3 3.7 Setosa [(5.2-5.3) 2 + (3.1-3.7) 2] ½ 0.608 3
5.1 2.5 Verscicolor [(5.2-5.1) 2 + (3.1-2.5) 2] ½ 0.60 4
5.8 2.8 Virginica [(5.2-5.8) 2 + (3.1-2.8) 2] ½ 0.67 5
5.1 3.8 Setosa [(5.2-5.1) 2 + (3.1-3.8) 2] ½ 0.707 6
5.5 2.4 Verscicolor [(5.2-5.5) 2 + (3.1-2.4) 2] ½ 0.75 7
5.4 3.9 Setosa [(5.2-5.4) 2 + (3.1-3.9) 2] ½ 0.82 8
6.0 2.7 Verscicolor [(5.2-6.0) 2 + (3.1-2.7) 2] ½ 0.89 9
6.1 2.8 Verscicolor [(5.2-6.1) 2 + (3.1-2.8) 2] ½ 0.94 10
6.3 2.5 Verscicolor [(5.2-6.3) 2 + (3.1-2.5) 2] ½ 1.25 11
6.3 2.3 Verscicolor [(5.2-6.3) 2 + (3.1-2.3) 2] ½ 1.36 12
7.2 3.0 Virginica [(5.2-7.2) 2 + (3.1-3.0) 2] ½ 2.002 13
7.3 2.9 Virginica [(5.2-7.3) 2 + (3.1-2.9) 2] ½ 2.1 14
7.4 2.8 Virginica [(5.2-7.4) 2 + (3.1-2.8) 2] ½ 2.22 15

Step 3: Find the nearest neighbor


If K=1
Sepal Sepal Species Calculation for Distance Distance Rank
Length Width
5.1 3.3 Setosa [(5.2-5.1) 2 + (3.1-3.3) 2] ½ 0.22 1
Species of new data=Setosa

If K=2
Sepal Sepal Species Calculation for Distance Distance Rank
Length Width
5.1 3.3 Setosa [(5.2-5.1) 2 + (3.1-3.3) 2] ½ 0.22 1
5.4 3.4 Setosa [(5.2-5.4) 2 + (3.1-3.4) 2] ½ 0.36 2
Species of new data=Setosa

If K=3
Sepal Sepal Species Calculation for Distance Distance Rank
Length Width
5.1 3.3 Setosa [(5.2-5.1) 2 + (3.1-3.3) 2] ½ 0.22 1
5.4 3.4 Setosa [(5.2-5.4) 2 + (3.1-3.4) 2] ½ 0.36 2
5.3 3.7 Setosa [(5.2-5.3) 2 + (3.1-3.7) 2] ½ 0.608 3
Species of new data=Setosa

If K=5
Sepal Sepal Species Calculation for Distance Distance Rank
Length Width
5.1 3.3 Setosa [(5.2-5.1) 2 + (3.1-3.3) 2] ½ 0.22 1
5.4 3.4 Setosa [(5.2-5.4) 2 + (3.1-3.4) 2] ½ 0.36 2
5.3 3.7 Setosa [(5.2-5.3) 2 + (3.1-3.7) 2] ½ 0.608 3
5.1 2.5 Verscicolor [(5.2-5.1) 2 + (3.1-2.5) 2] ½ 0.60 4
5.8 2.8 Virginica [(5.2-5.8) 2 + (3.1-2.8) 2] ½ 0.67 5
[Setosa=3, Verscicolor=1, Virginica=1]
Species of new data=Setosa

Numerical Problem: 2

Height(cm) Weight(kg) Class


167 51 Underweight
182 62 Normal
176 69 Normal
173 64 Normal
172 65 Normal
174 56 Underweight
169 58 Normal
173 57 Normal
170 55 Normal
170 57 ?

Step 1: Find Distance


Distance= (x-a)2 + (y-b) 2
Where, x=170 & y=57
Height Weight Class Calculation for Distance Distance
(cm) (kg)
167 51 Underweight [(170-167) 2 + (57-51) 2] ½ 6.7
182 62 Normal [(170-182) 2 + (57-62) 2] ½ 13
176 69 Normal [(170-176) 2 + (57-69) 2] ½ 13.4
173 64 Normal [(170-173) 2 + (57-64) 2] ½ 7.6
172 65 Normal [(170-172) 2 + (57-65) 2] ½ 8.2
174 56 Underweight [(170-174) 2 + (57-56) 2] ½ 4.1
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4
173 57 Normal [(170-173) 2 + (57-57) 2] ½ 3
170 55 Normal [(170-170) 2 + (57-55) 2] ½ 2

Step 2: Find Rank


Height Weight Class Calculation for Distance Distance Rank
(cm) (kg)
167 51 Underweight [(170-167) 2 + (57-51) 2] ½ 6.7 5
182 62 Normal [(170-182) 2 + (57-62) 2] ½ 13 8
176 69 Normal [(170-176) 2 + (57-69) 2] ½ 13.4 9
173 64 Normal [(170-173) 2 + (57-64) 2] ½ 7.6 6
172 65 Normal [(170-172) 2 + (57-65) 2] ½ 8.2 7
174 56 Underweight [(170-174) 2 + (57-56) 2] ½ 4.1 4
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4 1
173 57 Normal [(170-173) 2 + (57-57) 2] ½ 3 3
170 55 Normal [(170-170) 2 + (57-55) 2] ½ 2 2

Arrange in the increasing order of rank


Height Weight Class Calculation for Distance Distance Rank
(cm) (kg)
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4 1
170 55 Normal [(170-170) 2 + (57-55) 2] ½ 2 2
173 57 Normal [(170-173) 2 + (57-57) 2] ½ 3 3
174 56 Underweight [(170-174) 2 + (57-56) 2] ½ 4.1 4
167 51 Underweight [(170-167) 2 + (57-51) 2] ½ 6.7 5
173 64 Normal [(170-173) 2 + (57-64) 2] ½ 7.6 6
172 65 Normal [(170-172) 2 + (57-65) 2] ½ 8.2 7
182 62 Normal [(170-182) 2 + (57-62) 2] ½ 13 8
176 69 Normal [(170-176) 2 + (57-69) 2] ½ 13.4 9

Step 3: Find the nearest neighbor

If K=1
Height Weight Class Calculation for Distance Distance Rank
(cm) (kg)
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4 1
Class of new data=Normal

If K=2
Height Weight Class Calculation for Distance Distance Rank
(cm) (kg)
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4 1
170 55 Normal [(170-170) 2 + (57-55) 2] ½ 2 2
Class of new data=Normal

If K=3
Height Weight Class Calculation for Distance Distance Rank
(cm) (kg)
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4 1
170 55 Normal [(170-170) 2 + (57-55) 2] ½ 2 2
173 57 Normal [(170-173) 2 + (57-57) 2] ½ 3 3
Class of new data=Normal

If K=5
Height Weight Class Calculation for Distance Distance Rank
(cm) (kg)
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4 1
170 55 Normal [(170-170) 2 + (57-55) 2] ½ 2 2
173 57 Normal [(170-173) 2 + (57-57) 2] ½ 3 3
174 56 Underweight [(170-174) 2 + (57-56) 2] ½ 4.1 4
167 51 Underweight [(170-167) 2 + (57-51) 2] ½ 6.7 5
[Normal=3, Underweight=2]
Class of new data=Normal

Locally Weighted Regression


 Locally weighted regression is an instance-based learning algorithm.
 Locally weighted regression (LWR), also known as LOESS (locally estimated
scatterplot smoothing) or LOWESS (locally weighted scatterplot
smoothing), is a non-parametric regression method that fits multiple
regression models to different subsets of the data.
 It's particularly useful when a global linear model doesn't adequately
capture the relationship between the input and output variables, especially
when the data exhibits non-linear patterns.
 The phrase "locally weighted regression" is called
– local because the function is approximated based only on data near the
query point,
– weighted because the contribution of each training example is weighted
by its distance from the query point, and
– regression because this is the term used widely in the statistical learning
community for the problem of approximating real-valued functions.
 Locally Weighted Regression (LWR) is a non-parametric regression
algorithm used to model the relationship between variables.
 It aims to fit a linear regression model to a dataset by giving more weight to
nearby data points.
 The basic assumption for a linear regression is that the data must be
linearly distributed.
 But what if the data is not linearly distributed. Can we still apply the idea of
regression? And the answer is ‘yes’ … we can apply regression and it is
called as locally weighted regression.
 LWR works by assigning weights to the known points based on their
proximity to the unknown point.
 The closer a known point is to the unknown point, the higher its weight.
 This means that the known points that are closer to the unknown point
have a stronger influence on the prediction.

How Locally Weighted Regression Works:


1. Input
The algorithm takes as input a training dataset consisting of instances with
input variables (features) and corresponding output values (targets), as well
as a query instance for which the regression prediction is desired.
2. Weight Calculation
For each training instance in the dataset, a weight is calculated based on its
proximity to the query instance. The most common weight function used is
the Gaussian kernel, which assigns higher weights to instances closer to the
query instance and lower weights to instances farther away.
3. Local Regression
A weighted regression model is fit to the training instances using the
calculated weights.
4. Prediction
once the local regression model is fitted, the prediction for the query
instance is made by applying the learned model to the query instance's
input variables.
Case Based Learning/ Case Based Reasoning [CBR]
 Instance-based methods such as k-NEAREST NEIGHBOR and locally
weighted regression share three key properties.
 First, they are lazy learning methods in that they defer the decision of how
to generalize beyond the training data until a new query instance is
observed.
 Second, they classify new query instances by analyzing similar instances
while ignoring instances that are very different from the query.
 Third, they represent instances as real-valued points in an n-dimensional
Euclidean space.
 In CBR, the instances are not represented as real-valued points, but instead,
they use a rich symbolic representation and the methods used to retrieve
similar instances are correspondingly more elaborate.
 CBR has been applied to problems such as:
o Conceptual design of mechanical devices based on a stored library of
previous designs.
o Reasoning about new legal cases based on previous rulings.

CASE-BASED REASONING CYCLE


Case-based reasoning consists of a cycle of the following four steps:
1. Retrieve - Given a new case, retrieve similar cases from the case base.
2. Reuse - Adapt the retrieved cases to fit to the new case.
3. Revise - Evaluate the solution and revise it based on how well it works.
4. Retain - Decide whether to retain this new case in the case base.

Examples of CASE-BASED REASONING


Example 1:
 Help desk that users call with problems to be solved.
 When users give a description of a problem, the closest cases in the case
base are retrieved.
 The diagnostic assistant could recommend some of these to the user,
adapting each case to the user’s particular situation.
 If one of the adapted cases works, that case is added to the case base, to be
used when another user asks a similar question.
 If none of the cases found works, some other method is attempted to solve
the problem, perhaps by adapting other cases or having a human help
diagnose the problem.
 When the problem is finally solved, the solution is added to the case base.

Example 2: How CADET Tool Employs CBR


CADET (Course of Action Development and Evaluation Tool) is an AI-based
decision-support tool used for military planning. It helps in generating battle
plans automatically or semi-automatically.
CADET uses CBR in the following ways:
 Plan Generation: When given a new mission objective, CADET searches a
case library of previously developed plans (past battle scenarios).
 Case Retrieval: It finds the most similar past missions or operations based
on features like mission type, enemy type, terrain, resources, etc.
 Plan Adaptation: CADET modifies the retrieved plan to fit the specifics of
the current mission (e.g., adjusting forces or tactics based on current
enemy strength).
 Plan Evaluation: After adaptation, CADET simulates and evaluates the
feasibility of the plan.
 Learning: New battle plans generated during operations can be stored back
in the case library to improve future performance.

Thus, CADET employs CBR to quickly generate detailed, actionable, and feasible
military plans by leveraging experience from past operations.

Numerical Example of CBR

Case Monthly Account Home Owner Credit Score


Income Balance
1 3 2 0 2
2 2 1 1 1
3 3 2 2 4
4 0 -1 0 0
5 3 1 2 ?
The system is using the nearest neighbor retrieval algorithm with the following
similarity function: d(T,S)=∑ |T−Si|×wi
where T is the target case, S is the source case, i is the number of a feature, and wi
are the weights
Answer the following questions:
a) Which case will the CBR system retrieve as the ‘best match’, if all the weights?
b) The solution that the CBR system should propose is the credit score rating.
Suggest how should the solution of the retrieved case be adapted for the target
case?
c) What can be changed in the similarity function to make feature ‘Account
Balance’ three times more important than any other feature? Will this change
influence the solution?
Sol: a) Which case will the CBR system retrieve as the ‘best match’, if all the
weights?
D(case 1)= |3-3|*1 + |1-2|*1 + |2-1|*1 = 0+1+2=3
D(case 2)= |3-2|*1 + |1-1|*1 + |2-1|*1 = 1+0+1=2
D(case 3)= |3-3|*1 + |1-2|*1 + |2-2|*1 = 0+1+1=1
D(case 4)= |3-0|*1 + |1-(-1)|*1 + |2-0|*1 = 2+2+2=6
Min Distance=1 with respect to case 3.
So the best match/ most similar case =case 3.
According to the best match, credit score for case 5= 4

Sol: b) The solution that the CBR system should propose is the credit score rating.
Suggest how should the solution of the retrieved case be adapted for the target
case?
The Credit Score for Case 3 is 4.
The only difference between the target case and Case 3 is the Account Balance (3
< 2).
Based on the other cases, one can derive that the decrease in Account Balance
should decrease the credit.
Thus, the solution of Case 3 can be adapted by decreasing the value.
The revised solution for the new case is: Credit Score = 3.

Sol: c) What can be changed in the similarity function to make feature ‘Account
Balance’ three times more important than any other feature? Will this change
influence the solution?
D(case 1)= |3-3|*1 + |1-2|*3 + |2-1|*1 = 0+3+1=4
D(case 2)= |3-2|*1 + |1-1|*3 + |2-1|*1 = 1+0+1=2
D(case 3)= |3-3|*1 + |1-2|*3 + |2-2|*1 = 0+3+0=3
D(case 4)= |3-0|*1 + |1-(-1)|* 3 + |2-0|*1 = 3+6+2=11
Min Distance=2 with respect to case 2.
So the best match/ most similar case =case 2.
According to the best match, credit score for case 5= 1

Relevance of CBR(Case Based Reasoning)


CBR (Case-Based Reasoning) is a problem-solving approach where new problems
are solved by referring to similar past cases and reusing their solutions. Instead of
solving a problem from scratch, the system looks for previously encountered
cases, adapts them, and applies them to the new situation.
1
Its relevance includes:
 Learning from Experience: Like humans, systems using CBR improve over
time by accumulating solved cases.
 Efficiency: Reduces problem-solving time by reusing previous solutions.
 Adaptability: Can deal with incomplete or imperfect information because
solutions can be adapted.
 Consistency: Provides consistent decisions based on historical knowledge.
 Domains: Very useful in medical diagnosis, legal reasoning, help desks,
design systems, and military planning.
Radial Basis Function Network (RBFN)
 An RBF network is a type of artificial neural network used mainly for
function approximation, classification, and time-series prediction.
 It has three layers:
1. Input layer: Just passes input data to the next layer.
2. Hidden layer: Applies a radial basis function (typically a Gaussian
function) to the input. Each hidden node responds strongly only to
inputs near its center (hence "radial").
3. Output layer: Combines the outputs from the hidden layer linearly to
produce the final output.
 In simple terms, RBFNs work by measuring the distance between the input
and prototype points (centers) and then mapping that distance to an
output using a smooth function.

Radial Basis Functions


Radial Basis Functions (RBFs) are utilized to calculate distances. Among these,
the Gaussian function is the most frequently employed, defined as:

Where x is the input vector, c is the center of the RBF, and sigma is the spread
parameter. The RBF measures how close the input is to the center c.

Radial Basis Function Example


A dataset has two-dimensional data points belonging to two separate classes. An
RBF Network has been trained with 20 RBF neurons on the said data set. We can
mark the prototypes selected and view the category one score on the input space.
For viewing, we can draw a 3-D mesh or a contour plot.
 The areas of highest and lowest category one score should be marked separately.
In the case of category one output node:

 All the weights for category 2 RBF neurons will be negative.


 All the weights for category 1 RBF neurons will be positive.
Finally, an approximation of the decision boundary can be plotted by computing
the scores over a finite grid.

Training Process of radial basis function neural network


An RBF neural network must be trained in three stages: choosing the center's,
figuring out the spread parameters, and training the output weights.
Step 1: Selecting the Centers
 Techniques for Centre Selection: Centre's can be picked at random from the
training set of data or by applying techniques such as k-means clustering.
 K-Means Clustering: The center's of these clusters are employed as the
center's for the RBF neurons in this widely used center selection technique,
which groups the input data into k groups.
Step 2: Determining the Spread Parameters
 The spread parameter (σ) governs each RBF neuron's area of effect and
establishes the width of the RBF.
 Calculation: The spread parameter can be manually adjusted for each neuron
or set as a constant for all neurons. Setting σ based on the separation
between the center's is a popular method, frequently accomplished with the
help of a heuristic like dividing the greatest distance between canters by the
square root of twice the number of center's
Step 3: Training the Output Weights
 Linear Regression: The objective of linear regression techniques, which are
commonly used to estimate the output layer weights, is to minimize the error
between the anticipated output and the actual target values.
 Pseudo-Inverse Method: One popular technique for figuring out the weights
is to utilize the pseudo-inverse of the hidden layer outputs matrix.

Key properties:
 Fast training (compared to deep networks).
 Good for local approximation (each neuron specializes in a small region).
 Sensitive to the choice of centers and widths of the radial functions.

Applications of RBFNNs
 Pattern Identification: RBFNNs excel at identifying patterns within
datasets, making them ideal for image and speech identification.
 Continuous Function Estimation: They are good at estimating
continuous functions, which benefits applications like curve fitting and
modeling surfaces.
 Forecasting Timeseries Data: RBFNNs can forecast future data in
timeseries, which helps in financial market predictions and also weather
forecasting.

Comparison between Locally Weighted Regression (LWR) and Radial


Basis Function Networks (RBFNs):

Locally Weighted Regression Radial Basis Function Network


Aspect
(LWR) (RBFN)
Fits a local model (like a small linear Trains a global network where hidden
Basic Idea regression) around the query point units are activated based on distance
using nearby data. to centers (like Gaussian bumps).
Instance-based learning (lazy Model-based learning (eager
Learning Style learning); no global model built during learning); network parameters are
training. tuned during training.
Training involves choosing centers,
No real training phase; all
Model Training widths, and output weights ahead of
computation happens at query time.
time.
Passes input through the trained
Fits a small regression only around
At Query Time network to get output (global
the query point (local fitting).
operation).
High at query time (need to compute
Computation Low at query time (only forward pass
distances and solve small regression
Cost through network).
every time).
Needs to store the whole dataset
Memory Can compress dataset into few RBF
(because model is computed at query
Requirement centers (fixed-size model).
time).
Handles nonlinearities via radial basis
Handling Naturally handles complex
functions and combination at the
Nonlinearities nonlinearities via local fitting.
output.
More robust depending on training
Sensitive to local noise because it
Sensitivity (centers can "smooth over" noisy
uses nearby points only.
data).
k-Nearest Neighbors with weighted RBF Networks for classification,
Examples regression; locally weighted linear regression tasks, function
regression. approximation.
 LWR is like doing a mini regression on-the-fly, focusing only on data points
near your query input.
 RBF Networks learn a global function where hidden units represent "local
bumps" in the input space, and you simply evaluate that function at query
time.

Connection between RBFNs and Instance-Based Learning:


 When RBFNs choose centers from the training data directly, they behave
similar to instance-based learning because they retain instances and use
distance-based reasoning.
 However, a full RBF network generalizes a bit more by adjusting the centers
and weights beyond simply storing examples.
 RBFNs are neural networks that use distance-based activation.
 Instance-based learning relies on comparing new inputs to stored
instances rather than creating a full abstract model.
 RBFNs can sometimes behave like instance-based learners depending on
how the hidden layer centers are selected.

You might also like