Unit 3 (MLT)
Unit 3 (MLT)
Starting at the Root: The algorithm begins at the top, called the “root
node,” representing the entire dataset.
Asking the Best Questions: It looks for the most important feature or
question that splits the data into the most distinct groups. This is like
asking a question at a fork in the tree.
Branching Out: Based on the answer to that question, it divides the data
into smaller subsets, creating new branches. Each branch represents a
possible route through the tree.
Repeating the Process: The algorithm continues asking questions and
splitting the data at each branch until it reaches the final “leaf nodes,”
representing the predicted outcomes or classifications.
Information theory in the context of decision trees is primarily used to help the
algorithm decide which feature to split on at each step. It provides a way to
quantify uncertainty or impurity in data, and helps choose splits that maximize
the "information gain"—i.e., reduce uncertainty the most.
Attribute selection measures[ASM] are called a splitting rules because they
decides how the tuples at a given node are to be divided.
The attribute selection measure supports a ranking for every attribute defining
the given training tuples. The attribute having the best method for the measure is
selected as the splitting attribute for the given tuples.
There are three famous attribute selection measures including Information Gain,
Gain Ratio, and Gini Index.
Entropy
Entropy is a measure of disorder or impurity in the given dataset.
It is the sum of the probability of each label times, the log probability of that
same label. It is an average rate at which a stochastic data source produces
information, or it measures the uncertainty associated with a random variable.
Mathematically Entropy is calculated as :
Information Gain
An amount of information gained about a random variable or signal from
observing another random variable is known as Information Gain.
It favors smaller partitions with distinct values.
Gain Ratio
The information gain measure is biased approaching tests with several results.
It can select attributes having a high number of values. For instance, consider
an attribute that facilitates as a unique identifier, including product ID.
A split on product ID can result in a huge number of partitions, each one
including only one tuple. Because each partition is authentic, the data needed
to define data set D based on this partitioning would be Infoproduct_ID(D) = 0.
Gini Index
It is calculated by subtracting the sum of the squared probabilities of each class
from one.
The Gini index can be used in CART.
The Gini index calculates the impurity of D, a data partition or collection of
training tuples, as
Inductive Bias
“Inductive bias is the set of assumptions or preferences that a learning
algorithm uses to make predictions beyond the data it has been trained on.”
Without inductive bias, machine learning algorithms would be unable to
generalize from training data to unseen situations, as the possible hypotheses or
models could be infinite.
Types of Inductive Bias
There are two main types of inductive bias in machine learning: restrictive
bias and preferential bias.
1. Restrictive Bias
Restrictive bias refers to the assumptions that limit the set of functions that the
algorithm can learn.
For example, a linear regression model assumes that the relationship between the
input variables and the output variable is linear. This means that the model can
only learn linear functions, and any non-linear relationships between the variables
will not be captured.
Another example of restrictive bias is the decision tree algorithm, which assumes
that the relationship between the input variables and the output variable can be
represented by a tree-like structure. This means that the algorithm can only learn
functions that can be represented by a decision tree.
2. Preferential Bias
Preferential bias refers to the assumptions that make some functions more likely
to be learned than others.
For example, a neural network with a large number of hidden layers and
parameters has a preferential bias towards complex, non-linear functions. This
means that the algorithm is more likely to learn complex functions than simple
ones,
Another example of preferential bias is the k-nearest neighbors’ algorithm, which
assumes that similar inputs have similar outputs. This means that the algorithm is
more likely to predict the same output for inputs that are close together in
feature space.
Those rules are inductive inferences—they were inferred from training data and
are used to predict future inputs.
Example: Say you're building a tree to decide if someone will play tennis based on
weather:
Outlook Temperature PlayTennis
Sunny Hot No
Overcast Mild Yes
Rain Cool Yes
Pros of ID3:
Simple and intuitive.
Works well for categorical data.
Good for small to medium datasets.
Limitations of ID3:
Can overfit the training data.
Doesn’t handle numeric data well without preprocessing.
No pruning (in the original version).
6. Sensitivity to Training Data: Small changes in the training data can lead to
significantly different decision trees, making the model unstable and
unreliable.
7. Difficulty with Imbalanced Datasets: If one class is significantly more
prevalent than others, the decision tree may become biased towards the
majority class, leading to poor performance on the minority class.
8. Computational Efficiency:Decision tree algorithms can be computationally
expensive, especially for large datasets.
9. Interpretability: While decision trees are relatively easy to interpret, very
complex trees can be difficult to understand.
10. Choosing Attribute Selection Measures: The choice of attribute selection
measure (e.g., information gain, Gini index) can affect the performance of the
decision tree.
11. NP-completeness: Finding the optimal decision tree is an NP-complete
problem, meaning that finding the absolute best solution is computationally
intractable.
Numerical on ID3
Question: Decision rules will be found based on entropy and information gain
pair of features. Following table informs about decision making factors to play
Tennis at outside for previous 14 days.
Play
Day Outlook Temp. Humidity Wind
Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Solution using ID3 Algorithm
Gain(S,Temp)=0.94- 4 * 1.0 – 6 *0.9183 – 4 *0.8113 = 0.0289
14 14 14
-
Example 2: Make a Decision tree using below training data using ID3 Algorithm
Instance Based Learning
Machine
Learning
Advantages
Simplicity: Instance-based learning is straightforward to implement and
does not require complex mathematical modeling.
Adaptability: It can quickly adapt to new data without retraining, making it
ideal for dynamic datasets.
Disadvantages:
Storage Requirements: Storing the entire training dataset can require
significant memory, especially for large datasets.
Computational Cost: The prediction process can be computationally
expensive, especially when dealing with large datasets, as it involves
comparing the new instance to all stored instances.
Use Cases
Classification tasks: Algorithms like K-Nearest Neighbors (KNN) effectively
classify data points based on similarity metrics.
Regression tasks: Methods such as locally weighted regression use
instance-based learning for making predictions in continuous spaces.
Challenges and Limitations
High memory usage: Storing the entire dataset requires significant
memory, particularly for large datasets.
Computational expense: Predictions are slow for large datasets since
comparisons must be made with all stored instances.
Sensitivity to irrelevant or noisy features: These can distort similarity
measurements, leading to reduced prediction accuracy.
o As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
o There is no particular way to determine the best value for "K", so we need
to try some values to find the best out of them. The most preferred value
for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model leads to inaccurate predictions.
o Large values for K are good, but it may find some difficulties.
o Always use an ODD number as the value of K.
Numerical Problem: 1
If K=2
Sepal Sepal Species Calculation for Distance Distance Rank
Length Width
5.1 3.3 Setosa [(5.2-5.1) 2 + (3.1-3.3) 2] ½ 0.22 1
5.4 3.4 Setosa [(5.2-5.4) 2 + (3.1-3.4) 2] ½ 0.36 2
Species of new data=Setosa
If K=3
Sepal Sepal Species Calculation for Distance Distance Rank
Length Width
5.1 3.3 Setosa [(5.2-5.1) 2 + (3.1-3.3) 2] ½ 0.22 1
5.4 3.4 Setosa [(5.2-5.4) 2 + (3.1-3.4) 2] ½ 0.36 2
5.3 3.7 Setosa [(5.2-5.3) 2 + (3.1-3.7) 2] ½ 0.608 3
Species of new data=Setosa
If K=5
Sepal Sepal Species Calculation for Distance Distance Rank
Length Width
5.1 3.3 Setosa [(5.2-5.1) 2 + (3.1-3.3) 2] ½ 0.22 1
5.4 3.4 Setosa [(5.2-5.4) 2 + (3.1-3.4) 2] ½ 0.36 2
5.3 3.7 Setosa [(5.2-5.3) 2 + (3.1-3.7) 2] ½ 0.608 3
5.1 2.5 Verscicolor [(5.2-5.1) 2 + (3.1-2.5) 2] ½ 0.60 4
5.8 2.8 Virginica [(5.2-5.8) 2 + (3.1-2.8) 2] ½ 0.67 5
[Setosa=3, Verscicolor=1, Virginica=1]
Species of new data=Setosa
Numerical Problem: 2
If K=1
Height Weight Class Calculation for Distance Distance Rank
(cm) (kg)
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4 1
Class of new data=Normal
If K=2
Height Weight Class Calculation for Distance Distance Rank
(cm) (kg)
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4 1
170 55 Normal [(170-170) 2 + (57-55) 2] ½ 2 2
Class of new data=Normal
If K=3
Height Weight Class Calculation for Distance Distance Rank
(cm) (kg)
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4 1
170 55 Normal [(170-170) 2 + (57-55) 2] ½ 2 2
173 57 Normal [(170-173) 2 + (57-57) 2] ½ 3 3
Class of new data=Normal
If K=5
Height Weight Class Calculation for Distance Distance Rank
(cm) (kg)
169 58 Normal [(170-169) 2 + (57-58) 2] ½ 1.4 1
170 55 Normal [(170-170) 2 + (57-55) 2] ½ 2 2
173 57 Normal [(170-173) 2 + (57-57) 2] ½ 3 3
174 56 Underweight [(170-174) 2 + (57-56) 2] ½ 4.1 4
167 51 Underweight [(170-167) 2 + (57-51) 2] ½ 6.7 5
[Normal=3, Underweight=2]
Class of new data=Normal
Thus, CADET employs CBR to quickly generate detailed, actionable, and feasible
military plans by leveraging experience from past operations.
Sol: b) The solution that the CBR system should propose is the credit score rating.
Suggest how should the solution of the retrieved case be adapted for the target
case?
The Credit Score for Case 3 is 4.
The only difference between the target case and Case 3 is the Account Balance (3
< 2).
Based on the other cases, one can derive that the decrease in Account Balance
should decrease the credit.
Thus, the solution of Case 3 can be adapted by decreasing the value.
The revised solution for the new case is: Credit Score = 3.
Sol: c) What can be changed in the similarity function to make feature ‘Account
Balance’ three times more important than any other feature? Will this change
influence the solution?
D(case 1)= |3-3|*1 + |1-2|*3 + |2-1|*1 = 0+3+1=4
D(case 2)= |3-2|*1 + |1-1|*3 + |2-1|*1 = 1+0+1=2
D(case 3)= |3-3|*1 + |1-2|*3 + |2-2|*1 = 0+3+0=3
D(case 4)= |3-0|*1 + |1-(-1)|* 3 + |2-0|*1 = 3+6+2=11
Min Distance=2 with respect to case 2.
So the best match/ most similar case =case 2.
According to the best match, credit score for case 5= 1
Where x is the input vector, c is the center of the RBF, and sigma is the spread
parameter. The RBF measures how close the input is to the center c.
Key properties:
Fast training (compared to deep networks).
Good for local approximation (each neuron specializes in a small region).
Sensitive to the choice of centers and widths of the radial functions.
Applications of RBFNNs
Pattern Identification: RBFNNs excel at identifying patterns within
datasets, making them ideal for image and speech identification.
Continuous Function Estimation: They are good at estimating
continuous functions, which benefits applications like curve fitting and
modeling surfaces.
Forecasting Timeseries Data: RBFNNs can forecast future data in
timeseries, which helps in financial market predictions and also weather
forecasting.