ML Unit-3
ML Unit-3
Decision Tree
Using ID3 Algorithm to build a Decision Tree to predict the weather
ID3 algorithm, stands for Iterative Dichotomiser 3, is a classification algorithm that
follows a greedy approach of building a decision tree by selecting a best attribute that
yields maximum Information Gain (IG) or minimum Entropy (H).
What is a Decision Tree?
A Supervised Machine Learning Algorithm, used to build classification and regression models
in the form of a tree structure.
A decision tree is a tree where each -
• Node - a feature(attribute)
• Branch - a decision(rule)
• Leaf - an outcome(categorical or continuous)
There are many algorithms to build decision trees, here we are going to discuss ID3 algorithm
with an example.
What is an ID3 Algorithm?
ID3 stands for Iterative Dichotomiser 3
It is a classification algorithm that follows a greedy approach by selecting a best attribute that
yields maximum Information Gain(IG) or minimum Entropy(H).
What is Entropy and Information gain?
Entropy is a measure of the amount of uncertainty in the dataset S. Mathematical
Representation of Entropy is shown here –
Where,
• S - The current dataset for which entropy is being calculated(changes every iteration of
the ID3 algorithm).
• C - Set of classes in S {example - C ={yes, no}}
• p(c) - The proportion of the number of elements in class c to the number of elements in
set S.
In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest
entropy is used to split the set S on that particular iteration.
Entropy = 0 implies it is of pure class, that means all are of same category.
Information Gain IG(A) tells us how much uncertainty in S was reduced after splitting
set S on attribute A. Mathematical representation of Information gain is shown here –
What are the steps in ID3 algorithm?
The steps in ID3 algorithm are as follows:
Calculate entropy for dataset.
For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
Find the feature with maximum information gain.
Repeat it until we get the desired tree.
For each attribute of the dataset, let's follow the step-2 of pseudocode : -
First Attribute - Outlook
Here, the attribute with maximum information gain is Outlook. So, the
decision tree built so far -
Here, the attribute with maximum information gain is Humidity. So, the
decision tree built so far -
Now, finding the best attribute for splitting the data with Outlook=Sunny
values{ Dataset rows = [4, 5, 6, 10, 14]}.
Here, the attribute with maximum information gain is Wind. So, the decision
tree built so far -
Grouping classifiers such as decision trees divide the instance space into segments, and so can
be turned into rankers by learning an ordering on those segments.
The ranking obtained from the empirical probabilities in the leaves of a decision tree yields a
convex ROC curve on the training data.
Adding a split to a decision tree can be interpreted in terms of coverage curves as the following
two-step process:
->Split the corresponding curve segment into two or more segments;
->Reorder the segments on decreasing slope.
• To turn a feature tree into a ranker, we order its leaves on non-increasing empirical
probabilities, which is provably optimal on the training set;
• To turn the tree into a probability estimator, we predict the empirical probabilities in
each leaf, applying Laplace or m-estimate smoothing to make these estimates more
robust for small leaves;
• To turn the tree into a classifier, we choose the operating conditions and find the
operating point that is optimal under those operating conditions.
The operation of merging all leaves in a subtree is called pruning the subtree.
The advantage of pruning is that we can simplify the tree without affecting the chosen operating
point, which is sometimes useful if we want to communicate the tree model to somebody else.
The disadvantage is that we lose ranking performance.
Pruning is therefore not recommended unless (i) you only intend to use the tree for
classification, not for ranking or probability estimation; and (ii) you can define the expected
operating conditions with sufficient precision. One popular algorithm for pruning decision trees
is called reduced-error pruning.
Pruning: It is a process of adjusting your decision tree in order to minimize the miss
classification error.
Motive: Minimize the overfitting the model.
To reach the motive, We have TWO techniques
1) Pre Pruning or Early Stopping Rule
2) Post Pruning ( one variation is reduced-error pruning of a decision tree.)
Note: Entropy and Gini index are sensitive to fluctuations in the class distribution, √𝐆𝐢𝐧𝐢
isn’t.
Tree learning as variance reduction
Regression trees
Standard Deviation
A decision tree is built top-down from a root node and involves partitioning
the data into subsets that contain instances with similar values (homogenous).
We use standard deviation to calculate the homogeneity of a numerical sample.
If the numerical sample is completely homogeneous its standard deviation is
zero.
Step 3: The attribute with the largest standard deviation reduction is chosen for the decision
node.
Step 4a: The dataset is divided based on the values of the selected attribute.
This process is run recursively on the non-leaf branches, until all data is
processed.
In practice, we need some termination criteria. For example, when coefficient of
deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%)
and/or when too few instances (n) remain in the branch (e.g., 3).
Step 4b: "Overcast" subset does not need any further splitting because its CV
(8%) is less than the threshold (10%). The related leaf node gets the average of
the "Overcast" subset.
Step 4c: However, the "Sunny" branch has an CV (28%) more than the threshold
(10%) which needs further splitting. We select "Temp" as the best best node after
"Outlook" because it has the largest SDR.
Because the number of data points for both branches (FALSE and TRUE) is
equal or less than 3 we stop further branching and assign the average of each
branch to the related leaf node.
Step 4d: Moreover, the "rainy" branch has an CV (22%) which is more than the
threshold (10%). This branch needs further splitting. We select "Temp" as the
best best node because it has the largest SDR.
Because the number of data points for all three branches (Cool, Hot and Mild) is
equal or less than 3 we stop further branching and assign the average of each
branch to the related leaf node.
Linear models
In this chapter we will look at models that can be understood in terms of lines and planes,
commonly called linear models.
1. Linear models are parametric, meaning that they have a fixed form with a small
number of numeric parameters that need to be learned from data. This is different from
tree or rule models, where the structure of the model (e.g., which features to use in the
tree, and where) is not fixed in advance.
2. Linear models are stable, which is to say that small variations in the training data have
only limited impact on the learned model.
3. Linear models are less likely to overfit the training data than some other models,
largely because they have relatively few parameters. The flipside of this is that they
sometimes lead to underfitting.
The last two points can be summarised by saying that linear models have low variance but high
bias. Such models are often preferable when you have limited data and want to avoid
overfitting.
High variance–low bias models such as decision trees are preferable if data is large but
underfitting is a concern.
Linear models exist for all predictive tasks, including classification, probability estimation and
regression.
The following example illustrates the method in the simple case of a single feature, which is
called univariate regression.
Rule models:
Learning ordered rule lists
[email protected]
First-order rule learning
First-order rule learning, also known as first-order logic learning or relational learning, is a
branch of machine learning that focuses on learning logical rules from data with relational
structures. Traditional machine learning algorithms often assume that data instances are
independent and identically distributed, which may not hold true for many real-world problems
where relationships and dependencies between instances are crucial.
First-order rule learning extends classical logic and allows for the representation of complex
relationships and dependencies between objects and their attributes. It operates on first-order
logic, which is a formal language for representing knowledge and making logical inferences.
In first-order rule learning, the goal is to learn logical rules from a set of relational data, often
represented in the form of a relational database. These rules describe relationships, patterns,
and dependencies between entities and their attributes. The learned rules can then be used for
various tasks, such as classification, prediction, or generating new knowledge.
There are several approaches to first-order rule learning, including:
1. Inductive Logic Programming (ILP): ILP combines logic programming and machine
learning to induce logical rules from examples and background knowledge. ILP algorithms
typically use a top-down, hypothesis-driven approach to construct rules that accurately describe
the data.
2. Statistical Relational Learning (SRL): SRL methods integrate statistical modeling techniques
with first-order logic to capture both the uncertainty in the data and the relational structure.
SRL algorithms can handle probabilistic reasoning and leverage statistical learning techniques
such as Bayesian inference or Markov logic networks.
3. Relational Reinforcement Learning (RRL): RRL extends reinforcement learning to problems
with relational structures. It aims to learn policies or decision-making rules that consider the
relational dependencies between entities in an environment.
First-order rule learning has applications in various domains, including natural language
processing, social network analysis, bioinformatics, and robotics. By explicitly modeling
relationships and dependencies, it enables the discovery of complex patterns and facilitates
reasoning about structured data.
Linear models:
The least-squares method
The least-squares method is a widely used approach in machine learning for solving regression
problems. It aims to find the best-fitting line or curve that minimizes the sum of the squared
differences between the observed target values and the predicted values.
In the context of linear regression, the least-squares method finds the line that best fits a given
set of data points. The line is defined by the equation:
y = mx + b
where y is the target variable, x is the input variable, m is the slope of the line, and b is the y-
intercept. The goal is to find the values of m and b that minimize the sum of the squared
differences between the observed target values and the predicted values.
To achieve this, the least-squares method minimizes the objective function known as the sum
of squared residuals:
where yᵢ is the observed target value for the i-th data point, and xᵢ is the corresponding input
value. The sum is taken over all data points in the dataset.
The least-squares method estimates the values of m and b that minimize the SSR by taking the
partial derivatives of the SSR with respect to m and b, setting them to zero, and solving the
resulting equations. The resulting estimates are often referred to as the ordinary least squares
(OLS) estimates.
Once the values of m and b are determined, the fitted line can be used to make predictions for
new input values by plugging them into the equation.
The least-squares method is not limited to linear regression. It can also be used for fitting other
types of curves, such as polynomial regression or nonlinear regression. In these cases, the
equation defining the curve may have additional parameters, and the least-squares method aims
to find the values of these parameters that minimize the sum of squared differences.
Overall, the least-squares method is a fundamental technique in machine learning for finding
the best-fitting line or curve in regression problems by minimizing the sum of squared
differences between observed and predicted values.
Perceptrons
• Frank Rosenblatt, an American psychologist, proposed the classical perceptron model
(1958).
• Perceptron is an algorithm for Supervised Learning of single layer binary linear classifiers.
This algorithm enables neurons to learn and processes elements in the training set one at a
time.
Types of Perceptrons:
There are two types of Perceptrons:
1) Single layer Perceptrons
2) Multilayer Perceptrons (Deep feedforward networks/ feedforward neural networks)
Fig:1 Fig:2
The SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Kernels in Support Vector Machine
The most interesting feature of SVM is that it can even work with a non-linear dataset and for
this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we have a
dataset like this:
Here we cannot draw a single line or say hyperplane which can classify the points correctly. So
what we do is try converting this lower dimension space to a higher dimension space using
some quadratic functions which will allow us to find a decision boundary that clearly divides
the data points. These functions which help us do this are called Kernels and which kernel to
use is purely determined by hyperparameter tuning.
Different Kernel functions
Some kernel functions which you can use in SVM are given below:
1. Polynomial kernel
2. Sigmoid kernel
3. RBF (Radial Basis Function) kernel
4. Bessel function kernel
5. Anova Kernel
Advantages of SVM
1. SVM works better when the data is Linear
2. It is more effective in high dimensions
3. With the help of the kernel trick, we can solve any complex problem
4. SVM is not sensitive to outliers
5. Can help us with Image classification
Disadvantages of SVM
1. Choosing a good kernel is not easy
2. It doesn’t show good results on a big dataset