Unit-3 Decision Tree Learning (Februray 26, 2024)
Unit-3 Decision Tree Learning (Februray 26, 2024)
Syllabus
1
Introduction of Decision Tree
Decision Tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart-like tree structure, where each internal
node denotes a test on an attribute, each branch represents an outcome of the
test, and each leaf node (terminal node) holds a class label.
3
Important Terminology related to Decision Trees
Root Node: It represents the entire population or sample and this further gets
divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the
decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called
pruning. You can say the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a
parent node of sub-nodes whereas sub-nodes are the child of a parent node.
Decision trees classify the examples by sorting them down the tree from the root
to some leaf/terminal node, with the leaf/terminal node providing the classification
of the example.
Each node in the tree acts as a test case for some attribute, and each edge
descending from the node corresponds to the possible answers to the test case. This
process is recursive in nature and is repeated for every subtree rooted at the new
node.
4
Assumptions while creating Decision Tree
Below are some of the assumptions we make while using Decision tree:
In the beginning, the whole training set is considered as the root.
Feature values are preferred to be categorical. If the values are continuous then
they are discretized prior to building the model.
Records are distributed recursively on the basis of attribute values.
Order to placing attributes as root or internal node of the tree is done by using some
statistical approach.
Decision Trees follow Sum of Product (SOP) representation. The Sum of product
(SOP) is also known as Disjunctive Normal Form. For a class, every branch from
the root of the tree to a leaf node having the same class is conjunction (product) of
values, different branches ending in that class form a disjunction (sum).
The primary challenge in the decision tree implementation is to identify which
attributes do we need to consider as the root node and each level. Handling this is
to know as the attributes selection. We have different attributes selection measures
to identify the attribute which can be considered as the root note at each level.
The most significant predictor is designated as the root node, splitting is done to
form sub-nodes called decision nodes, and the nodes which do not split further are
terminal or leaf nodes.
5
In the decision tree, the dataset is divided into homogeneous and non-overlapping
regions. It follows a top-down approach as the top region presents all the
observations at a single place which splits into two or more branches that further
split. This approach is also called a greedy approach as it only considers the
current node between the worked on without focusing on the future nodes.
The decision tree algorithms will continue running until a stop criteria such as the
minimum number of observations etc. is reached.
Once a decision tree is built, many nodes may represent outliers or noisy data. Tree
pruning method is applied to remove unwanted data. This, in turn, improves the
accuracy of the classification model.
To find the accuracy of the model, a test set consisting of test tuples and class labels
is used. The percentages of the test set tuples are correctly classified by the model
to identify the accuracy of the model. If the model is found to be accurate then it is
used to classify the data tuples for which the class labels are not known.
Some of the decision tree algorithms include Hunt’s Algorithm, ID3, CD4.5, and
CART etc. The algorithm selection is also based on the type of target variables. Let
us look at some algorithms used in Decision Trees:
6
Attribute Selection Measures for Decision Tree
If the dataset consists of N attributes then deciding which attribute to place at the
root or at different levels of the tree as internal nodes is a complicated step. By just
randomly selecting any node to be the root can’t solve the issue. If we follow a
random approach, it may give us bad results with low accuracy.
For solving this attribute selection problem, researchers worked and devised some
solutions. They suggested using some criteria like :
➢ Entropy,
➢ Information gain,
➢ Gini index,
➢ Gain Ratio,
➢ Reduction in Variance
➢ Chi-Square
These criteria will calculate values for every attribute. The values are sorted, and
attributes are placed in the tree by following the order i.e., the attribute with a high
value (in case of information gain) is placed at the root.
While using Information Gain as a criterion, we assume attributes to be categorical,
and for the Gini index, attributes are assumed to be continuous.
7
occupation, product, and various other variables. In this case, we are predicting
values for the continuous variables.
#1) Learning Step: The training data is fed into the system to be analyzed by a
classification algorithm. In this example, the class label is the attribute i.e. “loan
decision”. The model built from this training data is represented in the form of
decision rules.
#2) Classification: Test dataset are fed to the model to check the accuracy of the
classification rule. If the model gives acceptable results then it is applied to a new
dataset with unknown class variables.
Decision tree algorithm falls under the category of supervised learning. They can
be used to solve both regression and classification problems. Decision tree uses
the tree representation to solve the problem in which each leaf node corresponds
to a class label and attributes are represented on the internal node of the tree. We
can represent any boolean function on discrete attributes using the decision tree.
8
Below are some assumptions that we made while using the decision tree:
• At the beginning, we consider the whole training set as the root.
• Feature values are preferred to be categorical. If the values are continuous
then they are discretized prior to building the model.
• On the basis of attribute values, records are distributed recursively.
• We use statistical methods for ordering attributes as root or the internal node.
9
As you can see from the above image the Decision Tree works on the Sum of
Product form which is also known as Disjunctive Normal Form. In the above
image, we are predicting the use of computer in the daily life of people.
10
• Decision trees are prone to errors in classification problems with many
classes and a relatively small number of training examples.
• Decision tree can be computationally expensive to train. The process of
growing a decision tree is computationally expensive. At each node, each
candidate splitting field must be sorted before its best split can be found. In
some algorithms, combinations of fields are used and a search must be made
for optimal combining weights. Pruning algorithms can also be expensive
since many candidate sub-trees must be formed and compared.
Recursive Induction
Recursion and induction belong to the branch of Mathematics, these terms are used
interchangeably. But there are some differences between these terms.
Recursion is a process in which a function gets repeated again and again until some
base function is satisfied. It repeats and uses its previous values to form a sequence.
The procedure applies a certain relation to the given function again and again until
some base condition is met. It consists of two components:
1) Base condition: In order to stop a recursive function, a condition is needed.
This is known as a base condition. Base condition is very important. If the base
condition is missing from the code then the function can enter into an infinite loop.
2) Recursive step: It divides a big problem into small instances that are solved by
the recursive function and later on recombined in the results.
Let a1, a2…… an, be a sequence. The recursive formula is given by:
an = an-1 + a1
Example: The definition of the Fibonacci series is a recursive one. It is often given
by the relation:
F N = FN-1 + FN-2 where F0 = 0
example : 0,1,1,2,3,5,8,13,……..
Induction
Induction is the branch of mathematics that is used to prove a result, or a formula,
or a statement, or a theorem. It is used to establish the validity of a theorem or
result. It has two working rules:
1) Base Step: It helps us to prove that the given statement is true for some initial
value.
11
2) Inductive Step: It states that if the theorem is true for the nth term, then the
statement is true for (n+1)th term.
3. It starts from nth term till the base case. It starts from the initial till (n+1)th term.
We backtrack at each step to replace the We just prove that the statement is true for
previous values with the answers using the n=1. Then we assume that n = k is true. Then
5. function. we prove for n=k+1.
Recursive function is always called to find Here statements or theorems are proved and
7. successive terms. no terms are found.
12
Decision Tree Induction
Decision tree induction is the method of learning the decision trees from the training
set. The training set consists of attributes and class labels.
A decision tree is a flowchart tree-like structure that is made from training set tuples.
The dataset is broken down into smaller subsets and is present in the form of nodes
of a tree. The tree structure has a root node, internal nodes or decision nodes, leaf
node, and branches.
The root node is the topmost node. It represents the best attribute selected for
classification. Internal nodes of the decision nodes represent a test of an attribute of
the dataset leaf node or terminal node which represents the classification or decision
label. The branches show the outcome of the test performed.
Some decision trees only have binary nodes, that means exactly two branches of a
node, while some decision trees are non-binary.
The image below shows the decision tree for the Titanic dataset to predict
whether the passenger will survive or not.
A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the outcome
13
of a test, and each leaf node holds a class label. The topmost node in the tree is the
root node.
The following decision tree is for the concept buy_computer that indicates whether
a customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.
In the late 1970s and early 1980s, J.Ross Quinlan was a researcher who built a
decision tree algorithm for machine learning. This algorithm is known as ID3,
Iterative Dichotomiser. This algorithm was an extension of the concept learning
systems described by E.B Hunt, J, and Marin.
ID3 later came to be known as C4.5. ID3 and C4.5 follow a greedy top-down
approach for constructing decision trees. ID3 and C4.5 adopt a greedy approach. In
this algorithm, there is no backtracking; the trees are constructed in a top-down
recursive divide-and-conquer manner.
The ID3 algorithm builds decision trees using a top-down greedy search approach
through the space of possible branches with no backtracking. A greedy algorithm,
as the name suggests, always makes the choice that seems to be the best at that
moment.
14
Steps of ID3 Algorithm for Decision Tree
The algorithm starts with a training dataset with class labels that are portioned into
smaller subsets as the tree is being constructed.
On each iteration of the algorithm, it iterates through the very unused attribute of
the set S and calculates Entropy(H) and Information gain(IG) of this attribute.
It then selects the attribute which has the smallest Entropy or Largest Information
gain.
The set S is then split by the selected attribute to produce a subset of the data.
The algorithm continues to recur on each subset, considering only attributes never
selected before.
#1) Initially, there are three parameters i.e. attribute list, attribute selection
method and data partition. The attribute list describes the attributes of the
training set tuples.
#2) The attribute selection method describes the method for selecting the best
attribute for discrimination among tuples. The methods used for attribute selection
can either be Information Gain or Gini Index.
#3) The structure of the tree (binary or non-binary) is decided by the attribute
selection method.
#4) When constructing a decision tree, it starts as a single node representing the
tuples.
#5) If the root node tuples represent different class labels, then it calls an attribute
selection method to split or partition the tuples. The step will lead to the formation
of branches and decision nodes.
#6) The splitting method will determine which attribute should be selected to
partition the data tuples. It also determines the branches to be grown from the node
according to the test outcome. The main motive of the splitting criteria is that the
partition at each branch of the decision tree should represent the same class label.
15
An example of splitting attribute is shown below:
#7) The above partitioning steps are followed recursively to form a decision tree
for the training dataset tuples.
#8) The portioning stops only when either all the partitions are made or when the
remaining tuples cannot be partitioned further.
#9) The complexity of the algorithm is described by n * |D| * log |D| where n is
the number of attributes in training dataset D and |D| is the number of tuples.
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
17
Tree Pruning
Pruning is the method of removing the unused branches from the decision tree.
Some branches of the decision tree might represent outliers or noisy data.
Tree pruning is the method to reduce the unwanted branches of the tree. This will
reduce the complexity of the tree and help in effective predictive analysis. It
reduces the overfitting as it removes the unimportant branches from the trees.
Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
• Number of leaves in the tree, and
• Error rate of the tree.
In the Decision Tree, the major challenge is the identification of the attribute
for the root node at each level. This process is known as attribute selection.
Best split is when we can separate the classes accurately based on that feature.
Decision tree uses Information Gain and Entropy to select a feature which gives
the best split.
For example :
Let us say we have to classify the type of people coming to the theatre as couples,
friends, family and the attributes are show timings, number of tickets, etc.
18
We know that lovers get 2 tickets, families get 3 or 4 tickets and a friends group
might get more than 4 tickets. (In most cases)
Therefore to find the class, no. of tickets might be the best split rather than show
timings.
It uses a measure called information gain which is calculated for each attribute, it
basically tells us how much information can be gained by the algorithm if that
particular attribute is chosen as the split.
Therefore, the attribute with the maximum Information Gain is chosen to be the
best split.
Decision Tree
Basically the decision tree algorithm is a supervised learning algorithm that can be
used in both classification or regression analysis. Unlike linear algorithms, decision
trees algorithms are capable of dealing with nonlinear relationships between
variables in the data.
19
The above diagram is a representation of the workflow of a basic decision tree.
Where a student needs to decide on going to school or not. In this example, the
decision tree can decide based on certain criteria. The rectangles in the diagram
can be considered as the node of the decision tree. And split on the nodes makes
the algorithm make a decision. In the above example, we have only two variables
which are very basic and easy for us to understand where and which node to
split. To perform a right split of the nodes in case of large variable holding data set
information gain comes into the picture.
Information Gain
Information Gain When we use a node in a decision tree to partition the training
instances into smaller subsets the entropy changes. Information gain is a measure
of this change in entropy.
20
Definition: Suppose S is a set of instances, A is an attribute, S v is the subset of S
with A = v, and Values (A) is the set of all possible values of A,
then
The Information Gained in the decision tree can be defined as the amount of
information improved in the nodes before splitting them for making further
decisions. To understand the information gain let’s take an example of three nodes
As we can see in these three nodes we have data of two classes and here in node 3
we have data for only one class and similarly, we have less data for the second class
than the first class in node 2, and node 1 is balanced. By this above, we can say that
in node three we don’t need to make any decision because all the instances are
representing the direction of the decision in the class first side wherein in node 1
there are 50% chances to decide the direction of both classes. We can say that in
node 1 we are required more information than the other nodes to describe a decision.
By the above, we can say the information gain in node 1 is higher.
21
By the above, we can say the balanced nodes or most impure nodes require more
information to describe. Let’s take a look at the below image on two nodes with
different impurities.
Here we can see that the node on the right side after split gives us heterogeneous
nodes where the node on the left side gives us homogeneous nodes and as we have
discussed in the above node on the left has more information gain than the other
nodes and by this, we can infer that increment in the information gain gives more
homogeneous or pure nodes.
• Use info gain to choose which attribute to label each node with
Note: No root-to-leaf path should contain the same discrete attribute twice
• Recursively construct each subtree on the subset of training instances that
22
• If no attributes remain, label with a majority vote of training instances left at
that node
• If no instances remain, label with a majority vote of the parent’s training
instances.
Example: Now, let us draw a Decision Tree for the following data using
Information gain. Training set: 3 features and 2 classes
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II
Here, we have 3 features and 2 output classes. To build a decision tree using
Information gain. We will take each of the features and calculate the information
for each features.
23
Split on feature X
Split on feature Y
24
Split on feature Z
From the above images, we can see that the information gain is maximum when
we make a split on feature Y. So, for the root node best-suited feature is feature Y.
Now we can see that while splitting the dataset by feature Y, the child contains a
pure subset of the target variable. So we don’t need to further split the dataset. The
final tree for the above dataset would look like this:
25
What is Entropy?
Entropy : Entropy is the measure of uncertainty of a random variable, it
characterizes the impurity of an arbitrary collection of examples. The higher the
entropy more the information content.
Example :
For the set X = {a,a,a,b,b,b,b,b}
Where
Xi = possible outcomes
% of pursuing = 50%
Following the above formula of entropy, we have filled the values where the
probabilities are pursuing or not pursuing is 0.5 and log of 0.5 base two is -1. Let’s
assume a family of 10 members where everyone has already pursued graduation.
% of pursuing = 0%
% of not pursuing = 100%
According to this the entropy of a given situation will be
27
From the above, we can say that if a node is containing only one class in it or
formally says the node of the tree is pure the entropy for data in such node will be
zero and according to the information gain formula the information gained for
such node will we higher and purity is higher and if the entropy is higher the
information gain will be less and the node can be considered as the less pure.
One thing which is noticeable here is that the information gain or entropy works
with only categorical data.
In the above example, we have seen how we can calculate the entropy for a single
node. Let’s talk about a tree or decision tree where the number of nodes is huge.
Since we know that in a decision tree we have parent nodes and child nodes. The
parent node can also be called the root node of the tree. We are starting with a split
of the parent node and after splitting every type of node the weighted average
entropy of the nodes will be the final entropy which can be used for calculating the
information gain.
Here we are starting with an example of class 11th and class 12th students where
we have a total of 20 students. And on the basis of performance, we have different
splits in the parent node, and on the basis of their class level, similar parent nodes
can be split. Like the below-given image.
28
Now according to the performance of the students, we can say
29
Now the entropy for the parent node will
Here we can see the entropy for the parent node is 1 this is the entropy of the
parent node.
As of now, we have calculated the entropy for the parent and child nodes now the
weighted sum of these entropies will give the weighted entropy of all the nodes.
Hereby the weighted entropy we can say that the split on the basis of performance
will give us the entropy around 0.95. Here we can see the final entropy is lower
than the entropy given by the parent node so we can say that the child node will
give the pure node(less number of classes in the node)than the parent node of the
tree. A similar procedure we can follow for the splits based on the class
This is the same also with the weighted entropy. The below table is the
representation of the information gain value of the example using the entropy
As we have discussed in the earlier section of the article that instrument in the
information gain causes in the homogeneous split of the node or formation of the
pure nodes hence in the above example the split based on the class will give us more
homogeneous nodes as the child than the nodes produces buy the split on the basis
of performance.
32
Computational Complexity of ML Algorithms
Further away still are those algorithmic problems that can be stated but are not
solvable; that is, one can prove that no program can be written to solve the problem.
A classic example of an unsolvable algorithmic problem is the halting problem,
which states that no program can be written that can predict whether or not any other
program halts after a finite number of steps. The unsolvability of the halting problem
has immediate practical bearing on software development. For instance, it would
be frivolous to try to develop a software tool that predicts whether another program
being developed has an infinite loop in it (although having such a tool would be
immensely beneficial).
33
Time and space complexity plays very important role while selecting machine
learning algorithm
Space complexity: space complexity of an algorithm denotes the total space used or
needed by the algorithm for its working, for various input size. In simple words space
it requires to complete the task.
Hard computing deals with problems that have exact solutions, and in which
approximate / uncertain solutions are not acceptable. This is the conventional
computing, and most algorithms courses deal with hard computing.
Lets start looking at worst case time complexity when the data is dense,
When training a decision tree, a split has to be found until a maximum depth d has
been reached.
The strategy for finding this split is to look for each variable (there are p of them)
to the different thresholds (there are up to n of them) and the information gain that
is achieved (evaluation in O ( n ) )
▪ Linear regressions
▪ Support Vector Machine
▪ k-Nearest Neighbours
4. Algorithm Complexity
Machine Learning is primarily about optimization of an objective function. Often
the function is so represented that the target is to reach the global minima. Solving
it involves heuristics, and thereby multiple iterations. In gradient descent for
instance, you need multiple iterations to reach the minima. So given an algorithm,
you can at best estimate the running 'time' for a single iteration.
We are talking about finding Minima of cost functions whose complexity depend
on the ‘value’ of the data and not just the ‘size’ of the data. The cost function is a
function of the dataset. This is a key difference between algorithms used for ML
and others.
35
Note that this again cannot be a parameter for comparison since for different
algorithms, the objective function would reach a minima in different number of
iterations for different data sets.
Many philosophers throughout history have advocated the idea of parsimony. One
of the greatest Greek philosophers, Aristotle who goes as far as to say, “Nature
operates in the shortest way possible”. It is as a consequence that humans might be
biased as well to choose a simpler explanation given a set of all possible
explanations with the same descriptive power. This post gives a brief overview of
Occam’s razor, the relevance of the principle and ends with a note on the usage of
this razor as an inductive bias in machine learning (decision tree learning in
particular).
Simpler models are typically defined as models that make fewer assumptions or
have fewer elements, most commonly characterized as fewer coefficients (e.g. rules,
layers, weights, etc.). The rationale for choosing simpler models is tied back to
Occam’s Razor.
The idea is that the best scientific theory is the smallest one that explains all the
facts.
They may include details of specific cases that are at hand or easily available, and
in turn, may not generalize to new cases. That is, the more assumptions a hypothesis
has, the more narrow it is expected to be in its application. Conversely, fewer
37
assumptions suggests a more general hypothesis with greater predictive power to
more cases.
Imagine, for example, you are trying to predict a student’s college GPA. A simple
model would be one that is based entirely on a student’s SAT score.
While this model is very simple, it might not be very accurate because often a
college student’s GPA is dependent on factors other than just his or her SAT score.
It is severely underfit and inflexible. In machine learning jargon, we would say this
type of model has high bias, but low variance.
In general, the more inflexible a model, the higher the bias. Also, the noisier the
model, the higher the variance. This is known as the bias–variance tradeoff.
38
Image Source: Medium.com
If the model is too complex and loaded with attributes, it is at risk of capturing noise
in the data that could be due entirely to random chance. It would make amazing
predictions on the training data set, but it would perform poorly when faced with a
new data set. It won’t generalize well because it is severely overfit. It has high
variance and low bias.
A real world example would be like trying to predict a person’s college gpa based
on his or her SAT score, high school GPA, middle school GPA, socio-economic
status, city of birth, hair color, favorite NBA team, favorite food, and average daily
sleep duration.
39
Image Source: Scott Fortman Roe
In machine learning there was always this balance of bias vs. variance, inflexibility
vs. flexibility, parsimony vs. prodigality.
There are many events that favor a simpler approach either as an inductive bias or
a constraint to begin with. Some of them are :
• Studies like this, where the results have suggested that preschoolers are
sensitive to simpler explanations during their initial years of learning and
development.
• Preference for a simpler approach and explanations to achieve the same
goal is seen in various facets of sciences; for instance, the parsimony
principle applied to the understanding of evolution.
• In theology, ontology, epistemology, etc this view of parsimony is used to
derive various conclusions.
• Variants of Occam’s razor are used in knowledge Discovery.
40
Occam’s razor as an inductive bias in machine learning.
• Inductive bias (or the inherent bias of the algorithm) are assumptions that
a preference for a simpler hypothesis that best fits the data. Though the
razor can be used to eliminate other hypotheses, relevant justification may
be needed to do so. Below is an analysis of how this principle is applicable
in decision tree learning.
• The decision tree learning algorithms follow a search strategy to search the
hypotheses space for the hypothesis that best fits the training data. For
example, the ID3 algorithm uses a simple to complex strategy starting from
an empty tree and adding nodes guided by the information gain heuristic to
build a decision tree consistent with the training instances.
• Well, there can be many decision trees that are consistent with a given set
of training examples, but the inductive bias of the ID3 algorithm results in
the preference for simper (or shorter trees) trees. This preference bias of
ID3 arises from the fact that there is an ordering of the hypotheses in the
search strategy. This leads to additional bias that attributes high with
information gain closer to the root is preferred. Therefore, there is a definite
41
order the algorithm follows until it terminates on reaching a hypothesis that
is consistent with the training data.
The above image depicts how the ID3 algorithm chooses the nodes in every
iteration. The red arrow depicts the node chosen in a particular iteration while
the black arrows suggest other decision trees that could have been possible in
a given iteration.
• Hence starting from an empty node, the algorithm graduates towards more
complex decision trees and stops when the tree is sufficient to classify the
training examples.
• This example pops a question. Does eliminating complex hypotheses bear
any consequence on the classification of unobserved instances? simply put,
does the preference for a simpler hypothesis have an advantage? If two
decision trees have slightly different training errors but the same validation
errors, then it is obvious that the simpler tree among the two will be chosen.
As a higher validation error causes overfitting of the data. Complex trees
often have almost zero training error, but the validation errors might be
42
high. This scenario gives a logical reason for a bias towards simpler trees.
In addition to that, a simpler hypothesis might prove effective in a resource-
limited environment.
• What is overfitting? Consider two hypotheses a and b. Let ‘a’ fit the training
examples perfectly, while the hypothesis ‘b’ has a small training error. If
over the entire set of data (i.e, including the unseen instances), if the
hypothesis ‘b’ performs better, then ‘a’ is said to overfit the training data.
To best illustrate the problem of over-fitting, consider the figure below.
Figures A and B depict two decision boundaries. Assuming the green and
red points represent the training examples, the decision boundary in B
perfectly fits the data thus perfectly classifying the instances, while the
decision boundary in A does not, though being simpler than B. In this
example the decision boundary in B overfits the data. The reason being that
every instance of the training data affects the decision boundary. The added
relevance is when the training data contains noise. For example, assume in
figure B that one of the red points close to the boundary was a noise point.
Then the unseen instances in close proximity to the noise point might be
wrongly classified. This makes the complex hypothesis vulnerable to noise
in the data.
• While the problem of overfitting behaviour of a model can be significantly
avoided by settling for a simpler hypothesis, an extremely simple
hypothesis may be too abstract to deduce any information needed for the
task resulting in underfitting. Overfitting and underfitting are one of the
43
major challenges to be addressed before we zero in on a machine learning
model. Sometimes a complex model might be desired, it’s a choice
dependent on the data available, the results expected and the application
domain.
Before understanding the overfitting and underfitting, let's understand some basic
term that will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the
machine learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance
of the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data
points or more than the required data points present in the given dataset. Because of
this, the model starts caching noise and inaccurate values present in the dataset, and
all these factors reduce the efficiency and accuracy of the model. The overfitted
model has low bias and high variance.
44
The chances of occurrence of overfitting increase as much we provide training to
our model. It means the more we train our model, the more chances of occurring the
overfitted model.
Example: The concept of the overfitting can be understood by the below graph of
the linear regression output:
As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because
the goal of the regression model to find the best fit line, but here we have not got
any best fit, so, it will generate the prediction errors.
Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which
we can reduce the occurrence of overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
45
Noisy Data in Machine Learning
Noisy data are data with a large amount of additional meaningless information
called noise. This includes data corruption, and the term is often used as a synonym
for corrupt data. It also includes any data that a user system cannot understand and
interpret correctly. Many systems, for example, cannot use unstructured text. Noisy
data can adversely affect the results of any data analysis and skew conclusions if
not handled properly. Statistical analysis is sometimes used to weed the noise out
of noisy data.
Noisy data are data that is corrupted, distorted, or has a low Signal-to-Noise Ratio.
Improper procedures (or improperly-documented procedures) to subtract out the
noise in data can lead to a false sense of accuracy or false conclusions.
Noisy data unnecessarily increases the amount of storage space required and can
adversely affect any data mining analysis results.Play Video
Noisy data can be caused by hardware failures, programming errors, and gibberish
input from speech or optical character recognition (OCR) programs. Spelling errors,
industry abbreviations, and slang can also impede machine reading.
Noise is an unavoidable problem that affects the data collection and preparation
processes in machine learning applications, where errors commonly occur. Noise
has two main sources, such as:
Sources of Noise
Differences in real-world measured data from the true values come from multiple
factors affecting the measurement.
46
Random noise is often a large component of the noise in data. Random noise in a
signal is measured as the Signal-to-Noise Ratio. Random noise contains almost
equal amounts of a wide range of frequencies and is called white noise (as colors of
light combine to make white). Random noise is an unavoidable problem. It affects
the data collection and data preparation processes, where errors commonly occur.
Noise has two main sources:
Filtering - Improper filtering can add noise if the filtered signal is treated as a
directly measured signal. For example, Convolution-type digital filters such as a
moving average can have side effects such as lags or truncation of peaks.
Differentiating digital filters amplify random noise in the original data.
Outlier data are data that appear to not belong in the data set. It can be caused by
human error such as transposing numerals, mislabeling, programming bugs, etc. If
valid data is identified as an outlier and is mistakenly removed, that also corrupts
results. If actual outliers are not removed from the data set, they corrupt the results
to a small or large degree, depending on circumstances.
Fraud: Individuals may deliberately skew data to influence the results toward a
desired conclusion. Data that looks good with few outliers reflects well on the
individual collecting it, and so there may be incentive to remove more data as
outliers or make the data look smoother than it is.
Types of Noise
A large number of components determine the quality of a dataset. Among them, the
class labels and the attribute values directly influence the quality of a classification
dataset. The quality of the class labels refers to whether the class of each example
is correctly assigned; otherwise, the quality of the attributes refers to their capability
of properly characterizing the examples for classification purposes if noise affects
attribute values, this capability of characterization and, therefore, the quality of the
attributes is reduced. Based on these two information sources, two types of noise
can be distinguished in a given dataset.
47
1. Class Noise (label noise)
This occurs when an example is incorrectly labeled. Class noise can be attributed to
several causes, such as subjectivity during the labeling process, data entry errors, or
inadequate information used to label each example. Class noise is further divided
into two types, such as:
2. Attribute Noise
o Erroneous attribute values: In the figure placed above, the example (1.02,
green, class = positive) has its first attribute with noise since it has the wrong
value.
o Missing or unknown attribute values: In the figure placed above, the
example (2.05, ? class = negative) has attribute noise since we do not know
the value of the second attribute.
o Incomplete attributes or do not care values: In the figure placed above, the
example (=, green, class = positive) has attribute noise since the value of the
48
first attribute does not affect the rest of the values of the example, including
the class of the example.
Considering class and attribute noise as corruptions of the class labels and attribute
values, respectively, is common in real-world data. Because of this, these types of
noise have also been considered in many works in the literature. For instance, the
authors reached a series of interesting conclusions, showing that attribute noise is
more harmful than class noise or that eliminating or correcting examples in datasets
with class and attribute noise may improve classifier performance. They also
showed that attribute noise is more harmful in those attributes highly correlated with
the class labels. The authors checked the robustness of methods from different
paradigms, such as probabilistic classifiers, decision trees, and instance-based
learners or support vector machines, studying the possible causes of their behaviors.
Removing noise from a data set is termed data smoothing. The following ways can
be used for Smoothing:
1. Binning
Binning is a technique where we sort the data and then partition the data into equal
frequency bins. Then you may either replace the noisy data with the bin mean bin
median or the bin boundary. This method is to smooth or handle noisy data. First,
the data is sorted then, and then the sorted values are separated and stored in the
form of bins. There are three methods for smoothing data in the bin.
o Smoothing by bin mean method: In this method, the values in the bin are
replaced by the mean value of the bin.
49
o Smoothing by bin median: In this method, the values in the bin are replaced
by the median value.
o Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken, and the closest boundary value
replaces the values.
2. Regression
This is used to smooth the data and help handle data when unnecessary data is
present. For the analysis, purpose regression helps decide the suitable
variable. Linear regression refers to finding the best line to fit between two
variables so that one can be used to predict the other. Multiple linear
regression involves more than two variables. Using regression to find a
mathematical equation to fit into the data helps to smooth out the noise.
3. Clustering
This is used for finding the outliers and also in grouping the data. Clustering is
generally used in unsupervised learning.
4. Outlier Analysis
Outliers may be detected by clustering, where similar or close values are organized
into the same groups or clusters. Thus, values that fall far apart from the cluster may
be considered noise or outliers. Outliers are extreme values that deviate from other
observations on data. They may indicate variability in measurement, experimental
errors, or novelty. In other words, an outlier is an observation that diverges from an
overall pattern on a sample. Outliers can be the following kinds, such as:
50
o Contextual outliers can be noise in data, such as punctuation symbols when
realizing text analysis or background noise signal when doing speech
recognition.
o Collective outliers can be subsets of novelties in data, such as a signal that
may indicate the discovery of new phenomena.
Data cleaning is an important stage. After all, your results are based on your data.
The more the dirt, the more inaccurate your results would prove.
Data Cleaning eliminates noise and missing values. Data Cleaning is just the first
of the many steps for data pre-processing. In addition to the above, data pre-
processing includes Aggregation, Feature Construction, Normalization,
Discretization, Concept hierarchy generation, which mostly deal with making
the data consistent. Data pre-processing, at times, also comprises 90% of the entire
process.
51