0% found this document useful (0 votes)
24 views51 pages

Unit-3 Decision Tree Learning (Februray 26, 2024)

The document discusses decision trees, including how they work, how they are constructed, important terminology, assumptions made in creating decision trees, and algorithms and attribute selection measures used for decision trees. Decision trees classify data and represent concepts as tree structures, with internal nodes performing tests on attributes and leaf nodes representing class labels.

Uploaded by

nikhilbadlani77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views51 pages

Unit-3 Decision Tree Learning (Februray 26, 2024)

The document discusses decision trees, including how they work, how they are constructed, important terminology, assumptions made in creating decision trees, and algorithms and attribute selection measures used for decision trees. Decision trees classify data and represent concepts as tree structures, with internal nodes performing tests on attributes and leaf nodes representing class labels.

Uploaded by

nikhilbadlani77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Unit- 3

Syllabus

Decision tree learning : representing concepts as decision trees. Recursive


induction of decision trees. Picking the best splitting attributes : entropy and
information gain. Searching for simple trees and computational complexity.
Occam’s razor. Overfitting, noisy data and pruning.

1
Introduction of Decision Tree
Decision Tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart-like tree structure, where each internal
node denotes a test on an attribute, each branch represents an outcome of the
test, and each leaf node (terminal node) holds a class label.

Classification is a two-step process, learning step and prediction step, in machine


learning. In the learning step, the model is developed based on given training data.
In the prediction step, the model is used to predict the response for given data.
Decision Tree is one of the easiest and popular classification algorithms to
understand and interpret.

A decision tree for the concept PlayTennis.

Construction of Decision Tree: A tree can be “learned” by splitting the source


set into subsets based on an attribute value test. This process is repeated on each
derived subset in a recursive manner called recursive partitioning. The
recursion is completed when the subset at a node all has the same value of the
target variable, or when splitting no longer adds value to the predictions. The
construction of a decision tree classifier does not require any domain knowledge
or parameter setting, and therefore is appropriate for exploratory knowledge
discovery. Decision trees can handle high-dimensional data. In general decision
tree, classifier has good accuracy. Decision tree induction is a typical inductive
approach to learn knowledge on classification.
2
Decision Tree Representation: Decision trees classify instances by sorting them
down the tree from the root to some leaf node, which provides the classification of
the instance. An instance is classified by starting at the root node of the tree, testing
the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute as shown in the above figure. This
process is then repeated for the subtree rooted at the new node.
The decision tree in above figure classifies a particular morning according to
whether it is suitable for playing tennis and returns the classification associated
with the particular leaf.(in this case Yes or No).

For example, the instance


(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong )
would be sorted down the leftmost branch of this decision tree and would therefore
be classified as a negative instance.
In other words, we can say that the decision tree represents a disjunction of
conjunctions of constraints on the attribute values of instances.
(Outlook = Sunny ^ Humidity = Normal) v (Outlook = Overcast) v (Outlook =
Rain ^ Wind = Weak)

Decision Tree Algorithm


Decision Tree algorithm belongs to the family of supervised learning algorithms.
Unlike other supervised learning algorithms, the decision tree algorithm can be
used for solving regression and classification problems too.
The goal of using a Decision Tree is to create a training model that can use to
predict the class or value of the target variable by learning simple decision
rules inferred from prior data(training data).
In Decision Trees, for predicting a class label for a record we start from the root of
the tree. We compare the values of the root attribute with the record’s attribute. On
the basis of comparison, we follow the branch corresponding to that value and
jump to the next node.

3
Important Terminology related to Decision Trees
Root Node: It represents the entire population or sample and this further gets
divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the
decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called
pruning. You can say the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a
parent node of sub-nodes whereas sub-nodes are the child of a parent node.

Decision trees classify the examples by sorting them down the tree from the root
to some leaf/terminal node, with the leaf/terminal node providing the classification
of the example.
Each node in the tree acts as a test case for some attribute, and each edge
descending from the node corresponds to the possible answers to the test case. This
process is recursive in nature and is repeated for every subtree rooted at the new
node.
4
Assumptions while creating Decision Tree
Below are some of the assumptions we make while using Decision tree:
In the beginning, the whole training set is considered as the root.
Feature values are preferred to be categorical. If the values are continuous then
they are discretized prior to building the model.
Records are distributed recursively on the basis of attribute values.
Order to placing attributes as root or internal node of the tree is done by using some
statistical approach.
Decision Trees follow Sum of Product (SOP) representation. The Sum of product
(SOP) is also known as Disjunctive Normal Form. For a class, every branch from
the root of the tree to a leaf node having the same class is conjunction (product) of
values, different branches ending in that class form a disjunction (sum).
The primary challenge in the decision tree implementation is to identify which
attributes do we need to consider as the root node and each level. Handling this is
to know as the attributes selection. We have different attributes selection measures
to identify the attribute which can be considered as the root note at each level.

How do Decision Trees work?


The decision of making strategic splits heavily affects a tree’s accuracy. The
decision criteria are different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node into two or more
sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-
nodes. In other words, we can say that the purity of the node increases with respect
to the target variable. The decision tree splits the nodes on all available variables
and then selects the split which results in most homogeneous sub-nodes.
A decision tree is a supervised learning algorithm that works for both discrete and
continuous variables. It splits the dataset into subsets on the basis of the most
significant attribute in the dataset. How the decision tree identifies this attribute and
how this splitting is done is decided by the algorithms.

The most significant predictor is designated as the root node, splitting is done to
form sub-nodes called decision nodes, and the nodes which do not split further are
terminal or leaf nodes.
5
In the decision tree, the dataset is divided into homogeneous and non-overlapping
regions. It follows a top-down approach as the top region presents all the
observations at a single place which splits into two or more branches that further
split. This approach is also called a greedy approach as it only considers the
current node between the worked on without focusing on the future nodes.
The decision tree algorithms will continue running until a stop criteria such as the
minimum number of observations etc. is reached.

Once a decision tree is built, many nodes may represent outliers or noisy data. Tree
pruning method is applied to remove unwanted data. This, in turn, improves the
accuracy of the classification model.

To find the accuracy of the model, a test set consisting of test tuples and class labels
is used. The percentages of the test set tuples are correctly classified by the model
to identify the accuracy of the model. If the model is found to be accurate then it is
used to classify the data tuples for which the class labels are not known.

Some of the decision tree algorithms include Hunt’s Algorithm, ID3, CD4.5, and
CART etc. The algorithm selection is also based on the type of target variables. Let
us look at some algorithms used in Decision Trees:

▪ ID3 → (extension of D3)


▪ C4.5 → (successor of ID3)

▪ CART → (Classification And Regression Tree)

▪ CHAID → (Chi-square automatic interaction detection Performs


multi-level splits when computing classification trees)

▪ MARS → (multivariate adaptive regression splines)

6
Attribute Selection Measures for Decision Tree

If the dataset consists of N attributes then deciding which attribute to place at the
root or at different levels of the tree as internal nodes is a complicated step. By just
randomly selecting any node to be the root can’t solve the issue. If we follow a
random approach, it may give us bad results with low accuracy.

For solving this attribute selection problem, researchers worked and devised some
solutions. They suggested using some criteria like :

➢ Entropy,
➢ Information gain,
➢ Gini index,
➢ Gain Ratio,
➢ Reduction in Variance
➢ Chi-Square

These criteria will calculate values for every attribute. The values are sorted, and
attributes are placed in the tree by following the order i.e., the attribute with a high
value (in case of information gain) is placed at the root.
While using Information Gain as a criterion, we assume attributes to be categorical,
and for the Gini index, attributes are assumed to be continuous.

Types of Decision Trees


Types of decision trees are based on the type of target variable we have. It can be
of two types:
Categorical Variable Decision Tree: Decision Tree which has a categorical
target variable then it called a Categorical variable decision tree.
Continuous Variable Decision Tree: Decision Tree has a continuous target
variable then it is called Continuous Variable Decision Tree.
Example:- Let’s say we have a problem to predict whether a customer will pay
his renewal premium with an insurance company (yes/ no). Here we know that the
income of customers is a significant variable but the insurance company does not
have income details for all customers. Now, as we know this is an important
variable, then we can build a decision tree to predict customer income based on

7
occupation, product, and various other variables. In this case, we are predicting
values for the continuous variables.

Example of Creating a Decision Tree

#1) Learning Step: The training data is fed into the system to be analyzed by a
classification algorithm. In this example, the class label is the attribute i.e. “loan
decision”. The model built from this training data is represented in the form of
decision rules.

#2) Classification: Test dataset are fed to the model to check the accuracy of the
classification rule. If the model gives acceptable results then it is applied to a new
dataset with unknown class variables.

Decision tree algorithm falls under the category of supervised learning. They can
be used to solve both regression and classification problems. Decision tree uses
the tree representation to solve the problem in which each leaf node corresponds
to a class label and attributes are represented on the internal node of the tree. We
can represent any boolean function on discrete attributes using the decision tree.

8
Below are some assumptions that we made while using the decision tree:
• At the beginning, we consider the whole training set as the root.
• Feature values are preferred to be categorical. If the values are continuous
then they are discretized prior to building the model.
• On the basis of attribute values, records are distributed recursively.
• We use statistical methods for ordering attributes as root or the internal node.

9
As you can see from the above image the Decision Tree works on the Sum of
Product form which is also known as Disjunctive Normal Form. In the above
image, we are predicting the use of computer in the daily life of people.

Strengths and Weaknesses of the Decision Tree approach


The strengths of decision tree methods are:
• Decision trees are able to generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important
for prediction or classification.
• Ease of use: Decision trees are simple to use and don’t require a lot of
technical expertise, making them accessible to a wide range of users.
• Scalability: Decision trees can handle large datasets and can be easily
parallelized to improve processing time.
• Missing value tolerance: Decision trees are able to handle missing values in
the data, making them a suitable choice for datasets with missing or
incomplete data.
• Handling non-linear relationships: Decision trees can handle non-linear
relationships between variables, making them a suitable choice for complex
datasets.
• Ability to handle imbalanced data: Decision trees can handle imbalanced
datasets, where one class is heavily represented compared to the others, by
weighting the importance of individual nodes based on the class distribution.

The weaknesses of Decision Tree methods :


• Decision trees are less appropriate for estimation tasks where the goal is to
predict the value of a continuous attribute.

10
• Decision trees are prone to errors in classification problems with many
classes and a relatively small number of training examples.
• Decision tree can be computationally expensive to train. The process of
growing a decision tree is computationally expensive. At each node, each
candidate splitting field must be sorted before its best split can be found. In
some algorithms, combinations of fields are used and a search must be made
for optimal combining weights. Pruning algorithms can also be expensive
since many candidate sub-trees must be formed and compared.

Recursive Induction
Recursion and induction belong to the branch of Mathematics, these terms are used
interchangeably. But there are some differences between these terms.
Recursion is a process in which a function gets repeated again and again until some
base function is satisfied. It repeats and uses its previous values to form a sequence.
The procedure applies a certain relation to the given function again and again until
some base condition is met. It consists of two components:
1) Base condition: In order to stop a recursive function, a condition is needed.
This is known as a base condition. Base condition is very important. If the base
condition is missing from the code then the function can enter into an infinite loop.
2) Recursive step: It divides a big problem into small instances that are solved by
the recursive function and later on recombined in the results.
Let a1, a2…… an, be a sequence. The recursive formula is given by:
an = an-1 + a1
Example: The definition of the Fibonacci series is a recursive one. It is often given
by the relation:
F N = FN-1 + FN-2 where F0 = 0
example : 0,1,1,2,3,5,8,13,……..

Induction
Induction is the branch of mathematics that is used to prove a result, or a formula,
or a statement, or a theorem. It is used to establish the validity of a theorem or
result. It has two working rules:
1) Base Step: It helps us to prove that the given statement is true for some initial
value.
11
2) Inductive Step: It states that if the theorem is true for the nth term, then the
statement is true for (n+1)th term.

Example: The assertion is that the nth Fibonacci number is at most 2 n.

How to Prove a statement using induction?

Step 1: Prove or verify that the statement is true for n=1


Step 2: Assume that the statement is true for n=k
Step 3: Verify that the statement is true for n=k+1, then it can be concluded that
the statement is true for n.

Difference between Recursion and Induction:


S.No. Recursion Induction

Recursion is the process in which a function


is called again and again until some base Induction is the way of proving a
1. condition is met. mathematical statement.

It is the way of defining in a repetitive


2. manner. It is the way of proving.

3. It starts from nth term till the base case. It starts from the initial till (n+1)th term.

It has two components: It has two steps:


• Base condition • Base step
4. • Recursive step. • Inductive step

We backtrack at each step to replace the We just prove that the statement is true for
previous values with the answers using the n=1. Then we assume that n = k is true. Then
5. function. we prove for n=k+1.

6. No assumptions are made. The assumption is made for n= k

Recursive function is always called to find Here statements or theorems are proved and
7. successive terms. no terms are found.

It can lead to infinity if no base condition is


8. given. There is no concept of infinity.

12
Decision Tree Induction

Decision tree induction is the method of learning the decision trees from the training
set. The training set consists of attributes and class labels.

Applications of decision tree induction include : astronomy, financial analysis,


medical diagnosis, manufacturing, and production.

A decision tree is a flowchart tree-like structure that is made from training set tuples.
The dataset is broken down into smaller subsets and is present in the form of nodes
of a tree. The tree structure has a root node, internal nodes or decision nodes, leaf
node, and branches.

The root node is the topmost node. It represents the best attribute selected for
classification. Internal nodes of the decision nodes represent a test of an attribute of
the dataset leaf node or terminal node which represents the classification or decision
label. The branches show the outcome of the test performed.

Some decision trees only have binary nodes, that means exactly two branches of a
node, while some decision trees are non-binary.

The image below shows the decision tree for the Titanic dataset to predict
whether the passenger will survive or not.

A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the outcome
13
of a test, and each leaf node holds a class label. The topmost node in the tree is the
root node.
The following decision tree is for the concept buy_computer that indicates whether
a customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows −


• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction for Machine Learning: ID3 Algorithm

In the late 1970s and early 1980s, J.Ross Quinlan was a researcher who built a
decision tree algorithm for machine learning. This algorithm is known as ID3,
Iterative Dichotomiser. This algorithm was an extension of the concept learning
systems described by E.B Hunt, J, and Marin.
ID3 later came to be known as C4.5. ID3 and C4.5 follow a greedy top-down
approach for constructing decision trees. ID3 and C4.5 adopt a greedy approach. In
this algorithm, there is no backtracking; the trees are constructed in a top-down
recursive divide-and-conquer manner.
The ID3 algorithm builds decision trees using a top-down greedy search approach
through the space of possible branches with no backtracking. A greedy algorithm,
as the name suggests, always makes the choice that seems to be the best at that
moment.

14
Steps of ID3 Algorithm for Decision Tree

The algorithm starts with a training dataset with class labels that are portioned into
smaller subsets as the tree is being constructed.

It begins with the original set S as the root node.

On each iteration of the algorithm, it iterates through the very unused attribute of
the set S and calculates Entropy(H) and Information gain(IG) of this attribute.
It then selects the attribute which has the smallest Entropy or Largest Information
gain.
The set S is then split by the selected attribute to produce a subset of the data.
The algorithm continues to recur on each subset, considering only attributes never
selected before.

#1) Initially, there are three parameters i.e. attribute list, attribute selection
method and data partition. The attribute list describes the attributes of the
training set tuples.
#2) The attribute selection method describes the method for selecting the best
attribute for discrimination among tuples. The methods used for attribute selection
can either be Information Gain or Gini Index.
#3) The structure of the tree (binary or non-binary) is decided by the attribute
selection method.
#4) When constructing a decision tree, it starts as a single node representing the
tuples.
#5) If the root node tuples represent different class labels, then it calls an attribute
selection method to split or partition the tuples. The step will lead to the formation
of branches and decision nodes.
#6) The splitting method will determine which attribute should be selected to
partition the data tuples. It also determines the branches to be grown from the node
according to the test outcome. The main motive of the splitting criteria is that the
partition at each branch of the decision tree should represent the same class label.

15
An example of splitting attribute is shown below:

a. The portioning above is discrete-valued.

b. The portioning above is for continuous-valued.

#7) The above partitioning steps are followed recursively to form a decision tree
for the training dataset tuples.

#8) The portioning stops only when either all the partitions are made or when the
remaining tuples cannot be partitioned further.

#9) The complexity of the algorithm is described by n * |D| * log |D| where n is
the number of attributes in training dataset D and |D| is the number of tuples.

Decision Tree Induction Algorithm


Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.

Attribute selection method, a procedure to determine the


splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
16
Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

17
Tree Pruning
Pruning is the method of removing the unused branches from the decision tree.
Some branches of the decision tree might represent outliers or noisy data.

Tree pruning is the method to reduce the unwanted branches of the tree. This will
reduce the complexity of the tree and help in effective predictive analysis. It
reduces the overfitting as it removes the unimportant branches from the trees.

Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity
The cost complexity is measured by the following two parameters −
• Number of leaves in the tree, and
• Error rate of the tree.

How does a decision tree determine the best splitting


attributes?

In the Decision Tree, the major challenge is the identification of the attribute
for the root node at each level. This process is known as attribute selection.

Best split is when we can separate the classes accurately based on that feature.

Decision tree uses Information Gain and Entropy to select a feature which gives
the best split.

For example :

Let us say we have to classify the type of people coming to the theatre as couples,
friends, family and the attributes are show timings, number of tickets, etc.
18
We know that lovers get 2 tickets, families get 3 or 4 tickets and a friends group
might get more than 4 tickets. (In most cases)

Therefore to find the class, no. of tickets might be the best split rather than show
timings.

How does a ML algorithm find this using the training dataset?

It uses a measure called information gain which is calculated for each attribute, it
basically tells us how much information can be gained by the algorithm if that
particular attribute is chosen as the split.

Therefore, the attribute with the maximum Information Gain is chosen to be the
best split.

Decision Tree

Basically the decision tree algorithm is a supervised learning algorithm that can be
used in both classification or regression analysis. Unlike linear algorithms, decision
trees algorithms are capable of dealing with nonlinear relationships between
variables in the data.

19
The above diagram is a representation of the workflow of a basic decision tree.
Where a student needs to decide on going to school or not. In this example, the
decision tree can decide based on certain criteria. The rectangles in the diagram
can be considered as the node of the decision tree. And split on the nodes makes
the algorithm make a decision. In the above example, we have only two variables
which are very basic and easy for us to understand where and which node to
split. To perform a right split of the nodes in case of large variable holding data set
information gain comes into the picture.

Information Gain

Information Gain When we use a node in a decision tree to partition the training
instances into smaller subsets the entropy changes. Information gain is a measure
of this change in entropy.
20
Definition: Suppose S is a set of instances, A is an attribute, S v is the subset of S
with A = v, and Values (A) is the set of all possible values of A,
then

The Information Gained in the decision tree can be defined as the amount of
information improved in the nodes before splitting them for making further
decisions. To understand the information gain let’s take an example of three nodes

As we can see in these three nodes we have data of two classes and here in node 3
we have data for only one class and similarly, we have less data for the second class
than the first class in node 2, and node 1 is balanced. By this above, we can say that
in node three we don’t need to make any decision because all the instances are
representing the direction of the decision in the class first side wherein in node 1
there are 50% chances to decide the direction of both classes. We can say that in
node 1 we are required more information than the other nodes to describe a decision.
By the above, we can say the information gain in node 1 is higher.

21
By the above, we can say the balanced nodes or most impure nodes require more
information to describe. Let’s take a look at the below image on two nodes with
different impurities.

Here we can see that the node on the right side after split gives us heterogeneous
nodes where the node on the left side gives us homogeneous nodes and as we have
discussed in the above node on the left has more information gain than the other
nodes and by this, we can infer that increment in the information gain gives more
homogeneous or pure nodes.

To measure the information gain we use the Entropy, which is a quantified


measurement of the amount of uncertainty because of any process or any given
random variable.

Information Gain = 1 – Entropy

Building Decision Tree using Information Gain The essentials:


• Start with all training instances associated with the root node

• Use info gain to choose which attribute to label each node with

Note: No root-to-leaf path should contain the same discrete attribute twice
• Recursively construct each subtree on the subset of training instances that

would be classified down that path in the tree.


• If all positive or all negative training instances remain, the label that node

“yes” or “no” accordingly

22
• If no attributes remain, label with a majority vote of training instances left at
that node
• If no instances remain, label with a majority vote of the parent’s training
instances.

Example: Now, let us draw a Decision Tree for the following data using
Information gain. Training set: 3 features and 2 classes

X Y Z C

1 1 1 I

1 1 0 I

0 0 1 II

1 0 0 II

Here, we have 3 features and 2 output classes. To build a decision tree using
Information gain. We will take each of the features and calculate the information
for each features.

23
Split on feature X

Split on feature Y

24
Split on feature Z

From the above images, we can see that the information gain is maximum when
we make a split on feature Y. So, for the root node best-suited feature is feature Y.
Now we can see that while splitting the dataset by feature Y, the child contains a
pure subset of the target variable. So we don’t need to further split the dataset. The
final tree for the above dataset would look like this:

25
What is Entropy?
Entropy : Entropy is the measure of uncertainty of a random variable, it
characterizes the impurity of an arbitrary collection of examples. The higher the
entropy more the information content.

Example :
For the set X = {a,a,a,b,b,b,b,b}

Total instances: 8 Instances of b: 5 Instances of a: 3


As of now, we are talking about the information gain which comes under the subject
of information theory, and also in information theory, the entropy of any random
variable or random process is the average level of uncertainty involved in the
possible outcome of the variable or process. To understand it more let’s take an
example of a coin flip. Where we have only two probabilities either it will be a tail,
or it will be a head and if the probability of tail after flip is p then the probability of
a head is 1-p. And the maximum uncertainty is for p = ½ when there is no reason
to expect one outcome over another. Here we can say that the entropy here is 1 and
if the event is known and the maximum uncertainty is for p=0 or p=1 entropy is 0
bits. Mathematically the formula for entropy is:

Where

X = random variable or process

Xi = possible outcomes

p(Xi) = probability of possible outcomes.

So for any random process or variable if we have calculated the probabilities of


possible outcomes and using these probabilities if we have calculated the entropy
we can easily calculate the information gain.
26
Let’s take an example of a family of 10 members, where 5 members are pursuing
their studies and the rest of them have completed or not pursued.

% of pursuing = 50%

% of not pursuing = 50%

Let’s first calculate the entropy for the above-given situation.

Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1

Following the above formula of entropy, we have filled the values where the
probabilities are pursuing or not pursuing is 0.5 and log of 0.5 base two is -1. Let’s
assume a family of 10 members where everyone has already pursued graduation.

% of pursuing = 0%
% of not pursuing = 100%
According to this the entropy of a given situation will be

Entropy = -(0) * log2(0) -(1) * log2(1) = 0

27
From the above, we can say that if a node is containing only one class in it or
formally says the node of the tree is pure the entropy for data in such node will be
zero and according to the information gain formula the information gained for
such node will we higher and purity is higher and if the entropy is higher the
information gain will be less and the node can be considered as the less pure.

One thing which is noticeable here is that the information gain or entropy works
with only categorical data.

In the above example, we have seen how we can calculate the entropy for a single
node. Let’s talk about a tree or decision tree where the number of nodes is huge.

Calculate Entropy Step by Step

Since we know that in a decision tree we have parent nodes and child nodes. The
parent node can also be called the root node of the tree. We are starting with a split
of the parent node and after splitting every type of node the weighted average
entropy of the nodes will be the final entropy which can be used for calculating the
information gain.

Entropy for Parent Node

Here we are starting with an example of class 11th and class 12th students where
we have a total of 20 students. And on the basis of performance, we have different
splits in the parent node, and on the basis of their class level, similar parent nodes
can be split. Like the below-given image.

28
Now according to the performance of the students, we can say

29
Now the entropy for the parent node will

Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1

Here we can see the entropy for the parent node is 1 this is the entropy of the
parent node.

Entropy for Child Node


There is no difference in the formula of entropy of any child node it is the same
for every nod we can simply put the values in the formula wherein one chile node
we have 57% students are performing the curricular activity and others are not

Entropy = -(0.43) * log2(0.43) -(0.57) * log2(0.57) = 0.98


We can see the entropy for one child node. Let’s do the same with the other child
node where 33% of students are involved in the curricular activity.

Entropy = -(0.33) * log2(0.33) -(0.67) * log2(0.67) = 0.91


30
Weighted Entropy Calculation

As of now, we have calculated the entropy for the parent and child nodes now the
weighted sum of these entropies will give the weighted entropy of all the nodes.

Weighted Entropy : (14/20)*0.98 + (6/20)*0.91 = 0.959

Hereby the weighted entropy we can say that the split on the basis of performance
will give us the entropy around 0.95. Here we can see the final entropy is lower
than the entropy given by the parent node so we can say that the child node will
give the pure node(less number of classes in the node)than the parent node of the
tree. A similar procedure we can follow for the splits based on the class

Entropy for parent nodes


Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1

Entropy for child nodes


Class 11th
Entropy = -(0.8) * log2(0.8) -(0.2) * log2(0.2) = 0.722
Class 12th
Entropy = -(0.2) * log2(0.2) -(0.8) * log2(0.8) = 1
Weighted entropy
Weighted Entropy : (10/20)*0.722 + (10/20)*0.722 = 0.722
31
Again we can see that the weighted entropy for the tree is less than the parent
entropy. Using these entropies and the formula of information gain we can
calculate the information gain.

Calculation of Information Gain


The formula of information gain based on the entropy is
Information Gain = 1 – Entropy

This is the same also with the weighted entropy. The below table is the
representation of the information gain value of the example using the entropy

Split Entropy Information Gain

Performance of class 0.959 0.041

Class 0.722 0.278

As we have discussed in the earlier section of the article that instrument in the
information gain causes in the homogeneous split of the node or formation of the
pure nodes hence in the above example the split based on the class will give us more
homogeneous nodes as the child than the nodes produces buy the split on the basis
of performance.

32
Computational Complexity of ML Algorithms

Introduction of Computational Complexity

Computational Complexity, a measure of the amount of computing resources


(time and space) that a particular algorithm consumes when it runs. Computer
scientists use mathematical measures of complexity that allow them to predict,
before writing the code, how fast an algorithm will run and how much memory it
will require. Such predictions are important guides for
programmers implementing and selecting algorithms for real-world applications.

Computational complexity is a continuum, in that some algorithms require linear


time (that is, the time required increases directly with the number of items or nodes
in the list, graph, or network being processed), whereas others require quadratic or
even exponential time to complete (that is, the time required increases with the
number of items squared or with the exponential of that number). At the far end of
this continuum lie intractable problems—those whose solutions cannot be
efficiently implemented. For those problems, computer scientists seek to
find heuristic algorithms that can almost solve the problem and run in a reasonable
amount of time.

Further away still are those algorithmic problems that can be stated but are not
solvable; that is, one can prove that no program can be written to solve the problem.
A classic example of an unsolvable algorithmic problem is the halting problem,
which states that no program can be written that can predict whether or not any other
program halts after a finite number of steps. The unsolvability of the halting problem
has immediate practical bearing on software development. For instance, it would
be frivolous to try to develop a software tool that predicts whether another program
being developed has an infinite loop in it (although having such a tool would be
immensely beneficial).

33
Time and space complexity plays very important role while selecting machine
learning algorithm

Space complexity: space complexity of an algorithm denotes the total space used or
needed by the algorithm for its working, for various input size. In simple words space
it requires to complete the task.

Time complexity: The time complexity is the number of operations on algorithm


perform to complete its task with respect to input size. In simple words time it
requires to complete the task.

1. Hard computing vs Soft computing


There are two paradigms of computing - hard computing and soft computing.

Hard computing deals with problems that have exact solutions, and in which
approximate / uncertain solutions are not acceptable. This is the conventional
computing, and most algorithms courses deal with hard computing.

Soft computing, on the other hand, looks at techniques to approximately solve


problems that are not solvable in finite time. Most machine learning algorithms fall
in this category. The quality of the solution improves as you spend more time in
solving the problem.

2. A Theoretical point of view


It is harder than one would think to evaluate the complexity of a machine learning
algorithm, especially as it may be implementation dependent, properties of the data
may lead to other algorithms or the training time often depends on some parameters
passed to the algorithm.

Lets start looking at worst case time complexity when the data is dense,

• n -> the number of training sample


• p -> the number of features
• ntress -> the number of trees (for methods based on various trees)
• nsv -> the number of support vectors
• nli -> the number of neurons at layer i in a neural network
34
3. Justifications

Decision Tree based models


Obviously, ensemble methods multiply the complexity of the original model by the
number of “voters” in the model, and replace the training size by the size of each
bag.

When training a decision tree, a split has to be found until a maximum depth d has
been reached.

The strategy for finding this split is to look for each variable (there are p of them)
to the different thresholds (there are up to n of them) and the information gain that
is achieved (evaluation in O ( n ) )

In the Breiman implementation, and for classification, it is recommanded to use √ p


predictors for each (weak) classifier.

▪ Linear regressions
▪ Support Vector Machine

▪ k-Nearest Neighbours

4. Algorithm Complexity
Machine Learning is primarily about optimization of an objective function. Often
the function is so represented that the target is to reach the global minima. Solving
it involves heuristics, and thereby multiple iterations. In gradient descent for
instance, you need multiple iterations to reach the minima. So given an algorithm,
you can at best estimate the running 'time' for a single iteration.

We are talking about finding Minima of cost functions whose complexity depend
on the ‘value’ of the data and not just the ‘size’ of the data. The cost function is a
function of the dataset. This is a key difference between algorithms used for ML
and others.

35
Note that this again cannot be a parameter for comparison since for different
algorithms, the objective function would reach a minima in different number of
iterations for different data sets.

What is Occam’s Razor?


Occam’s razor argues that the simplest explanation is the one most likely to be
correct.

Many philosophers throughout history have advocated the idea of parsimony. One
of the greatest Greek philosophers, Aristotle who goes as far as to say, “Nature
operates in the shortest way possible”. It is as a consequence that humans might be
biased as well to choose a simpler explanation given a set of all possible
explanations with the same descriptive power. This post gives a brief overview of
Occam’s razor, the relevance of the principle and ends with a note on the usage of
this razor as an inductive bias in machine learning (decision tree learning in
particular).

Occam’s razor is a law of parsimony popularly stated as (in William’s words)


“Plurality must never be posited without necessity”. Alternatively, as a heuristic,
36
it can be viewed as, when there are multiple hypotheses to solve a problem, the
simpler one is to be preferred. It is not clear as to whom this principle can be
conclusively attributed to, but William of Occam’s (c. 1287 – 1347) preference for
simplicity is well documented. Hence this principle goes by the name, “Occam’s
razor”. This often means cutting off or shaving away other possibilities or
explanations, thus “razor” appended to the name of the principle. It should be noted
that these explanations or hypotheses should lead to the same result.

Occam’s Razor for Model Selection


Model selection is the process of choosing one from among possibly many
candidate machine learning models for a predictive modeling project.

It is often straightforward to select a model based on its expected performance, e.g.


choose the model with the highest accuracy or lowest prediction error.

Another important consideration is to choose simpler models over complex models.

Simpler models are typically defined as models that make fewer assumptions or
have fewer elements, most commonly characterized as fewer coefficients (e.g. rules,
layers, weights, etc.). The rationale for choosing simpler models is tied back to
Occam’s Razor.

The idea is that the best scientific theory is the smallest one that explains all the
facts.

Occam’s Razor is an approach to problem-solving and is commonly invoked to


mean that if all else is equal, we should prefer the simpler solutions.

Occam’s Razor: If all else is equal, the simplest solution is correct.


The problem with complex hypotheses with more assumptions is that they are likely
too specific.

They may include details of specific cases that are at hand or easily available, and
in turn, may not generalize to new cases. That is, the more assumptions a hypothesis
has, the more narrow it is expected to be in its application. Conversely, fewer

37
assumptions suggests a more general hypothesis with greater predictive power to
more cases.

• Simple Hypothesis: Fewer assumptions, and in turn, broad applicability.


• Complex Hypothesis: More assumptions, and in turn, narrow applicability.

This has implications in machine learning, as we are specifically trying to generalize


to new unseen cases from specific observations, referred to as inductive reasoning.
If Occam’s Razor suggests that more complex models don’t generalize well, then
in applied machine learning, it suggests we should choose simpler models as they
will have lower prediction errors on new data.

How is Occam’s Razor Relevant in Machine Learning?


Occam’s Razor is one of the principles that guides us when we are trying to select
the appropriate model for a particular machine learning problem. If the model is too
simple, it will make useless predictions. If the model is too complex (loaded with
attributes), it will not generalize well.

Imagine, for example, you are trying to predict a student’s college GPA. A simple
model would be one that is based entirely on a student’s SAT score.

College GPA = Θ * (SAT Score)

While this model is very simple, it might not be very accurate because often a
college student’s GPA is dependent on factors other than just his or her SAT score.
It is severely underfit and inflexible. In machine learning jargon, we would say this
type of model has high bias, but low variance.

In general, the more inflexible a model, the higher the bias. Also, the noisier the
model, the higher the variance. This is known as the bias–variance tradeoff.

38
Image Source: Medium.com

If the model is too complex and loaded with attributes, it is at risk of capturing noise
in the data that could be due entirely to random chance. It would make amazing
predictions on the training data set, but it would perform poorly when faced with a
new data set. It won’t generalize well because it is severely overfit. It has high
variance and low bias.

A real world example would be like trying to predict a person’s college gpa based
on his or her SAT score, high school GPA, middle school GPA, socio-economic
status, city of birth, hair color, favorite NBA team, favorite food, and average daily
sleep duration.

College GPA = Θ1 * (SAT Score) + Θ2 * High School GPA + Θ3 * Middle


School GPA + Θ4 * Socio-Economic status + Θ5 * City of Birth + Θ6 * Hair
Color + Θ7 * Favorite NBA Team + Θ8 * Favorite Food + Θ9 * Average Daily
Sleep Duration

39
Image Source: Scott Fortman Roe

In machine learning there was always this balance of bias vs. variance, inflexibility
vs. flexibility, parsimony vs. prodigality.

There are many events that favor a simpler approach either as an inductive bias or
a constraint to begin with. Some of them are :

• Studies like this, where the results have suggested that preschoolers are
sensitive to simpler explanations during their initial years of learning and
development.
• Preference for a simpler approach and explanations to achieve the same
goal is seen in various facets of sciences; for instance, the parsimony
principle applied to the understanding of evolution.
• In theology, ontology, epistemology, etc this view of parsimony is used to
derive various conclusions.
• Variants of Occam’s razor are used in knowledge Discovery.

40
Occam’s razor as an inductive bias in machine learning.
• Inductive bias (or the inherent bias of the algorithm) are assumptions that

are made by the learning algorithm to form a hypothesis or a generalization


beyond the set of training instances in order to classify unobserved data.
• Occam’s razor is one of the simplest examples of inductive bias. It involves

a preference for a simpler hypothesis that best fits the data. Though the
razor can be used to eliminate other hypotheses, relevant justification may
be needed to do so. Below is an analysis of how this principle is applicable
in decision tree learning.
• The decision tree learning algorithms follow a search strategy to search the

hypotheses space for the hypothesis that best fits the training data. For
example, the ID3 algorithm uses a simple to complex strategy starting from
an empty tree and adding nodes guided by the information gain heuristic to
build a decision tree consistent with the training instances.

The information gain of every attribute (which is not already included in


the tree) is calculated to infer which attribute to be considered as the next
node. Information gain is the essence of the ID3 algorithm. It gives a
quantitative measure of the information that an attribute can provide about
the target variable i.e, assuming only information of that attribute is
available, how efficiently can we infer about the target. It can be defined as
:

• Well, there can be many decision trees that are consistent with a given set
of training examples, but the inductive bias of the ID3 algorithm results in
the preference for simper (or shorter trees) trees. This preference bias of
ID3 arises from the fact that there is an ordering of the hypotheses in the
search strategy. This leads to additional bias that attributes high with
information gain closer to the root is preferred. Therefore, there is a definite
41
order the algorithm follows until it terminates on reaching a hypothesis that
is consistent with the training data.

The above image depicts how the ID3 algorithm chooses the nodes in every
iteration. The red arrow depicts the node chosen in a particular iteration while
the black arrows suggest other decision trees that could have been possible in
a given iteration.
• Hence starting from an empty node, the algorithm graduates towards more
complex decision trees and stops when the tree is sufficient to classify the
training examples.
• This example pops a question. Does eliminating complex hypotheses bear
any consequence on the classification of unobserved instances? simply put,
does the preference for a simpler hypothesis have an advantage? If two
decision trees have slightly different training errors but the same validation
errors, then it is obvious that the simpler tree among the two will be chosen.
As a higher validation error causes overfitting of the data. Complex trees
often have almost zero training error, but the validation errors might be
42
high. This scenario gives a logical reason for a bias towards simpler trees.
In addition to that, a simpler hypothesis might prove effective in a resource-
limited environment.
• What is overfitting? Consider two hypotheses a and b. Let ‘a’ fit the training
examples perfectly, while the hypothesis ‘b’ has a small training error. If
over the entire set of data (i.e, including the unseen instances), if the
hypothesis ‘b’ performs better, then ‘a’ is said to overfit the training data.
To best illustrate the problem of over-fitting, consider the figure below.

Figures A and B depict two decision boundaries. Assuming the green and
red points represent the training examples, the decision boundary in B
perfectly fits the data thus perfectly classifying the instances, while the
decision boundary in A does not, though being simpler than B. In this
example the decision boundary in B overfits the data. The reason being that
every instance of the training data affects the decision boundary. The added
relevance is when the training data contains noise. For example, assume in
figure B that one of the red points close to the boundary was a noise point.
Then the unseen instances in close proximity to the noise point might be
wrongly classified. This makes the complex hypothesis vulnerable to noise
in the data.
• While the problem of overfitting behaviour of a model can be significantly
avoided by settling for a simpler hypothesis, an extremely simple
hypothesis may be too abstract to deduce any information needed for the
task resulting in underfitting. Overfitting and underfitting are one of the
43
major challenges to be addressed before we zero in on a machine learning
model. Sometimes a complex model might be desired, it’s a choice
dependent on the data available, the results expected and the application
domain.

Overfitting in Machine Learning


Overfitting and Underfitting are the two main problems that occur in machine
learning and degrade the performance of the machine learning models.

The main goal of each machine learning model is to generalize well.


Here generalization defines the ability of an ML model to provide a suitable output
by adapting the given set of unknown input. It means after providing training on the
dataset, it can produce reliable and accurate output. Hence, the underfitting and
overfitting are the two terms that need to be checked for the performance of the
model and whether the model is generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic
term that will help to understand this topic well:

o Signal: It refers to the true underlying pattern of the data that helps the
machine learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance
of the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance occurs.

Overfitting
Overfitting occurs when our machine learning model tries to cover all the data
points or more than the required data points present in the given dataset. Because of
this, the model starts caching noise and inaccurate values present in the dataset, and
all these factors reduce the efficiency and accuracy of the model. The overfitted
model has low bias and high variance.

44
The chances of occurrence of overfitting increase as much we provide training to
our model. It means the more we train our model, the more chances of occurring the
overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of
the linear regression output:

As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because
the goal of the regression model to find the best fit line, but here we have not got
any best fit, so, it will generate the prediction errors.

How to avoid the Overfitting in Model

Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which
we can reduce the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
45
Noisy Data in Machine Learning
Noisy data are data with a large amount of additional meaningless information
called noise. This includes data corruption, and the term is often used as a synonym
for corrupt data. It also includes any data that a user system cannot understand and
interpret correctly. Many systems, for example, cannot use unstructured text. Noisy
data can adversely affect the results of any data analysis and skew conclusions if
not handled properly. Statistical analysis is sometimes used to weed the noise out
of noisy data.

Noisy data are data that is corrupted, distorted, or has a low Signal-to-Noise Ratio.
Improper procedures (or improperly-documented procedures) to subtract out the
noise in data can lead to a false sense of accuracy or false conclusions.

Data = true signal + noise

Noisy data unnecessarily increases the amount of storage space required and can
adversely affect any data mining analysis results.Play Video

Noisy data can be caused by hardware failures, programming errors, and gibberish
input from speech or optical character recognition (OCR) programs. Spelling errors,
industry abbreviations, and slang can also impede machine reading.

Noise is an unavoidable problem that affects the data collection and preparation
processes in machine learning applications, where errors commonly occur. Noise
has two main sources, such as:

1. Implicit errors are introduced by measurement tools, such as different types of


sensors.
2. And random errors are introduced by batch processes or experts when the data
are gathered, such as in a document digitalization process.

Sources of Noise

Differences in real-world measured data from the true values come from multiple
factors affecting the measurement.
46
Random noise is often a large component of the noise in data. Random noise in a
signal is measured as the Signal-to-Noise Ratio. Random noise contains almost
equal amounts of a wide range of frequencies and is called white noise (as colors of
light combine to make white). Random noise is an unavoidable problem. It affects
the data collection and data preparation processes, where errors commonly occur.
Noise has two main sources:

1. Errors introduced by measurement tools,


2. And random errors are introduced by processing or experts when the data is
gathered.

Filtering - Improper filtering can add noise if the filtered signal is treated as a
directly measured signal. For example, Convolution-type digital filters such as a
moving average can have side effects such as lags or truncation of peaks.
Differentiating digital filters amplify random noise in the original data.

Outlier data are data that appear to not belong in the data set. It can be caused by
human error such as transposing numerals, mislabeling, programming bugs, etc. If
valid data is identified as an outlier and is mistakenly removed, that also corrupts
results. If actual outliers are not removed from the data set, they corrupt the results
to a small or large degree, depending on circumstances.

Fraud: Individuals may deliberately skew data to influence the results toward a
desired conclusion. Data that looks good with few outliers reflects well on the
individual collecting it, and so there may be incentive to remove more data as
outliers or make the data look smoother than it is.

Types of Noise

A large number of components determine the quality of a dataset. Among them, the
class labels and the attribute values directly influence the quality of a classification
dataset. The quality of the class labels refers to whether the class of each example
is correctly assigned; otherwise, the quality of the attributes refers to their capability
of properly characterizing the examples for classification purposes if noise affects
attribute values, this capability of characterization and, therefore, the quality of the
attributes is reduced. Based on these two information sources, two types of noise
can be distinguished in a given dataset.
47
1. Class Noise (label noise)

This occurs when an example is incorrectly labeled. Class noise can be attributed to
several causes, such as subjectivity during the labeling process, data entry errors, or
inadequate information used to label each example. Class noise is further divided
into two types, such as:

o Contradictory examples: Duplicate examples have different class labels. In


the figure above, the two examples (0.25, red, class = positive) and (0.25, red,
class = negative) are contradictory examples since they have the same attribute
values and a different class.
o Misclassifications examples: Examples that are labeled as a class different
from the real one. The figure placed above the example (0.99, green, class =
negative) is a mislabeled example since its class label is wrong, and it would
be "positive".

2. Attribute Noise

This refers to corruption in the values of one or more attributes. Examples of


attribute noise are:

o Erroneous attribute values: In the figure placed above, the example (1.02,
green, class = positive) has its first attribute with noise since it has the wrong
value.
o Missing or unknown attribute values: In the figure placed above, the
example (2.05, ? class = negative) has attribute noise since we do not know
the value of the second attribute.
o Incomplete attributes or do not care values: In the figure placed above, the
example (=, green, class = positive) has attribute noise since the value of the
48
first attribute does not affect the rest of the values of the example, including
the class of the example.

Considering class and attribute noise as corruptions of the class labels and attribute
values, respectively, is common in real-world data. Because of this, these types of
noise have also been considered in many works in the literature. For instance, the
authors reached a series of interesting conclusions, showing that attribute noise is
more harmful than class noise or that eliminating or correcting examples in datasets
with class and attribute noise may improve classifier performance. They also
showed that attribute noise is more harmful in those attributes highly correlated with
the class labels. The authors checked the robustness of methods from different
paradigms, such as probabilistic classifiers, decision trees, and instance-based
learners or support vector machines, studying the possible causes of their behaviors.

How to Manage Noisy Data?

Removing noise from a data set is termed data smoothing. The following ways can
be used for Smoothing:

1. Binning

Binning is a technique where we sort the data and then partition the data into equal
frequency bins. Then you may either replace the noisy data with the bin mean bin
median or the bin boundary. This method is to smooth or handle noisy data. First,
the data is sorted then, and then the sorted values are separated and stored in the
form of bins. There are three methods for smoothing data in the bin.

o Smoothing by bin mean method: In this method, the values in the bin are
replaced by the mean value of the bin.

49
o Smoothing by bin median: In this method, the values in the bin are replaced
by the median value.
o Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken, and the closest boundary value
replaces the values.

2. Regression

This is used to smooth the data and help handle data when unnecessary data is
present. For the analysis, purpose regression helps decide the suitable
variable. Linear regression refers to finding the best line to fit between two
variables so that one can be used to predict the other. Multiple linear
regression involves more than two variables. Using regression to find a
mathematical equation to fit into the data helps to smooth out the noise.

3. Clustering

This is used for finding the outliers and also in grouping the data. Clustering is
generally used in unsupervised learning.

4. Outlier Analysis

Outliers may be detected by clustering, where similar or close values are organized
into the same groups or clusters. Thus, values that fall far apart from the cluster may
be considered noise or outliers. Outliers are extreme values that deviate from other
observations on data. They may indicate variability in measurement, experimental
errors, or novelty. In other words, an outlier is an observation that diverges from an
overall pattern on a sample. Outliers can be the following kinds, such as:

o Univariate outliers can be found when looking at a distribution of values in a


single feature space.
o Multivariate outliers can be found in an n-dimensional space (of n-features).
Looking at distributions in n-dimensional spaces can be very difficult for the
human brain. That is why we need to train a model to do it for us.
o Point outliers are single data points that lay far from the rest of the
distribution.

50
o Contextual outliers can be noise in data, such as punctuation symbols when
realizing text analysis or background noise signal when doing speech
recognition.
o Collective outliers can be subsets of novelties in data, such as a signal that
may indicate the discovery of new phenomena.

Data cleaning is an important stage. After all, your results are based on your data.
The more the dirt, the more inaccurate your results would prove.

Data Cleaning eliminates noise and missing values. Data Cleaning is just the first
of the many steps for data pre-processing. In addition to the above, data pre-
processing includes Aggregation, Feature Construction, Normalization,
Discretization, Concept hierarchy generation, which mostly deal with making
the data consistent. Data pre-processing, at times, also comprises 90% of the entire
process.

51

You might also like