0% found this document useful (0 votes)
20 views48 pages

Chapter 04

The document discusses non-metric methods for pattern classification, focusing on decision trees, perceptrons, and support vector machines. It explains the construction, advantages, and disadvantages of decision trees, including concepts like overfitting and pruning. Additionally, it covers the workings of perceptrons and support vector machines, detailing their applications in classification problems.

Uploaded by

aakash626273
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views48 pages

Chapter 04

The document discusses non-metric methods for pattern classification, focusing on decision trees, perceptrons, and support vector machines. It explains the construction, advantages, and disadvantages of decision trees, including concepts like overfitting and pruning. Additionally, it covers the workings of perceptrons and support vector machines, detailing their applications in classification problems.

Uploaded by

aakash626273
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

PARULINSTITUTEOF ENGINEERING &TECHNOLOGY

FACULTY OF ENGINEERING & TECHNOLOGY


PARULUNIVERSITY

Subject: Pattern Recognition


Chapter 4 : Non-metric methods for pattern
classification
Computer Science & Engineering
Ishwarlal Rathod (Assistant Prof. PIET-CSE)
Outline
• Concept of construction,
• splitting of nodes,
• choosing of attributes,
• overfitting,
• pruning,
• Linear Discriminant based algorithm: Perceptron, Support
Vector Machines
Decision tree

• Decision Tree is a Supervised learning technique that can be used


for both classification and Regression problems, but mostly it is
preferred for solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent
the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
• It is a graphical representation for getting all the possible
solutions to a problem/decision based on given conditions.
• In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
Construction of decision tree
Why use Decision Trees?
• Decision Trees usually mimic human thinking ability while making
a decision, so it is easy to understand.
• The logic behind the decision tree can be easily understood
because it shows a tree-like structure.
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two
or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree
cannot be segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted
branches from the tree.
• Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
How does the Decision Tree algorithm Work?
• In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree.
• This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison,
follows the branch and jumps to the next node.
• For the next node, the algorithm again compares the attribute
value with the other sub-nodes and move further.
• It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below
algorithm:
Steps
• Step-1: Begin the tree with the root node, says S, which contains
the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for
the best attributes.
• Step-4: Generate the decision tree node, which contains the best
attribute.
• Step-5: Recursively make new decision trees using the subsets of
the dataset created in step -3. Continue this process until a stage is
reached where you cannot further classify the nodes and called
the final node as a leaf node.
Example
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that
how to select the best attribute for the root node and for sub-
nodes.
• So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM.
• By this measurement, we can easily select the best attribute for
the nodes of the tree.
• There are two popular techniques for ASM, which are:
• Information Gain
• Gini Index
Information Gain
• It is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
• It calculates how much information a feature provides us about
a class.
• According to the value of information gain, we split the node
and build the decision tree.
• A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest
information gain is split first. It can be calculated using the
below formula:
Information Gain=Entropy(S)[(Weighted Avg) *Entropy(each feature)
Cont.
• Entropy: Entropy is a metric to measure the impurity in a given
attribute. It specifies randomness in data. Entropy can be
calculated as:
Entropy(s)=-P(yes) log2 P(yes) - P(no) log2 P(no)
Where,
• S= Total number of samples
• P(yes)= probability of yes
• P(no)= probability of no
Gini Index
• Gini index is a measure of impurity or purity used while creating
a decision tree in the CART(Classification and Regression Tree)
algorithm.
• An attribute with the low Gini index should be preferred as
compared to the high Gini index.
• It only creates binary splits, and the CART algorithm uses the Gini
index to create binary splits.
• Gini index can be calculated using the below formula:
Overfitting
• Overfitting refers to the condition when the model completely
fits the training data but fails to generalize the testing unseen
data.
• Overfit condition arises when the model memorizes the noise of
the training data and fails to capture important patterns.
• A perfectly fit decision tree performs well for training data but
performs poorly for unseen test data.
• If the decision tree is allowed to train to its full strength, the
model will overfit the training data.
Overfitting
Overfitting
• Overfitting refers to the condition when the model completely
fits the training data but fails to generalize the testing unseen
data.
• Overfit condition arises when the model memorizes the noise of
the training data and fails to capture important patterns.
• A perfectly fit decision tree performs well for training data but
performs poorly for unseen test data.
• If the decision tree is allowed to train to its full strength, the
model will overfit the training data.
Overfitting
• There are various techniques to prevent the decision tree model
from overfitting.
• Pruning
– Pre-pruning
– Post-pruning
• Ensemble
– Random forest
Pruning
• Pruning is a process of deleting the unnecessary nodes from a tree
in order to get the optimal decision tree.
• A too-large tree increases the risk of overfitting, and a small tree
may not capture all the important features of the dataset.
Therefore, a technique that decreases the size of the learning tree
without reducing accuracy is known as Pruning. There are mainly
two types of tree pruning technology used:
• Cost Complexity Pruning
• Reduced Error Pruning.
Advantages of Decision tree
• It is simple to understand as it follows the same process which a
human follow while making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other
algorithms.
Disadvantages of Decision tree
• The decision tree contains lots of layers, which makes it complex.
• It may have an overfitting issue, which can be resolved using
the Random Forest algorithm.
• For more class labels, the computational complexity of the
decision tree may increase.
Linear discriminant based Algorithm: Perceptron
• Perceptron is a linear supervised machine learning algorithm. It
is used for binary classification.
• It helps to detect certain input data computations in business
intelligence.
• The perceptron learning algorithm is treated as the most
straightforward Artificial Neural network.
• Hence, it is a single-layer neural network with four main
parameters, i.e., input values, weights and Bias, net sum, and an
activation function.
Basic components of Perceptron
• Mr. Frank Rosenblatt invented the perceptron model as a binary
classifier which contains three main components. These are as
follows:
Contd.
• Input Nodes or Input Layer: This is the primary component of
Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical
value.
• Wight and Bias: Weight parameter represents the strength of
the connection between units. This is another most important
parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in
deciding the output. Further, Bias can be considered as the line
of intercept in a linear equation.
Contd.
• Activation Function: These are the final and important
components that help to determine whether the neuron will fire
or not. Activation Function can be considered primarily as a step
function.
• Types of Activation functions:
• Sign function
• Step function, and
• Sigmoid function
Contd.
• The data scientist uses the activation function to take a
subjective decision based on various problem statements
and forms the desired outputs. Activation function may
differ (e.g., Sign, Step, and Sigmoid) in perceptron models
by checking whether the learning process is slow or has
vanishing or exploding gradients.
How does perceptron work?
• The perceptron model begins with the multiplication of all input
values and their weights, then adds these values together to
create the weighted sum. Then this weighted sum is applied to
the activation function 'f' to obtain the desired output. This
activation function is also known as the step function and is
represented by 'f'.
Contd.
• This step function or Activation function plays a vital role in
ensuring that output is mapped between required values (0,1) or
(-1,1).
• It is important to note that the weight of input is indicative of the
strength of a node.
• Similarly, an input's bias value gives the ability to shift the
activation function curve up or down.
• Perceptron model works in two important steps as follows:
Step-1
• In the first step first, multiply all input values with corresponding
Contd.
weight values and then add them to determine the weighted sum.
Mathematically, we can calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
• Add a special term called bias 'b' to this weighted sum to
improve the model's performance.
∑wi*xi + b
Step-2
• An activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a
continuous value as follows:
Y = f(∑wi*xi + b)
Types of Perceptron Models
• Based on the number of layers, perceptrons are broadly
classified into two major categories:
• Single layer perceptron model: It is the simplest Artificial Neural
Network (ANN) model. A single-layer perceptron model consists
of a feed-forward network and includes a threshold transfer
function for thresholding on the Output. The main objective of
the single-layer perceptron model is to classify linearly separable
data with binary labels.
Contd.
• Multi-Layer Perceptron Model: The multi-layer perceptron
learning algorithm has the same structure as a single-layer
perceptron but consists of an additional one or more hidden
layers, unlike a single-layer perceptron, which consists of a single
hidden layer. The distinction between these two types of
perceptron models is shown in the Figure below.
Contd.
Perceptron Function
• Perceptron function ''f(x)'' can be achieved as output by
multiplying the input 'x' with the learned weight coefficient 'w'.
• Mathematically, we can express it as follows:
• f(x)=1; if w.x+b>0
• otherwise, f(x)=0
• 'w' represents real-valued weights vector
• 'b' represents the bias
• 'x' represents a vector of input x values.
Limitations of Perceptron Model
• The output of a perceptron can only be a binary number (0 or 1)
due to the hard limit transfer function.
• Perceptron can only be used to classify the linearly separable sets
of input vectors. If input vectors are non-linear, it is not easy to
classify them properly.
Support vector machine algorithm
• SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in
Machine Learning.
• The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional space into
classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is
called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating
the hyperplane. These extreme cases are called as support
vectors, and hence algorithm is termed as SVM.
Contd.
Example
Types of SVM
• Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a
single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly
separated data, which means if a dataset cannot be classified by
using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
How does SVM work?
• Linear SVM:
The working of the SVM algorithm can be understood by using
an example. Suppose we have a dataset that has two tags
(green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of
coordinates in either green or blue. Consider the below image:
Contd.
• So as it is 2-d space so by just using a straight line, we can
easily separate these two classes. But there can be multiple
lines that can separate these classes. Consider the below
image:
Contd.
• Hence, the SVM algorithm helps to find the best line or
decision boundary; this best boundary or region is called as
a hyperplane.
• SVM algorithm finds the closest point of the lines from both
the classes.
• These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin.
• And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal
hyperplane.
Contd.
Contd.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a
straight line, but for non-linear data, we cannot draw a single
straight line. Consider the below image:
Contd.
So to separate these data points, we need to add one more
dimension. For linear data, we have used two dimensions x
and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:

By adding the third dimension, the sample space will become


as below image:
Contd.
Contd.
• So now, SVM will divide the datasets into classes in the following
way. Consider the below image:
Contd.
• Since we are in 3-d Space, hence it is looking like a plane parallel to
the x-axis. If we convert it in 2d space with z=1, then it will become
as:

• Hence we get a circumference of radius 1 in case of non-linear


data.
Thank You!!!
www.paruluniversi
ty.ac.in

You might also like