The document discusses non-metric methods for pattern classification, focusing on decision trees, perceptrons, and support vector machines. It explains the construction, advantages, and disadvantages of decision trees, including concepts like overfitting and pruning. Additionally, it covers the workings of perceptrons and support vector machines, detailing their applications in classification problems.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
20 views48 pages
Chapter 04
The document discusses non-metric methods for pattern classification, focusing on decision trees, perceptrons, and support vector machines. It explains the construction, advantages, and disadvantages of decision trees, including concepts like overfitting and pruning. Additionally, it covers the workings of perceptrons and support vector machines, detailing their applications in classification problems.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48
PARULINSTITUTEOF ENGINEERING &TECHNOLOGY
FACULTY OF ENGINEERING & TECHNOLOGY
PARULUNIVERSITY
Subject: Pattern Recognition
Chapter 4 : Non-metric methods for pattern classification Computer Science & Engineering Ishwarlal Rathod (Assistant Prof. PIET-CSE) Outline • Concept of construction, • splitting of nodes, • choosing of attributes, • overfitting, • pruning, • Linear Discriminant based algorithm: Perceptron, Support Vector Machines Decision tree
• Decision Tree is a Supervised learning technique that can be used
for both classification and Regression problems, but mostly it is preferred for solving Classification problems. • It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome. • It is a graphical representation for getting all the possible solutions to a problem/decision based on given conditions. • In order to build a tree, we use the CART algorithm, which stands for Classification and Regression Tree algorithm. Construction of decision tree Why use Decision Trees? • Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand. • The logic behind the decision tree can be easily understood because it shows a tree-like structure. Decision Tree Terminologies • Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets divided into two or more homogeneous sets. • Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf node. • Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given conditions. • Branch/Sub Tree: A tree formed by splitting the tree. • Pruning: Pruning is the process of removing the unwanted branches from the tree. • Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes. How does the Decision Tree algorithm Work? • In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree. • This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node. • For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further. • It continues the process until it reaches the leaf node of the tree. The complete process can be better understood using the below algorithm: Steps • Step-1: Begin the tree with the root node, says S, which contains the complete dataset. • Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM). • Step-3: Divide the S into subsets that contains possible values for the best attributes. • Step-4: Generate the decision tree node, which contains the best attribute. • Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node. Example Attribute Selection Measures • While implementing a Decision tree, the main issue arises that how to select the best attribute for the root node and for sub- nodes. • So, to solve such problems there is a technique which is called as Attribute selection measure or ASM. • By this measurement, we can easily select the best attribute for the nodes of the tree. • There are two popular techniques for ASM, which are: • Information Gain • Gini Index Information Gain • It is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. • It calculates how much information a feature provides us about a class. • According to the value of information gain, we split the node and build the decision tree. • A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the highest information gain is split first. It can be calculated using the below formula: Information Gain=Entropy(S)[(Weighted Avg) *Entropy(each feature) Cont. • Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can be calculated as: Entropy(s)=-P(yes) log2 P(yes) - P(no) log2 P(no) Where, • S= Total number of samples • P(yes)= probability of yes • P(no)= probability of no Gini Index • Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification and Regression Tree) algorithm. • An attribute with the low Gini index should be preferred as compared to the high Gini index. • It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits. • Gini index can be calculated using the below formula: Overfitting • Overfitting refers to the condition when the model completely fits the training data but fails to generalize the testing unseen data. • Overfit condition arises when the model memorizes the noise of the training data and fails to capture important patterns. • A perfectly fit decision tree performs well for training data but performs poorly for unseen test data. • If the decision tree is allowed to train to its full strength, the model will overfit the training data. Overfitting Overfitting • Overfitting refers to the condition when the model completely fits the training data but fails to generalize the testing unseen data. • Overfit condition arises when the model memorizes the noise of the training data and fails to capture important patterns. • A perfectly fit decision tree performs well for training data but performs poorly for unseen test data. • If the decision tree is allowed to train to its full strength, the model will overfit the training data. Overfitting • There are various techniques to prevent the decision tree model from overfitting. • Pruning – Pre-pruning – Post-pruning • Ensemble – Random forest Pruning • Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree. • A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features of the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used: • Cost Complexity Pruning • Reduced Error Pruning. Advantages of Decision tree • It is simple to understand as it follows the same process which a human follow while making any decision in real-life. • It can be very useful for solving decision-related problems. • It helps to think about all the possible outcomes for a problem. • There is less requirement of data cleaning compared to other algorithms. Disadvantages of Decision tree • The decision tree contains lots of layers, which makes it complex. • It may have an overfitting issue, which can be resolved using the Random Forest algorithm. • For more class labels, the computational complexity of the decision tree may increase. Linear discriminant based Algorithm: Perceptron • Perceptron is a linear supervised machine learning algorithm. It is used for binary classification. • It helps to detect certain input data computations in business intelligence. • The perceptron learning algorithm is treated as the most straightforward Artificial Neural network. • Hence, it is a single-layer neural network with four main parameters, i.e., input values, weights and Bias, net sum, and an activation function. Basic components of Perceptron • Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three main components. These are as follows: Contd. • Input Nodes or Input Layer: This is the primary component of Perceptron which accepts the initial data into the system for further processing. Each input node contains a real numerical value. • Wight and Bias: Weight parameter represents the strength of the connection between units. This is another most important parameter of Perceptron components. Weight is directly proportional to the strength of the associated input neuron in deciding the output. Further, Bias can be considered as the line of intercept in a linear equation. Contd. • Activation Function: These are the final and important components that help to determine whether the neuron will fire or not. Activation Function can be considered primarily as a step function. • Types of Activation functions: • Sign function • Step function, and • Sigmoid function Contd. • The data scientist uses the activation function to take a subjective decision based on various problem statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the learning process is slow or has vanishing or exploding gradients. How does perceptron work? • The perceptron model begins with the multiplication of all input values and their weights, then adds these values together to create the weighted sum. Then this weighted sum is applied to the activation function 'f' to obtain the desired output. This activation function is also known as the step function and is represented by 'f'. Contd. • This step function or Activation function plays a vital role in ensuring that output is mapped between required values (0,1) or (-1,1). • It is important to note that the weight of input is indicative of the strength of a node. • Similarly, an input's bias value gives the ability to shift the activation function curve up or down. • Perceptron model works in two important steps as follows: Step-1 • In the first step first, multiply all input values with corresponding Contd. weight values and then add them to determine the weighted sum. Mathematically, we can calculate the weighted sum as follows: ∑wi*xi = x1*w1 + x2*w2 +…wn*xn • Add a special term called bias 'b' to this weighted sum to improve the model's performance. ∑wi*xi + b Step-2 • An activation function is applied with the above-mentioned weighted sum, which gives us output either in binary form or a continuous value as follows: Y = f(∑wi*xi + b) Types of Perceptron Models • Based on the number of layers, perceptrons are broadly classified into two major categories: • Single layer perceptron model: It is the simplest Artificial Neural Network (ANN) model. A single-layer perceptron model consists of a feed-forward network and includes a threshold transfer function for thresholding on the Output. The main objective of the single-layer perceptron model is to classify linearly separable data with binary labels. Contd. • Multi-Layer Perceptron Model: The multi-layer perceptron learning algorithm has the same structure as a single-layer perceptron but consists of an additional one or more hidden layers, unlike a single-layer perceptron, which consists of a single hidden layer. The distinction between these two types of perceptron models is shown in the Figure below. Contd. Perceptron Function • Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the learned weight coefficient 'w'. • Mathematically, we can express it as follows: • f(x)=1; if w.x+b>0 • otherwise, f(x)=0 • 'w' represents real-valued weights vector • 'b' represents the bias • 'x' represents a vector of input x values. Limitations of Perceptron Model • The output of a perceptron can only be a binary number (0 or 1) due to the hard limit transfer function. • Perceptron can only be used to classify the linearly separable sets of input vectors. If input vectors are non-linear, it is not easy to classify them properly. Support vector machine algorithm • SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning. • The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane. • SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as SVM. Contd. Example Types of SVM • Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier. • Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier. How does SVM work? • Linear SVM: The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image: Contd. • So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be multiple lines that can separate these classes. Consider the below image: Contd. • Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as a hyperplane. • SVM algorithm finds the closest point of the lines from both the classes. • These points are called support vectors. The distance between the vectors and the hyperplane is called as margin. • And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane. Contd. Contd. Non-Linear SVM: If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot draw a single straight line. Consider the below image: Contd. So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
By adding the third dimension, the sample space will become
as below image: Contd. Contd. • So now, SVM will divide the datasets into classes in the following way. Consider the below image: Contd. • Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space with z=1, then it will become as:
• Hence we get a circumference of radius 1 in case of non-linear