Chapter 3
Chapter 3
Classification and
Prediction
1
CONTENTS
3.1 What is classification? What is
prediction? 3.7 Other classification methods
3.2 Issues regarding classification 3.7.1 K-nearest neighbor classifier
and prediction
3.7.2 Neural Network
3.3 Classification by decision tree
3.7.3 Genetic algorithm
induction
3.8 Prediction
3.4 Bayesian classification
3.9 Classifier accuracy
3.5 Support vector machines
3.6 Classification by back
propagation 2
WHAT IS CLASSIFICATION?
The practice of using data to create predictions or foresee future events is known as
machine learning prediction.
Machine learning prediction, or prediction in machine learning, refers to the output of
an algorithm that has been trained on a historical dataset.
The algorithm then generates probable values for unknown variables in each record of
the new data.
The purpose of prediction in machine learning is to project a probable data set that
relates back to the original data. Or
Building models that can recognize patterns in data and utilize those patterns to create4
precise predictions about novel, unforeseen data is the aim of machine learning
EXAMPLE
A bank loans officer needs analysis of her data in order to learn which loan
applicants are ―safe
and which are ―risky for the bank.
A marketing manager at All Electronics needs data analysis to
help guess whether a customer with a given profile will buy a new computer.
A medical researcher wants to analyse breast cancer data in order to predict which
one of three specific treatments a patient should receive.
In each of these examples, the data analysis task is classification, where a model or
classifier is constructed to predict categorical labels, such as safe or risky for the
loan application data; yes or no for the marketing data; or treatment A,
treatment B, or treatment C for the medical data.
These categories can be represented by discrete values, where the ordering among
values has no meaning. For example, the values 1, 2, and 3 may be used to
5
represent treatments A, B, and C, where there is no ordering implied among this
group of treatment regimes.
CONT’S
Suppose that the marketing manager would like to predict how
much a given customer will spend during a sale at All Electronics.
This data analysis task is an example of numeric prediction, where
the model constructed predicts a continuous valued function, or
ordered value, as opposed to a categorical label. This model is a
predictor.
Data classification is a two-step process.
A. Building the Classifier or Model
B. Using Classifier for Classification
6
CONT’S
8
CONT’S
The general approach for building classification models is given below:
9
ISSUES REGARDING CLASSIFICATION AND PREDICTION
Data cleaning: This refers to the preprocessing of data in order to
remove or reduce noise (by applying smoothing techniques, for
example) and the treatment of missing values (e.g., by replacing a
missing value with the most commonly occurring value for that
attribute, or with the most probable value based on statistics).
Relevance analysis: Many of the attributes in the data may be
redundant. Correlation analysis can be used to identify whether any
two given attributes are statistically related.
Data transformation and reduction: The data may be transformed
by normalization, Normalization involves scaling all values for a given
attribute so that they fall within a small specified range, such as -1.0
to 1.0, or 0.0 to 1.0. In methods that use distance measurements, for 10
example, this would prevent attributes with initially large ranges (like,
say, income) from out weighing attributes with initially smaller ranges
CLASSIFICATION BY DECISION TREE INDUCTION
Classification is a two-step process, learning step and prediction step, in machine
learning.
In the learning step, the model is developed based on given training data. In the
prediction step, the model is used to predict the response for given data.
Decision Tree is one of the easiest and popular classification algorithms to
understand and interpret.
Decision tree induction is the learning of decision trees from class -labeled training
tuples.
A decision tree is a flowchart-like tree structure, where each internal node (non-leaf
node) denotes a test on an attribute, each branch represents an outcome of the
test, and each leaf node (or terminal node) holds a class label. The topmost node
in a tree is the root node. 11
NODES IN DECISION TREE
Root Node
Internal Node/Branch Node
Leaf Node
12
CONT’S
In supervised learning, the target result is already known. Decision trees can be
used for both categorical and numerical data. The categorical data represent
gender, marital status, etc. while the numerical data represent age, temperature,
etc.
13
DECISION TREE RULE GENERATION
Rule 5.
14
THE BENEFITS OF HAVING A DECISION TREE ARE AS
FOLLOWS −
15
DECISION TREE INDUCTION ALGORITHMS
16
EXERCISE: Generate rule for the following tree
17
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3DECISION TREE INDUCTION
(Iterative Dichotomiser). ALGORITHMS
Later, he presented C4.5, which was the successor of
ID3.
ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees
are constructed in a top-down recursive divide-and-conquer manner.
#1) Initially, there are three parameters i.e. attribute list, attribute selection method
and data partition.
#2) The attribute selection method describes the method for selecting the best attribute for
discrimination among tuples. The methods used for attribute selection can either be
Information Gain or Gini Index.
#3) The structure of the tree (binary or non-binary) is decided by the attribute selection
method.
#4) When constructing a decision tree, it starts as a single node representing the tuples.
#5) If the root node tuples represent different class labels, then it calls an attribute selection
method to split or partition the tuples. The step will lead to the formation of branches and
decision nodes. 18
#6) The splitting method will determine which attribute should be selected to partition the
CONT’S
19
INFORMATION GAIN, ENTROPY, AND GINI INDEX
Information gain, entropy, and Gini index are commonly used metrics in decision tree algorithms to
determine the best split when building a tree.
Entropy is a measure of the impurity or uncertainty of a set of data. It ranges from 0 (completely pure) to 1
(completely impure). When building a decision tree, the entropy of a set is calculated before and after a split,
and the change in entropy is used to determine the information gain.
Information gain is a measure of the difference in entropy between the set before and after a split. The
attribute that provides the highest information gain is chosen as the split attribute.
Gini index is another measure of impurity or uncertainty. It ranges from 0 (completely pure) to 1 (completely
impure). The Gini index measures the probability of a random sample being incorrectly labeled when it is
randomly labeled according to the distribution of the labels in the set.
When building a decision tree, the Gini index of a set is calculated before and after a split, and the change in
Gini index is used to determine the split attribute.
In general, all three metrics can be used in decision tree algorithms to determine the best split attribute.
However, some situations may favor one metric over the others. For example, when dealing with binary
classification problems, Gini index is preferred over entropy because it tends to be more
computationally efficient. On the other hand, entropy is preferred when the data set is
imbalanced, meaning there is a significant difference in the number of instances belonging to different 20
classes.
Information gain is a popular metric that is often used because it is easy to understand and
HOW TO SELECT ATTRIBUTES FOR CREATING A TREE?
Attribute selection measures are also called splitting rules to decide how the tuples are
going to split. The splitting criteria are used to best partition the dataset. These measures
provide a ranking to the attributes for partitioning the training tuples.
The most popular methods of selecting the attribute are information gain, Gini
index.
#1) Information Gain
This method is the main method that is used to build decision trees. It reduces the
information that is required to classify the tuples. It reduces the number of tests that are
needed to classify the given tuple. The attribute with the highest information gain is
selected.
The original information needed for classification of a tuple in dataset D is given by:
Where p is the probability that the tuple belongs to class C. The
information is encoded in bits, therefore, log to the base 2 is used. E(s)
represents the average amount of information required to find out the 21
class label of dataset D. This information gain is also called Entropy.
CONT’S
22
CONT’S
#2) Gain Ratio
Information gain might sometimes result in portioning useless for classification.
However, the Gain ratio splits the training data set into partitions and considers
the number of tuples of the outcome with respect to the total tuples.
The attribute with the max gain ratio is used as a splitting attribute.
The Gini index that is calculated for binary split dataset D by attribute A is given by:
24
EXAMPLE OF DECISION TREE ALGORITHM :
CONSTRUCTING A DECISION TREE
Let us take an example of the last 14 days weather dataset with attributes outlook,
temperature, wind, and humidity. The outcome variable will be playing cricket or not. We will
use the ID3 algorithm to build the decision tree.
25
CONT’S
Step1: The first step will be to create a root node.
Step2: If all results are yes, then the leaf node “yes” will be
returned else the leaf node “no” will be returned.
Step3: Find out the Entropy of all observations and entropy with
attribute “x” that is E(S) and E(S, x).
Step4: Find out the information gain and select the attribute with
high information gain.
Step5: Repeat the above steps until all attributes are covered.
26
CONT’S
27
CONT’S
28
CONT’S
29
CONT’S
30
CONT’S
31
CONT’S
32
CONT’S
33
CONT’S
34
CONT’S
35
CONT’S
Table for Outlook as “Sunny” will be:
36
CONT’S
37
Bayesian
BAYESIAN CLASSIFICATION
classifiers are statistical classifiers. They can predict class
membership probabilities, such as the probability that a given tuple belongs to a
particular class.
Bayesian classification is based on Bayes’ theorem
Studies comparing classification algorithms have found a simple Bayesian
classifier known as the naive Bayesian classifier to be comparable in
performance with decision tree and selected neural network classifiers. Bayesian
classifiers have also exhibited high accuracy and speed when applied to large
databases.
naive Bayesian classifier is a classification technique based on Bayes’ Theorem
with an assumption of independence among predictors. In simple terms, a
Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature.
Naive Bayes model is easy to build and particularly useful for very large data sets.
38
p
39
HOW NAIVE BAYES ALGORITHM WORKS?
40
EXAMPLE: ESTIMATE PROBABILITY OF NEW INSTANCE USING NAIVE BAYES ALGORITHM
41
SOLUTION
42
THEN
43
SUPPORT VECTOR MACHINES
Support Vector Machine or SVM is one of the most popular
Supervised Learning algorithms, which is used for Classification as
well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional space into
classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, 44
and hence algorithm is termed as Support Vector Machine.
CONT’S
Consider the below diagram in which there are two different
categories that are classified using a decision boundary or
hyperplane:
45
This technique assumes that data points that are similar can be found near
K-NEAREST NEIGHBOUR CLASSIFIER
together.
It attempts to determine the distance between data points, which is commonly
done using Euclidean distance, and then assigns a category based on the most
frequent category or average.
K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
46
classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets
WHY KNN LAZY LEARNER
✓The reason for calling certain machine learning
methods lazy is because they defer the decision of how
to generalize beyond the training data until each new
query instance is encountered.
it doesn't build an explicit model during the training
phase.
Instead, it simply stores the entire training dataset and
makes predictions based on the similarity of new data47
48
FIND THE CLASS OF NEW INSTANCE USING KNN ALGORITHM
49
SOLUTION
To know its class, we have to calculate the distance from the new entry to other entries in the
data set using the Euclidean distance formula.
• Here's the formula: √(X₂-X₁)²+(Y₂-Y₁)² Where:
X₂ = New entry's brightness (20).
X₁= Existing entry's brightness. Y₂ = New entry's saturation (35).
Y₁ = Existing entry's saturation.
50
--
51
LET'S REARRANGE THE DISTANCES IN ASCENDING ORDER
52
SO---
53
ADVANTAGES AND DIS ADVANTAGES
Disadvantages
It can be hard to find a good distance measure –
Typically can not handle more than a few dozen attributes
– Computational cost: requires a lot computation and memory
– A lot of memory is required for processing large data sets.
– Choosing the right value of K can be tricky 54
55
HOW DOES K-NN WORK ?
The K-NN working can be explained on the basis of the below
algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of
neighbors
Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
Step-4: Among these k neighbors, count the number of the data
points in each category.
Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum. 56
59
CONT’D
Once an input layer is determined, weights are assigned. These
weights help determine the importance of any given variable, with
larger ones contributing more significantly to the output compared
to other inputs.
All inputs are then multiplied by their respective weights and then
summed. Afterward, the output is passed through an activation
function, which determines the output. If that output exceeds a
given threshold, it “fires” (or activates) the node, passing data to
the next layer in the network.
This results in the output of one node becoming in the input of the
next node. This process of passing data from one layer to the next
layer defines this neural network as a feedforward network. 60
CONT’D
The basic unit of a neural network. A neuron takes inputs, does
some math with them, and produces one output. Here’s what
a 2-input neuron looks like:
61
CONT’D
Three (3) things are happening here. First, each input is multiplied by a weight:
Next, all the weighted inputs are added together with a bias b:
Finally, the sum is passed through an activation function:
The activation function is used to turn an unbounded input into an output that
has a nice, predictable form. A commonly used activation function is
the sigmoid function: The sigmoid function only outputs numbers in the range
(0,1).
62
CONT’D
A neural network is nothing more than a bunch of neurons
connected together. Here’s what a simple neural network might look
like:
Strength weakness
Parallel processing capability Assurance of proper network
Storing data on the entire structure
network Hardware dependence
Capability to work with Long training time
incomplete knowledge
Poor interpretability
High tolerance to noisy data
Successful on a wide array of
64
real-world data
CLASSIFICATION BY BACK PROPAGATION
The network is feed-forward in that none of the weights cycles back
to an input unit or to an output unit of a previous layer
Backpropagation is an algorithm that backpropagates the errors from
the output nodes to the input nodes. Therefore, it is simply referred to
as the backward propagation of errors.
Backpropagation is a widely used algorithm for training feedforward
neural networks. It computes the gradient of the loss function with
respect to the network weights to train multi-layer networks and
update weights to minimize loss; variants such as gradient descent or
stochastic gradient descent are often used.
It is the method of fine-tuning the weights of a neural network based
on the error rate obtained in the previous epoch (i.e., iteration). 65
Proper tuning of the weights allows you to reduce error rates and
HOW BACKPROPAGATION ALGORITHM WORKS
Step 5: From the output layer, go back to the hidden layer to adjust the weights 66
Advantages Disadvantages
It is simple, fast, and easy to It is sensitive to noisy data and
program. irregularities. Noisy data can lead to
Only numbers of the input are inaccurate results.
tuned, not any other parameter. Performance is highly dependent on
It is Flexible and efficient. input data.
No need for users to learn any Spending too much time training.
special functions. The matrix-based approach is
preferred over a mini-batch. 67