0% found this document useful (0 votes)
6 views67 pages

Chapter 3

Chapter 3 discusses classification and prediction in machine learning, defining classification as the categorization of data into classes and prediction as the use of historical data to forecast future outcomes. It covers various classification methods such as decision trees, Bayesian classification, and support vector machines, along with the importance of data preprocessing and model accuracy. The chapter emphasizes the two-step process of building classifiers and using them for classification tasks.

Uploaded by

ah4710519
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views67 pages

Chapter 3

Chapter 3 discusses classification and prediction in machine learning, defining classification as the categorization of data into classes and prediction as the use of historical data to forecast future outcomes. It covers various classification methods such as decision trees, Bayesian classification, and support vector machines, along with the importance of data preprocessing and model accuracy. The chapter emphasizes the two-step process of building classifiers and using them for classification tasks.

Uploaded by

ah4710519
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

CHAPTER 3

Classification and
Prediction

1
CONTENTS
3.1 What is classification? What is
prediction? 3.7 Other classification methods
3.2 Issues regarding classification 3.7.1 K-nearest neighbor classifier
and prediction
3.7.2 Neural Network
3.3 Classification by decision tree
3.7.3 Genetic algorithm
induction
3.8 Prediction
3.4 Bayesian classification
3.9 Classifier accuracy
3.5 Support vector machines
3.6 Classification by back
propagation 2
WHAT IS CLASSIFICATION?

 Classification is the process of categorizing given set of data /structured, unstructured/ in


to classes.
 It is part of supervised machine learning in which we put labeled data for training.
 It is a type of supervised learning technique where an algorithm is trained on a labeled
dataset to predict the class or category of new, unseen data.
 The main objective of classification machine learning is to build a model that can
accurately assign a label or category to a new observation based on its features.
 For example, a classification model might be trained on a dataset of images labeled as
either bird or animal and then used to predict the class of new, unseen images of birds or
animals based on their features such as color, texture, and shape.
 Categorized classes are named as categories, target, or label.
 The process starts with predicting the classes of given data points. 3
WHAT IS PREDICTION?

 The practice of using data to create predictions or foresee future events is known as
machine learning prediction.
 Machine learning prediction, or prediction in machine learning, refers to the output of
an algorithm that has been trained on a historical dataset.
 The algorithm then generates probable values for unknown variables in each record of
the new data.

 The purpose of prediction in machine learning is to project a probable data set that
relates back to the original data. Or
 Building models that can recognize patterns in data and utilize those patterns to create4
precise predictions about novel, unforeseen data is the aim of machine learning
EXAMPLE
 A bank loans officer needs analysis of her data in order to learn which loan
applicants are ―safe
and which are ―risky for the bank.
 A marketing manager at All Electronics needs data analysis to
help guess whether a customer with a given profile will buy a new computer.
 A medical researcher wants to analyse breast cancer data in order to predict which
one of three specific treatments a patient should receive.
 In each of these examples, the data analysis task is classification, where a model or
classifier is constructed to predict categorical labels, such as safe or risky for the
loan application data; yes or no for the marketing data; or treatment A,
treatment B, or treatment C for the medical data.
 These categories can be represented by discrete values, where the ordering among
values has no meaning. For example, the values 1, 2, and 3 may be used to
5
represent treatments A, B, and C, where there is no ordering implied among this
group of treatment regimes.
CONT’S
 Suppose that the marketing manager would like to predict how
much a given customer will spend during a sale at All Electronics.
 This data analysis task is an example of numeric prediction, where
the model constructed predicts a continuous valued function, or
ordered value, as opposed to a categorical label. This model is a
predictor.
 Data classification is a two-step process.
A. Building the Classifier or Model
B. Using Classifier for Classification

6
CONT’S

A. Building the Classifier or Model


 This step is the learning step or the
learning phase.
 In this step the classification
algorithms build the classifier.
 The classifier is built from the training
set made up of database tuples and
their associated class labels.
 Each tuple that constitutes the
training set is referred to as a
category or class
7
CONT’S

B. Using Classifier for Classification


 In this step, the classifier is used for
classification.
 Here the test data is used to estimate
the accuracy of classification rules.
 The classification rules can be applied
to the new data tuples if the accuracy
is considered acceptable

8
CONT’S
 The general approach for building classification models is given below:

9
ISSUES REGARDING CLASSIFICATION AND PREDICTION
 Data cleaning: This refers to the preprocessing of data in order to
remove or reduce noise (by applying smoothing techniques, for
example) and the treatment of missing values (e.g., by replacing a
missing value with the most commonly occurring value for that
attribute, or with the most probable value based on statistics).
 Relevance analysis: Many of the attributes in the data may be
redundant. Correlation analysis can be used to identify whether any
two given attributes are statistically related.
 Data transformation and reduction: The data may be transformed
by normalization, Normalization involves scaling all values for a given
attribute so that they fall within a small specified range, such as -1.0
to 1.0, or 0.0 to 1.0. In methods that use distance measurements, for 10

example, this would prevent attributes with initially large ranges (like,
say, income) from out weighing attributes with initially smaller ranges
CLASSIFICATION BY DECISION TREE INDUCTION
 Classification is a two-step process, learning step and prediction step, in machine
learning.
 In the learning step, the model is developed based on given training data. In the
prediction step, the model is used to predict the response for given data.
 Decision Tree is one of the easiest and popular classification algorithms to
understand and interpret.
 Decision tree induction is the learning of decision trees from class -labeled training
tuples.
 A decision tree is a flowchart-like tree structure, where each internal node (non-leaf
node) denotes a test on an attribute, each branch represents an outcome of the
test, and each leaf node (or terminal node) holds a class label. The topmost node
in a tree is the root node. 11
NODES IN DECISION TREE

 Root Node
 Internal Node/Branch Node
 Leaf Node

12
CONT’S
 In supervised learning, the target result is already known. Decision trees can be
used for both categorical and numerical data. The categorical data represent
gender, marital status, etc. while the numerical data represent age, temperature,
etc.

13
DECISION TREE RULE GENERATION

From above decision tree we have two construct five rules.

Rule 1: If outlook is sunny and windy is false, playgof is “Yes”

Rule 2: If outllok is sunny and windy is true, playgolf is “No”.

Rule 5.
14
THE BENEFITS OF HAVING A DECISION TREE ARE AS
FOLLOWS −

 Decision Trees usually mimic human thinking ability while making a


decision, so it is easy to understand.
 The logic behind the decision tree can be easily understood because it
shows a tree-like structure.
 It does not require any domain knowledge.
 The learning and classification steps of a decision tree are simple and fast

15
DECISION TREE INDUCTION ALGORITHMS

 ID3 (Iterative Dichotomiser).

 C4.5, which was the successor of ID3.

16
EXERCISE: Generate rule for the following tree

17
 A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3DECISION TREE INDUCTION
(Iterative Dichotomiser). ALGORITHMS
Later, he presented C4.5, which was the successor of
ID3.
 ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees
are constructed in a top-down recursive divide-and-conquer manner.
 #1) Initially, there are three parameters i.e. attribute list, attribute selection method
and data partition.
 #2) The attribute selection method describes the method for selecting the best attribute for
discrimination among tuples. The methods used for attribute selection can either be
Information Gain or Gini Index.
 #3) The structure of the tree (binary or non-binary) is decided by the attribute selection
method.
 #4) When constructing a decision tree, it starts as a single node representing the tuples.
 #5) If the root node tuples represent different class labels, then it calls an attribute selection
method to split or partition the tuples. The step will lead to the formation of branches and
decision nodes. 18

 #6) The splitting method will determine which attribute should be selected to partition the
CONT’S

 #7) The above partitioning


steps are followed recursively
to form a decision tree for the
training dataset tuples.
 #8) The partitioning stops
only when either all the
partitions are made or when
the remaining tuples cannot
be partitioned further.

19
INFORMATION GAIN, ENTROPY, AND GINI INDEX

 Information gain, entropy, and Gini index are commonly used metrics in decision tree algorithms to
determine the best split when building a tree.
 Entropy is a measure of the impurity or uncertainty of a set of data. It ranges from 0 (completely pure) to 1
(completely impure). When building a decision tree, the entropy of a set is calculated before and after a split,
and the change in entropy is used to determine the information gain.
 Information gain is a measure of the difference in entropy between the set before and after a split. The
attribute that provides the highest information gain is chosen as the split attribute.
 Gini index is another measure of impurity or uncertainty. It ranges from 0 (completely pure) to 1 (completely
impure). The Gini index measures the probability of a random sample being incorrectly labeled when it is
randomly labeled according to the distribution of the labels in the set.
 When building a decision tree, the Gini index of a set is calculated before and after a split, and the change in
Gini index is used to determine the split attribute.
 In general, all three metrics can be used in decision tree algorithms to determine the best split attribute.
However, some situations may favor one metric over the others. For example, when dealing with binary
classification problems, Gini index is preferred over entropy because it tends to be more
computationally efficient. On the other hand, entropy is preferred when the data set is
imbalanced, meaning there is a significant difference in the number of instances belonging to different 20
classes.
 Information gain is a popular metric that is often used because it is easy to understand and
HOW TO SELECT ATTRIBUTES FOR CREATING A TREE?
 Attribute selection measures are also called splitting rules to decide how the tuples are
going to split. The splitting criteria are used to best partition the dataset. These measures
provide a ranking to the attributes for partitioning the training tuples.
 The most popular methods of selecting the attribute are information gain, Gini
index.
#1) Information Gain
 This method is the main method that is used to build decision trees. It reduces the
information that is required to classify the tuples. It reduces the number of tests that are
needed to classify the given tuple. The attribute with the highest information gain is
selected.
 The original information needed for classification of a tuple in dataset D is given by:
Where p is the probability that the tuple belongs to class C. The
information is encoded in bits, therefore, log to the base 2 is used. E(s)
represents the average amount of information required to find out the 21
class label of dataset D. This information gain is also called Entropy.
CONT’S

 The information required for exact classification after portioning is


given by the formulaWhere P (c) is the weight of partition. This information
represents the information needed to classify the dataset D on
portioning by X.

 Information gain is the difference between the original and expected


information that is required to classify the tuples of dataset D.

22
CONT’S
#2) Gain Ratio
 Information gain might sometimes result in portioning useless for classification.
However, the Gain ratio splits the training data set into partitions and considers
the number of tuples of the outcome with respect to the total tuples.
 The attribute with the max gain ratio is used as a splitting attribute.

 #3) Gini Index


 Gini Index is calculated for binary variables only. It measures the impurity in
training tuples of dataset D, as
P is the probability that tuple belongs to class C. The Gini
index that is calculated for binary split dataset D by attribute A
is given by: 23
CONT’S

 The Gini index that is calculated for binary split dataset D by attribute A is given by:

 Where n is the nth partition of the dataset D.


 The reduction in impurity is given by the difference of the Gini index of the original
dataset D and Gini index after partition by attribute A.
 The maximum reduction in impurity or max Gini index is selected as the best
attribute for splitting.

24
EXAMPLE OF DECISION TREE ALGORITHM :
CONSTRUCTING A DECISION TREE
 Let us take an example of the last 14 days weather dataset with attributes outlook,
temperature, wind, and humidity. The outcome variable will be playing cricket or not. We will
use the ID3 algorithm to build the decision tree.

25
CONT’S
 Step1: The first step will be to create a root node.
 Step2: If all results are yes, then the leaf node “yes” will be
returned else the leaf node “no” will be returned.
 Step3: Find out the Entropy of all observations and entropy with
attribute “x” that is E(S) and E(S, x).
 Step4: Find out the information gain and select the attribute with
high information gain.
 Step5: Repeat the above steps until all attributes are covered.

26
CONT’S

27
CONT’S

28
CONT’S

29
CONT’S

30
CONT’S

31
CONT’S

32
CONT’S

33
CONT’S

34
CONT’S

35
CONT’S
 Table for Outlook as “Sunny” will be:

36
CONT’S

37
 Bayesian
BAYESIAN CLASSIFICATION
classifiers are statistical classifiers. They can predict class
membership probabilities, such as the probability that a given tuple belongs to a
particular class.
 Bayesian classification is based on Bayes’ theorem
 Studies comparing classification algorithms have found a simple Bayesian
classifier known as the naive Bayesian classifier to be comparable in
performance with decision tree and selected neural network classifiers. Bayesian
classifiers have also exhibited high accuracy and speed when applied to large
databases.
 naive Bayesian classifier is a classification technique based on Bayes’ Theorem
with an assumption of independence among predictors. In simple terms, a
Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature.
 Naive Bayes model is easy to build and particularly useful for very large data sets.
38

Along with simplicity, Naive Bayes is known to outperform even


CONT’S

 Bayes theorem provides a way of calculating posterior probability P(c|x) from


P(c), P(x) and P(x|c). Look at the equation below:

•P(c|x) is the posterior probability


of class (c, target)
given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability
of predictor given class.
•P(x) is the prior probability of predictor.

p
39
HOW NAIVE BAYES ALGORITHM WORKS?

 Step 1: Convert the data set into a frequency table


 Step 2: Create Likelihood table by finding the probabilities.
 Step 3: Now, use Naive Bayesian equation to calculate the posterior
probability for each class. The class with the highest posterior
probability is the outcome of prediction.

40
EXAMPLE: ESTIMATE PROBABILITY OF NEW INSTANCE USING NAIVE BAYES ALGORITHM

41
SOLUTION

42
THEN

43
SUPPORT VECTOR MACHINES
 Support Vector Machine or SVM is one of the most popular
Supervised Learning algorithms, which is used for Classification as
well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
 The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional space into
classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a
hyperplane.
 SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, 44
and hence algorithm is termed as Support Vector Machine.
CONT’S
 Consider the below diagram in which there are two different
categories that are classified using a decision boundary or
hyperplane:

45
 This technique assumes that data points that are similar can be found near
K-NEAREST NEIGHBOUR CLASSIFIER
together.
 It attempts to determine the distance between data points, which is commonly
done using Euclidean distance, and then assigns a category based on the most
frequent category or average.
 K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
 K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
46
classification, it performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets
WHY KNN LAZY LEARNER


✓The reason for calling certain machine learning
methods lazy is because they defer the decision of how
to generalize beyond the training data until each new
query instance is encountered.
it doesn't build an explicit model during the training
phase.
Instead, it simply stores the entire training dataset and
makes predictions based on the similarity of new data47

points to the training instances.


EXAMPLE

48
FIND THE CLASS OF NEW INSTANCE USING KNN ALGORITHM

49
SOLUTION

To know its class, we have to calculate the distance from the new entry to other entries in the
data set using the Euclidean distance formula.
 • Here's the formula: √(X₂-X₁)²+(Y₂-Y₁)² Where:
 X₂ = New entry's brightness (20).
 X₁= Existing entry's brightness. Y₂ = New entry's saturation (35).
 Y₁ = Existing entry's saturation.

50
--

51
LET'S REARRANGE THE DISTANCES IN ASCENDING ORDER

52
SO---

53
ADVANTAGES AND DIS ADVANTAGES

 – Conceptually simple, easy to understand and explain –


 Very flexible decision boundaries –
 Not much learning at all

Disadvantages
 It can be hard to find a good distance measure –
 Typically can not handle more than a few dozen attributes
 – Computational cost: requires a lot computation and memory
 – A lot of memory is required for processing large data sets.
 – Choosing the right value of K can be tricky 54
55
HOW DOES K-NN WORK ?
 The K-NN working can be explained on the basis of the below
algorithm:
 Step-1: Select the number K of the neighbors
 Step-2: Calculate the Euclidean distance of K number of
neighbors
 Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
 Step-4: Among these k neighbors, count the number of the data
points in each category.
 Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum. 56

 Step-6: Our model is ready.


HOW TO DETERMINE THE K VALUE IN THE K-NEIGHBORS
CLASSIFIER?
 The optimal k value will help you to achieve the maximum accuracy
of the model. This process, however, is always challenging.
 The simplest solution is to try out k values and find the one that
brings the best results on the testing set. For this, we follow these
steps
1. Select a random k value. In practice, k is usually chosen at random
between 3 and 10, but there are no strict rules. A small value of k
results in unstable decision boundaries. A large value of k often
leads to the smoothening of decision boundaries but not
always to better metrics. So it’s always about trial and error.
2. Try out different k values and note their accuracy on the testing set.57
3. Сhoose k with the lowest error rate and implement the model.
NEURAL NETWORK
 Neural networks, also known as artificial neural networks (ANNs)
or simulated neural networks (SNNs), are a subset of machine
learning and are at the heart of deep learning algorithms.
 Their name and structure are inspired by the human brain,
mimicking the way that biological neurons signal to one another.
 Artificial neural networks (ANNs) are comprised of a node layers,
containing an input layer, one or more hidden layers, and an
output layer. Each node, or artificial neuron, connects to another
and has an associated weight and threshold.
 If the output of any individual node is above the specified threshold
value, that node is activated, sending data to the next layer of the
58
network. Otherwise, no data is passed along to the next layer of the
network.
CONT’D

59
CONT’D
 Once an input layer is determined, weights are assigned. These
weights help determine the importance of any given variable, with
larger ones contributing more significantly to the output compared
to other inputs.
 All inputs are then multiplied by their respective weights and then
summed. Afterward, the output is passed through an activation
function, which determines the output. If that output exceeds a
given threshold, it “fires” (or activates) the node, passing data to
the next layer in the network.
 This results in the output of one node becoming in the input of the
next node. This process of passing data from one layer to the next
layer defines this neural network as a feedforward network. 60
CONT’D
 The basic unit of a neural network. A neuron takes inputs, does
some math with them, and produces one output. Here’s what
a 2-input neuron looks like:

61
CONT’D
 Three (3) things are happening here. First, each input is multiplied by a weight:

 Next, all the weighted inputs are added together with a bias b:
 Finally, the sum is passed through an activation function:
 The activation function is used to turn an unbounded input into an output that
has a nice, predictable form. A commonly used activation function is
the sigmoid function: The sigmoid function only outputs numbers in the range
(0,1).

62
CONT’D
 A neural network is nothing more than a bunch of neurons
connected together. Here’s what a simple neural network might look
like:

This network has 2 inputs, a hidden layer with 2


neurons (h1​ and h2​), and an output layer with 1
neuron (o1​). Notice that the inputs for o1​ are the
outputs from h1​and h2​— that’s what makes this
a network.

A hidden layer is any layer between the input


(first) layer and output (last) layer. There can be
multiple hidden layers!
63
STRENGTH AND WEAKNESS

Strength weakness
 Parallel processing capability  Assurance of proper network
 Storing data on the entire structure
network  Hardware dependence
 Capability to work with  Long training time
incomplete knowledge
 Poor interpretability
 High tolerance to noisy data
 Successful on a wide array of
64
real-world data
CLASSIFICATION BY BACK PROPAGATION
 The network is feed-forward in that none of the weights cycles back
to an input unit or to an output unit of a previous layer
 Backpropagation is an algorithm that backpropagates the errors from
the output nodes to the input nodes. Therefore, it is simply referred to
as the backward propagation of errors.
 Backpropagation is a widely used algorithm for training feedforward
neural networks. It computes the gradient of the loss function with
respect to the network weights to train multi-layer networks and
update weights to minimize loss; variants such as gradient descent or
stochastic gradient descent are often used.
 It is the method of fine-tuning the weights of a neural network based
on the error rate obtained in the previous epoch (i.e., iteration). 65

Proper tuning of the weights allows you to reduce error rates and
HOW BACKPROPAGATION ALGORITHM WORKS

 Step 1: Inputs X, arrive through the preconnected path.


 Step 2: The input is modelled using true weights W. Weights are usually chosen
randomly.
 Step 3: Calculate the output of each neuron from the input layer to the hidden
layer to the output layer.
 Step 4: Calculate the error in the outputs

 Step 5: From the output layer, go back to the hidden layer to adjust the weights 66

to reduce the error.


 Step 6: Repeat the process until the desired output is achieved.
CONT’D

Advantages Disadvantages
 It is simple, fast, and easy to  It is sensitive to noisy data and
program. irregularities. Noisy data can lead to
 Only numbers of the input are inaccurate results.
tuned, not any other parameter.  Performance is highly dependent on
 It is Flexible and efficient. input data.
 No need for users to learn any  Spending too much time training.
special functions.  The matrix-based approach is
preferred over a mini-batch. 67

You might also like