Module 3 Notes (1)
Module 3 Notes (1)
Classification
Classification is a process of categorizing items or data into distinct groups or
classes based on certain characteristics or attributes. It's a fundamental concept
used in various fields such as statistics, machine learning, and information
science.
For example, we can build a classification model to categorize bank loan applications as
either safe or risky. Such analysis can help provide us with a better understanding of the data
at large.
EXAMPLES:
o The Bank loans officer needs analysis of her data to learn which loan applicants are
“safe” and which are “risky” for the bank.
o A marketing manager needs data analysis to help guess whether a customer with a given
profile will buy a new computer.
o A medical researcher wants to analyze breast cancer data to predict which one of three
specific treatments a patient should receive.
These categories can be represented by discrete values, where the ordering among values has no
meaning.
• The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
i. Binary Classifier: The classification problem has only two possible outcomes,
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
ii. Multi-class Classifier: The classification problem has more than two outcomes
Example: Classifications of types of crops, Classification of types of music.
Regression analysis
is a statistical methodology that is most often used for numeric prediction.
Suppose that the marketing manager wants to predict how much a given customer will spend
during a sale. This data analysis task is an example of numeric prediction, where the model
constructed predicts a continuous-valued function, or ordered value, as opposed to a class
label. This model is a predictor.
Hence the two terms tend to be used synonymously, although other methods for numeric
prediction exist. Classification and numeric prediction are the two major types of prediction
problems.
The process is shown for the loan application data of Figure 8.1.
In the first step, a classifier is built describing a predetermined set of data classes or concepts.
This is the learning step (or training phase), where a classification algorithm builds the
classifier by analysing or “learning from” a training set made up of database tuples and their
associated class labels.
A tuple, X, is represented by an n-dimensional attribute vector, X D .x1, x2,: : :, xn/,
depicting n measurements made on the tuple from n database attributes, respectively, A1, A2,:
: :, An.1 Each tuple, X, is assumed to belong to a predefined class as determined by another
database attribute called the class label attribute. The class label attribute is discrete-valued
and unordered.
It is categorical (or nominal) in that each value serves as a category or class. The individual
tuples making up the training set are referred to as training tuples and are randomly sampled
from the database under analysis. In the context of classification, data tuples can be referred
to as samples, examples, instances, data points, or objects.
60
Because the class label of each training tuple is provided, this step is also known as
supervised learning (i.e., the learning of the classifier is “supervised” in that it is told to
which class each training tuple belongs). It contrasts with unsupervised learning (or
clustering), in which the class label of each training tuple is not known, and the number or
set of classes to be learned may not be known in advance.
In the second step (Figure 8.1b), the model is used for classification. First, the predictive
accuracy of the classifier is estimated. If we were to use the training set to measure the
classifier’s accuracy, this estimate would likely be optimistic, because the classifier tends to
overfit the data (i.e., during learning it may incorporate some particular anomalies of the
training data that are not present in the general data set overall). Therefore, a test set is used,
made up of test tuples and their associated class labels. They are independent of the training
tuples, meaning that they were not used to construct the classifier.
The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier. The associated class label of each test tuple is compared
with the learned classifier’s class prediction for that tuple.
A decision tree is a flowchart-like tree structure, where each internal node (non leaf node)
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (or terminal node) holds a class label. The topmost node in a tree is the root node. A
typical decision tree is shown in Figure 8.2.
It represents the concept of buying a computer, that is, it predicts whether a customer at
AllElectronics is likely to purchase a computer. Internal nodes are denoted by rectangles, and
leaf nodes are denoted by ovals.
61
“How are decision trees used for classification?” Given a tuple, X, for which the associated
class label is unknown, the attribute values of the tuple are tested against the decision tree. A
path is traced from the root to a leaf node, which holds the class prediction for that tuple.
Decision trees can easily be converted to classification rules.
ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which decision trees
are constructed in a top-down recursive divide-and-conquer manner. Most algorithms for
decision tree induction also follow a top-down approach, which starts with a training set of
tuples and their associated class labels. The training set is recursively partitioned into smaller
subsets as the tree is being built.
62
● The algorithm is called with three parameters: D, attribute list, and Attribute selection
method. We refer to D as a data partition. Initially, it is the complete set of training tuples
and their associated class labels. The parameter attribute list is a list of attributes
describing the tuples. Attribute selection method specifies a heuristic procedure for
selecting the attribute that “best” discriminates the given tuples according to class. This
procedure employs an attribute selection measure such as information gain or the Gini
index. Whether the tree is strictly binary is generally driven by the attribute selection
measure. Some attribute selection measures, such as the Gini index, enforce the resulting
tree to be binary. Others, like information gain, do not, therein allowing multiway splits
(i.e., two or more branches to be grown from a node). The tree starts as a single node, N,
representing the training tuples in D (step 1).
● If the tuples in D are all of the same class, then node N becomes a leaf and is labeled
with that class (steps 2 and 3). Note that steps 4 and 5 are terminating conditions. All
terminating conditions are explained at the end of the algorithm.
● Otherwise, the algorithm calls Attribute selection method to determine the splitting
criterion. The splitting criterion tells us which attribute to test at node N by determining
the “best” way to separate or partition the tuples in D into individual classes (step 6). The
splitting criterion also tells us which branches to grow from node N with respect to the
63
outcomes of the chosen test. More specifically, the splitting criterion indicates the
splitting attribute and may also indicate either a split-point or a splitting subset. The
splitting criterion is determined so that, ideally, the resulting partitions at each branch are
as “pure” as possible. A partition is pure if all the tuples in it belong to the same class. In
other words, if we split up the tuples in D according to the mutually exclusive outcomes
of the splitting criterion, we hope for the resulting partitions to be as pure as possible.
● The node N is labelled with the splitting criterion, which serves as a test at the node (step
7). A branch is grown from node N for each of the outcomes of the splitting criterion.
The tuples in D are partitioned accordingly (steps 10 to 11). There are three possible
scenarios, as illustrated in Figure 8.4.
Information Gain
ID3 uses information gain as its attribute selection measure
Let node N represent or hold the tuples of partition D. The attribute with the highest
information gain is chosen as the splitting attribute for node N. This attribute minimizes the
information needed to classify the tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions. Such an approach minimizes the expected
number of tests needed to classify a given tuple and guarantees that a simple tree is found.
The expected information needed to classify a tuple in D is given by
where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is
estimated by |Ci,D|/|D|. A log function to the base 2 is used, because the information is
encoded in bits. Info(D) is just the average amount of information needed to identify the class
label of a tuple in D. Note that, at this point, the information we have is based solely on the
proportions of tuples of each class. Info(D) is also known as the entropy of D.
How much more information would we still need (after the partitioning) to arrive at an exact
classification? This amount is measured by
64
PROBLEM 1:
65
66
67
68
69
61
0
Example: PROBLEM 2- COMPUTE THE GAIN RATIO OF ATTRIBUTE INCOME
Gain Ratio
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which
attempts to overcome this bias. It applies a kind of normalization to information gain using a
“split information” value defined analogously with Info(D) as
61
1
It differs from information gain, which measures the information with respect to
classification that is acquired based on the same partitioning. The gain ratio is defined as
61
2
61
3
61
4
61
5
Tree Pruning
When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outliers. Tree pruning methods address this problem of overfitting the data.
Such methods typically use statistical measures to remove the least-reliable branches.
The Bayesian Theorem is a fundamental concept in probability theory and statistics, named
after the Reverend Thomas Bayes. It provides a mathematical framework for updating
beliefs or probabilities about events based on new evidence or information. The theorem is
70
often represented algebraically as:
Where:
● P(A∣B) is the probability of event A occurring given that event B has occurred. This is
called the posterior probability.
● P(B∣A) is the probability of event B occurring given that event A has occurred. This is
called the likelihood.
70
● P(A) is the prior probability of event A occurring before any evidence is
considered. This is the initial belief.
● P(B) is the prior probability of event B occurring before any evidence is
considered.
1. Prior Probability (P(A)): This represents the initial belief or probability of event A
occurring before any new evidence is considered. It's what we believe about the
likelihood of event A based on previous knowledge, intuition, or experience.
2. Likelihood (P(B|A)): This is the probability of observing evidence B given that
event A has occurred. It quantifies how much the evidence supports or favors the
occurrence of event A. It's essentially the probability of the evidence given our
hypothesis.
3. Evidence or Observation (P(B)): This represents the total probability of observing
evidence B, regardless of the occurrence of event A. It serves as a normalization
factor and ensures that the posterior probability is a valid probability
distribution.
Posterior Probability (P(A|B)):
4. This is the updated probability of event A occurring
after considering the new evidence B. It's what we want to calculate based on the
prior probability and the new evidence. It reflects our belief about the occurrence
of event A after considering the evidence.
The Bayesian Theorem allows us to update our beliefs about the likelihood of different
events as new evidence becomes available. It's widely used in various fields such as machine
learning, artificial intelligence, medicine, and finance, where uncertainty needs to be
quantified and updated based on new data or observations.
71
P(Cj)
72
Therefore, based on these probabilities, we classify the new tuple as tall because it has the highest probability.
73
Rule Based Classification
The model is represented as a set of IF-THEN rules.
Using IF-THEN Rules for Classification
The rule-based classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is
an expression of the form
The “IF” part (or left side) of a rule is known as the rule antecedent or precondition. The
“THEN” part (or right side) is the rule consequent. In the rule antecedent, the condition
consists of one or more attribute tests (e.g., age = youth and student= yes) that are logically
ANDed. The rule’s consequent contains a class prediction (in this case, we are predicting
whether a customer will buy a computer). R1 can also be written as
To extract rules from a decision tree, one rule is created for each path from the root to a leaf
node. Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (“IF” part). The leaf node holds the class prediction, forming the rule consequent
(“THEN” part).
Example:
74
“How can we prune the rule set?” For a given rule antecedent, any condition that does not improve the estimated accuracy
of the rule can be pruned (i.e., removed), thereby generalizing the rule. C4.5 extracts rules from an unpruned tree, and then
prunes the rules using a pessimistic approach similar to its tree pruning method. The training tuples and their associated class
labels are used to estimate rule accuracy.
There are many sequential covering algorithms. Popular variations include AQ, CN2, and the
more recent RIPPER. The general strategy is as follows. Rules are learned one at a time. Each
time a rule is learned, the tuples covered by the rule are removed, and the process repeats on
the remaining tuples. This sequential learning of rules is in contrast to decision tree induction.
Because the path to each leaf in a decision tree corresponds to a rule, we can consider
decision tree induction as learning a set of rules simultaneously.
75
What is Rule Induction?
Rule Induction is a technique in data mining used to extract IF-THEN rules from a dataset for classification.
IF (Temperature = Hot) AND (Humidity = High) THEN Play = No
These rules help in predicting or classifying unseen data.
76
First Rule:
IF Outlook = Overcast THEN PlayTennis = Yes
This covers 4 positive examples.
Remove those rows from the dataset.
Step 3: Next Rule:
From the remaining data, the next best rule could be:
❗ Disadvantages
● May overfit if too specific
● Order of rule learning can affect performance
● Doesn’t work well when rules overlap a lot
You would like an estimate of how accurately the classifier can predict the purchasing
behaviour of future customers, that is, future customer data on which the classifier has not
been trained. You may even have tried different methods to build more than one classifier
and now wish to compare their accuracy. But what is accuracy? How can we estimate it? Are
some measures of a classifier’s accuracy more appropriate than others? How can we obtain a
reliable accuracy estimate? These questions are addressed in this section.
77
We will consider the case of where the class tuples are more or less evenly distributed, as
well as the case where classes are unbalanced (e.g., where an important class of interest is
rare such as in medical tests).
The classifier evaluation measures are summarized in Figure 8.13. They include accuracy
(also known as recognition rate), sensitivity (or recall), specificity, precision, F1, and F_.
Note that although accuracy is a specific measure, the word “accuracy” is also used as a
general term to refer to a classifier’s predictive abilities.
78
Given two classes, for example, the positive tuples may be buys computer = yes while the
negative tuples are buys computer = no.
There are four additional terms we need to know that are the “building blocks” used in
computing many evaluation measures.
True positives .TP: These refer to the positive tuples that were correctly labeled by the
classifier. Let TP be the number of true positives.
True negatives .TN: These are the negative tuples that were correctly labeled by the
classifier. Let TN be the number of true negatives.
False positives. FP/: These are the negative tuples that were incorrectly labeled as positive
(e.g., tuples of class buys computer = no for which the classifier predicted buys computer =
yes). Let FP be the number of false positives.
False negatives. FN/: These are the positive tuples that were mislabeled as negative (e.g.,
tuples of class buys computer = yes for which the classifier predicted buys computer = no).
Let FN be the number of false negatives.
79
The confusion matrix is a useful tool for analyzing how well your
classifier can recognize tuples of different classes. TP and TN tell us
when the classifier is getting things right, while FP and FN tell us when
the classifier is getting things wrong
🔹 5. Example
Let’s say we tested a model on 100 samples:
81
0
Predicted: Predicted:
Positive Negative
Actual: Positive
40 (TP) 10 (FN)
(Yes)
Actual: Negative
5 (FP) 45 (TN)
(No)
From this matrix:
● Accuracy = (40 + 45) / 100 = 85%
● Precision = 40 / (40 + 5) = 88.9%
● Recall = 40 / (40 + 10) = 80%
● F1 Score = 2 × (0.889 × 0.8) / (0.889 + 0.8) ≈ 84.2%
🧠 Summary
● The confusion matrix is central to model evaluation in classification problems.
● It provides the foundation for key metrics like precision, recall, and F1 score.
● Helps in choosing the best model for your specific goal (e.g., high recall vs high
precision).
81
1