0% found this document useful (0 votes)
2 views

Module 3 Notes (1)

The document discusses classification, a process of categorizing data into distinct groups based on attributes, and highlights its applications in fields like banking, marketing, and medicine. It explains the two-step approach to classification involving model learning and classification, as well as decision tree induction, which is a popular method for building classifiers. Additionally, it covers concepts such as attribute selection measures, tree pruning, and Bayesian classification methods.

Uploaded by

memeuno870
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 3 Notes (1)

The document discusses classification, a process of categorizing data into distinct groups based on attributes, and highlights its applications in fields like banking, marketing, and medicine. It explains the two-step approach to classification involving model learning and classification, as well as decision tree induction, which is a popular method for building classifiers. Additionally, it covers concepts such as attribute selection measures, tree pruning, and Bayesian classification methods.

Uploaded by

memeuno870
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Module 3

Classification
Classification is a process of categorizing items or data into distinct groups or
classes based on certain characteristics or attributes. It's a fundamental concept
used in various fields such as statistics, machine learning, and information
science.

For example, we can build a classification model to categorize bank loan applications as
either safe or risky. Such analysis can help provide us with a better understanding of the data
at large.

EXAMPLES:

o The Bank loans officer needs analysis of her data to learn which loan applicants are
“safe” and which are “risky” for the bank.
o A marketing manager needs data analysis to help guess whether a customer with a given
profile will buy a new computer.
o A medical researcher wants to analyze breast cancer data to predict which one of three
specific treatments a patient should receive.

In each of these examples,


❖ The data analysis task is classification, where a model or classifier is constructed to
predict class (categorical) labels, such as “safe” or “risky” for the loan application
data;
❖ “yes” or “no” for the marketing data;
❖ or “treatment A,” “treatment B,” or “treatment C” for the medical data.

These categories can be represented by discrete values, where the ordering among values has no
meaning.
• The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
i. Binary Classifier: The classification problem has only two possible outcomes,
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
ii. Multi-class Classifier: The classification problem has more than two outcomes
Example: Classifications of types of crops, Classification of types of music.

Regression analysis
is a statistical methodology that is most often used for numeric prediction.
Suppose that the marketing manager wants to predict how much a given customer will spend
during a sale. This data analysis task is an example of numeric prediction, where the model
constructed predicts a continuous-valued function, or ordered value, as opposed to a class
label. This model is a predictor.
Hence the two terms tend to be used synonymously, although other methods for numeric
prediction exist. Classification and numeric prediction are the two major types of prediction
problems.

General Approach to Classification


Data classification is a two-step process, consisting of a
❖ learning step (where a classification model is constructed) and a
❖ classification step (where the model is used to predict class labels for given data).

The process is shown for the loan application data of Figure 8.1.
In the first step, a classifier is built describing a predetermined set of data classes or concepts.
This is the learning step (or training phase), where a classification algorithm builds the
classifier by analysing or “learning from” a training set made up of database tuples and their
associated class labels.
A tuple, X, is represented by an n-dimensional attribute vector, X D .x1, x2,: : :, xn/,
depicting n measurements made on the tuple from n database attributes, respectively, A1, A2,:
: :, An.1 Each tuple, X, is assumed to belong to a predefined class as determined by another
database attribute called the class label attribute. The class label attribute is discrete-valued
and unordered.
It is categorical (or nominal) in that each value serves as a category or class. The individual
tuples making up the training set are referred to as training tuples and are randomly sampled
from the database under analysis. In the context of classification, data tuples can be referred
to as samples, examples, instances, data points, or objects.

60
Because the class label of each training tuple is provided, this step is also known as
supervised learning (i.e., the learning of the classifier is “supervised” in that it is told to
which class each training tuple belongs). It contrasts with unsupervised learning (or
clustering), in which the class label of each training tuple is not known, and the number or
set of classes to be learned may not be known in advance.

In the second step (Figure 8.1b), the model is used for classification. First, the predictive
accuracy of the classifier is estimated. If we were to use the training set to measure the
classifier’s accuracy, this estimate would likely be optimistic, because the classifier tends to
overfit the data (i.e., during learning it may incorporate some particular anomalies of the
training data that are not present in the general data set overall). Therefore, a test set is used,
made up of test tuples and their associated class labels. They are independent of the training
tuples, meaning that they were not used to construct the classifier.

The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier. The associated class label of each test tuple is compared
with the learned classifier’s class prediction for that tuple.

Decision Tree Induction


Decision tree induction is the learning of decision trees from class-labeled training tuples.

A decision tree is a flowchart-like tree structure, where each internal node (non leaf node)
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (or terminal node) holds a class label. The topmost node in a tree is the root node. A
typical decision tree is shown in Figure 8.2.

It represents the concept of buying a computer, that is, it predicts whether a customer at
AllElectronics is likely to purchase a computer. Internal nodes are denoted by rectangles, and
leaf nodes are denoted by ovals.

61
“How are decision trees used for classification?” Given a tuple, X, for which the associated
class label is unknown, the attribute values of the tuple are tested against the decision tree. A
path is traced from the root to a leaf node, which holds the class prediction for that tuple.
Decision trees can easily be converted to classification rules.

“Why are decision tree classifiers so popular?”


● The construction of decision tree classifiers does not require any domain knowledge or
parameter setting and therefore is appropriate for exploratory knowledge discovery.
Decision trees can handle multidimensional data.
● Their representation of acquired knowledge in tree form is intuitive and generally
easy to assimilate by humans.
● The learning and classification steps of decision tree induction are simple and fast. In
general, decision tree classifiers have good accuracy.
● However, successful use may depend on the data at hand.
● Decision tree induction algorithms have been used for classification in many application
areas such as medicine, manufacturing and production, financial analysis,
astronomy, and molecular biology.
● Decision trees are the basis of several commercial rule induction systems. Decision trees are
widely used in commercial rule induction systems because they provide a clear, interpretable
structure for decision-making. These trees break down complex decision processes into a series of
simple, rule-based splits, making them easy to translate into "if-then" rules. As a result, they serve
as a foundation for many expert systems and data mining tools used in business and industry. Their
ability to handle both categorical and numerical data adds to their versatility and commercial
appeal. Moreover, decision trees can be easily visualized and understood by non-technical
stakeholders, making them valuable in practical decision support applications.

Decision Tree Induction

ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which decision trees
are constructed in a top-down recursive divide-and-conquer manner. Most algorithms for
decision tree induction also follow a top-down approach, which starts with a training set of
tuples and their associated class labels. The training set is recursively partitioned into smaller
subsets as the tree is being built.

Here we will study ID3 ALGORITHM TO CONSTRUCT DECISION TREES

62
● The algorithm is called with three parameters: D, attribute list, and Attribute selection
method. We refer to D as a data partition. Initially, it is the complete set of training tuples
and their associated class labels. The parameter attribute list is a list of attributes
describing the tuples. Attribute selection method specifies a heuristic procedure for
selecting the attribute that “best” discriminates the given tuples according to class. This
procedure employs an attribute selection measure such as information gain or the Gini
index. Whether the tree is strictly binary is generally driven by the attribute selection
measure. Some attribute selection measures, such as the Gini index, enforce the resulting
tree to be binary. Others, like information gain, do not, therein allowing multiway splits
(i.e., two or more branches to be grown from a node). The tree starts as a single node, N,
representing the training tuples in D (step 1).
● If the tuples in D are all of the same class, then node N becomes a leaf and is labeled
with that class (steps 2 and 3). Note that steps 4 and 5 are terminating conditions. All
terminating conditions are explained at the end of the algorithm.
● Otherwise, the algorithm calls Attribute selection method to determine the splitting
criterion. The splitting criterion tells us which attribute to test at node N by determining
the “best” way to separate or partition the tuples in D into individual classes (step 6). The
splitting criterion also tells us which branches to grow from node N with respect to the

63
outcomes of the chosen test. More specifically, the splitting criterion indicates the
splitting attribute and may also indicate either a split-point or a splitting subset. The
splitting criterion is determined so that, ideally, the resulting partitions at each branch are
as “pure” as possible. A partition is pure if all the tuples in it belong to the same class. In
other words, if we split up the tuples in D according to the mutually exclusive outcomes
of the splitting criterion, we hope for the resulting partitions to be as pure as possible.
● The node N is labelled with the splitting criterion, which serves as a test at the node (step
7). A branch is grown from node N for each of the outcomes of the splitting criterion.
The tuples in D are partitioned accordingly (steps 10 to 11). There are three possible
scenarios, as illustrated in Figure 8.4.

Attribute Selection Measures


An attribute selection measure is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into individual classes. If
we were to split D into smaller partitions according to the outcomes of the splitting criterion,
ideally each partition would be pure (i.e., all the tuples that fall into a given partition would
belong to the same class). Conceptually, the “best” splitting criterion is the one that most
closely results in such a scenario. Attribute selection measures are also known as splitting
rules because they determine how the tuples at a given node are to be split.

Information Gain
ID3 uses information gain as its attribute selection measure

Let node N represent or hold the tuples of partition D. The attribute with the highest
information gain is chosen as the splitting attribute for node N. This attribute minimizes the
information needed to classify the tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions. Such an approach minimizes the expected
number of tests needed to classify a given tuple and guarantees that a simple tree is found.
The expected information needed to classify a tuple in D is given by

where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is
estimated by |Ci,D|/|D|. A log function to the base 2 is used, because the information is
encoded in bits. Info(D) is just the average amount of information needed to identify the class
label of a tuple in D. Note that, at this point, the information we have is based solely on the
proportions of tuples of each class. Info(D) is also known as the entropy of D.

How much more information would we still need (after the partitioning) to arrive at an exact
classification? This amount is measured by

64
PROBLEM 1:

65
66
67
68
69
61
0
Example: PROBLEM 2- COMPUTE THE GAIN RATIO OF ATTRIBUTE INCOME

Gain Ratio

C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which
attempts to overcome this bias. It applies a kind of normalization to information gain using a
“split information” value defined analogously with Info(D) as

61
1
It differs from information gain, which measures the information with respect to
classification that is acquired based on the same partitioning. The gain ratio is defined as

THE SAME ABOVE PROBLEM IS SOLVED NUMERICALLY HERE

61
2
61
3
61
4
61
5
Tree Pruning

When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outliers. Tree pruning methods address this problem of overfitting the data.
Such methods typically use statistical measures to remove the least-reliable branches.

“How does tree pruning work?”


There are two common approaches to tree pruning: prepruning and postpruning.
1. In the prepruning approach, a tree is “pruned” by halting its construction early (e.g., by
deciding not to further split or partition the subset of training tuples at a given node).
Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among
the subset tuples or the probability distribution of those tuples. When constructing a tree,
measures such as statistical significance, information gain, Gini index, and so on, can be
used to assess the goodness of a split. If partitioning the tuples at a node would result in a
split that falls below a prespecified threshold, then further partitioning of the given subset
is halted.
2. The second and more common approach is postpruning, which removes subtrees from a
“fully grown” tree. A subtree at a given node is pruned by removing its branches and
replacing it with a leaf. The leaf is labeled with the most frequent class among the
subtree being replaced. For example, notice the subtree at node “A3?” in the unpruned
tree of Figure 8.6. Suppose that the most common class within this subtree is “class B.”
In the pruned version of the tree, the subtree in question is pruned by replacing it with the
leaf “class B.”

Bayes Classification Methods


Bayesian classifiers are statistical classifiers. They can predict class membership probabilities
such as the probability that a given tuple belongs to a particular class. Bayesian classification
is based on Bayes’ theorem.

The Bayesian Theorem is a fundamental concept in probability theory and statistics, named
after the Reverend Thomas Bayes. It provides a mathematical framework for updating
beliefs or probabilities about events based on new evidence or information. The theorem is

70
often represented algebraically as:

Where:

● P(A∣B) is the probability of event A occurring given that event B has occurred. This is
called the posterior probability.
● P(B∣A) is the probability of event B occurring given that event A has occurred. This is
called the likelihood.

70
● P(A) is the prior probability of event A occurring before any evidence is
considered. This is the initial belief.
● P(B) is the prior probability of event B occurring before any evidence is
considered.

Let's break down the components of the theorem:

1. Prior Probability (P(A)): This represents the initial belief or probability of event A
occurring before any new evidence is considered. It's what we believe about the
likelihood of event A based on previous knowledge, intuition, or experience.
2. Likelihood (P(B|A)): This is the probability of observing evidence B given that
event A has occurred. It quantifies how much the evidence supports or favors the
occurrence of event A. It's essentially the probability of the evidence given our
hypothesis.
3. Evidence or Observation (P(B)): This represents the total probability of observing
evidence B, regardless of the occurrence of event A. It serves as a normalization
factor and ensures that the posterior probability is a valid probability
distribution.
Posterior Probability (P(A|B)):
4. This is the updated probability of event A occurring
after considering the new evidence B. It's what we want to calculate based on the
prior probability and the new evidence. It reflects our belief about the occurrence
of event A after considering the evidence.

The Bayesian Theorem allows us to update our beliefs about the likelihood of different
events as new evidence becomes available. It's widely used in various fields such as machine
learning, artificial intelligence, medicine, and finance, where uncertainty needs to be
quantified and updated based on new data or observations.

71
P(Cj)

72
Therefore, based on these probabilities, we classify the new tuple as tall because it has the highest probability.

73
Rule Based Classification
The model is represented as a set of IF-THEN rules.
Using IF-THEN Rules for Classification
The rule-based classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is
an expression of the form

The “IF” part (or left side) of a rule is known as the rule antecedent or precondition. The
“THEN” part (or right side) is the rule consequent. In the rule antecedent, the condition
consists of one or more attribute tests (e.g., age = youth and student= yes) that are logically
ANDed. The rule’s consequent contains a class prediction (in this case, we are predicting
whether a customer will buy a computer). R1 can also be written as

Rule Extraction from a Decision Tree

To extract rules from a decision tree, one rule is created for each path from the root to a leaf
node. Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (“IF” part). The leaf node holds the class prediction, forming the rule consequent
(“THEN” part).

Example:

74
“How can we prune the rule set?” For a given rule antecedent, any condition that does not improve the estimated accuracy
of the rule can be pruned (i.e., removed), thereby generalizing the rule. C4.5 extracts rules from an unpruned tree, and then
prunes the rules using a pessimistic approach similar to its tree pruning method. The training tuples and their associated class
labels are used to estimate rule accuracy.

Rule Induction Using a Sequential Covering Algorithm


IF-THEN rules can be extracted directly from the training data (i.e., without having to
generate a decision tree first) using a sequential covering algorithm. The name comes from
the notion that the rules are learned sequentially (one at a time), where each rule for a given
class will ideally cover many of the class’s tuples (and hopefully none of the tuples of other
classes). Sequential covering algorithms are the most widely used approach to mining
disjunctive sets of classification rules, and form the topic of this subsection.

There are many sequential covering algorithms. Popular variations include AQ, CN2, and the
more recent RIPPER. The general strategy is as follows. Rules are learned one at a time. Each
time a rule is learned, the tuples covered by the rule are removed, and the process repeats on
the remaining tuples. This sequential learning of rules is in contrast to decision tree induction.
Because the path to each leaf in a decision tree corresponds to a rule, we can consider
decision tree induction as learning a set of rules simultaneously.

75
What is Rule Induction?
Rule Induction is a technique in data mining used to extract IF-THEN rules from a dataset for classification.
IF (Temperature = Hot) AND (Humidity = High) THEN Play = No
These rules help in predicting or classifying unseen data.

🔹 What is Sequential Covering?


Sequential Covering Algorithm is a common method used for rule induction. It’s also called a separate-and-
conquer approach.
Instead of building one big model like a decision tree, it learns one rule at a time.

How Sequential Covering Works (Step-by-Step)


1. Start with all training data
2. Find the best rule that covers some of the positive examples (target class)
3. Remove the examples covered by this rule
4. Repeat the process on the remaining data until:
o All (or most) positive examples are covered
o Or a stopping condition is met
📌 This is called sequential covering because we cover the dataset sequentially, one rule at a time.

🔄 Example: Weather Dataset


Let’s say we want to predict whether to play tennis (Yes/No) based on weather data.
Step 1: Initial Data
Outloo Te Humidi Win PlayTenn
k mp ty dy is
Sunny Hot High False No
Overca Coo
Normal True Yes
st l
Mil
Rainy High False Yes
d
... ... ... ... ...
Step 2: Learn a Rule

76
First Rule:
IF Outlook = Overcast THEN PlayTennis = Yes
This covers 4 positive examples.
Remove those rows from the dataset.
Step 3: Next Rule:
From the remaining data, the next best rule could be:

IF Humidity = Normal AND Windy = False THEN PlayTennis = Yes


Repeat until most Yes examples are covered.

📐 Rule Evaluation Criteria


When creating each rule, we want it to be:
● Accurate: Correctly classifies most examples it covers
● General: Covers many examples (but not too many!)
We use metrics like:
● Coverage: Number of examples the rule applies to
● Accuracy of rule: Correct predictions made by the rule

📦 Output: Set of IF-THEN Rules


After completing the sequential covering, we get a list like:

Rule 1: IF Outlook = Overcast THEN Play = Yes


Rule 2: IF Humidity = Normal AND Windy = False THEN Play = Yes
Rule 3: ELSE Play = No
This rule list becomes our classification model.

✅ Advantages of Sequential Covering


● Easy to understand (simple IF-THEN rules)
● Efficient for datasets with clear patterns
● Good for knowledge discovery (rules are human-readable)

❗ Disadvantages
● May overfit if too specific
● Order of rule learning can affect performance
● Doesn’t work well when rules overlap a lot

Model Evaluation and Selection

You would like an estimate of how accurately the classifier can predict the purchasing
behaviour of future customers, that is, future customer data on which the classifier has not
been trained. You may even have tried different methods to build more than one classifier
and now wish to compare their accuracy. But what is accuracy? How can we estimate it? Are
some measures of a classifier’s accuracy more appropriate than others? How can we obtain a
reliable accuracy estimate? These questions are addressed in this section.

Metrics for Evaluating Classifier Performance

77
We will consider the case of where the class tuples are more or less evenly distributed, as
well as the case where classes are unbalanced (e.g., where an important class of interest is
rare such as in medical tests).
The classifier evaluation measures are summarized in Figure 8.13. They include accuracy
(also known as recognition rate), sensitivity (or recall), specificity, precision, F1, and F_.
Note that although accuracy is a specific measure, the word “accuracy” is also used as a
general term to refer to a classifier’s predictive abilities.

78
Given two classes, for example, the positive tuples may be buys computer = yes while the
negative tuples are buys computer = no.

There are four additional terms we need to know that are the “building blocks” used in
computing many evaluation measures.

True positives .TP: These refer to the positive tuples that were correctly labeled by the
classifier. Let TP be the number of true positives.
True negatives .TN: These are the negative tuples that were correctly labeled by the
classifier. Let TN be the number of true negatives.
False positives. FP/: These are the negative tuples that were incorrectly labeled as positive
(e.g., tuples of class buys computer = no for which the classifier predicted buys computer =
yes). Let FP be the number of false positives.
False negatives. FN/: These are the positive tuples that were mislabeled as negative (e.g.,
tuples of class buys computer = yes for which the classifier predicted buys computer = no).
Let FN be the number of false negatives.

These terms are summarized in the confusion matrix of Figure 8.14.

79
The confusion matrix is a useful tool for analyzing how well your
classifier can recognize tuples of different classes. TP and TN tell us
when the classifier is getting things right, while FP and FN tell us when
the classifier is getting things wrong

In addition to accuracy-based measures, classifiers can also be compared


with respect to the following additional aspects:
Speed: This refers to the computational costs involved in generating and
using the given classifier.
Robustness: This is the ability of the classifier to make correct
predictions given noisy data or data with missing values. Robustness is
typically assessed with a series of synthetic data sets representing
increasing degrees of noise and missing values.
Scalability: This refers to the ability to construct the classifier
efficiently given large amounts of data. Scalability is typically assessed
with a series of data sets of increasing size.
Interpretability: This refers to the level of understanding and insight
that is provided by the classifier or predictor. Interpretability is
subjective and therefore more difficult to assess. Decision trees and
classification rules can be easy to interpret, yet their interpretability may
diminish the more they become complex.

1. What is a Confusion Matrix?


A confusion matrix is a performance measurement tool for classification problems.
It compares the actual labels with the predicted labels made by a classification model.

🔹 3. Key Terms Explained


● True Positive (TP): Correctly predicted as positive
● True Negative (TN): Correctly predicted as negative
● False Positive (FP): Incorrectly predicted as positive
● False Negative (FN): Incorrectly predicted as negative

🔹 5. Example
Let’s say we tested a model on 100 samples:

81
0
Predicted: Predicted:
Positive Negative
Actual: Positive
40 (TP) 10 (FN)
(Yes)
Actual: Negative
5 (FP) 45 (TN)
(No)
From this matrix:
● Accuracy = (40 + 45) / 100 = 85%
● Precision = 40 / (40 + 5) = 88.9%
● Recall = 40 / (40 + 10) = 80%
● F1 Score = 2 × (0.889 × 0.8) / (0.889 + 0.8) ≈ 84.2%

🔹 6. When to Use Which Metric?


Metric Use When...
Accurac
Classes are balanced and all errors are equal
y
Precisio
False positives are costly (e.g., spam filters)
n
False negatives are costly (e.g., medical
Recall
tests)
Need a balance between Precision and
F1 Score
Recall
Specifici
Focus is on correctly identifying negatives
ty

🔹 7. Model Selection Using Confusion Matrix


● Train different models (e.g., Decision Tree, SVM, Logistic Regression)
● Generate the confusion matrix for each
● Compare metrics like Precision, Recall, and F1 Score
● Choose the model that gives the best balance based on your problem needs

🧠 Summary
● The confusion matrix is central to model evaluation in classification problems.
● It provides the foundation for key metrics like precision, recall, and F1 score.
● Helps in choosing the best model for your specific goal (e.g., high recall vs high
precision).

81
1

You might also like