0% found this document useful (0 votes)
10 views40 pages

Data Mining Unit 2

Classification is a data analysis task aimed at predicting categorical class labels using a model built from input features. It has various real-world applications, including banking, marketing, and medical research, and differs from numeric prediction by focusing on discrete labels rather than continuous values. The general process involves a training phase with labeled data to learn a mapping function, followed by a testing phase to evaluate the model's accuracy on unseen data.

Uploaded by

is7636665
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views40 pages

Data Mining Unit 2

Classification is a data analysis task aimed at predicting categorical class labels using a model built from input features. It has various real-world applications, including banking, marketing, and medical research, and differs from numeric prediction by focusing on discrete labels rather than continuous values. The general process involves a training phase with labeled data to learn a mapping function, followed by a testing phase to evaluate the model's accuracy on unseen data.

Uploaded by

is7636665
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Classification

1. Definition of Classification

• Classification is a data analysis task used to predict categorical class labels


(discrete and unordered).

• A model (also called a classifier) is built to assign data instances to predefined


classes based on the input features.

2. Purpose & Real-World Applications

Application Use of Classification


Area

Banking Classify loan applicants as “safe” or “risky”

Marketing Predict whether a customer will buy a product (“yes” or “no”)

Medical Predict suitable treatment (“treatment A”, “B”, or “C”) based on


research patient data

• In all these examples, the goal is to predict a label (class) from input attributes
using historical data.

3. Classification vs Numeric Prediction

Feature Classification Numeric Prediction

Output Discrete labels (e.g., Continuous values (e.g., total


safe/risky) purchase amount)

Data Type Categorical Numeric

Example Decision Trees, SVM Regression


Method

Output Name Class label Predictor

Goal Assign to a category Predict a numeric value

Numeric prediction is commonly associated with regression analysis, but other


methods can also be used.
4. General Process of Classification

Step 1: Learning (Training Phase)

• Input: A set of training data tuples, where each tuple has:

o A set of input features:


X=(x1,x2,…,xn)X = (x_1, x_2, …, x_n)X=(x1,x2,…,xn)

o A known class label 𝑦

• Goal: Learn a mapping function:

𝑓(𝑋) = 𝑦

• The mapping can be expressed as:

o Classification rules

o Decision trees

o Mathematical models

• This step is also called supervised learning, as the model is trained with labeled
examples.

Step 2: Classification (Testing Phase)

• Input: New, unlabeled data (test set).

• Classifier predicts class labels for these data.

• Accuracy Evaluation:

o Use test data (with known labels, but not seen during training).

o Compare predicted class vs actual class.

o Calculate accuracy:

𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑡𝑒𝑠𝑡 𝑡𝑢𝑝𝑙𝑒𝑠
• The test data must be independent of the training data.
5. Key Terminology

Term Description

Tuple (X) A data instance, also called an example, sample, or object

Attribute Vector Input feature vector X=(x1,x2,...,xn)X = (x_1, x_2, ..., x_n)X=(x1
,x2,...,xn)

Class Label Output category or target class (e.g., safe, risky)

Classifier / Model Learned function or rules for assigning class labels


Training Set Data with known class labels used to train the model

Test Set Data with known labels (not used during training) used to
test the model

Supervised Learning Learning with labeled data

Unsupervised Learning without labeled data (e.g., clustering)


Learning

6. Example: Loan Application Classification

Training Data

Name Age Income Loan Decision

Sandy Jones Youth Low Risky

Caroline Fox Middle-aged High Safe

Susan Lake Senior Low Safe

Learned Rules (Classifier Output)

IF age = youth THEN loan_decision = risky

IF income = high THEN loan_decision = safe

IF age = middle_aged AND income = low THEN loan_decision = risky

New Data Input (Test Tuple)

Name Age Income Predicted Decision

John Henry Middle-aged Low Risky


7. Supervised vs Unsupervised Learning

Feature Supervised Learning Unsupervised Learning

Input Data Labeled Unlabeled

Output Predict known class labels Group similar data points

Example Task Classification Clustering

Example Method Decision Trees, SVM K-Means, Hierarchical

Known Classes? Yes No

• Example of Unsupervised Learning:

o If loan decisions (safe/risky) were not known, we could use clustering to


group similar applications.

o These clusters could correspond to risk levels (even if we don't know what
“safe” or “risky” means at first).

8. Advantages of Classification Models

• Interpretable (in rule or tree form).

• Generalizable: Can be applied to unseen data.

• Compressed Representation: Reduces the need to store full data.

• Insights: Helps discover patterns and relationships in data.

Note on Overfitting

• A model trained too closely on the training data may overfit:

o Captures noise and anomalies.

o Performs poorly on new data.

• Hence, always test on unseen data to estimate real performance.


Decision Tree Induction
• Definition: Decision Tree Induction is a process of learning decision trees from
class-labeled training data (tuples).

• Structure:

o Root node: Topmost node representing the first attribute split.

o Internal nodes: Test on an attribute.

o Branches: Outcome of a test.

o Leaf nodes: Final class label (decision outcome).

Purpose of Decision Trees

• To classify an unknown tuple by tracing a path from root to leaf based on


attribute values.

• Can be easily translated into classification rules (IF-THEN rules).

• Popular due to simplicity, interpretability, and ability to handle multi-


dimensional data.

Advantages

• No domain knowledge or parameter tuning required.

• Good performance in many real-world tasks (medicine, finance, astronomy,


etc.).

• Easy to understand for humans.

• Fast learning and classification.

Basic Decision Tree Induction Algorithm

Input:

• D: Training data (tuples with class labels)

• Attribute list: List of candidate attributes

• Attribute selection method: Heuristic (e.g., Information Gain, Gini Index)

Output:
• A Decision Tree

Steps:

1. Create a node N.

2. If all tuples in D belong to the same class → label N as a leaf node with that class.

3. If attribute list is empty → label N as a leaf node with the majority class in D.

4. Apply Attribute selection method to find the best attribute to split.

5. Label N with the splitting attribute.

6. Remove splitting attribute from attribute list (if multiway split is allowed).

7. For each outcome of the split:

o Create subset 𝐷𝑗 of tuples matching that outcome.

o If Dj is empty → attach a leaf node with majority class of D.

o Else → recursively call the algorithm on Dj.

Splitting Criteria Scenarios

1. Discrete-valued attribute:

o Each known value of A gets a separate branch.

o E.g., color? → red, green, blue (each gets a branch).

2. Continuous-valued attribute:

o Two branches using a split-point.

o Conditions: A ≤ split-point and A > split-point.

3. Binary tree with discrete attribute:

o Create a subset SA of values.

o Condition: A ∈ SA? → binary split into yes/no.

Stopping Conditions

1. All tuples belong to the same class.

2. No attributes left → majority voting is used.

3. Empty partition Dj → leaf with majority class of parent D.


Computational Complexity

• O(n × |D| × log(|D|)), where:

o n = number of attributes

o |D| = number of tuples

Notable Algorithms

• ID3 (Iterative Dichotomiser 3) - by Quinlan

• C4.5 - successor of ID3

• CART (Classification and Regression Trees) - binary decision trees

Incremental Learning

• Some decision tree algorithms allow incremental updates, adjusting the tree
with new data instead of rebuilding it from scratch.

Attribute Selection Measures


What is Information Gain?

Information Gain (IG) is a measure based on information theory, particularly entropy, which
was introduced by Claude Shannon.

• Entropy (Info(D)): It tells us how impure (or uncertain) the dataset is. If a dataset
contains mixed class labels (e.g., both "yes" and "no"), entropy is high. If all tuples
belong to one class, entropy is zero.

• Formula for Entropy:


𝒎

Info(𝑫) = − ∑ 𝒑𝒊 𝐥𝐨𝐠 𝟐 (𝒑𝒊 )


𝒊=𝟏

where 𝒑𝒊 is the proportion of tuples in class 𝑪𝒊 .

How is Information Gain Calculated?

1. Compute entropy of the whole dataset Info(D).


2. Split the dataset D based on an attribute A:

o If A is discrete, partition D into subsets based on its distinct values (e.g.,


"youth", "middle-aged", "senior").

o If A is continuous, evaluate possible split points (e.g., midpoints between


adjacent sorted values).

3. Compute weighted average entropy after the split (𝑰𝒏𝒇𝒐𝑨(𝑫) ):


𝒗
|𝑫𝒋 |
Info𝑨 (𝑫) = ∑ ⋅ Info(𝑫𝒋 )
|𝑫|
𝒋=𝟏

where 𝑫𝒋 is the subset of tuples with outcome 𝒂𝒋 of attribute 𝑨.

4. Calculate the Information Gain:

𝑮𝒂𝒊𝒏(𝑨) = 𝑰𝒏𝒇𝒐(𝑫) − Info𝑨 (𝑫)

Why Use Information Gain in ID3?

• It selects the attribute that best reduces entropy, i.e., increases purity.

• The attribute with the highest Information Gain is chosen for the decision node.

• It aims to reduce uncertainty at each step and build a tree that classifies data with
as few questions as possible.

Example with “Age” Attribute (from Table 8.1):

• Dataset has 14 tuples: 9 “yes” and 5 “no”

𝟗 𝟗 𝟓 𝟓
Info(𝑫) = − ( 𝐥𝐨𝐠 𝟐 + 𝐥𝐨𝐠 𝟐 ) = 𝟎. 𝟗𝟒𝟎 bits
𝟏𝟒 𝟏𝟒 𝟏𝟒 𝟏𝟒

Then, for age:

• Youth: 2 yes, 3 no

• Middle-aged: 4 yes, 0 no

• Senior: 3 yes, 2 no

Calculate weighted average entropy 𝑰𝒏𝒇𝒐_𝒂𝒈𝒆(𝑫):

𝟓 𝟒 𝟓
𝑰𝒏𝒇𝒐𝒂𝒈𝒆 (𝑫) = ⋅ 𝟎. 𝟗𝟕𝟏 + ⋅𝟎+ ⋅ 𝟎. 𝟗𝟕𝟏 = 𝟎. 𝟔𝟗𝟒 bits
𝟏𝟒 𝟏𝟒 𝟏𝟒
Then,

𝑮𝒂𝒊𝒏(𝒂𝒈𝒆) = 𝟎. 𝟗𝟒𝟎 − 𝟎. 𝟔𝟗𝟒 = 𝟎. 𝟐𝟒𝟔


Do this for all attributes, choose the one with the highest gain.

Gain Ratio
Gain Ratio is a metric used in decision tree algorithms (like C4.5) to select the best
attribute for splitting the dataset. It addresses a major limitation of Information Gain,
which tends to favor attributes with many distinct values (like unique IDs), even when
those attributes are poor for generalization.

Why Not Just Use Information Gain?

Information Gain (IG) can be biased. For example:

• If an attribute like 𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑑 uniquely identifies each record, splitting on it will


create pure partitions, each with only one tuple.

• The entropy in each partition is 0, so 𝐼𝐺 is maximized.

• But such a split is useless for predicting new data (i.e., overfitting).

How Gain Ratio Fixes This

Gain Ratio penalizes attributes with many outcomes by normalizing the Information
Gain using a value called Split Information, which measures how broadly and evenly the
data is split.

Formulas

Let’s define:

• 𝐷: dataset

• 𝐷𝑗 : subset of 𝐷 for outcome 𝑗

• 𝑣: number of distinct outcomes for attribute 𝐴

1. Split Information:
𝑣
|𝐷𝑗 |
SplitInfo𝐴 (𝐷) = − ∑
|𝐷|
𝑗=1

This measures the potential information created by the split itself, regardless of class
labels.
2. Gain Ratio:

Gain(𝐴)
GainRatio(𝐴) =
SplitInfo𝐴 (𝐷)

This corrects the bias by dividing Information Gain by how “uniformly” the attribute
splits the data.

Note: If SplitInfo is very small (close to zero), GainRatio becomes unstable. To avoid
this, C4.5 ensures that the Information Gain is above average before considering
GainRatio.

Example
Given a split on income resulting in three partitions:

• low: 4 tuples

• medium: 6 tuples

• high: 4 tuples

Step 1: Compute SplitInfo


4 4 6 6 4 4
SplitInfoincome (𝐷) = − ( log 2 + log 2 + log 2 ) = 1.557
14 14 14 14 14 14
Step 2: Use Gain from Example 8.1

Gain(income) = 0.029

Step 3: Compute GainRatio


0.029
GainRatio(income) = ≈ 0.019
1.557

Bayes Classification Methods

What Are Bayesian Classifiers?

• Bayesian classifiers are statistical classifiers that predict class membership


probabilities, like the likelihood that a data tuple belongs to a specific class.

• They use Bayes’ theorem to make these predictions.

• Naïve Bayes, a simplified Bayesian method, performs comparably to decision


trees and neural networks.
• They are known for high accuracy and speed, especially on large datasets.

Assumption of Naïve Bayes:

• Assumes class-conditional independence:

o The value of one attribute is independent of another, given the class


label.

o This simplification makes computations more efficient — hence the term


"naïve."

Gini Index
The Gini Index is used in the CART (Classification and Regression Trees) algorithm to
measure the impurity of a dataset.

Definition

For a dataset DDD with mmm classes:


𝑚

Gini(𝐷) = 1 − ∑ 𝑝𝑖2
𝑖=1

𝑝𝑖 : Proportion of tuples in class 𝐶𝑖 within dataset D

• A Gini index of 0 means the dataset is pure (only one class present).

• A Gini index closer to 1 indicates high impurity (mixed classes).

Binary Splits

CART allows only binary splits (dividing into two groups). For each attribute, we:

1. Try all possible binary splits.

2. Calculate the Gini index for each split.

3. Choose the split that minimizes the Gini index of the resulting partitions.

Discrete Attributes

If attribute A has v values, then there are 2𝑣 − 22𝑣 − 22𝑣 − 2 valid binary splits
(excluding the empty set and full set).
Example:

• If income has values {low, medium, high}, valid splits are:

o {𝑙𝑜𝑤}, {𝑚𝑒𝑑𝑖𝑢𝑚}, {ℎ𝑖𝑔ℎ}, {𝑙𝑜𝑤, 𝑚𝑒𝑑𝑖𝑢𝑚}, {𝑙𝑜𝑤, ℎ𝑖𝑔ℎ}, {𝑚𝑒𝑑𝑖𝑢𝑚, ℎ𝑖𝑔ℎ}

Each subset becomes a binary test like:

"Is income in {low, medium}?"

Continuous Attributes

For numeric attributes:

• Sort the values.

• Use midpoints between adjacent values as possible split points.

• Each split is of the form:

"Is A ≤ split_point?"

Weighted Gini Index After Split

If a split divides DDD into D1D_1D1 and D2D_2D2:

|𝐷1 | |𝐷2 |
Gini𝐴 (𝐷) = |𝐷|
⋅ Gini(𝐷1 ) + |𝐷|
⋅ Gini(𝐷2 )

Gini Gain (Reduction in Impurity)

Gain(𝐴) = Gini(𝐷) − Gini𝐴 (𝐷)


Choose the attribute and split that maximizes the gain (i.e., minimizes 𝐺𝑖𝑛𝑖_𝐴).

Example 8.3 – Gini Index in Action

Let the dataset DDD have:

• 9 tuples of class "yes"

• 5 tuples of class "no"

Step 1: Initial Impurity of D

9 2 5 2
Gini(𝐷) = 1 − ( ) − ( ) = 0.459
14 14
Step 2: Evaluate Attribute income

Split: income in {low, medium}

• D₁ = 10 tuples: 7 yes, 3 no → Gini(D₁) = 1 - (0.7² + 0.3²) = 0.42

• D₂ = 4 tuples: 2 yes, 2 no → Gini(D₂) = 0.5

10 4
Giniincome (𝐷) = ⋅ 0.42 + ⋅ 0.5 = 0.443
14 14
Other splits gave higher values:

• {low, high} → Gini = 0.458

• {medium, high} → Gini = 0.450

Best income split: {low, medium} with Gini = 0.443

Step 3: Evaluate Other Attributes

• age = {youth, senior} → Gini = 0.357

• student → Gini = 0.367

• credit_rating → Gini = 0.429

Best overall split:


age = {youth, senior}

• Gini reduction = 0.459 - 0.357 = 0.102

So, age is selected as the root node's splitting attribute.

Other Attribute Selection Measures

These are used to determine which attribute to split the dataset on at each node of the
tree:

1. Information Gain:

o Biased towards attributes with many distinct values.

o Might lead to overfitting with multivalued attributes.

2. Gain Ratio:
o Adjusts information gain by incorporating split information.

o Tends to favor unbalanced splits.

3. Gini Index:

o Also biased toward multivalued attributes.

o Prefers equal-sized and pure partitions.

o Struggles with datasets having many classes.

4. Other Measures:

o CHAID: Uses chi-square test; often used in marketing.

o C-SEP: Sometimes better than information gain and Gini.

o G-statistic

4o

You said:

detailed notesof useful

ChatGPT said:

Here are detailed and useful notes based on the content you shared regarding
Attribute Selection Measures, Tree Pruning, Scalability, and Visual Mining in
Decision Tree Induction:

Other Attribute Selection Measures


Common Issues with Traditional Measures:

• Information Gain:

o Biased towards attributes with many values (multi-valued).

• Gain Ratio:

o Adjusts for bias of Information Gain but prefers unbalanced splits (e.g.,
one partition is much smaller).

• Gini Index:

o Also biased toward multi-valued attributes.

o Struggles with many-class problems.


o Prefers equal-sized and pure partitions.

Alternative Measures:

• CHAID: Uses the Chi-square (χ²) test of independence.

• C-SEP: Performs better than Information Gain & Gini in certain scenarios.

• G-statistic: Similar to χ², information-theoretic.

• MDL (Minimum Description Length):

o Least biased toward multi-valued attributes.

o Chooses trees that minimize encoding cost (both tree and misclassified
exceptions).

• Multivariate Splits:

o Combine multiple attributes (e.g., linear combinations).

o Used in systems like CART.

o Essentially a type of attribute construction (discussed in data


transformation).

Summary:

• No measure is universally best.

• Measures producing shallower trees (multiway, balanced splits) are often


preferred.

• Shallower trees can increase leaf count and error – it's a tradeoff.

Tree Pruning (Handling Overfitting)


Purpose:

• To avoid overfitting by removing branches reflecting noise or outliers.

Two Approaches:

1. Prepruning (Early Stopping)

• Stop growing the tree early based on a threshold (e.g., gain, Gini).

• Node becomes a leaf with:

o Most frequent class

o Or probability distribution of classes


• Challenge: Choosing the right threshold (high = oversimplified, low =
ineffective)

2. Postpruning (Most Common)

• Build full tree first, then prune.

• Replace subtrees with a leaf node.

• Label = most frequent class in subtree.

Examples:

• Cost Complexity Pruning (CART):

o Based on:

▪ of leaves

▪ Error rate

o Uses a pruning set (different from training/test set).

• Pessimistic Pruning (C4.5):

o Uses training data, no separate prune set.

o Adjusts for optimism by adding a penalty to error estimate.

• MDL-based Pruning:

o Prune based on encoding length.

o Simplest (shortest encoded) tree is preferred.

o No need for a separate prune set.

Hybrid:

• Combine prepruning and postpruning.

3. Scalability of Decision Tree Induction

Problem:

• Traditional algorithms (ID3, C4.5, CART) require all training data in memory.

• Not suitable for large-scale, disk-resident datasets.

Scalable Algorithms:

RainForest
• Uses AVC-sets (Attribute-Value-Classlabel) at each node.

• Keeps class distributions for each attribute.

• Efficient because size depends on # distinct values and classes, not # of


tuples.

• Handles memory overflow gracefully.

• Compatible with any decision tree algorithm.

BOAT (Bootstrapped Optimistic Algorithm for Tree Construction)

• Uses bootstrapping to build decision trees from memory-fit samples.

• Constructs multiple trees from subsets → merges them to form final tree.

• Efficient:

o Only 2 scans of data.

o Faster than RainForest.

• Supports incremental updates (can adjust for new/deleted data without full
reconstruction).

4. Visual Mining for Decision Tree Induction

PBC (Perception-Based Classification):

• Interactive decision tree construction.

• Helps users visualize and understand multidimensional data.

Key Concepts:

• Uses pixel-based visualization:

o Data mapped to a circle segmented by attributes.

o Each pixel represents a value-color for the class label.

• Visualization of homogeneous class regions helps in manual splitting.

Interface:

• Data Interaction Window: shows circle segments of data.

• Knowledge Interaction Window: shows decision tree built so far.

Benefits:
• Allows manual split selection (one or more split points).

• Can generate multiway splits, especially for numeric attributes.

• Trees are usually smaller and more interpretable.

• Supports human-in-the-loop learning by combining domain expertise with


algorithmic insights.

Bayes’ Theorem
Let:

• X: a data tuple (evidence).

• H: a hypothesis, e.g., “X belongs to class C.”

We want to compute the posterior probability:

𝑃( 𝐻 ∣ 𝑋 )

This is the probability that hypothesis HHH holds given evidence XXX.

Key Probability Terms:

TERM MEANING

P(H) Prior probability of hypothesis HHH — before seeing data

P(X) Prior probability of evidence XXX — how often this data occurs

𝑷( 𝑿 ∣ 𝑯 ) Likelihood — probability of seeing evidence XXX given HHH

𝑷( 𝑯 ∣ 𝑿 ) Posterior probability — updated belief in HHH after seeing XXX

Bayes’ Theorem Formula:

𝑃( 𝑋 ∣ 𝐻 ) ⋅ 𝑃(𝐻)
𝑃( 𝐻 ∣ 𝑋 ) =
𝑃(𝑋)
Example:

• H: Customer buys a computer.

• X: Customer is 35 years old with $40,000 income.


• Goal: Compute probability that this customer will buy a computer given age and
income.

Rule-Based Classification
Overview

A rule-based classifier is a classification system that uses a set of IF-THEN rules to


classify data tuples. These rules are easy to understand, implement, and interpret,
making them ideal for knowledge representation and decision-making tasks.

Using IF-THEN Rules for Classification

Structure of a Rule

A rule is expressed in the form:

IF <condition> THEN <class>

• Antecedent (Precondition): The "IF" part, which is a combination of one or more


attribute tests (conditions). These are connected using logical AND.

• Consequent (Conclusion): The "THEN" part, which contains the predicted class
label.

Example Rule (R1):

𝐼𝐹 𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ 𝐴𝑁𝐷 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠 𝑇𝐻𝐸𝑁 𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠

Alternative representation:

(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ) ∧ (𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠) → (𝑏𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)

Rule Satisfaction

• A rule is satisfied (triggered) if all attribute conditions in its antecedent are true
for a tuple.

• If a rule is triggered by a tuple, it is said to cover that tuple.


Rule Evaluation: Coverage and Accuracy

Let:

• 𝑛𝑐𝑜𝑣𝑒𝑟𝑠 = number of tuples covered (i.e., satisfied) by the rule

• 𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = number of correctly classified tuples among those covered

• |D| = total number of tuples in the dataset

Coverage:

Percentage of tuples in the dataset that are covered by the rule:


𝑛𝑐𝑜𝑣𝑒𝑟𝑠
coverage(𝑅 ) =
|𝐷|
Accuracy:

Percentage of correctly classified tuples among those that are covered:

𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡
accuracy(𝑅) =
𝑛𝑐𝑜𝑣𝑒𝑟𝑠
Example (R1):

• 𝑛𝑐𝑜𝑣𝑒𝑟𝑠 = 2, 𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 2, |D| = 14


2
• Coverage: = 14.28%
14
2
• Accuracy: = 100%
2

Rule Firing and Prediction

To classify a new tuple X, we follow this process:

1. Check which rules are satisfied by X (triggered).

2. If only one rule is triggered, that rule fires and assigns its class label.

3. If multiple rules are triggered, use a conflict resolution strategy.

4. If no rule is triggered, use a default rule.


Conflict Resolution Strategies

When multiple rules are triggered, we must choose which one to fire. Common
strategies include:

1. Size Ordering

• Preference is given to the rule with the most conditions (most specific rule).

• Rule with the largest antecedent size (most attribute tests) is selected.

2. Rule Ordering

• Rules are arranged in a priority list (called a decision list).

• The first rule in the list that is triggered fires.

• Ordering criteria can include:

o Rule accuracy

o Rule coverage

o Rule size

o Domain expert knowledge

There are two kinds of rule ordering:

• Class-Based Rule Ordering

o Classes are sorted by importance (e.g., frequency or cost).

o Rules within each class are unordered.

• Rule-Based Ordering

o Rules are ordered globally, regardless of class.

o The first matching rule fires.

Note: Rules in a decision list imply negation of the earlier rules. That is, each rule
assumes the previous ones didn’t apply.

What if No Rule Matches?

• A default rule can be used.

• It has no condition and applies only if no other rule is triggered.

• Typically predicts the majority class (overall or among uncovered tuples).


Rule Extraction from a Decision Tree
Overview

Decision trees are a widely used classification method due to their simplicity and
effectiveness. However, large trees can be hard to interpret. To improve
interpretability, we can convert a decision tree into a set of IF-THEN classification
rules.

Why Extract Rules?

• Rules are often easier for humans to understand than a large decision tree.

• They represent knowledge in a modular, interpretable way.

• Rule sets can be used as an alternative model to a tree classifier.

How to Extract Rules from a Decision Tree

Basic Idea:

• One rule is created per path from the root to a leaf node.

• The splitting criteria (conditions) along the path form the antecedent (IF part).

• The class label at the leaf node becomes the consequent (THEN part).

Example – Extracting Rules

From a decision tree (as in Figure 8.2 of the textbook), the following rules are extracted:

R1:

IF age = youth AND student = no THEN buys_computer = no

R2:

IF age = youth AND student = yes THEN buys_computer = yes

R3:

IF age = middle-aged THEN buys_computer = yes

R4:

IF age = senior AND credit_rating = excellent THEN buys_computer = yes


R5:

IF age = senior AND credit_rating = fair THEN buys_computer = no

Properties of Extracted Rules

1. Mutually Exclusive:

o No two rules apply to the same tuple.

o Each tuple maps to exactly one path in the tree.

2. Exhaustive:

o There is a rule for every possible attribute-value combination.

o No need for a default rule.

3. Unordered Rules:

o Since the rules are mutually exclusive, order does not matter.

Rule Pruning

Although rule extraction is straightforward, the resulting rule set can be large and
contain redundant or irrelevant conditions. Therefore, rule pruning is necessary.

Purpose of Pruning:

• Simplify rules.

• Improve generalization.

• Eliminate unnecessary conditions or rules.

How to Prune Rules

For a given rule:

• Remove any condition that does not improve the rule’s estimated accuracy.

• This leads to a more general and concise rule.

C4.5 Rule Pruning Approach

• Extracts rules from an unpruned tree.


• Uses a pessimistic pruning approach:

o Rule accuracy is estimated from training data.

o To avoid optimistic bias, the estimate is adjusted pessimistically.

• Any rule that does not improve overall accuracy is removed.

Consequences of Pruning

• After pruning, the rules are no longer mutually exclusive or exhaustive.

• This can lead to rule conflicts or uncovered tuples.

Conflict Resolution (Post-Pruning)

To manage conflicts among overlapping rules, C4.5 uses class-based rule ordering:

1. Group rules by class.

2. Rank class rule sets to minimize false positives:

o A false positive occurs when a rule predicts class C but the actual class is
not C.

3. Evaluate class rule sets in order of increasing false positives.

Default Class Selection (C4.5)

• Default class is NOT the majority class.

• Instead, it is the class that contains the most uncovered training tuples.

• This strategy ensures better handling of tuples not matched by any rule.
Rule Induction Using a Sequential Covering Algorithm

Overview

• Goal: Generate a set of IF-THEN classification rules directly from training


data, without building a decision tree first.

• Approach: Learn one rule at a time for each class.

• Method: Known as Sequential Covering Algorithm, widely used for disjunctive


rule learning.

Why “Sequential Covering”?

• Sequential: Rules are learned one at a time.

• Covering: Each rule ideally covers many tuples of the target class and few or
none of the others.

• After learning a rule: Remove the tuples it covers and repeat the process with
remaining data.

Popular Algorithms

Some implementations of sequential covering:

• AQ

• CN2

• RIPPER (Repeated Incremental Pruning to Produce Error Reduction)

Comparison with Decision Tree Induction

Feature Sequential Covering Decision Tree

Learning Style One rule at a time Entire tree at once

Rule Generation Directly from data Paths from tree

Tuple Handling Covered tuples removed All tuples used


Feature Sequential Covering Decision Tree

Rule Set Can be overlapping Mutually exclusive

Basic Sequential Covering Algorithm (Steps)

Input:

- D: Dataset of class-labeled tuples

- 𝐴𝑡𝑡_𝑣𝑎𝑙𝑠: Set of all attribute-value pairs

Output:

- A set of IF-THEN rules

Method:

1. Initialize rule set as empty.

2. For each class c:

o Repeat:

▪ Learn a rule for class c.

▪ Remove tuples covered by this rule from D.

▪ Add rule to the rule set.

o Until a terminating condition is met.

3. Return the complete rule set.

Rule Learning (Learn One Rule)

• Rule learning is done in a general-to-specific manner (greedy strategy).

• Start: with the most general rule (empty antecedent).

IF [ ] THEN class = C

• At each step:

o Consider adding attribute tests (like income = high, age > 25).

o Select the test that gives the best improvement in rule quality.
• Add this test to the rule, and repeat until rule quality is satisfactory.

Search Strategy

General-to-Specific Search:

• Begins with the most general rule.

• Gradually specialize by adding conjuncts (AND conditions).

Greedy Search:

• No backtracking.

• Always picks the best local option at each step.

• Fast but may make sub-optimal choices.

Beam Search (Improved):

• Maintain top k candidates instead of just 1.

• Reduces risk of bad decisions due to local optima.

• Beam width = k

Example: Loan Decision Data

Imagine a dataset with features like age, income, credit rating, etc.

• Start with:

IF [ ] THEN loan_decision = accept

• Add best test:

IF income = high THEN loan_decision = accept

• Add next best:

IF income = high AND credit_rating = excellent THEN loan_decision = accept

• Continue until rule reaches acceptable accuracy or quality threshold.

Rule Quality

• Rules are evaluated based on quality measures, e.g.:

o Accuracy
o Coverage

o Confidence

o Information Gain

• The goal is to maximize accuracy while keeping the rule simple.

Termination Conditions

Rule induction for a class stops when:

• No more tuples of that class remain.

• No high-quality rule can be learned.

• A predefined number of rules has been generated.

• Rule’s accuracy falls below a threshold.

Advantages

• No need to build a full decision tree.

• Can produce compact, interpretable rule sets.

• Flexible in handling overlapping classes.

Limitations

• Greedy strategy may miss better rules.

• Tuple removal can sometimes lose useful training data.

• Beam search helps but increases computational complexity.


Model Evaluation and Selection
Why Evaluate a Model?

• To estimate how well a classifier will perform on unseen (future) data.

• To compare multiple classifiers and choose the best one.

• Evaluation helps avoid overfitting and provides realistic performance


expectations.

8.5.1: Metrics for Evaluating Classifier Performance

Key Definitions:

• True Positives (TP): Correctly predicted positive cases.

• True Negatives (TN): Correctly predicted negative cases.

• False Positives (FP): Negative cases incorrectly predicted as positive.

• False Negatives (FN): Positive cases incorrectly predicted as negative.

Confusion Matrix (Binary Class Example):

Predicted Positive Predicted Negative

Actual Positive TP FN

Actual Negative FP TN

Evaluation Metrics:

METRIC FORMULA MEANING

ACCURACY (TP + TN) / (P + N) Overall correctness

ERROR RATE (FP + FN) / (P + N) = 1 - Accuracy Rate of incorrect


predictions

SENSITIVITY TP / P True Positive Rate (correct


(RECALL / TPR) positives)

SPECIFICITY (TNR) TN / N True Negative Rate


(correct negatives)

PRECISION TP / (TP + FP) Correctness of predicted


positives
F1 SCORE 2 * (Precision * Recall) / Harmonic mean of
(Precision + Recall) precision and recall

FΒ SCORE (1 + β²) * (Precision * Recall) / ((β² Weighted F-measure


* Precision) + Recall)

• F1: Equal weight to precision and recall.

• F2: Recall is more important.

• F0.5: Precision is more important.

When Accuracy is Misleading

• In imbalanced datasets, accuracy can be high even if the classifier fails on the
minority (important) class.

o Example: If only 3% of cases are positive (e.g., cancer), a 97% accuracy


might still mean all positives are misclassified.

o Use sensitivity, specificity, precision, and F-measures instead.

Additional Classifier Evaluation Criteria

• Speed: Time to train and test the model.

• Robustness: Performance with noisy or missing data.

• Scalability: Ability to handle increasing data size efficiently.

• Interpretability: How understandable the model is (e.g., decision trees vs.


neural networks).

Model Evaluation Methods


1. Holdout Method

• Procedure: Split data into two sets — training (usually 2/3) and test (1/3).

• Use: Model is trained on the training set and evaluated on the test set.

• Drawback: Accuracy estimate is pessimistic because only part of the data is


used for training.

2. Random Subsampling

• Variation of holdout method.


• Procedure: Repeat the holdout method k times with different random splits.

• Accuracy Estimate: Average accuracy over k iterations.

3. k-Fold Cross-Validation

• Procedure:

o Split data into k equal-sized folds.

o For each iteration, one fold is used as test set, remaining 𝑘 − 1 folds for
training.

• Advantage: Every tuple is used exactly once for testing and 𝑘 − 1 times for
training.

• Common choice: k = 10 is popular due to low bias and variance.

Special Cases:

• Leave-One-Out Cross-Validation (LOOCV):

o k=n, where n = number of samples.

o Each test set contains only one tuple.

• Stratified Cross-Validation:

o Ensures each fold has approximately the same class distribution as the
full dataset.

o More reliable for imbalanced datasets.

4. Bootstrap Method

• Procedure:

o Create bootstrap sample by sampling with replacement from the dataset


(same size as original).

o Tuples not selected form the test set.

• Key Insight: On average:

o 63.2% of tuples appear in training set.

o 36.8% form the test set.


1 𝑑
o Based on probability: (1 − ) ≈ 𝑒 −1 ≈ 0.368
𝑑

• . 𝟔𝟑𝟐 𝑩𝒐𝒐𝒕𝒔𝒕𝒓𝒂𝒑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒:


𝑘
1
Acc(𝑀) = ∑(0.632 ⋅ Acc(𝑀𝑖 /test) + 0.368 ⋅ Acc(𝑀𝑖 /train))
𝑘
𝑖=1

• Use Case: Effective for small datasets, but may be overly optimistic.

Model Selection Using Statistical Tests of


Significance
When you have two models (M1 and M2) and use 10-fold cross-validation to compare
their performance, it's not enough to simply pick the model with the lowest average
error rate. The difference might just be due to random variation.

What you need:

• Multiple rounds of 10-fold cross-validation (say 10 runs).

• Calculate mean error rate and standard deviation of errors for each model.

• Use the paired Student’s t-test to check if the difference is statistically


significant.

The Paired t-test Formula:

To compare error rates of M1 and M2 over 10 rounds (k=10):

𝑡 = (𝑚𝑒𝑎𝑛(𝑒𝑟𝑟. 𝑀1) − 𝑚𝑒𝑎𝑛(𝑒𝑟𝑟. 𝑀2))/𝑠𝑞𝑟𝑡(𝑣𝑎𝑟(𝑀1 − 𝑀2)/𝑘)

where:

𝑣𝑎𝑟(𝑀1 − 𝑀2) = (1/𝑘) ∗ 𝛴[(𝑒𝑟𝑟. 𝑀1𝑖 − 𝑒𝑟𝑟. 𝑀2𝑖 − 𝑚𝑒𝑎𝑛𝑑 𝑖𝑓𝑓)2 ]

• Degrees of freedom = k - 1 = 9

• Use a t-distribution table to find the critical t-value for a 5% significance level
(look up 0.025 because it's two-tailed).

• If your computed t is larger than the table value, you reject the null hypothesis
(i.e., the difference is real).

Comparing Classifiers Based on Cost–Benefit and ROC Curves


Cost–Benefit Analysis:

In some applications, false positives and false negatives have different


consequences:

• Example: Cancer diagnosis

o False negative (missed cancer): very dangerous

o False positive (incorrect cancer): more tests, costs

Instead of accuracy, compute average cost per decision by assigning real costs to TP,
FP, TN, and FN.

ROC Curve (Receiver Operating Characteristic Curve)

An ROC curve plots:

• TPR (True Positive Rate) on the y-axis

• FPR (False Positive Rate) on the x-axis

Steps to Plot ROC:

1. Use a probabilistic classifier (e.g., Naive Bayes, neural nets).

2. Rank test tuples by their predicted probability of being positive.

3. Set different thresholds and compute TPR & FPR.

4. Plot points (FPR, TPR). Move:

o Up for TP

o Right for FP

Example:

If a tuple with high probability is actually positive → TP ↑ → move up.

Interpretation:

• The steeper and higher the curve, the better the model.

• Area Under Curve (AUC):

o AUC = 1 → perfect model

o AUC = 0.5 → random guessing

o The larger the AUC, the better


Comparing Models (e.g., M1 vs M2):

• If M1’s ROC curve is above M2’s, then M1 is more accurate.

• Figure 8.20 shows M1 better than M2.

Techniques to Improve Classification Accuracy


1. Ensemble Methods

• Definition: An ensemble method is a technique that combines multiple


classifiers (models) to create a composite model that is generally more accurate
than any individual model.

• Purpose: Boost classification accuracy and model robustness.

• Mechanism: Each base classifier votes on the class label, and the ensemble
outputs the majority-voted label.

Why Ensembles Work

• Error Reduction: An ensemble reduces the chance of misclassification because


it only misclassifies when the majority of base classifiers are wrong.

• Diversity: Ensembles are more effective when the base classifiers are diverse
(i.e., they make different kinds of errors).

• Better Than Random Guessing: Each classifier should perform better than
random chance.

• Parallelizability: Since each model can be trained independently, ensemble


methods can be efficiently parallelized.

2. Key Ensemble Methods

a. Bagging (Bootstrap Aggregating) – Section

• Trains each classifier on a different bootstrap sample of the data.

• Reduces variance and helps avoid overfitting.

b. Boosting – Section

• Trains classifiers sequentially, where each new model focuses on the errors of
the previous ones.

• Aims to reduce bias and improve performance on hard-to-classify examples.


c. Random Forests – Section

• A special type of bagging that uses decision trees.

• Adds extra randomness by choosing a random subset of features for each split in
the tree.

• Very effective and widely used.

3. Class Imbalance Problem – Section

• Definition: Occurs when one class significantly outnumbers another (e.g., fraud
detection, rare disease).

• Challenge: Standard classifiers may be biased toward the majority class.

• Solutions:

o Data-level approaches (e.g., oversampling minority class, undersampling


majority class)

o Algorithmic-level approaches (e.g., cost-sensitive learning, ensemble


methods tailored for imbalance)

4. Example: Decision Boundaries

• A single decision tree struggles with representing linear decision boundaries .

• An ensemble of decision trees can approximate the linear boundary more


accurately.

• This demonstrates how ensembles can generalize better.

1. Bagging (Bootstrap Aggregating)


Concept:

Bagging is an ensemble technique that improves classification (or regression) accuracy


by reducing variance through averaging multiple models trained on different subsets of
data.

Analogy:

• Just like seeking diagnoses from multiple doctors and deciding based on
majority vote, bagging trains multiple models (classifiers) and predicts the
class of an instance based on a majority vote from those models.
How Bagging Works:

1. Input:

o Dataset D with d tuples (instances).

o Number of models (iterations) k.

o A base learning algorithm (e.g., decision trees, naive Bayes).

2. Training Phase (Ensemble Creation):


For each i = 1 to k:

o Create a bootstrap sample Di by sampling d tuples from D with


replacement.

o Train a model Mi using the learning algorithm on Di.

3. Prediction Phase:

o To classify a new instance X:

▪ Let each model Mi predict the class label of X.

▪ Use majority voting among all models to determine the final class
label.

o For regression tasks, take the average of all predictions.

Key Features:

• Sampling with Replacement:

o Some original tuples may be repeated in a Di, while others may be left
out.

• Equal Weighting:

o All classifiers contribute equally to the final prediction.

Advantages of Bagging:

• Increased Accuracy:

o The ensemble model usually performs better than a single model


trained on the full dataset.
• Reduces Variance:

o Makes the model less sensitive to fluctuations in the training data.

• Robust to Overfitting and Noise:

o Especially effective for high-variance models like decision trees.

2. Boosting and AdaBoost


Concept:

Boosting is an ensemble method that improves accuracy by combining multiple


weak learners. Unlike bagging (where all models are equally weighted), boosting gives
more weight to accurate classifiers and focuses learning on previously
misclassified data.

Intuition:

• Like consulting multiple doctors and weighing their opinions based on past
accuracy.

• Classifiers are trained sequentially, and each new model focuses more on the
difficult cases misclassified by the previous ones.

How Boosting Works:

1. Initialize Weights:

o All training tuples are initially given equal weight: 1/d for d data points.

2. Iteratively Train k Classifiers (Rounds):


For each round i from 1 to k:

o Sample a training set Di from D with replacement, based on current


tuple weights.

o Train a model Mi on Di.

o Evaluate error rate of Mi on Di using:


𝑑

error(𝑀𝑖 ) = ∑ 𝑤𝑗 ⋅ err(𝑋𝑗 )
𝑗=1

where 𝑒𝑟𝑟(𝑋𝑗) = 1 if misclassified, else 0.


o If error(Mi) > 0.5, discard Mi and retry the round.

o Update weights:

▪ Increase weights of misclassified tuples.

▪ Decrease weights of correctly classified tuples:

error(Mi )
wj ← wj ⋅
1 − error(Mi )
o Normalize the weights so that they sum to 1.

3. Final Ensemble Prediction:

o Each model’s vote is weighted by its accuracy:

1 − error(𝑀𝑖 )
𝑤𝑖 = log ( )
error(𝑀𝑖 )
o For a new instance X:

▪ Each model Mi predicts a class label.

▪ Add 𝑤𝑖 to that class’s total vote.

▪ Return the class with the highest total weighted vote.

3. Random Forests

Definition

Random Forest is an ensemble learning technique that builds multiple decision trees
and merges their predictions to produce more accurate and stable results. It is
applicable to both classification and regression tasks.

Intuition Behind Random Forest

• A single decision tree is prone to overfitting and sensitive to data noise.

• Random Forest combines many decision trees, each trained slightly differently,
to reduce variance without increasing bias.

• Like asking multiple doctors (trees) for a diagnosis, then trusting the majority
vote or average prediction.
Key Concepts and Working

1. Ensemble Learning

• Combines the predictions from multiple models.

• Random Forest is based on Bagging (Bootstrap Aggregating).

2. Bagging

• From the original dataset DDD with d samples, we generate k datasets D_1, D_2,
..., D_ by random sampling with replacement (bootstrap sampling).

• Some original samples may appear multiple times; others may not be selected
at all.

3. Random Feature Selection

• At each node in a decision tree, a random subset of features FFF is selected


from the total feature set.

• The best split is chosen only among this subset, not from all features.

4. Tree Building Process

• Each tree is trained on a different bootstrap sample.

• At each split in the tree:

o Choose F random features.

o Use CART algorithm (Classification And Regression Trees) to determine


the best split among those features.

• Trees are grown to full depth and not pruned.

5. Prediction

• For classification:

o Each tree gives a class vote.

o The class with the majority vote is the final prediction.

• For regression:

o Each tree predicts a numeric value.

o The average of all predictions is taken.

You might also like