Data Mining Unit 2
Data Mining Unit 2
1. Definition of Classification
• In all these examples, the goal is to predict a label (class) from input attributes
using historical data.
𝑓(𝑋) = 𝑦
o Classification rules
o Decision trees
o Mathematical models
• This step is also called supervised learning, as the model is trained with labeled
examples.
• Accuracy Evaluation:
o Use test data (with known labels, but not seen during training).
o Calculate accuracy:
𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑡𝑒𝑠𝑡 𝑡𝑢𝑝𝑙𝑒𝑠
• The test data must be independent of the training data.
5. Key Terminology
Term Description
Attribute Vector Input feature vector X=(x1,x2,...,xn)X = (x_1, x_2, ..., x_n)X=(x1
,x2,...,xn)
Test Set Data with known labels (not used during training) used to
test the model
Training Data
o These clusters could correspond to risk levels (even if we don't know what
“safe” or “risky” means at first).
Note on Overfitting
• Structure:
Advantages
Input:
Output:
• A Decision Tree
Steps:
1. Create a node N.
2. If all tuples in D belong to the same class → label N as a leaf node with that class.
3. If attribute list is empty → label N as a leaf node with the majority class in D.
6. Remove splitting attribute from attribute list (if multiway split is allowed).
1. Discrete-valued attribute:
2. Continuous-valued attribute:
Stopping Conditions
o n = number of attributes
Notable Algorithms
Incremental Learning
• Some decision tree algorithms allow incremental updates, adjusting the tree
with new data instead of rebuilding it from scratch.
Information Gain (IG) is a measure based on information theory, particularly entropy, which
was introduced by Claude Shannon.
• Entropy (Info(D)): It tells us how impure (or uncertain) the dataset is. If a dataset
contains mixed class labels (e.g., both "yes" and "no"), entropy is high. If all tuples
belong to one class, entropy is zero.
• It selects the attribute that best reduces entropy, i.e., increases purity.
• The attribute with the highest Information Gain is chosen for the decision node.
• It aims to reduce uncertainty at each step and build a tree that classifies data with
as few questions as possible.
𝟗 𝟗 𝟓 𝟓
Info(𝑫) = − ( 𝐥𝐨𝐠 𝟐 + 𝐥𝐨𝐠 𝟐 ) = 𝟎. 𝟗𝟒𝟎 bits
𝟏𝟒 𝟏𝟒 𝟏𝟒 𝟏𝟒
• Youth: 2 yes, 3 no
• Middle-aged: 4 yes, 0 no
• Senior: 3 yes, 2 no
𝟓 𝟒 𝟓
𝑰𝒏𝒇𝒐𝒂𝒈𝒆 (𝑫) = ⋅ 𝟎. 𝟗𝟕𝟏 + ⋅𝟎+ ⋅ 𝟎. 𝟗𝟕𝟏 = 𝟎. 𝟔𝟗𝟒 bits
𝟏𝟒 𝟏𝟒 𝟏𝟒
Then,
Gain Ratio
Gain Ratio is a metric used in decision tree algorithms (like C4.5) to select the best
attribute for splitting the dataset. It addresses a major limitation of Information Gain,
which tends to favor attributes with many distinct values (like unique IDs), even when
those attributes are poor for generalization.
• But such a split is useless for predicting new data (i.e., overfitting).
Gain Ratio penalizes attributes with many outcomes by normalizing the Information
Gain using a value called Split Information, which measures how broadly and evenly the
data is split.
Formulas
Let’s define:
• 𝐷: dataset
1. Split Information:
𝑣
|𝐷𝑗 |
SplitInfo𝐴 (𝐷) = − ∑
|𝐷|
𝑗=1
This measures the potential information created by the split itself, regardless of class
labels.
2. Gain Ratio:
Gain(𝐴)
GainRatio(𝐴) =
SplitInfo𝐴 (𝐷)
This corrects the bias by dividing Information Gain by how “uniformly” the attribute
splits the data.
Note: If SplitInfo is very small (close to zero), GainRatio becomes unstable. To avoid
this, C4.5 ensures that the Information Gain is above average before considering
GainRatio.
Example
Given a split on income resulting in three partitions:
• low: 4 tuples
• medium: 6 tuples
• high: 4 tuples
Gain(income) = 0.029
Gini Index
The Gini Index is used in the CART (Classification and Regression Trees) algorithm to
measure the impurity of a dataset.
Definition
Gini(𝐷) = 1 − ∑ 𝑝𝑖2
𝑖=1
• A Gini index of 0 means the dataset is pure (only one class present).
Binary Splits
CART allows only binary splits (dividing into two groups). For each attribute, we:
3. Choose the split that minimizes the Gini index of the resulting partitions.
Discrete Attributes
If attribute A has v values, then there are 2𝑣 − 22𝑣 − 22𝑣 − 2 valid binary splits
(excluding the empty set and full set).
Example:
Continuous Attributes
"Is A ≤ split_point?"
|𝐷1 | |𝐷2 |
Gini𝐴 (𝐷) = |𝐷|
⋅ Gini(𝐷1 ) + |𝐷|
⋅ Gini(𝐷2 )
9 2 5 2
Gini(𝐷) = 1 − ( ) − ( ) = 0.459
14 14
Step 2: Evaluate Attribute income
10 4
Giniincome (𝐷) = ⋅ 0.42 + ⋅ 0.5 = 0.443
14 14
Other splits gave higher values:
These are used to determine which attribute to split the dataset on at each node of the
tree:
1. Information Gain:
2. Gain Ratio:
o Adjusts information gain by incorporating split information.
3. Gini Index:
4. Other Measures:
o G-statistic
4o
You said:
ChatGPT said:
Here are detailed and useful notes based on the content you shared regarding
Attribute Selection Measures, Tree Pruning, Scalability, and Visual Mining in
Decision Tree Induction:
• Information Gain:
• Gain Ratio:
o Adjusts for bias of Information Gain but prefers unbalanced splits (e.g.,
one partition is much smaller).
• Gini Index:
Alternative Measures:
• C-SEP: Performs better than Information Gain & Gini in certain scenarios.
o Chooses trees that minimize encoding cost (both tree and misclassified
exceptions).
• Multivariate Splits:
Summary:
• Shallower trees can increase leaf count and error – it's a tradeoff.
Two Approaches:
• Stop growing the tree early based on a threshold (e.g., gain, Gini).
Examples:
o Based on:
▪ of leaves
▪ Error rate
• MDL-based Pruning:
Hybrid:
Problem:
• Traditional algorithms (ID3, C4.5, CART) require all training data in memory.
Scalable Algorithms:
RainForest
• Uses AVC-sets (Attribute-Value-Classlabel) at each node.
• Constructs multiple trees from subsets → merges them to form final tree.
• Efficient:
• Supports incremental updates (can adjust for new/deleted data without full
reconstruction).
Key Concepts:
Interface:
Benefits:
• Allows manual split selection (one or more split points).
Bayes’ Theorem
Let:
𝑃( 𝐻 ∣ 𝑋 )
This is the probability that hypothesis HHH holds given evidence XXX.
TERM MEANING
P(X) Prior probability of evidence XXX — how often this data occurs
𝑃( 𝑋 ∣ 𝐻 ) ⋅ 𝑃(𝐻)
𝑃( 𝐻 ∣ 𝑋 ) =
𝑃(𝑋)
Example:
Rule-Based Classification
Overview
Structure of a Rule
• Consequent (Conclusion): The "THEN" part, which contains the predicted class
label.
Alternative representation:
Rule Satisfaction
• A rule is satisfied (triggered) if all attribute conditions in its antecedent are true
for a tuple.
Let:
Coverage:
𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡
accuracy(𝑅) =
𝑛𝑐𝑜𝑣𝑒𝑟𝑠
Example (R1):
2. If only one rule is triggered, that rule fires and assigns its class label.
When multiple rules are triggered, we must choose which one to fire. Common
strategies include:
1. Size Ordering
• Preference is given to the rule with the most conditions (most specific rule).
• Rule with the largest antecedent size (most attribute tests) is selected.
2. Rule Ordering
o Rule accuracy
o Rule coverage
o Rule size
• Rule-Based Ordering
Note: Rules in a decision list imply negation of the earlier rules. That is, each rule
assumes the previous ones didn’t apply.
Decision trees are a widely used classification method due to their simplicity and
effectiveness. However, large trees can be hard to interpret. To improve
interpretability, we can convert a decision tree into a set of IF-THEN classification
rules.
• Rules are often easier for humans to understand than a large decision tree.
Basic Idea:
• One rule is created per path from the root to a leaf node.
• The splitting criteria (conditions) along the path form the antecedent (IF part).
• The class label at the leaf node becomes the consequent (THEN part).
From a decision tree (as in Figure 8.2 of the textbook), the following rules are extracted:
R1:
R2:
R3:
R4:
1. Mutually Exclusive:
2. Exhaustive:
3. Unordered Rules:
o Since the rules are mutually exclusive, order does not matter.
Rule Pruning
Although rule extraction is straightforward, the resulting rule set can be large and
contain redundant or irrelevant conditions. Therefore, rule pruning is necessary.
Purpose of Pruning:
• Simplify rules.
• Improve generalization.
• Remove any condition that does not improve the rule’s estimated accuracy.
Consequences of Pruning
To manage conflicts among overlapping rules, C4.5 uses class-based rule ordering:
o A false positive occurs when a rule predicts class C but the actual class is
not C.
• Instead, it is the class that contains the most uncovered training tuples.
• This strategy ensures better handling of tuples not matched by any rule.
Rule Induction Using a Sequential Covering Algorithm
Overview
• Covering: Each rule ideally covers many tuples of the target class and few or
none of the others.
• After learning a rule: Remove the tuples it covers and repeat the process with
remaining data.
Popular Algorithms
• AQ
• CN2
Input:
Output:
Method:
o Repeat:
IF [ ] THEN class = C
• At each step:
o Consider adding attribute tests (like income = high, age > 25).
o Select the test that gives the best improvement in rule quality.
• Add this test to the rule, and repeat until rule quality is satisfactory.
Search Strategy
General-to-Specific Search:
Greedy Search:
• No backtracking.
• Beam width = k
Imagine a dataset with features like age, income, credit rating, etc.
• Start with:
Rule Quality
o Accuracy
o Coverage
o Confidence
o Information Gain
Termination Conditions
Advantages
Limitations
Key Definitions:
Actual Positive TP FN
Actual Negative FP TN
Evaluation Metrics:
• In imbalanced datasets, accuracy can be high even if the classifier fails on the
minority (important) class.
• Procedure: Split data into two sets — training (usually 2/3) and test (1/3).
• Use: Model is trained on the training set and evaluated on the test set.
2. Random Subsampling
3. k-Fold Cross-Validation
• Procedure:
o For each iteration, one fold is used as test set, remaining 𝑘 − 1 folds for
training.
• Advantage: Every tuple is used exactly once for testing and 𝑘 − 1 times for
training.
Special Cases:
• Stratified Cross-Validation:
o Ensures each fold has approximately the same class distribution as the
full dataset.
4. Bootstrap Method
• Procedure:
• Use Case: Effective for small datasets, but may be overly optimistic.
• Calculate mean error rate and standard deviation of errors for each model.
where:
• Degrees of freedom = k - 1 = 9
• Use a t-distribution table to find the critical t-value for a 5% significance level
(look up 0.025 because it's two-tailed).
• If your computed t is larger than the table value, you reject the null hypothesis
(i.e., the difference is real).
Instead of accuracy, compute average cost per decision by assigning real costs to TP,
FP, TN, and FN.
o Up for TP
o Right for FP
Example:
Interpretation:
• The steeper and higher the curve, the better the model.
• Mechanism: Each base classifier votes on the class label, and the ensemble
outputs the majority-voted label.
• Diversity: Ensembles are more effective when the base classifiers are diverse
(i.e., they make different kinds of errors).
• Better Than Random Guessing: Each classifier should perform better than
random chance.
b. Boosting – Section
• Trains classifiers sequentially, where each new model focuses on the errors of
the previous ones.
• Adds extra randomness by choosing a random subset of features for each split in
the tree.
• Definition: Occurs when one class significantly outnumbers another (e.g., fraud
detection, rare disease).
• Solutions:
Analogy:
• Just like seeking diagnoses from multiple doctors and deciding based on
majority vote, bagging trains multiple models (classifiers) and predicts the
class of an instance based on a majority vote from those models.
How Bagging Works:
1. Input:
3. Prediction Phase:
▪ Use majority voting among all models to determine the final class
label.
Key Features:
o Some original tuples may be repeated in a Di, while others may be left
out.
• Equal Weighting:
Advantages of Bagging:
• Increased Accuracy:
Intuition:
• Like consulting multiple doctors and weighing their opinions based on past
accuracy.
• Classifiers are trained sequentially, and each new model focuses more on the
difficult cases misclassified by the previous ones.
1. Initialize Weights:
o All training tuples are initially given equal weight: 1/d for d data points.
error(𝑀𝑖 ) = ∑ 𝑤𝑗 ⋅ err(𝑋𝑗 )
𝑗=1
o Update weights:
error(Mi )
wj ← wj ⋅
1 − error(Mi )
o Normalize the weights so that they sum to 1.
1 − error(𝑀𝑖 )
𝑤𝑖 = log ( )
error(𝑀𝑖 )
o For a new instance X:
3. Random Forests
Definition
Random Forest is an ensemble learning technique that builds multiple decision trees
and merges their predictions to produce more accurate and stable results. It is
applicable to both classification and regression tasks.
• Random Forest combines many decision trees, each trained slightly differently,
to reduce variance without increasing bias.
• Like asking multiple doctors (trees) for a diagnosis, then trusting the majority
vote or average prediction.
Key Concepts and Working
1. Ensemble Learning
2. Bagging
• From the original dataset DDD with d samples, we generate k datasets D_1, D_2,
..., D_ by random sampling with replacement (bootstrap sampling).
• Some original samples may appear multiple times; others may not be selected
at all.
• The best split is chosen only among this subset, not from all features.
5. Prediction
• For classification:
• For regression: