Week03 Classification
Week03 Classification
Classification
1
Recall – Classification
• Definition: Classification is a supervised learning task where
the goal is to predict the category or label of a given input
based on learned patterns from a dataset
• Spam Detection (Spam vs. Not Spam)
• Fault Detection (Fault vs. Not Fault)
https://developers.google.com/machine-learning/decision-
forests/decision-trees
Decision Tree
• Root Node: The topmost node
representing the entire dataset
• Internal Nodes: Nodes that
perform tests on features
• Branches: Edges connecting
nodes, representing the outcome
of a test
• Leaf Nodes: Terminal nodes that
provide the final decision or
prediction
Types of Decision Trees
• In practice, decision trees generally use binary trees, and we
will focus on them
Building a Decision Tree
• We split the data based on questions like:
• "Is the pet a cat or dog?"
• "Is the temperature high or low?"
• Goal: We want each split to make the groups as pure as
possible (like grouping similar items together)
Let see if we gain any information by splitting based on the tail length feature.
Information Gain Example
• Short tail group: 𝐻 = − 0.8 ∗ log2 0.8 + 0.2 ∗ log2 0.2 = 0.722
• Long tail group: 𝐻 = − 0.8 ∗ log2 0.8 + 0.2 ∗ log2 0.2 = 0.722
Weighted Average Entropy (after split):
5 5
𝐻𝑎𝑓𝑡𝑒𝑟 𝑠𝑝𝑙𝑖𝑡 = × 0.722 + × 0.971 = 0.846
10 10
Tree 2
Tree 1
Weight is < 1350
Weight is > 1350
Color
Overfitting
• Overfitting occurs when a model learns the details and noise in
the training data to the extent that it negatively impacts
performance on new, unseen data.
• The model becomes too complex, capturing even irrelevant patterns,
which reduces its ability to generalize to unseen data.
Depth of the tree
• Lower depth (e.g., 2 to 7): • Larger depth (e.g., > 7):
• Easier to interpret • Can capture complex
• Ideal for scenarios where relationships
model transparency is • More prone to overfitting and
essential. do not generalize well
• May learn noise in the data
Train-Test Split
• To prevent overfitting, we split the dataset into two parts:
• Training set: Used to train the model (e.g., 70% of the data)
• Test set: Used to evaluate the model's performance on unseen data
(e.g., 30% of the data)
Overfitting in Decision Trees
• The tree is grown too deep, creating many branches that
capture noise and irrelevant patterns in the training data
• Solution: Limit depth of the tree and rely on testing accuracy
Source: https://machinelearningmastery.com/overfitting-machine-learning-models/
Pros and cons
Pros
• Easy to Understand and Interpret
• Handles Both Numerical and Categorical Data
• Works well for small datasets
Cons
• Prone to overfitting
• Computationally expensive for large trees