Chapter 7 Supervised Learning Classification
Chapter 7 Supervised Learning Classification
7.1 INTRODUCTION
Key Takeaways
• Supervised learning relies on labelled training data.
• It has extensive real-world applications, from healthcare to finance.
• Comparing supervised learning with other methods highlights its dependency on labelled
data for classification and prediction tasks.
Key Concepts
1. Classification vs. Regression:
o Classification: Predicts categorical or nominal variables (e.g., tumor type:
malignant or benign).
o When we are trying to predict a categorical or nominal variable, the problem is
known as a classification problem. A classification problem is one where the output
variable is a category such as ‘red’ or ‘blue’ or ‘malignant tumour’ or ‘benign
tumour’, etc.
o Regression: Whereas when we are trying to predict a numerical variable such as
‘price’, ‘weight’, etc.
o The problem falls under the category of regression Predicts numerical variables
(e.g., real estate prices or weights).
2. Definition of Classification:
o A classification problem focuses on predicting a category or class label for an input
based on the training data.
o Examples of categorical outputs:
▪ Tumor type: malignant or benign
▪ Color: red or blue
3. Supervised Learning in Classification:
o The quality of the training data directly affects the model's performance. Poor-
quality training data leads to imprecise predictions.
4. Process of Classification:
o A classifier algorithm generates a classification model from labelled training data.
o This model assigns class labels to new, unseen test data.
Summary
• Classification predicts categorical outcomes based on labelled training data.
• It is a type of supervised learning where the goal is to assign a class label to test data.
• Applications span across healthcare, finance, technology, and natural disaster prediction.
• The target categorical feature is often referred to as a class.
7.4 CLASSIFICATION LEARNING STEPS
Key Steps
1. Problem Identification:
o First step: Identify a well-formed problem with clear goals and long-term impact.
o Example: Predicting if a tumor is malignant or benign.
2. Identification of Required Data:
o Select a data set that accurately represents the problem.
o Example: Use patient data with labels (malignant or benign tumors).
3. Data Pre-processing:
o Clean and transform raw data to ensure quality and relevance.
o Remove irrelevant or unnecessary elements.
o Prepare the data to be fed into the algorithm.
4. Definition of Training Data Set:
o Decide on the input-output pairs for training.
o Example: For handwriting analysis, the training set could include:
▪ A single alphabet
▪ A word
▪ A sentence (multiple words)
5. Algorithm Selection:
o Select the best algorithm for the given problem.
o Factors considered:
▪ Nature of the data
▪ Problem complexity
▪ Desired accuracy
6. Training:
o Run the selected algorithm on the training data.
o Adjust control parameters (if required) using a validation set.
7. Evaluation with Test Data Set:
o Measure the algorithm’s performance on test data.
o If performance is unsatisfactory:
▪ Retrain with adjustments.
▪ Refine the training set or algorithm.
Summary:
• The success of a classification model depends on:
o The problem definition
o The quality of training data
o The appropriate selection of algorithms and parameters.
Popular Algorithms
1. k-Nearest Neighbour (kNN):
o Classifies data based on the similarity to its nearest neighbors.
2. Decision Tree:
o Uses a series of logical decisions to classify data.
3. Random Forest:
o Combines multiple decision trees for robust classification.
4. Support Vector Machine (SVM):
o Finds an optimal boundary (hyperplane) to separate classes.
5. Naïve Bayes Classifier:
o Based on Bayes’ theorem, assumes independence between features.
7.5.1 k-Nearest Neighbour (kNN)
Overview
• kNN Algorithm: A simple and powerful classification algorithm based on the principle that
similar instances (neighbors) tend to belong to the same category.
• The class label of an unknown data point is assigned based on its similarity to nearby
training data points.
2. Steps in kNN:
o A test data point (e.g., student "Josh") is compared with the training data using
Euclidean distance.
3. Key Parameters:
o Similarity Measure: Usually Euclidean distance.
o Value of k: Number of nearest neighbors to consider.
▪ k = 1: Closest neighbor determines the label.
▪ k = 3: Majority voting among 3 nearest neighbors determines the label.
4. Challenges:
o Choosing k:
▪ Large k: Risk of majority class bias.
▪ Small k: Risk of noise or outliers influencing the result.
o 3 Common strategies for choosing k:
1. One common practice is to set k equal to the square root of the number of
training records.
2. An alternative approach is to test several k values on a variety of test data
sets and choose the one that delivers the best performance.
3. Another interesting approach is to choose a larger value of k, but apply a
weighted voting process in which the vote of close neighbours is considered
more influential than the vote of distant neighbours.
1. Input:
o Training data set.
o Test data set.
o Value of k (number of neighbors).
2. Process:
o For each test data point:
1. Calculate the distance to all training data points.
2. Identify the k-closest training points.
3. Assign the class label based on:
▪ Majority voting (if k>1).
▪ Single neighbor's label (if k=1).
3. Output:
o Class label for each test data point.
Key Takeaways
• kNN is a lazy learning algorithm that makes predictions based on neighbor similarity.
• It is effective for certain applications but can be computationally intensive due to lack of
pre-model training.
• Proper selection of the value of k and data preprocessing are critical for its success.
7.5.2 Decision Tree
Overview
• A Decision Tree is a widely-used classification algorithm that models decisions and their
possible consequences in a tree-like structure.
• Key features include fast execution, easy interpretation, and suitability for multi-
dimensional analysis with multiple classes.
Key Concepts
o
o So Chandra will get the job offer
Level1:
∴ entropy of the data set after split i.e. Entropy (SCOMM ) when the feature ‘Communication’ is
used for split =0.63
∴ entropy of the data set after split, i.e. Entropy (SAPPTITUDE ) when the feature ‘Apptitude’ is used
for split =0.52
∴ entropy of the data set after split, i.e. Entropy (SPROG SKILL) when the feature ‘Programming
Skill’ is used for split =0.95
Therefore, the information gain from the feature ‘CGPA’ = 0.99 − 0.69 = 0.3, whereas the
information gain from the feature ‘Communication’ = 0.99 − 0.63 = 0.36. Likewise, the
information gain for ‘Aptitude’ == 0.99 − 0.63 = 0.47 and ‘Programming skills’ = 0.99 − 0.63 =
0.04, respectively
‘Aptitude’ will be the first node of the decision tree formed. In Level 1.
Level 1 Split:
o Feature: Aptitude (Highest Information Gain).
o Result: Split into High and Low branches.
o Aptitude=: Entropy =0; branch terminates.
Level 2
o Feature: Communication.
o Result: Split into Good and Bad branches.
o Communication= Good: Entropy =0; branch terminates.
As can be seen from the figure, the entropy value is as follows: 0.81 before the split 0
when the feature ‘CGPA’ is used for split 0.50 when the feature ‘Programming Skill’ is
used for split
Level 3
o
1. Level 3 Split:
o Feature: CGPA.
o Result: Final classification based on CGPA.
o 0.81 before the split 0 when the feature ‘CGPA’ is used for split 0.50 when the
feature ‘Programming Skill’ is used for split
Approaches to Pruning
1. Pre-Pruning:
o Stop tree growth before reaching full depth.
o Criteria:
▪ Maximum number of decision nodes.
▪ Minimum data points in a node for further splits.
o Advantages:
▪ Avoids overfitting early.
▪ Optimizes computational cost.
o Disadvantages:
▪ May skip important information.
▪ Risks missing subtle patterns in data.
2. Post-Pruning:
o Allow the tree to grow fully, then remove unnecessary branches.
o Criteria:
▪ Use error rates at nodes to decide which branches to prune.
o Advantages:
▪ Considers all available information.
▪ Often achieves better classification accuracy.
o Disadvantages:
▪ Higher computational cost.
Key Takeaways
• Pre-Pruning is faster but might miss important data patterns.
• Post-Pruning is more accurate but computationally expensive.
• The choice of pruning technique depends on the data and problem requirements.
Overview
• Random Forest is an ensemble classifier that combines multiple decision tree classifiers.
• Uses the principle of bagging (bootstrap aggregation) with varying feature sets to train
multiple trees.
• The majority vote among the decision trees determines the final classification.
• The ensemble model often performs better than individual decision trees.
Key Takeaways
• Random Forest leverages the power of multiple decision trees for superior classification
and regression results.
• It balances performance with robustness, making it suitable for complex real-world
problems.
• Despite its strengths, it requires higher computational resources and lacks the
interpretability of simpler models.
Class Notes: Chapter 7 - Supervised Learning: Classification
Overview of SVM
• Support Vector Machine (SVM) is a model used for classification and regression.
• It works by identifying a hyperplane in an N-dimensional space that separates data
instances into two classes.
• The goal of SVM is to classify data points by finding the optimal hyperplane with the
maximum margin between two classes.
• Conclusion:
o Hyperplane AAA is the correct hyperplane because it divides
the data into the respective classes with no misclassification.
Key Point: The selected hyperplane must correctly segregate
the two classes.
• Situation:
o Three hyperplanes A, B, and C are considered.
o All hyperplanes separate the classes correctly, but they differ in the margin
(distance) between the hyperplane and the nearest data points of both classes.
• Analysis:
o A has the highest margin compared to B and C.
o A larger margin ensures better generalization and robustness against noise or
outliers.
• Conclusion:
o Hyperplane A is the correct hyperplane because it maximizes the margin, reducing
the likelihood of misclassification for new data.
Key Point: The correct hyperplane maximizes the margin between the two classes.
Scenario 3: Balancing Margin and Classification Accuracy
• Situation:
o Two hyperplanes AAA and BBB are considered.
o AAA has a lower margin but classifies all data points
correctly.
o BBB has a higher margin but introduces
misclassification errors.
• Analysis:
o Although BBB has a larger margin, its
misclassification makes it unsuitable.
o AAA ensures all data points are classified correctly,
even with a smaller margin.
• Conclusion:
o Hyperplane AAA is the correct hyperplane because
accuracy in classification is prioritized over margin size.
Key Point: The correct hyperplane minimizes misclassification while maintaining an acceptable
margin.
• Situation:
o The data includes an outlier (a data point significantly distant from the rest).
o Without accounting for the outlier, the hyperplane may classify the majority of data
correctly.
o Hyperplane A ignores the outlier and maximizes the margin for the remaining
points.
• Analysis:
o A provides the best generalization by ignoring the outlier.
o A hyperplane influenced by the outlier would lead to poor classification
performance for most data.
• Conclusion:
o Hyperplane A is the correct hyperplane because it is robust to outliers and ensures
better classification for the majority of data.
Key Point: The correct hyperplane must ignore outliers to achieve robustness and reliable
generalization.
Using this equation, the objective is to find a set of values for the vector such that two
hyperplanes, represented by the equations below, can be specified
This is to ensure that all the data instances that belong to one class falls above one hyperplane and
all the data instances belonging to the other class falls below another hyperplane.
Non Linearly
separable data
Using slack
using Kernal
variable and
tricks
cost function