0% found this document useful (0 votes)
48 views28 pages

Chapter 7 Supervised Learning Classification

Chapter 7 covers supervised learning with a focus on classification techniques, including k-Nearest Neighbour (kNN), Decision Trees, Random Forests, and Support Vector Machines (SVM). It explains the principles of these classifiers, their applications in real-world scenarios, and the steps involved in building classification models. Key takeaways emphasize the importance of labeled training data and the various algorithms used for classification tasks.

Uploaded by

Srimathi mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views28 pages

Chapter 7 Supervised Learning Classification

Chapter 7 covers supervised learning with a focus on classification techniques, including k-Nearest Neighbour (kNN), Decision Trees, Random Forests, and Support Vector Machines (SVM). It explains the principles of these classifiers, their applications in real-world scenarios, and the steps involved in building classification models. Key takeaways emphasize the importance of labeled training data and the various algorithms used for classification tasks.

Uploaded by

Srimathi mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Chapter 7 - Supervised Learning: Classification

7.1 INTRODUCTION

Objective of the Chapter


• To provide an understanding of popular classifiers and their working principles.

Overview of the Chapter


1. k-Nearest Neighbour (kNN):
o A simple algorithm used to classify unlabelled data.
o Classification is based on the similarity between unlabelled instances and labelled
instances in the training data.
2. Decision Tree:
o A classifier that operates through a series of logical decisions.
o Its structure resembles a tree with branches representing decision paths.
3. Random Fores (ensemble method)
o A robust classifier built using a collection of decision trees.
o Each tree contributes to the overall classification, making it more accurate than a
single decision tree.
4. Support Vector Machine (SVM):
o A powerful and widely-used classifier.
o Operates by finding the optimal boundary (hyperplane) to separate classes in the
feature space.

Key Learning Outcomes


By the end of this chapter, you will:
• Understand the working principles of major classifiers.
• Be able to solve classification problems using:
o kNN
o Decision Trees
o Random Forest
o SVM
• Gain insight into the strengths and applications of each classifier.
7.2 EXAMPLE OF SUPERVISED LEARNING

Definition and Concept


• Supervised Learning involves using labelled training data to train a model.
o Training data provides experience or prior knowledge.
o The learning process can be compared to a teacher supervising a student.
▪ Here, the teacher = Training Data.
• Labelled Training Data:
o Contains past information with known class field or labels.
o Enables the machine to learn patterns and make predictions.
• Comparison with Other Learning Types:

o Unsupervised Learning: No labelled training data; focuses on finding hidden


patterns or clusters.
o Semi-Supervised Learning: Uses a small amount of labelled data combined with a
large amount of unlabelled data.

• Supervised Learning: Requires labelled data for training.


• Unsupervised Learning: Relies on unlabelled data.
• Semi-Supervised Learning: Mix of both labelled and unlabelled data.

Real-World Example: Hospital ICU Prediction


• Problem Statement:
o Predicting which general ward patients are likely to need ICU care.
• Training Data:
o Past patient records with labels indicating ICU transfers.
• Application:
o Test results of new patients help classify them as high-risk or low-risk.

Other Examples of Supervised Learning


1. Game Outcome Prediction:
o Predict results of a game using past analysis.
2. Medical Diagnosis:
o Predict whether a tumour is malignant or benign using patient data.
3. Price Prediction:
o Forecast prices in real estate, stock markets, and more.

Did You Know?


• IBM developed ‘Watson for Oncology’, a machine learning-based tool:
o Assists physicians by analyzing millions of data points.
o Provides evidence-backed cancer care.
o Explores various treatment options efficiently.
Source: IBM Watson Health Oncology

Key Takeaways
• Supervised learning relies on labelled training data.
• It has extensive real-world applications, from healthcare to finance.
• Comparing supervised learning with other methods highlights its dependency on labelled
data for classification and prediction tasks.

7.3 CLASSIFICATION MODEL

Key Concepts
1. Classification vs. Regression:
o Classification: Predicts categorical or nominal variables (e.g., tumor type:
malignant or benign).
o When we are trying to predict a categorical or nominal variable, the problem is
known as a classification problem. A classification problem is one where the output
variable is a category such as ‘red’ or ‘blue’ or ‘malignant tumour’ or ‘benign
tumour’, etc.
o Regression: Whereas when we are trying to predict a numerical variable such as
‘price’, ‘weight’, etc.
o The problem falls under the category of regression Predicts numerical variables
(e.g., real estate prices or weights).
2. Definition of Classification:
o A classification problem focuses on predicting a category or class label for an input
based on the training data.
o Examples of categorical outputs:
▪ Tumor type: malignant or benign
▪ Color: red or blue
3. Supervised Learning in Classification:
o The quality of the training data directly affects the model's performance. Poor-
quality training data leads to imprecise predictions.
4. Process of Classification:
o A classifier algorithm generates a classification model from labelled training data.
o This model assigns class labels to new, unseen test data.

Real-World Applications of Classification


1. Healthcare:
o Identifying malignant vs. benign tumors.
o Predicting the presence of diseases.
2. Banking and Finance:
o Detecting fraudulent transactions.
▪ Past data labelled as fraudulent or legitimate is used for training.
▪ New transactions are categorized as suspicious or usual.
3. Technology:
o Image classification: Recognizing objects in images.
o Handwriting recognition: Converting handwritten text to digital format.
4. Other Applications:
o Predicting game outcomes (win/loss).
o Forecasting natural calamities like earthquakes or floods.

• Shows how training data is used to create a classification model.


• Example: Classifying data into labels like Intel.

Did You Know?


1. Machine learning has saved lives by spotting 52% of breast cancer cells at least a year
before diagnosis.
2. The US Postal Service uses machine learning for handwriting recognition.
3. Facebook’s News Feed employs machine learning to personalize content for each user.

Summary
• Classification predicts categorical outcomes based on labelled training data.
• It is a type of supervised learning where the goal is to assign a class label to test data.
• Applications span across healthcare, finance, technology, and natural disaster prediction.
• The target categorical feature is often referred to as a class.
7.4 CLASSIFICATION LEARNING STEPS

Overview of the Process


Supervised learning follows a structured process to develop a classifier that solves a given
problem using labelled data.
The process is depicted in THE FOLLOWING FIGURE .

Key Steps
1. Problem Identification:
o First step: Identify a well-formed problem with clear goals and long-term impact.
o Example: Predicting if a tumor is malignant or benign.
2. Identification of Required Data:
o Select a data set that accurately represents the problem.
o Example: Use patient data with labels (malignant or benign tumors).
3. Data Pre-processing:
o Clean and transform raw data to ensure quality and relevance.
o Remove irrelevant or unnecessary elements.
o Prepare the data to be fed into the algorithm.
4. Definition of Training Data Set:
o Decide on the input-output pairs for training.
o Example: For handwriting analysis, the training set could include:
▪ A single alphabet
▪ A word
▪ A sentence (multiple words)
5. Algorithm Selection:
o Select the best algorithm for the given problem.
o Factors considered:
▪ Nature of the data
▪ Problem complexity
▪ Desired accuracy
6. Training:
o Run the selected algorithm on the training data.
o Adjust control parameters (if required) using a validation set.
7. Evaluation with Test Data Set:
o Measure the algorithm’s performance on test data.
o If performance is unsatisfactory:
▪ Retrain with adjustments.
▪ Refine the training set or algorithm.

Summary:
• The success of a classification model depends on:
o The problem definition
o The quality of training data
o The appropriate selection of algorithms and parameters.

7.5 COMMON CLASSIFICATION ALGORITHMS

Popular Algorithms
1. k-Nearest Neighbour (kNN):
o Classifies data based on the similarity to its nearest neighbors.
2. Decision Tree:
o Uses a series of logical decisions to classify data.
3. Random Forest:
o Combines multiple decision trees for robust classification.
4. Support Vector Machine (SVM):
o Finds an optimal boundary (hyperplane) to separate classes.
5. Naïve Bayes Classifier:
o Based on Bayes’ theorem, assumes independence between features.
7.5.1 k-Nearest Neighbour (kNN)

Overview
• kNN Algorithm: A simple and powerful classification algorithm based on the principle that
similar instances (neighbors) tend to belong to the same category.
• The class label of an unknown data point is assigned based on its similarity to nearby
training data points.

7.5.1.1 How KNN Works


1. Data Set Example: (15 STUDENTS FROMA CLASS ARE TAKEN AS DATASET)
o Consider a student dataset with features:
▪ Aptitude (score out of 10)
▪ Communication (score out of 10)
o Classifications:
▪ Leader: Good aptitude and communication.
▪ Speaker: Good communication, poor aptitude.
▪ Intel: Good aptitude, poor communication.

2. Steps in kNN:
o A test data point (e.g., student "Josh") is compared with the training data using
Euclidean distance.

o Class labels of the nearest neighbors determine the classification.


In the kNN algorithm, the class label of the test data elements is decided by the class label of the
training data elements which are neighbouring, i.e. similar in nature.
But there are two challenges:
1. What is the basis of this similarity or when can we say that two data elements are
similar? 2.
2. How many similar elements should be considered for deciding the class label of each
test data element?
To answer the first question, though there are many measures of similarity, the most common
approach adopted by kNN to measure similarity between two data elements is Euclidean distance.
Considering a very simple data set having two features (say f and f ),
Euclidean distance between two data elements d and d can be measured by

3. Key Parameters:
o Similarity Measure: Usually Euclidean distance.
o Value of k: Number of nearest neighbors to consider.
▪ k = 1: Closest neighbor determines the label.
▪ k = 3: Majority voting among 3 nearest neighbors determines the label.
4. Challenges:
o Choosing k:
▪ Large k: Risk of majority class bias.
▪ Small k: Risk of noise or outliers influencing the result.
o 3 Common strategies for choosing k:
1. One common practice is to set k equal to the square root of the number of
training records.
2. An alternative approach is to test several k values on a variety of test data
sets and choose the one that delivers the best performance.
3. Another interesting approach is to choose a larger value of k, but apply a
weighted voting process in which the vote of close neighbours is considered
more influential than the vote of distant neighbours.

kNN Algorithm Steps

1. Input:
o Training data set.
o Test data set.
o Value of k (number of neighbors).
2. Process:
o For each test data point:
1. Calculate the distance to all training data points.
2. Identify the k-closest training points.
3. Assign the class label based on:
▪ Majority voting (if k>1).
▪ Single neighbor's label (if k=1).
3. Output:
o Class label for each test data point.

7.5.1.3 Why kNN is a Lazy Learner?


• Unlike eager learners, kNN does not create an abstract model during training.
• It stores the entire training data and performs classification on demand by finding nearest
neighbors.

7.5.1.4 Strengths of kNN


1. Simple: Easy to understand and implement.
2. Effective: Works well in applications like recommender systems.
3. Fast Training: Almost no training time required.

7.5.1.5 Weaknesses of kNN


1. Dependence on Training Data:
o No learning occurs; relies entirely on stored training data.
o Poor generalization if training data is not representative.
2. Slow Classification:
o Computationally expensive at runtime due to distance calculations.
3. High Memory Requirements:
o All training data must be stored and accessible.

7.5.1.6 Applications of kNN


1. Recommender Systems:
o Suggests items similar to a user's preferences (e.g., past purchases, browsing
history).
2. Information Retrieval:
o Concept Search: Finds documents/content similar to a given input.

Key Takeaways
• kNN is a lazy learning algorithm that makes predictions based on neighbor similarity.
• It is effective for certain applications but can be computationally intensive due to lack of
pre-model training.
• Proper selection of the value of k and data preprocessing are critical for its success.
7.5.2 Decision Tree

Overview
• A Decision Tree is a widely-used classification algorithm that models decisions and their
possible consequences in a tree-like structure.
• Key features include fast execution, easy interpretation, and suitability for multi-
dimensional analysis with multiple classes.

Key Concepts

1. Structure of a Decision Tree:


o Root Node: Represents the starting point (entire dataset).
o Branch Nodes: Intermediate nodes where data splits based on feature values.
o Leaf Nodes: Terminal nodes that provide the classification outcome.
2. Example:

o Scenario: Deciding whether to "Keep Going" or "Stop" a car.


o Rules:
▪ If the signal is red → Stop.
▪ If there’s no gas → Stop at the next gas station.

7.5.2.1 Building a Decision Tree


1. Recursive Partitioning:
o Splits data into subsets based on feature values, starting at the root node.
o Features are selected based on their ability to predict the target class.
2. Stopping Criteria:
3. All or most of the examples at a particular node have the same class
4. All features have been used up in the partitioning
5. The tree has grown to a pre-defined threshold limit
6. Example: GTS Recruitment:
o Data includes attributes like CGPA, Communication, Aptitude, and Programming
Skills.
o Chandra, a student of CEM, wants to find out if he may be offered a job in GTS.
His CGPA is quite high. His self-evaluation on the other parameters is as follows:
Communication – Bad; Aptitude – High; Programming skills – Bad(Test data)
o Split based on Aptitude first, then Communication, and finally CGPA.

7.5.2.2 Searching a Decision Tree


1. Exhaustive Search:
o Explores all branches of the tree.
o Computationally expensive for large trees.
2. Branch and Bound Search:
o Uses existing best solutions to skip unnecessary paths.
o Faster than exhaustive search.

Key Algorithms for Decision Trees


1. ID3 (Iterative Dichotomiser 3):
o Uses Entropy and Information Gain to decide splits.
2. CART (Classification and Regression Tree):
o Uses the Gini index for splits.
3. CHAID (Chi-square Automatic Interaction Detector):
o Uses Chi-square test to select features.

o
o So Chandra will get the job offer

Level1:

∴ entropy of the data set before split (i.e. Entropy (S )) = 0.99,


∴ entropy of the data set after split (i.e. Entropy (SCGPA )) iswhen the feature ‘CGPA’ is used for
split =0.69

∴ entropy of the data set after split i.e. Entropy (SCOMM ) when the feature ‘Communication’ is
used for split =0.63

∴ entropy of the data set after split, i.e. Entropy (SAPPTITUDE ) when the feature ‘Apptitude’ is used
for split =0.52

∴ entropy of the data set after split, i.e. Entropy (SPROG SKILL) when the feature ‘Programming
Skill’ is used for split =0.95
Therefore, the information gain from the feature ‘CGPA’ = 0.99 − 0.69 = 0.3, whereas the
information gain from the feature ‘Communication’ = 0.99 − 0.63 = 0.36. Likewise, the
information gain for ‘Aptitude’ == 0.99 − 0.63 = 0.47 and ‘Programming skills’ = 0.99 − 0.63 =
0.04, respectively
‘Aptitude’ will be the first node of the decision tree formed. In Level 1.
Level 1 Split:
o Feature: Aptitude (Highest Information Gain).
o Result: Split into High and Low branches.
o Aptitude=: Entropy =0; branch terminates.
Level 2
o Feature: Communication.
o Result: Split into Good and Bad branches.
o Communication= Good: Entropy =0; branch terminates.

As can be seen from the figure, the entropy value is as follows: 0.81 before the split 0
when the feature ‘CGPA’ is used for split 0.50 when the feature ‘Programming Skill’ is
used for split
Level 3

o
1. Level 3 Split:
o Feature: CGPA.
o Result: Final classification based on CGPA.
o 0.81 before the split 0 when the feature ‘CGPA’ is used for split 0.50 when the
feature ‘Programming Skill’ is used for split

7.5.2.5 Algorithm for Decision Tree

5.2.6 Avoiding Overfitting in Decision Trees – Pruning


The decision tree algorithm, unless a stopping criterion is applied, may keep growing indefinitely
– splitting for every feature and dividing into smaller partitions till the point that the data is
perfectly classified. This, results in overfitting problem

Why Pruning is Necessary


• Overfitting occurs when a decision tree:
o Splits excessively to classify every detail of the training data.
o Loses its ability to generalize for unseen data.
• Solution: Pruning reduces tree complexity and improves generalization.

Approaches to Pruning
1. Pre-Pruning:
o Stop tree growth before reaching full depth.
o Criteria:
▪ Maximum number of decision nodes.
▪ Minimum data points in a node for further splits.
o Advantages:
▪ Avoids overfitting early.
▪ Optimizes computational cost.
o Disadvantages:
▪ May skip important information.
▪ Risks missing subtle patterns in data.
2. Post-Pruning:
o Allow the tree to grow fully, then remove unnecessary branches.
o Criteria:
▪ Use error rates at nodes to decide which branches to prune.
o Advantages:
▪ Considers all available information.
▪ Often achieves better classification accuracy.
o Disadvantages:
▪ Higher computational cost.

Key Takeaways
• Pre-Pruning is faster but might miss important data patterns.
• Post-Pruning is more accurate but computationally expensive.
• The choice of pruning technique depends on the data and problem requirements.

Strengths of Decision Trees


1. Interpretability:
o Easy to visualize and understand.
2. Versatility:
o Can handle both categorical and numerical data.
3. Efficiency:
o Fast to train and evaluate.

Weaknesses of Decision Trees


1. Overfitting:
o May create overly complex trees for noisy data.
2. Instability:
o Small changes in data can significantly alter the tree structure.
Applications of Decision Trees
1. Healthcare:
o Disease diagnosis and prediction.
2. Recruitment:
o Predicting job offers based on candidate attributes.
3. Finance:
o Fraud detection in transactions.

7.5.3 Random Forest Model

Overview
• Random Forest is an ensemble classifier that combines multiple decision tree classifiers.
• Uses the principle of bagging (bootstrap aggregation) with varying feature sets to train
multiple trees.
• The majority vote among the decision trees determines the final classification.
• The ensemble model often performs better than individual decision trees.

7.5.3.1 How Random Forest Works

1. Input Data Handling:


o Given N features in the input data, select a random subset of mmm features (m<N).
o Randomly select observations or data instances for training.
2. Tree Construction:
o Use the best split principle on the mmm features to calculate the number of nodes
ddd.
o Continue splitting nodes until the tree is grown to its maximum depth.
3. Training Multiple Trees:
o Generate different bootstrap samples (with replacement) to train each tree.
o Repeat steps (1)-(3) to train nnn decision trees.
4. Final Classification:
o Perform a majority vote across all trees to determine the class label.

7.5.3.2 Out-of-Bag (OOB) Error


• Bootstrap Sampling:
o Each tree is trained using a bootstrap sample.
o Data not included in the bootstrap sample (out-of-bag data) is used to evaluate the
model's performance.
• OOB Error Rate:
o Predictions for out-of-bag samples are aggregated, and the error rate is calculated.
o Reflects the model’s generalization ability.

Strengths of Random Forest


1. Efficiency:
o Performs well on large, complex datasets.
2. Handling Missing Data:
o Robust in estimating missing data without losing precision.
3. Balances Errors:
o Effective with unbalanced datasets.
4. Feature Importance:
o Provides estimates of feature importance in classification.
5. Unbiased Error Estimation:
o Generates an internal estimate of generalization error during training.
6. Versatility:
o Applicable to both classification and regression problems.
7. Reusability:
o Saved forests can be reused for future predictions.

Weaknesses of Random Forest


1. Interpretability:
o Not as straightforward as a single decision tree.
2. Computational Cost:
o Requires significant computational resources due to multiple tree construction.

Applications of Random Forest


1. Classification Problems:
o Suitable for a wide range of problems, including image recognition, text
classification, and medical diagnosis.
2. Regression Analysis:
o Used for predicting continuous variables with high accuracy.

Key Takeaways
• Random Forest leverages the power of multiple decision trees for superior classification
and regression results.
• It balances performance with robustness, making it suitable for complex real-world
problems.
• Despite its strengths, it requires higher computational resources and lacks the
interpretability of simpler models.
Class Notes: Chapter 7 - Supervised Learning: Classification

7.5.4 Support Vector Machines (SVM)

Overview of SVM
• Support Vector Machine (SVM) is a model used for classification and regression.
• It works by identifying a hyperplane in an N-dimensional space that separates data
instances into two classes.
• The goal of SVM is to classify data points by finding the optimal hyperplane with the
maximum margin between two classes.

7.5.4.1 Classification Using Hyperplanes


1. Hyperplane:
o A hyperplane is a decision boundary that separates data points into two classes in
an N-dimensional space.
o For example:
▪ In 2D space, the hyperplane is a line.
▪ In 3D space, the hyperplane is a plane.
o For higher dimensions, it becomes difficult to visualize but can be mathematically
represented.
2. Margin:
o The margin is the distance between the hyperplane and the closest data points from
either class.
o SVM aims to maximize the margin to ensure better classification and
generalization.
3. Support Vectors:
o These are the critical data points closest to the hyperplane.
o They determine the position and orientation of the hyperplane.
o Removing a support vector would change the hyperplane's position.

7.5.4.2 Identifying the Correct Hyperplane in SVM


In Support Vector Machines (SVM), identifying the correct hyperplane is critical for accurately
classifying data. While there may be multiple hyperplanes that separate the classes, SVM selects
the one that optimizes classification by following specific rules.
Scenarios for Identifying the Correct Hyperplane
Scenarios:
• Scenario 1: The hyperplane must separate the data instances accurately.
• Scenario 2: The hyperplane should maximize the margin to reduce the chances of
misclassification.
• Scenario 3: When prioritizing between margin size and misclassification, accuracy takes
precedence. A hyperplane with no misclassification is preferred.
• Scenario 4: SVM can handle outliers by ignoring them when determining the hyperplane.

Scenario 1: Multiple Hyperplanes


• Situation:
o Three hyperplanes A, B, and C are considered.
o Each hyperplane separates the data points into two classes
(e.g., triangles and circles).
• Analysis:
o Among the three hyperplanes, AAA separates the two classes
accurately and distinctly.

• Conclusion:
o Hyperplane AAA is the correct hyperplane because it divides
the data into the respective classes with no misclassification.
Key Point: The selected hyperplane must correctly segregate
the two classes.

Scenario 2: Maximizing the Margin

• Situation:
o Three hyperplanes A, B, and C are considered.
o All hyperplanes separate the classes correctly, but they differ in the margin
(distance) between the hyperplane and the nearest data points of both classes.
• Analysis:
o A has the highest margin compared to B and C.
o A larger margin ensures better generalization and robustness against noise or
outliers.
• Conclusion:
o Hyperplane A is the correct hyperplane because it maximizes the margin, reducing
the likelihood of misclassification for new data.
Key Point: The correct hyperplane maximizes the margin between the two classes.
Scenario 3: Balancing Margin and Classification Accuracy
• Situation:
o Two hyperplanes AAA and BBB are considered.
o AAA has a lower margin but classifies all data points
correctly.
o BBB has a higher margin but introduces
misclassification errors.
• Analysis:
o Although BBB has a larger margin, its
misclassification makes it unsuitable.
o AAA ensures all data points are classified correctly,
even with a smaller margin.
• Conclusion:
o Hyperplane AAA is the correct hyperplane because
accuracy in classification is prioritized over margin size.
Key Point: The correct hyperplane minimizes misclassification while maintaining an acceptable
margin.

Scenario 4: Handling Outliers

• Situation:
o The data includes an outlier (a data point significantly distant from the rest).
o Without accounting for the outlier, the hyperplane may classify the majority of data
correctly.
o Hyperplane A ignores the outlier and maximizes the margin for the remaining
points.
• Analysis:
o A provides the best generalization by ignoring the outlier.
o A hyperplane influenced by the outlier would lead to poor classification
performance for most data.
• Conclusion:
o Hyperplane A is the correct hyperplane because it is robust to outliers and ensures
better classification for the majority of data.
Key Point: The correct hyperplane must ignore outliers to achieve robustness and reliable
generalization.

Summary of Rules for Identifying the Correct Hyperplane


1. The hyperplane must separate data points of different classes accurately.
2. It should maximize the margin between the nearest data points of both classes and the
hyperplane.
3. In cases of conflict, the hyperplane that minimizes misclassification is preferred.
4. The hyperplane must be robust to outliers, ignoring them when necessary.
These rules ensure that the selected hyperplane generalizes well to unseen data, making SVM a
reliable and effective classification model.

7.5.4.3 Maximum Margin Hyperplane (MMH)

The Maximum Margin Hyperplane (MMH) is the hyperplane that


maximizes the margin between the two classes of data points.

The margin is defined as the distance between the hyperplane and


the nearest data points of each class (support vectors). Finding the
MMH ensures better generalization and fewer misclassification
issues for unseen data.

How to Find the MMH


1. Linearly Separable Data:
I. For linearly separable data:
▪ Draw outer boundaries (convex hulls) for the data instances of each class.
▪ Find the shortest line connecting the convex hulls.
▪ The MMH is drawn as the perpendicular bisector of this shortest line.
II. This process ensures that the hyperplane maximizes the distance from the nearest
data points of each class (support vectors).

Using this equation, the objective is to find a set of values for the vector such that two
hyperplanes, represented by the equations below, can be specified
This is to ensure that all the data instances that belong to one class falls above one hyperplane and
all the data instances belonging to the other class falls below another hyperplane.

According to vector geometry, the distance of these planes should be

Support Vectors and MMH


• Support Vectors:
o The data points closest to the MMH from each class.
o These points influence the position and orientation of the hyperplane.
• Role of Support Vectors:
o Removing or altering support vectors changes the position of the MMH.
2. Non-Linearly Separable Data:

Non Linearly
separable data

Using slack
using Kernal
variable and
tricks
cost function

I. Using slack variable:


MMH maximizes the margin between classes for better generalization.
1. It is determined by support vectors, the closest data points to the hyperplane.
2. For non-linear data, slack variables and cost values help in achieving an optimal
hyperplane.
II. Using Kernel trick
SVM has a technique called the kernel trick to deal with non-linearly separable data. As shown in
Figure 7.23, these are functions which can transform lower dimensional input space to a higher
dimensional space. In the process, it converts linearly non separable data to a linearly separable
data. These functions are called kernels. When data instances of the classes are closer to each
other, this method can be used.
7.5.4.5 Strengths of SVM
1. Versatility:
o Can be used for both classification and regression.
2. Robustness:
o Handles noise and outliers effectively.
3. Accurate Predictions:
o Delivers highly reliable results for classification tasks.

7.5.4.6 Weaknesses of SVM


1. Binary Classification Only:
o Designed for problems with only two classes.
2. Complexity:
o Difficult to interpret for high-dimensional data.
3. Performance:
o Computationally intensive for large datasets with many features or instances.
4. Memory Requirements:
o Requires significant memory to store and process data.

7.5.4.7 Applications of SVM


1. Bioinformatics:
o Detecting cancer and genetic disorders through binary classification.
2. Image Processing:
o Classifying images into face and non-face categories.
3. General Binary Classification:
o Suitable for problems with two distinct classes.

Chapter 7: Answers to the multiple-choice questions


1. Predicting whether a tumor is malignant or benign is an example of?
o 3. Supervised Classification Problem
2. Price prediction in the domain of real estate is an example of?
o 2. Supervised Regression Problem
3. The problems of predicting whether a tumor is malignant or benign and price prediction in
real estate are the same in nature.
o 2. FALSE
4. Supervised machine learning is as good as the data used to train it.
o 1. TRUE
5. Which type of machine learning predicts a categorical target feature based on training
data?
o 3. Supervised Classification
6. In classification, the target categorical feature is known as?
o 4. Class
7. The first step in the supervised learning model is?
o 1. Problem Identification
8. The step of cleaning/transforming the data set in supervised learning is called?
o 3. Data Pre-processing
9. Transformations applied to identified data before feeding into the algorithm refer to?
o 3. Data Pre-processing
10. The step determining the type of training data set is called?
o 4. Definition of Training Data Set
11. The entire design of the supervised learning program is done in?
o 1. Problem Identification
12. Training data run on the algorithm is referred to as?
o 4. Learned Function
13. SVM is an example of?
o 1. Linear Classifier and Maximum Margin Classifier
14. In SVM, a hard margin means?
o 1. SVM allows very low error in classification
15. In SVM, support vectors are?
o 4. Support Vectors
16. A line that separates and classifies data in SVM is called?
o 1. Hyperplane
17. The distance between the hyperplane and data points is called?
o 2. Margins
18. Which is true about SVM?
o 3. It is accurate
19. Which is true about SVM?
o 3. SVM does not perform well when we have a large dataset
20. What is the meaning of a hard margin in SVM?
o 1. SVM allows very low error in classification
21. What size of training data sets is not best suited for SVM?
o 1. Large data sets
22. Support Vectors are near the hyperplane.
o 1. True
23. In SVM, functions that transform lower-dimensional input space to higher-dimensional
space are called?
o 1. Kernels
24. Which is true about the kNN algorithm?
o 3. It can be used for both classification and regression
25. The Euclidean distance between A(4,3) and B(2,3) is?
o 2. 2
26. The Manhattan distance between A(8,3) and B(4,3) is?
o 3. 4
27. When there is noise in data in kNN, what should you do?
o 1. Increase the value of k
28. What is the relationship between training times for 1-NN, 2-NN, and 3-NN?
o 3. 1-NN ~ 2-NN ~ 3-NN
29. Which algorithm is an example of an ensemble learning algorithm?
o 1. Random Forest
30. Which is not an inductive bias in a decision tree?
o 1. It prefers longer trees over shorter trees

You might also like