0% found this document useful (0 votes)
3 views45 pages

Week 8. Supervised Learning. Classification

The document provides an overview of supervised learning, specifically focusing on classification techniques such as Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Decision Trees, and Random Forests. It explains the differences between supervised and unsupervised learning, outlines types of classification, and introduces key concepts like confusion matrix and performance metrics. Additionally, it details the algorithms' applications and methodologies, emphasizing their roles in various real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views45 pages

Week 8. Supervised Learning. Classification

The document provides an overview of supervised learning, specifically focusing on classification techniques such as Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Decision Trees, and Random Forests. It explains the differences between supervised and unsupervised learning, outlines types of classification, and introduces key concepts like confusion matrix and performance metrics. Additionally, it details the algorithms' applications and methodologies, emphasizing their roles in various real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Supervised Learning:

Classification

Instructor: Sabina Mammadova


Agenda
• Supervised Learning: Classification

• Types of Classification

• Logistic Regression

• K-Nearest Neighbors (KNN)

• Naïve Bayes

• Support Vector Machine (SVM)

• Decision Tree and Random Forest

• Confusion matrix, ROC and AUC curve


Machine Learning Algorithms
Linear Regression, Polynomial
Regression, Support Vector
Regression Regression, Decision Tree
Regression, Random Forest
Supervised Regression
Learning Logistic Regression, K-Nearest
Neighbors, Support Vector
Classification Machines, Decision Tree, Random
Forest, Naïve Bayes

Clustering K-Means, Hierarchical, DBSCAN

Machine Unsupervised Association Apriori, FP-Growth


Learning Learning Analysis
Dimensionality
PCA, LDA
Reduction

Reinforcemen Q-Learning, Deep Q-Networks…


t Learning
What is Machine Learning (ML) and Supervised Learning?

• Machine learning is the process of extracting knowledge


from data, combining elements of statistics, AI, and
computer science. It is widely used in daily life, from
personalized recommendations (Netflix, Amazon) to
scientific research (DNA analysis, cancer treatment).
• Supervised learning is a type of machine learning where
an algorithm is trained on labeled data. This means the
dataset contains input-output pairs, where the model learns
the relationship between inputs (features) and the correct
outputs (labels). The goal is for the model to generalize this
relationship so it can make accurate predictions on new,
unseen data.
Difference between Supervised and
Unsupervised Learning

• Input data is labelled • Input data is unlabeled


• There is a training phase • There is no training
• Data is modelled based phase
on training dataset • Uses properties of given
• Known number of data for clustering
classes (for • Unknown number of
classification) classes
Supervised Learning
Unsupervised Learning
Regression vs Classification
• Classification and Regression are both types of supervised machine
learning tasks, but they serve different purposes. Classification is used
when the goal is to predict discrete labels or categories. For example, you
might want to predict whether an email is "spam" or "not spam." The key
here is that the output is categorical; it's about classifying the input into one
of several predefined classes. Common examples of classification tasks
include disease detection (predicting whether someone is healthy or sick),
or image recognition (e.g., classifying an image as a cat, dog, or bird). Some
common algorithms used for classification are Logistic Regression, Decision
Trees, K-Nearest Neighbors (KNN), and Support Vector Machines (SVM).
• On the other hand, Regression is used when the goal is to predict a
continuous value. In this case, you're predicting a real number rather than a
category. For example, you might want to predict the price of a house based
on features like its size, location, and number of bedrooms. The output here
is a continuous value, such as a price or temperature. Other examples of
regression tasks include predicting stock prices or forecasting sales figures.
Algorithms typically used for regression include Linear Regression,
Types of Classification in Machine Learning
Classification can be categorized based on the number of possible output
labels and how they are assigned to instances. The three main types are:
• Binary Classification: When there are only two possible classes
(labels), it's called binary classification. The model predicts whether an
instance belongs to Class A or Class B. Example: Spam detection:
Classify emails as Spam (1) or Not Spam (0).
• Multi-Class Classification: When there are more than two possible
classes, but each instance belongs to only one class. The model predicts
one out of multiple possible categories. Example: Handwritten digit
recognition: Classify images into digits 0-9.
• Multi-Label Classification: Unlike multi-class classification, where an
instance belongs to only one class, in multi-label classification, an
instance can belong to multiple categories at the same time. Instead of
predicting one label, the model predicts multiple labels. Example: Movie
genre classification: A movie can belong to multiple genres, e.g., Action,
Comedy, and Sci-Fi.
How to Choose the Right Classification Type?

• If the problem has only two outcomes, use Binary Classification.


• If the problem has more than two outcomes, and only one label is assigned per
instance, use Multi-Class Classification.
• If the problem requires assigning multiple labels per instance, use Multi-Label
Classification.

Labels per
Type Number of Classes Example
Instance
Binary Spam Detection
2 1
Classification (Spam/Not Spam)
Multi-Class Dog/Cat/Rabbit
3 or more 1
Classification Classification
Movie Genre
Multi-Label
3 or more Prediction (Action + Multiple
Classification
Comedy + Sci-Fi)
Logistic Regression
Logistic Regression
• Logistic Regression is a widely used supervised learning
algorithm for classification tasks. Despite its name, it is not used
for regression but for predicting categorical outcomes. It is
especially useful for binary classification problems, where the
target variable has two possible classes, such as "Yes/No,"
"Spam/Not Spam," or "Disease/No Disease."
• Despite its name, it is used for classification, not regression.
• Applications:
• Spam detection (Spam/Not Spam)
• Disease prediction (Diabetic/Non-Diabetic)
• Customer churn prediction (Will Leave/Will Stay)
Logistic Regression
• Logistic regression is used for binary classification where we use
sigmoid function, that takes input as independent variables and
produces a probability value between 0 and 1.

• For example, we have two classes Class 0 and Class 1 if the value
of the logistic function for an input is greater than 0.5 (threshold
value) then it belongs to Class 1 otherwise it belongs to Class 0.
It’s referred to as regression because it is the extension of linear
regression but is mainly used for classification problems.
Sigmoid Function

• The sigmoid function is a mathematical function used to map the


predicted values to probabilities.
• It maps any real value into another value within a range of 0 and
1. The value of the logistic regression must be between 0 and 1,
which cannot go beyond this limit, so it forms a curve like the “S”
form.
• The S-form curve is called the Sigmoid function or the logistic
function.
• In logistic regression, we use the concept of the threshold value,
which defines the probability of either 0 or 1. Such as values
above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Sigmoid Function
K-Nearest Neighbors
(KNN)
K-Nearest Neighbor (KNN)

• K-NN algorithm only considers exactly one nearest neighbor, which is


the closest training data point to the point we want to make a
prediction for
• K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning
algorithm used for classification and regression.
• It classifies new data points based on the majority vote of the k-nearest
neighbors.
• If most of your nearest neighbors belong to Class A, the new point is
classified as Class A.
• Real-World Applications:
• Medical Diagnosis (Predicting diseases based on patient data).
• Recommendation Systems (Suggesting products based on user
similarity).
• Pattern Recognition (Handwritten digit classification).
How Does KNN Work?

• Choose the value of K (number of neighbors).


• Measure the distance between the new data point and all training
points (e.g., Euclidean distance).
• Select the K nearest neighbors.
• Perform majority voting (for classification) or compute the average
(for regression).
• Assign the new data point to the most common class among the K
neighbors.
• If K is too small → Model becomes sensitive to noise (overfitting).
• If K is too large → Model might lose important details
(underfitting).
How Does KNN Work?

K=3
Distance Metrics Used in KNN Algorithm

• Euclidean distance is defined as the


straight-line distance between two points
in a plane or space. You can think of it
like the shortest path you would walk if
you were to go directly from one point to
another.
• Manhattan Distance: This is the total
distance you would travel if you could
only move along horizontal and vertical
lines (like a grid or city streets). It’s also
called “taxicab distance” because a taxi
can only drive along the grid-like streets
of a city.
Support Vector Machine
What is Support Vector Machine?

• Support Vector Machine (SVM) is a powerful supervised learning


algorithm used for classification and regression tasks.
• It works by finding the optimal hyperplane that maximizes the margin
between different classes in a dataset.
• SVM uses support vectors, which are the closest data points to the
hyperplane, to define the decision boundary.
• When data is not linearly separable, SVM applies the kernel trick to
transform the data into a higher-dimensional space where it becomes
separable.
• With strong generalization capabilities, SVM is particularly effective in
high-dimensional spaces and is widely used in image recognition, text
classification, and bioinformatics.
Terminologies in Support Vector Machine?
• Hyperplane:
• The hyperplane is the decision
boundary that separates different
classes in the dataset.
• In 2D space, it is a straight line; in 3D
space, it is a plane; in higher
dimensions, it is a hyperplane.
• The goal of SVM is to find the optimal
hyperplane that maximizes the margin
between classes.
• Support vectors:
• Support vectors are the data points
closest to the hyperplane.
• These points define the margin and
influence the position of the
hyperplane.
• Even if other points are removed, the
decision boundary remains the same
Terminologies in Support Vector Machine?

• Margin:
• The margin is the distance
between the hyperplane
and the nearest support
vectors.
• A larger margin means
better generalization (less
risk of overfitting).
• SVM aims to find the
maximum margin
hyperplane (MMH) for
better separation of
classes.
Terminologies in Support Vector Machine?
• Kernel Trick:
• Some datasets are not linearly separable in their original form.
• The Kernel Trick transforms data into a higher-dimensional space, where it
becomes separable.
• Common kernel functions:
• Linear Kernel: Used when data is linearly separable.
Terminologies in Support Vector Machine?

• Kernel Trick:
• Common kernel
functions:
• Polynomial Kernel:
Useful for curved
boundaries.
Polynomial • Radial Basis
Function (RBF)
Kernel: Widely
used for complex,
non-linear
decision
boundaries.

RBF
Decision Tree
What is Decision Tree?

• A Decision Tree is a supervised machine learning algorithm used for


both classification and regression tasks.
• It works by recursively splitting the dataset into subsets based on
feature values, creating a tree-like structure with nodes and branches.
• The root node represents the entire dataset, while internal nodes
represent decision points based on a specific feature, and leaf nodes
contain the final prediction or output.
• The splits are made to maximize the homogeneity of the resulting
subsets, using metrics like Gini Impurity or Entropy for classification
tasks.
• Decision trees are easy to interpret, visualize, and understand, making
them a popular choice for various applications, though they can be
prone to overfitting, which can be mitigated through techniques like
pruning.
Terminologies in Decision Tree

• Root Node: The initial node that


represents the complete dataset.
• Branches: The connecting lines
between nodes, indicating the
progression from one decision to the
next.
• Internal Nodes: Decision points
where choices are made based on
the values of input features.
• Leaf Nodes: The endpoint nodes at
the end of the branches,
representing the final predictions or
outcomes.
Terminologies in Decision Tree
• Entropy is a measure of uncertainty or disorder in a dataset. In the
context of decision trees, it quantifies the unpredictability of class labels
in the dataset. Entropy ranges from 0 (perfectly pure node) to 1
(completely impure node), and the goal is to reduce entropy as much as
possible with each split to achieve a homogenous group of data points.
• Information Gain is a metric used to measure the effectiveness of a
feature in reducing uncertainty or entropy in the dataset when splitting
it. It calculates the difference between the entropy of the original
dataset and the weighted sum of entropies of the subsets created by the
split. The higher the Information Gain, the better the feature is at
classifying the data.
• The Gini Index is a measure of impurity that tells us how often a
randomly selected element would be incorrectly classified if it were
assigned to a class based on the dataset. It ranges from 0 (pure node) to
1 (maximum impurity), with the goal being to minimize the Gini Index
when splitting the dataset. It is used primarily in classification tasks and
Random Forest
What is Random Forest?

Random Forest is an ensemble


learning method primarily used for
classification and regression tasks. It
works by constructing a collection (or
"forest") of decision trees, typically
trained with a technique known as
bagging (Bootstrap Aggregating).
Instead of building a single decision
tree, Random Forest creates multiple
decision trees, each trained on a
different subset of the data, and
makes predictions based on the
majority vote (for classification) or
average (for regression) of all the
trees' predictions.
How Random Forest works
• Data Bootstrapping: Random Forest first creates several
different training sets by randomly sampling the original
training data (with replacement).
• Tree Construction: For each training set, a decision tree is built.
At each node of the tree, only a random subset of features is
considered for the split, which helps create diverse trees.
• Voting/Averaging: For classification tasks, each tree in the forest
"votes" for a class, and the class with the majority of votes
becomes the model’s prediction. For regression, the predictions
from all trees are averaged to produce the final result.
Naïve Bayes
Naïve Bayes

• Naïve Bayes classifiers are supervised machine learning algorithms


used for classification tasks, based on Bayes’ Theorem to find
probabilities.
• The main idea behind the Naive Bayes classifier is to use Bayes’
Theorem to classify data based on the probabilities of different classes
given the features of the data. It is used mostly in high-dimensional text
classification.
• Given a set of features, it calculates the probability of a class and
assigns the highest probability class to the new instance.
• Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis,
classifying articles and many more.
Math behind Naïve Bayes

𝑃(𝐵∣𝐴) = Probability of event B given A (Likelihood).


• P(A∣B) = Probability of event A given B (Posterior probability).

𝑃(𝐴) = Prior probability of A.


𝑃(𝐵)= Prior probability of B.




Performance Metrics
Confusion Matrix
Confusion Matrix

• A Confusion Matrix is a table used to evaluate the performance


of a classification model. It compares the actual labels with the
predicted labels and helps identify errors in classification.
• True Positive (TP): Correctly predicted positive cases (e.g.,
detecting spam when it actually is spam).
• True Negative (TN): Correctly predicted negative cases (e.g.,
detecting non-spam when it actually is not spam).
• False Positive (FP) (Type I Error): Wrongly classified a negative as
positive (e.g., marking a normal email as spam).
• False Negative (FN) (Type II Error): Wrongly classified a positive
as negative (e.g., missing a spam email).
Accuracy
Accuracy measures how often the model’s predictions are correct overall. It
gives a general idea of how well the model is performing. However,
accuracy can be misleading, especially with imbalanced datasets where one
class dominates. For example, a model that predicts the majority class
correctly most of the time might have high accuracy but still fail to capture
important details about other classes.

• Works well when the dataset has balanced classes (e.g., 50%
positive, 50% negative).
• Not reliable for imbalanced datasets, where one class dominates the
other.
Precision
Precision focuses on the quality of the model’s positive predictions. It tells
us how many of the instances predicted as positive are actually positive.
Precision is important in situations where false positives need to be
minimized, such as detecting spam emails or fraud.

• Important in scenarios where false positives are costly (e.g., spam


filters, fraud detection, medical diagnoses).
• High precision means that most of the positive predictions are truly
positive.
Recall
Recall measures how well the model identifies all actual positive cases. It
shows the proportion of true positives detected out of all the actual positive
instances. High recall is essential when missing positive cases has
significant consequences, such as in medical diagnoses.

• Crucial when missing positive cases is critical (e.g., detecting


diseases, fraud detection, safety alarms).
• High recall ensures that most actual positives are captured.
F1 Score
F1-score combines precision and recall into a single metric to balance their
trade-off. It provides a better sense of a model’s overall performance,
particularly for imbalanced datasets. The F1 score is helpful when both false
positives and false negatives are important, though it assumes precision
and recall are equally significant, which might not always align with the use
case.

• Suitable when dealing with imbalanced datasets, where accuracy is


misleading.
• Helps in cases where we need a balance between precision and
recall.
Thank you!

You might also like