0% found this document useful (0 votes)
3 views

Classification FoundationalMathofAI S24

This document provides an overview of classification in machine learning, focusing on binary and multi-class classification, and introduces two key algorithms: k-Nearest Neighbors (kNN) and Naive Bayes. It outlines the framework for classification tasks, including building training sets, training classifiers, testing, and evaluating performance using metrics such as accuracy, precision, and recall. The document also includes examples of classification tasks, such as spam detection and handwritten digit recognition.

Uploaded by

Tej Grover
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Classification FoundationalMathofAI S24

This document provides an overview of classification in machine learning, focusing on binary and multi-class classification, and introduces two key algorithms: k-Nearest Neighbors (kNN) and Naive Bayes. It outlines the framework for classification tasks, including building training sets, training classifiers, testing, and evaluating performance using metrics such as accuracy, precision, and recall. The document also includes examples of classification tasks, such as spam detection and handwritten digit recognition.

Uploaded by

Tej Grover
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Classification in Machine Learning

Yashil Sukurdeep
June 27, 2024

1 Classification: Fundamentals
In this lecture, we will explore the world of classification in machine learning.
Classification is a type of supervised learning where the goal is to predict the
category of a given input based on previously seen examples. We will discuss
two main types of classification problems: binary classification and multi-class
classification. We will also introduce two popular classification algorithms: the
k-Nearest Neighbors (kNN) classifier and the Naive Bayes’ classifier.

The following framework is typically used to tackle classification problems:


• Building a training set: The training set refers to a set of labelled examples
(Xtrain , Ytrain ) for the classification task, where:
– X train = {⃗x1 , . . . , ⃗xn } is input data that we wish to classify, e.g.
patients, images, text messages, and so on. We will assume that each
sample from the training set is a d-dimensional vector, i.e., ⃗xi ∈ Rd
for all i = 1, . . . , n. Often times, each training sample will in fact
be a feature vector, see Examples 1.1 and 1.2. We will express the
elements of a d-dimensional (feature) vector as ⃗x = (⃗x1 , . . . , ⃗xd ), and
refer to them as features.
– Y train = {y1 , . . . , yn } are labels, where the label yi corresponds to the
class of sample xi for each i = 1, . . . , n. The set of possible classes
for each label yi will be denoted by C.
• Training a classifier : The classifier is a model which will use the training
set to learn how to assign a class for any given input. There are various
types of classifiers that one can use.
• Testing the classifier: Once trained, we then apply the classifier on a set
of unseen examples (for which we know the labels, but the classifier does
not), which is called the test set (or the validation set).
• Evaluating the classifier’s performance: The goal here is to determine
how well the classifier has learnt to classify new inputs that it has not
seen before. We measure the performance of a classifier through a set of
performance metrics.

1
1.1 Binary classification
Binary classification involves categorizing data into one of two classes.
Example 1.1 (Spam Detection). Consider the task of classifying text messages
as spam or ham (i.e., not spam). Here, our input data is a text message, which
of course is not exactly a vector in Rd . Nevertheless, any given text message can
be represented by a feature vector, such as an array containing the frequency of
certain keywords appearing in the text message. Based on these feature vectors,
we can build a model to predict the class of new text messages, where the set of
possible classes is C = {ham, spam}.

1.2 Multi-Class Classification


Multi-class classification involves categorizing data into more than two classes.
Example 1.2 (Handwritten Digit Recognition). Each image of a handwritten
digit can be represented by its pixel values. We can use these pixel values as our
feature vector for each image. However, this might be a very high-dimensional
feature vector (think of how many pixels there are in the image captured by your
smartphone)! As a result, we can often use a simplified feature vector for each
image, and use these to build a model to classify it into one of 10 classes, i.e.,
C = {0, 1, . . . , 9}.

2 Classification Algorithms
We now turn our attention to a couple of widely-used classification algorithms,
or classifiers.

2.1 k-Nearest Neighbors (kNN)


The k-Nearest Neighbors algorithm is a simple, instance-based learning method
where the class of a new sample is determined by the majority class among its
k nearest neighbors in the training data, where k ∈ N is an integer. The kNN
algorithm works as follows:

1. Choose the number of neighbors k.

2. For a new data point ⃗x, calculate the distance between ⃗x and all the
samples in the training set. While many choices exist, common functions
used to calculate the distance between two vectors ⃗x, ⃗y ∈ Rd include:
• Euclidean Distance:
v
u d
uX
d(⃗x, ⃗y ) = t (⃗xi − ⃗yi )2
i=1

2
• Manhattan Distance:
d
X
d(⃗x, ⃗y ) = |⃗xi − ⃗yi |
i=1

• Minkowski Distance, where p ≥ 1:

d
!1/p
X
p
d(⃗x, ⃗y ) = |⃗xi − ⃗yi |
i=1

3. Sort the distances and determine the k-nearest neighbors based on the
smallest distances.
4. Assign the class label based on the majority class among the k-nearest
neighbors.

Figure 1: Illustration of k-Nearest Neighbors algorithm. The new data point


xtest is classified based on the majority class of its k = 1, 3, 4 nearest neighbors
(red circles and blue circles). The example with k = 4 shows why it is always a
good idea to choose an odd number k.

2.2 Example: Iris Flower Classification


The Iris dataset is one of the earliest datasets used in the literature on classifi-
cation methods and widely used in statistics and machine learning. It contains
three classes of iris flowers C = {IrisSetosa, IrisVersicolour, IrisVirginica}. Each
iris flower in the dataset is represented by a 4-dimensional feature vector, i.e.,
a vector in R4 . The four features are the sepal length, sepal width, petal length,
petal width. To classify a new flower, we find the k nearest neighbors in the
feature space and assign the class based on the majority vote.

3
2.3 Naive Bayes’ Classifier
The Naive Bayes’ classifier is based on Bayes’ theorem and assumes that features
(in a feature vector) are conditionally independent given the class label. Despite
this strong assumption, it often performs well in practice.

• Bayes’ Rule (or Bayes’ Theorem): Recall Bayes’ theorem, which


states:
P (⃗x|y)P (y)
P (y|⃗x) =
P (⃗x)
where P (y|⃗x) is the posterior probability of class y ∈ C given features
⃗x ∈ Rd , P (⃗x|y) is the likelihood of features ⃗x given class y, P (y) is the
prior probability of class y, and P (x) is the probability of features ⃗x.
• Naive Bayes’ Assumption: The Naive Bayes classifier assumes that
each feature ⃗xk in the feature vector ⃗x = (⃗x1 , . . . , ⃗xd ) is conditionally
independent of every other feature given the class label y. This simplifies
the computation of the likelihood P (⃗x|y):
d
Y
P (⃗x|y) = P (⃗x1 , ⃗x2 , . . . , ⃗xd |y) = P (⃗xk |y)
k=1

• Classification Rule: To classify a new instance, we compute the pos-


terior probability for each class and choose the class with the highest
posterior probability:
d
Y
ŷ = argmax P (y|⃗x) = argmax P (y) P (⃗xk |y)
y∈C y∈C
k=1

Steps to Classify Data Using Naive Bayes

1. Training phase:
• Calculate the prior probability P (y) for each class y ∈ C.
• Calculate the likelihood P (⃗xk |y) for each feature ⃗xk given each class
y ∈ C.
2. Classification phase:

• For a new data point ⃗x ∈ Rd , calculate the posterior probability


P (y|⃗x) for each class y ∈ C.
• Assign the data point to the class with the highest posterior proba-
bility.

4
3 Performance Metrics for Classification Algo-
rithms
To evaluate the performance of classification algorithms, several metrics are
commonly used. These metrics provide insights into how well the classifier is
performing and help in comparing different classifiers. To illustrate the defini-
tions of these metrics, let us focus on the binary classification setting where the
two classes are C = {Positive, Negative} for instance.

3.1 Confusion Matrix


The confusion matrix is a table used to describe the performance of a classi-
fication model on a set of test data for which the true values are known. It
provides a comprehensive breakdown of the classifier’s performance by showing
the counts of true positive (TP), true negative (TN), false positive (FP), and
false negative (FN) predictions.

Predicted Positive Predicted Negative


Actual Positive TP FN
Actual Negative FP TN

Table 1: Confusion Matrix for a Binary Classification Problem

3.2 Accuracy
Accuracy is the ratio of correctly predicted instances to the total instances. It
is a simple metric that provides an overall effectiveness of the classifier.
TP + TN
Accuracy =
TP + TN + FP + FN

3.3 Precision
Precision is the ratio of correctly predicted positive observations to the total
predicted positives. It indicates how many of the predicted positive instances
are actually positive.
TP
Precision =
TP + FP

3.4 Recall
Recall (also known as Sensitivity or True Positive Rate) is the ratio of correctly
predicted positive observations to the all observations in the actual class. It
measures how well the classifier identifies positive instances.
TP
Recall =
TP + FN

5
These metrics help in understanding the performance of the classification algo-
rithm beyond simple accuracy, providing a more detailed view of the classifier’s
ability to correctly identify positive and negative instances.

4 Conclusion
In this lecture, we introduced the concepts of binary and multi-class classifica-
tion and discussed two popular classification algorithms: k-Nearest Neighbors
(kNN) and the Naive Bayes’ classifier. These methods form the basis of many
machine learning applications, from spam detection to image recognition. We
also discussed performance metrics to help us evaluate the performance of our
classifiers, which allow us to quantify how confident we are in the predictions
made by the classifiers.

You might also like