0% found this document useful (0 votes)
12 views169 pages

C4 +Supervised+Machine+Learning

The document provides an overview of supervised machine learning (SML), detailing its lifecycle, key concepts, and types of algorithms such as classification and regression. It explains the process of training SML models, including data preparation, error calculation using various metrics, and the optimization of trainable parameters through methods like gradient descent. Additionally, it highlights specific applications of SML, such as optical character recognition and email prioritization.

Uploaded by

sirine.nahra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views169 pages

C4 +Supervised+Machine+Learning

The document provides an overview of supervised machine learning (SML), detailing its lifecycle, key concepts, and types of algorithms such as classification and regression. It explains the process of training SML models, including data preparation, error calculation using various metrics, and the optimization of trainable parameters through methods like gradient descent. Additionally, it highlights specific applications of SML, such as optical character recognition and email prioritization.

Uploaded by

sirine.nahra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Spring 2025

ECE 490: Introduction to


Machine Learning
Chapter 4: Supervised Machine Learning ALgorithms
Machine Learning Lifecycle

ECE 490: Introduction to ML 2


Where are we in the life cycle now?
We are here

ECE 490: Introduction to ML 3


Supervised Learning

ECE 490: Introduction to ML 4


Recap: Supervised Learning

Supervised learning is a category of machine learning that uses labeled datasets


to train algorithms to predict outcomes and recognize patterns.

ECE 490: Introduction to ML 5


Algorithm vs Model

A model is the outcome of training an algorithm on data; it represents the


learned patterns, relationships, or predictions based on the training process.

ECE 490: Introduction to ML 6


Algorithm vs Model

ECE 490: Introduction to ML 7


Types of Supervised Machine Learning
(SML) Applications

ECE 490: Introduction to ML 8


Classification vs Regression

ECE 490: Introduction to ML 9


Classification

Classification is a type of supervised learning where the goal is to predict the


categorical label of new observations based on past observations.

ECE 490: Introduction to ML 10


Regression

Regression is another key type of supervised learning that focuses on predicting


continuous numerical values rather than categorical labels.

ECE 490: Introduction to ML 11


Examples of SML Applications

ECE 490: Introduction to ML 12


Optical Character Recognition (OCR)

The model is able to identify handwritten characters and classify each image as a
character. In this case, we are classifying the number digits.

ECE 490: Introduction to ML 13


Email Prioritization

The model is able to successfully detect which of the arriving emails go to spam
and which to the primary inbox.

ECE 490: Introduction to ML 14


Language Translation

The model is able to take in a sequence in


one language and output a sequence of the
same information in a different language.

Would this be classification or regression?

ECE 490: Introduction to ML 15


Language Translation

Language translation is a classification


problem in because it involves predicting the
next word (or token) from a predefined
vocabulary, which can be seen as a list of
possible "classes." Each word in the
vocabulary is effectively a "class" that the
model selects based on the input context.

ECE 490: Introduction to ML 16


Linear Regression

ECE 490: Introduction to ML 17


Linear Regression
Linear regression is a statistical method
used to model the relationship between a
the target variable and a feature variable
by a line.

Based on the training data points, we try to


create a line that best models the
relationships between the input feature and
the output feature.

ECE 490: Introduction to ML 18


Applications of Linear Regression

ECE 490: Introduction to ML 19


Linear Regression

Given that our dataset has one input feature and an output feature, let’s plot our
data.

ECE 490: Introduction to ML 20


Linear Regression

We want to create a line that models the training data points in the lowest error
possible. We call the modeled relationship: “the best fit line”.

- Best fit line


- Predicted line

ECE 490: Introduction to ML 21


Linear Regression

Data point

Label

Predicted value

ECE 490: Introduction to ML Input feature value 22


Linear Regression

We want to create a line that will not only


model the current data points, but will also
allow us to predict future outputs with high
accuracy.

Predicted value

New input (value not


seed in the dataset)
ECE 490: Introduction to ML 23
Linear regression

To get to the best fit line, we will go


through multiple iterations of updates.

What is the model updating (or “learning”)?

In the case of creating a line, we update


the slope, and y-intercept.

ECE 490: Introduction to ML 24


Linear Regression

Since the model is trying to create a best fit line, it is optimizing the equation of a
line.

Predicted value
Input feature value

Model trainable
parameters
ECE 490: Introduction to ML 25
“Learning” of machine learning algorithms

We call the variables being updated during the training process as “trainable
parameters” or “weights”. In some cases, we also have a “bias” term as a
trainable parameter.

- Trainable parameters because they are getting “trained” or updated during


the learning process.
- Weights because they also carry the importance and contribution of each
input value.
- Bias because it helps direct the predictions of the model for higher
accuracy.

ECE 490: Introduction to ML 26


Linear Regression

In the case of linear regression, we have both a weight and a bias in our model.

Bias term Weight

Model trainable
parameters
ECE 490: Introduction to ML 27
How SML algorithms learn

ECE 490: Introduction to ML 28


Learning flow of SML models

ECE 490: Introduction to ML 29


SML training process

The training (or learning) process of


supervised machine learning algorithms
consists of 4 parts:

1. Preparing the training dataset


2. Initializing the algorithm
3. Loop over data points in the training
set:
a. Make a prediction (Class or Number)
b. Calculate the error of the prediction
4. Update the algorithm parameters

ECE 490: Introduction to ML 30


Preparing the training dataset

The data preparation steps, as


covered in the previous chapters,
need:

1. EDA and feature tuning


2. Transformation to numerical
representation
3. Split the dataset for training
and testing

ECE 490: Introduction to ML 31


Training Linear Regression

We mentioned that linear regression has one bias term


and one weight that needs to be updated using the
training loop. Let’s see how that happens.

First, we will use a dataset of randomly created points.

ECE 490: Introduction to ML 32


Training Linear Regression

Since the dataset is random and does not actually hold


any information, we don’t need to perform EDA.

We will only pre-process the data by normalizing it.

ECE 490: Introduction to ML 33


Initializing the algorithm

Every machine learning algorithm holds its


learned patterns and connections within its
parameters, often referred to as “trainable
parameters”.

These trainable parameters, at the


beginning of the training process, are
randomly initialized to be updated and
“learned” during the training process.

ECE 490: Introduction to ML 34


Linear Regression Example

So, we will choose random initializations for our trainable parameters.

ECE 490: Introduction to ML 35


Linear Regression Example

Let’s visualize initialized line

ECE 490: Introduction to ML 36


Predicted value
Looping over data points

For every data point, we feed the data


points to the model so it can make a
Error
prediction.

This prediction would not be accurate, so it


contains some error.

Real value

Input data point


ECE 490: Introduction to ML 37
Looping over data points

This error that we got, which is the difference between real and predicted values,
should guide our update of the trainable parameters.

The goal is to update the trainable parameters so that their update results in
a lower error.

ECE 490: Introduction to ML 38


Error calculation

For regression tasks, we have multiple


options for calculating the error between the
real and predicted values:
1. Mean Bias Error (MBE)
2. Mean Squared Error (MSE)
3. Mean Absolute Error (MAE)
4. Root Mean Squared Error (RMSE)
5. Huber Loss
6. Mean Logarithmic Error (MLE)
7. …

ECE 490: Introduction to ML 39


Error calculation - Loss function

If we calculate the error for more than one data point at a time, this is done using a
loss function.

The loss function aggregates the individual errors across the selected data
points, typically by computing the average or sum of the errors using a specified
error function (e.g., MSE, MAE).

ECE 490: Introduction to ML 40


Error Calculation - Mean Bias Error

Measures the average difference between predicted and actual values. It indicates
the direction of the error (positive or negative bias).

- Positive MBE: Overestimation.


- Negative MBE: Underestimation.

Useful for understanding bias in predictions but not ideal as a standalone metric
because it doesn't capture the magnitude of errors.

Error Function: Loss Function:

ECE 490: Introduction to ML 41


Linear Regression Example - MBE

In the linear regression model we initialized earlier, let’s try to get a prediction from
the first data point in the training set and calculate its MBE.

ECE 490: Introduction to ML 42


Error Calculation - Mean Squared Error

Measures the average squared differences between predicted and actual values.
Squaring emphasizes larger errors, making it sensitive to outliers.

Use this error when you want to penalize large errors more heavily or when
outliers are meaningful.

Error Function: Loss Function:

ECE 490: Introduction to ML 43


Linear Regression Example - MSE

Using the MSE, we see a significant difference in the error.

Both MSE and MBE can guide the update of trainable parameters during model
training, but they serve different purposes. However MSE is more commonly
used to minimize overall prediction error, and MBE is used to provide insight
into whether the model consistently overestimates or underestimates.

ECE 490: Introduction to ML 44


Error Calculation - Mean Absolute Error

Measures the average of absolute differences between predicted and actual


values. It treats all errors equally, making it robust to outliers.

Useful for when you want a simple, interpretable metric that is less sensitive to
outliers than MSE.

Error Function: Loss Function:

ECE 490: Introduction to ML 45


Linear Regression Example - MAE

The MAE provides a more robust measure of error by treating all deviations
equally, without disproportionately penalizing large errors, unlike MSE. This
characteristic makes MAE less sensitive to outliers and provides a more balanced
reflection of typical errors in the model.

ECE 490: Introduction to ML 46


Error Calculation - Root Mean Squared Error

The square root of MSE. It provides the error in the same units as the target
variable, making it more interpretable than MSE.
Useful for when you want a metric in the same scale as the target variable, while
still penalizing large errors more heavily.
Suppose you're predicting house prices in dollars. RMSE provides an error value
(e.g., $5,000) that is also in dollars. This tells you that, on average, your model's
prediction is about $5,000 off from the actual value.

Error Function: Loss Function:

ECE 490: Introduction to ML 47


Linear Regression Example - RMSE

In this case, we got RMSE and MAE as the same value. This makes sense
because we calculating the errors for one data point.

Both are expressed in the same units as the target variable, but RMSE can
sometimes overemphasize large errors, which might distort the perception of the
model's performance.

ECE 490: Introduction to ML 48


Error Calculation - Huber Loss

Combines the properties of MSE and MAE. It behaves like MSE for small errors
and switches to MAE for large errors, making it robust to outliers while maintaining
sensitivity to small errors.

Error Function:

You set the threshold


(δ) according to the size
Loss Function: of your dataset

ECE 490: Introduction to ML 49


Linear Regression Example - Huber Loss

In our case, since the error > delta, then we applied the modified version of MAE
as the result of the huber loss.

ECE 490: Introduction to ML 50


Error Calculation - MLE

Measures the error logarithmically, which reduces the impact of large errors. This
metric is useful when the target values vary over several orders of magnitude.

Useful for when handling data with widely varying scales or when large errors are
undesirable but should not dominate the metric.

Error Function:

Loss Function:

ECE 490: Introduction to ML 51


Linear Regression Example - MLE

As you can see MLE tends to be much smaller than other error metrics like MSE,
MAE, or RMSE. This is because it focuses on relative error rather than outlier
impact.

ECE 490: Introduction to ML 52


Regression Error Functions Recap

ECE 490: Introduction to ML 53


Updating Algorithm Trainable Parameters

Now, the algorithm is initialized, and we made a prediction (or a set of predictions)
and calculated their error using the chosen loss function.

We want to utilize this error to perform an update in to the trainable parameters.

Intuitively, we want to update the trainable parameters in a way that will minimize
the error.

This process is performed using a family of algorithms called optimization


functions.

ECE 490: Introduction to ML 54


Optimization Algorithms - Gradient Descent
For now, we will focus on the most popular optimization algorithm: Gradient
Descent, which can be used in the training of ANY machine learning
algorithm.

Gradient descent is simply used to find the values of a function’s parameters


(coefficients) that minimize a cost function as much as possible.

ECE 490: Introduction to ML 55


Optimization Algorithms - Gradient Descent

If we were to plot the cost (or loss) function with respect to the change in
value of one trainable parameter, we would get a 2D curve.

Input data point

ECE 490: Introduction to ML 56


Optimization Algorithms - Gradient Descent

If we were to plot the error with respect to the change in value of two trainable
parameter, we would get a 3D curve.

Desired point: lowest value of the error

ECE 490: Introduction to ML 57


Optimization Algorithms - Gradient Descent

To get the lowest error, we should get the minimum of the loss (or cost) function.

How do we get the global minimum of a function?

ECE 490: Introduction to ML 58


Optimization Algorithms - Gradient Descent

The global minimum of a function is where the derivative of the function is =0.

Thus, we want to move the parameter closer to where the the loss is at its
minimum through incremental steps which consider the derivative of the loss
function.

Where the
derivative of the
function = 0

ECE 490: Introduction to ML 59


Optimization Algorithms - Gradient Descent

Thus, the update moves as follows:

ECE 490: Introduction to ML 60


Optimization Algorithms - Gradient Descent

Gradient Descent (GD) is an iterative process where you start at a coefficient’s


initial point (x0) and you move step by step until you reach the minimum of the
loss function.

The update rule of your position is given by this formula:

ECE 490: Introduction to ML 61


Optimization Algorithms - Gradient Descent

Let’s dissect the update rule of gradient descent:

New_value = old_value - learning_rate * derivative_of _cost _function_wrt_param

1. New_value: The updated value of the parameter.


2. old_value: The previous value of the parameter. In the case of the first
update, this is the random initialization value of the parameter.
3. learning_rate: The rate of update (how fast or slow) we will move towards the
optimal coefficients/parameters.
4. derivative_of _cost _function_wrt_param: The derivative of the cost function
used to direct the update closer to the global minimum.

ECE 490: Introduction to ML 62


Optimization Algorithms - Gradient Descent

The choice of the learning rate affects the size of the steps we are taking to get to
the minimum.

ECE 490: Introduction to ML 63


Optimization Algorithms - Gradient Descent

Let’s take MSE as the choice for our loss function and compute its derivative with
respect to the trainable parameters

ECE 490: Introduction to ML 64


Updating Algorithm Trainable Parameters

Depending on the size and complexity


of our data, we might choose to update
our parameters:

- Every data point,


- Every batch of data points,
- After the ingestion of all the training
data points.

ECE 490: Introduction to ML 65


When do we perform this update?

We can apply gradient descent:

1. After every data point: using Stochastic Gradient Descent


2. After every batch of data points: using Batch Gradient Descent
3. After all the training data points: using Gradient Descent

ECE 490: Introduction to ML 66


Linear Regression Example
Going back to our linear regression example, we can now train our algorithm by
performing the parameter update loop using gradient descent.

ECE 490: Introduction to ML 67


Linear Regression Example

We can now visualize the initial


and final model.

ECE 490: Introduction to ML 68


Multi-Linear Regression

ECE 490: Introduction to ML 69


Multi-Linear Regression

While we can model the relationship between a feature and the output using a
line, we can model the relationship between two input feature and the output using
a plane.

ECE 490: Introduction to ML 70


Multi-Linear Regression

Using the same logic, the relationship between three or more features and the
output can be modeled using a hyperplane.

ECE 490: Introduction to ML 71


Multi-Linear Regression

Thus, multi-linear regression is modeled by a polynomial. The number of


variables in the polynomial depends on the number of input features.

Bias Weights

Output

Input Features

ECE 490: Introduction to ML 72


Updates in multi-linear regression

We apply the gradient descent update rule for each weight in the multi-linear
regression model in each iteration.

Derivative of the cost function:

Update rule:

ECE 490: Introduction to ML 73


Classification Algorithms

ECE 490: Introduction to ML 74


Binary vs Multi-Class Classification

ECE 490: Introduction to ML 75


Binary vs Multi-Class Classification

ECE 490: Introduction to ML 76


What would the data look like?

ECE 490: Introduction to ML 77


Logistic Regression

ECE 490: Introduction to ML 78


Logistic Regression
Logistic Regression is a statistical method used for binary classification
problems, where the outcome variable is categorical with two possible outcomes.

ECE 490: Introduction to ML 79


Logistic Regression

So why does it have ‘Regression’ in its name if it is a classification problem?

Logistic Regression is an extension of linear regression and multi-linear


regression to be used for classification.

This extension is done by adding a mapping function that would allow us to map
the output of the linear regression part to a class.

ECE 490: Introduction to ML 80


Logistic Regression

The equation of logistic regression: We add a “Sigmoid” function after the linear
regression block.
What is this?

Multi-Linear
Regression

ECE 490: Introduction to ML 81


Logistic Regression

The sigmoid function:

Given that our threshold is


0.5: If the output of the
sigmoid function is above 0.5,
the data point belongs to one
class.
Otherwise, the data point
belongs to class B.

ECE 490: Introduction to ML 82


Logistic Regression

Should the threshold be always be


0.5? No.

This is the default threshold.


However, according to your data,
you might find other threshold more
suitable.

ECE 490: Introduction to ML 83


What would the line in LR block represent?

We saw that the sigmoid function is preceded by a linear regression block.


Instead of modeling the data, the line is used as a classifying boundary.

ECE 490: Introduction to ML 84


Trainable parameters in Logistic Regression

We know what are the trainable parameters in the (multi) linear regression block
(coefficients + bias).

Do we have any trainable parameters in the sigmoid block?

ECE 490: Introduction to ML 85


Trainable parameters in Logistic Regression

What about the threshold that we use to decide if the output of the sigmoid block
maps to class A or class B?

The threshold is pre-set and not changed or “learned” during the training process.
This means that it is a fixed parameter and not a trainable parameter.
We call these types of parameters “hyper-parameters”.

ECE 490: Introduction to ML 86


Training Classification Algorithms

ECE 490: Introduction to ML 87


Classification Training Lifecycle

To update the parameters of a classification function, we will have to follow the


same training steps as before:
1. Random initialization of training parameters
2. Passing the function through one or multiple training samples
3. Calculating the error
4. Using gradient descent to update the value of the trainable parameters

Which part do you think differs between classification and regression?

ECE 490: Introduction to ML 88


Classification Training Lifecycle

To update the parameters of a classification function, we will have to follow the


same training steps as before:
1. Random initialization of training parameters
2. Passing the function through one or multiple training samples
3. Calculating the error
4. Using gradient descent to update the value of the trainable parameters

We cannot measure the error in using the same metrics between classification
and regression.

ECE 490: Introduction to ML 89


Classification Error Metrics- Binary

Let’s say we have this example where we will predict if a person is male (1) or
female (0) based on their height

ECE 490: Introduction to ML 90


Classification Error Metrics- Binary

The output of the logistic regression function will be the output of the sigmoid
function. This means that it will be a float between 0 and 1.

ECE 490: Introduction to ML 91


Binary Cross Entropy

The goal of training is to maximize the likelihood that the model assigns to the
correct labels. Instead of maximizing likelihood directly, we minimize the negative
log-likelihood, leading to:

ECE 490: Introduction to ML 92


Binary Cross Entropy

This function penalizes wrong predictions exponentially by taking the log of


predicted probabilities. Logarithmic scaling ensures that confident wrong
predictions are penalized much more than weakly wrong predictions.

ECE 490: Introduction to ML 93


Binary Cross Entropy
- Cross-entropy measures the distance
between two probability distributions:
Means that it is used to calculate - True distribution P (actual labels)
the error and update binary - Predicted Distribution Q (model’s
classification problems.
predictions)

The loss function we saw is called Binary Cross Entropy.

Entropy measures the uncertainty


or unpredictability of a probability
distribution. If an event is certain,
ECE 490: Introduction to ML entropy is low 94
Binary Cross Entropy

Imagine we have two models predicting the probability of an image being a "cat".

ECE 490: Introduction to ML 95


Binary Cross Entropy

ECE 490: Introduction to ML 96


Binary Cross Entropy: Loss vs Cost

As we mentioned before, a loss (or error) is for one data sample while cost if for
multiple data samples.

ECE 490: Introduction to ML 97


Classification Error Metrics

So far, with logistic regression, we saw a binary classification model that outputs a
value between 0 and 1- where:
● Anything below the threshold (0.5) belongs to one class and anything above
belongs to another.

What if we have multiple classes?

ECE 490: Introduction to ML 98


Multi-Class Classification

Multi-class classification
models output an array of
probabilities instead of one
probability (likelihood of
belonging to class A).

ECE 490: Introduction to ML 99


Classification Error Metrics

For multi-class classification, we use Categorical Cross Entropy instead of


Binary Cross Entropy.
k is the number of
classes.

Predicted value
sample i belongs to
class j
We sum over all classes for
each sample since one-hot True value sample i
encoding ensures only the true belongs to class j
class contributes to the loss.

ECE 490: Introduction to ML 100


Categorical Cross Entropy

ECE 490: Introduction to ML 101


Logistic Regression for MultiClass Classification

Can we use logistic regression for multiclass


classification? Yes.

To do so, we have two options:


1. One vs All algorithm. We train multiple
binary logistic regression models, each
distinguishing one class from all others.
2. If we switch the sigmoid function with a
softmax function, we can output an array of
probabilities instead of a singular probability
value.

ECE 490: Introduction to ML 102


Gradient Descent using BCE and CCE

We perform the sample steps in the gradient descent update where we also derive
the cost function to insert it in the gradient update rule

ECE 490: Introduction to ML 103


Training logistic regression from scratch
in lab ‘Classification Algorithms’

ECE 490: Introduction to ML 104


Naive Bayes

ECE 490: Introduction to ML 105


Naive Bayes

Naive Bayes is a probability-based algorithm. In logistic regression, which was a


linear-regression based algorithm that is used to output a probability.

However, in Naive Bayes, we use a probability function to calculate the


probability of a data point belonging to a class.

ECE 490: Introduction to ML 106


Probability Recap

Calculating the probability of an event is to find how much likely this event is to
happen.

Probabilities of calibrated dice:

ECE 490: Introduction to ML 107


Probability Recap

Expectation: Summation of all possible values of a random variable, multiplied by


the probability of each.

ECE 490: Introduction to ML 108


Probability Recap

In most cases, the probability is not equally distributed.


● What if we don’t know if the dice is calibrated?
● What if we don’t know if the coin is calibrated?
● Then we don’t know the probabilities, we need to approximate them.

ECE 490: Introduction to ML 109


Naive Bayes

The Naive bayes algorithm is based on the Bayes Theorem for conditional
probability.

Let’s remember conditional probability:

ECE 490: Introduction to ML 110


Naive Bayes

How does conditional probability apply to prediction problems?

In our case of classification, we want to calculate the probability of a data point


belonging to class A, to class B, to class C… Then, we will use the highest
probability to select the class.

What guides our decision to choosing the class? The information in the features.

ECE 490: Introduction to ML 111


Naive Bayes

What guides our decision to choosing the class? The information in the features.

ECE 490: Introduction to ML 112


Posterior Probability

ECE 490: Introduction to ML 113


Naive Bayes Training

Unlike KNN, which stores all training examples, Naïve Bayes compresses the data
into a small set of probability values.

During training, it computes:


- Class priors: P(Ck)– the probability of each class occurring.
- Feature likelihoods: P(X∣Ck) – how often a feature takes a certain value,
given a class.
- For categorical data: It counts the occurrences of feature values for each
class and normalizes them into probabilities.
- For continuous data: It estimates the mean and variance of the feature
values per class.

ECE 490: Introduction to ML 114


Naive Bayes Training

After computer class priors and


feature likelihoods, it applies Bayes'
Theorem to compute posterior
probabilities.

This makes it computationally efficient


because training is just counting and
computing probabilities—there’s no
iterative optimization.

ECE 490: Introduction to ML 115


Naive Bayes Training

To summarize, the Naive Bayes algorithm is trained using three steps:


1. Counts feature occurrences per class.
2. Estimates likelihoods (of features) using probability distributions (e.g.,
Gaussian for continuous data, Multinomial for text data).
3. Applies Bayes' Theorem to compute posterior probabilities.

ECE 490: Introduction to ML 116


Naive Bayes Example

To solidify how Naïve Bayes functions, we will consider an


example with one feature.
The question we want to answer: Kids will play if the weather
is sunny?

How do we start with applying Naive Bayes?

ECE 490: Introduction to ML 117


Naive Bayes Example

Step 1: Count feature occurrences per class.

ECE 490: Introduction to ML 118


Naive Bayes Example

Step 2: Calculate the class probabilities and feature likelihoods.

In order to calculate the likelihood of kids playing with respect to the weather
condition, we begin by computing the probability of each condition and event.

ECE 490: Introduction to ML 119


Naive Bayes Example

Step 3: Apply Bayes' Theorem to compute posterior probabilities.


The model assigns the class with the highest probability

ECE 490: Introduction to ML 120


Naive Bayes Assumptions

We notice that we calculate the feature likelihoods without taking into


consideration more than one feature at a time.

This means that we assume there are no feature correlation or dependencies.

This assumption may be false for a lot of use cases where there are at least minor
correlation between the features that should be considered for accurate modeling.

ECE 490: Introduction to ML 121


Naive Bayes Assumptions

Let’s very this assumption with an example. If we have 2 features F1 and F2.
Bayes rule will become as follows:

ECE 490: Introduction to ML 122


Naive Bayes Assumptions

So in case of 2 features and 2 classes, for example, we compute the following


probabilities:

ECE 490: Introduction to ML 123


Applications of Naive Bayes

ECE 490: Introduction to ML 124


Algorithms that can be used for both
regression and classification

ECE 490: Introduction to ML 125


K-Nearest Neighbors For Classification

ECE 490: Introduction to ML 126


KNNs for Classification

It classifies a new data instances


based on the k most similar training
examples.

Similarity, in this case, is measured in


distance.

ECE 490: Introduction to ML 127


KNNs for Classification

How do KNNs choose which class the new data point belongs to?

K Nearest Neighbors

We obtain the K number of data points by


K is the number of measuring the distance between the data
nearest data points that point and all the other data points.
surround the new input. The labels of the K closest data points
are used to decide which class the new
data point belongs to.

ECE 490: Introduction to ML 128


KNNs for Classification

K is a hyperparameter that we set


before the training process.

We use a majority voting mechanism


to decide the class of the new data
sample.

ECE 490: Introduction to ML 129


KNNs for Classification

What do KNNs learn? What are the trainable parameters? Are we just comparing
every data point to the data points in out data set?

KNN is an instance-based learning algorithm, meaning that it does not


explicitly learn a function or a set of parameters during training. Instead, it
memorizes the training data and makes predictions by comparing new inputs to
stored instances.

Learning in KNN is essentially data storage and indexing, which enables


efficient nearest-neighbor searches.

ECE 490: Introduction to ML 130


KNNs for Classification

ECE 490: Introduction to ML 131


KNNs for Regression

ECE 490: Introduction to ML 132


KNNs for Regression
Instead of using a majority vote to classify a new data point, KNNs for regression
averages the value of the K nearest neighbors to estimate the value of the new
data point.

New input data point


ECE 490: Introduction to ML 133
KNNs for Regression

How KNN Regression Works:

1. Distance Calculation: For a new data point, the algorithm calculates the
distance between this point and all points in the training dataset. Common
distance metrics include Euclidean, Manhattan, and Minkowski distances.
2. Identifying Neighbors: The algorithm identifies the K data points in the
training set that are closest to the new point based on the calculated
distances.
3. Prediction: The target value for the new data point is predicted by averaging
the target values of its K nearest neighbors.

ECE 490: Introduction to ML 134


Choice of K (Classification and Regression)

When K is too small (e.g., K=1 or 2)

1. The prediction is based on very few points, meaning small fluctuations in the
data can have a big impact.
2. If there’s noise in the dataset, the model might rely too much on those noisy
points, leading to inconsistent or unstable predictions.

Think of it like asking just one or two people for advice—if they have extreme
opinions, your decision might not be well-balanced.

ECE 490: Introduction to ML 135


Choice of K (Classification and Regression)

When K is too large (e.g., K=50 or 100)

● The prediction is influenced by many points, including ones that are farther
away and might not be very similar.
● The model smooths out variations, which can make it less sensitive to specific
details in the data.
● This is like averaging opinions from a very large group—while you get a
general sense of the trend, you might lose important local nuances.

ECE 490: Introduction to ML 136


Support Vector Machine for Classification

ECE 490: Introduction to ML 137


SVMs for classification

Suppose we have data to be used for binary classification.


What would you say makes the best linear separator of the two classes?

ECE 490: Introduction to ML 138


SVMs for classification

The goal of SVM is to curate the most accurate linear separator that will correctly
classify any new input.

ECE 490: Introduction to ML 139


SVMs for classification

But what determines the best linear separator?


In support vector machine, the best linear separator is the line that maximizes the
distance between it and the closest data points from each class.

Max distance between


separator and class A

Max distance between


separator and class B
ECE 490: Introduction to ML 140
SVMs for classification

The closest data points to the separator are called the support vectors

Support vectors

ECE 490: Introduction to ML 141


SVMs for classification

This distance, called the margin, should not only be maximized, but also
equidistant between the support vectors.

Margin

ECE 490: Introduction to ML 142


SVM inference

SVM classifies new data points by inserting the


feature values in the equation of the line. Then,
decides the class by checking the position of the
new data point with respect to the line.

1. If w1x1 + w2x2 + b > 0. Then, the point falls


above the line and it belongs to class A.
2. Otherwise, the point falls below the line and
belongs to class B.

ECE 490: Introduction to ML 143


Training SVM for Classification

Assuming we have two features x1 and x2, this means that our classifier is a
straight line. The classifier equation is

Where w1 and w2 are the weights of x1 and x2 respectively and b is the bias term.
We define the weight vector w where w = (w1, w2).

ECE 490: Introduction to ML 144


Training SVM for Classification

As we know the goal of SVM is to maximize the value of the margin. The margin is
defined by:
Equation of the linear separator

So, maximizing the margin would mean to minimize ||w|| since they are inversely
proportional. To simplify the quadratic optimization problem, instead of simplifying
||w||, we will simplify its integral (½)(||w||2) which makes the function differentiable
and easier to optimize.

ECE 490: Introduction to ML 145


Training SVM for Classification

We can solve this optimization problem using Lagrange multipliers.

RECAP: The Lagrange multiplier technique lets you find the maximum or
minimum of a multivariable function (f(x,y,..)) when there is some constraint on the
input values you are allowed to use.

To ensure that all data points are correctly classified with a margin of at least 1,
the constraints are written as:

ECE 490: Introduction to ML 146


Training SVM for Classification

Now, we have a primal optimization function of

Then, we introduce the lagrange multipliers and solve for them.

ECE 490: Introduction to ML 147


Higher Dimension SVM for Classification

If we had three features, the classifier becomes a classifier plane.

ECE 490: Introduction to ML 148


Types of Margins in SVMs

Data is not perfect. Some data points may lie within the margin. Thus, we have
two types of margins:

ECE 490: Introduction to ML 149


SVM Hard Margin

Hard Margin conditions:

● The data should be linearly separable


● We select two parallel hyperplanes separating the two classes of data
● The hyperplanes are chosen so that the distance between them is as
large as possible

ECE 490: Introduction to ML 150


SVM Soft Margin

Soft Margin conditions:

● The data is not linearly separable


● We allow for errors in classification
● Finding the maximal margin means maximizing the margin between the
data points and the hyperplane

ECE 490: Introduction to ML 151


SVM Error Function

In the SVM training process, we we need


to penalize two scenarios:
1. Wrongly classified data points
2. Data points that are correctly
classified but fall within the margin

ECE 490: Introduction to ML 152


SVM Error Function

Thus, we use a new loss function called Hinge Loss which considers the two
penalties that we want.

ECE 490: Introduction to ML 153


SVM Error Function

ECE 490: Introduction to ML 154


SVM Error Function

ECE 490: Introduction to ML 155


SVM for Classification

What if our dataset is not linearly seperable?

ECE 490: Introduction to ML 156


SVM for Classification

To solve this issue, we can increase the dimensionality of our dataset by adding a
new feature. This would allow us to to use a linear separator. This solution is
called the “kernel trick”.

ECE 490: Introduction to ML 157


SVM for Classification
The kernel trick works by creating a new feature that is the function of the
available features.
Given feature X and feature Y, we add a new feature Z where X = X2 + Y2

ECE 490: Introduction to ML 158


SVM for Classification
However, after training, we don’t want to keep calculating a new feature for every
input. Thus, we need to map the input back to the lower dimensional space.

Best Separating Hyperplane Equation:


Z = constant = X2 +Y2 <=> Equation of a circle in 2D space

ECE 490: Introduction to ML 159


SVM for Classification

ECE 490: Introduction to ML 160


SVM for classification

ECE 490: Introduction to ML 161


Support Vector Regressors

ECE 490: Introduction to ML 162


SVR

Support Vector Machines (SVMs) can be


adapted for regression problems in a method
known as Support Vector Regression
(SVR).
Rather than finding a hyperplane that
separates classes, SVR seeks a function that
approximates the relationship between input
features and a continuous target variable.

ECE 490: Introduction to ML 163


Training SVR

Instead of trying to fit every data point exactly, SVR introduces a margin of
tolerance ε. Errors smaller than ε are ignored.

The ε-insensitive loss function is defined as:

ECE 490: Introduction to ML 164


Decision Trees

ECE 490: Introduction to ML 165


Decision Trees

Consider that you have a 2-hour break between classes, and you’re looking for a
place to eat.

What steps would you take to eliminate potential restaurants and eventually
choose the best option?

ECE 490: Introduction to ML 166


Decision Trees
Your decision process might look like this:
Is it close
by?
Yes No
Does it include a
Eliminated
student discount?
Yes No
Is it crowded? Eliminated

Yes No
Chosen Eliminated

ECE 490: Introduction to ML 167


Decision Trees

Decision trees work in a similar manner where

ECE 490: Introduction to ML 168


Thank you

ECE 490: Introduction to ML 169

You might also like