C4 +Supervised+Machine+Learning
C4 +Supervised+Machine+Learning
The model is able to identify handwritten characters and classify each image as a
character. In this case, we are classifying the number digits.
The model is able to successfully detect which of the arriving emails go to spam
and which to the primary inbox.
Given that our dataset has one input feature and an output feature, let’s plot our
data.
We want to create a line that models the training data points in the lowest error
possible. We call the modeled relationship: “the best fit line”.
Data point
Label
Predicted value
Predicted value
Since the model is trying to create a best fit line, it is optimizing the equation of a
line.
Predicted value
Input feature value
Model trainable
parameters
ECE 490: Introduction to ML 25
“Learning” of machine learning algorithms
We call the variables being updated during the training process as “trainable
parameters” or “weights”. In some cases, we also have a “bias” term as a
trainable parameter.
In the case of linear regression, we have both a weight and a bias in our model.
Model trainable
parameters
ECE 490: Introduction to ML 27
How SML algorithms learn
Real value
This error that we got, which is the difference between real and predicted values,
should guide our update of the trainable parameters.
The goal is to update the trainable parameters so that their update results in
a lower error.
If we calculate the error for more than one data point at a time, this is done using a
loss function.
The loss function aggregates the individual errors across the selected data
points, typically by computing the average or sum of the errors using a specified
error function (e.g., MSE, MAE).
Measures the average difference between predicted and actual values. It indicates
the direction of the error (positive or negative bias).
Useful for understanding bias in predictions but not ideal as a standalone metric
because it doesn't capture the magnitude of errors.
In the linear regression model we initialized earlier, let’s try to get a prediction from
the first data point in the training set and calculate its MBE.
Measures the average squared differences between predicted and actual values.
Squaring emphasizes larger errors, making it sensitive to outliers.
Use this error when you want to penalize large errors more heavily or when
outliers are meaningful.
Both MSE and MBE can guide the update of trainable parameters during model
training, but they serve different purposes. However MSE is more commonly
used to minimize overall prediction error, and MBE is used to provide insight
into whether the model consistently overestimates or underestimates.
Useful for when you want a simple, interpretable metric that is less sensitive to
outliers than MSE.
The MAE provides a more robust measure of error by treating all deviations
equally, without disproportionately penalizing large errors, unlike MSE. This
characteristic makes MAE less sensitive to outliers and provides a more balanced
reflection of typical errors in the model.
The square root of MSE. It provides the error in the same units as the target
variable, making it more interpretable than MSE.
Useful for when you want a metric in the same scale as the target variable, while
still penalizing large errors more heavily.
Suppose you're predicting house prices in dollars. RMSE provides an error value
(e.g., $5,000) that is also in dollars. This tells you that, on average, your model's
prediction is about $5,000 off from the actual value.
In this case, we got RMSE and MAE as the same value. This makes sense
because we calculating the errors for one data point.
Both are expressed in the same units as the target variable, but RMSE can
sometimes overemphasize large errors, which might distort the perception of the
model's performance.
Combines the properties of MSE and MAE. It behaves like MSE for small errors
and switches to MAE for large errors, making it robust to outliers while maintaining
sensitivity to small errors.
Error Function:
In our case, since the error > delta, then we applied the modified version of MAE
as the result of the huber loss.
Measures the error logarithmically, which reduces the impact of large errors. This
metric is useful when the target values vary over several orders of magnitude.
Useful for when handling data with widely varying scales or when large errors are
undesirable but should not dominate the metric.
Error Function:
Loss Function:
As you can see MLE tends to be much smaller than other error metrics like MSE,
MAE, or RMSE. This is because it focuses on relative error rather than outlier
impact.
Now, the algorithm is initialized, and we made a prediction (or a set of predictions)
and calculated their error using the chosen loss function.
Intuitively, we want to update the trainable parameters in a way that will minimize
the error.
If we were to plot the cost (or loss) function with respect to the change in
value of one trainable parameter, we would get a 2D curve.
If we were to plot the error with respect to the change in value of two trainable
parameter, we would get a 3D curve.
To get the lowest error, we should get the minimum of the loss (or cost) function.
The global minimum of a function is where the derivative of the function is =0.
Thus, we want to move the parameter closer to where the the loss is at its
minimum through incremental steps which consider the derivative of the loss
function.
Where the
derivative of the
function = 0
The choice of the learning rate affects the size of the steps we are taking to get to
the minimum.
Let’s take MSE as the choice for our loss function and compute its derivative with
respect to the trainable parameters
While we can model the relationship between a feature and the output using a
line, we can model the relationship between two input feature and the output using
a plane.
Using the same logic, the relationship between three or more features and the
output can be modeled using a hyperplane.
Bias Weights
Output
Input Features
We apply the gradient descent update rule for each weight in the multi-linear
regression model in each iteration.
Update rule:
This extension is done by adding a mapping function that would allow us to map
the output of the linear regression part to a class.
The equation of logistic regression: We add a “Sigmoid” function after the linear
regression block.
What is this?
Multi-Linear
Regression
We know what are the trainable parameters in the (multi) linear regression block
(coefficients + bias).
What about the threshold that we use to decide if the output of the sigmoid block
maps to class A or class B?
The threshold is pre-set and not changed or “learned” during the training process.
This means that it is a fixed parameter and not a trainable parameter.
We call these types of parameters “hyper-parameters”.
We cannot measure the error in using the same metrics between classification
and regression.
Let’s say we have this example where we will predict if a person is male (1) or
female (0) based on their height
The output of the logistic regression function will be the output of the sigmoid
function. This means that it will be a float between 0 and 1.
The goal of training is to maximize the likelihood that the model assigns to the
correct labels. Instead of maximizing likelihood directly, we minimize the negative
log-likelihood, leading to:
Imagine we have two models predicting the probability of an image being a "cat".
As we mentioned before, a loss (or error) is for one data sample while cost if for
multiple data samples.
So far, with logistic regression, we saw a binary classification model that outputs a
value between 0 and 1- where:
● Anything below the threshold (0.5) belongs to one class and anything above
belongs to another.
Multi-class classification
models output an array of
probabilities instead of one
probability (likelihood of
belonging to class A).
Predicted value
sample i belongs to
class j
We sum over all classes for
each sample since one-hot True value sample i
encoding ensures only the true belongs to class j
class contributes to the loss.
We perform the sample steps in the gradient descent update where we also derive
the cost function to insert it in the gradient update rule
Calculating the probability of an event is to find how much likely this event is to
happen.
The Naive bayes algorithm is based on the Bayes Theorem for conditional
probability.
What guides our decision to choosing the class? The information in the features.
What guides our decision to choosing the class? The information in the features.
Unlike KNN, which stores all training examples, Naïve Bayes compresses the data
into a small set of probability values.
In order to calculate the likelihood of kids playing with respect to the weather
condition, we begin by computing the probability of each condition and event.
This assumption may be false for a lot of use cases where there are at least minor
correlation between the features that should be considered for accurate modeling.
Let’s very this assumption with an example. If we have 2 features F1 and F2.
Bayes rule will become as follows:
How do KNNs choose which class the new data point belongs to?
K Nearest Neighbors
What do KNNs learn? What are the trainable parameters? Are we just comparing
every data point to the data points in out data set?
1. Distance Calculation: For a new data point, the algorithm calculates the
distance between this point and all points in the training dataset. Common
distance metrics include Euclidean, Manhattan, and Minkowski distances.
2. Identifying Neighbors: The algorithm identifies the K data points in the
training set that are closest to the new point based on the calculated
distances.
3. Prediction: The target value for the new data point is predicted by averaging
the target values of its K nearest neighbors.
1. The prediction is based on very few points, meaning small fluctuations in the
data can have a big impact.
2. If there’s noise in the dataset, the model might rely too much on those noisy
points, leading to inconsistent or unstable predictions.
Think of it like asking just one or two people for advice—if they have extreme
opinions, your decision might not be well-balanced.
● The prediction is influenced by many points, including ones that are farther
away and might not be very similar.
● The model smooths out variations, which can make it less sensitive to specific
details in the data.
● This is like averaging opinions from a very large group—while you get a
general sense of the trend, you might lose important local nuances.
The goal of SVM is to curate the most accurate linear separator that will correctly
classify any new input.
The closest data points to the separator are called the support vectors
Support vectors
This distance, called the margin, should not only be maximized, but also
equidistant between the support vectors.
Margin
Assuming we have two features x1 and x2, this means that our classifier is a
straight line. The classifier equation is
Where w1 and w2 are the weights of x1 and x2 respectively and b is the bias term.
We define the weight vector w where w = (w1, w2).
As we know the goal of SVM is to maximize the value of the margin. The margin is
defined by:
Equation of the linear separator
So, maximizing the margin would mean to minimize ||w|| since they are inversely
proportional. To simplify the quadratic optimization problem, instead of simplifying
||w||, we will simplify its integral (½)(||w||2) which makes the function differentiable
and easier to optimize.
RECAP: The Lagrange multiplier technique lets you find the maximum or
minimum of a multivariable function (f(x,y,..)) when there is some constraint on the
input values you are allowed to use.
To ensure that all data points are correctly classified with a margin of at least 1,
the constraints are written as:
Data is not perfect. Some data points may lie within the margin. Thus, we have
two types of margins:
Thus, we use a new loss function called Hinge Loss which considers the two
penalties that we want.
To solve this issue, we can increase the dimensionality of our dataset by adding a
new feature. This would allow us to to use a linear separator. This solution is
called the “kernel trick”.
Instead of trying to fit every data point exactly, SVR introduces a margin of
tolerance ε. Errors smaller than ε are ignored.
Consider that you have a 2-hour break between classes, and you’re looking for a
place to eat.
What steps would you take to eliminate potential restaurants and eventually
choose the best option?
Yes No
Chosen Eliminated