0% found this document useful (0 votes)
13 views

Lecture 09 ML

This document provides an overview of several machine learning models and techniques: - Polynomial regression, learning curves, and regularized linear models such as ridge, lasso, and elastic net regression are discussed for modeling relationships between variables and reducing overfitting. - Early stopping is presented as a way to regularize models by stopping training when validation error reaches a minimum. - Logistic regression is introduced for estimating probabilities and binary classification, using a "logistic function" to calculate class probabilities.

Uploaded by

saharabdouma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 09 ML

This document provides an overview of several machine learning models and techniques: - Polynomial regression, learning curves, and regularized linear models such as ridge, lasso, and elastic net regression are discussed for modeling relationships between variables and reducing overfitting. - Early stopping is presented as a way to regularize models by stopping training when validation error reaches a minimum. - Logistic regression is introduced for estimating probabilities and binary classification, using a "logistic function" to calculate class probabilities.

Uploaded by

saharabdouma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

• Polynomial Regression.

• Learning Curves.
• Regularized Linear Models
• Ridge Regression.
• Lasso Regression.
TODAY’S • Elastic Net.

CONTENT • Early Stopping.


• Logistic Regression
• Estimating Probabilities.
• Training and Cost Function.
• Decision Boundaries.
• SoftMax Regression.
#1 Polynomial Regression

• Polynomial Regression is a
special case of Linear Regression
in which the relationship between
the independent variables and
dependent variables are modeled
in the nth degree polynomial.
#1 Polynomial Regression
#1 Polynomial
Regression

• If you perform high-degree Polynomial


Regression, you will likely fit the training
data much better than with plain Linear
Regression.
high-degree Polynomial Regression model
is severely overfitting the training data,
while the linear model is underfitting it.
So, How can you decide how complex
your model should be? How can you
say that your model is overfitting or
underfitting the data?
#2 Learning Curves

• Learning curves are plots that display


the performance of a machine learning
model as the number of training samples
increases. We use them to evaluate the
model’s ability to generalize to new,
unseen data, and to identify issues, such
as overfitting and underfitting.
#2 Learning Curves

• when there are just one or two instances


in the training set, the model can fit them
perfectly, which is why the curve starts at
zero. But as new instances are added to the
training set, it becomes impossible for the
model to fit the training data perfectly,
both because the data is noisy and because
it is not linear at all. Learning curves for theLinear Model
• These learning curves are typical of an
underfi6ng model. Both curves have reached a
plateau; they are close and fairly high.
#2 Learning Curves

• The error on the training data is much


lower than with the Linear Regression
model.
• There is a gap between the curves. This
means that the model performs
significantly be@er on the training data
than on the validaAon data, which is the
hall- mark of an overfiDng model.
• However, if you used a much larger Learning curves for the polynomial model
training set, the two curves would conAnue
to get closer.
# The Bias/Variance Tradeoff
model will have good accuracy when it is trying to make predictions on new or
unseen data for example, using the data which is not included in the training
set.
Good accuracy also means that the value predicted by the model will be very
much close to the actual value.
Bias will be low and variance will be high when model performs well on the
training data but performs bad or poorly on the test data.
High variance means the model cannot generalize to new or unseen data. (This
is the case of overfitting) If the model performs poorly (means less accurate and
cannot generalize) on both training data and test data, it means it has high
bias and high
low variance (This is the case of underfitting) If model performs well
on both test and training data.
Performs well meaning, predictions are close to actual values for unseen data
so accuracy will be high. In this case, bias will be low and variance will also be
low. The best model must have low bias (low error rate on training data) and
low variance (can generalize and has low error rate on new or test data) (This
is the case for best fit model)
# THE BIAS/VARIANCE TRADEOFF
#3 Regularized Linear Model

A good way to reduce overfitting is to regularize the model (i.e., to constrain it): the
fewer degrees of freedom it has, the harder it will be for it to overfit the data. For
example, a simple way to regularize a polynomial model is to reduce the number of
polynomial degrees.

For a linear model, regularization is typically achieved by constraining the weights of


the model. We will now look at Ridge Regression, Lasso Regression, and Elastic Net,
which implement three different ways to constrain the weights.
#3 Regularized Linear Model
Note:
In linear regression, the final output is the weighted sum of the feature variables which
is represented by the equation: y = w1x1+w2x2+w3x3+…+wn xn+w₀

- Here weights w1, w2, …, wn represent the importance of the features(x1, x2,..xn).
A feature will be of high importance if it has a large weight associated with it.
- Weights are calculated as per the cost function, for linear regression cost function is
mean squared error. Weights are tweaked each time and MSE is calculated and the
set that has minimum MSE will be considered as the final output.

- To improve the model or reduce the effect of the noise in our model, we need to
reduce the weights associated with noise. Smaller the weight associated with the noise
will be the less contribution it will have in predicting the output.

- Regularized Cost Function = MSE+ Regularization term


#3.1 Ridge Regression

• Ridge Regression is a regularized version of Linear Regression:


a regularization term equal to 𝜶 ∑𝒏𝒊"𝟏 𝜽𝟐𝒊 is added to the cost function.

• This forces the learning algorithm to not only fit the data but also keep the model
weights as small as possible.

• The hyperparameter α controls how much you want to regularize the model. If α = 0 then
Ridge Regression is just Linear Regression. If α is very large, then all weights end up very
close to zero and the result is a flat line going through the data’s mean. To choose the best
hyperparameter value, we do hyperparameter tuning.

• Note: whenever we apply this technique, we first scale the data, as ridge regression is
sensitive to the scale of input features. This is true for most of the regularized models.
#3.1 Ridge Regression

• shows several Ridge models trained on some linear data


using different α value. On the left, plain Ridge models are
used, leading to linear predictions.
• On the right, the data is first expanded using
PolynomialFeatures(degree=10), and the Ridge models are
applied to the resulting features: this is Polynomial Regression
with Ridge regularization.
• Note: how increasing α leads to flatter (i.e., less extreme,
more reasonable) predictions; this reduces the model’s variance
but increases its bias.
#3.2 Lasso Regression

• An important characteristic of the lasso


regression is, it tends to eliminate the
features which have less importance
by shrinking the weights to zero, and
because of this it is used in feature
selection also.
• it adds a regularization term to the
cost function
𝒏
𝜶 # |𝜽𝒊 |
𝒊"𝟏
#3.2 Elastic Net Regression
• Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The
regularization term is a simple mix of both Ridge and Lasso’s regularization terms.

𝟏&𝒓
• Regularized Cost function = MSE + 𝒓𝜶 ∑𝒏𝒊"𝟏 𝜽𝟐𝒊 + 𝟐
𝜶 ∑𝒏𝒊"𝟏 |𝜽𝒊 |

• When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is


equivalent to Lasso Regression.

• So when should you use Linear Regression, Ridge, Lasso, or Elastic Net?
It is almost always preferable to have at least a little bit of regularization, so
generally you should avoid plain Linear Regression. Ridge is a good default, but if
you suspect that only a few features are actually useful, you should prefer Lasso or
Elastic Net since they tend to reduce the useless features’ weights down to zero as we
have discussed.
#4 Early Stopping
• A very different way to regularize iterative learning
algorithms such as Gradient Descent is to stop
training as soon as the validation error reaches a
minimum.
• As the epochs go by, the algorithm learns and its
prediction error (RMSE) on the training set naturally
goes down, and so does its prediction error on the
validation set.
• However, after a while the validation error stops
decreasing and actually starts to go back up. This
high-degree Polynomial Regression model trained using Batch Gradient Descent
indicates that the model has started to overfit the
training data.
• With early stop- ping you just stop training as soon
as the validation error reaches the minimum
#5 Logistic Regression
• Logistic Regression (also called Logit Regression) is commonly used to estimate the
probability that an instance belongs to a particular class (e.g., what is the probability
that this email is spam?).

• If the estimated probability is greater than 50%, then the model predicts that the
instance belongs to that class (called the positive class, labeled “1”), or else it predicts
that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a
binary classifier.
#5 Logistic Regression
• Imagine you have a magical box that can tell you if a fruit is an apple or an orange. You
have a bunch of fruits, and you want to know if each fruit is an apple (1) or an orange (0).

• Logistic regression is like a magical way to use the features of the fruits (like color, shape,
size) to make predictions. It’s as if the magical box draws a line on the features to separate
apples from oranges.

• The magic box uses a special formula called “logistic function” to calculate the probability of
a fruit being an apple (the chances of it being 1). If the probability is more than 0.5, the
box says it’s an apple; if it’s less than 0.5, the box says it’s an orange.

• For example, if the probability of a fruit being an apple is 0.8, the box is quite confident it’s
an apple. But if the probability is 0.2, the box thinks it’s more likely to be an orange.

• Logistic regression helps us classify things into two groups (like apples and oranges) based
on their features. It’s like a magical tool that uses probabilities to make smart decisions and
sort things out!
#5.1 Estimating Probabilities
• Logistic Regression model computes a weighted sum of the input features (plus a bias
term), but instead of outputting the result directly like the Linear Regression model does, it
outputs the logistic of this result.
• The logistic is a sigmoid function that outputs a number between 0 and 1.

Logistic Function
#5.1 Estimating Probabilities
• Once the Logistic Regression model has estimated the probability that an instance x
belongs to the positive class, it can make its prediction ŷ easily

0 𝑖𝑓 𝑝 < 0.5
• ŷ = "
1 𝑖𝑓 𝑝 ≥ 0.5
#5.2 Training and Cost Function
• The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0).

− log 𝑝 𝑖𝑓 𝑦 = 1
• Cost function of a single training instance: C(𝜃) = "
− log 1 − 𝑝 𝑖𝑓 𝑦 = 0

• The bad news is that there is no known closed-form equation to compute the value of 𝜃 that
minimizes this cost function (there is no equivalent of the Normal Equation).
But the good news is that this cost function is convex, so Gradient Descent (or any other
optimization algorithm) is guaranteed to find the global minimum (if the learning rate is not
too large and you wait long enough).
#5.3 Decision Boundaries

• The fundamental application of logistic


regression is to determine a decision
boundary for a binary classification problem.

• boundary since we will observe instances of a


different class on each side of the boundary.
Our intention in logistic regression would be
to decide on a proper fit to the decision
boundary so that we will be able to predict
which class a new feature set might
correspond to.
#5.3 Decision Boundaries
#5.4 SoftMax Regression
• The Logistic Regression model can be generalized to support multiple classes directly,
without having to train and combine multiple binary classifiers. This is called SoftMax
Regression, or Multinomial Logistic Regression.

• The idea is quite simple: when given an instance x, the SoftMax Regression model first
computes a score Sk(x) for each class k, then estimates the probability of each class by
applying the SoftMax function (also called the normalized exponential) to the scores.

• Once you have computed the score of every class for the instance x, you can estimate the
probability pk that the instance belongs to class k by running the scores through the
softmax function

• The Softmax Regression classifier predicts only one class at a time (i.e., it is multiclass, not
multioutput) so it should be used only with mutually exclusive classes such as different
types of plants. You cannot use it to recognize multiple people in one picture.
#5.4 SoftMax Regression
Thanks…

You might also like