Lecture 09 ML
Lecture 09 ML
• Learning Curves.
• Regularized Linear Models
• Ridge Regression.
• Lasso Regression.
TODAY’S • Elastic Net.
• Polynomial Regression is a
special case of Linear Regression
in which the relationship between
the independent variables and
dependent variables are modeled
in the nth degree polynomial.
#1 Polynomial Regression
#1 Polynomial
Regression
A good way to reduce overfitting is to regularize the model (i.e., to constrain it): the
fewer degrees of freedom it has, the harder it will be for it to overfit the data. For
example, a simple way to regularize a polynomial model is to reduce the number of
polynomial degrees.
- Here weights w1, w2, …, wn represent the importance of the features(x1, x2,..xn).
A feature will be of high importance if it has a large weight associated with it.
- Weights are calculated as per the cost function, for linear regression cost function is
mean squared error. Weights are tweaked each time and MSE is calculated and the
set that has minimum MSE will be considered as the final output.
- To improve the model or reduce the effect of the noise in our model, we need to
reduce the weights associated with noise. Smaller the weight associated with the noise
will be the less contribution it will have in predicting the output.
• This forces the learning algorithm to not only fit the data but also keep the model
weights as small as possible.
• The hyperparameter α controls how much you want to regularize the model. If α = 0 then
Ridge Regression is just Linear Regression. If α is very large, then all weights end up very
close to zero and the result is a flat line going through the data’s mean. To choose the best
hyperparameter value, we do hyperparameter tuning.
• Note: whenever we apply this technique, we first scale the data, as ridge regression is
sensitive to the scale of input features. This is true for most of the regularized models.
#3.1 Ridge Regression
𝟏&𝒓
• Regularized Cost function = MSE + 𝒓𝜶 ∑𝒏𝒊"𝟏 𝜽𝟐𝒊 + 𝟐
𝜶 ∑𝒏𝒊"𝟏 |𝜽𝒊 |
• So when should you use Linear Regression, Ridge, Lasso, or Elastic Net?
It is almost always preferable to have at least a little bit of regularization, so
generally you should avoid plain Linear Regression. Ridge is a good default, but if
you suspect that only a few features are actually useful, you should prefer Lasso or
Elastic Net since they tend to reduce the useless features’ weights down to zero as we
have discussed.
#4 Early Stopping
• A very different way to regularize iterative learning
algorithms such as Gradient Descent is to stop
training as soon as the validation error reaches a
minimum.
• As the epochs go by, the algorithm learns and its
prediction error (RMSE) on the training set naturally
goes down, and so does its prediction error on the
validation set.
• However, after a while the validation error stops
decreasing and actually starts to go back up. This
high-degree Polynomial Regression model trained using Batch Gradient Descent
indicates that the model has started to overfit the
training data.
• With early stop- ping you just stop training as soon
as the validation error reaches the minimum
#5 Logistic Regression
• Logistic Regression (also called Logit Regression) is commonly used to estimate the
probability that an instance belongs to a particular class (e.g., what is the probability
that this email is spam?).
• If the estimated probability is greater than 50%, then the model predicts that the
instance belongs to that class (called the positive class, labeled “1”), or else it predicts
that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a
binary classifier.
#5 Logistic Regression
• Imagine you have a magical box that can tell you if a fruit is an apple or an orange. You
have a bunch of fruits, and you want to know if each fruit is an apple (1) or an orange (0).
• Logistic regression is like a magical way to use the features of the fruits (like color, shape,
size) to make predictions. It’s as if the magical box draws a line on the features to separate
apples from oranges.
• The magic box uses a special formula called “logistic function” to calculate the probability of
a fruit being an apple (the chances of it being 1). If the probability is more than 0.5, the
box says it’s an apple; if it’s less than 0.5, the box says it’s an orange.
• For example, if the probability of a fruit being an apple is 0.8, the box is quite confident it’s
an apple. But if the probability is 0.2, the box thinks it’s more likely to be an orange.
• Logistic regression helps us classify things into two groups (like apples and oranges) based
on their features. It’s like a magical tool that uses probabilities to make smart decisions and
sort things out!
#5.1 Estimating Probabilities
• Logistic Regression model computes a weighted sum of the input features (plus a bias
term), but instead of outputting the result directly like the Linear Regression model does, it
outputs the logistic of this result.
• The logistic is a sigmoid function that outputs a number between 0 and 1.
Logistic Function
#5.1 Estimating Probabilities
• Once the Logistic Regression model has estimated the probability that an instance x
belongs to the positive class, it can make its prediction ŷ easily
0 𝑖𝑓 𝑝 < 0.5
• ŷ = "
1 𝑖𝑓 𝑝 ≥ 0.5
#5.2 Training and Cost Function
• The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0).
− log 𝑝 𝑖𝑓 𝑦 = 1
• Cost function of a single training instance: C(𝜃) = "
− log 1 − 𝑝 𝑖𝑓 𝑦 = 0
• The bad news is that there is no known closed-form equation to compute the value of 𝜃 that
minimizes this cost function (there is no equivalent of the Normal Equation).
But the good news is that this cost function is convex, so Gradient Descent (or any other
optimization algorithm) is guaranteed to find the global minimum (if the learning rate is not
too large and you wait long enough).
#5.3 Decision Boundaries
• The idea is quite simple: when given an instance x, the SoftMax Regression model first
computes a score Sk(x) for each class k, then estimates the probability of each class by
applying the SoftMax function (also called the normalized exponential) to the scores.
• Once you have computed the score of every class for the instance x, you can estimate the
probability pk that the instance belongs to class k by running the scores through the
softmax function
• The Softmax Regression classifier predicts only one class at a time (i.e., it is multiclass, not
multioutput) so it should be used only with mutually exclusive classes such as different
types of plants. You cannot use it to recognize multiple people in one picture.
#5.4 SoftMax Regression
Thanks…