HAAI Linear Models Slides
HAAI Linear Models Slides
Often we want to predict a particular property/value for an entity, given some other
properties/values
Examples:
- Marks in Physics of a student in Class 12 finals, given his/her marks in
practice tests
- Predict the price of houses in a city, given their area, number of rooms, …
Such problems, where we try to predict a continuous value, are called Regression
How can we learn to predict the prices of houses of other sizes in the city,
as a function of their area?
Hands-on approach to AI for real-world applications
Dataset of house Area vs Price in a city
Example of a
supervised learning problem!
When the target variable we are trying to predict is continuous: regression problem
Training Set
Learn a function h(x), so that
h(x) is a good predictor for the
Learning
corresponding value of y Algorithm
h: hypothesis function
x h y’
(house area) (predicted price)
Acknowledgement: Some of the images are taken from course materials of Professor Andrew Ng
Hands-on approach to AI for real-world applications
Linear Regression
Hypothesis:
Parameters: w = [ w0, w1 ]
We want to find those values of w for which Cost function will be minimized.
Hands-on approach to AI for real-world applications
For simplicity, let us for now assume w0 = 0
Hypothesis:
Parameters: w = [ w0, w1 ]
We want to find those values of w for which Cost function will be minimized.
Hands-on approach to AI for real-world applications
Hands-on approach to AI for real-world applications
Now let us again consider both w0 and w1 vary
Hypothesis:
Parameters: w = [ w0, w1 ]
We want to find those values of w for which Cost function will be minimized.
E(w)
w1
w0
Acknowledgement: Some of the images are taken from course materials of Professor Andrew Ng
Hands-on approach to AI for real-world applications
Gradient Descent for minimizing a cost function
E(w)
Gradient Descent can get stuck at a
local minima
w1
The MSE cost function in linear w0
regression is always a convex function
● always has a single minimum.
● gradient descent always converges
Hands-on approach to AI for real-world applications
Gradient descent algorithm
(Simultaneously
update w0 and w1)
}
Take x0 = 1
(for easier notation)
...
Hands-on approach to AI for real-world applications
Practical aspects of applying gradient descent
Feature Scaling: Make sure features are on a similar scale. If numerically in very
different range, gradient descent steps updates will be dominated by numerically
larger features
E.g. x1 = area between 300 - 5000 sq.ft.
x2 = #bedrooms between 1 - 5
Normalization strategies:
- Divide by maximum value of the feature
- Min-Max Scaling
- Replace with to make features have approx zero mean
Hands-on approach to AI for real-world applications
Practical aspects of applying gradient descent
Is gradient descent working properly?
- Plot how E(w) changes with every iteration of gradient descent
- For sufficiently small learning rate, E(w) should decrease with every iteration
- If not (e.g., fluctuating), learning rate needs to be reduced
- However, too small learning rate means slow convergence!
y ’ = w 0 + w 1. x y’ = w0 + w1. x + w2.t
t = x2
Can also be multivariate:
y’ = w0 + (w11. x1 + w12. x12 + w13. x13 + …) + (w21. x2 + w22. x22 + w23. x23 + …) + …
Given an input x, try to predict / estimate the probability that y = 1 for this x
Want: 0 ≤ hw(x) ≤ 1
hw(x) be differentiable at all points
hw(x) = 𝝈 (wT x)
hw(x) = 𝝈 (wT x)
where 𝝈 (z) = 1 / (1 + e-z)
Suppose:
Predict y = 1 when hw(x) ≥ 0.5
➢ wT x ≥ 0
➢ wT x < 0
Hands-on approach to AI for real-world applications
Separating two classes of points
Predict y = 1 if
- 3 + x1 + x2 ≥ 0
Predict y = 1 if
- 1 + x12 + x22 ≥ 0
CE
cost
{
Hands-on approach to AI for real-world applications
Logistic Regression Cost function
If y is 0
If y is 1
CE
cost
{
Now use the fact that y is always either 0 or 1, to write this as a closed-form expression
Hands-on approach to AI for real-world applications
Logistic Regression Cost function
Remember: y is always
either 0 or 1
Given a new input x, to make a prediction, output the estimated probability that
y=1 for input x:
hw(x) = 𝝈 (wT x)
where 𝝈 (z) = 1 / ( 1 + e-z )