0% found this document useful (0 votes)
25 views12 pages

Linear Regression With One Variable

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

2.

Linear Regression with One


Variable – or Univariate Linear Regression

 MODEL REPRESENTATION:

➢ Other Notations:
X = space of input values
Y = space of output values
Dataset = list of m training examples → (x(i) ,y(i)); i = 1,2, . . . ,m

Hypothesis Function: Function which is derived by feeding


training data (Input and Output (Supervised) ) to the learning
algorithm, which can then be used to predict o/p for new input
data.
➢ For a supervised problem: h : X → Y so that h(x) is a “good”
predictor for the corresponding value of y.

COST FUNCTION: Can measure the accuracy of our hypothesis


function by Choosing parameters of h(x) such that h(x) is close to y
for data in the training set.
Cost function is a minimization function:
Here, we try to minimize the (1/2m) x (sum of squared differences
b/w predicted value and actual value given in dataset).
➢ “J” is the cost function or squared error cost function or
Mean squared error
➢ (1/2m) is for averaging the squared difference.
Minimize means, we try to find values of θ parameters such that
the cost function is minimized.

Cost ➔ An average difference of all the results of the hypothesis


with inputs from x's and the actual output y's.

➢ The mean is halved as a convenience for the computation of


the gradient descent, as the derivative term of the square
function will cancel out the 1/2 term.

COST FUNCTION INTUITION: Training set data is scattered on x-y


plane. We try to draw a straight line through it. Goal → find best
fitting line.
Ideally, the line should pass through all the points of our training
data set. In such a case, the value of “J” will be 0.
➢ For simplicity: let
For Θ=1 ➔ h(Θ)=x ➔ J(Θ)=0
For Θ=0.5 ➔ h(Θ)=0.5x ➔
J(Θ)=0.58
For Θ=0 ➔ h(Θ)=0 ➔ J(Θ)=2.3

➢ J(Θ) is the average of sqr of diff


bw h(x) and y:
➢ h(x) = predicted value at given
training data
➢ y = actual value of o/p in training data

➢ Vertical lines
represent the given
difference

➢ For different values of


Θ, we try to minimize J(Θ) {error}, which occurs at Θ=1.
Therefore, we choose Θ=1 as out best fitting curve: h(x)=x

≫ For more complex h(x) fxn like h(x) = Θ0 + Θ1x : we have to plot
J(Θ0, Θ1) in 3D. As, for different combinations of Θ0 and Θ1, J can be
different.
These can be more easily represented using contour figures: A
contour plot is a graph that contains many contour lines. A contour
line of a two variable function has a constant value at all points on
the line.

Graph b/w Θ0 and Θ1: these ellipses are the combinations of Θ0 and
Θ1 for which value of J is same. Points other than on ellipses are
also valid points, they also correspond to a unique value of J.
➢ The best combination (one which minimizes J(Θ0, Θ1) ) of Θ0 and
Θ1 lies around center of the innermost circle.

 GRADIENT DESCENT: an algorithm to minimize J(Θ0, Θ1)


➢ Start with some Θ0, Θ1
➢ Keep changing Θ0, Θ1 to reduce J until minimum is reached

Here, we start at a value of Θ0, Θ1 and keep going down on J(Θ0,


Θ1) curve until we reach a local minima. We will know that we have
succeeded when our cost function is at the very bottom of the pits
in our graph.
➢ If we start at a diff value of Θ0, Θ1, we end having different
minima.
We are not graphing x and y itself, but the parameter range of our
hypothesis function and the cost resulting from selecting a
particular set of parameters.

The slope of the tangent is the derivative at that point and it will
give us a direction to move towards. We make steps down the cost
function in the direction with the steepest descent. The size of
each step is determined by the parameter α, which is called the
learning rate.
α = Learning rate
SIMULTANEOUS UPDATE: first we calculate new value for both Θ0
and Θ1, then only we update their values. So order of execution of
statements is:

Here, if update Θ0 before actually calculating the value of new Θ1,


the Θ0 used in equation of Θ1 will be new Θ0, not the one we
wanted to minimize for.
At each iteration j, one should simultaneously update the
parameters .. Θ0, Θ1…Θn. Updating a specific parameter prior to
calculating another one on the jth iteration would yield a wrong
implementation.

GRADIENT DESCENT INTUITION: for simplicity we only use one


parameter:
h(x) = Θ1.x
→α is positive
For a value of Θ1, if the slope of J(Θ) is positive: Θ1 decreases
For negative slope of J(Θ): Θ1 increases
θ1 eventually converges to its minimum.

If the value of α is too small: gradient descent takes baby steps
towards the min.
If α is too large: gradient descent takes huge steps.. in such
case the gradient descent may even overshoot the min if the
diff b/w initial Θ and Θmin is less than the value of jump in Θ
(α * derivative of J) and it may start going further and further
from the min.

➢ Therefore, we should adjust our parameter to ensure that the


gradient descent algorithm converges in a reasonable time.
➢ If the Θ is already at its local minimum, the slope will be 0.. thus
Θ won’t change

Even if the learning rate α is fixed, the slope gets smaller as we


reach towards the minima.. so the steps automatically become
smaller
For a linear regression model: Derivative of J:

For Θ0 – derivate wrt Θ0


For Θ1 – derivate wrt Θ1
➢ For a linear regression model: the curve is always a convex curve
(bowl shaped).
It has only one optimum → global minima (assuming the
learning rate α is not too large).

➢ In the contour curve: we start of with any value of Θ0 and Θ1 and


then we minimize the J.
➢ We approach the min as we reach towards the center.

We start at an arbitrary value for Θ0 and Θ1:


We start minimizing J(Θ0, Θ1) with our gradient descent algo:

J is a complicated quadratic function


The ellipses shown above are the contours of a quadratic function
Batch Gradient Descent → Each step of gradient descent uses all
training examples.
The point of all this is that if we start with a guess for our
hypothesis and then repeatedly apply these gradient descent
equations, our hypothesis will become more and more accurate.

You might also like