Linear Regression With Multiple Variables

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56
At a glance
Powered by AI
The document discusses linear regression with one variable and how gradient descent is used to find the parameters that best fit the linear model to the training data.

The training set contains the input features (x) and target variables (y) for multiple examples. It is passed into a learning algorithm which outputs a hypothesis function that tries to map inputs to outputs.

The hypothesis function hθ(x) = θ0 + θ1x represents a linear regression model with parameters θ0 and θ1. It takes an input x and outputs a prediction based on the current values of the parameters.

Linear Regression

with One Variable


Lecture 02 - 03

Mirza Mohammad Lutfe Elahi Silvia Ahmed


CSE 445 Machine Learning ECE@NSU
Linear Regression
• Housing price data example
– Supervised learning regression problem

Size in feet2 (x) Price ($) in 1000’s (y)


2104 460
1416 232
1534 315
852 178
… …

• What do we start with?


– Training set (this is your data set)

CSE 445 Machine Learning ECE@NSU


Training Set of Housing prices
Size in feet2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …

• Notation (used throughout the course)


– m = number of training examples
– x's = input variables / features
– y's = output variable "target" variables
• (x, y) - single training example
• (x(i), y(i)) - specific example (ith training example)
• i is an index to training set
CSE 445 Machine Learning ECE@NSU
Training Set of Housing prices
Housing Price
600

500
PRICE (IN 1000S OF $)

400

300

200

100

0
0 500 1000 1500 2000 2500 3000
SIZE (FEET2)

CSE 445 Machine Learning ECE@NSU


Model Representation

• With our training set defined - how do we use it?


– Take training set
– Pass into a learning algorithm
– Algorithm outputs a function (denoted h ) (h = hypothesis)
• This function takes an input (e.g. size of new house)
• Tries to output the estimated value of y
CSE 445 Machine Learning ECE@NSU
How do we Represent h?
• Hypothesis function h -
– hθ(x) = θ0 + θ1x (h(x) (shorthand))

Housing Price
600

500
h(x) = θ0 + θ1x
PRICE (IN 1000S OF $)

400

300

200

100

0
0 500 1000 1500 2000 2500 3000
SIZE (FEET2)

CSE 445 Machine Learning ECE@NSU


Model Representation
• What does this mean?
– Means y is a linear function of x.
– θi are parameters
• θ0 is zero condition
• θ1 is gradient
• This kind of function is a linear regression with one
variable
– Also called Univariate Linear Regression
• So in summary
– A hypothesis takes in some variable
– Uses parameters determined by a learning system
– Outputs a prediction based on that input

CSE 445 Machine Learning ECE@NSU


Hypothesis Function
hθ(x) = θ0 + θ1x
• equation of a straight line.
• set values for θ0 and θ1 to get estimated output hθ.
• hθ is a function that is trying to map input data (the x's) to our
output data (the y's).
Input x Output y
• Example – 0 4
1 7
2 7
3 8
• Let θ0 = 2 and θ1 = 2, then hθ(x) = 2 + 2x
• For input 1 to our hypothesis, hθ(x) will be 4 (off by 3).
• Need to try out various values to find best possible fit.
• Or the most representative "straight line" through the data
points mapped on the x-y plane.
CSE 445 Machine Learning ECE@NSU
Cost Function
Size in feet2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …

• Hypothesis: hθ(x) = θ0 + θ1x


• θi’s: parameters

How to choose θi’s?

CSE 445 Machine Learning ECE@NSU


Cost Function
• A cost function lets us figure out how to fit the best
straight line to our data.
• We can measure the accuracy of our hypothesis
function by using a cost function.
• Choosing values for θi (parameters)
– Different values give different functions.

CSE 445 Machine Learning ECE@NSU


Cost Function
hθ(x) = θ0 + θ1x

h(x) = 1.5 + 0.x h(x) = 0.5x h(x) = 1 + 0.5x


3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

Θ0 =1.5 Θ0 = 0 Θ0 = 1
Θ1 = 0 Θ1 = 0.5 Θ1 = 0.5

CSE 445 Machine Learning ECE@NSU


Linear Regression – Cost Function

θ0, θ1
y

x
• Based on our training set we want to generate parameters
which make the straight line
– Chosen these parameters so hθ(x) is close to y for our training
examples
• Basically, uses x’s in training set with hθ(x) to give output which is as close
to the actual y value as possible
• Think of hθ(x) as a "y imitator" - it tries to convert the x into y, and
considering we already have - we can evaluate, how well hθ(x) does this.

CSE 445 Machine Learning ECE@NSU


Cost Function
• To formalize this;
– We want to solve a minimization problem
– Minimize the difference between h(x) and y for
each/any/every example –
• minimize (hθ(x) - y)2
– Sum this over the training set

CSE 445 Machine Learning ECE@NSU


Cost Function
• Minimize squared difference between predicted house price
and actual house price
– 1/2m
• 1/m - means we determine the average
• 1/2m - the 2 makes the math a bit easier, and doesn't change the constants
we determine at all (i.e. half the smallest value is still the smallest value!)
– Minimizing difference means we get the values of θ0 and θ1 which find
on average the minimal deviation of x from y when we use those
parameters in our hypothesis function
• More cleanly, this is a cost function

• And we want to minimize this cost function


– Our cost function is (because of the summation term) inherently looking at ALL the data
in the training set at any time

CSE 445 Machine Learning ECE@NSU


Cost Function
• Lets consider some intuition about the cost function and why
we want to use it
– The cost function determines parameters
– The value associated with the parameters determines how your
hypothesis behaves, with different values generate different
• Simplified hypothesis
– Assumes θ0 = 0 3

2
hθ(x) = θ1x
1

0
0 1 2 3

Θ0 = 0

CSE 445 Machine Learning ECE@NSU


Cost Function
• Cost function and goal here are very similar to when we
have θ0, but with a simpler parameter
– Simplified hypothesis makes visualizing cost function J(θ) a bit easier
• So hypothesis pass through (0, 0)
• Two key functions we want to understand
– hθ(x)
• Hypothesis is a function of x - function of what the size of the house is
– J(θ1)
• Is a function of the parameter of θ1

CSE 445 Machine Learning ECE@NSU


Cost Function
hθ(x) = θ1x

hθ(x) J(θ1)
(for fixed θ1, this is a function of x) (function of the parameter θ1)
3 3

hθ(x)
2 2

J(θ1)
y

1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 3
x Θ1

Θ1= 1 Θ1 = 1
J(θ1) = 0

CSE 445 Machine Learning ECE@NSU


Cost Function
hθ(x) = θ1x

hθ(x) J(θ1)
(for fixed θ1, this is a function of x) (function of the parameter θ1)
3 3

2 2

J(θ1)
y

1 1
hθ(x)

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 3
x Θ1

Θ1= 0.5 Θ1= 0.5


J(θ1) = 0.58

CSE 445 Machine Learning ECE@NSU


Cost Function
hθ(x) = θ1x

hθ(x) J(θ1)
(for fixed θ1, this is a function of x) (function of the parameter θ1)
3 3

Θ1 = 1
2 2

J(θ1)
y

Θ1= 0.5
1 1
hθ(x)

0
Θ1 = 0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 3
x Θ1

CSE 445 Machine Learning ECE@NSU


Deeper Insight - Cost Function
• Using our original complex hypothesis with two variables,
– So cost function is
• J(θ0, θ1)
• Example,
– Say
• θ0 = 50
• θ1 = 0.06
– Previously we plotted our cost function by plotting
• θ1 vs J(θ1)

CSE 445 Machine Learning ECE@NSU


Deeper Insight - Cost Function
– Now we have two parameters
• Plot becomes a bit more complicated
• Generates a 3D surface plot where axis are
– X = θ1
– Z = θ0
– Y = J(θ0, θ1)
• the height (y) indicates the value of the cost function, so find where y is at a
minimum

CSE 445 Machine Learning ECE@NSU


Deeper Insight - Cost Function
• Instead of a surface plot we can use a contour figures/plots
– Set of ellipses in different colors
– Each color is the same value of J(θ0, θ1), but obviously plot to different
locations because θ1 and θ0 will vary
– Imagine a bowl shape function coming out of the screen so the middle is
the concentric circles

CSE 445 Machine Learning ECE@NSU


Deeper Insight - Cost Function
• Each point (like the green one here) represents a pair of
parameter values for θ0 and θ1
– Our example here put the values at
• θ0 = ~800
• θ1 = ~-0.15

– Not a good fit


• i.e. these parameters give a value on our contour plot far from the center

CSE 445 Machine Learning ECE@NSU


Deeper Insight - Cost Function

– If we have
• θ0 = ~360
• θ1 = 0
• This gives a better hypothesis, but still not great - not in the center of
the contour plot

CSE 445 Machine Learning ECE@NSU


Deeper Insight - Cost Function

CSE 445 Machine Learning ECE@NSU


Deeper Insight - Cost Function

 Finally we find the minimum, which gives the best hypothesis

CSE 445 Machine Learning ECE@NSU


Deeper Insight - Cost Function

• Doing this by eye/hand is a painful task


 What we really want is an efficient algorithm for finding the minimum
for θ0 and θ1

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm
• Minimize cost function J
• Gradient descent
– Used all over machine learning for minimization
• Start by looking at a general J() function
• Problem
– We have J(θ0, θ1)
– We want to get min J(θ0, θ1)
• Gradient descent applies to more general functions
– J(θ0, θ1, θ2 .... θn)
– min J(θ0, θ1, θ2 .... θn)

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm
How does it work?
• Start with initial guesses
– Start at (0, 0) (or any other value)
– Keeping changing θ0 and θ1 a little bit to try and reduce J(θ0, θ1)
• Each time you change the parameters, you select the gradient
which reduces J(θ0, θ1) the most possible
• Repeat
• Do so until you converge to a local minimum
• Has an interesting property
– Where you start can determine which minimum you end up

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

• Here we can see one initialization point led to one local


minimum
• The other led to a different one

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

• What does this all mean?


– Update θj by setting it to (θj - α) times the partial derivative of the cost
function with respect to θj
• Notation
– := Denotes assignment
– α (alpha)
• Is a number called the learning rate
• Controls how big a step you take
– If α is big have an aggressive gradient descent
– If α is small take tiny steps

• Derivative term

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

• There is a subtly about how this gradient descent algorithm is


implemented
– Do this for θ0 and θ1
– For j = 0 and j = 1 means we simultaneously update both
– How do we do this?
• Compute the right hand side for both θ0 and θ1
– So we need a temp value
• Then, update θ0 and θ1 at the same time

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm
• To understand gradient descent, we'll return to a simpler
function where we minimize one parameter to help explain the
algorithm in more detail
– where θ1 is a real number
• Two key terms in the algorithm
– Alpha
– Derivative term
• Notation nuances
– Partial derivative vs. derivative
• Use partial derivative when we have multiple variables but only derive with
respect to one
• Use derivative when we are deriving with respect to all the variables

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm
• Derivative term

• Derivative says
– Lets take the tangent at the point and look at the slope of the line
– So moving towards the minimum (down) will create a negative
derivative, alpha is always positive, so will update J(θ1) to a smaller
value
– Similarly, if we're moving up a slope we make J(θ1) a bigger numbers

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm
J(θ1) (θ1 ϵ ℝ)

positive slope d
θ1 = θ1 − α J(θ1) ≥ 0
dθ1
θ1 = θ1 − α (positive number)
θ1

J(θ1) (θ1 ϵ ℝ)

d
negative slope θ1 = θ1 − α J(θ1) ≤ 0
dθ1
θ1 = θ1 − α (negative number)
θ1
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
d J(θ1) (θ1 ϵ ℝ)
θ1 = θ1 − α J(θ1)
dθ1

If'α is too small, gradient descent


can be slow.

θ1

J(θ1) (θ1 ϵ ℝ)

If'α is too large, gradient descent


can overshoot the minimum. It may
fail to converge, or even diverge.
θ1
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
d
θ1 = θ1 − α J(θ1) ≥ 0
dθ1

J(θ1) (θ1 ϵ ℝ)

θ1 at local optima

θ1

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm
• Gradient descent can converge to a local minimum, even with
the learning rate α fixed.

d J(θ1) (θ1 ϵ ℝ)
θ1 = θ1 − α J(θ1) ≥ 0
dθ1

θ1

• As we approach a local minimum, gradient descent will


automatically take smaller steps. So, no need to decrease α
over time.
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

repeat until convergence {

Update θ0 and θ1
simultaneously

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm

CSE 445 Machine Learning ECE@NSU


“Batch” Gradient Descent

“Batch”: Each step of gradient descent uses all the training


examples

CSE 445 Machine Learning ECE@NSU


Implementation
House sizes:
2104 hθ(x) = -40 + 0.25x
1416
1534
852

-40 x 1 + 0.25 x 2104


-40 x 1 + 0.25 x 1416
X =
-40 x 1 + 0.25 x 1532
-40 x 1 + 0.25 x 852

Prediction = Data Matrix * Parameters

CSE 445 Machine Learning ECE@NSU


Implementation
House sizes: Have 3 competing hypothesis
2104 1. hθ(x) = -40 + 0.25x
1416 2. hθ(x) = 200 + 0.1x
1534 3. hθ(x) = -150 + 0.4x
852

486 410 692


314 342 416
X =
344 353 464
173 285 191

Prediction Prediction Prediction


of the 1st hθ of the 2nd hθ of the 3rd hθ

CSE 445 Machine Learning ECE@NSU

You might also like