0% found this document useful (0 votes)
7 views27 pages

CSE445 T3b Linear Regression Multiple Varable

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views27 pages

CSE445 T3b Linear Regression Multiple Varable

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Lecture 3b

Linear Regression
with Multiple Variable

Silvia Ahmed CSE445 Machine Learning ECE@NSU 1


Learning goals

• After this presentation, you should be able to:


• Apply linear regression for multiple variable.
• Apply feature scaling, normalization, and feature scaling successfully
for multiple features.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 2


Multiple Features

• Notation
• n = number of features
• m = number of training examples
• x(i) = input (features) of ith training example
• 𝒙𝒊𝒋 = value of features j in ith training example

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 3


Multiple Features
• Multiple variables = multiple features
• In original version we had
• x = house size (use this to predict)
• y = house price
• If in a new scheme we have more variables (such as number of
bedrooms, number of floors, age of the home)
• x1, x2, x3, x4 are the four features
• x1 - size (feet squared)
• x2 - Number of bedrooms
• x3 - Number of floors
• x4 - Age of home (years)
• y is the output variable (price)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 4


Hypothesis for Multiple Features

Previously: hθ(x) = θ0 + θ1x

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + … + θnxn

E.g.:
hθ(x) = 80 + 0.1x1 + 0.01x2 + 3x3 - 2x4

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 5


Hypothesis for Multiple Features
• For convenience of notation, x0 = 1 For every example i you
have an additional 0th feature for each example
• So now your feature vector is n + 1 dimensional feature vector
indexed from 0
• This is a column vector called X
• X = [x0 x1 … xn] (x0 =1)
• Each example has a column vector associated with it
• Parameters are also in a 0 indexed n + 1 dimensional vector
• This is also a column vector called θ
• θ = [θ0 θ1 … θn ]
• This vector is the same for each example

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 6


Model for Multiple Features
Hypothesis: hθ(x) = 𝜃Ԧ ∙ 𝑋Ԧ = θ0x0 + θ1x1 + θ2x2 + … + θnxn

Parameters = θ0, θ1, θ2, … θn

1
Cost function: 𝐽(θ) = σ𝑚 (ℎ
𝑖=1 θ (𝑥 (𝑖) ) − 𝑦 (𝑖) )2
2𝑚

Gradient descent:
Repeat{
θj ≔ θj – α 𝜕𝜕θ J(θ)
j
} (simultaneously update for every j = 0, 1, … n)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 7


Gradient Descent Algorithm
Previously (n = 1):
Repeat
{
θ0 ≔ θ0 – α 𝑚1 σ𝑚 (𝑖) (𝑖)
𝑖=1(ℎθ(𝑥 ) − 𝑦 )

θ1 ≔ θ1 – α 𝑚1 σ𝑚 (𝑖) (𝑖)
𝑖=1(ℎθ(𝑥 ) − 𝑦 ) 𝑥
(𝑖)

(simultaneously update θ0, θ1)


}

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 8


Gradient Descent Algorithm
New algorithm (n ≥ 1):
Repeat
{
θj ≔ θj – α 𝑚1 σ𝑚 (𝑖) (𝑖) (𝑖)
𝑖=1(ℎθ(𝑥 ) − 𝑦 ) 𝑥𝑗

(simultaneously update θj for j = 0, 1, … n)


}

θ0 ≔ θ0 – α 𝑚1 σ𝑚𝑖=1(ℎθ(𝑥 (𝑖) ) − 𝑦 (𝑖) ) 𝑥0(𝑖)


θ1 ≔ θ1 – α 𝑚1 σ𝑚𝑖=1(ℎθ(𝑥 (𝑖) ) − 𝑦 (𝑖) ) 𝑥1(𝑖)
θ2 ≔ θ2 – α 𝑚1 σ𝑚𝑖=1(ℎθ(𝑥 (𝑖) ) − 𝑦 (𝑖) ) 𝑥2(𝑖)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 9


Gradient Descent Algorithm
• We're doing this for each j (0 until n) as simultaneous update
(like when n = 1)
• So, we reset θj to
• θj minus the learning rate (α) times the partial derivative of the θ vector
with respect to θj
• In non-calculus words, this means that we do
• Learning rate
• Times 1/m (makes the maths easier)
• Times the sum of
• The hypothesis taking in the variable vector, minus the actual value, times the jth value
in that variable vector for each example

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 10


Feature Scaling
• Having a problem with multiple features
• Make sure those features have a similar scale
• Means gradient descent will converge more quickly
• E.g. x1 = size (0 – 2000 feet2) J(θ)
θ2
x2 = number of bed rooms (1 - 5)
• Means the contours generated if we
plot θ1 vs. θ2 give a very tall and
thin shape due to the huge range
difference
• Running gradient descent can take a
θ1
long time to find the global minimum

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 11


Feature Scaling
• Idea: Make sure features are on a similar scale

0 ≤ x1 ≤ 1 θ2
J(θ)
0 ≤ x2 ≤ 1
• If you define each value from
x1 and x2 by dividing by the
max for each feature
• Contours become more like
circles (as scaled between θ1
0 and 1)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 12


Feature Scaling
• Get every feature into approximately a -1 ≤ xi ≤ 1 range
• Want to avoid large ranges, small ranges or very different ranges
from one another
• Rule of thumb regarding acceptable ranges
• -3 to +3 is generally fine - any bigger bad
• -1/3 to +1/3 is okay - any smaller bad

x0 = 0
0 ≤ x1 ≤ 3
-2 ≤ x2 ≤ 0.5
-100 ≤ x3 ≤ 100
-0.0001 ≤ x4 ≤ 0.0001

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 13


Mean Normalization
• Take a feature xi
• Replace it by (xi - mean)/max
• So your values all have an average of about 0
𝑠𝑖𝑧𝑒−1000
• E.g. 𝑥1 = -0.5 ≤ x1 ≤ 0.5
2000
#𝑏𝑒𝑑𝑟𝑜𝑜𝑚𝑠−2
𝑥2 = -0.5 ≤ x2 ≤ 0.5
5

• Instead of max, can also be used standard deviation or (max -


min)
𝑥1 −µ 𝑥2 −µ
𝑥1 = 𝑥2 =
σ σ

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 14


Learning Rate α
• θj ≔ θj – α 𝜕𝜕θ J(θ)
j

• Debugging: how to make sure gradient descent is working


correctly
• How to choose learning rate α

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 15


Learning Rate α
min J(θ)
• θj ≔ θj – α 𝜕𝜕θ J(θ) θ
j

• Number of iterations varies a lot


• 30 iterations
• 3000 iterations
• 3000000 iterations 0 100 200 300 400
• Very hard to tell in advance how
No of iteration
many iterations will be needed
• Can often make a guess based a plot like this after the first 100 or so iterations

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 16


Learning Rate α
min J(θ)
θ

0 100 200 300 400

No of iteration

• Automatic convergence tests


• Declare convergence if J(θ) decrease by less than 10-3 in one iteration

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 17


Learning Rate α
J(θ) J(θ) J(θ)

No of iteration No of iteration No of iteration

• Gradient descent not working


• Use small α

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 18


Learning Rate α
• For sufficiently small α, J(θ) should decrease on every iteration
• But it α is too small, gradient descent can be slow to
convergence
• So
• If α is too small: slow convergence
• If α is too large: J(θ) may not decrease on every iteration; may not
converge
• To choose α, try
…, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 19


Feature Selection
• House price prediction
• hθ(x) = θ0 + θ1X frontage + θ2 X depth

• Two features
• Frontage - width of the plot of land along road (x1)
• Depth - depth away from road (x2)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 20


Feature Selection
• You don't have to use just two features
• Can create new features

• Might decide that an important feature is the land area


• So, create a new feature area (x3) = frontage * depth hθ(x) = θ0 + θ1X
area
• Area is a better indicator
• Often, by defining new features you may get a better model

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 21


Polynomial regression
• May fit the data better
• hθ(x) = θ0 + θ1x + θ2x2
• e.g. quadratic function

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 22


Polynomial regression
• For housing data could use a quadratic function
• But may not fit the data so well - inflection point means housing prices
decrease when size gets really big

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 23


Polynomial regression
• So instead must use a cubic function
• hθ(x) = θ0 + θ1x + θ2x2 + θ3x3

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 24


Polynomial regression
• So instead must use a cubic function
• hθ(x) = θ0 + θ1 x1 + θ2 x2 + θ3 x3
• hθ(x) = θ0 + θ1(size) + θ2 (size)2 + θ3 (size)3
x1 = (size)
x2 = (size)2
x3 = (size)3

Make sure apply the feature scaling.


size = 1 – 1000
(size)2 = 1 – 1000000
(size)3 = 1 – 109
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 25
Choice of features
hθ(x) = θ0 + θ1 (size) + θ2 (size)

• Instead of a conventional polynomial you could do variable


^(1/something) - i.e. square root, cubed root, etc.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 26


Reference and further reading
• Coursera course: “Supervised machine Learning: Regression
and Classification”, https://www.coursera.org/learn/machine-
learning/home

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 27

You might also like