Linear Regression Multi
Linear Regression Multi
• Notation
– n = number of features
– m = number of training examples
– x(i) = input (features) of ith training example
– = value of features j in ith training example
hθ(x) = θTX
ϵ ℝn+1 ϵ ℝn+1
Cost function: =
Gradient descent:
Repeat{
– α J()
} (simultaneously update for every j = 0, 1, … n)
–α
–α
–α
…
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
• We're doing this for each j (0 until n) as
simultaneous update (like when n = 1)
• So, we reset θj to
– θj minus the learning rate (α) times the partial derivative of
the θ vector with respect to θj
– In non-calculus words, this means that we do
• Learning rate
• Times 1/m (makes the maths easier)
• Times the sum of
– The hypothesis taking in the variable vector, minus the actual value,
times the jth value in that variable vector for each example
θ1
CSE 445 Machine Learning ECE@NSU
Feature Scaling
• Idea: Make sure features are on a similar scale
0 ≤ x1 ≤ 1
0 ≤ x2 ≤ 1 θ2
J(θ)
• If you define each value from
x1 and x2 by dividing by the
max for each feature
• Contours become more like
circles (as scaled between
0 and 1) θ1
x0 = 0
0 ≤ x1 ≤ 3
-2 ≤ x2 ≤ 0.5
-100 ≤ x3 ≤ 100
-0.0001 ≤ x4 ≤ 0.0001
CSE 445 Machine Learning ECE@NSU
Mean Normalization
• Take a feature xi
– Replace it by (xi - mean)/max
– So your values all have an average of about 0
• E.g. -0.5 ≤ x1 ≤ 0.5
-0.5 ≤ x2 ≤ 0.5
No of iteration
• Automatic convergence tests
– Declare convergence if decrease by less than 10-3 in one iteration
No of iteration No of iteration
J ( θ)
• Gradient descent no working
• Use small α
No of iteration
CSE 445 Machine Learning ECE@NSU
Learning Rate α
• For sufficiently small α, J(θ) should decrease on every
iteration
• But it α is too small, gradient descent can be slow to
convergence
• So
– If α is too small: slow convergence
– If α is too large: J(θ) may not decrease on every iteration;
may not converge
• To choose α, try
…, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …
• Two features
– Frontage - width of the plot of land along road (x1)
– Depth - depth away from road (x2)
• Here
– n=4
– m=4
X= y=
• = (XT X)-1XT y
m x (n+1) m
ϵ ℝn+1
Design
Matrix
m x (n+1)
mx2
• Normal equation
– Can be used for linear regression only
– Can solve w, b without iterations
• Disadvantages
– Doesn't generalize to other learning algorithms
– Slow when number of features is large
(>10000)