2 Linear Regression
2 Linear Regression
• Supervised Learning
• Generic equation
• Unsupervised Learning
1
Linear Regression
• We aim at predicting a continuous target
value given an input feature vector.
• We assume a d-dimensional feature vector
is denoted by , while is the output variable..
• The hypothesis function is defined by
2
Linear Regression
Slope Intercept
3
Optimization: Linear Regression
• Mapping of th instance
• Overall deviation
4
Optimization: Linear Regression
• Overall deviation
• Optimization Problem
5
Optimization: Linear Regression
• We minimize the square of deviation
instead of mode.
• Gradient descent
1. Start with an initial guess for , say J
• Gradient descent: initialize your starting point for
search for minimum anywhere
2. Iterate until convergence,
• Compute gradient of J w.r.t linear coefficient at time
t
• Update to get by taking a step in the opposite
direction of the gradient
… … …
^
𝑦 𝑖=¿ × +𝑤 0
… … …
… … …
𝑋 ∈ ℝ 𝑑× 𝑁
7
Optimization: Linear Regression
𝑵
𝟏 𝟐
𝒂𝒓𝒈𝒎𝒊𝒏𝑤 , 𝑤 , …, 𝑤 𝑱= ∑ ( 𝒚 𝒊 − ( 𝒘 𝒙 𝒊 +𝒘 𝟎) )
𝑻
0 1 𝑑
𝟐 𝒊 =𝟏
𝑇
…
^
𝑦 𝑖=¿ ×
… … … …
… … … …
… … …
𝑋 ∈ ℝ 𝑑× 𝑁
8
Optimization: Linear Regression
𝑵
𝟏 𝟐
𝒂𝒓𝒈𝒎𝒊𝒏𝒘 𝑱 = ∑ ( 𝒚 𝒊 − 𝒘 𝒙 𝒊 ) =∥ 𝒚 − 𝑿 𝒘 ∥ 𝑭
𝑻 𝑻 𝟐
𝟐 𝒊=𝟏
𝟏
𝒂𝒓𝒈𝒎𝒊𝒏𝒘 [ 𝟐 𝟐
𝑱 = ( 𝒚 𝟏 − 𝒘 𝒙 𝟏) + ( 𝒚 𝟐 −𝒘 𝒙 𝟐 ) +… + ( 𝒚 𝑵 − 𝒘 𝒙 𝑵 )
𝟐
𝑻 𝑻 𝑻 𝟐
]
𝟐 𝟐
𝜕 ( 𝒚 𝟏 −𝒘 𝒙 𝟏 ) 𝜕 ( 𝒚 𝟏 −𝒘 𝒙 𝟏 ) 𝜕 ( 𝒚 𝟏 − 𝒘 𝒙𝟏 )
𝑻 𝑻 𝑻
=−𝟐 ( 𝒚 𝟏 −𝒘 𝒙 𝟏 ) 𝒙 𝟎𝟏
𝑻
= ×
𝜕 𝑤0 𝜕 ( 𝒚 𝟏 − 𝒘 𝒙𝟏 ) 𝜕𝑤0
𝑻
9
Optimization: Linear Regression
𝟏
𝒂𝒓𝒈𝒎𝒊𝒏𝒘 [ 𝟐 𝟐
𝑱 = ( 𝒚 𝟏 − 𝒘 𝒙 𝟏) + ( 𝒚 𝟐 −𝒘 𝒙 𝟐 ) +… + ( 𝒚 𝑵 − 𝒘 𝒙 𝑵 )
𝟐
𝑻 𝑻 𝑻 𝟐
]
𝟐 𝟐
𝜕 ( 𝒚 𝟏 −𝒘 𝒙 𝟏 ) 𝜕 ( 𝒚 𝟏 −𝒘 𝒙 𝟏 ) 𝜕 ( 𝒚 𝟏 − 𝒘 𝒙𝟏 )
𝑻 𝑻 𝑻
=−𝟐 ( 𝒚 𝟏 −𝒘 𝒙𝟏 ) 𝒙 𝟎 𝟏
𝑻
= ×
𝜕 𝑤0 𝜕 ( 𝒚 𝟏 − 𝒘 𝒙𝟏 ) 𝜕𝑤0
𝑻
𝜕𝑱 𝟏
𝜕 𝑤0 𝟐
[
= − 𝟐 ( 𝒚 𝟏 − 𝒘 𝒙 𝟏 ) 𝒙 𝟎𝟏 −𝟐 ( 𝒚 𝟐 −𝒘 𝒙𝟐 ) 𝒙 𝟎𝟐 −… − 𝟐 ( 𝒚 𝑵 − 𝒘 𝒙 𝑵 ) 𝒙 𝟎 𝑵
𝑻 𝑻 𝑻
]
10
Optimization: Linear Regression
𝜕𝑱
𝜕 𝑤0
[
=− ( 𝒚 𝟏 −𝒘 𝒙 𝟏 ) 𝒙 𝟎𝟏 + ( 𝒚 𝟐 − 𝒘 𝒙 𝟐 ) 𝒙 𝟎 𝟐 +…+ 𝟐 ( 𝒚 𝑵 − 𝒘 𝒙 𝑵 ) 𝒙 𝟎 𝑵
𝑻 𝑻 𝑻
]
𝑇
𝜕𝑱
=¿
− × − ×
𝜕 𝑤0 … …
… …
… …
𝜕𝑱
=¿− 𝑋0 . × ( 𝑦 − 𝑋𝑇 𝑤)
𝜕 𝑤0
11
Optimization: Linear Regression
𝜕𝑱 𝑻
=− 𝑿 𝟎 .×( 𝒚 − 𝑿 𝒘 )
𝜕 𝑤0
𝜕 𝑱 𝑻
=− 𝑿 𝒋 . ×( 𝒚 − 𝑿 𝒘)
𝜕𝑤 𝑗
𝜕 𝑱 𝑻
=− 𝑿 × ( 𝒚 − 𝑿 𝒘 )
𝜕𝑤
12
Optimization: Closed form Solution
At the point of minimization, the gradient
of the loss function with respect to the
model parameters is zero, indicating that
the best fit has been found.
𝜕 𝑱 𝑻
=− 𝑿 × ( 𝒚 − 𝑿 𝒘 )
𝜕𝑤
13
References
• Lecture 2: Deep Learning Fundamentals by
Serena Yeung