0% found this document useful (0 votes)
19 views60 pages

Lecture 2 Annotated

The document discusses maximum likelihood estimation (MLE), which is a method for estimating the parameters of a probabilistic model given observations. MLE chooses the parameters that maximize the likelihood or probability of the observed data given the model. The document provides examples of applying MLE to estimate the probability of heads for a coin flip (binomial distribution) and the mean and variance of a Gaussian/normal distribution.

Uploaded by

Adil Sadiki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views60 pages

Lecture 2 Annotated

The document discusses maximum likelihood estimation (MLE), which is a method for estimating the parameters of a probabilistic model given observations. MLE chooses the parameters that maximize the likelihood or probability of the observed data given the model. The document provides examples of applying MLE to estimate the probability of heads for a coin flip (binomial distribution) and the mean and variance of a Gaussian/normal distribution.

Uploaded by

Adil Sadiki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Maximum Likelihood

Estimation
Your first consulting job
• Billionaire: I have special coin, if I flip it, what’s the
probability it will be heads?
• You: Please flip it a few times:

• You: The probability is:

• Billionaire: Why?
Coin – Binomial Distribution

• Data: sequence D= (HHTHT…), k heads out of n flips


• Hypothesis: P(Heads) = θ, P(Tails) = 1-θ
• Flips are i.i.d.:
• Independent events
• Identically distributed according to Binomial
distribution

• P (D|✓) =
Maximum Likelihood Estimation

• Data: sequence D= (HHTHT…), k heads out of n flips


• Hypothesis: P(Heads) = θ, P(Tails) = 1-θ

k n k
P (D|✓) = ✓ (1 ✓)
• Maximum likelihood estimation (MLE): Choose θ that
maximizes the probability of observed data:
P (D|✓)
✓bM LE = arg max P (D|✓)

= arg max log P (D|✓) ✓bM LE


Your first learning algorithm

✓bM LE = arg max log P (D|✓)



= arg max log ✓k (1 ✓)n k

d
• Set derivative to zero: log P (D|✓) = 0
d✓
Maximum Likelihood Estimation

Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤


n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)



Recap
• Learning is…
• Collect some data
• E.g., coin flips
• Choose a hypothesis class or model
• E.g., binomial
• Choose a loss function
• E.g., data likelihood
• Choose an optimization procedure
• E.g., set derivative to zero to obtain MLE
What about continuous variables?

• Billionaire: What if I am measuring a continuous variable?


• You: Let me tell you about Gaussians…
Some properties of Gaussians

• affine transformation (multiplying by scalar and adding a


constant)
• X ~ N(µ,σ2)
• Y = aX + b ➔ Y ~ N(aµ+b,a2σ2)

• Sum of Gaussians
• X ~ N(µX,σ2X)
• Y ~ N(µY,σ2Y)
• Z = X+Y ➔ Z ~ N(µX+µY, σ2X+σ2Y)
MLE for Gaussian

• Prob. of i.i.d. samples D={x1,…,xn} (e.g., temperature):


P (D|µ, ) = P (x1 , . . . , xn |µ, )
✓ ◆n Y n
1 (xi µ)2
= p e 2 2

2⇡ i=1

• Log-likelihood of data:
n
X
p (xi µ)2
log P (D|µ, ) = n log( 2⇡) 2
i=1
2

• What is ✓bM LE for ✓ = (µ, 2 ) ? Draw a picture!


Your second learning algorithm:
MLE for mean of a Gaussian

• What’s MLE for mean?


" n
#
d d p X (xi µ)2
log P (D|µ, ) = n log( 2⇡)
dµ dµ i=1
2 2
MLE for variance

• Again, set derivative to zero:


" n
#
d d p X (xi µ)2
log P (D|µ, ) = n log( 2⇡)
d d i=1
2 2
Learning Gaussian parameters

Xn
• MLE: 1
bM LE
µ = xi
n i=1
Xn
c2 M LE 1 2
= (xi bM LE )
µ
n i=1
• MLE for the variance of a Gaussian is biased

E[c2 M LE ] 6= 2

• Unbiased variance estimator:


n
X
c2 unbiased = 1
(xi bM LE )2
µ
n 1 i=1
Maximum Likelihood Estimation
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤
n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)


Under benign assumptions, as the number of observations n ̂


→ ∞ we have θ MLE → θ*

The MLE is a “recipe” that begins with a model for data f(x; θ)
Applications preview
Maximum Likelihood Estimation
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤
n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)


Under benign assumptions, as the number of observations n ̂


→ ∞ we have θ MLE → θ*

Why is it useful to recover the “true” parameters θ* of a probabilistic model?


• Estimation of the parameters θ* is the goal
• Help interpret or summarize large datasets
• Make predictions about future data
• Generate new data X ∼ f( ⋅ ; θ MLE ̂ )
Estimation
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

Opinion polls
How does the greater
population feel about an issue?
Correct for over-sampling?
• θ* is “true” average opinion
• X1, X2, … are sample calls

A/B testing
How do we figure out which ad
results in more click-through?
• θ* are the “true” average rates
• X1, X2, … are binary “clicks”
Interpret
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

Customer segmentation / clustering


Can we identify distinct groups of
customers by their behavior?
• θ* describes “center” of distinct groups
• X1, X2, … are individual customers

Data exploration
What are the degrees of freedom of the
dataset?
• θ* describes the principle directions of
variation
• X1, X2, … are the individual images
Predict
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

Content recommendation
Can we predict how much someone will
like a movie based on past ratings?
• θ* describes user’s preferences
• X1, X2, … are (movie, rating) pairs

Object recognition / classification


Identify a flower given just its picture?
• θ* describes the characteristics of
each kind of flower
• X1, X2, … are the (image, label) pairs
Generate
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

“Kaia the dog wasn't a natural pick to go to mars.


Text generation No one could have predicted she would…”
Can AI generate text that could have
been written like a human?
• θ* describes language structure
• X1, X2, … are text snippets found
online

https://chat.openai.com/chat

Image to text generation “dog talking on cell phone under water, oil painting”
Can AI generate an image from a prompt?
• θ* describes the coupled structure of
images and text
• X1, X2, … are the (image, caption) pairs
found online https://labs.openai.com/
Linear Regression
The regression problem, 1-dimensional

Given past sales data on zillow.com, predict:


y = House sale price from
x = {# sq. ft.}
xi 2 R
Training Data: <latexit sha1_base64="orh5n7qZpaR0XoEUpD3YIOQNUYs=">AAACGHicbZC7TsMwFIadcivhVmBksaiQmKqkIMFYQQfGUtGLaKLKcZzWquNEtoOoorwFExI8CxtiZeNR2HDaDNByJEu/vv8c+/j3Ykalsqwvo7Syura+Ud40t7Z3dvcq+wddGSUCkw6OWCT6HpKEUU46iipG+rEgKPQY6XmT69zvPRAhacTv1DQmbohGnAYUI6XR/eOQQody6LSHlapVs2YFl4VdiCooqjWsfDt+hJOQcIUZknJgW7FyUyQUxYxkppNIEiM8QSMy0JKjkEg3nW2cwRNNfBhEQh+u4Iz+nkhRKOU09HRniNRYLno5/M8bJCq4dFPK40QRjucPBQmDKoL596FPBcGKTbVAWFC9K8RjJBBWOiTTdHwSOM3UyS/GiKXNLJuz9px5XtrOMp2VvZjMsujWa/ZZrX57Xm1cFamVwRE4BqfABhegAW5AC3QABhw8gRfwajwbb8a78TFvLRnFzCH4U8bnD87eoBs=</latexit>

yi 2 R
{(xi , yi )}ni=1
Sale Price

# square feet
Fit a function to our data, 1-d

Given past sales data on zillow.com, predict:


y = House sale price from
x = {# sq. ft.}
xi 2 R
Training Data: <latexit sha1_base64="orh5n7qZpaR0XoEUpD3YIOQNUYs=">AAACGHicbZC7TsMwFIadcivhVmBksaiQmKqkIMFYQQfGUtGLaKLKcZzWquNEtoOoorwFExI8CxtiZeNR2HDaDNByJEu/vv8c+/j3Ykalsqwvo7Syura+Ud40t7Z3dvcq+wddGSUCkw6OWCT6HpKEUU46iipG+rEgKPQY6XmT69zvPRAhacTv1DQmbohGnAYUI6XR/eOQQody6LSHlapVs2YFl4VdiCooqjWsfDt+hJOQcIUZknJgW7FyUyQUxYxkppNIEiM8QSMy0JKjkEg3nW2cwRNNfBhEQh+u4Iz+nkhRKOU09HRniNRYLno5/M8bJCq4dFPK40QRjucPBQmDKoL596FPBcGKTbVAWFC9K8RjJBBWOiTTdHwSOM3UyS/GiKXNLJuz9px5XtrOMp2VvZjMsujWa/ZZrX57Xm1cFamVwRE4BqfABhegAW5AC3QABhw8gRfwajwbb8a78TFvLRnFzCH4U8bnD87eoBs=</latexit>

yi 2 R
best linear fit {(xi , yi )}ni=1
Woe
Sale Price

Hypothesis/Model: linear
Consider T
y = xt w + ✏
yi = xi w i+ ✏i
i.i.d.
<latexit sha1_base64="LOtnxQgiPKpP1vzfkfTJc26t6A4=">AAACKHicbZDLSgMxFIYz9VbrbdSN4CZYBEEoM1XQjVKwC5e12At0SsmkZ9rQzIUko5ZhfBpXgj6LO+nWt3Bnello64GEn+8/J5ffjTiTyrJGRmZpeWV1Lbue29jc2t4xd/fqMowFhRoNeSiaLpHAWQA1xRSHZiSA+C6Hhju4GfuNBxCShcG9GkbQ9kkvYB6jRGnUMQ+GHYav8JPeH/EpdiCSjGuDdcy8VbAmhReFPRN5NKtKx/x2uiGNfQgU5UTKlm1Fqp0QoRjlkOacWEJE6ID0oKVlQHyQ7WTygxQfa9LFXij0ChSe0N8TCfGlHPqu7vSJ6st5bwz/81qx8i7bCQuiWEFApxd5MccqxOM4cJcJoIoPtSBUMP1WTPtEEKp0aLmc0wXPKSfO+GBKeFJO0ymrTpnrJtU01VnZ88ksinqxYJ8Vinfn+dL1LLUsOkRH6ATZ6AKV0C2qoBqi6Bm9oDf0brwaH8anMZq2ZozZzD76U8bXD4MFpZE=</latexit>
i wherei ✏i ⇠ N (0, 2
)

WEIR
# square feet
Fit a function to our data, 1-d

Given past sales data on zillow.com, predict:


y = House sale price from
x = {# sq. ft.}
xi 2 R
Training Data: <latexit sha1_base64="orh5n7qZpaR0XoEUpD3YIOQNUYs=">AAACGHicbZC7TsMwFIadcivhVmBksaiQmKqkIMFYQQfGUtGLaKLKcZzWquNEtoOoorwFExI8CxtiZeNR2HDaDNByJEu/vv8c+/j3Ykalsqwvo7Syura+Ud40t7Z3dvcq+wddGSUCkw6OWCT6HpKEUU46iipG+rEgKPQY6XmT69zvPRAhacTv1DQmbohGnAYUI6XR/eOQQody6LSHlapVs2YFl4VdiCooqjWsfDt+hJOQcIUZknJgW7FyUyQUxYxkppNIEiM8QSMy0JKjkEg3nW2cwRNNfBhEQh+u4Iz+nkhRKOU09HRniNRYLno5/M8bJCq4dFPK40QRjucPBQmDKoL596FPBcGKTbVAWFC9K8RjJBBWOiTTdHwSOM3UyS/GiKXNLJuz9px5XtrOMp2VvZjMsujWa/ZZrX57Xm1cFamVwRE4BqfABhegAW5AC3QABhw8gRfwajwbb8a78TFvLRnFzCH4U8bnD87eoBs=</latexit>

yi 2 R
best linear fit {(xi , yi )}ni=1
Sale Price

Hypothesis/Model: linear
i.i.d.
Consider yi = xTi wyi+=✏ixi wwhere
+ ✏i
<latexit sha1_base64="LOtnxQgiPKpP1vzfkfTJc26t6A4=">AAACKHicbZDLSgMxFIYz9VbrbdSN4CZYBEEoM1XQjVKwC5e12At0SsmkZ9rQzIUko5ZhfBpXgj6LO+nWt3Bnello64GEn+8/J5ffjTiTyrJGRmZpeWV1Lbue29jc2t4xd/fqMowFhRoNeSiaLpHAWQA1xRSHZiSA+C6Hhju4GfuNBxCShcG9GkbQ9kkvYB6jRGnUMQ+GHYav8JPeH/EpdiCSjGuDdcy8VbAmhReFPRN5NKtKx/x2uiGNfQgU5UTKlm1Fqp0QoRjlkOacWEJE6ID0oKVlQHyQ7WTygxQfa9LFXij0ChSe0N8TCfGlHPqu7vSJ6st5bwz/81qx8i7bCQuiWEFApxd5MccqxOM4cJcJoIoPtSBUMP1WTPtEEKp0aLmc0wXPKSfO+GBKeFJO0ymrTpnrJtU01VnZ88ksinqxYJ8Vinfn+dL1LLUsOkRH6ATZ6AKV0C2qoBqi6Bm9oDf0brwaH8anMZq2ZozZzD76U8bXD4MFpZE=</latexit>
✏i ⇠ N (0, 2
)

# square feet
The regression problem, d-dim

Given past sales data on zillow.com, predict:


y = House sale price from
x = {# sq. ft., zip code, date of sale, etc.}
Training Data: x i 2 Rd
n yi 2 R
{(xi , yi )}i=1
Sale price Hypothesis/Model: linear
T i.i.d. i.i.d.
yi = xTi wyi+=✏ixi wwhere
ConsiderConsider + ✏i where
✏i ⇠ N✏(0, 2
i ⇠ )N

==
# bat X
X
hroo
ms # square feet ŷŷii == ŵjj hhjj(x
ŵ (xii))
6=00
ŵjj6=

The regression problem, d-dim

Given past sales data on zillow.com, predict:


y = House sale price from y Xt w E N Io
x = {# sq. ft., zip code, date of sale, etc.}
Training Data: x i 2 Rd
n yi 2 R
{(xi , yi )}i=1
Sale price Hypothesis/Model: linear
T i.i.d. i.i.d.
yi = xTi wyi+=✏ixi wwhere
ConsiderConsider + ✏i where
✏i ⇠ N✏(0, 2
i ⇠ )N

1
<latexit sha1_base64="x8Gxg0mFUGvHDx7sknp16ilgZos=">AAACTHicbZDPbhMxEMa9AdoS/gU4crEaIaVSG3ZXVcsFqYILx1Zq2kpxEnmd2dSqd9fYs7Qrs8/FW/TeAzcEL8ANVao3iRC0jGTp0/fNjO1fopW0GIZXQeve/Qcrq2sP248eP3n6rPP8xZEtSiNgIApVmJOEW1AyhwFKVHCiDfAsUXCcnH1o8uPPYKws8kOsNIwyPstlKgVHb006B7pXfbnYPN9kVs4yvkHfUZYaLlxUO2Y/GXQx03IRjuO6pjB2W7RXbV2MGRaanm+MY/qGxvRPy6TTDfvhvOhdES1Flyxrf9L5zqaFKDPIUShu7TAKNY4cNyiFgrrNSguaizM+g6GXOc/Ajtz86zV97Z0pTQvjT4507v494XhmbZUlvjPjeGpvZ435v2xYYvp25GSuS4RcLC5KS0WxoA1HOpUGBKrKCy6M9G+l4pR7cuhpt9kUUhY51uxNUg+zwRLdhnBXHMX9aKcfHWx3994vAa2RV2Sd9EhEdske+Uj2yYAI8pV8Iz/Iz+Ay+BX8Dq4Xra1gOfOS/FOtlRviw7Ja</latexit>

(y x> w)2 /2
p(y|x, w, ) = p e
2⇡ 2

31
19
Eg exp
# bat
hroo
ms # square feet
The regression problem, d-dim

Given past sales data on zillow.com, predict:


y = House sale price from
x = {# sq. ft., zip code, date of sale, etc.}
Training Data: x i 2 Rd
n yi 2 R
{(xi , yi )}i=1
Sale price Hypothesis/Model: linear
T i.i.d. i.i.d.
yi = xTi wyi+=✏ixi wwhere
ConsiderConsider + ✏i where
✏i ⇠ N✏(0, 2
i ⇠ )N

1
<latexit sha1_base64="x8Gxg0mFUGvHDx7sknp16ilgZos=">AAACTHicbZDPbhMxEMa9AdoS/gU4crEaIaVSG3ZXVcsFqYILx1Zq2kpxEnmd2dSqd9fYs7Qrs8/FW/TeAzcEL8ANVao3iRC0jGTp0/fNjO1fopW0GIZXQeve/Qcrq2sP248eP3n6rPP8xZEtSiNgIApVmJOEW1AyhwFKVHCiDfAsUXCcnH1o8uPPYKws8kOsNIwyPstlKgVHb006B7pXfbnYPN9kVs4yvkHfUZYaLlxUO2Y/GXQx03IRjuO6pjB2W7RXbV2MGRaanm+MY/qGxvRPy6TTDfvhvOhdES1Flyxrf9L5zqaFKDPIUShu7TAKNY4cNyiFgrrNSguaizM+g6GXOc/Ajtz86zV97Z0pTQvjT4507v494XhmbZUlvjPjeGpvZ435v2xYYvp25GSuS4RcLC5KS0WxoA1HOpUGBKrKCy6M9G+l4pR7cuhpt9kUUhY51uxNUg+zwRLdhnBXHMX9aKcfHWx3994vAa2RV2Sd9EhEdske+Uj2yYAI8pV8Iz/Iz+Ay+BX8Dq4Xra1gOfOS/FOtlRviw7Ja</latexit>

(y x> w)2 /2 2
p(y|x, w, ) = p e
2⇡ 2
# bat
hroo
ms # square feet
Maximizing log-likelihood

Training Data: x i 2 Rd <latexit sha1_base64="x8Gxg0mFUGvHDx7sknp16ilgZos=">AAACTHicbZDPbhMxEMa9AdoS/gU4crEaIaVSG3ZXVcsFqYILx1Zq2kpxEnmd2dSqd9fYs7Qrs8/FW/TeAzcEL8ANVao3iRC0jGTp0/fNjO1fopW0GIZXQeve/Qcrq2sP248eP3n6rPP8xZEtSiNgIApVmJOEW1AyhwFKVHCiDfAsUXCcnH1o8uPPYKws8kOsNIwyPstlKgVHb006B7pXfbnYPN9kVs4yvkHfUZYaLlxUO2Y/GXQx03IRjuO6pjB2W7RXbV2MGRaanm+MY/qGxvRPy6TTDfvhvOhdES1Flyxrf9L5zqaFKDPIUShu7TAKNY4cNyiFgrrNSguaizM+g6GXOc/Ajtz86zV97Z0pTQvjT4507v494XhmbZUlvjPjeGpvZ435v2xYYvp25GSuS4RcLC5KS0WxoA1HOpUGBKrKCy6M9G+l4pR7cuhpt9kUUhY51uxNUg+zwRLdhnBXHMX9aKcfHWx3994vAa2RV2Sd9EhEdske+Uj2yYAI8pV8Iz/Iz+Ay+BX8Dq4Xra1gOfOS/FOtlRviw7Ja</latexit>

1 (y x> w)2 /2 2
yi 2 R p(y|x, w, ) = p e
{(xi , yi )}ni=1 2⇡ 2

IID <latexit sha1_base64="/Bxf7hK+LPJaSBIHDeQGuaijOFM=">AAACinicdZHfbtMwFMad8G+UAQUud2NRJnXSVuIIDRCaNMEuuCwS3SbVTeQ4TmfNsT3bYVRenoHn455X4B6nrRBscCRLn77z+dj6nUILbl2SfI/iW7fv3L23cb/3YPPho8f9J0+PrWoMZROqhDKnBbFMcMkmjjvBTrVhpC4EOynOP3T9ky/MWK7kZ7fQbFaTueQVp8QFK+9/Gw9xTdwZJcIftVeXu9jyeU124AHE2qgy9/wAtZmEerjI+dXXnO/+P4MrQ6hHrcf2wjifYs1X0SxtW8gyvwe7KXthSoad0vByJ0vhS5jC37G8P0hGybLgTYHWYgDWNc77P3CpaFMz6agg1k5Rot3ME+M4Fazt4cYyTeg5mbNpkJLUzM78ElwLt4NTwkqZcKSDS/fPG57U1i7qIiQ7SPZ6rzP/1Zs2rnoz81zqxjFJVw9VjYBOwW4LsOSGUScWQRBqePgrpGck0HNhVz1csgojv1xMUQWgHRZ0HcJNcZyO0P4IfXo1OHy/BrQBtsBzMAQIvAaH4CMYgwmg4Ge0Fb2ItuPNOI3fxu9W0Tha33kG/qr46BeUyMRf</latexit>

n
Y n
Y 1
Likelihood: P (D|w, ) = (yi x> 2 2
p(yi |xi , w, ) = p e i w) /2

2⇡ 2
i=1 i=1
Maximum Likelihood Estimation
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤
n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)


Under benign assumptions, as the number of observations n ̂


→ ∞ we have θ MLE → θ*

Why is it useful to recover the “true” parameters θ* of a probabilistic model?


• Estimation of the parameters θ* is the goal
• Help interpret or summarize large datasets
• Make predictions about future data
• Generate new data X ∼ f( ⋅ ; θ MLE ̂ )
Maximizing log-likelihood x
8 d known log exp
x

Training Data: xi2R <latexit sha1_base64="x8Gxg0mFUGvHDx7sknp16ilgZos=">AAACTHicbZDPbhMxEMa9AdoS/gU4crEaIaVSG3ZXVcsFqYILx1Zq2kpxEnmd2dSqd9fYs7Qrs8/FW/TeAzcEL8ANVao3iRC0jGTp0/fNjO1fopW0GIZXQeve/Qcrq2sP248eP3n6rPP8xZEtSiNgIApVmJOEW1AyhwFKVHCiDfAsUXCcnH1o8uPPYKws8kOsNIwyPstlKgVHb006B7pXfbnYPN9kVs4yvkHfUZYaLlxUO2Y/GXQx03IRjuO6pjB2W7RXbV2MGRaanm+MY/qGxvRPy6TTDfvhvOhdES1Flyxrf9L5zqaFKDPIUShu7TAKNY4cNyiFgrrNSguaizM+g6GXOc/Ajtz86zV97Z0pTQvjT4507v494XhmbZUlvjPjeGpvZ435v2xYYvp25GSuS4RcLC5KS0WxoA1HOpUGBKrKCy6M9G+l4pR7cuhpt9kUUhY51uxNUg+zwRLdhnBXHMX9aKcfHWx3994vAa2RV2Sd9EhEdske+Uj2yYAI8pV8Iz/Iz+Ay+BX8Dq4Xra1gOfOS/FOtlRviw7Ja</latexit>

1 (y x> w)2 /2 2
yi 2 R p(y|x, w, ) = p e
{(xi , yi )}ni=1 2⇡ 2

<latexit sha1_base64="/Bxf7hK+LPJaSBIHDeQGuaijOFM=">AAACinicdZHfbtMwFMad8G+UAQUud2NRJnXSVuIIDRCaNMEuuCwS3SbVTeQ4TmfNsT3bYVRenoHn455X4B6nrRBscCRLn77z+dj6nUILbl2SfI/iW7fv3L23cb/3YPPho8f9J0+PrWoMZROqhDKnBbFMcMkmjjvBTrVhpC4EOynOP3T9ky/MWK7kZ7fQbFaTueQVp8QFK+9/Gw9xTdwZJcIftVeXu9jyeU124AHE2qgy9/wAtZmEerjI+dXXnO/+P4MrQ6hHrcf2wjifYs1X0SxtW8gyvwe7KXthSoad0vByJ0vhS5jC37G8P0hGybLgTYHWYgDWNc77P3CpaFMz6agg1k5Rot3ME+M4Fazt4cYyTeg5mbNpkJLUzM78ElwLt4NTwkqZcKSDS/fPG57U1i7qIiQ7SPZ6rzP/1Zs2rnoz81zqxjFJVw9VjYBOwW4LsOSGUScWQRBqePgrpGck0HNhVz1csgojv1xMUQWgHRZ0HcJNcZyO0P4IfXo1OHy/BrQBtsBzMAQIvAaH4CMYgwmg4Ge0Fb2ItuPNOI3fxu9W0Tha33kG/qr46BeUyMRf</latexit>

n
Y n
Y 1
Likelihood: P (D|w, ) = (yi x> 2 2
p(yi |xi , w, ) = p e i w) /2

2⇡ 2
i=1 i=1
<latexit sha1_base64="YNlewJsyKWhRpABadfADwpH03KI=">AAACgXicbZFda9swFIZldx9d9pW2l93FYWGQwJraZrSDUijbLnaZwdIWotjIsuyIypYnye2C5tv9x93vcj9ichLG1u6A4OU97+GI56S14NoEwQ/P37p3/8HD7Ue9x0+ePnve39k917JRlE2pFFJdpkQzwSs2NdwIdlkrRspUsIv06n3Xv7hmSnNZfTbLms1LUlQ855QYZyX971jIAiZDwCUxC0qE/dDCN7h5jTUvSjKCU+gSWLDcuFCtZJZYfhq2cQU4V4TasLVYf1HGRrjm66k4altgsT2A4TLhB18THmMja7gZxREcQgR/YoAVLxZmlPQHwThYFdwV4UYM0KYmSf8nziRtSlYZKojWszCozdwSZTgVrO3hRrOa0CtSsJmTFSmZntsVrxZeOSeDXCr3KgMr9+8JS0qtl2Xqkh0VfbvXmf/rzRqTv51bXtWNYRVdL8obAUZCBx8yrhg1YukEoYq7vwJdEEfRuBP1cMZyHNrVJdLcgW0dlvA2hLviPBqHR+Pw05vB2bsNoG20j16iIQrRMTpDH9EETRFFv7xdb9974W/5Iz/wo3XU9zYze+if8k9+AyPvwDE=</latexit>

n
!
Y 1
Maximize (wrt w): (yi x> 2 2
log P (D|w, ) = log p e i w) /2

2⇡ 2
i=1

Ei É
It log exp

n lo Ig É HII
argmax lo Plato
s argma
w
IE 19527,1
a
52 E ly x.tw
argumin
É I9i

flow ly x.tw y 2x.twt tx.t


y Zwtxitwtxn.tw
Dw log P Dlw
II 195ft a III 9x I Éw

Dw log P Dlw III yes


IAIN

Sef O solve for w

Ilyin In.tw

Ince is
any
Étisan
Suppose 72155 exists 7

txt
IF xd
dud

Time Mitt Yi
flat ly x Tw

219 tut
qui
Effie
x

1 a
214 x w

ly x xx tw
Maximizing log-likelihood

Training Data: x i 2 Rd <latexit sha1_base64="x8Gxg0mFUGvHDx7sknp16ilgZos=">AAACTHicbZDPbhMxEMa9AdoS/gU4crEaIaVSG3ZXVcsFqYILx1Zq2kpxEnmd2dSqd9fYs7Qrs8/FW/TeAzcEL8ANVao3iRC0jGTp0/fNjO1fopW0GIZXQeve/Qcrq2sP248eP3n6rPP8xZEtSiNgIApVmJOEW1AyhwFKVHCiDfAsUXCcnH1o8uPPYKws8kOsNIwyPstlKgVHb006B7pXfbnYPN9kVs4yvkHfUZYaLlxUO2Y/GXQx03IRjuO6pjB2W7RXbV2MGRaanm+MY/qGxvRPy6TTDfvhvOhdES1Flyxrf9L5zqaFKDPIUShu7TAKNY4cNyiFgrrNSguaizM+g6GXOc/Ajtz86zV97Z0pTQvjT4507v494XhmbZUlvjPjeGpvZ435v2xYYvp25GSuS4RcLC5KS0WxoA1HOpUGBKrKCy6M9G+l4pR7cuhpt9kUUhY51uxNUg+zwRLdhnBXHMX9aKcfHWx3994vAa2RV2Sd9EhEdske+Uj2yYAI8pV8Iz/Iz+Ay+BX8Dq4Xra1gOfOS/FOtlRviw7Ja</latexit>

1 (y x> w)2 /2 2
yi 2 R p(y|x, w, ) = p e
{(xi , yi )}ni=1 2⇡ 2

<latexit sha1_base64="/Bxf7hK+LPJaSBIHDeQGuaijOFM=">AAACinicdZHfbtMwFMad8G+UAQUud2NRJnXSVuIIDRCaNMEuuCwS3SbVTeQ4TmfNsT3bYVRenoHn455X4B6nrRBscCRLn77z+dj6nUILbl2SfI/iW7fv3L23cb/3YPPho8f9J0+PrWoMZROqhDKnBbFMcMkmjjvBTrVhpC4EOynOP3T9ky/MWK7kZ7fQbFaTueQVp8QFK+9/Gw9xTdwZJcIftVeXu9jyeU124AHE2qgy9/wAtZmEerjI+dXXnO/+P4MrQ6hHrcf2wjifYs1X0SxtW8gyvwe7KXthSoad0vByJ0vhS5jC37G8P0hGybLgTYHWYgDWNc77P3CpaFMz6agg1k5Rot3ME+M4Fazt4cYyTeg5mbNpkJLUzM78ElwLt4NTwkqZcKSDS/fPG57U1i7qIiQ7SPZ6rzP/1Zs2rnoz81zqxjFJVw9VjYBOwW4LsOSGUScWQRBqePgrpGck0HNhVz1csgojv1xMUQWgHRZ0HcJNcZyO0P4IfXo1OHy/BrQBtsBzMAQIvAaH4CMYgwmg4Ge0Fb2ItuPNOI3fxu9W0Tha33kG/qr46BeUyMRf</latexit>

n
Y n
Y 1
Likelihood: P (D|w, ) = (yi x> 2 2
p(yi |xi , w, ) = p e i w) /2

2⇡ 2
i=1 i=1
<latexit sha1_base64="YNlewJsyKWhRpABadfADwpH03KI=">AAACgXicbZFda9swFIZldx9d9pW2l93FYWGQwJraZrSDUijbLnaZwdIWotjIsuyIypYnye2C5tv9x93vcj9ichLG1u6A4OU97+GI56S14NoEwQ/P37p3/8HD7Ue9x0+ePnve39k917JRlE2pFFJdpkQzwSs2NdwIdlkrRspUsIv06n3Xv7hmSnNZfTbLms1LUlQ855QYZyX971jIAiZDwCUxC0qE/dDCN7h5jTUvSjKCU+gSWLDcuFCtZJZYfhq2cQU4V4TasLVYf1HGRrjm66k4altgsT2A4TLhB18THmMja7gZxREcQgR/YoAVLxZmlPQHwThYFdwV4UYM0KYmSf8nziRtSlYZKojWszCozdwSZTgVrO3hRrOa0CtSsJmTFSmZntsVrxZeOSeDXCr3KgMr9+8JS0qtl2Xqkh0VfbvXmf/rzRqTv51bXtWNYRVdL8obAUZCBx8yrhg1YukEoYq7vwJdEEfRuBP1cMZyHNrVJdLcgW0dlvA2hLviPBqHR+Pw05vB2bsNoG20j16iIQrRMTpDH9EETRFFv7xdb9974W/5Iz/wo3XU9zYze+if8k9+AyPvwDE=</latexit>

n
!
Y 1
Maximize (wrt w): (yi x> 2 2
log P (D|w, ) = log p e i w) /2

2⇡ 2
i=1

<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
> 2
bM LE = arg min
w (yi xi w)
w
i=1
Maximizing log-likelihood
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
bM LE = arg min
w (yi x>
i w) 2
Set derivate=0, solve for w
w
i=1
Maximizing log-likelihood
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
bM LE = arg min
w (yi x>
i w) 2
Set derivate=0, solve for w
w
i=1

H H
ith

<latexit sha1_base64="1oZ/0REE3QdZEPPWcX24ngl8J5U=">AAACUXicbVDLahsxFJWnj6ROH0677EbUFNJFzSiUJJtAaAl00UICtROw7EGjuWOLaDSDdCepEfNn/Yusui3dtT/QXTSOF23cAxcO55yrx0krrRzG8fdOdO/+g4cbm4+6W4+fPH3W234+cmVtJQxlqUt7ngoHWhkYokIN55UFUaQaztKLD61/dgnWqdJ8wUUFk0LMjMqVFBikpDfiVyqDuUB/1ST+86fjhh5SriHHHcpdXSReHbJmaujXRLUz5VhWlFs1m+ObqX/LmvXYIlFJrx8P4iXoOmEr0icrnCS9nzwrZV2AQamFc2MWVzjxwqKSGpourx1UQl6IGYwDNaIAN/HL/zf0dVAympc2jEG6VP/e8KJwblGkIVkInLu7Xiv+zxvXmB9MvDJVjWDk7UV5rSmWtC2TZsqCRL0IREirwlupnAsrJIbKuzyDnDPP23PT3LOmCbWwuyWsk9HugO0N2Om7/tH7VUGb5CV5RXYII/vkiHwkJ2RIJPlGfpBf5HfnuvMnIlF0G406q50X5B9EWzcNs7Ow</latexit>

! 1 n
t Hi Kline
n
X X
bM LE =
w xi x>
i x i yi
i=1 i=1
The regression problem in matrix notation
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
bM LE = arg min
w (yi x>
i w) 2
xd
Rn
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7
y = 4 ... 5 X = 4 ... 5
n : # of examples/datapoints
IR
yn xTn
The regression problem in matrix notation
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
bM LE = arg min
w (yi x>
i w) 2
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7
y = 4 ... 5 X = 4 ... 5
n : # of examples/datapoints

yn xTn

yi = xTi w + ✏i
<latexit sha1_base64="XGQ0rQSnZMvlLaYfXpXfZFfomkE=">AAACKnicbZDLSgMxFIYz9VbrrepON8EiCEKZqYJulIJduKylrYVOHTLpmTY0cyHJqGUY8GlcCfos7opbX8Kd6WWh1gOBn+8/5yT53YgzqUxzZGQWFpeWV7KrubX1jc2t/PZOU4axoNCgIQ9FyyUSOAugoZji0IoEEN/lcOsOrsb+7T0IycKgroYRdHzSC5jHKFEaOfm9ocPwBX502F0dP+BjbEMkGdcWc/IFs2hOCs8LayYKaFZVJ/9ld0Ma+xAoyomUbcuMVCchQjHKIc3ZsYSI0AHpQVvLgPggO8nkDyk+1KSLvVDoEyg8oT8nEuJLOfRd3ekT1Zd/vTH8z2vHyjvvJCyIYgUBnV7kxRyrEI8DwV0mgCo+1IJQwfRbMe0TQajSseVydhc8u5LY48WU8KSSplNWmzLXTWppqrOy/iYzL5qlonVSLN2cFsqXs9SyaB8doCNkoTNURteoihqIoif0jF7Rm/FivBsj42PamjFmM7voVxmf3weDplc=</latexit>
<latexit sha1_base64="k1LdFc1OMVRZxUkLfAMqfHn8XTs=">AAACNXicbZDLSsNAFIYnXmu9VV26GSwFQShJFXSjFOzCZS32Ak0pk+lJO3QyCTMTpYQ8gU/jStAnceFO3Lp2Z9J0oa0HBj7+/5wzM78TcKa0ab4ZS8srq2vruY385tb2zm5hb7+l/FBSaFKf+7LjEAWcCWhqpjl0AgnEczi0nfF16rfvQSrmizs9CaDnkaFgLqNEJ1K/UIpsx8WTGF9iPMVOjB/wScY2BIpxX8T9QtEsm9PCi2DNoIhmVe8Xvu2BT0MPhKacKNW1zED3IiI1oxzivB0qCAgdkyF0ExTEA9WLpt+JcSlRBtj1ZXKExlP190REPKUmnpN0ekSP1LyXiv953VC7F72IiSDUIGh2kRtyrH2cZoMHTALVfJIAoZIlb8V0RCShOkkwn7cH4Nq1yE4XU8KjWhxnWiPTHCdqxGlW1nwyi9CqlK3TcuX2rFi9mqWWQ4foCB0jC52jKrpBddREFD2iJ/SCXo1n4934MD6z1iVjNnOA/pTx9QMtmqpv</latexit>
y = Xw + ✏

== + = +

X
X =X
ŷŷii == ŵjj hhjj(x
ŵ (xii)) = Xŵ h (x )
ŷi = j j i
6=00
ŵjj6=

=
ŷi =ŵX ŵj hj(xi)
jj 6=0
ŷ ==ŵX
i ŵ h (x )
j 6=0 j j i
ŷi =ŵŵX
j 6=
j ŵj
6=0
0 hj(xi)
ŷi =ŵjj 6=0ŵj hj(xi)
The regression problem in matrix notation
Xn flot T
Wt By
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

bM LE = arg min
w (yi x>
i w) 2
w
i=1

y1
2 3 2
x1 T
3 gl w y AW WAY
d : # of features
6 7 6 7
y = 4 ... 5 X = 4 ... 5
n : # of examples/datapoints
Dwg w
yn xTn Aty
yi = xTi w + ✏i
<latexit sha1_base64="XGQ0rQSnZMvlLaYfXpXfZFfomkE=">AAACKnicbZDLSgMxFIYz9VbrrepON8EiCEKZqYJulIJduKylrYVOHTLpmTY0cyHJqGUY8GlcCfos7opbX8Kd6WWh1gOBn+8/5yT53YgzqUxzZGQWFpeWV7KrubX1jc2t/PZOU4axoNCgIQ9FyyUSOAugoZji0IoEEN/lcOsOrsb+7T0IycKgroYRdHzSC5jHKFEaOfm9ocPwBX502F0dP+BjbEMkGdcWc/IFs2hOCs8LayYKaFZVJ/9ld0Ma+xAoyomUbcuMVCchQjHKIc3ZsYSI0AHpQVvLgPggO8nkDyk+1KSLvVDoEyg8oT8nEuJLOfRd3ekT1Zd/vTH8z2vHyjvvJCyIYgUBnV7kxRyrEI8DwV0mgCo+1IJQwfRbMe0TQajSseVydhc8u5LY48WU8KSSplNWmzLXTWppqrOy/iYzL5qlonVSLN2cFsqXs9SyaB8doCNkoTNURteoihqIoif0jF7Rm/FivBsj42PamjFmM7voVxmf3weDplc=</latexit>
É
<latexit sha1_base64="k1LdFc1OMVRZxUkLfAMqfHn8XTs=">AAACNXicbZDLSsNAFIYnXmu9VV26GSwFQShJFXSjFOzCZS32Ak0pk+lJO3QyCTMTpYQ8gU/jStAnceFO3Lp2Z9J0oa0HBj7+/5wzM78TcKa0ab4ZS8srq2vruY385tb2zm5hb7+l/FBSaFKf+7LjEAWcCWhqpjl0AgnEczi0nfF16rfvQSrmizs9CaDnkaFgLqNEJ1K/UIpsx8WTGF9iPMVOjB/wScY2BIpxX8T9QtEsm9PCi2DNoIhmVe8Xvu2BT0MPhKacKNW1zED3IiI1oxzivB0qCAgdkyF0ExTEA9WLpt+JcSlRBtj1ZXKExlP190REPKUmnpN0ekSP1LyXiv953VC7F72IiSDUIGh2kRtyrH2cZoMHTALVfJIAoZIlb8V0RCShOkkwn7cH4Nq1yE4XU8KjWhxnWiPTHCdqxGlW1nwyi9CqlK3TcuX2rFi9mqWWQ4foCB0jC52jKrpBddREFD2iJ/SCXo1n4934MD6z1iVjNnOA/pTx9QMtmqpv</latexit>
y = Xw + ✏
o

Lg
ow flu ℓ2 norm: ∥z∥2 =
n
∑i=1 zi2 = z ⊤z
bLS = arg min ||y
w Xw||22
w
= arg min(y Xw)T (y Xw)
w

Xt ly Xu t Xt ly XW 2 XT Xu
2Xty 2XTXw O

It x exists then

WMLE XX XTy

flu la h w
g
f a
tat h la t
g
w h w
g

44
0
21,1
flu wt B w

Is win Bi I
Eats
31 2.7 wow B

Dw flu B Bt w s
The regression problem in matrix notation
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
bM LE = arg min
w (yi x>
i w) 2
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7
y = 4 ... 5 X = 4 ... 5
n : # of examples/datapoints

yn xTn

yi = xTi w + ✏i
<latexit sha1_base64="XGQ0rQSnZMvlLaYfXpXfZFfomkE=">AAACKnicbZDLSgMxFIYz9VbrrepON8EiCEKZqYJulIJduKylrYVOHTLpmTY0cyHJqGUY8GlcCfos7opbX8Kd6WWh1gOBn+8/5yT53YgzqUxzZGQWFpeWV7KrubX1jc2t/PZOU4axoNCgIQ9FyyUSOAugoZji0IoEEN/lcOsOrsb+7T0IycKgroYRdHzSC5jHKFEaOfm9ocPwBX502F0dP+BjbEMkGdcWc/IFs2hOCs8LayYKaFZVJ/9ld0Ma+xAoyomUbcuMVCchQjHKIc3ZsYSI0AHpQVvLgPggO8nkDyk+1KSLvVDoEyg8oT8nEuJLOfRd3ekT1Zd/vTH8z2vHyjvvJCyIYgUBnV7kxRyrEI8DwV0mgCo+1IJQwfRbMe0TQajSseVydhc8u5LY48WU8KSSplNWmzLXTWppqrOy/iYzL5qlonVSLN2cFsqXs9SyaB8doCNkoTNURteoihqIoif0jF7Rm/FivBsj42PamjFmM7voVxmf3weDplc=</latexit>
<latexit sha1_base64="k1LdFc1OMVRZxUkLfAMqfHn8XTs=">AAACNXicbZDLSsNAFIYnXmu9VV26GSwFQShJFXSjFOzCZS32Ak0pk+lJO3QyCTMTpYQ8gU/jStAnceFO3Lp2Z9J0oa0HBj7+/5wzM78TcKa0ab4ZS8srq2vruY385tb2zm5hb7+l/FBSaFKf+7LjEAWcCWhqpjl0AgnEczi0nfF16rfvQSrmizs9CaDnkaFgLqNEJ1K/UIpsx8WTGF9iPMVOjB/wScY2BIpxX8T9QtEsm9PCi2DNoIhmVe8Xvu2BT0MPhKacKNW1zED3IiI1oxzivB0qCAgdkyF0ExTEA9WLpt+JcSlRBtj1ZXKExlP190REPKUmnpN0ekSP1LyXiv953VC7F72IiSDUIGh2kRtyrH2cZoMHTALVfJIAoZIlb8V0RCShOkkwn7cH4Nq1yE4XU8KjWhxnWiPTHCdqxGlW1nwyi9CqlK3TcuX2rFi9mqWWQ4foCB0jC52jKrpBddREFD2iJ/SCXo1n4934MD6z1iVjNnOA/pTx9QMtmqpv</latexit>
y = Xw + ✏

n
ℓ2 norm: ∥z∥2 = ∑i=1 zi2 = z ⊤z
bLS = arg min ||y
w Xw||22
w
= arg min(y Xw)T (y Xw)
w

bM LE = (XT X)
bLS = w
w 1
XT Y
The regression problem in matrix notation

bLS = arg min ||y


w Xw||22
w
b
T 1 T
= (X X) X y

What about an offset?


n
X 2
bLS , bbLS = arg min
w yi (xTi w + b)
w,b
i=1
= arg min ||y (Xw + 1b)||22
w,b
Dealing with an offset

bLS , bbLS = arg min ||y


w (Xw + 1b)||22
w,b
Dealing with an offset

bLS , bbLS = arg min ||y


w (Xw + 1b)||22
w,b

T
X Xw b T
bLS + bLS X 1 = X y T

bLS + bbLS 1T 1 = 1T y
1T Xw

If XT 1 = 0 (i.e., if each feature is mean-zero) then


bLS = (XT X)
w 1
XT Y
Xn
bbLS 1
= yi
n i=1
Make Predictions

bLS = (XT X)
w 1
XT Y
n
bbLS 1X
= yi
n i=1

A new house is about to be listed. What should it sell for?

ŷnew = xTnew ŵLS + b̂LS


<latexit sha1_base64="OBCTZ1ysswu78fvh4ENNelplGmk=">AAACV3icbZDfShtBFMYna6tpWm3US2+GBqFQCLta0BuLoGAvvLDWqJBNw+zkrBmcnV1mzlbDMI/k0/RK0Afplc5mc9FGDwx88zvfmT9fUkhhMAzvG8HCm7eLS813rfcfllc+tlfXzk1eag49nstcXybMgBQKeihQwmWhgWWJhIvk+qDqX/wGbUSuznBSwCBjV0qkgjP0aNg+iscM7cQNbYxwizqzCm6co3v0dg79OqNT7433Hv909Eu9TertsN0Ju+G06EsRzUSHzOpk2P4bj3JeZqCQS2ZMPwoLHFimUXAJrhWXBgrGr9kV9L1ULAMzsNMPO7rpyYimufZLIZ3Sfycsy4yZZIl3ZgzHZr5Xwdd6/RLT3YEVqigRFK8vSktJMadVenQkNHCUEy8Y18K/lfIx04yjz7jVikeQxoc2rg7mTNpD52p2WrMksaeuyiqaT+alON/qRtvdrR9fO/vfZqk1yQb5RD6TiOyQffKdnJAe4eSO/CEP5LFx33gKFoNmbQ0as5l18l8Fq898ibih</latexit>
Process

Decide on a model for the likelihood function f(x; θ)

Find the function which fits the data best


Choose a loss function- least squares
Pick the function which minimizes loss on data

Use function to make prediction on new examples


Linear regression with non-
linear basis functions
Recap: Linear Regression
label y
f(x) = 400 x

f(x) = 100,000 + 200 x

input x
• In general high-dimensions, we fit a linear model with intercept
yi ≃ w T xi + b , or equivalently yi = w T xi + b + ϵi
with model parameters (w ∈ ℝd, b ∈ ℝ) that minimizes ℓ2-loss
n
(yi − (w T xi + b))2

ℒ(w, b) =
i=1
error ϵi
Recap: Linear Regression
• The least squares solution, i.e. the minimizer of the ℓ2-loss can be
written in a closed form as a function of data X and y as

or equivalently using
straightforward linear algebra
by setting the gradient to zero:

̂
([ ] ) [ 1T ]
−1
w LS
[ b LS ]
XT XT
= [ X 1] y
̂ 1 T
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

• Linear model with parameter (b, w1):


• y î = b + w1 xi
input x
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

• Linear model with parameter (b, w1):


• y î = b + w1 xi

• Quadratic model with parameter (b, w = [w2]):


w1 input x

• y î = b + w1 xi + w2 xi
2
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

• Linear model with parameter (b, w1):


• y î = b + w1 xi

• Quadratic model with parameter (b, w = [w2]):


w1 input x

• y î = b + w1 xi + w2 xi
2

w1
Degree-p polynomial model with parameter (b, w = ⋮ ):
• wp
• y î = b + w1 xi + w2 xi2 + … + wp xip
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

• Linear model with parameter (b, w1):


• y î = b + w1 xi

• Quadratic model with parameter (b, w = [w2]):


w1 input x

• y î = b + w1 xi + w2 xi
2

w1
Degree-p polynomial model with parameter (b, w = ⋮ ):
• wp
• y î = b + w1 xi + w2 xi2 + … + wp xip
w1
General p-features with parameter w = ⋮ :
• wp
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

w1
General p-features with parameter w = ⋮ :
• wp input x
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp

Note: h can be arbitrary non-linear functions!

h(x) = [log(x), x 2, sin(x), x]



Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

w1
General p-features with parameter w = ⋮ :
• wp input x
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp

How do we learn w?
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

w1
General p-features with parameter w = ⋮ :
• wp input x
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp

How do we learn w?

w ̂ = arg min ∥Hw − y∥22


− − h(x1)⊤ − − w
H= ⋮ ∈ ℝn×p
y ̂ = ⟨ w ,̂ h(x)⟩
For a new test point x, predict
− − h(xn)⊤ − −
Which p should we choose?
• First instance of class of models with different
representation power = model complexity

label y label y

input x input x
• How do we determine which is better model?
Generalization
• we say a predictor generalizes if it performs as well on unseen data
as on training data (we will formalize the next lecture)
• the data used to train a predictor is training data or in-sample data
• we want the predictor to work on out-of-sample data
• we say a predictor fails to generalize if it performs well on in-
sample data but does not perform well on out-of-sample data
Generalization
• we say a predictor generalizes if it performs as well on unseen data
as on training data (we will formalize the next lecture)
• the data used to train a predictor is training data or in-sample data
• we want the predictor to work on out-of-sample data
• we say a predictor fails to generalize if it performs well on in-
sample data but does not perform well on out-of-sample data

• train a cubic predictor on 32 (in-sample) white circles: Mean Squared Error (MSE) 174

• predict label y for 30 (out-of-sample) blue circles: MSE 192

• conclude this predictor/model generalizes, as in-sample MSE ≃ out-of-sample MSE


Split the data into training and testing

• a way to mimic how the predictor performs on unseen data

• given a single dataset S = {(xi, yi)}ni=1


• we split the dataset into two: training set and test set (e.g., 90/10)
• training set used to train the model
1
(yi − xiT w)2

minimize ℒtrain(w) =
• | Strain | i∈S
train

• test set used to evaluate the model


1
(yi − xiT w)2

ℒtest(w) =
• | Stest | i∈S
test

• this assumes that test set is similar to unseen data


• test set should never be used in training or picking unknowns
y
Train/test error vs. complexity
Error

degree p of the polynomial regression

y x
• Degree p = 5, since it achieves minimum
test error
• Train error monotonically decreases with model
complexity
• Test error has a U shape
test set should never be used in training or picking degree x

You might also like