0% found this document useful (0 votes)

19 views60 pages

Lecture 2 Annotated

The document discusses maximum likelihood estimation (MLE), which is a method for estimating the parameters of a probabilistic model given observations. MLE chooses the parameters that maximize the likelihood or probability of the observed data given the model. The document provides examples of applying MLE to estimate the probability of heads for a coin flip (binomial distribution) and the mean and variance of a Gaussian/normal distribution.

Uploaded by

Adil Sadiki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views60 pages

Lecture 2 Annotated

Uploaded by

Adil Sadiki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Maximum Likelihood

Estimation
Your first consulting job
• Billionaire: I have special coin, if I flip it, what’s the
probability it will be heads?
• You: Please flip it a few times:

• You: The probability is:

• Billionaire: Why?
Coin – Binomial Distribution

• Data: sequence D= (HHTHT…), k heads out of n flips

• Hypothesis: P(Heads) = θ, P(Tails) = 1-θ
• Flips are i.i.d.:
• Independent events
• Identically distributed according to Binomial
distribution

• P (D|✓) =
Maximum Likelihood Estimation

• Data: sequence D= (HHTHT…), k heads out of n flips

• Hypothesis: P(Heads) = θ, P(Tails) = 1-θ

k n k
P (D|✓) = ✓ (1 ✓)
• Maximum likelihood estimation (MLE): Choose θ that
maximizes the probability of observed data:
P (D|✓)
✓bM LE = arg max P (D|✓)
✓
= arg max log P (D|✓) ✓bM LE
✓
✓
Your first learning algorithm

✓bM LE = arg max log P (D|✓)

✓
= arg max log ✓k (1 ✓)n k
✓

d
• Set derivative to zero: log P (D|✓) = 0
d✓
Maximum Likelihood Estimation

Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)

✓
Recap
• Learning is…
• Collect some data
• E.g., coin flips
• Choose a hypothesis class or model
• E.g., binomial
• Choose a loss function
• E.g., data likelihood
• Choose an optimization procedure
• E.g., set derivative to zero to obtain MLE
What about continuous variables?

• Billionaire: What if I am measuring a continuous variable?

• You: Let me tell you about Gaussians…
Some properties of Gaussians

• affine transformation (multiplying by scalar and adding a

constant)
• X ~ N(µ,σ2)
• Y = aX + b ➔ Y ~ N(aµ+b,a2σ2)

• Sum of Gaussians
• X ~ N(µX,σ2X)
• Y ~ N(µY,σ2Y)
• Z = X+Y ➔ Z ~ N(µX+µY, σ2X+σ2Y)
MLE for Gaussian

• Prob. of i.i.d. samples D={x1,…,xn} (e.g., temperature):

P (D|µ, ) = P (x1 , . . . , xn |µ, )
✓ ◆n Y n
1 (xi µ)2
= p e 2 2

2⇡ i=1

• Log-likelihood of data:
n
X
p (xi µ)2
log P (D|µ, ) = n log( 2⇡) 2
i=1
2

• What is ✓bM LE for ✓ = (µ, 2 ) ? Draw a picture!

Your second learning algorithm:
MLE for mean of a Gaussian

• What’s MLE for mean?

" n
#
d d p X (xi µ)2
log P (D|µ, ) = n log( 2⇡)
dµ dµ i=1
2 2
MLE for variance

• Again, set derivative to zero:

" n
#
d d p X (xi µ)2
log P (D|µ, ) = n log( 2⇡)
d d i=1
2 2
Learning Gaussian parameters

Xn
• MLE: 1
bM LE
µ = xi
n i=1
Xn
c2 M LE 1 2
= (xi bM LE )
µ
n i=1
• MLE for the variance of a Gaussian is biased

E[c2 M LE ] 6= 2

• Unbiased variance estimator:

n
X
c2 unbiased = 1
(xi bM LE )2
µ
n 1 i=1
Maximum Likelihood Estimation
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤
n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)

✓

Under benign assumptions, as the number of observations n ̂

→ ∞ we have θ MLE → θ*

The MLE is a “recipe” that begins with a model for data f(x; θ)
Applications preview
Maximum Likelihood Estimation
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤
n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)

✓

Under benign assumptions, as the number of observations n ̂

→ ∞ we have θ MLE → θ*

Why is it useful to recover the “true” parameters θ* of a probabilistic model?

• Estimation of the parameters θ* is the goal
• Help interpret or summarize large datasets
• Make predictions about future data
• Generate new data X ∼ f( ⋅ ; θ MLE ̂ )
Estimation
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

Opinion polls
How does the greater
population feel about an issue?
Correct for over-sampling?
• θ* is “true” average opinion
• X1, X2, … are sample calls

A/B testing
How do we figure out which ad
results in more click-through?
• θ* are the “true” average rates
• X1, X2, … are binary “clicks”
Interpret
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

Customer segmentation / clustering

Can we identify distinct groups of
customers by their behavior?
• θ* describes “center” of distinct groups
• X1, X2, … are individual customers

Data exploration
What are the degrees of freedom of the
dataset?
• θ* describes the principle directions of
variation
• X1, X2, … are the individual images
Predict
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

Content recommendation
Can we predict how much someone will
like a movie based on past ratings?
• θ* describes user’s preferences
• X1, X2, … are (movie, rating) pairs

Object recognition / classification

Identify a flower given just its picture?
• θ* describes the characteristics of
each kind of flower
• X1, X2, … are the (image, label) pairs
Generate
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

“Kaia the dog wasn't a natural pick to go to mars.

Text generation No one could have predicted she would…”
Can AI generate text that could have
been written like a human?
• θ* describes language structure
• X1, X2, … are text snippets found
online

https://chat.openai.com/chat

Image to text generation “dog talking on cell phone under water, oil painting”
Can AI generate an image from a prompt?
• θ* describes the coupled structure of
images and text
• X1, X2, … are the (image, caption) pairs
found online https://labs.openai.com/
Linear Regression
The regression problem, 1-dimensional

Given past sales data on zillow.com, predict:

y = House sale price from
x = {# sq. ft.}
xi 2 R
Training Data: <latexit sha1_base64="orh5n7qZpaR0XoEUpD3YIOQNUYs=">AAACGHicbZC7TsMwFIadcivhVmBksaiQmKqkIMFYQQfGUtGLaKLKcZzWquNEtoOoorwFExI8CxtiZeNR2HDaDNByJEu/vv8c+/j3Ykalsqwvo7Syura+Ud40t7Z3dvcq+wddGSUCkw6OWCT6HpKEUU46iipG+rEgKPQY6XmT69zvPRAhacTv1DQmbohGnAYUI6XR/eOQQody6LSHlapVs2YFl4VdiCooqjWsfDt+hJOQcIUZknJgW7FyUyQUxYxkppNIEiM8QSMy0JKjkEg3nW2cwRNNfBhEQh+u4Iz+nkhRKOU09HRniNRYLno5/M8bJCq4dFPK40QRjucPBQmDKoL596FPBcGKTbVAWFC9K8RjJBBWOiTTdHwSOM3UyS/GiKXNLJuz9px5XtrOMp2VvZjMsujWa/ZZrX57Xm1cFamVwRE4BqfABhegAW5AC3QABhw8gRfwajwbb8a78TFvLRnFzCH4U8bnD87eoBs=</latexit>

yi 2 R
{(xi , yi )}ni=1
Sale Price

# square feet
Fit a function to our data, 1-d

Given past sales data on zillow.com, predict:

yi 2 R
best linear fit {(xi , yi )}ni=1
Woe
Sale Price

Hypothesis/Model: linear
Consider T
y = xt w + ✏
yi = xi w i+ ✏i
i.i.d.
<latexit sha1_base64="LOtnxQgiPKpP1vzfkfTJc26t6A4=">AAACKHicbZDLSgMxFIYz9VbrbdSN4CZYBEEoM1XQjVKwC5e12At0SsmkZ9rQzIUko5ZhfBpXgj6LO+nWt3Bnello64GEn+8/J5ffjTiTyrJGRmZpeWV1Lbue29jc2t4xd/fqMowFhRoNeSiaLpHAWQA1xRSHZiSA+C6Hhju4GfuNBxCShcG9GkbQ9kkvYB6jRGnUMQ+GHYav8JPeH/EpdiCSjGuDdcy8VbAmhReFPRN5NKtKx/x2uiGNfQgU5UTKlm1Fqp0QoRjlkOacWEJE6ID0oKVlQHyQ7WTygxQfa9LFXij0ChSe0N8TCfGlHPqu7vSJ6st5bwz/81qx8i7bCQuiWEFApxd5MccqxOM4cJcJoIoPtSBUMP1WTPtEEKp0aLmc0wXPKSfO+GBKeFJO0ymrTpnrJtU01VnZ88ksinqxYJ8Vinfn+dL1LLUsOkRH6ATZ6AKV0C2qoBqi6Bm9oDf0brwaH8anMZq2ZozZzD76U8bXD4MFpZE=</latexit>
i wherei ✏i ⇠ N (0, 2
)

WEIR
# square feet
Fit a function to our data, 1-d

Given past sales data on zillow.com, predict:

yi 2 R
best linear fit {(xi , yi )}ni=1
Sale Price

Hypothesis/Model: linear
i.i.d.
Consider yi = xTi wyi+=✏ixi wwhere
+ ✏i
<latexit sha1_base64="LOtnxQgiPKpP1vzfkfTJc26t6A4=">AAACKHicbZDLSgMxFIYz9VbrbdSN4CZYBEEoM1XQjVKwC5e12At0SsmkZ9rQzIUko5ZhfBpXgj6LO+nWt3Bnello64GEn+8/J5ffjTiTyrJGRmZpeWV1Lbue29jc2t4xd/fqMowFhRoNeSiaLpHAWQA1xRSHZiSA+C6Hhju4GfuNBxCShcG9GkbQ9kkvYB6jRGnUMQ+GHYav8JPeH/EpdiCSjGuDdcy8VbAmhReFPRN5NKtKx/x2uiGNfQgU5UTKlm1Fqp0QoRjlkOacWEJE6ID0oKVlQHyQ7WTygxQfa9LFXij0ChSe0N8TCfGlHPqu7vSJ6st5bwz/81qx8i7bCQuiWEFApxd5MccqxOM4cJcJoIoPtSBUMP1WTPtEEKp0aLmc0wXPKSfO+GBKeFJO0ymrTpnrJtU01VnZ88ksinqxYJ8Vinfn+dL1LLUsOkRH6ATZ6AKV0C2qoBqi6Bm9oDf0brwaH8anMZq2ZozZzD76U8bXD4MFpZE=</latexit>
✏i ⇠ N (0, 2
)

# square feet
The regression problem, d-dim

Given past sales data on zillow.com, predict:

y = House sale price from
x = {# sq. ft., zip code, date of sale, etc.}
Training Data: x i 2 Rd
n yi 2 R
{(xi , yi )}i=1
Sale price Hypothesis/Model: linear
T i.i.d. i.i.d.
yi = xTi wyi+=✏ixi wwhere
ConsiderConsider + ✏i where
✏i ⇠ N✏(0, 2
i ⇠ )N

==
# bat X
X
hroo
ms # square feet ŷŷii == ŵjj hhjj(x
ŵ (xii))
6=00
ŵjj6=
ŵ
The regression problem, d-dim

Given past sales data on zillow.com, predict:

y = House sale price from y Xt w E N Io
x = {# sq. ft., zip code, date of sale, etc.}
Training Data: x i 2 Rd
n yi 2 R
{(xi , yi )}i=1
Sale price Hypothesis/Model: linear
T i.i.d. i.i.d.
yi = xTi wyi+=✏ixi wwhere
ConsiderConsider + ✏i where
✏i ⇠ N✏(0, 2
i ⇠ )N

1
<latexit sha1_base64="x8Gxg0mFUGvHDx7sknp16ilgZos=">AAACTHicbZDPbhMxEMa9AdoS/gU4crEaIaVSG3ZXVcsFqYILx1Zq2kpxEnmd2dSqd9fYs7Qrs8/FW/TeAzcEL8ANVao3iRC0jGTp0/fNjO1fopW0GIZXQeve/Qcrq2sP248eP3n6rPP8xZEtSiNgIApVmJOEW1AyhwFKVHCiDfAsUXCcnH1o8uPPYKws8kOsNIwyPstlKgVHb006B7pXfbnYPN9kVs4yvkHfUZYaLlxUO2Y/GXQx03IRjuO6pjB2W7RXbV2MGRaanm+MY/qGxvRPy6TTDfvhvOhdES1Flyxrf9L5zqaFKDPIUShu7TAKNY4cNyiFgrrNSguaizM+g6GXOc/Ajtz86zV97Z0pTQvjT4507v494XhmbZUlvjPjeGpvZ435v2xYYvp25GSuS4RcLC5KS0WxoA1HOpUGBKrKCy6M9G+l4pR7cuhpt9kUUhY51uxNUg+zwRLdhnBXHMX9aKcfHWx3994vAa2RV2Sd9EhEdske+Uj2yYAI8pV8Iz/Iz+Ay+BX8Dq4Xra1gOfOS/FOtlRviw7Ja</latexit>

(y x> w)2 /2
p(y|x, w, ) = p e
2⇡ 2

31
19
Eg exp
# bat
hroo
ms # square feet
The regression problem, d-dim

Given past sales data on zillow.com, predict:

(y x> w)2 /2 2
p(y|x, w, ) = p e
2⇡ 2
# bat
hroo
ms # square feet
Maximizing log-likelihood

Training Data: x i 2 Rd <latexit sha1_base64="x8Gxg0mFUGvHDx7sknp16ilgZos=">AAACTHicbZDPbhMxEMa9AdoS/gU4crEaIaVSG3ZXVcsFqYILx1Zq2kpxEnmd2dSqd9fYs7Qrs8/FW/TeAzcEL8ANVao3iRC0jGTp0/fNjO1fopW0GIZXQeve/Qcrq2sP248eP3n6rPP8xZEtSiNgIApVmJOEW1AyhwFKVHCiDfAsUXCcnH1o8uPPYKws8kOsNIwyPstlKgVHb006B7pXfbnYPN9kVs4yvkHfUZYaLlxUO2Y/GXQx03IRjuO6pjB2W7RXbV2MGRaanm+MY/qGxvRPy6TTDfvhvOhdES1Flyxrf9L5zqaFKDPIUShu7TAKNY4cNyiFgrrNSguaizM+g6GXOc/Ajtz86zV97Z0pTQvjT4507v494XhmbZUlvjPjeGpvZ435v2xYYvp25GSuS4RcLC5KS0WxoA1HOpUGBKrKCy6M9G+l4pR7cuhpt9kUUhY51uxNUg+zwRLdhnBXHMX9aKcfHWx3994vAa2RV2Sd9EhEdske+Uj2yYAI8pV8Iz/Iz+Ay+BX8Dq4Xra1gOfOS/FOtlRviw7Ja</latexit>

1 (y x> w)2 /2 2
yi 2 R p(y|x, w, ) = p e
{(xi , yi )}ni=1 2⇡ 2

IID <latexit sha1_base64="/Bxf7hK+LPJaSBIHDeQGuaijOFM=">AAACinicdZHfbtMwFMad8G+UAQUud2NRJnXSVuIIDRCaNMEuuCwS3SbVTeQ4TmfNsT3bYVRenoHn455X4B6nrRBscCRLn77z+dj6nUILbl2SfI/iW7fv3L23cb/3YPPho8f9J0+PrWoMZROqhDKnBbFMcMkmjjvBTrVhpC4EOynOP3T9ky/MWK7kZ7fQbFaTueQVp8QFK+9/Gw9xTdwZJcIftVeXu9jyeU124AHE2qgy9/wAtZmEerjI+dXXnO/+P4MrQ6hHrcf2wjifYs1X0SxtW8gyvwe7KXthSoad0vByJ0vhS5jC37G8P0hGybLgTYHWYgDWNc77P3CpaFMz6agg1k5Rot3ME+M4Fazt4cYyTeg5mbNpkJLUzM78ElwLt4NTwkqZcKSDS/fPG57U1i7qIiQ7SPZ6rzP/1Zs2rnoz81zqxjFJVw9VjYBOwW4LsOSGUScWQRBqePgrpGck0HNhVz1csgojv1xMUQWgHRZ0HcJNcZyO0P4IfXo1OHy/BrQBtsBzMAQIvAaH4CMYgwmg4Ge0Fb2ItuPNOI3fxu9W0Tha33kG/qr46BeUyMRf</latexit>

n
Y n
Y 1
Likelihood: P (D|w, ) = (yi x> 2 2
p(yi |xi , w, ) = p e i w) /2

2⇡ 2
i=1 i=1
Maximum Likelihood Estimation
Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤
n
Y
Likelihood function Ln (✓) = f (Xi ; ✓)
i=1
n
X
Log-Likelihood function ln (✓) = log(Ln (✓)) = log(f (Xi ; ✓))
i=1

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)

✓

Under benign assumptions, as the number of observations n ̂

→ ∞ we have θ MLE → θ*

Why is it useful to recover the “true” parameters θ* of a probabilistic model?

• Estimation of the parameters θ* is the goal
• Help interpret or summarize large datasets
• Make predictions about future data
• Generate new data X ∼ f( ⋅ ; θ MLE ̂ )
Maximizing log-likelihood x
8 d known log exp
x

Training Data: xi2R <latexit sha1_base64="x8Gxg0mFUGvHDx7sknp16ilgZos=">AAACTHicbZDPbhMxEMa9AdoS/gU4crEaIaVSG3ZXVcsFqYILx1Zq2kpxEnmd2dSqd9fYs7Qrs8/FW/TeAzcEL8ANVao3iRC0jGTp0/fNjO1fopW0GIZXQeve/Qcrq2sP248eP3n6rPP8xZEtSiNgIApVmJOEW1AyhwFKVHCiDfAsUXCcnH1o8uPPYKws8kOsNIwyPstlKgVHb006B7pXfbnYPN9kVs4yvkHfUZYaLlxUO2Y/GXQx03IRjuO6pjB2W7RXbV2MGRaanm+MY/qGxvRPy6TTDfvhvOhdES1Flyxrf9L5zqaFKDPIUShu7TAKNY4cNyiFgrrNSguaizM+g6GXOc/Ajtz86zV97Z0pTQvjT4507v494XhmbZUlvjPjeGpvZ435v2xYYvp25GSuS4RcLC5KS0WxoA1HOpUGBKrKCy6M9G+l4pR7cuhpt9kUUhY51uxNUg+zwRLdhnBXHMX9aKcfHWx3994vAa2RV2Sd9EhEdske+Uj2yYAI8pV8Iz/Iz+Ay+BX8Dq4Xra1gOfOS/FOtlRviw7Ja</latexit>

1 (y x> w)2 /2 2
yi 2 R p(y|x, w, ) = p e
{(xi , yi )}ni=1 2⇡ 2

<latexit sha1_base64="/Bxf7hK+LPJaSBIHDeQGuaijOFM=">AAACinicdZHfbtMwFMad8G+UAQUud2NRJnXSVuIIDRCaNMEuuCwS3SbVTeQ4TmfNsT3bYVRenoHn455X4B6nrRBscCRLn77z+dj6nUILbl2SfI/iW7fv3L23cb/3YPPho8f9J0+PrWoMZROqhDKnBbFMcMkmjjvBTrVhpC4EOynOP3T9ky/MWK7kZ7fQbFaTueQVp8QFK+9/Gw9xTdwZJcIftVeXu9jyeU124AHE2qgy9/wAtZmEerjI+dXXnO/+P4MrQ6hHrcf2wjifYs1X0SxtW8gyvwe7KXthSoad0vByJ0vhS5jC37G8P0hGybLgTYHWYgDWNc77P3CpaFMz6agg1k5Rot3ME+M4Fazt4cYyTeg5mbNpkJLUzM78ElwLt4NTwkqZcKSDS/fPG57U1i7qIiQ7SPZ6rzP/1Zs2rnoz81zqxjFJVw9VjYBOwW4LsOSGUScWQRBqePgrpGck0HNhVz1csgojv1xMUQWgHRZ0HcJNcZyO0P4IfXo1OHy/BrQBtsBzMAQIvAaH4CMYgwmg4Ge0Fb2ItuPNOI3fxu9W0Tha33kG/qr46BeUyMRf</latexit>

n
Y n
Y 1
Likelihood: P (D|w, ) = (yi x> 2 2
p(yi |xi , w, ) = p e i w) /2

2⇡ 2
i=1 i=1
<latexit sha1_base64="YNlewJsyKWhRpABadfADwpH03KI=">AAACgXicbZFda9swFIZldx9d9pW2l93FYWGQwJraZrSDUijbLnaZwdIWotjIsuyIypYnye2C5tv9x93vcj9ichLG1u6A4OU97+GI56S14NoEwQ/P37p3/8HD7Ue9x0+ePnve39k917JRlE2pFFJdpkQzwSs2NdwIdlkrRspUsIv06n3Xv7hmSnNZfTbLms1LUlQ855QYZyX971jIAiZDwCUxC0qE/dDCN7h5jTUvSjKCU+gSWLDcuFCtZJZYfhq2cQU4V4TasLVYf1HGRrjm66k4altgsT2A4TLhB18THmMja7gZxREcQgR/YoAVLxZmlPQHwThYFdwV4UYM0KYmSf8nziRtSlYZKojWszCozdwSZTgVrO3hRrOa0CtSsJmTFSmZntsVrxZeOSeDXCr3KgMr9+8JS0qtl2Xqkh0VfbvXmf/rzRqTv51bXtWNYRVdL8obAUZCBx8yrhg1YukEoYq7vwJdEEfRuBP1cMZyHNrVJdLcgW0dlvA2hLviPBqHR+Pw05vB2bsNoG20j16iIQrRMTpDH9EETRFFv7xdb9974W/5Iz/wo3XU9zYze+if8k9+AyPvwDE=</latexit>

n
!
Y 1
Maximize (wrt w): (yi x> 2 2
log P (D|w, ) = log p e i w) /2

2⇡ 2
i=1

Ei É
It log exp

n lo Ig É HII
argmax lo Plato
s argma
w
IE 19527,1
a
52 E ly x.tw
argumin
É I9i

flow ly x.tw y 2x.twt tx.t

y Zwtxitwtxn.tw
Dw log P Dlw
II 195ft a III 9x I Éw

Dw log P Dlw III yes

IAIN

Sef O solve for w

Ilyin In.tw

Ince is
any
Étisan
Suppose 72155 exists 7

txt
IF xd
dud

Time Mitt Yi
flat ly x Tw

219 tut
qui
Effie
x

1 a
214 x w

ly x xx tw
Maximizing log-likelihood

Training Data: x i 2 Rd <latexit sha1_base64="x8Gxg0mFUGvHDx7sknp16ilgZos=">AAACTHicbZDPbhMxEMa9AdoS/gU4crEaIaVSG3ZXVcsFqYILx1Zq2kpxEnmd2dSqd9fYs7Qrs8/FW/TeAzcEL8ANVao3iRC0jGTp0/fNjO1fopW0GIZXQeve/Qcrq2sP248eP3n6rPP8xZEtSiNgIApVmJOEW1AyhwFKVHCiDfAsUXCcnH1o8uPPYKws8kOsNIwyPstlKgVHb006B7pXfbnYPN9kVs4yvkHfUZYaLlxUO2Y/GXQx03IRjuO6pjB2W7RXbV2MGRaanm+MY/qGxvRPy6TTDfvhvOhdES1Flyxrf9L5zqaFKDPIUShu7TAKNY4cNyiFgrrNSguaizM+g6GXOc/Ajtz86zV97Z0pTQvjT4507v494XhmbZUlvjPjeGpvZ435v2xYYvp25GSuS4RcLC5KS0WxoA1HOpUGBKrKCy6M9G+l4pR7cuhpt9kUUhY51uxNUg+zwRLdhnBXHMX9aKcfHWx3994vAa2RV2Sd9EhEdske+Uj2yYAI8pV8Iz/Iz+Ay+BX8Dq4Xra1gOfOS/FOtlRviw7Ja</latexit>

1 (y x> w)2 /2 2
yi 2 R p(y|x, w, ) = p e
{(xi , yi )}ni=1 2⇡ 2

n
Y n
Y 1
Likelihood: P (D|w, ) = (yi x> 2 2
p(yi |xi , w, ) = p e i w) /2

n
!
Y 1
Maximize (wrt w): (yi x> 2 2
log P (D|w, ) = log p e i w) /2

2⇡ 2
i=1

<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
> 2
bM LE = arg min
w (yi xi w)
w
i=1
Maximizing log-likelihood
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
bM LE = arg min
w (yi x>
i w) 2
Set derivate=0, solve for w
w
i=1
Maximizing log-likelihood
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
bM LE = arg min
w (yi x>
i w) 2
Set derivate=0, solve for w
w
i=1

H H
ith

<latexit sha1_base64="1oZ/0REE3QdZEPPWcX24ngl8J5U=">AAACUXicbVDLahsxFJWnj6ROH0677EbUFNJFzSiUJJtAaAl00UICtROw7EGjuWOLaDSDdCepEfNn/Yusui3dtT/QXTSOF23cAxcO55yrx0krrRzG8fdOdO/+g4cbm4+6W4+fPH3W234+cmVtJQxlqUt7ngoHWhkYokIN55UFUaQaztKLD61/dgnWqdJ8wUUFk0LMjMqVFBikpDfiVyqDuUB/1ST+86fjhh5SriHHHcpdXSReHbJmaujXRLUz5VhWlFs1m+ObqX/LmvXYIlFJrx8P4iXoOmEr0icrnCS9nzwrZV2AQamFc2MWVzjxwqKSGpourx1UQl6IGYwDNaIAN/HL/zf0dVAympc2jEG6VP/e8KJwblGkIVkInLu7Xiv+zxvXmB9MvDJVjWDk7UV5rSmWtC2TZsqCRL0IREirwlupnAsrJIbKuzyDnDPP23PT3LOmCbWwuyWsk9HugO0N2Om7/tH7VUGb5CV5RXYII/vkiHwkJ2RIJPlGfpBf5HfnuvMnIlF0G406q50X5B9EWzcNs7Ow</latexit>

! 1 n
t Hi Kline
n
X X
bM LE =
w xi x>
i x i yi
i=1 i=1
The regression problem in matrix notation
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
bM LE = arg min
w (yi x>
i w) 2
xd
Rn
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7
y = 4 ... 5 X = 4 ... 5
n : # of examples/datapoints
IR
yn xTn
The regression problem in matrix notation
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
bM LE = arg min
w (yi x>
i w) 2
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7
y = 4 ... 5 X = 4 ... 5
n : # of examples/datapoints

yn xTn

yi = xTi w + ✏i
<latexit sha1_base64="XGQ0rQSnZMvlLaYfXpXfZFfomkE=">AAACKnicbZDLSgMxFIYz9VbrrepON8EiCEKZqYJulIJduKylrYVOHTLpmTY0cyHJqGUY8GlcCfos7opbX8Kd6WWh1gOBn+8/5yT53YgzqUxzZGQWFpeWV7KrubX1jc2t/PZOU4axoNCgIQ9FyyUSOAugoZji0IoEEN/lcOsOrsb+7T0IycKgroYRdHzSC5jHKFEaOfm9ocPwBX502F0dP+BjbEMkGdcWc/IFs2hOCs8LayYKaFZVJ/9ld0Ma+xAoyomUbcuMVCchQjHKIc3ZsYSI0AHpQVvLgPggO8nkDyk+1KSLvVDoEyg8oT8nEuJLOfRd3ekT1Zd/vTH8z2vHyjvvJCyIYgUBnV7kxRyrEI8DwV0mgCo+1IJQwfRbMe0TQajSseVydhc8u5LY48WU8KSSplNWmzLXTWppqrOy/iYzL5qlonVSLN2cFsqXs9SyaB8doCNkoTNURteoihqIoif0jF7Rm/FivBsj42PamjFmM7voVxmf3weDplc=</latexit>
<latexit sha1_base64="k1LdFc1OMVRZxUkLfAMqfHn8XTs=">AAACNXicbZDLSsNAFIYnXmu9VV26GSwFQShJFXSjFOzCZS32Ak0pk+lJO3QyCTMTpYQ8gU/jStAnceFO3Lp2Z9J0oa0HBj7+/5wzM78TcKa0ab4ZS8srq2vruY385tb2zm5hb7+l/FBSaFKf+7LjEAWcCWhqpjl0AgnEczi0nfF16rfvQSrmizs9CaDnkaFgLqNEJ1K/UIpsx8WTGF9iPMVOjB/wScY2BIpxX8T9QtEsm9PCi2DNoIhmVe8Xvu2BT0MPhKacKNW1zED3IiI1oxzivB0qCAgdkyF0ExTEA9WLpt+JcSlRBtj1ZXKExlP190REPKUmnpN0ekSP1LyXiv953VC7F72IiSDUIGh2kRtyrH2cZoMHTALVfJIAoZIlb8V0RCShOkkwn7cH4Nq1yE4XU8KjWhxnWiPTHCdqxGlW1nwyi9CqlK3TcuX2rFi9mqWWQ4foCB0jC52jKrpBddREFD2iJ/SCXo1n4934MD6z1iVjNnOA/pTx9QMtmqpv</latexit>
y = Xw + ✏

== + = +

X
X =X
ŷŷii == ŵjj hhjj(x
ŵ (xii)) = Xŵ h (x )
ŷi = j j i
6=00
ŵjj6=
ŵ
=
ŷi =ŵX ŵj hj(xi)
jj 6=0
ŷ ==ŵX
i ŵ h (x )
j 6=0 j j i
ŷi =ŵŵX
j 6=
j ŵj
6=0
0 hj(xi)
ŷi =ŵjj 6=0ŵj hj(xi)
The regression problem in matrix notation
Xn flot T
Wt By
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

bM LE = arg min
w (yi x>
i w) 2
w
i=1

y1
2 3 2
x1 T
3 gl w y AW WAY
d : # of features
6 7 6 7
y = 4 ... 5 X = 4 ... 5
n : # of examples/datapoints
Dwg w
yn xTn Aty
yi = xTi w + ✏i
<latexit sha1_base64="XGQ0rQSnZMvlLaYfXpXfZFfomkE=">AAACKnicbZDLSgMxFIYz9VbrrepON8EiCEKZqYJulIJduKylrYVOHTLpmTY0cyHJqGUY8GlcCfos7opbX8Kd6WWh1gOBn+8/5yT53YgzqUxzZGQWFpeWV7KrubX1jc2t/PZOU4axoNCgIQ9FyyUSOAugoZji0IoEEN/lcOsOrsb+7T0IycKgroYRdHzSC5jHKFEaOfm9ocPwBX502F0dP+BjbEMkGdcWc/IFs2hOCs8LayYKaFZVJ/9ld0Ma+xAoyomUbcuMVCchQjHKIc3ZsYSI0AHpQVvLgPggO8nkDyk+1KSLvVDoEyg8oT8nEuJLOfRd3ekT1Zd/vTH8z2vHyjvvJCyIYgUBnV7kxRyrEI8DwV0mgCo+1IJQwfRbMe0TQajSseVydhc8u5LY48WU8KSSplNWmzLXTWppqrOy/iYzL5qlonVSLN2cFsqXs9SyaB8doCNkoTNURteoihqIoif0jF7Rm/FivBsj42PamjFmM7voVxmf3weDplc=</latexit>
É
<latexit sha1_base64="k1LdFc1OMVRZxUkLfAMqfHn8XTs=">AAACNXicbZDLSsNAFIYnXmu9VV26GSwFQShJFXSjFOzCZS32Ak0pk+lJO3QyCTMTpYQ8gU/jStAnceFO3Lp2Z9J0oa0HBj7+/5wzM78TcKa0ab4ZS8srq2vruY385tb2zm5hb7+l/FBSaFKf+7LjEAWcCWhqpjl0AgnEczi0nfF16rfvQSrmizs9CaDnkaFgLqNEJ1K/UIpsx8WTGF9iPMVOjB/wScY2BIpxX8T9QtEsm9PCi2DNoIhmVe8Xvu2BT0MPhKacKNW1zED3IiI1oxzivB0qCAgdkyF0ExTEA9WLpt+JcSlRBtj1ZXKExlP190REPKUmnpN0ekSP1LyXiv953VC7F72IiSDUIGh2kRtyrH2cZoMHTALVfJIAoZIlb8V0RCShOkkwn7cH4Nq1yE4XU8KjWhxnWiPTHCdqxGlW1nwyi9CqlK3TcuX2rFi9mqWWQ4foCB0jC52jKrpBddREFD2iJ/SCXo1n4934MD6z1iVjNnOA/pTx9QMtmqpv</latexit>
y = Xw + ✏
o

Lg
ow flu ℓ2 norm: ∥z∥2 =
n
∑i=1 zi2 = z ⊤z
bLS = arg min ||y
w Xw||22
w
= arg min(y Xw)T (y Xw)
w

Xt ly Xu t Xt ly XW 2 XT Xu
2Xty 2XTXw O

It x exists then

WMLE XX XTy

flu la h w
g
f a
tat h la t
g
w h w
g

44
0
21,1
flu wt B w

Is win Bi I
Eats
31 2.7 wow B

Dw flu B Bt w s
The regression problem in matrix notation
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>

n
X
bM LE = arg min
w (yi x>
i w) 2
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7
y = 4 ... 5 X = 4 ... 5
n : # of examples/datapoints

yn xTn

n
ℓ2 norm: ∥z∥2 = ∑i=1 zi2 = z ⊤z
bLS = arg min ||y
w Xw||22
w
= arg min(y Xw)T (y Xw)
w

bM LE = (XT X)
bLS = w
w 1
XT Y
The regression problem in matrix notation

bLS = arg min ||y

w Xw||22
w
b
T 1 T
= (X X) X y

What about an offset?

n
X 2
bLS , bbLS = arg min
w yi (xTi w + b)
w,b
i=1
= arg min ||y (Xw + 1b)||22
w,b
Dealing with an offset

bLS , bbLS = arg min ||y

w (Xw + 1b)||22
w,b
Dealing with an offset

bLS , bbLS = arg min ||y

w (Xw + 1b)||22
w,b

T
X Xw b T
bLS + bLS X 1 = X y T

bLS + bbLS 1T 1 = 1T y
1T Xw

If XT 1 = 0 (i.e., if each feature is mean-zero) then

bLS = (XT X)
w 1
XT Y
Xn
bbLS 1
= yi
n i=1
Make Predictions

bLS = (XT X)
w 1
XT Y
n
bbLS 1X
= yi
n i=1

A new house is about to be listed. What should it sell for?

ŷnew = xTnew ŵLS + b̂LS

<latexit sha1_base64="OBCTZ1ysswu78fvh4ENNelplGmk=">AAACV3icbZDfShtBFMYna6tpWm3US2+GBqFQCLta0BuLoGAvvLDWqJBNw+zkrBmcnV1mzlbDMI/k0/RK0Afplc5mc9FGDwx88zvfmT9fUkhhMAzvG8HCm7eLS813rfcfllc+tlfXzk1eag49nstcXybMgBQKeihQwmWhgWWJhIvk+qDqX/wGbUSuznBSwCBjV0qkgjP0aNg+iscM7cQNbYxwizqzCm6co3v0dg79OqNT7433Hv909Eu9TertsN0Ju+G06EsRzUSHzOpk2P4bj3JeZqCQS2ZMPwoLHFimUXAJrhWXBgrGr9kV9L1ULAMzsNMPO7rpyYimufZLIZ3Sfycsy4yZZIl3ZgzHZr5Xwdd6/RLT3YEVqigRFK8vSktJMadVenQkNHCUEy8Y18K/lfIx04yjz7jVikeQxoc2rg7mTNpD52p2WrMksaeuyiqaT+alON/qRtvdrR9fO/vfZqk1yQb5RD6TiOyQffKdnJAe4eSO/CEP5LFx33gKFoNmbQ0as5l18l8Fq898ibih</latexit>
Process

Decide on a model for the likelihood function f(x; θ)

Find the function which fits the data best

Choose a loss function- least squares
Pick the function which minimizes loss on data

Use function to make prediction on new examples

Linear regression with non-
linear basis functions
Recap: Linear Regression
label y
f(x) = 400 x

f(x) = 100,000 + 200 x

input x
• In general high-dimensions, we fit a linear model with intercept
yi ≃ w T xi + b , or equivalently yi = w T xi + b + ϵi
with model parameters (w ∈ ℝd, b ∈ ℝ) that minimizes ℓ2-loss
n
(yi − (w T xi + b))2
∑
ℒ(w, b) =
i=1
error ϵi
Recap: Linear Regression
• The least squares solution, i.e. the minimizer of the ℓ2-loss can be
written in a closed form as a function of data X and y as

or equivalently using
straightforward linear algebra
by setting the gradient to zero:

̂
([ ] ) [ 1T ]
−1
w LS
[ b LS ]
XT XT
= [ X 1] y
̂ 1 T
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

• Linear model with parameter (b, w1):

• y î = b + w1 xi
input x
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

• Linear model with parameter (b, w1):

• y î = b + w1 xi

• Quadratic model with parameter (b, w = [w2]):

w1 input x

• y î = b + w1 xi + w2 xi
2
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

• Linear model with parameter (b, w1):

• y î = b + w1 xi

• Quadratic model with parameter (b, w = [w2]):

w1 input x

• y î = b + w1 xi + w2 xi
2

w1
Degree-p polynomial model with parameter (b, w = ⋮ ):
• wp
• y î = b + w1 xi + w2 xi2 + … + wp xip
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

• Linear model with parameter (b, w1):

• y î = b + w1 xi

• Quadratic model with parameter (b, w = [w2]):

w1 input x

• y î = b + w1 xi + w2 xi
2

w1
Degree-p polynomial model with parameter (b, w = ⋮ ):
• wp
• y î = b + w1 xi + w2 xi2 + … + wp xip
w1
General p-features with parameter w = ⋮ :
• wp
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

w1
General p-features with parameter w = ⋮ :
• wp input x
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp

Note: h can be arbitrary non-linear functions!

h(x) = [log(x), x 2, sin(x), x]

⊤
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

w1
General p-features with parameter w = ⋮ :
• wp input x
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp

How do we learn w?
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn

w1
General p-features with parameter w = ⋮ :
• wp input x
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp

How do we learn w?

w ̂ = arg min ∥Hw − y∥22

− − h(x1)⊤ − − w
H= ⋮ ∈ ℝn×p
y ̂ = ⟨ w ,̂ h(x)⟩
For a new test point x, predict
− − h(xn)⊤ − −
Which p should we choose?
• First instance of class of models with different
representation power = model complexity

label y label y

input x input x
• How do we determine which is better model?
Generalization
• we say a predictor generalizes if it performs as well on unseen data
as on training data (we will formalize the next lecture)
• the data used to train a predictor is training data or in-sample data
• we want the predictor to work on out-of-sample data
• we say a predictor fails to generalize if it performs well on in-
sample data but does not perform well on out-of-sample data
Generalization
• we say a predictor generalizes if it performs as well on unseen data
as on training data (we will formalize the next lecture)
• the data used to train a predictor is training data or in-sample data
• we want the predictor to work on out-of-sample data
• we say a predictor fails to generalize if it performs well on in-
sample data but does not perform well on out-of-sample data

• train a cubic predictor on 32 (in-sample) white circles: Mean Squared Error (MSE) 174

• predict label y for 30 (out-of-sample) blue circles: MSE 192

• conclude this predictor/model generalizes, as in-sample MSE ≃ out-of-sample MSE

Split the data into training and testing

• a way to mimic how the predictor performs on unseen data

• given a single dataset S = {(xi, yi)}ni=1

• we split the dataset into two: training set and test set (e.g., 90/10)
• training set used to train the model
1
(yi − xiT w)2
∑
minimize ℒtrain(w) =
• | Strain | i∈S
train

• test set used to evaluate the model

1
(yi − xiT w)2
∑
ℒtest(w) =
• | Stest | i∈S
test

• this assumes that test set is similar to unseen data

• test set should never be used in training or picking unknowns
y
Train/test error vs. complexity
Error

degree p of the polynomial regression

y x
• Degree p = 5, since it achieves minimum
test error
• Train error monotonically decreases with model
complexity
• Test error has a U shape
test set should never be used in training or picking degree x

Candela Obscura - Assignment #557 The Puppeteer
100% (1)
Candela Obscura - Assignment #557 The Puppeteer
11 pages
312 ATI Critical Care Meds
100% (1)
312 ATI Critical Care Meds
34 pages
Soal Bahasa Inggris Kelas Xi Semester 2
100% (10)
Soal Bahasa Inggris Kelas Xi Semester 2
8 pages
6nucleic Acids PDF
No ratings yet
6nucleic Acids PDF
52 pages
Advantages and Disadvantages in Using Online Tools
50% (2)
Advantages and Disadvantages in Using Online Tools
3 pages
ML Columbia PDF
No ratings yet
ML Columbia PDF
615 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Lec8 MLE
No ratings yet
Lec8 MLE
35 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Lec 12
No ratings yet
Lec 12
15 pages
ML Basics Lecture2 Linear Classification
No ratings yet
ML Basics Lecture2 Linear Classification
34 pages
6 Probabilities
No ratings yet
6 Probabilities
52 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Linear Regression1
No ratings yet
Linear Regression1
98 pages
ML 3
No ratings yet
ML 3
66 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Lin Reg
No ratings yet
Lin Reg
34 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
L09 Learning I Bayesian Learning
No ratings yet
L09 Learning I Bayesian Learning
66 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Unit-2 Machine Learning
No ratings yet
Unit-2 Machine Learning
110 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
6 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
CSE546: Naïve Bayes: Winter 2012
No ratings yet
CSE546: Naïve Bayes: Winter 2012
35 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Linear Regression Using Python
No ratings yet
Linear Regression Using Python
15 pages
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
No ratings yet
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
51 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Naive Bayes Classifier and Other Topics
No ratings yet
Naive Bayes Classifier and Other Topics
52 pages
Asset-V1 ColumbiaX+CSMM.102x+1T2018+type@asset+block@ML Lecture1
No ratings yet
Asset-V1 ColumbiaX+CSMM.102x+1T2018+type@asset+block@ML Lecture1
17 pages
Lecture17 Mle Map
No ratings yet
Lecture17 Mle Map
29 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
CH 1
No ratings yet
CH 1
24 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Lecture 1and2-Revision Part1
No ratings yet
Lecture 1and2-Revision Part1
53 pages
Final 2012 Wsolutions
No ratings yet
Final 2012 Wsolutions
14 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
No ratings yet
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
387 pages
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
No ratings yet
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
91 pages
Final 2012 W
No ratings yet
Final 2012 W
8 pages
4.ML Estimation
No ratings yet
4.ML Estimation
19 pages
ANN Unit1
No ratings yet
ANN Unit1
29 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
CS 229 - Supervised Learning Cheatsheet
No ratings yet
CS 229 - Supervised Learning Cheatsheet
2 pages
Super Cheatsheet Machine Learning
100% (1)
Super Cheatsheet Machine Learning
15 pages
CS775 Lec 2
No ratings yet
CS775 Lec 2
66 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
ds11 2
No ratings yet
ds11 2
19 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Dokumen - Pub Algorithm Design Techniques
No ratings yet
Dokumen - Pub Algorithm Design Techniques
555 pages
ChatGPT Cheat Sheet PDF
100% (1)
ChatGPT Cheat Sheet PDF
1 page
2023 10 18 Council Meeting Minutes v0.2 .Docx 2
No ratings yet
2023 10 18 Council Meeting Minutes v0.2 .Docx 2
2 pages
HH Hbar Coin Economics Paper 060320 v6
No ratings yet
HH Hbar Coin Economics Paper 060320 v6
21 pages
Azure Migration and Modernization Program: Help Your Customers Save Money, Stay Secure, and Scale On-Demand
No ratings yet
Azure Migration and Modernization Program: Help Your Customers Save Money, Stay Secure, and Scale On-Demand
2 pages
Lab - Biology (7 6 25)
No ratings yet
Lab - Biology (7 6 25)
87 pages
Theoretical/methodological Approach: Representation
No ratings yet
Theoretical/methodological Approach: Representation
3 pages
Bianca
No ratings yet
Bianca
208 pages
Objective Resolution 1949
No ratings yet
Objective Resolution 1949
11 pages
Dowry Deaths in India: SEMESTER During The Academic Year 2020-2021
100% (1)
Dowry Deaths in India: SEMESTER During The Academic Year 2020-2021
23 pages
Lesson 11: The Computer As The Teacher's Tool
No ratings yet
Lesson 11: The Computer As The Teacher's Tool
19 pages
Synopsis - On Line Reminder
100% (1)
Synopsis - On Line Reminder
10 pages
Simple Random Sampling
No ratings yet
Simple Random Sampling
4 pages
Research Methodology Assignment
No ratings yet
Research Methodology Assignment
23 pages
Partial Derivatives: Chapter 2: Function of Several Variables
No ratings yet
Partial Derivatives: Chapter 2: Function of Several Variables
26 pages
Business Objects Xir2 Data Access Guide en
No ratings yet
Business Objects Xir2 Data Access Guide en
120 pages
Syllabus: Bisnis Global (Global Business) (ACAU609104) SEMESTER 1 2018-2019
No ratings yet
Syllabus: Bisnis Global (Global Business) (ACAU609104) SEMESTER 1 2018-2019
6 pages
BNSS by ADPO Rai Golden
No ratings yet
BNSS by ADPO Rai Golden
1 page
A Linguistic Analysis of Some Problems of Arabic-English Translation of Legal Texts - Facebook Com LibraryofHIL PDF
100% (2)
A Linguistic Analysis of Some Problems of Arabic-English Translation of Legal Texts - Facebook Com LibraryofHIL PDF
121 pages
Set - Analysis Presentation PDF
No ratings yet
Set - Analysis Presentation PDF
19 pages
1 Theme: Comparison of The Implementation of The CART Algorithm Under Tanagra and R (Rpart Package)
No ratings yet
1 Theme: Comparison of The Implementation of The CART Algorithm Under Tanagra and R (Rpart Package)
15 pages
Shane Draper: Information Technology Support Professional
No ratings yet
Shane Draper: Information Technology Support Professional
2 pages
Snakes - Identification of Poisonous & Non-Poisonous Snakes, Poison Apparatus, Venom & Its Effect
No ratings yet
Snakes - Identification of Poisonous & Non-Poisonous Snakes, Poison Apparatus, Venom & Its Effect
6 pages
The Blessing of Quran and Warning From Abandoning It
No ratings yet
The Blessing of Quran and Warning From Abandoning It
22 pages
INTERSUBJECTIVITY
No ratings yet
INTERSUBJECTIVITY
68 pages
Présentation
No ratings yet
Présentation
5 pages
Leg Painting - Google Search
No ratings yet
Leg Painting - Google Search
1 page
Addition & Subtraction Fluency Fact Drills (0-20) : Similar Ideas
No ratings yet
Addition & Subtraction Fluency Fact Drills (0-20) : Similar Ideas
31 pages
Trasparency of Things Contemplating The Nature of Experience Rupert Spira PDF
100% (2)
Trasparency of Things Contemplating The Nature of Experience Rupert Spira PDF
271 pages
Schiffman cb11 Im15
No ratings yet
Schiffman cb11 Im15
16 pages

Lecture 2 Annotated

Uploaded by

Lecture 2 Annotated

Uploaded by

Maximum Likelihood

• You: The probability is:

• Data: sequence D= (HHTHT…), k heads out of n flips

• Data: sequence D= (HHTHT…), k heads out of n flips

✓bM LE = arg max log P (D|✓)

Observe X1 , X2 , . . . , Xn drawn IID from f (x; ✓) for some “true” ✓ = ✓⇤

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)

• Billionaire: What if I am measuring a continuous variable?

• affine transformation (multiplying by scalar and adding a

• Prob. of i.i.d. samples D={x1,…,xn} (e.g., temperature):

• What is ✓bM LE for ✓ = (µ, 2 ) ? Draw a picture!

• What’s MLE for mean?

• Again, set derivative to zero:

• Unbiased variance estimator:

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)

Under benign assumptions, as the number of observations n ̂

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)

Under benign assumptions, as the number of observations n ̂

Why is it useful to recover the “true” parameters θ* of a probabilistic model?

Customer segmentation / clustering

Object recognition / classification

“Kaia the dog wasn't a natural pick to go to mars.

Given past sales data on zillow.com, predict:

Given past sales data on zillow.com, predict:

Given past sales data on zillow.com, predict:

Given past sales data on zillow.com, predict:

Given past sales data on zillow.com, predict:

Given past sales data on zillow.com, predict:

Maximum Likelihood Estimator (MLE) ✓bM LE = arg max Ln (✓)

Under benign assumptions, as the number of observations n ̂

Why is it useful to recover the “true” parameters θ* of a probabilistic model?

flow ly x.tw y 2x.twt tx.t

Dw log P Dlw III yes

Sef O solve for w

bLS = arg min ||y

What about an offset?

bLS , bbLS = arg min ||y

bLS , bbLS = arg min ||y

If XT 1 = 0 (i.e., if each feature is mean-zero) then

A new house is about to be listed. What should it sell for?

ŷnew = xTnew ŵLS + b̂LS

Decide on a model for the likelihood function f(x; θ)

Find the function which fits the data best

Use function to make prediction on new examples

f(x) = 100,000 + 200 x

• Linear model with parameter (b, w1):

• Linear model with parameter (b, w1):

• Quadratic model with parameter (b, w = [w2]):

• Linear model with parameter (b, w1):

• Quadratic model with parameter (b, w = [w2]):

• Linear model with parameter (b, w1):

• Quadratic model with parameter (b, w = [w2]):

Note: h can be arbitrary non-linear functions!

h(x) = [log(x), x 2, sin(x), x]

w ̂ = arg min ∥Hw − y∥22

• predict label y for 30 (out-of-sample) blue circles: MSE 192

• conclude this predictor/model generalizes, as in-sample MSE ≃ out-of-sample MSE

• a way to mimic how the predictor performs on unseen data

• given a single dataset S = {(xi, yi)}ni=1

• test set used to evaluate the model

• this assumes that test set is similar to unseen data

degree p of the polynomial regression

You might also like