0% found this document useful (0 votes)
10 views

ML_basics_lecture2_linear_classification

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

ML_basics_lecture2_linear_classification

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Machine Learning Basics

Lecture 2: Linear Classification


Princeton University COS 495
Instructor: Yingyu Liang
Review: machine learning basics
Math formulation
• Given training data𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛

• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes 𝐿 𝑓 = σ𝑖=1 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖 )
𝑛
• s.t. the expected loss is small
𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷 [𝑙(𝑓, 𝑥, 𝑦)]
Machine learning 1-2-3

• Collect data and extract features


• Build model: choose hypothesis class 𝓗 and loss function 𝑙
• Optimization: minimize the empirical loss
Machine learning 1-2-3 Experience

• Collect data and extract features


• Build model: choose hypothesis class 𝓗 and loss function 𝑙
• Optimization: minimize the empirical loss

Prior knowledge
Example: Linear regression
• Given training data
𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛

• Find 𝑓𝑤 𝑥 = 𝑤 𝑥 that minimizes 𝐿 𝑓𝑤 = σ𝑖=1 𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 2
𝑇
𝑛

𝑙2 loss

Linear model 𝓗
Why 𝑙2 loss
• Why not choose another loss
• 𝑙1 loss, hinge loss, exponential loss, …
• Empirical: easy to optimize
• For linear case: w = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦

• Theoretical: a way to encode prior knowledge

Questions:
• What kind of prior knowledge?
• Principal way to derive loss?
Maximum likelihood Estimation
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• Would like to pick 𝜃 so that 𝑃𝜃 (𝑥, 𝑦) fits the data well


Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• “fitness” of 𝜃 to one data point 𝑥𝑖 , 𝑦𝑖


likelihood 𝜃; 𝑥𝑖 , 𝑦𝑖 ≔ 𝑃𝜃 (𝑥𝑖 , 𝑦𝑖 )
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• “fitness” of 𝜃 to i.i.d. data points { 𝑥𝑖 , 𝑦𝑖 }


likelihood 𝜃; {𝑥𝑖 , 𝑦𝑖 } ≔ 𝑃𝜃 {𝑥𝑖 , 𝑦𝑖 } = ς𝑖 𝑃𝜃 (𝑥𝑖 , 𝑦𝑖 )
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• MLE: maximize “fitness” of 𝜃 to i.i.d. data points { 𝑥𝑖 , 𝑦𝑖 }


𝜃𝑀𝐿 = argmaxθ∈Θ ς𝑖 𝑃𝜃 (𝑥𝑖 , 𝑦𝑖 )
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• MLE: maximize “fitness” of 𝜃 to i.i.d. data points { 𝑥𝑖 , 𝑦𝑖 }


𝜃𝑀𝐿 = argmaxθ∈Θ log[ς𝑖 𝑃𝜃 𝑥𝑖 , 𝑦𝑖 ]

𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log[𝑃𝜃 𝑥𝑖 , 𝑦𝑖 ]


Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• MLE: negative log-likelihood loss


𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )

𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )
𝐿෠ 𝑃𝜃 = − σ𝑖 log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )
MLE: conditional log-likelihood
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑦 𝑥 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

Only care about predicting y


• MLE: negative conditional log-likelihood loss from x; do not care about p(x)
𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )

𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
𝐿෠ 𝑃𝜃 = − σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
MLE: conditional log-likelihood
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑦 𝑥 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

P(y|x): discriminative;
• MLE: negative conditional log-likelihood loss P(x,y): generative
𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )

𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
𝐿෠ 𝑃𝜃 = − σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
Example: 𝑙2 loss
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛

• Find 𝑓𝜃 𝑥 that minimizes 𝐿 𝑓𝜃 = σ𝑖=1 𝑓𝜃 (𝑥𝑖 ) − 𝑦𝑖 2
𝑛
Example: 𝑙2 loss
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛

• Find 𝑓𝜃 𝑥 that minimizes 𝐿 𝑓𝜃 = σ𝑖=1 𝑓𝜃 (𝑥𝑖 ) − 𝑦𝑖 2
𝑛

𝑙2 loss: Normal + MLE

• Define 𝑃𝜃 𝑦 𝑥 = Normal 𝑦; 𝑓𝜃 𝑥 , 𝜎 2
−1 1
• log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 ) = 2 (𝑓𝜃 𝑥𝑖 − 𝑦𝑖 )2 −log(𝜎) − log(2𝜋)
2𝜎 2
1 𝑛 2
• 𝜃𝑀𝐿 = argminθ∈Θ σ𝑖=1 𝑓𝜃 (𝑥𝑖 ) − 𝑦𝑖
𝑛
Linear classification
Example 1: image classification

indoor
Indoor outdoor
Example 2: Spam detection
#”$” #”Mr.” #”sale” … Spam?
Email 1 2 1 1 Yes
Email 2 0 1 0 No
Email 3 1 1 1 Yes

Email n 0 0 0 No
New email 0 0 1 ??
Why classification
• Classification: a kind of summary
• Easy to interpret
• Easy for making decisions
Linear classification 𝑤𝑇𝑥 = 0
𝑤𝑇𝑥 > 0

𝑤𝑇𝑥 < 0
Class 1 𝑤

w đóng vai trò như 1 vecto pháp tuyến


cái nào ngược với pháp tuyến sẽ < 0
đường ở giữa chia thành 2 nửa <0 và >0 Class 0
Linear classification: natural attempt
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Hypothesis 𝑓𝑤 𝑥 = 𝑤 𝑇 𝑥
• 𝑦 = 1 if 𝑤 𝑇 𝑥 > 0 Linear model 𝓗
• 𝑦 = 0 if 𝑤 𝑇 𝑥 < 0
• Prediction: 𝑦 = step(𝑓𝑤 𝑥 ) = step(𝑤 𝑇 𝑥)
Linear classification: natural attempt
• Given training data
𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛

• Find 𝑓𝑤 𝑥 = 𝑤 𝑥 to minimize 𝐿 𝑓𝑤 = σ𝑖=1 𝕀[step(𝑤 𝑇 𝑥𝑖 ) ≠ 𝑦𝑖 ]
𝑇
𝑛

• Drawback: difficult to optimize


• NP-hard in the worst case 0-1 loss
Linear classification: simple approach
• Given training data
𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛

• Find 𝑓𝑤 𝑥 = 𝑤 𝑥 that minimizes 𝐿 𝑓𝑤 = σ𝑖=1 𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 2
𝑇
𝑛

Reduce to linear regression;


ignore the fact 𝑦 ∈ {0,1}
Linear classification: simple approach

Drawback: not
robust to “outliers”

Figure borrowed from


Pattern Recognition and
Machine Learning, Bishop
Compare the two
𝑦
𝑦 = 𝑤𝑇𝑥

𝑦 = step(𝑤 𝑇 𝑥)

𝑤𝑇𝑥
Between the two
• Prediction bounded in [0,1]
• Smooth

1
• Sigmoid: 𝜎 𝑎 =
1+exp(−𝑎)

Figure borrowed from Pattern Recognition and Machine Learning, Bishop


Linear classification: sigmoid prediction
• Squash the output of the linear function
1
Sigmoid 𝑤𝑇𝑥 =𝜎 𝑤𝑇𝑥 =
1 + exp(−𝑤 𝑇 𝑥)

1 𝑛

• Find 𝑤 that minimizes 𝐿 𝑓𝑤 = σ𝑖=1 𝜎(𝑤 𝑇 𝑥𝑖 ) − 𝑦𝑖 2
𝑛
Linear classification: logistic regression
• Squash the output of the linear function
1
Sigmoid 𝑤𝑇𝑥 =𝜎 𝑤𝑇𝑥 =
1 + exp(−𝑤 𝑇 𝑥)
• A better approach: Interpret as a probability
𝑇
1
𝑃𝑤 (𝑦 = 1|𝑥) = 𝜎 𝑤 𝑥 =
1 + exp(−𝑤 𝑇 𝑥)

𝑃𝑤 𝑦 = 0 𝑥 = 1 − 𝑃𝑤 𝑦 = 1 𝑥 = 1 − 𝜎 𝑤 𝑇 𝑥
Linear classification: logistic regression
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑤 that minimizes
𝑛
1
𝐿෠ 𝑤 = − ෍ log 𝑃𝑤 𝑦 𝑥
𝑛
𝑖=1
1 1
𝐿෠ 𝑤 = − ෍ log𝜎(𝑤 𝑥𝑖 ) − ෍ log[1 − 𝜎 𝑤 𝑇 𝑥𝑖 ]
𝑇
𝑛 𝑛
𝑦𝑖 =1 𝑦𝑖 =0

Logistic regression:
MLE with sigmoid
Linear classification: logistic regression
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑤 that minimizes
1 1
𝐿෠ 𝑤 = − ෍ log𝜎(𝑤 𝑥𝑖 ) − ෍ log[1 − 𝜎 𝑤 𝑇 𝑥𝑖 ]
𝑇
𝑛 𝑛
𝑦𝑖 =1 𝑦𝑖 =0

No close form solution;


Need to use gradient descent
Properties of sigmoid function
• Bounded
1
𝜎 𝑎 = ∈ (0,1)
1 + exp(−𝑎)
• Symmetric
exp −𝑎 1
1−𝜎 𝑎 = = = 𝜎(−𝑎)
1 + exp −𝑎 exp 𝑎 + 1
• Gradient
exp −𝑎
𝜎 ′ (𝑎) = 2
= 𝜎(𝑎)(1 − 𝜎 𝑎 )
1 + exp −𝑎

You might also like