ML_basics_lecture2_linear_classification
ML_basics_lecture2_linear_classification
Prior knowledge
Example: Linear regression
• Given training data
𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
• Find 𝑓𝑤 𝑥 = 𝑤 𝑥 that minimizes 𝐿 𝑓𝑤 = σ𝑖=1 𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 2
𝑇
𝑛
𝑙2 loss
Linear model 𝓗
Why 𝑙2 loss
• Why not choose another loss
• 𝑙1 loss, hinge loss, exponential loss, …
• Empirical: easy to optimize
• For linear case: w = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
Questions:
• What kind of prior knowledge?
• Principal way to derive loss?
Maximum likelihood Estimation
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )
𝐿 𝑃𝜃 = − σ𝑖 log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )
MLE: conditional log-likelihood
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑦 𝑥 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
𝐿 𝑃𝜃 = − σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
MLE: conditional log-likelihood
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑦 𝑥 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
P(y|x): discriminative;
• MLE: negative conditional log-likelihood loss P(x,y): generative
𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
𝐿 𝑃𝜃 = − σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
Example: 𝑙2 loss
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
• Find 𝑓𝜃 𝑥 that minimizes 𝐿 𝑓𝜃 = σ𝑖=1 𝑓𝜃 (𝑥𝑖 ) − 𝑦𝑖 2
𝑛
Example: 𝑙2 loss
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
• Find 𝑓𝜃 𝑥 that minimizes 𝐿 𝑓𝜃 = σ𝑖=1 𝑓𝜃 (𝑥𝑖 ) − 𝑦𝑖 2
𝑛
• Define 𝑃𝜃 𝑦 𝑥 = Normal 𝑦; 𝑓𝜃 𝑥 , 𝜎 2
−1 1
• log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 ) = 2 (𝑓𝜃 𝑥𝑖 − 𝑦𝑖 )2 −log(𝜎) − log(2𝜋)
2𝜎 2
1 𝑛 2
• 𝜃𝑀𝐿 = argminθ∈Θ σ𝑖=1 𝑓𝜃 (𝑥𝑖 ) − 𝑦𝑖
𝑛
Linear classification
Example 1: image classification
indoor
Indoor outdoor
Example 2: Spam detection
#”$” #”Mr.” #”sale” … Spam?
Email 1 2 1 1 Yes
Email 2 0 1 0 No
Email 3 1 1 1 Yes
…
Email n 0 0 0 No
New email 0 0 1 ??
Why classification
• Classification: a kind of summary
• Easy to interpret
• Easy for making decisions
Linear classification 𝑤𝑇𝑥 = 0
𝑤𝑇𝑥 > 0
𝑤𝑇𝑥 < 0
Class 1 𝑤
Drawback: not
robust to “outliers”
𝑦 = step(𝑤 𝑇 𝑥)
𝑤𝑇𝑥
Between the two
• Prediction bounded in [0,1]
• Smooth
1
• Sigmoid: 𝜎 𝑎 =
1+exp(−𝑎)
1 𝑛
• Find 𝑤 that minimizes 𝐿 𝑓𝑤 = σ𝑖=1 𝜎(𝑤 𝑇 𝑥𝑖 ) − 𝑦𝑖 2
𝑛
Linear classification: logistic regression
• Squash the output of the linear function
1
Sigmoid 𝑤𝑇𝑥 =𝜎 𝑤𝑇𝑥 =
1 + exp(−𝑤 𝑇 𝑥)
• A better approach: Interpret as a probability
𝑇
1
𝑃𝑤 (𝑦 = 1|𝑥) = 𝜎 𝑤 𝑥 =
1 + exp(−𝑤 𝑇 𝑥)
𝑃𝑤 𝑦 = 0 𝑥 = 1 − 𝑃𝑤 𝑦 = 1 𝑥 = 1 − 𝜎 𝑤 𝑇 𝑥
Linear classification: logistic regression
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑤 that minimizes
𝑛
1
𝐿 𝑤 = − log 𝑃𝑤 𝑦 𝑥
𝑛
𝑖=1
1 1
𝐿 𝑤 = − log𝜎(𝑤 𝑥𝑖 ) − log[1 − 𝜎 𝑤 𝑇 𝑥𝑖 ]
𝑇
𝑛 𝑛
𝑦𝑖 =1 𝑦𝑖 =0
Logistic regression:
MLE with sigmoid
Linear classification: logistic regression
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑤 that minimizes
1 1
𝐿 𝑤 = − log𝜎(𝑤 𝑥𝑖 ) − log[1 − 𝜎 𝑤 𝑇 𝑥𝑖 ]
𝑇
𝑛 𝑛
𝑦𝑖 =1 𝑦𝑖 =0