DM - Lecture 3
DM - Lecture 3
DM - Lecture 3
LOGISTIC REGRESSION
Jesse Davis
Motivation Problem: Expected Goals
2
Y is binary:
𝐸 𝑌 𝑋] = 𝑃 𝑌 = 𝑦 𝑋 = 𝑥) × 𝑦
𝑦
= 𝑃 𝑌 = 1 𝑋 = 𝑥) × 1 + 𝑃 𝑌 = 0 𝑋 = 𝑥) × 0
= 𝑃 𝑌 = 1 𝑋 = 𝑥)
Suppose 𝑌 = {𝑌1 , … , 𝑌𝐾 }
Model probabilities as:
exp(𝑤1 𝑥𝑗 )
𝑃 𝑌𝑗 = 1 𝑥𝑗 , 𝑊) =
1+exp( σ𝐾−1
𝑘=1 𝑤𝑘 𝑥𝑗 )
…
exp(𝑤𝐾−1 𝑋𝑗 )
𝑃 𝑌𝑗 = 𝐾 − 1 𝑥𝑗 , 𝑊) =
1+exp( σ𝐾−1
𝑘=1 𝑤𝑘 𝑥𝑗 )
1
𝑃 𝑌𝑗 = 𝐾 𝑥𝑗 , 𝑊) =
1+exp( σ𝐾−1
𝑘=1 𝑤𝑘 𝑥𝑗 )
Dist * Angle
Understanding 1-D Logistic Regression
8
𝑤0
P(Y=1 | X) = 0.5 when -w0 – w1 x = 0 or x =−
𝑤1
Steepness of curve controlled by w1
What’s the Best Representation for
9
Modeling Shot Location?
Two features:
Raw X
Raw Y
Learn: Weights
Question: What would make good weights?
Insight: If yj = 1, want P(Y=1 | xj) > P(Y=0 | xj)
If yj = 0, want P(Y=1 | xj) close to 0
If yj = 1, want P(Y=1 | xj) close to 1
𝑛
Idea: argmax𝑤 ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤)
𝑗=1
argmax𝑤 ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤) = 𝑦𝑗 ln 𝑝𝑗 + 1 − 𝑦𝑗 ln(1 − 𝑝𝑗 )
𝑗=1 𝑗=1
𝑁
exp(𝑤𝑥𝑗 ) 1
= 𝑦𝑗 ln 1+exp(𝑤𝑥 ) + 1 − 𝑦𝑗 ln 1+exp(𝑤𝑥 )
𝑗 𝑗
𝑗=1
argmax𝑤 ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤) = 𝑦𝑗 ln 𝑝𝑗 + 1 − 𝑦𝑗 ln(1 − 𝑝𝑗 )
𝑗=1 𝑗=1
𝑁
exp(𝑤𝑥𝑗 ) 1
= 𝑦𝑗 ln 1+exp(𝑤𝑥 ) + 1 − 𝑦𝑗 ln 1+exp(𝑤𝑥 )
𝑗 𝑗
𝑗=1
𝑁
𝑋
Apply ln = ln 𝑥 − ln 𝑦
𝑦
How Do We Select Weights?
17
argmax𝑤 ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤) = 𝑦𝑗 ln 𝑝𝑗 + 1 − 𝑦𝑗 ln(1 − 𝑝𝑗 )
𝑗=1 𝑗=1
𝑁
exp(𝑤𝑥𝑗 ) 1
= 𝑦𝑗 ln 1+exp(𝑤𝑥 ) + 1 − 𝑦𝑗 ln 1+exp(𝑤𝑥 )
𝑗 𝑗
𝑗=1
𝑁
𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
argmax𝑤 𝑦𝑗 (𝑤0 + 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
Gradient Ascent:
𝑤
Derivative of Logistic Regression
Objective
19
𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 = 𝑦𝑗 (𝑤0 + 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
Derivative of Logistic Regression
Objective
20
𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 = 𝑦𝑗 (𝑤0 + 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
∂J 𝑤 𝑛 1 ∂ 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
= 𝑦𝑗 𝑥𝑗,𝑖 − (1 + 𝑒 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi
1+𝑒
Derivative of Logistic Regression
Objective
21
𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 = 𝑦𝑗 (𝑤0 + 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
∂J 𝑤 𝑛 1 ∂ 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
= 𝑦𝑗 𝑥𝑗,𝑖 − (1 + 𝑒 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi
1+𝑒
𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 = 𝑦𝑗 (𝑤0 + 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
∂J 𝑤 𝑛 1 ∂ 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
= 𝑦𝑗 𝑥𝑗,𝑖 − (1 + 𝑒 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi
1+𝑒
𝑛 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 𝑑
∂J 𝑤 𝑒 ∂
= 𝑦𝑗 𝑥𝑗,𝑖 − (𝑤0 + 𝑤𝑖 𝑥𝑗,𝑖 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi 𝑖=1
1+𝑒
𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 = 𝑦𝑗 (𝑤0 + 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
∂J 𝑤 𝑛 1 ∂ 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
= 𝑦𝑗 𝑥𝑗,𝑖 − (1 + 𝑒 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi
1+𝑒
𝑛 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 𝑑
∂J 𝑤 𝑒 ∂
= 𝑦𝑗 𝑥𝑗,𝑖 − (𝑤0 + 𝑤𝑖 𝑥𝑗,𝑖 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi 𝑖=1
1+𝑒
𝑛 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
∂J 𝑤 𝑒
= 𝑦𝑗 𝑥𝑗,𝑖 − 𝑥𝑗,𝑖
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
1+𝑒
Derivative of Logistic Regression
Objective
24
𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 = 𝑦𝑗 (𝑤0 + 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
∂J 𝑤 𝑛 1 ∂ 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
= 𝑦𝑗 𝑥𝑗,𝑖 − (1 + 𝑒 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi
1+𝑒
𝑛 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 𝑑
∂J 𝑤 𝑒 ∂
= 𝑦𝑗 𝑥𝑗,𝑖 − (𝑤0 + 𝑤𝑖 𝑥𝑗,𝑖 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi 𝑖=1
1+𝑒
𝑛 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
∂J 𝑤 𝑒
= 𝑦𝑗 𝑥𝑗,𝑖 − 𝑥𝑗,𝑖
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
1+𝑒
∂J 𝑤 𝑛
= 𝑥𝑗,𝑖 ( 𝑦𝑗 − P(𝑦𝑗 = 1|𝑥𝑗 , 𝑤))
∂wi 𝑗=1
Sketch of Gradient Ascent for
Logistic Regression
25
Suppose
Weightvector: w = [1, -2, -1]
Example: ((1, 1), +)
Ideas why
there is a drop?
Aside
34
Keyword
Ad slot
1
Algorithmic
search Ad slot
results 2
Advertising Model (Simplified)
37
Advertisers
Bid$X per a keyword w
Pay $X even time someone clicks on ad
Google’s insight
Profit = $X * click through rate (CTR)
The key for profit