DM - Lecture 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

1

LOGISTIC REGRESSION

Jesse Davis
Motivation Problem: Expected Goals
2

Question: What constitutes a good shot in soccer?

Objective: Expected number of goals scored by a shot


3 The Logistic Regression Model
Scoring Problems:
A Regression Approach
4

 Our objective: Learn a function to approximate


E[Y | X] for each X

 Y is binary:
𝐸 𝑌 𝑋] = ෍ 𝑃 𝑌 = 𝑦 𝑋 = 𝑥) × 𝑦
𝑦

= 𝑃 𝑌 = 1 𝑋 = 𝑥) × 1 + 𝑃 𝑌 = 0 𝑋 = 𝑥) × 0
= 𝑃 𝑌 = 1 𝑋 = 𝑥)

 New objective: Estimate posterior class


probabilities
Binary Logistic Regression
5

 Discriminative model for learning: P(Y | X)


 Form: Sigmoid applied to linear function of data
1
𝑃 𝑌𝑗 = 0 𝑥𝑗 , 𝑤) =
1+exp(𝑤0 + σ𝑑𝑖=1 𝑤𝑖 × 𝑥𝑗,𝑖 )

exp(𝑤0 + σ𝑑𝑖=1 𝑤𝑖 × 𝑥𝑗,𝑖 )


𝑃 𝑌𝑗 = 1 𝑥𝑗 , 𝑤) =
1+exp(𝑤0 + σ𝑑𝑖=1 𝑤𝑖 × 𝑥𝑗,𝑖 )
Real value Feature

 Make decision by thresholding P(Y=1 | X)


 P(Y=1 | X) > t, then predict positive
 P(Y=1 | X) ≤ t, then predict negative
Multinomial Logistic Regression
6

 Suppose 𝑌 = {𝑌1 , … , 𝑌𝐾 }
 Model probabilities as:
exp(𝑤1 𝑥𝑗 )
𝑃 𝑌𝑗 = 1 𝑥𝑗 , 𝑊) =
1+exp( σ𝐾−1
𝑘=1 𝑤𝑘 𝑥𝑗 )

exp(𝑤𝐾−1 𝑋𝑗 )
𝑃 𝑌𝑗 = 𝐾 − 1 𝑥𝑗 , 𝑊) =
1+exp( σ𝐾−1
𝑘=1 𝑤𝑘 𝑥𝑗 )
1
𝑃 𝑌𝑗 = 𝐾 𝑥𝑗 , 𝑊) =
1+exp( σ𝐾−1
𝑘=1 𝑤𝑘 𝑥𝑗 )

Learn separate weight vector for class 1,…,K-1


Features for Logistic Regression
7

 Binary: Penalty = Yes / No


 Discrete: Dummy coding so use k – 1 values
Head Foot Other

BodyPart = {Head, Foot, Other} Xi,h 1 0 0


Xi,f 0 1 0
 Reals: x position of shot
 Complex variables
 Dist = Distance to center of goal
 Angle = Angle to goal

 Dist * Angle
Understanding 1-D Logistic Regression
8

 X → +∞, then P(Y=1 | X) → 1


 X → -∞, then P(Y=1 | X) → 0

𝑤0
 P(Y=1 | X) = 0.5 when -w0 – w1 x = 0 or x =−
𝑤1
 Steepness of curve controlled by w1
What’s the Best Representation for
9
Modeling Shot Location?
 Two features:
 Raw X
 Raw Y

 Distance d to the center of


the goal
 Angle α to goal
 Polar coordinates
 Grid cells
 Some combination?
Logistic Regression:
Effect of Feature Space
10

Raw (X, Y) Distance to Goal


Logistic Regression vs.
Probability Trees
11

Raw (X, Y) Logistic Raw (X, Y) with a


Regression Probability Tree
Hand-crafted vs. Learned Features
12

Learner Feature Set Brier


Score MSE: lower
Logistic Raw (X,Y), Distance, 0.0784 is better
Regression Angle, Angle*Distance
Probability Raw (X, Y) 0.0787
Tree
Logistic Raw (X, Y), Distance 0.0791
Regression
Logistic Raw (X, Y) 0.0837
Regression
“Learned partition” of the pitch shockingly good
But requires 100,000 shots to train…
13 Training Logistic Regression
Logistic Regression Training Task
14

 Given: Data S = ((x1, y1),…,(xn,yn))

 Learn: Weights
Question: What would make good weights?
 Insight: If yj = 1, want P(Y=1 | xj) > P(Y=0 | xj)
 If yj = 0, want P(Y=1 | xj) close to 0
 If yj = 1, want P(Y=1 | xj) close to 1
𝑛
 Idea: argmax𝑤 ෍ ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤)
𝑗=1

Push predicted probability to 0 for negative


examples and 1 for positive examples
How Do We Select Weights?
15

Let 𝑦𝑗 ∈ {0,1} and 𝑝𝑗 = 𝑃 𝑦𝑗 = 1|𝑥𝑗 , 𝑤


𝑁 𝑁

argmax𝑤 ෍ ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤) = ෍ 𝑦𝑗 ln 𝑝𝑗 + 1 − 𝑦𝑗 ln(1 − 𝑝𝑗 )
𝑗=1 𝑗=1
𝑁
exp(𝑤𝑥𝑗 ) 1
= ෍ 𝑦𝑗 ln 1+exp(𝑤𝑥 ) + 1 − 𝑦𝑗 ln 1+exp(𝑤𝑥 )
𝑗 𝑗
𝑗=1

P(Yj =1 | Xj) P(Yj =0 | Xj)

For brevity let w𝑥𝑗 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 × 𝑥𝑗,𝑖


How Do We Select Weights?
16

Let 𝑦𝑗 ∈ {0,1} and 𝑝𝑗 = 𝑃 𝑦𝑗 = 1|𝑥𝑗 , 𝑤


𝑁 𝑁

argmax𝑤 ෍ ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤) = ෍ 𝑦𝑗 ln 𝑝𝑗 + 1 − 𝑦𝑗 ln(1 − 𝑝𝑗 )
𝑗=1 𝑗=1
𝑁
exp(𝑤𝑥𝑗 ) 1
= ෍ 𝑦𝑗 ln 1+exp(𝑤𝑥 ) + 1 − 𝑦𝑗 ln 1+exp(𝑤𝑥 )
𝑗 𝑗
𝑗=1
𝑁

= ෍ 𝑦𝑗 𝑤𝑥𝑗 − 𝑦𝑗 ln(1+exp(𝑤𝑥𝑗 )) + 1 − 𝑦𝑗 (−ln(1+exp(𝑤𝑥𝑗 ))


𝑗=1

𝑋
Apply ln = ln 𝑥 − ln 𝑦
𝑦
How Do We Select Weights?
17

Let 𝑦𝑗 ∈ {0,1} and 𝑝𝑗 = 𝑃 𝑦𝑗 = 1|𝑥𝑗 , 𝑤


𝑁 𝑁

argmax𝑤 ෍ ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤) = ෍ 𝑦𝑗 ln 𝑝𝑗 + 1 − 𝑦𝑗 ln(1 − 𝑝𝑗 )
𝑗=1 𝑗=1
𝑁
exp(𝑤𝑥𝑗 ) 1
= ෍ 𝑦𝑗 ln 1+exp(𝑤𝑥 ) + 1 − 𝑦𝑗 ln 1+exp(𝑤𝑥 )
𝑗 𝑗
𝑗=1
𝑁

= ෍ 𝑦𝑗 𝑤𝑥𝑗 − 𝑦𝑗 ln(1+exp(𝑤𝑥𝑗 )) + 1 − 𝑦𝑗 (−ln(1+exp(𝑤𝑥𝑗 ))


𝑗=1
Underlined
𝑁
𝑦𝑗 𝑤𝑥𝑗 − 𝑦𝑗 ln(1+exp(𝑤𝑥𝑗 )) terms cancel
=෍
+𝑦𝑗 ln(1+exp(𝑤𝑥𝑗 )) − ln(1+exp(𝑤𝑥𝑗 ))
𝑗=1
𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
=෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
How Do We Select Weights?
18

𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
argmax𝑤 ෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1

Bad: No closed-form solution


Good: Concave = easy to optimize

Gradient Ascent:

Goal  Initial guess for w


 While not converged
𝛻𝑙(𝑤)
𝑙 𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂∇𝑙(𝑤)

𝑤
Derivative of Logistic Regression
Objective
19

𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 =෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
Derivative of Logistic Regression
Objective
20

𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 =෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1

∂J 𝑤 𝑛 1 ∂ 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
= ෍ 𝑦𝑗 𝑥𝑗,𝑖 − (1 + 𝑒 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi
1+𝑒
Derivative of Logistic Regression
Objective
21

𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 =෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1

∂J 𝑤 𝑛 1 ∂ 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
= ෍ 𝑦𝑗 𝑥𝑗,𝑖 − (1 + 𝑒 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi
1+𝑒

Chain rule: F(x)=f (g (x)) then F’(x) = f’(g(x)) g’(x)


∂ 1
ln(𝑥) =
∂x x
Derivative of Logistic Regression
Objective
22

𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 =෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1

∂J 𝑤 𝑛 1 ∂ 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
= ෍ 𝑦𝑗 𝑥𝑗,𝑖 − (1 + 𝑒 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi
1+𝑒
𝑛 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 𝑑
∂J 𝑤 𝑒 ∂
= ෍ 𝑦𝑗 𝑥𝑗,𝑖 − (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi 𝑖=1
1+𝑒

Chain rule: F(x)=f (g (x)) then F’(x) = f’(g(x)) g’(x)


∂ 𝑥
𝑒 = 𝑒𝑥
∂x
Derivative of Logistic Regression
Objective
23

𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 =෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1

∂J 𝑤 𝑛 1 ∂ 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
= ෍ 𝑦𝑗 𝑥𝑗,𝑖 − (1 + 𝑒 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi
1+𝑒
𝑛 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 𝑑
∂J 𝑤 𝑒 ∂
= ෍ 𝑦𝑗 𝑥𝑗,𝑖 − (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi 𝑖=1
1+𝑒
𝑛 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
∂J 𝑤 𝑒
= ෍ 𝑦𝑗 𝑥𝑗,𝑖 − 𝑥𝑗,𝑖
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
1+𝑒
Derivative of Logistic Regression
Objective
24

𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 =෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1

∂J 𝑤 𝑛 1 ∂ 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
= ෍ 𝑦𝑗 𝑥𝑗,𝑖 − (1 + 𝑒 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi
1+𝑒
𝑛 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 𝑑
∂J 𝑤 𝑒 ∂
= ෍ 𝑦𝑗 𝑥𝑗,𝑖 − (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 )
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖 ∂wi 𝑖=1
1+𝑒
𝑛 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
∂J 𝑤 𝑒
= ෍ 𝑦𝑗 𝑥𝑗,𝑖 − 𝑥𝑗,𝑖
∂wi 𝑗=1 𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
1+𝑒
∂J 𝑤 𝑛
= ෍ 𝑥𝑗,𝑖 ( 𝑦𝑗 − P(𝑦𝑗 = 1|𝑥𝑗 , 𝑤))
∂wi 𝑗=1
Sketch of Gradient Ascent for
Logistic Regression
25

Randomly initialize 𝑤00 , 𝑤10 , …, 𝑤𝑑0


while (not converged) do
Compute gradient of J(w) = [ ∂w
∂J(w)
,…,
∂J(w)
∂wd
]
0
𝑛
𝑤0𝑡+1 ← 𝑤0𝑡 + η෍ 𝑦𝑗 − 𝑃 𝑦𝑗 = 1 𝑥𝑗 , 𝑤 𝑡 )
𝑗=1

for each variable i do


𝑛
𝑤𝑖𝑡+1 ← 𝑤𝑖𝑡 + η ෍ 𝑥𝑗,𝑖 𝑦𝑗 − 𝑃 𝑦𝑗 = 1 𝑥𝑗 , 𝑤 𝑡 )
𝑗=1
Example of Update
26

 Suppose
 Weightvector: w = [1, -2, -1]
 Example: ((1, 1), +)

 Thus: P( + | (1,1), w) ≈ 0.12


∂l(w) ∂l(w)
= 1 [1 - 0.12] = 1 [1 - 0.12]
∂w1 ∂w2
 Updates are
wt+1 η > 0: If xi ≠ 0, wi increases, making
1 ⟵ -2 + η(0.88)
t+1
{
w2 ⟵ -1 + η(0.88) P(Y=1 | (1,1),wt+1) > P(Y=1 | (1,1),wt)
 Incorrect prediction: ∀ xi ≠ 0, wi changes
⇒ Weights are (somewhat) interdependent!
An Important Detail:
Often Want to Use Regularization
27

 Apply regularization for better generalization


 Idea: Penalize high weights
argmax𝑤 σ𝑁
𝑗=1 ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤) −λ𝑅(𝑤)
𝑑
2
 R(w) = L2: 𝑤 2 = ෍ 𝑤𝑖 2
𝑖=1
𝑑
 R(w) =L1: 𝑤 1 = ෍ |𝑤𝑖 |
𝑖=1

λ encodes a tradeoff between fitting the data (aka the


training loss) and model complexity
Notes on Regularization
28

 Tune λ: Best value is highly problem dependent


 Large λ makes overfitting less likely
 If λ if is too large, model may underfit

 L2 works with standard gradient methods


𝑤𝑖𝑡+1 ← 𝑤𝑖𝑡 + η −λ𝑤𝑖𝑡 +𝑥𝑗,𝑖 𝑦𝑗 − 𝑃 𝑦𝑗 = 1 𝑥𝑗 , 𝑤 𝑡 )

Note: Do not regularize w0

 L1 requires specialized optimizers


 Also enforces sparsity: Sets weights to zero
29 Applications
Challenge Problem: Flu Prediction
30

 Goal: Predict flu outbreaks

 Think about for 5 minutes:


 What data could you use?
 What features would you look at?

 What else would be important?


Logistic Regression: Predicting Flu
31

 Influenza results in 250,000-500,000 deaths in


the world each year
 Quickly detecting outbreaks can reduce risk
 Currently, center for disease control (CDC)
tracks this by aggregating counts of people with
flu like illnesses
 Problem: This is slow and not timely

Ginsberg et al., Nature 2009


http://static.googleusercontent.com/media/research.google.com/e
n//archive/papers/detecting-influenza-epidemics.pdf
Idea: Exploit Query Logs
32

 Goal: Predict proportion of doctor visits that are


flu-related as calculated by CDC

 Data: Historical Google search Data

 Model: Logistic regression

 Features: Queries most correlated to target


variable
How Many Queries to Use?
33

Ideas why
there is a drop?
Aside
34

 Often times, there is a variable not be used in


the analysis that is correlated with the target
variable (Y) and the predictor variables (X)
 In this example, it is season

 In causal studies, these are called confounding


variables

 This occurs often, watch out for it!


Predictive Performance
35

Fit to training data Fit to test data


Logistic Regression: Predicting CTR
for Advertising
36

Keyword

Ad slot
1
Algorithmic
search Ad slot
results 2
Advertising Model (Simplified)
37

 Advertisers
 Bid$X per a keyword w
 Pay $X even time someone clicks on ad

 Gives maximum budget

 Question: Someone searches on w, which ad


should the search engine display?
Advertising Model
38

 Naïve solution: Just return the ad with the


highest bid on w

 Google’s insight
 Profit = $X * click through rate (CTR)
The key for profit

 Display based on CTR


Approach: Logistic Regression
39

 Bid term CTRs of related ads from other


accounts (e.g., CTRs of other ads with keyword
“camera”)
 Appearance, advertiser reputation, landing page
quality, relevance of bid terms to ad, words in ad
Summary
40

 Logistic regression is a workhorse in practice

 Feature construction is key given the limited


representational power of logistic regression

 Fast training and predictions

 Many relevant industry applications


Questions?
41

You might also like