0% found this document useful (0 votes)

35 views41 pages

DM - Lecture 3

1. Logistic regression can be used to model the probability of a binary outcome given input features. 2. The model learns a set of weights that maximize the log likelihood of the training data. 3. Gradient ascent is used to iteratively update the weights based on the derivative of the logistic regression objective function.

Uploaded by

Maa See

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views41 pages

DM - Lecture 3

Uploaded by

Maa See

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

1

LOGISTIC REGRESSION

Jesse Davis
Motivation Problem: Expected Goals
2

Question: What constitutes a good shot in soccer?

Objective: Expected number of goals scored by a shot

3 The Logistic Regression Model
Scoring Problems:
A Regression Approach
4

 Our objective: Learn a function to approximate

E[Y | X] for each X

 Y is binary:
𝐸 𝑌 𝑋] = ෍ 𝑃 𝑌 = 𝑦 𝑋 = 𝑥) × 𝑦
𝑦

= 𝑃 𝑌 = 1 𝑋 = 𝑥) × 1 + 𝑃 𝑌 = 0 𝑋 = 𝑥) × 0
= 𝑃 𝑌 = 1 𝑋 = 𝑥)

 New objective: Estimate posterior class

probabilities
Binary Logistic Regression
5

 Discriminative model for learning: P(Y | X)

 Form: Sigmoid applied to linear function of data
1
𝑃 𝑌𝑗 = 0 𝑥𝑗 , 𝑤) =
1+exp(𝑤0 + σ𝑑𝑖=1 𝑤𝑖 × 𝑥𝑗,𝑖 )

exp(𝑤0 + σ𝑑𝑖=1 𝑤𝑖 × 𝑥𝑗,𝑖 )

𝑃 𝑌𝑗 = 1 𝑥𝑗 , 𝑤) =
1+exp(𝑤0 + σ𝑑𝑖=1 𝑤𝑖 × 𝑥𝑗,𝑖 )
Real value Feature

 Make decision by thresholding P(Y=1 | X)

 P(Y=1 | X) > t, then predict positive
 P(Y=1 | X) ≤ t, then predict negative
Multinomial Logistic Regression
6

 Suppose 𝑌 = {𝑌1 , … , 𝑌𝐾 }
 Model probabilities as:
exp(𝑤1 𝑥𝑗 )
𝑃 𝑌𝑗 = 1 𝑥𝑗 , 𝑊) =
1+exp( σ𝐾−1
𝑘=1 𝑤𝑘 𝑥𝑗 )
…

exp(𝑤𝐾−1 𝑋𝑗 )
𝑃 𝑌𝑗 = 𝐾 − 1 𝑥𝑗 , 𝑊) =
1+exp( σ𝐾−1
𝑘=1 𝑤𝑘 𝑥𝑗 )
1
𝑃 𝑌𝑗 = 𝐾 𝑥𝑗 , 𝑊) =
1+exp( σ𝐾−1
𝑘=1 𝑤𝑘 𝑥𝑗 )

Learn separate weight vector for class 1,…,K-1

Features for Logistic Regression
7

 Binary: Penalty = Yes / No

 Discrete: Dummy coding so use k – 1 values
Head Foot Other

BodyPart = {Head, Foot, Other} Xi,h 1 0 0

Xi,f 0 1 0
 Reals: x position of shot
 Complex variables
 Dist = Distance to center of goal
 Angle = Angle to goal

 Dist * Angle
Understanding 1-D Logistic Regression
8

 X → +∞, then P(Y=1 | X) → 1

 X → -∞, then P(Y=1 | X) → 0

𝑤0
 P(Y=1 | X) = 0.5 when -w0 – w1 x = 0 or x =−
𝑤1
 Steepness of curve controlled by w1
What’s the Best Representation for
9
Modeling Shot Location?
 Two features:
 Raw X
 Raw Y

 Distance d to the center of

the goal
 Angle α to goal
 Polar coordinates
 Grid cells
 Some combination?
Logistic Regression:
Effect of Feature Space
10

Raw (X, Y) Distance to Goal

Logistic Regression vs.
Probability Trees
11

Raw (X, Y) Logistic Raw (X, Y) with a

Regression Probability Tree
Hand-crafted vs. Learned Features
12

Learner Feature Set Brier

Score MSE: lower
Logistic Raw (X,Y), Distance, 0.0784 is better
Regression Angle, Angle*Distance
Probability Raw (X, Y) 0.0787
Tree
Logistic Raw (X, Y), Distance 0.0791
Regression
Logistic Raw (X, Y) 0.0837
Regression
“Learned partition” of the pitch shockingly good
But requires 100,000 shots to train…
13 Training Logistic Regression
Logistic Regression Training Task
14

 Given: Data S = ((x1, y1),…,(xn,yn))

 Learn: Weights
Question: What would make good weights?
 Insight: If yj = 1, want P(Y=1 | xj) > P(Y=0 | xj)
 If yj = 0, want P(Y=1 | xj) close to 0
 If yj = 1, want P(Y=1 | xj) close to 1
𝑛
 Idea: argmax𝑤 ෍ ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤)
𝑗=1

Push predicted probability to 0 for negative

examples and 1 for positive examples
How Do We Select Weights?
15

Let 𝑦𝑗 ∈ {0,1} and 𝑝𝑗 = 𝑃 𝑦𝑗 = 1|𝑥𝑗 , 𝑤

𝑁 𝑁

argmax𝑤 ෍ ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤) = ෍ 𝑦𝑗 ln 𝑝𝑗 + 1 − 𝑦𝑗 ln(1 − 𝑝𝑗 )
𝑗=1 𝑗=1
𝑁
exp(𝑤𝑥𝑗 ) 1
= ෍ 𝑦𝑗 ln 1+exp(𝑤𝑥 ) + 1 − 𝑦𝑗 ln 1+exp(𝑤𝑥 )
𝑗 𝑗
𝑗=1

P(Yj =1 | Xj) P(Yj =0 | Xj)

For brevity let w𝑥𝑗 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 × 𝑥𝑗,𝑖

How Do We Select Weights?
16

Let 𝑦𝑗 ∈ {0,1} and 𝑝𝑗 = 𝑃 𝑦𝑗 = 1|𝑥𝑗 , 𝑤

𝑁 𝑁

argmax𝑤 ෍ ln 𝑃 𝑦𝑗 𝑥𝑗 , 𝑤) = ෍ 𝑦𝑗 ln 𝑝𝑗 + 1 − 𝑦𝑗 ln(1 − 𝑝𝑗 )
𝑗=1 𝑗=1
𝑁
exp(𝑤𝑥𝑗 ) 1
= ෍ 𝑦𝑗 ln 1+exp(𝑤𝑥 ) + 1 − 𝑦𝑗 ln 1+exp(𝑤𝑥 )
𝑗 𝑗
𝑗=1
𝑁

= ෍ 𝑦𝑗 𝑤𝑥𝑗 − 𝑦𝑗 ln(1+exp(𝑤𝑥𝑗 )) + 1 − 𝑦𝑗 (−ln(1+exp(𝑤𝑥𝑗 ))

𝑗=1

𝑋
Apply ln = ln 𝑥 − ln 𝑦
𝑦
How Do We Select Weights?
17

Let 𝑦𝑗 ∈ {0,1} and 𝑝𝑗 = 𝑃 𝑦𝑗 = 1|𝑥𝑗 , 𝑤

𝑁 𝑁

= ෍ 𝑦𝑗 𝑤𝑥𝑗 − 𝑦𝑗 ln(1+exp(𝑤𝑥𝑗 )) + 1 − 𝑦𝑗 (−ln(1+exp(𝑤𝑥𝑗 ))

𝑗=1
Underlined
𝑁
𝑦𝑗 𝑤𝑥𝑗 − 𝑦𝑗 ln(1+exp(𝑤𝑥𝑗 )) terms cancel
=෍
+𝑦𝑗 ln(1+exp(𝑤𝑥𝑗 )) − ln(1+exp(𝑤𝑥𝑗 ))
𝑗=1
𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
=෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
How Do We Select Weights?
18

𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
argmax𝑤 ෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1

Bad: No closed-form solution

Good: Concave = easy to optimize

Gradient Ascent:

Goal  Initial guess for w

 While not converged
𝛻𝑙(𝑤)
𝑙 𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂∇𝑙(𝑤)

𝑤
Derivative of Logistic Regression
Objective
19

𝑛 𝑑
𝑤0 +σ𝑑
𝑖=1 𝑤𝑖 𝑥𝑗,𝑖
J 𝑤 =෍ 𝑦𝑗 (𝑤0 + ෍ 𝑤𝑖 𝑥𝑗,𝑖 ) − ln(1 + 𝑒 )
𝑗=1 𝑖=1
Derivative of Logistic Regression
Objective
20