0% found this document useful (0 votes)
6 views30 pages

383 Fall11 Lec19

Uploaded by

mokatoglc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views30 pages

383 Fall11 Lec19

Uploaded by

mokatoglc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Regression and Classification"

with Linear Models"

CMPSCI 383
Nov 15, 2011!

1
Todayʼs topics"
• Learning from Examples: brief review!
• Univariate Linear Regression!
• Batch gradient descent!
• Stochastic gradient descent!
• Multivariate Linear Regression!
• Regularization!
• Linear Classifiers!
• Perceptron learning rule!
• Logistic Regression!

2
Learning from Examples (supervised learning)"

3
Learning from Examples (supervised learning)"

4
Learning from Examples (supervised learning)"

5
Learning from Examples (supervised learning)"

6
Learning from Examples (supervised learning)"

7
Learning from Examples (supervised learning)"

8
Important issues"

• Generalization !
• Overfitting!
• Cross-validation!
• Holdout cross validation!
• K-fold cross validation!
• Leave-one-out cross-validation!
• Model selection!

9
Recall Notation"

(x1, y1 ), (x 2 , y 2 ),K (x N , y N ) training set!

Where each y j was generated by !


an unknown function! y = f (x)

Discover a function h that best

€ the true function! f
approximates

hypothesis!


10
Loss Functions"
Suppose the true prediction for input x is f (x) = y
but the hypothesis gives h(x) = yˆ

L(x, y, yˆ ) = Utility(result of using y given input x)


€ − Utility(result of using yˆ given input x)

Simplified version : L(y, yˆ )

€ Absolute value loss : L1 (y, yˆ ) = y − yˆ


2
€ Squared error loss : L2 (y, yˆ ) = ( y − yˆ )
0/1 loss : L0 /1 (y, yˆ ) = 0 if y = yˆ , else 1

Generalization loss: expected loss over all possible examples!


€ Empirical loss: average loss over available examples!

11
Univariate Linear Regression"

12
Univariate Linear Regression contd."

w = [ w 0 ,w1 ] weight vector!


hw (x) = w1 x + w 0

Find weight vector that minimizes empirical loss,


€ e.g., L2:!
N N N

Loss(hw ) = ∑ L2 (y j , hw (x j )) =∑ (y j − hw (x j )) 2 =∑ (y j − (w1 x j + w 0 )) 2
j =1 j =1 j =1

i.e., find w * such that!


€ w* = argmin w Loss(hw )

€ 13


Weight Space"

14
Finding w*"

Find weights such that:!

∂ N ∂ N

∂w 0 j =1
2
(y j − (w1 x j + w 0 )) = 0 and ∑
∂w1 j =1
(y j − (w1 x j + w 0 )) 2 = 0

15
Gradient Descent"


wi ← wi − α Loss(w)
∂w i
step size or!
learning rate!

16
Gradient Descent contd."

For one training example (x, y) :!

w 0 ← w 0 + α (y − hw (x)) and w1 ← w1 + α (y − hw (x))x



For N training examples:!

w 0 ← w 0 + α ∑ (y j − hw (x j )) and w1 ← w1 + α ∑ (y j − hw (x j ))x j
j j

batch gradient descent!



stochastic gradient descent: take a step for
one training example at a time!
17
The Multivariate case"

hsw (x j ) = w 0 + w1 x j,1 +L + w n x j,n = w 0 + ∑ w i x j,i


i

Augmented vectors: add a feature to each x by tacking on a 1:! x j,0 = 1


€ Then:!

hsw (x j ) = w⋅ x j = wT x j = ∑ w i x j,i
€ i

And batch gradient descent update becomes:!

€ w i ← w i + α ∑ (y j − hw (x j ))x j,i
j

18

The Multivariate case contd."

Or, solving analytically:!


Let y be the vector of outputs for the training examples!
X data matrix: each row is an input vector!

Solving this for w *:! y = Xw



−1
€ w* = ( XT X) XT y

€ pseudo inverse!

19
Regularization"

Cost(h) = EmpLoss(h) + λComplexity(h)


q
Complexity(hw ) = Lq (w) = ∑ w i
i

20
L1 vs. L2 Regularization"

21
Linear Classification: hard thresholds"

22
Linear Classification: hard thresholds contd."

• Decision Boundary:!
• In linear case: linear separator, a hyperplane!
• Linearly separable: !
• data is linearly separable if the classes can be
separated by a linear separator!
• Classification hypothesis:!
hw (x) = Threshold(w⋅ x) where Threshold(z) = 1 if z ≥ 0 and 0 otherwise

23
Perceptron Learning Rule"

For a single sample (x, y) :


w i ← w i + α ( y − hw (x)) x i

• If the output is correct, i.e., y =h w (x), then the weights don't change
• If y = 1 but hw (x) = 0, then w i is increased when x i is positive and decreased when x i is negative.
€ • If y = 0 but hw (x) = 1, then w i is decreased when x i is positive and increased when x i is negative.

€ Perceptron Convergence Theorem: For any data set


thatʼs linearly separable and any training procedure
that continues to present each training example, the
learning rule is guaranteed to find a solution in a finite
number of steps.!

24
Perceptron Performance"

25
Linear Classification with Logistic Regression"

An important function!!
26
Logistic Regression"

1
hw (x) = Logistic(w⋅ x) =
1+ e −w⋅x

For a single sample (x, y) and L2 loss function :


w€i ← w i + α ( y − hw (x)) hw (x)(1 − hw (x)) x i

€ derivative of logistic function!

27
Logistic Regression Performance"

separable case!

28
Summary"
• Learning from Examples: brief review!
• Loss functions!
• Generalization!
• Overfitting!
• Cross-validation!
• Regularization!
• Univariate Linear Regression!
• Batch gradient descent!
• Stochastic gradient descent!
• Multivariate Linear Regression!
• Regularization!
• Linear Classifiers!
• Perceptron learning rule!
• Logistic Regression!

29
Next Class"

• Artificial Neural Networks, Nonparametric


Models, & Support Vector Machines!
• Secs. 18.7 – 18.9!

30

You might also like