0% found this document useful (0 votes)
48 views46 pages

Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes

This document discusses different machine learning algorithms for classification and regression. It compares generative classifiers like Naive Bayes to discriminative classifiers like logistic regression. It also covers linear regression, discussing techniques like least squares estimation, gradient descent, and regularization methods like ridge regression and lasso. Non-linear regression methods using polynomial bases, kernels, and regression trees are also summarized.

Uploaded by

MarioMateiroMM
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views46 pages

Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes

This document discusses different machine learning algorithms for classification and regression. It compares generative classifiers like Naive Bayes to discriminative classifiers like logistic regression. It also covers linear regression, discussing techniques like least squares estimation, gradient descent, and regularization methods like ridge regression and lasso. Non-linear regression methods using polynomial bases, kernels, and regression trees are also summarized.

Uploaded by

MarioMateiroMM
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Revisiting Logistic Regression & Nave Bayes

Aarti Singh
Machine Learning 10-701/15-781 Jan 27, 2010

Generative and Discriminative Classifiers


Training classifiers involves learning a mapping f: X -> Y, or P(Y|X) Generative classifiers (e.g. Nave Bayes) Assume some functional form for P(X,Y) (or P(X|Y) and P(Y)) Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X) Discriminative classifiers (e.g. Logistic Regression) Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data

Logistic Regression
Assumes the following functional form for P(Y|X):

Alternatively,

(Linear Decision Boundary)

DOES NOT require any conditional independence assumptions


3

Connection to Gaussian Nave Bayes


There are several distributions that can lead to a linear decision boundary. As another example, consider a generative model:

Exponential family Observe that Gaussian is a special case

Connection to Gaussian Nave Bayes

Constant term

First-order term

Special case: P(X|Y=y) ~ Gaussian( y,y) where 0 = 1 (cij,0 = cij,1) Conditionally independent cij,y = 0 , i j (Gaussian Nave Bayes)

Generative vs Discriminative
Given infinite data (asymptotically), If conditional independence assumption holds, Discriminative and generative perform similar.

If conditional independence assumption does NOT holds, Discriminative outperforms generative.

Generative vs Discriminative
Given finite data (n data points, p features), Ng-Jordan paper

Nave Bayes (generative) requires n = O(log p) to converge to its asymptotic error, whereas Logistic regression (discriminative) requires n = O(p). Why? Independent class conditional densities * smaller classes are easier to learn * parameter estimates not coupled each parameter is learnt independently, not jointly, from training data.
7

Nave Bayes vs Logistic Regression


Verdict Both learn a linear decision boundary. Nave Bayes makes more restrictive assumptions and has higher asymptotic error, BUT converges faster to its less accurate asymptotic error.
8

Experimental Comparison (Ng-Jordan01)


UCI Machine Learning Repository 15 datasets, 8 continuous features, 7 discrete features

More in Paper

Nave Bayes

Logistic Regression
9

Classification so far (Recap)

10

Classification Tasks
Features, X
Diagnosing sickle cell anemia Tax Fraud Detection

Labels, Y
Anemic cell Healthy cell

Web Classification

Sports Science News


Drive to CMU, Rachels fan, Shop at SH Giant Eagle

Predict squirrel hill resident

Resident 11 Not resident

Classification
Goal:

Sports Science News

Features, X

Labels, Y

Probability of Error
12

Classification
Optimal predictor: (Bayes classifier)

Depends on unknown distribution


13

Classification algorithms
However, we can learn a good prediction rule from training data

Independent and identically distributed

Learning algorithm

So far

Decision Trees K-Nearest Neighbor Nave Bayes Logistic Regression

14

Linear Regression
Aarti Singh
Machine Learning 10-701/15-781 Jan 27, 2010

Discrete to Continuous Labels


Classification Sports Science News
X = Document

Anemic cell Healthy cell

Y = Topic

X = Cell Image

Y = Diagnosis

Regression

Stock Market Prediction

Y=?
X = Feb01
16

Regression Tasks
Weather Prediction

Y = Temp
X = 7 pm

Estimating Contamination
X = new location Y = sensor reading

17

Supervised Learning
Goal:

Sports Science News

Y=?
X = Feb01

Classification:

Regression:

Probability of Error

Mean Squared Error


18

Regression
Optimal predictor: (Conditional Mean)

Dropping subscripts for notational convenience

19

Regression
Optimal predictor: (Conditional Mean)

Intuition: Signal plus (zero-mean) Noise model

Depends on unknown distribution


20

Regression algorithms
Learning algorithm

Linear Regression Lasso, Ridge regression (Regularized Linear Regression) Nonlinear Regression Kernel Regression Regression Trees, Splines, Wavelet estimators,
Empirical Risk Minimizer:
21

Empirical mean

Linear Regression
Least Squares Estimator

- Class of Linear functions


Uni-variate case:

2 = slope

1 - intercept

Multi-variate case:

where

,
22

Least Squares Estimator

23

Least Squares Estimator

24

Normal Equations
p xp p x1 p x1

If

is invertible,

When is invertible ? (Homework 2) Recall: Full rank matrices are invertible. What is rank of What if is not invertible ? (Homework 2) Regularization (later)

25

Geometric Interpretation

Difference in prediction on training set:

is the orthogonal projection of onto the linear subspace spanned by the columns of
26

Revisiting Gradient Descent


Even when is invertible, might be computationally expensive if A is huge.

Gradient Descent Initialize: Update:

0 if

= < .
27

Stop: when some criterion met e.g. fixed # iterations, or

Effect of step-size

Large => Fast convergence but larger residual error Also possible oscillations Small => Slow convergence but small residual error

28

When does Gradient Descent succeed?


View of the algorithm is myopic.

http://www.ce.berkeley.edu/~bayen/

http://demonstrations.wolfram.com

Guaranteed to converge to local minima if

Converges as in jth direction Convergence depends on eigenvalue spread


29

Least Squares and MLE


Intuition: Signal plus (zero-mean) Noise model

log likelihood

Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model ! 30

Regularized Least Squares and MAP


What if is not invertible ?

log likelihood

log prior

I) Gaussian Prior

Ridge Regression
Closed form: HW Prior belief that is Gaussian with zero-mean biases solution to small
31

Regularized Least Squares and MAP


What if is not invertible ?

log likelihood

log prior

II) Laplace Prior

Lasso
Closed form: HW Prior belief that is Laplace with zero-mean biases solution to small
32

Ridge Regression vs Lasso


Ridge Regression: Lasso:
HOT! Ideally l0 penalty, but optimization becomes non-convex

s with constant J() (level sets of J()) s with constant l2 norm


2

s with constant l1 norm


1

s with constant l0 norm

Lasso (l1 penalty) results in sparse solutions vector with more zero coordinates Good for high-dimensional problems dont have to store all coordinates!

33

Beyond Linear Regression


Polynomial regression

Regression with nonlinear features/basis functions

Kernel regression - Local/Weighted regression

Regression trees Spatially adaptive regression

34

Polynomial Regression
Univariate case: where ,

Weight of each feature

Nonlinear features
35

Nonlinear Regression
Basis coefficients Nonlinear features/basis functions

Fourier Basis

Wavelet Basis

Good representation for oscillatory functions

Good representation for functions localized at multiple scales

36

Local Regression
Basis coefficients Nonlinear features/basis functions

Globally supported basis functions (polynomial, fourier) will not yield a good representation

37

Local Regression
Basis coefficients Nonlinear features/basis functions

Globally supported basis functions (polynomial, fourier) will not yield a good representation

38

Kernel Regression (Local)

Weighted Least Squares Weigh each training point based on distance to test point

K Kernel
h Bandwidth of kernel

39

Nadaraya-Watson Kernel Regression

constant

40

Nadaraya-Watson Kernel Regression

constant

with box-car kernel #pts in h ball around X Sum of Ys in h ball around X Recall: NN classifier Average <-> majority41vote

Choice of Bandwidth
h
Should depend on n, # training data (determines variance) Should depend on smoothness of function (determines bias)

Large Bandwidth average more data points, reduce noise (Lower variance) Small Bandwidth less smoothing, more accurate fit (Lower bias) Bias Variance tradeoff : More to come in later lectures
42

Spatially adaptive regression


h

If function smoothness varies spatially, we want to allow bandwidth h to depend on X Local polynomials, splines, wavelets, regression trees
43

Regression trees
Binary Decision Tree Num Children? 2 <2

Average (fit a constant ) on the leaves


44

Regression trees
Quad Decision Tree

h
h

- Polynomial fit on each leaf

If Else stop

, then split Compare residual error with and without split


45

Summary
Discriminative vs Generative Classifiers - Nave Bayes vs Logistic Regression Regression - Linear Regression Least Squares Estimator Normal Equations Gradient Descent Geometric Interpretation Probabilistic Interpretation (connection to MLE) - Regularized Linear Regression (connection to MAP) Ridge Regression, Lasso - Polynomial Regression, Basis (Fourier, Wavelet) Estimators - Kernel Regression (Localized) - Regression Trees
46

You might also like