Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
Aarti Singh
Machine Learning 10-701/15-781 Jan 27, 2010
Logistic Regression
Assumes the following functional form for P(Y|X):
Alternatively,
Constant term
First-order term
Special case: P(X|Y=y) ~ Gaussian( y,y) where 0 = 1 (cij,0 = cij,1) Conditionally independent cij,y = 0 , i j (Gaussian Nave Bayes)
Generative vs Discriminative
Given infinite data (asymptotically), If conditional independence assumption holds, Discriminative and generative perform similar.
Generative vs Discriminative
Given finite data (n data points, p features), Ng-Jordan paper
Nave Bayes (generative) requires n = O(log p) to converge to its asymptotic error, whereas Logistic regression (discriminative) requires n = O(p). Why? Independent class conditional densities * smaller classes are easier to learn * parameter estimates not coupled each parameter is learnt independently, not jointly, from training data.
7
More in Paper
Nave Bayes
Logistic Regression
9
10
Classification Tasks
Features, X
Diagnosing sickle cell anemia Tax Fraud Detection
Labels, Y
Anemic cell Healthy cell
Web Classification
Classification
Goal:
Features, X
Labels, Y
Probability of Error
12
Classification
Optimal predictor: (Bayes classifier)
Classification algorithms
However, we can learn a good prediction rule from training data
Learning algorithm
So far
14
Linear Regression
Aarti Singh
Machine Learning 10-701/15-781 Jan 27, 2010
Y = Topic
X = Cell Image
Y = Diagnosis
Regression
Y=?
X = Feb01
16
Regression Tasks
Weather Prediction
Y = Temp
X = 7 pm
Estimating Contamination
X = new location Y = sensor reading
17
Supervised Learning
Goal:
Y=?
X = Feb01
Classification:
Regression:
Probability of Error
Regression
Optimal predictor: (Conditional Mean)
19
Regression
Optimal predictor: (Conditional Mean)
Regression algorithms
Learning algorithm
Linear Regression Lasso, Ridge regression (Regularized Linear Regression) Nonlinear Regression Kernel Regression Regression Trees, Splines, Wavelet estimators,
Empirical Risk Minimizer:
21
Empirical mean
Linear Regression
Least Squares Estimator
2 = slope
1 - intercept
Multi-variate case:
where
,
22
23
24
Normal Equations
p xp p x1 p x1
If
is invertible,
When is invertible ? (Homework 2) Recall: Full rank matrices are invertible. What is rank of What if is not invertible ? (Homework 2) Regularization (later)
25
Geometric Interpretation
is the orthogonal projection of onto the linear subspace spanned by the columns of
26
0 if
= < .
27
Effect of step-size
Large => Fast convergence but larger residual error Also possible oscillations Small => Slow convergence but small residual error
28
http://www.ce.berkeley.edu/~bayen/
http://demonstrations.wolfram.com
log likelihood
Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model ! 30
log likelihood
log prior
I) Gaussian Prior
Ridge Regression
Closed form: HW Prior belief that is Gaussian with zero-mean biases solution to small
31
log likelihood
log prior
Lasso
Closed form: HW Prior belief that is Laplace with zero-mean biases solution to small
32
Lasso (l1 penalty) results in sparse solutions vector with more zero coordinates Good for high-dimensional problems dont have to store all coordinates!
33
34
Polynomial Regression
Univariate case: where ,
Nonlinear features
35
Nonlinear Regression
Basis coefficients Nonlinear features/basis functions
Fourier Basis
Wavelet Basis
36
Local Regression
Basis coefficients Nonlinear features/basis functions
Globally supported basis functions (polynomial, fourier) will not yield a good representation
37
Local Regression
Basis coefficients Nonlinear features/basis functions
Globally supported basis functions (polynomial, fourier) will not yield a good representation
38
Weighted Least Squares Weigh each training point based on distance to test point
K Kernel
h Bandwidth of kernel
39
constant
40
constant
with box-car kernel #pts in h ball around X Sum of Ys in h ball around X Recall: NN classifier Average <-> majority41vote
Choice of Bandwidth
h
Should depend on n, # training data (determines variance) Should depend on smoothness of function (determines bias)
Large Bandwidth average more data points, reduce noise (Lower variance) Small Bandwidth less smoothing, more accurate fit (Lower bias) Bias Variance tradeoff : More to come in later lectures
42
If function smoothness varies spatially, we want to allow bandwidth h to depend on X Local polynomials, splines, wavelets, regression trees
43
Regression trees
Binary Decision Tree Num Children? 2 <2
Regression trees
Quad Decision Tree
h
h
If Else stop
Summary
Discriminative vs Generative Classifiers - Nave Bayes vs Logistic Regression Regression - Linear Regression Least Squares Estimator Normal Equations Gradient Descent Geometric Interpretation Probabilistic Interpretation (connection to MLE) - Regularized Linear Regression (connection to MAP) Ridge Regression, Lasso - Polynomial Regression, Basis (Fourier, Wavelet) Estimators - Kernel Regression (Localized) - Regression Trees
46