Supervised Learning Logistic Regression and SVMS: Rayid Ghani
Supervised Learning Logistic Regression and SVMS: Rayid Ghani
Rayid Ghani
Slides liberally borrowed and customized from lots of excellent online sources
Rayid Ghani @rayidghani
Types of Learning
Classification
Clustering
Anomaly Detection Binary, multi-
PCA
class,
…
hierarchical,
sequential
Regression
y = f(x)
features/variables/inputs/predict
Output / dependent Learned ors/independent variables
variable function
• Testing: apply f to a new test example x and output the predicted value y
= f(x)
Find βi to maximize
Rayid Ghani @rayidghani
How do you train a Logistic Regression model?
• Logistic Regression:
– Early stopping
– add a penalty term to the cost function based on non-
zero coefficients
• L1 (Lasso):
• L2 (Ridge):
Rayid Ghani @rayidghani
Rayid Ghani @rayidghani
Regularization
• What is the impact of L1 (Least Absolute
Shrinkage and Selection Operator) ?
• Visualization:
https://florianhartl.com/logistic-regression-geom
Rayid Ghani @rayidghani
Linear Classifiers
y = f(x) = sign(wTx + b)
• Detailed Math:
http://www.engr.mun.ca/~baxter/Publications/La
grangeForSVMs.pdf
If C is large, errors cost a lot (hard margin). If C is small, errors are tolerated (softer margin).
0 x
0 x
0 x
Rayid Ghani @rayidghani
Logit vs Linear SVMs
• Logistic regression maximizes the conditional
likelihood of the training data
• No Probability estimates
P(d | c)P(c)
P(c | d) =
P(d)
Rayid Ghani @rayidghani
Naïve Bayes Classifier
cÎ C P(d)
=argmax P(d | c)P(c) Dropping
Droppingthe
the
denominator
denominator
cÎ C
Rayid Ghani @rayidghani
Naïve Bayes Classifier
cÎ C
O(|X|
O(|X|n•|C|)
n
•|C|)parameters
parameters How
Howoften
oftendoes
doesthis
this
class
classoccur?
occur?
Could
Couldonly
onlybe
beestimated
estimatedififaavery,
very,
very
verylarge
largenumber
numberof oftraining
training We
Wecan
canjust
justcount
countthe
the
examples
exampleswaswasavailable.
available. relative
relativefrequencies
frequenciesininaa
data
dataset
set
P(x1, x2 ,…, xn | c)
• Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
• Cons
– Independence assumption may be violated in practice (Features are
correlated)
– Gives skewed probability estimates when # of features is large. Why?
Rayid Ghani @rayidghani
Questions to think about
• When would Naïve Bayes fail?