0% found this document useful (0 votes)
2 views34 pages

l05 Machine Learning

The document discusses linear models for classification, focusing on logistic regression and support vector machines (SVM). It covers concepts such as empirical risk minimization, regularization, and the differences between binary and multiclass classification methods. Additionally, it highlights practical considerations for implementing these models using scikit-learn and the trade-offs associated with stochastic gradient descent.

Uploaded by

sashakayukov23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views34 pages

l05 Machine Learning

The document discusses linear models for classification, focusing on logistic regression and support vector machines (SVM). It covers concepts such as empirical risk minimization, regularization, and the differences between binary and multiclass classification methods. Additionally, it highlights practical considerations for implementing these models using scikit-learn and the trade-offs associated with stochastic gradient descent.

Uploaded by

sashakayukov23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

6012B0419Y Machine Learning

Linear Models for Classification


13-11-2022

Guido van Capelleveen

(Prepared by: Stevan Rudinac)


Slide Credit
●Andreas Müller, lecturer at the Data Science
Institute at Columbia University
● Author of the book we will be using for this course
“Introduction to Machine Learning with Python”

● Great materials available at:


● https://github.com/amueller/applied_ml_spring_2017/
● https://amueller.github.io/applied_ml_spring_2017/
Linear Models for Binary Classification

3
4
(regularized) Empirical Risk
Minimization

Data fitting Regularization

Who remembers this slide from the previous lecture?


Differences Between Algorithms
●The way in which they measure how well a particular
combination of coefficients and intercept fits the
training data

● If and what kind of regularization they use

6
By default both apply L2 regularization 7
Picking a loss?

Obvious idea:
Minimize number of misclassifications aka 0-1 loss.
But: non-convex, not continuous => Relax

8
Logistic Regression

9
Penalized Logistic Regression

C is inverse to alpha (or alpha / n_samples)

Both versions strongly convex, l2 version smooth (differentiable).


All points contribute to w (dense solution to dual).

10
Effect of Regularization

Decision boundaries of a linear SVM for different values of C

● Small C (a lot of regularization) limits the influence of individual


points! 14
(soft margin) linear SVM
(soft margin) linear SVM

Both versions strongly convex, neither smooth.


Only some points contribute (the support vectors) to w (sparse solution to dual).

13
SVM or LogReg?

Do you need probability estimates?


no yes

It doesn’t matter Logistic Regression


- try either / both

Need compact model or believe solution is sparse? Use L1.

14
LR Example C=1

● The default value of C=1 provides good performance


● However, performance on the training and the test set are very

close, so we are likely underfitting

15
LR Example C=100

● Using C=100 results in higher training and test set accuracies


● A more complex model performs better in this case

16
LR Example C=0.01

● Training and test set accuracy decrease


● What does it tell us?

17
Overfitting and Underfitting
Multiclass Classification

23
Reduction to Binary Classification
For 4 classes
● One vs Rest Standard

1v{2, 3, 4}, 2v{1, 3, 4}, 3v{1, 2, 4}, 4v{1, 2, 3}


n binary classifiers - each on all data

● One vs One
1v2, 1v3, 1v4, 2v3, 2v4, 3v4
n * (n-1) / 2 binary classifiers - each on a fraction of the data

24
25
Prediction with One Vs Rest

●“Class with highest score” / highest result of the


classification confidence formula

● Unclear why it even


works but works
well.
26
One vs Rest
Prediction with One Vs One
● “Vote for highest positives”
● Classify by all classifiers.
● Count how often each class was predicted.
● Return most commonly predicted class.

● Again – just a heuristic.

28
One vs One
Iris Dataset

30
What about Coefficients?

● Each row of coef_ contains the coefficient vector for one of the
three classes
● Each column holds the coefficient value for a specific feature

31
Coefficient and Intercept

32
Multinomial Logistic Regression
Probabilistic multi-class model:

Same prediction
rule as OvR!
33
In scikit-learn
● OvO: only SVC
●OvR: default for all linear models, even
LogisticRegression
● LogisticRegression(multinomial=True)
● clf.decision_function = w^Tx
● logreg.predict_proba
● SVC(probability=True) not great

34
Computational Considerations
(for all linear models)

35
Solver Choices
● Don’t use SVC(kernel=’linear’), use LinearSVC
● For n_features >> n_samples: Lars (or LassoLars) instead

of Lasso.
● For small n_samples (<10.000?), don’t worry.

● LinearSVC, LogisticRegression: dual=False if

n_samples >> n_features


● LogisticRegression(solver=”sag”) for n_samples large.

● Stochastic Gradient Descent for “n_samples really

large”

36
Stochastic Gradient Descent

Leon Bottou and Olivier Bousquet. 2007. The tradeoffs of large scale
learning. In Proceedings of the 20th International Conference on Neural
Information Processing Systems, 161–168.

https://scikit-learn.org/stable/modules/sgd.html

37
SGD Pros and Cons
PROS
● Efficiency

● Ease of implementation (lots of opportunities for code

tuning)

CONS
● SGD requires a number of hyperparameters such as the

regularization parameter and the number of iterations


● SGD is sensitive to feature scaling

38

You might also like