0% found this document useful (0 votes)
3 views

Machine Learning Algorithms - pptx-1

Uploaded by

张立波
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Machine Learning Algorithms - pptx-1

Uploaded by

张立波
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 129

Machine Learning Algorithms

Lorenzo Servadei, Sebastian Schober, Daniela S Lopera, Wolfgang Ecker


Introduction to Machine Learning Algorithms

2
Table of Contents

Machine Learning Algorithms – an Overview

The Data Driven Approach

K-Nearest Neighbors

Linear Classifier

Loss Functions

Decision Trees and Random Forrest

3
Table of Contents

• Machine Learning Algorithms – an Overview


• The Data Driven Approach
• K-Nearest Neighbors
• Linear Classifier
• Loss Functions
• Decision Trees and Random Forrest

4
Machine Learning Algorithms – an Overview

5
Machine Learning Algorithms – an Overview

Parametric Learning Algorithms

Non-parametric Learning Algorithms

• KNNs
• Decision Trees
• Random Forest
Etc.

6
Problem of hard coded classification

Unlike e.g. sorting a list of numbers,

no obvious way to hard-code the algorithm for


recognizing a cat, or other classes.

7
Edges based method?

Find edges Find corners

John Canny, “A Computational Approach to Edge Detection”, IEEE TPAMI 1986

8
A possible Approach

1. Collect a dataset of images and labels


2. Use Machine Learning to train a classifier
3. Evaluate the classifier on new images
Example training set

9
A possible Approach

Memorize all
data and labels

Predict the label


of the most similar
training image

10
Example Dataset: CIFAR10

10 classes
50,000 training images
10,000 testing images

Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.

11
Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.

12
L1 Distance

L1
distance:

13
Code implementation, using Numpy
Nearest Neighbor classifier

Lecture 2 -

14
Code implementation, using Numpy
Nearest Neighbor classifier

Memorize training data

Lecture 2 -

15
Code implementation, using Numpy
Nearest Neighbor classifier

For each test image:


Find closest train image
Predict label of nearest image

Lecture 2 -

16
Code implementation, using Numpy
Nearest Neighbor classifier

Q: With N
examples, how
fast are training
and prediction?

Lecture 2 -

17
Code implementation, using Numpy
Nearest Neighbor
classifier

Q: With N
examples, how
fast are training
and prediction?

A: Train O(1),
predict
O(N)

Lecture 2 -

18
Code implementation, using Numpy
Nearest Neighbor
classifier

Q: With N
examples, how
fast are training
and prediction?

A: Train O(1),
predict
O(N)

This is bad: we want classifiers that are fast 2at-


Lecture
prediction; slow for training is ok

19
K-Nearest Neighbors

Lecture 2 -

20
K-Nearest Neighbors

Instead of copying label from nearest neighbor, take majority vote


from K closest points

K=1 K=3 K=5

Lecture 2 -

21
What does it looks like?

Lecture 2 -

22
What does it looks like?

Lecture 2 -

23
K-Nearest Neighbors: Distance Metric

L1 (Manhattan) distance L2 (Euclidean) distance

Lecture 2 -

24
K-Nearest Neighbors: Distance Metric

L1 (Manhattan) distance L2 (Euclidean) distance

K=1 K=1

Lecture 2 -

25
K-Nearest Neighbors: Demo

http://vision.stanford.edu/teaching/cs231n-demos/knn/

26
K-Nearest Neighbors: Hyperparameters

›What is the best value of k to use? What


is the best distance to use?

›These are hyperparameters: choices about the


algorithm that we set rather than learn

›Very problem-dependent.

27
Searching for Hyperparameters

Idea #1: Choose hyperparameters that work best on the data

Your Dataset

28
Searching for Hyperparameters

Idea #1: Choose hyperparameters that work best on the data


BAD: always works perfectly on training
data

Your Dataset

29
Searching for Hyperparameters

Idea #1: Choose hyperparameters that work best on the data


BAD: always works perfectly on training
data

Your Dataset

Idea #2: Split data into train and test, choose hyperparameters that work best
on test data

train test

30
Searching for Hyperparameters

Idea #1: Choose hyperparameters that work best on the data


BAD: always works perfectly on training
data

Your Dataset

Idea #2: Split data into train and test, choose hyperparameters that work best
on test data

train test

BAD: No idea how algorithm


will perform on new data

31
Searching for Hyperparameters

Idea #3: Split data into train, val, and test; choose hyperparameters on val
and evaluate on test

train validation test

32
Searching for Hyperparameters

Idea #4: K-Folds Cross-Validation: Split data into folds, try each fold as
validation and average the results

fold 1 fold 2 fold 3 fold 4 fold 5 test

fold 1 fold 2 fold 3 fold 4 fold 5 test

fold 1 fold 2 fold 3 fold 4 fold 5 test

Useful for small datasets, but not used too frequently in deep learning

33
Searching for Hyperparameters

Idea #5: Nested Cross-Validation: Two Loops, for model selection and
evaluation

34
Searching for Hyperparameters

Example of
5-fold cross-validation
for the value of k.

Each point: single


outcome.

35
k-Nearest Neighbors‘ Drawbacks

k-Nearest Neighbor on images never used.

- Very slow at test time


- Distance metrics on pixels are not informative
Tinted
Original Boxed Shifted

Original image is
CC0 public domain

(all 3 images have same L2 distance to the one on the left)

36
Curse of Dimensionalities

k-Nearest Neighbor on images never used.

Dimensions = 3
- Curse of Points = 43
dimensionality
Dimensions = 2
Points = 42

Dimensions = 1
Points = 4

Lecture 2 -

37
k-Nearest Neighbor: Summary

In Image classification we start with a training set of images and labels, and
must predict labels on the test set

The K-Nearest Neighbors classifier predicts labels based on nearest training


examples

Distance metric and K are hyperparameters

Choose hyperparameters using the validation set; only run on the test set once at
the very end!

Lecture 2 -

38
Linear Classifiers

Two young girls are Boy is doing backflip


playing with lego toy on wakeboard

Man in black shirt is Construction worker in orange Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015
Figures copyright IEEE, 2015. Reproduced for educational purposes.
playing guitar. safety vest is working on road.

Lecture 2 -

39
Recall CIFAR10

50,000 training images


each image is 32x32x3

10,000 test images.

40
Parametric Approach

Image

10 numbers giving
f(x,W)
class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights
Lecture 2 -

41
Parametric Approach: Linear Classifier
f(x,W) = Wx

Image

10 numbers giving
f(x,W)
class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights
Lecture 2 -

42
Parametric Approach: Linear Classifier

3072x1

Image f(x,W) = Wx
10x1 10x3072
10 numbers giving
f(x,W) class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights
Lecture 2 -

43
Classification through a linear classifier

Image with 4 pixels, and 3 classes (cat/dog/ship)

Stretch pixels into column

56

56 231
231

24 2
24

Input image
2

44
Classification through a linear classifier

Image with 4 pixels, and 3 classes (cat/dog/ship)


Stretch pixels into column

56
0.2 -0.5 0.1 2.0 1.1 -96.8 Cat score
56 231
231

24 2
1.5 1.3 2.1 0.0
24
+ =3.2 437.9 Dog score

0 0.25 0.2 -0.3 -1.2 61.95 Ship score


Input image
2
W b
Lecture 2 -

45
What is a linear Classifier

Image with 4 pixels, and 3 classes (cat/dog/ship)


Algebraic Viewpoint

f(x,W) = Wx

Lecture 2 -

46
What is a linear Classifier
Image with 4 pixels, and 3 classes (cat/dog/ship)
Input image
Algebraic Viewpoint

W
f(x,W) = Wx
0.2 -0.5 1.5 1.3 0 .25

0.1 2.0 2.1 0.0 0.2 -0.3

b 1.1 3.2 -1.2

Score -96.8 437.9 61.95

Lecture 2 -

47
Interpreting a linear Classifier

Lecture 2 -

48
Interpreting a linear Classifier
Visual Viewpoint

Lecture 2 -

49
Interpreting a linear Classifier
Geometric Viewpoint

f(x,W) = Wx + b

Array of 32x32x3 numbers


(3072 numbers total)

Plot created using Wolfram Cloud Cat image by Nikita is licensed under CC-BY 2.0

Lecture 2 -

50
Hard cases for a linear classifier

Class 1: Class 1: Class 1:


First and third quadrants 1 <= L2 norm <= 2 Three modes

Class 2: Class 2: Class 2:


Second and fourth quadrants Everything Else Everything Else

Lecture 2 -

51
Linear Classifier: Three Viewpoints

Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint

f(x,W) = Wx Hyperplanes cutting


One template per class
up space

52
Linear Classifier: Three Viewpoints
f(x,W) = Wx + b
So far: Defined a (linear) score function
Example class
scores for 3
images for
some W: -3.45 -0.51 3.42
-8.87 6.04 4.64
0.09 5.31 2.65
How can we tell 2.9 -4.22 5.1
4.48 -4.19 2.64
whether this W 8.02 3.58 5.55
4.49 -4.34
is good or bad? 3.78
-4.37 -1.5
1.06
-0.36 -2.09 -4.79
-0.72 -2.93 6.14

Cat image by Nikita is licensed


Lecture 2 -
under CC-BY 2.0 Car image is
CC0 1.0 public domain
Frog image is in the public domain

53
Linear Classifier: TODO List

TODO:

1. Define a loss function


that quantifies our
unhappiness with the
scores across the training
data.

2. Come up with a way of


efficiently finding the
parameters that minimize
the loss function.
(optimization)
Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain

Fei-Fei Li & Justin Johnson & Serena Yeung

54
Linear Classifier: Scores

Suppose: 3 training examples, 3


classes. With some W the scores are:

cat
car 3.2 1.3 2.2
frog 5.1 4.9 2.5
-1.7 2.0 -3.1

Fei-Fei Li & Justin Johnson & Serena Yeung

55
Loss Functions
Suppose: 3 training examples, 3 A loss function tells how
classes. With some W the scores are: good our current classifier is

Given a dataset of examples

cat Where is image and


is (integer) label
car 3.2 1.3 2.2
frog 5.1 4.9 2.5 Loss over the dataset is a
sum of loss over examples:
-1.7 2.0 -3.1

Fei-Fei Li & Justin Johnson & Serena Yeung

56
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5
-1.7 2.0 -3.1

Fei-Fei Li & Justin Johnson & Serena Yeung

57
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
“Hinge loss”
where is the image
and where is the (integer)
label,

and using the shorthand for


the scores vector:
cat
car 3.2 1.3 2.2 the SVM loss has the form:
frog 5.1 4.9 2.5
-1.7 2.0 -3.1

Fei-Fei Li & Justin Johnson & Serena Yeung

58
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5
-1.7 2.0 -3.1

Fei-Fei Li & Justin Johnson & Serena Yeung

59
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5 = max(0, 5.1 - 3.2 + 1)
+max(0, -1.7 - 3.2 + 1)
Losses: -1.7 2.0 -3.1 = max(0, 2.9) + max(0, -3.9)
= 2.9 + 0
2.9 = 2.9

Fei-Fei Li & Justin Johnson & Serena Yeung

60
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5 = max(0, 1.3 - 4.9 + 1)
+max(0, 2.0 - 4.9 + 1)
Losses: -1.7 2.0 -3.1 = max(0, -2.6) + max(0, -1.9)
=0+0
2.9 0 =0

Fei-Fei Li & Justin Johnson & Serena Yeung

61
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5 = max(0, 2.2 - (-3.1) + 1)
+max(0, 2.5 - (-3.1) + 1)
Losses: -1.7 2.0 -3.1 = max(0, 6.3) + max(0, 6.6)
= 6.3 + 6.6
2.9 0 12.9 = 12.9

Fei-Fei Li & Justin Johnson & Serena Yeung

62
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5 Loss over full dataset is average:

Losses: -1.7 2.0 -3.1


2.9 0 12.9 L = (2.9 + 0 + 12.9)/3
= 5.27
Fei-Fei Li & Justin Johnson & Serena Yeung

63
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5 Q: What happens to loss
Losses: -1.7 2.0 -3.1 if car scores change a

2.9 0 12.9 bit?

Fei-Fei Li & Justin Johnson & Serena Yeung

64
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5 Q2: what is the
Losses: -1.7 2.0 -3.1 min/max possible

2.9 0 12.9 loss?

Fei-Fei Li & Justin Johnson & Serena Yeung

65
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5 Q3: At initialization W is
Losses: -1.7 2.0 -3.1 small so all s ≈ 0.

2.9 0 12.9 What is the loss?

Fei-Fei Li & Justin Johnson & Serena Yeung

66
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5 Q4: What if the sum was
Losses: -1.7 2.0 -3.1 over all classes? (including

2.9 0 12.9 j = y_i)

Fei-Fei Li & Justin Johnson & Serena Yeung

67
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5 Q5: What if we used
Losses: -1.7 2.0 -3.1 mean instead of sum?

2.9 0 12.9
Fei-Fei Li & Justin Johnson & Serena Yeung

68
Multiclass SVM Loss
Suppose: 3 training examples, 3 Multiclass SVM loss:
classes. With some W the scores are:
Given an example
where is the image and
where is the (integer) label,

and using the shorthand for the


scores vector:

cat the SVM loss has the form:

car 3.2 1.3 2.2


frog 5.1 4.9 2.5 Q6: What if we used

Losses: -1.7 2.0 -3.1


2.9 0 12.9
Fei-Fei Li & Justin Johnson & Serena Yeung

69
Multiclass SVM Loss - Implementation

Fei-Fei Li & Justin Johnson & Serena Yeung

70
Multiclass SVM Loss

E.g. Suppose that we found a W such that L =


0. Is this W unique?

Fei-Fei Li & Justin Johnson & Serena Yeung

71
Multiclass SVM Loss

E.g. Suppose that we found a W such that L =


0. Is this W unique?

No! 2W is also has L = 0!

Fei-Fei Li & Justin Johnson & Serena Yeung

72
Multiclass SVM Loss: Parameters Search
Suppose: 3 training examples, 3
classes. With some W the scores are:
Before:
= max(0, 1.3 - 4.9 + 1)
+max(0, 2.0 - 4.9 + 1)
= max(0, -2.6) + max(0, -1.9)
=0+0
=0

cat 1.3 2.2 With W twice as large:


3.2 4.9 2.5 = max(0, 2.6 - 9.8 + 1)
+max(0, 4.0 - 9.8 + 1)
car 5.1 = max(0, -6.2) + max(0, -4.8)
2.0 -3.1 =0+0
frog
-1.7 =0
2.9 0
Losses:
Fei-Fei Li & Justin Johnson & Serena Yeung

73
Regularization

Data loss: Model predictions


should match training data

Fei-Fei Li & Justin Johnson & Serena Yeung

74
Regularization

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Fei-Fei Li & Justin Johnson & Serena Yeung

75
Regularization

Fei-Fei Li & Justin Johnson & Serena Yeung

76
Regularization
= regularization strength
(hyperparameter)

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Fei-Fei Li & Justin Johnson & Serena Yeung

77
Regularization
= regularization strength
(hyperparameter)

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Simple examples
L2 regularization:
L1 regularization:
Elastic net (L1 + L2):

Fei-Fei Li & Justin Johnson & Serena Yeung

78
Regularization
= regularization strength
(hyperparameter)

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Simple examples More complex:


L2 regularization: Dropout
L1 regularization: Batch normalization
Elastic net (L1 + L2): Stochastic depth, fractional pooling, etc

Fei-Fei Li & Justin Johnson & Serena Yeung

79
Regularization
= regularization strength
(hyperparameter)

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Why regularize?
- Express preferences over weights
- Make the model simple so it works on test data
- Improve optimization by adding curvature

Fei-Fei Li & Justin Johnson & Serena Yeung

80
Regularization - Expressing Preferences

Fei-Fei Li & Justin Johnson & Serena Yeung

81
Regularization - Expressing Preferences

Simple examples
L2 regularization:
L1 regularization:

L2 regularization likes to
“spread out” the weights

Fei-Fei Li & Justin Johnson & Serena Yeung

82
Regularization – Prefer simpler Models

Fei-Fei Li & Justin Johnson & Serena Yeung

83
Regularization – Prefer simpler Models

f1
f2
y

Fei-Fei Li & Justin Johnson & Serena Yeung

84
Regularization – Prefer simpler Models

f1
f2
y

x
Regularization pushes against fitting the data
too well so we don’t fit noise in the data

Fei-Fei Li & Justin Johnson & Serena Yeung

85
Softmax Classifier

Softmax Classifier (Multinomial Logistic Regression)


Want to interpret raw classifier scores as probabilities

cat
car 3.2
frog 5.1
-1.7

Fei-Fei Li & Justin Johnson & Serena Yeung

86
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function

cat
car 3.2
frog 5.1
-1.7

Fei-Fei Li & Justin Johnson & Serena Yeung

87
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function

Probabilities
must be >= 0
cat
car 3.2 24.5
exp
frog 5.1 164.0
-1.7 0.18
unnormalized
probabilities
Fei-Fei Li & Justin Johnson & Serena Yeung

88
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function
Probabilities Probabilities
must be >= 0 must sum to 1
cat
car 3.2 24.5 0.13
exp normalize
frog 5.1 164.0 0.87
-1.7 0.18 0.00
unnormalized probabilities
probabilities
Fei-Fei Li & Justin Johnson & Serena Yeung

89
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function
Probabilities Probabilities
must be >= 0 must sum to 1
cat
car 3.2 24.5 0.13
exp normalize
frog 5.1 164.0 0.87
-1.7 0.18 0.00
Unnormalized unnormalized probabilities
log-probabilities / logits probabilities
Fei-Fei Li & Justin Johnson & Serena Yeung

90
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function
Probabilities Probabilities
must be >= 0 must sum to 1
cat
car 3.2 24.5 0.13 Li = -log(0.13)
exp normalize = 2.04
frog 5.1 164.0 0.87
Maximum Likelihood Estimation
-1.7 0.18 0.00 Choose probabilities to maximize
the likelihood of the observed data
Unnormalized unnormalized (See CS 229 for details)
probabilities
log-probabilities / logits probabilities
Fei-Fei Li & Justin Johnson & Serena Yeung

91
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function
Probabilities Probabilities
must be >= 0 must sum to 1
cat
car 3.2 24.5 0.13 compare
1.00
exp
normalize
frog 5.1 164.0 0.87 0.00
-1.7 0.18 0.00 0.00
Unnormalized unnormalized probabilities Correct
log-probabilities / logits probabilities probs
Fei-Fei Li & Justin Johnson & Serena Yeung April 10, 2018

92
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function
Probabilities Probabilities
must be >= 0 must sum to 1
cat
car 3.2 24.5 0.13 compare
1.00
exp normalize Kullback–Leibler
frog 5.1 164.0 0.87 divergence 0.00
-1.7 0.18 0.00 0.00
Unnormalized unnormalized probabilities Correct
log-probabilities / logits probabilities probs
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018
51

93
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function
Probabilities Probabilities
must be >= 0 must sum to 1
cat
car 3.2 24.5 0.13 compare
1.00
exp normalize
frog 5.1 164.0 0.87 Cross Entropy 0.00
-1.7 0.18 0.00 0.00
Unnormalized unnormalized probabilities Correct
log-probabilities / logits probabilities probs
Fei-Fei Li & Justin Johnson & Serena Yeung

94
Distance Metric - Intuition

Take L2 (Euclidean) distance in parametric models

Noisy target = deterministic target f(x) = E (y|x) plus


noise y − f (x)

95
Distance Metric - Intuition

Take L2 (Euclidean) distance in parametric models

Assuming є has Gaussian


Distribution:

Rewrite the model in form:

Where

If we apply it to high order


polinomia:

96
Distance Metric - Intuition

Recall the Negative Log Likelihood

We can express the Log Likelihood


of the data, as

Inserting the definition of the


Gaussian:

RSS is the sum of residual squares,


If we average them,

MSE – Mean Squared Error

97
Regularization - Intuition

Posterior Distribution,

From prior over weights


and likelihood

MLE = Maximum Likelihood Estimation

MAP = Maximum a Posteriori Estimation

98
Regularization - Intuition

Posterior Distribution,

From prior over weights


and likelihood

Minimize

Also called, Ridge Regression Penalization: l2 regularization

99
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function

Maximize probability of correct class Putting it all together:


cat
car 3.2
frog 5.1
-1.7

Fei-Fei Li & Justin Johnson & Serena Yeung

10
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function

Maximize probability of correct class Putting it all together:


cat
car 3.2
Q: What is the min/max
frog 5.1
possible loss L_i?
-1.7

Fei-Fei Li & Justin Johnson & Serena Yeung

10
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function

Maximize probability of correct class Putting it all together:


cat
car 3.2
Q: What is the min/max
frog 5.1
possible loss L_i?
-1.7 A: min 0, max infinity

Fei-Fei Li & Justin Johnson & Serena Yeung

10
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function

Maximize probability of correct class Putting it all together:


cat
car 3.2
Q2: At initialization all s will be
frog 5.1
approximately equal; what is the loss?
-1.7

Fei-Fei Li & Justin Johnson & Serena Yeung

10
Softmax Classifier
Softmax Classifier (Multinomial Logistic
Regression)
Want to interpret raw classifier scores as probabilities
Softmax
Function

Maximize probability of correct class Putting it all together:


cat
car 3.2
Q2: At initialization all s will be
frog 5.1
approximately equal; what is the loss?
-1.7 A: log(C), eg log(10) ≈ 2.3

Fei-Fei Li & Justin Johnson & Serena Yeung

10
Softmax vs. SVM

Fei-Fei Li & Justin Johnson & Serena Yeung

10
Softmax vs. SVM

Fei-Fei Li & Justin Johnson & Serena Yeung

10
Softmax vs. SVM

Q: Suppose I take a datapoint and I jiggle


assume scores:
a bit (changing its score slightly). What
[10, -2, 3]
happens to the loss in both cases?
[10, 9, 9]
[10, -100, -100]
and
Fei-Fei Li & Justin Johnson & Serena Yeung

10
Recap

- We have some dataset of (x,y) e.g.


- We have a score function:
- We have a loss function:

Softmax

SVM

Full loss

Fei-Fei Li & Justin Johnson & Serena Yeung

10
Recap
How do we find the best W?

- We have some dataset of (x,y) e.g.


- We have a score function:
- We have a loss function:

Softmax

SVM

Full loss

Fei-Fei Li & Justin Johnson & Serena Yeung

10
Motivation example: Kinect

11
Image classification example

[MSR Tutorial on decision forests by Criminisi et al,


2011] 11
Classification tree

[Criminisi et al, 2011]


11
Another commerce example

Simafore.com
11
From a spreadsheet to a
decision node

[AI book of Stuart Russell and Peter Norvig]


11
A learned decision tree

[AI book of Stuart Russell and Peter Norvig]


11
How do we construct the tree ? i.e., how to
pick attribute (nodes)?

For a training set containing p positive examples and n negative examples, we


have:
p n p p n n
H( , )=− log 2 − log 2
p+n p p+n p+n p+n p+n
+n
11
How to pick nodes?

❑ A chosen attribute A, with K distinct values, divides the training set


E into subsets E1, … , EK.
❑ The Expected Entropy (EH) remaining after trying attribute A
(with branches i=1,2,…,K) is
i-th Child
K
EH ( A) = pi +ni ni
H ( , )
ppi + n pi + ni pi + ni

i=1

❑ Information gain (I) or reduction in entropy for this attribute is:

p n
I ( A) = H , ) − EH ( A)
( p+n p+
n largest I
❑ Choose the attribute with the

[Hwee Tou Ng & Stuart Russell]


11
Example
❑ Convention: For the training set, p = n = 6, H(6/12, 6/12) = 1 bit

❑ Consider the attributes Patrons and Type (and others too):


2
2 4 6 0.
I (Patrons) = 1−[12H (0,1) +
12 12 6 H (1,0)
6 + H ( , )] =
.0541 bits 1 1 1 1 4
2 H( ,
I (Type) = 1−[ H)]( = 0 bits 12 H (2 , 2 H 4( 2
+ ,
2 2 ) 2 4 12 )2 2 2 12 4 4
, ) 12 4 + 4 +

[Hwee Tou Ng & Stuart Russell]


11
Classification tree

[Criminisi et al, 2011]


11
Use information gain to decide splits

[Criminisi et al, 2011]


12
Building a random tree

12
Random Forests algorithm

[From the book of Hastie, Friedman and Tibshirani]


12
Randomization

[Criminisi et al, 2011]


12
Recall - The Tradeoff of the Out-of-Sample Error

Take Expectation with respect to x, and obtain bias and variance

↓ H↑ ↑

12
Building a forest (ensemble)

[Criminisi et al, 2011]


12
Loss functions
metric: true output 🡺 predicted output

Mesaures the distance


betwen Y_pred and Y:

12
Loss Function implements a metric
What is a metric?

A map J(x,y) is a metric, if and only if it meets the following conditions:

1. Positive definte: J(x, y) = 0 🡺🡺 x = y

2. Symmetric: J(x, y) = J(y, x)

3. Triangular inequality: d(x, y)<= d(x, z) + d(z, y) for all z

Visually, a metrix provides a maping between two elements and the elements distance.

12
Grundlage der Masterfolien
Als Grundlage dient der Corporate Design Style Guide der TUM.
Die Präsentationsvorlage ist auf gute Lesbarkeit und klare Darstellung von Informationen
optimiert.

128
Sources

http://cs231n.stanford.edu/syllabus.html
https://www.cs.ubc.ca/~nando/340-2012/lectures.php
https://github.com/jonesgithub/book-1/blob/master/ML%20Machine%20Learning-A%20Probabili
stic%20Perspective.pdf

129

You might also like