0% found this document useful (0 votes)
10 views

Lecture 1 - Machine Learning

Uploaded by

tahaakber25
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 1 - Machine Learning

Uploaded by

tahaakber25
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

Course series: Deep Learning for NLP

Machine Learning

Lecture # 1

Fahim Dalvi and Hassan Sajjad


Qatar Computing Research Institute, HBKU
Machine Learning

A technique that gives machines the ability to


learn, without any explicit programming
Machine Learning

In simpler terms, a machine should be able to see


some data, and learn to make decisions based
on what it has seen
Machine Learning
An example: You are a car dealer, and you have a historical
record of which cars are accident prone. How can you “teach”
a computer to predict which new cars will be accident prone?

Maximum Speed Acceleration Color Car age Accident Prone?

Car 1 240 km/h Fast Red 2 yrs Yes

Car 2 100 km/h Fast Yellow 2 yrs No

Car 3 240 km/h Fast Blue 1 yr No

Car 4 200 km/h Slow Blue 5 yrs Yes

Car 5 100 km/h Fast Yellow 5 yrs Yes

Car 6 100 km/h Slow Black 6 yrs No

Car 7 150 km/h Fast Red 2 yrs ?


Machine Learning

How can you make your decision?


Search for closest vehicle in the past?
Come up with a set of rules?
How can you decide what knowledge is important?
Machine Learning
Historically, rule based systems were common:
if (car.acceleration = fast and car.age > 1 and ...)
print (“accident prone”)
else if (car.acceleration = slow and car.maxspeed > 150 and ...)
print (“accident prone”)
else if (car.acceleration = slow and car.maxspeed < 50)
print (“not accident prone”)
else if
...
...

Domain Specific Cumbersome Not easy to learn from new data


Machine Learning
Then, machine learning techniques came about...

... Accident
Prone?

Car 1 ... Yes

Car 2 ... No

Car 3 ... No

Car 4 ... Yes

Car 5 ... Yes

Car 6 ... No
Machine Learning “Model”
Algorithms
Training data

Domain Agnostic Robust Easy to learn from new data


Machine Learning
Let’s talk about training data

... Accident
Prone?

Car 1 ... Yes

Car 2 ... No

Car 3 ... No

Car 4 ... Yes

Car 5 ... Yes

Car 6 ... No
Machine Learning “Model”
Algorithms
Training data

Domain Agnostic Robust Easy to learn from new data


Training Data
Maximum Speed Acceleration Color Car age Accident Prone?

Car 1 240 km/h Fast Red 2 yrs Yes

Car 2 100 km/h Fast Yellow 2 yrs No

Car 3 240 km/h Fast Blue 1 yr No

Car 4 200 km/h Slow Red 5 yrs Yes

Car 5 100 km/h Fast Yellow 5 yrs Yes

Car 6 100 km/h Slow Black 6 yrs No

Input Features Labels

We use training examples with labels to train a model


Machine Learning
Algorithms use this training data

... Accident
Prone?

Car 1 ... Yes

Car 2 ... No

Car 3 ... No

Car 4 ... Yes

Car 5 ... Yes

Car 6 ... No
Machine Learning “Model”
Algorithms
Training data

Domain Agnostic Robust Easy to learn from new data


Algorithms
In this case, we have labels for each car. This class
of problems is handled by supervised learning
algorithms.

Unsupervised learning algorithms work on


unlabelled data
Algorithms
In this case, we have labels for each car. This class
of problems is handled by supervised learning
algorithms.

Unsupervised learning algorithms work on


unlabelled data
Algorithms
In this case, we have labels for each car. This class
of problems is handled by supervised learning
algorithms.

Unsupervised learning algorithms work on


unlabelled data
Algorithms
Many techniques exist to build models:
• “Finding similar cars” type methods:
– K-means clustering
– Hierarchical clustering
• “Create set of rules” type methods:
– Support vector machines
– Logistic Regression
– Neural Networks
Algorithms
Many techniques exist to build models:
• “Finding similar cars” type methods:
– K-means clustering
– Hierarchical clustering Unsupervised
• “Create set of rules” type methods:
– Support vector machines
– Logistic Regression
– Neural Networks Supervised
Machine Learning
Algorithms use this training data to produce a model

... Accident
Prone?

Car 1 ... Yes

Car 2 ... No

Car 3 ... No

Car 4 ... Yes

Car 5 ... Yes

Car 6 ... No
Machine Learning “Model”
Algorithms
Training data

Domain Agnostic Robust Easy to learn from new data


Machine Learning Model
Example of a model: A function that takes
information about a car, and predicts whether it’s
accident prone or not

f( ) = No

f( ) = Yes
Machine Learning Model
Example of a model: A function that takes a word,
and predicts it’s part of speech tag

f(“car”) = Noun
f(“beautiful”) = Adjective
f(“she”) = Pronoun
Machine Learning Model

At a high level, the basic idea is to figure out which


features are important, and how important are they for
prediction

0.3 x car.maxspeed
+ 0.2 x car.acceleration
+ 0.0 x car.color
+ 0.5 x car.age
Machine Learning Model

At a high level, the basic idea is to figure out which


features are important, and how important are they for
prediction

0.3 x car.maxspeed
+ 0.2 x car.acceleration
+ 0.0 x car.color
+ 0.5 x car.age
An older car is more likely to be accident prone
Machine Learning Model

At a high level, the basic idea is to figure out which


features are important, and how important are they for
prediction

0.3 x car.maxspeed
+ 0.2 x car.acceleration
+ 0.0 x car.color
+ 0.5 x car.age
The color of a car has no impact on accidents
Machine Learning Model

At a high level, the basic idea is to figure out which


features are important, and how important are they for
prediction

0.3 x car.maxspeed
+ 0.2 x car.acceleration
+ 0.0 x car.color
+ 0.5 x car.age
Feature Weights
Machine Learning

... Accident
Prone?

Car 1 ... Yes

Car 2 ... No

Car 3 ... No

Car 4 ... Yes

Car 5 ... Yes

Car 6 ... No
Machine Learning “Model”
Algorithms
Training data
Supervised Learning: Classification
Process of assigning objects to categories

For example, Car 1 belongs to category “Accident Prone”


Maximum Speed Acceleration Color Car age Accident Prone?

Car 1 240 km/h Fast Red 2 yrs Yes

Car 2 100 km/h Fast Yellow 2 yrs No

Car 3 240 km/h Fast Blue 1 yr No

Car 4 200 km/h Slow Blue 5 yrs Yes

Car 5 100 km/h Fast Yellow 5 yrs Yes

Car 6 100 km/h Slow Black 6 yrs No


Supervised Learning
Process of predicting values or categories
Pricing Example Car Example POS Example
€55.41
€80.12 Accident Prone Noun Verb
€90.00
⋮ Not Accident Prone Pronoun Adjective
€97.55

Regression Binary Classification Multiclass Classification


Supervised Learning
Process of predicting values or categories
Pricing Example Car Example POS Example
€55.41
€80.12 Accident Prone Noun Verb
€90.00
⋮ Not Accident Prone Pronoun Adjective
€97.55

Regression Binary Classification Multiclass Classification

Predicting Predicting two


Predicting more than
continuous real classes
two classes
values eg. yes or no
Binary Classification
Exercise
Binary Classification Exercise
• We will start by looking at a simple technique -
Linear classification for two classes
Binary Classification Exercise
• We will start by looking at a simple technique -
Linear classification for two classes
• Our model/function will predict just one real
number
• If this number is < 0, we will consider it to belong
to Class 1. If it is ≥ 0, we will consider it to
belong to Class 2.
Binary Classification Exercise
• Choose , and such that positive
examples give a result > 0 and negative
examples give a result < 0

Class

2 0 Positive

5 -2 Positive

-2 2 Negative

-1 -3 Negative
Binary Classification Exercise
• Choose , and such that positive
examples give a result > 0 and negative
examples give a result < 0
0.3 x car.maxspeed
+ 0.2 x car.acceleration
+ 0.0 x car.color
+ 0.5 x car.age
Class

2 0 Positive

5 -2 Positive

-2 2 Negative

-1 -3 Negative
Binary Classification Exercise
• Choose , and such that positive
examples give a result > 0 and negative
examples give a result < 0

Class

2 0 Positive

5 -2 Positive

-2 2 Negative

-1 -3 Negative
Binary Classification Exercise
Find weights that separate positive examples from
negative examples x
1

x0
Binary Classification Exercise
• Potential Solution: = 3, = 1 and b = 3
x1

x0
Binary Classification Exercise
• Potential Solution: = 3, = 1 and b = 3
x1

should define a decision boundary

defines one such decision boundary


Positive examples will be on one side of the boundary, and negative
x0
examples on the other
Binary Classification Exercise
• Potential Solution: = 3, = 1 and b = 3
x1

x0

-3 0
0 -1
3 -2
Binary Classification
What did we do?
- We were given a set of points in space
- We tried to draw a line to separate the
“positive” points from the “negative” points
- The line was defined using “feature weights”
Vector Spaces
Vector Spaces

Acceleration

Age

ed
S pe

Imagine every feature as a dimension in space


Every object (car) can be represented as a point in space
Vector Spaces

Wall
NO

OUN
UN Table Laptop
they Cat

PRON
he
she
we
I E CT IVE
ADJ
sharp
beautiful
fast
Lifecycle of Training a
Machine Learning Model
Learning Lifecycle

Parameters

Optimization
Objective Function
Input data Function

Loss Function
Objective Function

Objective function defines our goal

f( , W, b) = N Numbers

Input Parameters
P
Objective Function

Objective function defines our goal

f( , W, b) = N Numbers

Input Parameters N can be 1 as in our


P previous example
Objective Function
f( , W, b) = 1 Number

This 1 number can be a real valued output (for example


depicting price, age etc). This is called regression.

This 1 number can also be used in the special case of binary


classification (two classes) like we did in the previous
exercise - i.e. Class 1 if f > 0 and Class 2 if f ≤ 0
Objective Function

Objective function defines our goal

f( , W, b) = N Numbers

Input Parameters Outputs


P N=2 can be used for binary
classification
Example: scores for the two
classes - accident prone and
not accident prone
Objective Function

Objective function defines our goal

f( , W, b) = N Numbers

Input Parameters Outputs


P N=M can be used for M-class
classification
Example: M = 40 for POS tags
Objective Function

Objective function defines our goal

f( , W, b) = 2 Numbers

Input Parameters Outputs


P

Learned by the algorithm, just like we learned W


and b in the previous exercise!
Learning Lifecycle

Parameters

Optimization
Objective Function
Input data Function

Loss Function
Loss Function
Given a set of parameters P={P1,P2,…}, how do
you know which one to use?
P1 P3
x1
P1={w1:3, w2:1, b:3}
P2={w1:-2, w2:-1, b:-1}
P3={w1:-3, w2:-1, b:3}

x0

P2
Loss Function
Given a set of parameters P={P1,P2,…}, how do
you know which one to use?
P1 P3
x1
P1={w1:3, w2:1, b:3}
P2={w1:-2, w2:-1, b:-1}
P3={w1:-3, w2:-1, b:3}
We use the concept of loss
A loss function takes in the output of our model,
compares it to the true value and then gives us a
measure of how “far” our model is. x0

P2
Loss Function
A loss function is any function that gives a measure
of how far your scores are from their true values
Loss Function
A loss function is any function that gives a measure
of how far your scores are from their true values

Not accident prone Accident prone

[1.0, 0.0] ← True values → [0.0, 1.0]


Loss Function Exercise
Consider two cars and three sets of parameters

Not accident prone Accident prone

P1 P2 P3

Which set of parameters is the best?


f( , P1) = [0.5, 0.5] f( , P1) = [0.1, 0.9]

f( , P2) = [0.7, 0.3] f( , P2) = [0.3, 0.7]

f( , P3) = [0.1, 0.9] f( , P3) = [0.9, 0.1]


Loss Function Exercise
Consider two cars and three sets of parameters

Not accident prone Accident prone

P1 P2 P3

Which set of parameters is the best?


f( , P1) = [0.5, 0.5] f( , P1) = [0.1, 0.9]

f( , P2) = [0.7, 0.3] f( , P2) = [0.3, 0.7]

f( , P3) = [0.1, 0.9] f( , P3) = [0.9, 0.1]

Confused Less confident but Very confident but


model correct model wrong model
Loss Function
A potential loss function in this case is the sum of
the absolute difference of scores:
L( , P1) = sum(f( , P1) - [1.0, 0.0] )
= sum([ |-0.5| , |0.5| ]) = 1

L( , P1) = sum(f( , P1) - [0.0, 1.0])


= sum([ |0.1|, |-0.1| ]) = 0.2
Loss Function
A potential loss function in this case is the sum of
the absolute difference of scores:
L( , P1) = sum(f( , P1) - [1.0, 0.0] )
= sum([ |-0.5| , |0.5| ]) = 1

L( , P1) = sum(f( , P1) - [0.0, 1.0])


= sum([ |0.1|, |-0.1| ]) = 0.2

Number of classes
Loss Function
A potential loss function in this case is the sum of
the absolute difference of scores:
L( , P1) = sum(f( , P1) - [1.0, 0.0] )
= sum([ |-0.5| , |0.5| ]) = 1

L( , P1) = sum(f( , P1) - [0.0, 1.0])


= sum([ |0.1|, |-0.1| ]) = 0.2

L( , P2) = 0.6 L( , P2) = 0.6


L( , P3) = 1.8 L( , P3) = 1.8
Loss Function
A potential loss function in this case is the sum of
the absolute difference of scores:
L( , P1) = sum(f( , P1) - [1.0, 0.0] )
= sum([ |-0.5| , |0.5| ]) = 1

L( , P1) = sum(f( , P1) - [0.0, 1.0])


= sum([ |0.1|, |-0.1| ]) = 0.2

L( , P2) = 0.6 L( , P2) = 0.6


L( , P3) = 1.8 L( , P3) = 1.8

Average loss for both cars


L(P1) = 0.6 L(P2) = 0.6 L(P3) = 1.8
Loss Function
Average loss for both cars
L(P1) = 0.6 L(P2) = 0.6 L(P3) = 1.8

A lower value of the loss indicates a better model


i.e. we are closer to the true values
In this case, P1 and P2 have the lower value of 0.6, so we know they are better
than P3. However, we also know that P2 is better than P1 , and this implies our
loss function is not very good right now!
Loss Function
Better loss function:
Mean Squared Error

Loss is equal to the sum of the square of the


differences in the scores
Loss Function
Better loss function:
Mean Squared Error

Loss is equal to the sum of the square of the


differences in the scores

f( , P1) = [0.5, 0.5] f( , P1) = [0.1, 0.9]


f( , P2) = [0.7, 0.3] f( , P2) = [0.3, 0.7]
f( , P3) = [0.1, 0.9] f( , P3) = [0.9, 0.1]
Loss Function
Better loss function:
Mean Squared Error

Loss is equal to the sum of the square of the


differences in the scores
MSE( , P1) = 0.50 MSE( , P1) = 0.02
MSE( , P2) = 0.18 MSE( , P2) = 0.18
MSE( , P3) = 1.62 MSE( , P3) = 1.62
Loss Function
Better loss function:
Mean Squared Error

Loss is equal to the sum of the square of the


differences in the scores
MSE( , P1) = 0.50 MSE( , P1) = 0.02
MSE( , P2) = 0.18 MSE( , P2) = 0.18
MSE( , P3) = 1.62 MSE( , P3) = 1.62

Average loss
L(P1) = 0.26 L(P2) = 0.18 L(P3) = 1.62
Loss Function

Mean Squared Error works better, as it penalizes


values that are further away from the true value
Loss Function
Many other choices for loss functions:
• Absolute Distance loss
• Hinge loss
• Logistic loss
• Cross Entropy loss

Loss Function

Loss function is also known as the cost function


in some literature
Learning Lifecycle

Parameters

Optimization
Objective Function
Input data Function

Loss Function
Optimization
Now that we have a way of defining loss, we need
a way to use it to improve our parameters

This process is called optimization - where our


goal is to “minimize” the loss function, i.e. bring it
as close to zero as possible
Optimization Exercise
Find the value of x in the following equation:

x + 5 = ?

For every guess you will get the following hints:

Direction: Higher Lower

Error: Very far Far Close Very close


Optimization Exercise

45

15

58

25

55

20

50

48

47

46

45
Optimization Exercise

45

15

58

Just like you25did the exercise of updating x based


on our feedback, machines can 55also look at the
20
loss (“Higher”, “Very far”) and decide to update
50
x appropriately
48

47

46

45
Optimization Exercise

45

15

58

25
Optimization algorithms use the55
loss value to
mathematically
20 nudge the parameters P of the
objective function to be50more “correct”
48

47

46

45
Optimization Exercise

What are some strategies you used to optimize x?


Optimization: Random Search
• Potential Solution: Guess randomly each time
• Pros:
– Very simple
• Cons:
– Not very efficient
– loss value is unused
– Potentially may never find a good solution
Optimization: Gradient Search
• Better Solution: Gradient based search
Optimization: Gradient Search
• Better Solution: Gradient based search
• Every function can be represented in space
Optimization: Gradient Search
• Better Solution: Gradient based search
• Every function can be represented in space

Any loss function can also be represented in


space
Optimization: Gradient Search
• Better Solution: Gradient based search
• Our goal is to minimize the loss, i.e. find a set of
parameters P such that the loss is close to zero
Optimization: Gradient Search
• Better Solution: Gradient based search
• Our goal is to minimize the loss, i.e. find a set of
parameters P such that the loss is close to zero
Optimization: Gradient Search
• Better Solution: Gradient based search
• Our goal is to minimize the loss, i.e. find a set of
parameters P such that the loss is close to zero

Minimum value of the loss function


Optimization: Gradient Search
Functions are just like terrain - they have
mountains and valleys

We want to minimize loss,


i.e. go to the bottom of the terrain
Optimization: Gradient Search
Q: Imagine you are blindfolded on a mountain, how
will you go to the bottom?
Optimization: Gradient Search
Q: Imagine you are blindfolded on a mountain, how
will you go to the bottom?

A: Sense the slope around you, and move in the


direction where the slope points downwards
Optimization: Gradient Search
Concept of gradient == “your sense of slope” for
the loss function

The gradient of a function is mathematically


defined as the slope of the tangent i.e. slope at any
given point on the function
Optimization: Gradient Search
Tangent at x = 1.5
Slope = 3.0

Tangent at x = 0.8
Slope = 1.6
Optimization: Gradient Search
Once we know the direction, we can move towards
the minimum.

Are we done?
Optimization: Learning Rate
How far should we move?

The step size or learning rate defines how big a step


we should take in the direction of the gradient
Optimization: Learning Rate
How far should we move?

The step size or learning rate defines how big a step


we should take in the direction of the gradient

It must be well controlled - too small a step and it


may take a long time to reach the bottom - too big
a step and we may miss the minimum all together!
Optimization
Various optimization algorithms

Alec Radford (Reddit)


Local Minima

Minimum value of the function Local minimum value of the function


Local Minima

Optimization algorithm may get “stuck” at


local minimum of a function

Minimum value of the function Local minimum value of the function


Optimization: Demo

Lecture 1 - Learning Rate and Optimization Demo


Optimization
• How can we compute the slope of the function?
• Compute gradients analytically
• Backpropagation
Optimization
Let us compute the gradient of MSE analytically
Optimization
But what if the function was slightly more
complicated:
Optimization
But what if the function was slightly more
complicated:

Analytical gradients become much more


complicated and tedious to compute!
Optimization
But what if the function was slightly more
complicated:

Backpropagation to the rescue!


Backpropagation

Backpropagation is a technique to compute


gradients of any function with respect to a variable
using the concept of a computation graph
Backpropagation

Computation graph: Graphical way of describing


any function:
Backpropagation

Intuition
• Divide the loss function into small differentiable
steps
• Calculate the gradient of each small step and use
chain rule to calculate the gradient of your input
parameters
• Recommended reading: see supplementary
material of lecture 1
Optimization

To complete the picture, we can then use the


gradients to update the parameters using gradient
descent
Optimization

To complete the picture, we can then use the


gradients to update the parameters using gradient
descent

Recall: We want to take a “step” in the direction of


the slope
Optimization

To complete the picture, we can then use the


gradients to update the parameters using gradient
descent
Optimization

To complete the picture, we can then use the


gradients to update the parameters using gradient
descent

Step size
Learning rate
Learning Lifecycle

Parameters

Optimization
Objective Function
Input data Function

Loss Function
Multiclass Classification
Let us now look at another complete example of
classification
Multiclass Classification
Recall that we have been using linear regression so
far and making decisions based on the sign of the
output

f( , W, b) = 1 Real Number

Output
If the number is less
than 0, it is accident
prone, else it is not
accident prone
Multiclass Classification
In general, we design our function f such that we
output one number per class:

f( , W, b) = 2 Numbers

Outputs
The scores for the two
classes - accident
prone and not
accident prone
Multiclass Classification
From now on, we will use this generalized technique,
since it can be easily extended to more than two
classes

f( , W, b) = Numbers

Outputs
M scores for the
M classes
Multiclass Classification
From now on, we will use this generalized technique,
since it can be easily extended to more than two
classes

f( , W, b) = Numbers
(vector)

Everything else remains the same - the loss functions


now operates on vectors instead of real numbers
Multiclass Classification

Vector

Real
number

Linear Regression
Multiclass Classification

Vector

Real
number

Linear Regression
Multiclass Classification

scores

Matrix Vector

Multi-class Linear Classification


Multiclass Classification
Prediction

In regression:
f( , W, b) >= 0

In classification:
argmax(f( , W, b))

Pick the class with the


highest score
Softmax function
Softmax
With the argmax function, our classifier has always
output some “scores”, and we just pick whichever
score is higher:

f( , W, b) = 2 Numbers

Outputs
The scores for the two
classes - accident
prone and not
accident prone
Softmax
However, these scores are not interpretable.
Their absolute values don’t give us any insight, we
can only compare them relatively

f( , W, b) = 2 Numbers

Outputs
The scores for the two
classes - accident
prone and not
accident prone
Softmax
The softmax function helps us transform these
values into probability distributions:

Scores from the classifier Scores as a probability


distribution
Softmax
The softmax function helps us transform these
values into probability distributions:

-1.85 0.06
0.42 Softmax 0.54
0.15 0.40
Softmax
The softmax function helps us transform these
values into probability distributions:
each output can be treated as the
probability of that class
-1.85 0.06
0.42 Softmax 0.54
0.15 0.40
scores sum to one
Softmax
The softmax function helps us transform these
values into probability distributions:
each output can be treated as the
probability of that class
-1.85 0.06
0.42 Softmax 0.54
0.15 0.40
scores sum to one

The Softmax function also acts as a normalizer, i.e. we can now


compare scores from different models and examples directly
Cross Entropy Loss
Cross Entropy Loss
Recall MSE:
Mean Squared Error
Cross Entropy Loss
Recall MSE:
Mean Squared Error

We saw that MSE is better than just taking the


absolute difference:
Cross Entropy Loss
Recall MSE:
Mean Squared Error

In practice, we use Cross Entropy loss, which


generally performs better for more complex
models.
Cross Entropy Loss

Here, y represents the true probability distribution


(so yi = 1 for the correct class i, and 0 otherwise)

fi represents the score of class i from our classifier


Cross Entropy Loss

Simplifying for our case,


if c is the correct class, then yc = 1, and all other yi’s are 0
Therefore, we only have one element left from the summation
Cross Entropy Loss
Cross Entropy Loss

Mean Squared Error Cross Entropy


Cross Entropy Loss
Why cross entropy?
Consider three people, Person1 is a Democrat, Person2 is a
Republican and Person3 is Other. We have two models to
classify these people:
SOther SRepublican SDemocrat SOther SRepublican SDemocrat

Person1 0.3 0.3 0.4 Person1 0.1 0.2 0.7


Person2 0.3 0.4 0.3 Person2 0.1 0.7 0.2
Person3 0.1 0.2 0.7 Person3 0.3 0.4 0.3
Model 1 Model 2
https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-s
quared-error-for-neural-network-classifier-training/
Cross Entropy Loss
SOther SRepublican SDemocrat SOther SRepublican SDemocrat

Person1 0.3 0.3 0.4 Person1 0.1 0.2 0.7


Person2 0.3 0.4 0.3 Person2 0.1 0.7 0.2
Person3 0.1 0.2 0.7 Person3 0.3 0.4 0.3
Model 1 Model 2

Both models misclassify Person3, but is one model better


than the other?
Cross Entropy Loss
SOther SRepublican SDemocrat SOther SRepublican SDemocrat

Person1 0.3 0.3 0.4 Person1 0.1 0.2 0.7


Person2 0.3 0.4 0.3 Person2 0.1 0.7 0.2
Person3 0.1 0.2 0.7 Person3 0.3 0.4 0.3
Model 1 Model 2

Model 2 is better, since it classifies Person1 and Person2


with higher scores on the correct class, and mis-classifies
Person3 with a smaller error in the scores
Cross Entropy Loss
SOther SRepublican SDemocrat SOther SRepublican SDemocrat

Person1 0.3 0.3 0.4 Person1 0.1 0.2 0.7


Person2 0.3 0.4 0.3 Person2 0.1 0.7 0.2
Person3 0.1 0.2 0.7 Person3 0.3 0.4 0.3
Model 1 Model 2

Person1: 0.54 Person1: 0.14


Person2: 0.54 Person2: 0.14
Person3: 1.34 Person3: 0.74
Model 1 Average: 0.81 Model 2 Average: 0.34
Mean Squared Error
Cross Entropy Loss
SOther SRepublican SDemocrat SOther SRepublican SDemocrat

Person1 0.3 0.3 0.4 Person1 0.1 0.2 0.7


Person2 0.3 0.4 0.3 Person2 0.1 0.7 0.2
Person3 0.1 0.2 0.7 Person3 0.3 0.4 0.3
Model 1 Model 2

Person1: -log(0.4) = 0.92 Person1: 0.36


Person2: -log(0.4) = 0.92 Person2: 0.36
Person3: -log(0.1) = 2.30 Person3: 1.20
Model 1 Average: 1.38 Model 2 Average: 0.64
Cross Entropy
Cross Entropy Loss
SOther SRepublican SDemocrat SOther SRepublican SDemocrat

Person1 0.3 0.3 0.4 Person1 0.1 0.2 0.7


Person2 0.3 0.4 0.3 Person2 0.1 0.7 0.2
Person3 0.1 0.2 0.7 Person3 0.3 0.4 0.3
Model 1 Model 2

Mean Squared Error

Model 1 Average: 0.81 Model 2 Average: 0.34

Cross Entropy

Model 1 Average: 1.38 Model 2 Average: 0.64


Cross Entropy Loss
Mean Squared Error

Model 1 Average: 0.81 Model 2 Average: 0.34

Cross Entropy

Model 1 Average: 1.38 Model 2 Average: 0.64

Cross Entropy Loss difference between the two models


is greater than the Mean Squared Error!
Cross Entropy Loss
In general, Mean Squared Error penalizes
incorrect predictions much more than Cross
Entropy
Cross Entropy Loss
A more principled reason arises from the
underlying mathematics of MSE and Cross
Entropy

MSE causes the gradients to become very small


as the network scores become better, so
learning slows down!
Cross Entropy and Softmax
Cross Entropy and Softmax
Cross Entropy is mathematically defined to
compare two probability distributions
Cross Entropy and Softmax
Cross Entropy is mathematically defined to
compare two probability distributions

Our ground truth is already represented as a probability


distribution (with all the probability mass on the correct
class)
0.00
y = 1.00
0.00
Cross Entropy and Softmax
Cross Entropy is mathematically defined to
compare two probability distributions

However, the scores directly from a linear classifier do not


form any such distribution:

-1.85
f = 0.42
0.15
Cross Entropy and Softmax
Cross Entropy is mathematically defined to
compare two probability distributions

Solution: Use softmax!

0.06
softmax(f) = 0.54
0.40
Putting it all together
Softmax Cross Entropy
Linear Classifier
Summary
• Classification
• Objective function
• Loss function
– sum of absolute differences
– mean squared error
• Optimization
– random search
– gradient search
– backpropagation
Lecture 1 Supplementary
• Backpropagation
• Bias
• Parameter Initialization
• Regularization
Lecture 1 Practical
• Setup
• Jupyter Notebooks
• Introduction to Python & Numpy
• Linear Algebra in five minutes
• Data Representation
• Linear classifier in Keras

You might also like