Lecture 1 - Machine Learning
Lecture 1 - Machine Learning
Machine Learning
Lecture # 1
... Accident
Prone?
Car 2 ... No
Car 3 ... No
Car 6 ... No
Machine Learning “Model”
Algorithms
Training data
... Accident
Prone?
Car 2 ... No
Car 3 ... No
Car 6 ... No
Machine Learning “Model”
Algorithms
Training data
... Accident
Prone?
Car 2 ... No
Car 3 ... No
Car 6 ... No
Machine Learning “Model”
Algorithms
Training data
... Accident
Prone?
Car 2 ... No
Car 3 ... No
Car 6 ... No
Machine Learning “Model”
Algorithms
Training data
f( ) = No
f( ) = Yes
Machine Learning Model
Example of a model: A function that takes a word,
and predicts it’s part of speech tag
f(“car”) = Noun
f(“beautiful”) = Adjective
f(“she”) = Pronoun
Machine Learning Model
0.3 x car.maxspeed
+ 0.2 x car.acceleration
+ 0.0 x car.color
+ 0.5 x car.age
Machine Learning Model
0.3 x car.maxspeed
+ 0.2 x car.acceleration
+ 0.0 x car.color
+ 0.5 x car.age
An older car is more likely to be accident prone
Machine Learning Model
0.3 x car.maxspeed
+ 0.2 x car.acceleration
+ 0.0 x car.color
+ 0.5 x car.age
The color of a car has no impact on accidents
Machine Learning Model
0.3 x car.maxspeed
+ 0.2 x car.acceleration
+ 0.0 x car.color
+ 0.5 x car.age
Feature Weights
Machine Learning
... Accident
Prone?
Car 2 ... No
Car 3 ... No
Car 6 ... No
Machine Learning “Model”
Algorithms
Training data
Supervised Learning: Classification
Process of assigning objects to categories
Class
2 0 Positive
5 -2 Positive
-2 2 Negative
-1 -3 Negative
Binary Classification Exercise
• Choose , and such that positive
examples give a result > 0 and negative
examples give a result < 0
0.3 x car.maxspeed
+ 0.2 x car.acceleration
+ 0.0 x car.color
+ 0.5 x car.age
Class
2 0 Positive
5 -2 Positive
-2 2 Negative
-1 -3 Negative
Binary Classification Exercise
• Choose , and such that positive
examples give a result > 0 and negative
examples give a result < 0
Class
2 0 Positive
5 -2 Positive
-2 2 Negative
-1 -3 Negative
Binary Classification Exercise
Find weights that separate positive examples from
negative examples x
1
x0
Binary Classification Exercise
• Potential Solution: = 3, = 1 and b = 3
x1
x0
Binary Classification Exercise
• Potential Solution: = 3, = 1 and b = 3
x1
x0
-3 0
0 -1
3 -2
Binary Classification
What did we do?
- We were given a set of points in space
- We tried to draw a line to separate the
“positive” points from the “negative” points
- The line was defined using “feature weights”
Vector Spaces
Vector Spaces
Acceleration
Age
ed
S pe
Wall
NO
OUN
UN Table Laptop
they Cat
PRON
he
she
we
I E CT IVE
ADJ
sharp
beautiful
fast
Lifecycle of Training a
Machine Learning Model
Learning Lifecycle
Parameters
Optimization
Objective Function
Input data Function
Loss Function
Objective Function
f( , W, b) = N Numbers
Input Parameters
P
Objective Function
f( , W, b) = N Numbers
f( , W, b) = N Numbers
f( , W, b) = N Numbers
f( , W, b) = 2 Numbers
Parameters
Optimization
Objective Function
Input data Function
Loss Function
Loss Function
Given a set of parameters P={P1,P2,…}, how do
you know which one to use?
P1 P3
x1
P1={w1:3, w2:1, b:3}
P2={w1:-2, w2:-1, b:-1}
P3={w1:-3, w2:-1, b:3}
x0
P2
Loss Function
Given a set of parameters P={P1,P2,…}, how do
you know which one to use?
P1 P3
x1
P1={w1:3, w2:1, b:3}
P2={w1:-2, w2:-1, b:-1}
P3={w1:-3, w2:-1, b:3}
We use the concept of loss
A loss function takes in the output of our model,
compares it to the true value and then gives us a
measure of how “far” our model is. x0
P2
Loss Function
A loss function is any function that gives a measure
of how far your scores are from their true values
Loss Function
A loss function is any function that gives a measure
of how far your scores are from their true values
P1 P2 P3
P1 P2 P3
Number of classes
Loss Function
A potential loss function in this case is the sum of
the absolute difference of scores:
L( , P1) = sum(f( , P1) - [1.0, 0.0] )
= sum([ |-0.5| , |0.5| ]) = 1
Average loss
L(P1) = 0.26 L(P2) = 0.18 L(P3) = 1.62
Loss Function
Parameters
Optimization
Objective Function
Input data Function
Loss Function
Optimization
Now that we have a way of defining loss, we need
a way to use it to improve our parameters
x + 5 = ?
45
15
58
25
55
20
50
48
47
46
45
Optimization Exercise
45
15
58
47
46
45
Optimization Exercise
45
15
58
25
Optimization algorithms use the55
loss value to
mathematically
20 nudge the parameters P of the
objective function to be50more “correct”
48
47
46
45
Optimization Exercise
Tangent at x = 0.8
Slope = 1.6
Optimization: Gradient Search
Once we know the direction, we can move towards
the minimum.
Are we done?
Optimization: Learning Rate
How far should we move?
Intuition
• Divide the loss function into small differentiable
steps
• Calculate the gradient of each small step and use
chain rule to calculate the gradient of your input
parameters
• Recommended reading: see supplementary
material of lecture 1
Optimization
Step size
Learning rate
Learning Lifecycle
Parameters
Optimization
Objective Function
Input data Function
Loss Function
Multiclass Classification
Let us now look at another complete example of
classification
Multiclass Classification
Recall that we have been using linear regression so
far and making decisions based on the sign of the
output
f( , W, b) = 1 Real Number
Output
If the number is less
than 0, it is accident
prone, else it is not
accident prone
Multiclass Classification
In general, we design our function f such that we
output one number per class:
f( , W, b) = 2 Numbers
Outputs
The scores for the two
classes - accident
prone and not
accident prone
Multiclass Classification
From now on, we will use this generalized technique,
since it can be easily extended to more than two
classes
f( , W, b) = Numbers
Outputs
M scores for the
M classes
Multiclass Classification
From now on, we will use this generalized technique,
since it can be easily extended to more than two
classes
f( , W, b) = Numbers
(vector)
Vector
Real
number
Linear Regression
Multiclass Classification
Vector
Real
number
Linear Regression
Multiclass Classification
scores
Matrix Vector
In regression:
f( , W, b) >= 0
In classification:
argmax(f( , W, b))
f( , W, b) = 2 Numbers
Outputs
The scores for the two
classes - accident
prone and not
accident prone
Softmax
However, these scores are not interpretable.
Their absolute values don’t give us any insight, we
can only compare them relatively
f( , W, b) = 2 Numbers
Outputs
The scores for the two
classes - accident
prone and not
accident prone
Softmax
The softmax function helps us transform these
values into probability distributions:
-1.85 0.06
0.42 Softmax 0.54
0.15 0.40
Softmax
The softmax function helps us transform these
values into probability distributions:
each output can be treated as the
probability of that class
-1.85 0.06
0.42 Softmax 0.54
0.15 0.40
scores sum to one
Softmax
The softmax function helps us transform these
values into probability distributions:
each output can be treated as the
probability of that class
-1.85 0.06
0.42 Softmax 0.54
0.15 0.40
scores sum to one
Cross Entropy
Cross Entropy
-1.85
f = 0.42
0.15
Cross Entropy and Softmax
Cross Entropy is mathematically defined to
compare two probability distributions
0.06
softmax(f) = 0.54
0.40
Putting it all together
Softmax Cross Entropy
Linear Classifier
Summary
• Classification
• Objective function
• Loss function
– sum of absolute differences
– mean squared error
• Optimization
– random search
– gradient search
– backpropagation
Lecture 1 Supplementary
• Backpropagation
• Bias
• Parameter Initialization
• Regularization
Lecture 1 Practical
• Setup
• Jupyter Notebooks
• Introduction to Python & Numpy
• Linear Algebra in five minutes
• Data Representation
• Linear classifier in Keras