Machine Learning Overview
Machine Learning Overview
Learning
Overview
What is Machine Learning (ML)?
Inference (Testing)
Abductive
Inference to best
Artificial Intelligence explanation/
hypothesis for a set
of observations
Machine Learning Inductive
Deductive Reason from
Deep Deduce specific examples to
Learning conclusions general rules or
Search from rules model
• JESS, CLIPS • Scikit Learn
• Drools, Esper • TensorFlow
Deductive
• CEP Engines • Rapid Miner
Reasoning • Spark Streams • Spark MLlib
Applications:
Class Scores
Model
Class Scores
Model
Application:
Stock Market Data
Model
Input Prediction
Class Scores
Model
Combination of Probability
distribution
perception and decision- over actions
making/controls {left, right,
up, down}
Example: Robotics
Supervised
Learning and
Parametric
Models
Dataset
Supervised Learning
𝑋 = 𝑥" , 𝑥# , … , 𝑥$ 𝑤ℎ𝑒𝑟𝑒 𝑥 ∈ ℝ% Examples
⬣ Train Input: 𝑋, 𝑌
⬣ Learning output: 𝑓 ∶ 𝑋 → 𝑌, 𝑌 = 𝑦" , 𝑦# , … , 𝑦$ 𝑤ℎ𝑒𝑟𝑒 𝑦 ∈ ℝ& Labels
e.g. 𝑃(𝑦|𝑥)
Dataset
Terminology:
⬣ Model Example 1 Label 1
Unsupervised Learning
⬣ Input: 𝑋
Dataset
⬣ Learning output: 𝑃 𝑥
Example 1
⬣ Example: Clustering, density
estimation, etc. Example 2
Example N
Supervised Learning
Parametric – Linear Classifier
Parametric Model
Supervised Learning
Data: Image Class Scores
Model
𝑓 𝑥, 𝑊 = 𝑊𝑥 + 𝑏
Car Coffee Cup Bird
Model
𝑓 𝑥, 𝑊 = 𝑊𝑥 + 𝑏
Model
𝑓 𝑥, 𝑊 = 𝑊! + 𝑏
Class Scores
Optimizer
Model
𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃
Classifier Output
Input Weights (scalar or vector)
(vector)
Weights Bias
Classifier
(scalar)
Result
𝑥
Input
(Note if 𝒘 and 𝐱 are column vectors we often show this as 𝒘! 𝒙)
Image adapted from:
https://en.wikipedia.org/wiki/Linear_equation#/
media/File:Linear_Function_Graph.svg
Simple Function
Linear Classification and
Regression
Simple linear classifier:
⬣ Calculate score:
𝒇 𝒙, 𝒘 = 𝒘 ⋅ 𝒙 + 𝒃
⬣ Binary classification rule
(𝒘 is a vector):
𝟏 𝐢𝐟 𝒇 𝒙, 𝒘 > = 𝟎
𝒚=A
𝟎 𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞
⬣ For multi-class classifier take
class with highest (max) score
𝒇(𝒙, 𝑾) = 𝑾𝒙 + 𝒃
⬣ Idea: Separate classes via
Car
high-dimensional linear
separators (hyper-planes) Bird
Input Dimensionality
Model
𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃
𝑾 𝒙 𝒃
(Note that in practice, implementations can use xW instead, assuming a different shape for W. That is just a different convention and is equivalent.)
Weights
⬣ We can move Model
the bias term 𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃
into the weight
matrix, and a “1”
at the end of the 𝑤!! 𝑤!" ⋯ 𝑤!# 𝑏! 𝑥!
input 𝑤"! 𝑤"" ⋯ 𝑤"# 𝑏" 𝑥"
𝑤$! 𝑤$" ⋯ 𝑤$# 𝑏$ ⋮
⬣ Results in one 𝑥#
matrix-vector 1
multiplication! 𝑾 𝒙
Weights
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
Stretch pixels into column
56
0.2 -0.5 0.1 2.0 1.1 -96.8 Cat score
56 231
231
24 2
1.5 1.3 2.1 0.0 + 3.2 = 437.9 Dog score
24
0 0.25 0.2 -0.3 -1.2 60.75 Ship score
Input image
2
𝑾 𝒃
Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n
Example
airplane
automobile Visual Viewpoint
bird
cat We can convert the
deer weight vector back into
dog the shape of the image
frog
horse
and visualize
ship
truck
plane car bird cat deer dog frog horse ship truck
Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n
𝒇(𝒙, 𝑾) = 𝑾𝒙 + 𝒃
Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n
Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n
Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n
Class Scores
Optimizer
Model
𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃
Performance Measure
Multiclass SVM loss:
delta
Given an example (𝒙𝒊, 𝒚𝒊 )
score
where 𝒙𝒊 is the image and scores for other classes score for correct class
where 𝒚𝒊 is the (integer) label,
the SVM loss has the form: cat 3.2 1.3 2.2
𝑳𝒊 = /
𝒋(𝒚𝒊
𝒎𝒂𝒙(𝟎, 𝒔𝒋 − 𝒔𝒚𝒊 + 𝟏) car 5.1 4.9 2.5
= max(0, 5.1 - 3.2 + 1)
+max(0, -1.7 - 3.2 + 1)
frog -1.7 2.0 -3.1
= max(0, 2.9) + max(0, -3.9) Losses: 2.9
= 2.9 + 0
= 2.9 Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n
𝑳𝒊 = |𝒚 − 𝑾𝒙𝒊 | L1 loss
𝑳𝒊 = |𝒚 − 𝑾𝒙𝒊 |𝟐 L2 loss
⬣ For probabilities
𝒆𝒔𝒌
𝑳𝒊 = |𝒚 − 𝑾𝒙𝒊 | = Logistic Source: https://raw.githubusercontent.com/rohan-
∑𝒋 𝒆𝒔𝒋 varma/rohan-blog/gh-pages/images/loss3.jpg
Regression Example
Often, we add a regularization term to the loss function
L1 Regularization
𝑳𝒊 = |𝒚 − 𝑾𝒙𝒊 |𝟐 + |𝑾|
Example regularizations:
⬣ L1/L2 on weights (encourage small values)
Regularization
Gradient
Descent
⬣
Class Scores
Input (and representation)
⬣ Functional form of the model
⬣ Including parameters Car Coffee Bird
Cup
⬣ Performance measure to improve
⬣ Loss or objective function
⬣ Algorithm for finding best parameters Loss Function
⬣ Optimization algorithm
Class Scores
Optimizer
Model
𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃
Optimization
As weights change, the loss
changes as well
⬣ This is often somewhat-
smooth locally, so small
changes in weights produce
small changes in the loss
Loss Surfaces
Strategy: Follow the Slope!
⬣ We can find the steepest descent direction by
computing the derivative (gradient):
𝒇 𝒂 + 𝒉 − 𝒇(𝒂)
𝒇, 𝒂 = lim
𝒉→𝟎 𝒉
⬣ Steepest descent direction is the negative
gradient
⬣ Intuitively: Measures how the function
changes as the argument a changes by a small
step size
⬣ As step size goes to zero
⬣ In Machine Learning: Want to know how the ∆𝒙
loss function changes as weights are varied
Image and equation from:
⬣ Can consider each parameter separately https://en.wikipedia.org/wiki/Derivative#/media/
by taking partial derivative of loss File:Tangent_animation.gif
function with respect to that parameter
Derivatives
This idea can be turned into an algorithm (gradient descent)
⬣ Choose a model: 𝒇 𝒙, 𝑾 = Wx
𝝏𝑳
⬣ Update the parameters: 𝒘𝒋 = 𝒘𝒋 −
𝝏𝒘𝒋
𝝏𝑳
⬣ Add learning rate to prevent too big of a step: 𝒘𝒋 = 𝒘𝒋 − 𝜶
𝝏𝒘𝒋
Gradient Descent
Often, we only compute the gradients across a small subset of
data
𝟏
⬣ Full Batch Gradient Descent 𝑳 = / 𝑳 (𝒇 𝒙𝒊 , 𝑾 , 𝒚𝒊 )
𝑵
𝟏
⬣ Mini-Batch Gradient Descent 𝑳 = / 𝑳 (𝒇 𝒙𝒊 , 𝑾 , 𝒚𝒊 )
𝑴
⬣ Where M is a subset of data
𝝏𝑳
Several ways to compute
𝝏𝒘𝒊
⬣ Manual differentiation
⬣ Symbolic differentiation
⬣ Numerical differentiation
⬣ Automatic differentiation
Computing Gradients
For some functions, we can analytically derive the partial derivative
Example: Derivation of Update Rule
𝑵
𝝏𝑳 𝝏
L= ∑𝑵 𝑻 𝟐 =4 (𝒚 − 𝒘𝑻 𝒙𝒊 )𝟐
𝒊#𝟏(𝒚𝒊 − 𝒘 𝒙𝒊 ) 𝝏𝒘𝒋 𝝏𝒘𝒋 𝒊
Function Loss 𝒊#𝟏
𝑵
𝑵 𝝏𝑳 = −𝟐 4 𝜹𝒊 4 𝒘𝒌 𝒙𝒊𝒌
So what’s 𝝏𝒘 ? 𝝏𝒘𝒋
𝒋 𝒊#𝟏 𝒌#𝟏
𝒘𝒋 ← 𝒘𝒋 + 𝟐𝜼 I 𝜹𝒌 𝒙𝒌𝒋 𝑵
= −𝟐 4 𝜹𝒊 𝒙𝒊𝒋
𝒌2𝟏
𝒊#𝟏
Manual Differentiation
If we add a non-linearity (sigmoid), derivation is more complex
𝟏
𝝈 𝒙 =
𝟏 + 𝒆+𝒙
First, one can derive that: 𝝈- 𝒙 = 𝝈(𝒙)(𝟏 − 𝝈 𝒙 )
𝐟 𝐱 = 𝝈 4 𝒘𝒌 𝒙𝒌
𝒌
𝟐
L = 4 𝒚𝒊 − 𝝈 4 𝒘𝒌 𝒙𝒊𝒌
𝒊 𝒌
𝝏𝑳 𝝏
The sigmoid perception update rule:
= 4 𝟐 𝒚𝒊 − 𝝈 4 𝒘𝒌 𝒙𝒊𝒌 − 𝝈 4 𝒘𝒌 𝒙𝒊𝒌 𝑵
𝝏𝒘𝒋 𝝏𝒘𝒋
𝒊 𝒌 𝒌
𝒘𝒋 ← 𝒘𝒋 + 𝟐𝜼 5 𝜹𝒊 𝝈𝒊 (𝟏 − 𝝈𝒊 )𝒙𝒊𝒋
𝝏 𝒊.𝟏
= 4 −𝟐 𝒚𝒊 − 𝝈 4 𝒘𝒌 𝒙𝒊𝒌 𝝈′ 4 𝒘𝒌 𝒙𝒊𝒌 4 𝒘𝒌 𝒙𝒊𝒌 𝒎
𝝏𝒘𝒋
𝒊 𝒌 𝒌 𝒌
where 𝝈𝒊 = 𝝈 5 𝒘𝒌 𝒙𝒊𝒌
= 4 −𝟐𝜹𝒊 𝝈(𝐝𝒊 )(𝟏 − 𝝈 𝐝𝒊 )𝒙𝒊𝒋 𝒌.𝟏
𝒊
where 𝜹𝒊 = 𝒚𝒊 − 𝐟(𝒙𝒊 ) 𝒅𝒊 = 4 𝒘𝒌 𝒙𝒊𝒌 𝜹 𝒊 = 𝒚𝒊 − 𝝈 𝒊
𝐬𝐢𝐧(𝒙)
𝐥𝐨𝐠(𝒙) Compose into a 𝟏
𝐜𝐨𝐬(𝒙) − 𝐥𝐨𝐠
𝒙𝟑
complicate function 𝟏 + 𝒆<𝒘⋅𝒙
𝐞𝐱𝐩(𝒙)
𝒖 𝟏 𝒑 𝑳
𝒘⋅𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆d𝒖
Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun
Decomposing a Function
Linear
Algebra
View:
Vector and
Matrix Sizes
𝒙𝟏
𝒘𝟏𝟏 𝒘𝟏𝟐 ⋯ 𝒘𝟏𝒎 𝒃𝟏
𝒙𝟐
𝒘𝟐𝟏 𝒘𝟐𝟐 ⋯ 𝒘𝟐𝒎 𝒃𝟐
⋮
𝒘𝟑𝟏 𝒘𝟑𝟐 ⋯ 𝒘𝟑𝒎 𝒃𝟑
𝒙𝒎
𝟏
𝑾 𝒙
Sizes: 𝒄× 𝒅 + 𝟏 𝒅 + 𝟏 ×𝟏
Where c is number of classes
d is dimensionality of input
Dimensionality of Derivatives
Conventions:
𝝏𝒗𝟏
⬣ What is the size of ? A matrix: Col j
𝝏𝒗𝟐
𝝏𝒗𝟏𝟏
⋯ ⋯ ⋯ ⋯
𝝏𝒗𝟐𝟏
⋯ ⋯ ⋯ ⋯ ⋯
Row i 𝝏𝒗𝟏𝒊
⋯ ⋯ ⋯ ⋯
𝝏𝒗𝟐𝒋
⋯ ⋯ ⋯ ⋯ ⋯
⋯ ⋯ ⋯ ⋯ ⋯
⬣ This matrix of partial derivatives is called a Jacobian
Dimensionality of Derivatives
Conventions:
𝝏𝒔
⬣ What is the size of ? A matrix:
𝝏𝑴
𝝏𝒔
⋯ ⋯ ⋯ ⋯
𝝏𝒎[𝟏,𝟏]
⋯ ⋯ ⋯ ⋯ ⋯
𝝏𝒔
⋯ ⋯ ⋯ ⋯
𝝏𝒎[𝒊,𝒋]
⋯ ⋯ ⋯ ⋯ ⋯
⋯ ⋯ ⋯ ⋯ ⋯
Dimensionality of Derivatives
𝝏𝑳
⬣ What is the size of ?
𝝏𝑾
Dimensionality of Derivatives
Batches of data are matrices or tensors (multi- 𝒙𝟏𝟏 𝒙𝟏𝟐 ⋯ 𝒙𝟏𝒏
dimensional matrices) 𝒙𝟐𝟏 𝒙𝟐𝟐 ⋯ 𝒙𝟐𝒏
Examples: ⋮ ⋮ ⋱ ⋮
⬣ Each instance is a vector of size m, our batch is of 𝒙𝒏𝟏 𝒙𝒏𝟐 ⋯ 𝒙𝒏𝒏
size [𝑩×𝒎]
⬣ Each instance is a matrix (e.g. grayscale image) of Flatten
size 𝑾×𝑯, our batch is [𝑩×𝑾×𝑯]
𝒙𝟏𝟏
⬣ Each instance is a multi-channel matrix (e.g. color 𝒙𝟏𝟐
image with R,B,G channels) of size 𝑪×𝑾×𝑯, our
⋮
batch is [𝑩×𝑪×𝑾×𝑯]
𝒙𝟐𝟏
Jacobians become tensors which is complicated 𝒙𝟐𝟐
⬣ Instead, flatten input to a vector and get a vector of ⋮
derivatives! 𝒙𝒏𝟏
⬣ This can also be done for partial derivatives ⋮
between two vectors, two matrices, or two tensors 𝒙𝒏𝒏
Jacobians of Batches
How is Deep
Learning
Different?
Hierarchical
Compositionality
So What is Deep
(Machine) Learning?
⬣ Representation Learning
⬣ Neural Networks
⬣ Deep Unsupervised /
Reinforcement / Structured /
<insert-qualifier-here>
Learning
⬣ Simply: Deep Learning
(Hierarchical) End-to-End Distributed
Compositionality Learning Representations
⬣ Cascade of non- ⬣ Learning (goal- ⬣ No single neuron
linear driven) “encodes”
transformations representations everything
⬣ Multiple layers of ⬣ Learning feature ⬣ Groups of neurons
representations extraction work together
NLP hand-crafted
This burrito place your favorite
features “+”
is yummy and fun! classifier
Bag-of-words
fixed learned
Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun
SPEECH
NLP
Hierarchical Compositionality
Given a library of simple functions
Idea 1:
Linear Combinations
𝐬𝐢𝐧(𝒙) ⬣ Boosting
Compose into a
𝐥𝐨𝐠(𝒙) ⬣ Kernels
𝐜𝐨𝐬(𝒙)
𝒙𝟑
⬣ …
complicate function
𝐞𝐱𝐩(𝒙)
𝒇 𝒙 = + 𝜶𝒊 𝒈𝒊 (𝒙)
+ 𝒊
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
NLP hand-crafted
This burrito place your favorite
features “+”
is yummy and fun! classifier
Bag-of-words
fixed learned
Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun
HoG Textons
Feature Engineering
VISION “learned”
K-Means/ “car”
SIFT/HOG classifier
pooling
fixed unsupervised supervised
SPEECH
Mixture of \ˈd ē p\
MFCC classifier
Gaussians
fixed unsupervised supervised
NLP
This burrito place Parse Tree “+”
is yummy and fun! n-grams classifier
Syntactic
fixed unsupervised supervised
Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun
K-Means/ “car”
SIFT/HOG classifier
pooling
fixed unsupervised supervised
SPEECH
Mixture of \ˈd ē p\
MFCC classifier
Gaussians
fixed unsupervised supervised
NLP
This burrito place Parse Tree “+”
is yummy and fun! n-grams classifier
Syntactic
fixed unsupervised supervised
Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun
fixed learned
Deep models
Trainable Trainable Trainable
Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier
tal
gl e
on
al
(a) (b)
e
tan
i ps
rtic
ri z
re c
el l
ho
ve
no pattern no pattern
Distributed = V+H+E ≈