0% found this document useful (0 votes)
3 views

Machine Learning Overview

Machine Learning (ML) is the study of algorithms that improve their performance on tasks based on experience, typically through data. It differs from traditional programming by learning from data inputs and producing outputs through trained models. ML encompasses various types including supervised, unsupervised, and reinforcement learning, and has applications in fields such as image classification, natural language processing, and robotics.

Uploaded by

Carlos Souza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Machine Learning Overview

Machine Learning (ML) is the study of algorithms that improve their performance on tasks based on experience, typically through data. It differs from traditional programming by learning from data inputs and producing outputs through trained models. ML encompasses various types including supervised, unsupervised, and reinforcement learning, and has applications in fields such as image classification, natural language processing, and robotics.

Uploaded by

Carlos Souza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Machine

Learning
Overview
What is Machine Learning (ML)?

“A computer program is said to learn from


experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by
P, improves with experience E.”

Tom Mitchell (Machine Learning, 1997)


Machine Learning is the
study of algorithms that:

⬣ Improve their performance


⬣ on some task(s)
⬣ Based on experience
(typically data)
How is it Different than Programming?
Programming

Input Algorithm Output

Machine Learning Training

Data Algorithm Labels

Inference (Testing)

Input Model Output


Machine learning thrives when it is difficult to design an algorithm to
perform the task
Applications:

algorithm quicksort(A, lo, hi) is


if lo < hi then
p := partition(A, lo, hi)
quicksort(A, lo, p – 1)
quicksort(A, p + 1, hi)
Coffee
algorithm partition(A, lo, hi) is
pivot := A[hi]
? cup
i := lo
for j := lo to hi do
if A[j] < pivot then
swap A[i] with A[j]
i := i + 1
swap A[i] with A[hi]
return i

Machine Learning Applications


Machine Learning and Artificial Intelligence

Abductive
Inference to best
Artificial Intelligence explanation/
hypothesis for a set
of observations
Machine Learning Inductive
Deductive Reason from
Deep Deduce specific examples to
Learning conclusions general rules or
Search from rules model
• JESS, CLIPS • Scikit Learn
• Drools, Esper • TensorFlow
Deductive
• CEP Engines • Rapid Miner
Reasoning • Spark Streams • Spark MLlib

Adapted from: https://www.datanami.com/2018/03/20/u-s-


pursues-abductive-reasoning-to-divine-intent/
Given an image, output class label
⬣ Often output probability distribution over labels

Applications:
Class Scores

Model

Car Coffee Cup Bird

Class Scores

Model

Normal Benign Malignant

Example: Image Classification


Given a series of measurements, output prediction for next time period

Application:
Stock Market Data

Model

Feb March April June June July

Input Prediction

Example: Time Series Prediction


Very large number of NLP sub-tasks:
⬣ Syntax Parsing
⬣ Parts of speech
⬣ Named entity recognition
⬣ Summarization
⬣ Similarity / paraphrasing

Different from classification: Variable


length sequential inputs and/or outputs

Example: Natural Language Processing (NLP)


Sentiment Analysis:

Class Scores
Model

Negative Neutral Positive

Example: Natural Language Processing (NLP)


Application:
Decision-making tasks
⬣ Sequence of Actions Observations
inputs/outputs affect the
world
⬣ Actions affect the
environment
Model

Combination of Probability
distribution
perception and decision- over actions
making/controls {left, right,
up, down}

Example: Decision-Making Tasks


Robotics involves a combination
of AI/ML techniques: Application:
⬣ Sense: Perception
⬣ Plan: Planning
⬣ Act: Controls/Decision-Making
Some things are learned
(perception), while others
programmed
⬣ Evolving landscape

Example: Robotics
Supervised
Learning and
Parametric
Models
Dataset
Supervised Learning
𝑋 = 𝑥" , 𝑥# , … , 𝑥$ 𝑤ℎ𝑒𝑟𝑒 𝑥 ∈ ℝ% Examples
⬣ Train Input: 𝑋, 𝑌
⬣ Learning output: 𝑓 ∶ 𝑋 → 𝑌, 𝑌 = 𝑦" , 𝑦# , … , 𝑦$ 𝑤ℎ𝑒𝑟𝑒 𝑦 ∈ ℝ& Labels
e.g. 𝑃(𝑦|𝑥)

Dataset
Terminology:
⬣ Model Example 1 Label 1

⬣ Category / Class Example 2 Label 2

⬣ Note inputs 𝒙𝒊 and 𝒚𝒊 are Example N Label N


each represented as
vectors

Types of Machine Learning


Dataset
𝑋 = 𝑥" , 𝑥# , … , 𝑥$ 𝑤ℎ𝑒𝑟𝑒 𝑥 ∈ ℝ% Examples

Unsupervised Learning
⬣ Input: 𝑋
Dataset
⬣ Learning output: 𝑃 𝑥
Example 1
⬣ Example: Clustering, density
estimation, etc. Example 2

Example N

Types of Machine Learning


Agent
State
Reinforcement Learning
Reward
⬣ Supervision in form of Next state
reward
⬣ No supervision on what
action to take Environment Action

Adapted from: http://cs231n.stanford.edu/slides/2020/lecture_17.pdf

Types of Machine Learning


Supervised Unsupervised Reinforcement
Learning Learning Learning
⬣ Train Input: 𝑋, 𝑌 ⬣ Input: 𝑋 ⬣ Supervision in
form of reward
⬣ Learning output: 𝑓 ⬣ Learning
∶ 𝑋 → 𝑌, output: 𝑃 𝑥 ⬣ No supervision on
e.g. 𝑃(𝑦|𝑥) what action to take
⬣ Example: Clustering,
density estimation,
etc.

Very often combined


⬣ Sometimes within the same model!

Types of Machine Learning


Non-Parametric – Nearest Neighbor

Non-Parametric Model Example 1, cat Example 2, dog

No explicit model for the


function, examples:
Query Example 4, dog
⬣ Nearest neighbor classifier

⬣ Decision tree Example 3, car

Procedure: Take label of nearest example

Supervised Learning
Parametric – Linear Classifier
Parametric Model

Explicitly model the function 𝑓 ∶ 𝑋 → 𝑌 𝑓 𝑥, 𝑊 = 𝑊𝑥 + 𝑏


in the form of a parametrized function
𝑓 𝑥, 𝑊 = 𝑦, examples:
Procedure:
⬣ Logistic regression/classification Calculate score per class for
example
⬣ Neural networks
Return label of maximum score
(argmax)

Supervised Learning
Data: Image Class Scores

Model
𝑓 𝑥, 𝑊 = 𝑊𝑥 + 𝑏
Car Coffee Cup Bird

Input {𝑿, 𝒀} where:


⬣ 𝑋 is an image
⬣ 𝑌 is a ground truth label annotated by an expert (human)
⬣ 𝑓 𝑥, 𝑊 = 𝑊𝑥 + 𝑏 is our model, chosen to be a linear function in this case
⬣ 𝑊 and 𝑏 are the parameters (weights) of our model that must be learned

Example: Image Classification


Input image is high-dimensional Input Image
⬣ For example n=512 so 512x512
image = 262,144 pixels
⬣ Learning a classifier with high-
dimensional inputs is hard
Before deep learning, it was typical to
perform feature engineering 𝑥!! 𝑥!" ⋯ 𝑥!#
𝑥"! 𝑥"" ⋯ 𝑥"#
⬣ Hand-design algorithms for 𝑥= ⋮ ⋮ ⋱ ⋮
converting raw input into a lower-
𝑥#! 𝑥#" ⋯ 𝑥##
dimensional set of features

Input Representation: Feature Engineering


Example: Color histogram
⬣ Vector of numbers representing number of pixels fitting within
each bin
⬣ We will later see that learning the feature representation itself is
much more effective

Data: Image Features: Histogram

Input Representation: Feature Engineering


⬣ Labels are categories, but we need a
numerical representation Ground truth label: ‘Coffee Cup’
⬣ Assigning number to each Convert
category is arbitrary to Scores
⬣ Instead, represent probability (e.g. 0/1)
distribution over categories 1.0
⬣ Ground truth label then becomes a
probability distribution where the
correct category probability is 1, and all 0.0 0.0
others are 0
Car Coffee Cup Bird
⬣ Note for regression this is not an
issue as the ground truth label (e.g. Class Scores
housing prices) is a number already

Output Representation: Representing Categories


Features: Histogram Class Scores
Data: Image

Model
𝑓 𝑥, 𝑊 = 𝑊𝑥 + 𝑏

Car Coffee Bird


Cup

Input {𝑿, 𝒀} where:


⬣ 𝑋 is an image histogram
⬣ 𝑌 is a ground truth label represented a probability distribution
⬣ 𝑓 𝑥, 𝑊 = 𝑊𝑥 + 𝑏 is our model, chosen to be a linear function in this case
⬣ 𝑊 and 𝑏 are the weights of our model that must be learned

Example: Image Classification


Class Scores
Data: Text

Model
𝑓 𝑥, 𝑊 = 𝑊! + 𝑏

Negative Neutral Positive

Input {𝑿, 𝒀} where: Word Histogram


⬣ 𝑋 is a sentence Word Count
this 1
⬣ 𝑌 is a ground truth label annotated by an that 0
expert (human) is 2
⬣ 𝑓 𝑥, 𝑊 = 𝑊𝑥 + 𝑏 is our model, chosen to be a ...
extremely 1
linear function in this case hello 0
⬣ 𝑊 and 𝑏 are the weights of our model that must onomatopoeia 0
be learned …

Example: Image Classification


Components
of a
Parametric
Learning
Algorithm

Class Scores
Input (and representation)
⬣ Functional form of the model
⬣ Including parameters Car Coffee Bird
Cup
⬣ Performance measure to improve
⬣ Loss or objective function
⬣ Algorithm for finding best parameters Loss Function
⬣ Optimization algorithm

Class Scores
Optimizer
Model
𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃

Data: Image Car Coffee Bird


Features: Histogram Cup

Components of a Parametric Model


𝒇 𝒙, 𝒘 = 𝒚

Classifier Output
Input Weights (scalar or vector)
(vector)

⬣ Input: Continuous number or vector


⬣ Output: A continuous number
⬣ For classification typically a score
⬣ For regression what we want to regress to (house prices,
crime rate, etc.)
⬣ 𝒘 is a vector and weights to optimize to fit target function

Model: Discriminative Parameterized Function


What is the simplest function
you can think of?
Our model is:
𝒚 = 𝒎𝒙 + 𝒃
𝑦
=
𝑦 𝒇 𝒙, 𝒘 = 𝒘 ⋅ 𝒙 + 𝒃

𝑥
+ 1 𝑥+2
5 𝑦= 2

Weights Bias
Classifier
(scalar)
Result
𝑥
Input
(Note if 𝒘 and 𝐱 are column vectors we often show this as 𝒘! 𝒙)
Image adapted from:
https://en.wikipedia.org/wiki/Linear_equation#/
media/File:Linear_Function_Graph.svg

Simple Function
Linear Classification and
Regression
Simple linear classifier:
⬣ Calculate score:
𝒇 𝒙, 𝒘 = 𝒘 ⋅ 𝒙 + 𝒃
⬣ Binary classification rule
(𝒘 is a vector):

𝟏 𝐢𝐟 𝒇 𝒙, 𝒘 > = 𝟎
𝒚=A
𝟎 𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞
⬣ For multi-class classifier take
class with highest (max) score
𝒇(𝒙, 𝑾) = 𝑾𝒙 + 𝒃
⬣ Idea: Separate classes via
Car
high-dimensional linear
separators (hyper-planes) Bird

⬣ One of the simplest


parametric models, but
surprisingly effective

⬣ Very commonly used!


⬣ Let’s look more closely at
each element

Linear Classification and Regression


Data: Image Class Scores
Model
𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃
Car Coffee Bird
𝑥"" Cup
𝑥"#

𝑥"" 𝑥"# ⋯ 𝑥"'
𝑥#"
𝑥#" 𝑥## ⋯ 𝑥#'
𝑥= ⋮ ⋮ ⋱ ⋮ 𝑥 = 𝑥##

𝑥'" 𝑥'# ⋯ 𝑥'' Flatten
𝑥'"

𝑥''
To simplify notation we will refer to inputs as 𝑥! ⋯ 𝑥0 where 𝑚 = 𝑛 × 𝑛

Input Dimensionality
Model
𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃

Classifier for class 1 𝑤!! 𝑤!" ⋯ 𝑤!# 𝑥! 𝑏!


Classifier for class 2 𝑤"! 𝑤"" ⋯ 𝑤"# 𝑥" 𝑏"
Classifier for class 3 𝑤$! 𝑤$" ⋯ 𝑤$# ⋮ + 𝑏$
𝑥#

𝑾 𝒙 𝒃

(Note that in practice, implementations can use xW instead, assuming a different shape for W. That is just a different convention and is equivalent.)

Weights
⬣ We can move Model
the bias term 𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃
into the weight
matrix, and a “1”
at the end of the 𝑤!! 𝑤!" ⋯ 𝑤!# 𝑏! 𝑥!
input 𝑤"! 𝑤"" ⋯ 𝑤"# 𝑏" 𝑥"
𝑤$! 𝑤$" ⋯ 𝑤$# 𝑏$ ⋮
⬣ Results in one 𝑥#
matrix-vector 1
multiplication! 𝑾 𝒙

Weights
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
Stretch pixels into column

56
0.2 -0.5 0.1 2.0 1.1 -96.8 Cat score
56 231
231

24 2
1.5 1.3 2.1 0.0 + 3.2 = 437.9 Dog score
24
0 0.25 0.2 -0.3 -1.2 60.75 Ship score
Input image
2
𝑾 𝒃
Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n

Example
airplane
automobile Visual Viewpoint
bird
cat We can convert the
deer weight vector back into
dog the shape of the image
frog
horse
and visualize
ship
truck
plane car bird cat deer dog frog horse ship truck

Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n

Interpreting a Linear Classifier


Geometric Viewpoint

𝒇(𝒙, 𝑾) = 𝑾𝒙 + 𝒃

Array of 32x32x3 numbers


(3072 numbers total)
Plot created using Wolfram Cloud

Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n

Interpreting a Linear Classifier


Class 1: Class 1: Class 1:
number of pixels > 0 odd 1 < = L2 norm < = 2 Three modes
Class 2: Class 2: Class 2:
number of pixels > 0 even Everything else Everything else

Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n

Hard Cases for a Linear Classifier


Algebraic Visual Geometric
Viewpoint Viewpoint Viewpoint
One template Hyperplanes
𝒇(𝒙, 𝑾) = 𝑾𝒙 per class cutting up space

Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n

Linear Classifier: Three Viewpoints


Performance
Measure for
a Classifier

Class Scores
Input (and representation)
⬣ Functional form of the model
⬣ Including parameters Car Coffee Bird
Cup
⬣ Performance measure to improve
⬣ Loss or objective function
⬣ Algorithm for finding best parameters Loss Function
⬣ Optimization algorithm

Class Scores
Optimizer
Model
𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃

Data: Image Car Coffee Bird


Features: Histogram Cup

Components of a Parametric Model


⬣ The output of a classifier can
be considered a score
⬣ For binary classifier, use rule: Class Scores
𝟏 𝐢𝐟 𝒇 𝒙, 𝒘 > = 𝟎
𝒚=A
𝟎 𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞 Model
𝑓 𝑥, 𝑊 = 𝑊𝑥 + 𝑏
⬣ Can be used for many
classes by considering Car Coffee Bird
Cup
one class versus all the
rest (one versus all)
⬣ For multi-class classifier can
take the maximum

Classification using Scores


Several issues with scores:
⬣ Not very interpretable (no
bounded value)
𝒔 = 𝒇(𝒙, 𝑾) Scores

We often want probabilities


𝒆𝒔𝒌 Softmax
⬣ More interpretable 𝑷 𝒀=𝒌𝑿=𝒙 =
∑𝒋 𝒆𝒔𝒋 Function
⬣ Can relate to probabilistic
view of machine learning
We use the softmax function to
convert scores to probabilities

Converting Scores to Probabilities


We need a performance measure to
optimize Given a dataset of examples:

⬣ Penalizes model for being wrong { 𝒙𝒊 , 𝒚𝒊 }𝑵


𝒊E𝟏
⬣ Allows us to modify the model to
reduce this penalty Where 𝒙𝒊 is image and
⬣ Known as an objective or loss 𝒚𝒊 is (integer) label
function
In machine learning we use empirical Loss over the dataset is a sum
risk minimization of loss over examples:
⬣ Reduce the loss over the training 𝟏
dataset 𝑳 = / 𝑳𝒊 (𝒇 𝒙𝒊 , 𝑾 , 𝒚𝒊 )
⬣ We average the loss over the training
𝑵
data

Performance Measure
Multiclass SVM loss:
delta
Given an example (𝒙𝒊, 𝒚𝒊 )
score
where 𝒙𝒊 is the image and scores for other classes score for correct class
where 𝒚𝒊 is the (integer) label,

and using the shorthand for the


scores vector: 𝒔 = 𝒇(𝒙𝒊 , 𝑾) Example: “Hinge Loss”

the SVM loss has the form:


𝟎 𝐢𝐟 𝒔𝒚𝒊 ≥ 𝒔𝒋 + 𝟏 𝒔𝒚𝒊
𝑳𝒊 = I J𝒔 − 𝒔 + 𝟏
𝒋 𝒚𝒊 𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞
𝒋)𝒚𝒊 𝒔𝒋
𝟏
= I 𝒎𝒂𝒙(𝟎, 𝒔𝒋 − 𝒔𝒚𝒊 + 𝟏)
𝒋)𝒚𝒊
Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n

Performance Measure for Scores


Multiclass SVM loss: Suppose: 3 training examples, 3 classes.
Given an example (𝒙𝒊, 𝒚𝒊 )
With some 𝑾 the scores 𝒇(𝒙,𝑾)=𝑾𝒙 are:
where 𝒙𝒊 is the image and
where 𝒚𝒊 is the (integer) label,

and using the shorthand for the


scores vector: 𝒔 = 𝒇(𝒙𝒊 , 𝑾)

the SVM loss has the form: cat 3.2 1.3 2.2
𝑳𝒊 = /
𝒋(𝒚𝒊
𝒎𝒂𝒙(𝟎, 𝒔𝒋 − 𝒔𝒚𝒊 + 𝟏) car 5.1 4.9 2.5
= max(0, 5.1 - 3.2 + 1)
+max(0, -1.7 - 3.2 + 1)
frog -1.7 2.0 -3.1
= max(0, 2.9) + max(0, -3.9) Losses: 2.9
= 2.9 + 0
= 2.9 Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n

SVM Loss Example


⬣ If we use the softmax function to
convert scores to probabilities,
the right loss function to use is 𝑳𝒊 = −𝐥𝐨𝐠 𝑷(𝒀 = 𝒚𝒊 |𝑿 = 𝒙𝒊 )
cross-entropy
⬣ Can be derived by looking at the Maximum Likelihood
distance between two probability Estimation
distributions (output of model and Choose parameters to
ground truth) maximize the likelihood
of the observed data
⬣ Can also be derived from a
maximum likelihood estimation
perspective

Performance Measure for Probabilities


Softmax Classifier (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
𝒆𝒔𝒌 Softmax
𝒔 = 𝒇(𝒙𝒊 ; 𝑾) 𝑷 𝒀 = 𝒌 𝑿 = 𝒙𝒊 =
∑𝒋 𝒆𝒔𝒋 Function
Probabilities Probabilities
𝑳 = −𝐥𝐨𝐠𝑷(𝒀 = 𝒚𝒊 |𝑿 = 𝒙𝒊 )
must be >= 0 must sum to 1 𝒊
𝑳𝒊 = −𝐥𝐨𝐠(𝟎. 𝟏𝟑)
cat 3.2 24.5 0.13
exp normalize
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
Unnormalized log- Unnormalized Probabilities
probabilities / logits probabilities
Adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, from CS 231n

Cross-Entropy Loss Example


If we are performing regression, we can directly optimize to match the
ground truth value
⬣ Example: House price prediction

𝑳𝒊 = |𝒚 − 𝑾𝒙𝒊 | L1 loss

𝑳𝒊 = |𝒚 − 𝑾𝒙𝒊 |𝟐 L2 loss

⬣ For probabilities
𝒆𝒔𝒌
𝑳𝒊 = |𝒚 − 𝑾𝒙𝒊 | = Logistic Source: https://raw.githubusercontent.com/rohan-
∑𝒋 𝒆𝒔𝒋 varma/rohan-blog/gh-pages/images/loss3.jpg

Regression Example
Often, we add a regularization term to the loss function

L1 Regularization
𝑳𝒊 = |𝒚 − 𝑾𝒙𝒊 |𝟐 + |𝑾|

Example regularizations:
⬣ L1/L2 on weights (encourage small values)

Regularization
Gradient
Descent

Class Scores
Input (and representation)
⬣ Functional form of the model
⬣ Including parameters Car Coffee Bird
Cup
⬣ Performance measure to improve
⬣ Loss or objective function
⬣ Algorithm for finding best parameters Loss Function
⬣ Optimization algorithm

Class Scores
Optimizer
Model
𝒇 𝒙, 𝑾 = 𝑾𝒙 + 𝒃

Data: Image Car Coffee Bird


Features: Histogram Cup

Components of a Parametric Model


Given a model and loss function, finding the
best set of weights is a search problem
⬣ Find the best combination of weights
that minimizes our loss function
𝒘𝟏𝟏 𝒘𝟏𝟐 ⋯ 𝒘𝟏𝒎 𝒃𝟏
Several classes of methods: 𝒘𝟐𝟏 𝒘𝟐𝟐 ⋯ 𝒘𝟐𝒎 𝒃𝟐
⬣ Random search 𝒘𝟐𝟏 𝒘𝟐𝟐 ⋯ 𝒘𝟑𝒎 𝒃𝟑
⬣ Genetic algorithms (population-based
search)
⬣ Gradient-based optimization
Loss
In deep learning, gradient-based methods
are dominant although not the only
approach possible

Optimization
As weights change, the loss
changes as well
⬣ This is often somewhat-
smooth locally, so small
changes in weights produce
small changes in the loss

We can therefore think about


iterative algorithms that take
current values of weights and
modify them a bit

Loss Surfaces
Strategy: Follow the Slope!
⬣ We can find the steepest descent direction by
computing the derivative (gradient):
𝒇 𝒂 + 𝒉 − 𝒇(𝒂)
𝒇, 𝒂 = lim
𝒉→𝟎 𝒉
⬣ Steepest descent direction is the negative
gradient
⬣ Intuitively: Measures how the function
changes as the argument a changes by a small
step size
⬣ As step size goes to zero
⬣ In Machine Learning: Want to know how the ∆𝒙
loss function changes as weights are varied
Image and equation from:
⬣ Can consider each parameter separately https://en.wikipedia.org/wiki/Derivative#/media/
by taking partial derivative of loss File:Tangent_animation.gif
function with respect to that parameter

Derivatives
This idea can be turned into an algorithm (gradient descent)

⬣ Choose a model: 𝒇 𝒙, 𝑾 = Wx

⬣ Choose loss function: 𝑳𝒊 = |𝒚 − 𝑾𝒙𝒊 |𝟐


𝝏𝑳
⬣ Calculate partial derivative for each parameter:
𝝏𝒘𝒋

𝝏𝑳
⬣ Update the parameters: 𝒘𝒋 = 𝒘𝒋 −
𝝏𝒘𝒋

𝝏𝑳
⬣ Add learning rate to prevent too big of a step: 𝒘𝒋 = 𝒘𝒋 − 𝜶
𝝏𝒘𝒋

Gradient Descent
Often, we only compute the gradients across a small subset of
data
𝟏
⬣ Full Batch Gradient Descent 𝑳 = / 𝑳 (𝒇 𝒙𝒊 , 𝑾 , 𝒚𝒊 )
𝑵

𝟏
⬣ Mini-Batch Gradient Descent 𝑳 = / 𝑳 (𝒇 𝒙𝒊 , 𝑾 , 𝒚𝒊 )
𝑴
⬣ Where M is a subset of data

⬣ We iterate over mini-batches:

⬣ Get mini-batch, compute loss, compute derivatives, and


take a step

Mini-Batch Gradient Descent


Gradient descent is guaranteed to converge under some
conditions

⬣ For example, learning rate has to be appropriately reduced


throughout training

⬣ It will converge to a local minima

⬣ Small changes in weights would not decrease the loss

⬣ It turns out that some of the local minima that it finds in


practice (if trained well) are still pretty good!

Gradient Descent Properties


We know how to compute the
model output and loss
function

𝝏𝑳
Several ways to compute
𝝏𝒘𝒊

⬣ Manual differentiation

⬣ Symbolic differentiation

⬣ Numerical differentiation

⬣ Automatic differentiation

Computing Gradients
For some functions, we can analytically derive the partial derivative
Example: Derivation of Update Rule
𝑵
𝝏𝑳 𝝏
L= ∑𝑵 𝑻 𝟐 =4 (𝒚 − 𝒘𝑻 𝒙𝒊 )𝟐
𝒊#𝟏(𝒚𝒊 − 𝒘 𝒙𝒊 ) 𝝏𝒘𝒋 𝝏𝒘𝒋 𝒊
Function Loss 𝒊#𝟏
𝑵

𝒇 𝒘, 𝒙𝒊 = 𝒘𝑻 𝒙𝒊 (𝒚𝒊 − 𝒘𝑻 𝒙𝒊 )𝟐 Gradient descent tells us = 4 𝟐 𝒚𝒊 − 𝒘𝑻 𝒙𝒊


𝝏
(𝒚 − 𝒘𝑻 𝒙𝒊 )
we should update 𝒘 as 𝝏𝒘𝒋 𝒊
𝒊#𝟏
(Assume 𝒘 and 𝐱𝐢 are column vectors, so same as 𝒘 ⋅ 𝒙𝒊 ) follows to minimize 𝐿: 𝑵
𝝏
= −𝟐 4 𝜹𝒊 𝒘𝑻 𝒙𝒊
𝝏𝑳 𝝏𝒘𝒋
𝒊#𝟏
Dataset: N examples (indexed by 𝑖) 𝒘𝒋 ← 𝒘𝒋 − 𝜼 …where…
𝝏𝒘𝒋 𝜹𝒊 = 𝒚𝒊 − 𝒘𝑻 𝒙𝒊
Update Rule 𝑵
𝝏
𝒎

𝑵 𝝏𝑳 = −𝟐 4 𝜹𝒊 4 𝒘𝒌 𝒙𝒊𝒌
So what’s 𝝏𝒘 ? 𝝏𝒘𝒋
𝒋 𝒊#𝟏 𝒌#𝟏
𝒘𝒋 ← 𝒘𝒋 + 𝟐𝜼 I 𝜹𝒌 𝒙𝒌𝒋 𝑵
= −𝟐 4 𝜹𝒊 𝒙𝒊𝒋
𝒌2𝟏
𝒊#𝟏

Manual Differentiation
If we add a non-linearity (sigmoid), derivation is more complex
𝟏
𝝈 𝒙 =
𝟏 + 𝒆+𝒙
First, one can derive that: 𝝈- 𝒙 = 𝝈(𝒙)(𝟏 − 𝝈 𝒙 )

𝐟 𝐱 = 𝝈 4 𝒘𝒌 𝒙𝒌
𝒌
𝟐

L = 4 𝒚𝒊 − 𝝈 4 𝒘𝒌 𝒙𝒊𝒌
𝒊 𝒌

𝝏𝑳 𝝏
The sigmoid perception update rule:
= 4 𝟐 𝒚𝒊 − 𝝈 4 𝒘𝒌 𝒙𝒊𝒌 − 𝝈 4 𝒘𝒌 𝒙𝒊𝒌 𝑵
𝝏𝒘𝒋 𝝏𝒘𝒋
𝒊 𝒌 𝒌
𝒘𝒋 ← 𝒘𝒋 + 𝟐𝜼 5 𝜹𝒊 𝝈𝒊 (𝟏 − 𝝈𝒊 )𝒙𝒊𝒋
𝝏 𝒊.𝟏
= 4 −𝟐 𝒚𝒊 − 𝝈 4 𝒘𝒌 𝒙𝒊𝒌 𝝈′ 4 𝒘𝒌 𝒙𝒊𝒌 4 𝒘𝒌 𝒙𝒊𝒌 𝒎
𝝏𝒘𝒋
𝒊 𝒌 𝒌 𝒌
where 𝝈𝒊 = 𝝈 5 𝒘𝒌 𝒙𝒊𝒌
= 4 −𝟐𝜹𝒊 𝝈(𝐝𝒊 )(𝟏 − 𝝈 𝐝𝒊 )𝒙𝒊𝒋 𝒌.𝟏
𝒊
where 𝜹𝒊 = 𝒚𝒊 − 𝐟(𝒙𝒊 ) 𝒅𝒊 = 4 𝒘𝒌 𝒙𝒊𝒌 𝜹 𝒊 = 𝒚𝒊 − 𝝈 𝒊

Adding a Non-Linear Function


Given a library of simple functions

𝐬𝐢𝐧(𝒙)
𝐥𝐨𝐠(𝒙) Compose into a 𝟏
𝐜𝐨𝐬(𝒙) − 𝐥𝐨𝐠
𝒙𝟑
complicate function 𝟏 + 𝒆<𝒘⋅𝒙
𝐞𝐱𝐩(𝒙)

𝒖 𝟏 𝒑 𝑳
𝒘⋅𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆d𝒖
Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Decomposing a Function
Linear
Algebra
View:
Vector and
Matrix Sizes
𝒙𝟏
𝒘𝟏𝟏 𝒘𝟏𝟐 ⋯ 𝒘𝟏𝒎 𝒃𝟏
𝒙𝟐
𝒘𝟐𝟏 𝒘𝟐𝟐 ⋯ 𝒘𝟐𝒎 𝒃𝟐

𝒘𝟑𝟏 𝒘𝟑𝟐 ⋯ 𝒘𝟑𝒎 𝒃𝟑
𝒙𝒎
𝟏

𝑾 𝒙

Sizes: 𝒄× 𝒅 + 𝟏 𝒅 + 𝟏 ×𝟏
Where c is number of classes
d is dimensionality of input

Closer Look at a Linear Classifier


Conventions:
⬣ Size of derivatives for scalars, vectors, and matrices:
Assume we have scalar 𝒔 ∈ ℝ𝟏 , vector 𝒗 ∈ ℝ𝒎, i.e. 𝒗 = 𝒗𝟏 , 𝒗𝟐 , … , 𝒗𝒎 𝑻

and matrix 𝑴 ∈ ℝ𝒌×ℓ


𝝏𝒗𝟏
𝝏𝒗 𝝏𝒔
⬣ What is the size of ? ℝ𝒎×𝟏 (column vector of size m ) 𝝏𝒗𝟐
𝝏𝒔
𝝏𝒔
𝝏𝒔 ⋮
⬣ What is the size of ? ℝ𝟏×𝒎 (row vector of size m ) 𝝏𝒗𝒎
𝝏𝒗
𝝏𝒔
𝝏𝒔 𝝏𝒔 𝝏𝒔

𝝏𝒗𝟏 𝝏𝒗𝟐 𝝏𝒗𝒎

Dimensionality of Derivatives
Conventions:
𝝏𝒗𝟏
⬣ What is the size of ? A matrix: Col j
𝝏𝒗𝟐
𝝏𝒗𝟏𝟏
⋯ ⋯ ⋯ ⋯
𝝏𝒗𝟐𝟏
⋯ ⋯ ⋯ ⋯ ⋯
Row i 𝝏𝒗𝟏𝒊
⋯ ⋯ ⋯ ⋯
𝝏𝒗𝟐𝒋
⋯ ⋯ ⋯ ⋯ ⋯
⋯ ⋯ ⋯ ⋯ ⋯
⬣ This matrix of partial derivatives is called a Jacobian

(Note this is slightly different convention than on Wikipedia)

Dimensionality of Derivatives
Conventions:
𝝏𝒔
⬣ What is the size of ? A matrix:
𝝏𝑴
𝝏𝒔
⋯ ⋯ ⋯ ⋯
𝝏𝒎[𝟏,𝟏]
⋯ ⋯ ⋯ ⋯ ⋯
𝝏𝒔
⋯ ⋯ ⋯ ⋯
𝝏𝒎[𝒊,𝒋]
⋯ ⋯ ⋯ ⋯ ⋯
⋯ ⋯ ⋯ ⋯ ⋯

Dimensionality of Derivatives
𝝏𝑳
⬣ What is the size of ?
𝝏𝑾

⬣ Remember that loss is a scalar and W is a matrix:


𝒘𝟏𝟏 𝒘𝟏𝟐 ⋯ 𝒘𝟏𝒎 𝒃𝟏
𝒘𝟐𝟏 𝒘𝟐𝟐 ⋯ 𝒘𝟐𝒎 𝒃𝟐
𝒘𝟑𝟏 𝒘𝟑𝟐 ⋯ 𝒘𝟑𝒎 𝒃𝟑
Jacobian is also a matrix: W
𝝏𝑳 𝝏𝑳 𝝏𝑳 𝝏𝑳

𝝏𝒘𝟏𝟏 𝝏𝒘𝟏𝟐 𝝏𝒘𝟏𝒎 𝝏𝒃𝟏
𝝏𝑳 𝝏𝑳 𝝏𝑳
⋯ ⋯
𝝏𝒘𝟐𝟏 𝝏𝒘𝟐𝒎 𝝏𝒃𝟐
𝝏𝑳 𝝏𝑳
⋯ ⋯ ⋯
𝝏𝒘𝟑𝒎 𝝏𝒃𝟑

Dimensionality of Derivatives
Batches of data are matrices or tensors (multi- 𝒙𝟏𝟏 𝒙𝟏𝟐 ⋯ 𝒙𝟏𝒏
dimensional matrices) 𝒙𝟐𝟏 𝒙𝟐𝟐 ⋯ 𝒙𝟐𝒏
Examples: ⋮ ⋮ ⋱ ⋮
⬣ Each instance is a vector of size m, our batch is of 𝒙𝒏𝟏 𝒙𝒏𝟐 ⋯ 𝒙𝒏𝒏
size [𝑩×𝒎]
⬣ Each instance is a matrix (e.g. grayscale image) of Flatten
size 𝑾×𝑯, our batch is [𝑩×𝑾×𝑯]
𝒙𝟏𝟏
⬣ Each instance is a multi-channel matrix (e.g. color 𝒙𝟏𝟐
image with R,B,G channels) of size 𝑪×𝑾×𝑯, our

batch is [𝑩×𝑪×𝑾×𝑯]
𝒙𝟐𝟏
Jacobians become tensors which is complicated 𝒙𝟐𝟐
⬣ Instead, flatten input to a vector and get a vector of ⋮
derivatives! 𝒙𝒏𝟏
⬣ This can also be done for partial derivatives ⋮
between two vectors, two matrices, or two tensors 𝒙𝒏𝒏

Jacobians of Batches
How is Deep
Learning
Different?
Hierarchical
Compositionality
So What is Deep
(Machine) Learning?

⬣ Representation Learning
⬣ Neural Networks
⬣ Deep Unsupervised /
Reinforcement / Structured /
<insert-qualifier-here>
Learning
⬣ Simply: Deep Learning
(Hierarchical) End-to-End Distributed
Compositionality Learning Representations
⬣ Cascade of non- ⬣ Learning (goal- ⬣ No single neuron
linear driven) “encodes”
transformations representations everything
⬣ Multiple layers of ⬣ Learning feature ⬣ Groups of neurons
representations extraction work together

So What is Deep (Machine) Learning?


VISION
hand-crafted
your favorite
features “car”
classifier
SIFT/HOG
fixed learned
SPEECH
hand-crafted
your favorite
features \ˈd ē p\
classifier
MFCC
fixed learned

NLP hand-crafted
This burrito place your favorite
features “+”
is yummy and fun! classifier
Bag-of-words
fixed learned
Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Traditional Machine Learning


VISION

pixels edge texton motif part object

SPEECH

sample spectral formant motif phone word


band

NLP

character word NP/VP/.. clause sentence story

Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Hierarchical Compositionality
Given a library of simple functions
Idea 1:
Linear Combinations
𝐬𝐢𝐧(𝒙) ⬣ Boosting
Compose into a
𝐥𝐨𝐠(𝒙) ⬣ Kernels
𝐜𝐨𝐬(𝒙)
𝒙𝟑
⬣ …
complicate function
𝐞𝐱𝐩(𝒙)
𝒇 𝒙 = + 𝜶𝒊 𝒈𝒊 (𝒙)
+ 𝒊

Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function


Given a library of simple functions
Idea 2:
Compositions
𝐬𝐢𝐧(𝒙) ⬣ Deep Learning
Compose into a
𝐥𝐨𝐠(𝒙) ⬣ Grammar models
𝐜𝐨𝐬(𝒙)
𝒙𝟑
⬣ Scattering transforms…
complicate function
𝐞𝐱𝐩(𝒙)
𝒇 𝒙 = 𝒈𝟏 (𝒈𝟐 (… 𝒈𝒏 𝒙 … )

Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function


Given a library of simple functions
Idea 2:
Compositions
𝐬𝐢𝐧(𝒙) ⬣ Deep Learning
Compose into a
𝐥𝐨𝐠(𝒙) ⬣ Grammar models
𝐜𝐨𝐬(𝒙)
𝒙𝟑
⬣ Scattering transforms…
complicate function
𝐞𝐱𝐩(𝒙)
𝒇 𝒙 = 𝐥𝐨𝐠 𝐜𝐨𝐬 𝐞𝐱𝐩 𝐬𝐢𝐧𝟑 𝒙

𝐬𝐢𝐧(𝒙) 𝒙𝟑 𝐞𝐱𝐩(𝒙) 𝐜𝐨𝐬(𝒙) 𝐥𝐨𝐠(𝒙)

Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function


“car”

Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Deep Learning = Hierarchical Compositionality


Low-Level Mid-Level High-Level Trainable
“car”
Feature Feature Feature Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]

Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Deep Learning = Hierarchical Compositionality


How is Deep
Learning
Different?
End-to-End
Learning
(Hierarchical) End-to-End Distributed
Compositionality Learning Representations
⬣ Cascade of non- ⬣ Learning (goal- ⬣ No single neuron
linear driven) “encodes”
transformations representations everything
⬣ Multiple layers of ⬣ Learning feature ⬣ Groups of neurons
representations extraction work together

So What is Deep (Machine) Learning?


VISION
hand-crafted
your favorite
features “car”
classifier
SIFT/HOG
fixed learned
SPEECH
hand-crafted
your favorite
features \ˈd ē p\
classifier
MFCC
fixed learned

NLP hand-crafted
This burrito place your favorite
features “+”
is yummy and fun! classifier
Bag-of-words
fixed learned
Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Traditional Machine Learning


SIFT Spin Images

HoG Textons

And many many more….

Feature Engineering
VISION “learned”

K-Means/ “car”
SIFT/HOG classifier
pooling
fixed unsupervised supervised
SPEECH
Mixture of \ˈd ē p\
MFCC classifier
Gaussians
fixed unsupervised supervised

NLP
This burrito place Parse Tree “+”
is yummy and fun! n-grams classifier
Syntactic
fixed unsupervised supervised
Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Traditional Machine Learning (more accurately)


VISION “learned”

K-Means/ “car”
SIFT/HOG classifier
pooling
fixed unsupervised supervised
SPEECH
Mixture of \ˈd ē p\
MFCC classifier
Gaussians
fixed unsupervised supervised

NLP
This burrito place Parse Tree “+”
is yummy and fun! n-grams classifier
Syntactic
fixed unsupervised supervised
Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

Deep Learning = End-to-End Learning


“Shallow” models

Hand-crafted “Simple” Trainable


Feature Extractor Classifier

fixed learned
Deep models
Trainable Trainable Trainable
Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier

Learned Internal Representations


Adapted from slides by: Marc'Aurelio Ranzato, Yann LeCun

“Shallow” vs Deep Learning


How is Deep
Learning
Different?
Distributed
Representations
(Hierarchical) End-to-End Distributed
Compositionality Learning Representations
⬣ Cascade of non- ⬣ Learning (goal- ⬣ No single neuron
linear driven) “encodes”
transformations representations everything
⬣ Multiple layers of ⬣ Learning feature ⬣ Groups of neurons
representations extraction work together

So What is Deep (Machine) Learning?


Local vs Distributed

tal
gl e
on
al
(a) (b)

e
tan
i ps
rtic

ri z
re c

el l
ho
ve
no pattern no pattern

Adapted from slides by Moontae Lee

Distributed Representations Toy Example


Local = VR + HR + HE = ?

Distributed = V+H+E ≈

Adapted from slides by Moontae Lee

Power of Distributed Representations!


(Hierarchical) End-to-End Distributed
Compositionality Learning Representations
⬣ Cascade of non- ⬣ Learning (goal- ⬣ No single neuron
linear driven) “encodes”
transformations representations everything
⬣ Multiple layers of ⬣ Learning feature ⬣ Groups of neurons
representations extraction work together

So What is Deep (Machine) Learning?

You might also like