Lecture 1
Lecture 1
8 ÷ 2(2 + 2)
16
calc(*args)
8 ÷ 2(2 + 2)
calc(*args)
16
Why do we need to study
machine learning?
Machine learning: revolution in technology
Machine learning: revolution in science
Machine learning: revolution in engineering
Types of Learning
• Supervised learning:
Example: distinguish photos of cats from photos of dogs
• Unsupervised learning
Example: figure out that cat and dog photos show different
animals
• Reinforcement learning
Example: play Go
Types of Learning
• Supervised learning:
• Linear and nonlinear models
• Basic learning and approximation theory
• Learning/optimization algorithms
• Unsupervised learning
• Dimensional reduction, clustering and generative models
• Reinforcement learning
• Markov decision processes, reinforcement learning
algorithms
What this course is
• A (hopefully) gentle introduction to machine learning and
deep learning.
• A holistic view of the modern interplay of deep learning
models with applied mathematics, including optimization,
differential equations, and control.
Other examples
• Video captures
• Financial time series
• Numerical measurements from experiments
What about general discrete data?
We make an important distinction
• Ordinal data
Data that has a natural notion of order, e.g.
• Star ratings of a product
• Level of language proficiency
• Letter grades of a class
• Nominal data
Data that has no order, e.g.
• Categories of image classification
• Answers to True/False questions
We need to embed these discrete data into something we can
represent on a computer, e.g. real/floating point numbers
The types of embedding depends on the nature of the data!
• Ordinal data
We want embedding to preserve this ordering, so we
typically use real numbers
⋆,⋆⋆,⋆⋆⋆ → 1, 2, 3
• Nominal data
This is somewhat opposite -- we want embedding to not
introduce spurious ordering, e.g. one-hot embedding
1 0 0
apple, orange, pear → 0 , 1 , 0
0 0 1
Classes of machine learning problems
Supervised Learning Unsupervised Learning Reinforcement Learning
Regression Clustering Value iteration
Classification Dimensional reduction Policy gradient
Function approximation Generative models Actor-critic
Inverse problems/design Anomaly detection Exploration
… … …
𝒟 = 𝒟𝒕𝒓𝒂𝒊𝒏 ∪ 𝒟𝒕𝒆𝒔𝒕
Examples
• Image recognition
• Weather prediction
• Stock price prediction
• …
Given dataset: 𝒟 = 𝑥+ , 𝑦+ . +,-
Inputs: 𝑥+ Outputs/labels: 𝑦+ Data size: 𝑁
Goal: learn the relationship from 𝑥+ → 𝑦+
𝑥- = 1
𝑦- = Cat
0
𝑥/ = 𝑓 ∗ (Oracle) 0
𝑦/ = Dog
1
.
𝒟 = 𝑥+ , 𝑦+ = 𝑓 ∗ 𝑥+ +,-
.
1
min 𝑅456 𝑓 = 3 𝐿(𝑓 𝑥+ , 𝑓 ∗ (𝑥+ ))
1∈ℋ 𝑁
+,-
𝑦+
This is called empirical risk minimization (ERM)
So, is learning just optimization?
We want to do well on unseen data! In other words, our model
must generalize.
What we can solve
≠
Population risk minimization
min 𝑅676 𝑓 = 𝔼8∼: [𝐿 𝑓 𝑥 , 𝑓 ∗ 𝑥 ]
1∈ℋ
𝑓#
What we really want to solve
Three paradigms of supervised learning
tio n 𝑓∗
ma
oxi
r
𝓗 A pp
𝑓$
n
tio
iza
𝑓;
ral
ne
Ge
Optimization 𝑓#
(using 𝒟)
Linear Models
Simple linear regression
ℋ = {𝑓: 𝑓 𝑥 = 𝑤; + 𝑤- 𝑥, 𝑤; ∈ ℝ, 𝑤- ∈ ℝ}
Solution:
? !!"# ? !!"#
?=!
𝑤
E; , 𝑤
E- = 0 and ?="
𝑤
E; , 𝑤
E- = 0
∑#(𝑥# − 𝑥)(𝑦
̅ # − 𝑦)
C 1 1
𝑤
@! = 𝑦C − 𝑤
@" 𝑥̅ 𝑤
@" = 𝑥̅ = M 𝑥# 𝑦C = M 𝑦#
∑# 𝑥# − 𝑥̅ $ 𝑁 𝑁
# #
𝑓! 𝑥 = 𝑤
%" + 𝑤
%# 𝑥
Ordinary Least Squares
Formula (1D)
Approximation
Is the linear hypothesis space large enough?
1
𝐿 𝑦′, 𝑦 = 𝑦 − 𝑦 # $
2
1
𝑦 − 𝑦 # $ if 𝑦 − 𝑦 # ≤ 𝛿
𝐿 𝑦′, 𝑦 = 2
1
𝛿 𝑦 − 𝑦 # − 𝛿 $ otherwise
2
Mean square vs Huber loss in regression
We perform a linear regression on a noisy dataset with outliers.
What do you observe?
General linear basis models
The simple linear regression we have seen is quite limited
• Only for 1D inputs
• Can only fit linear relationships
ℋQ = 𝑓: 𝑓 𝑥 = 3 𝑤O 𝜙O 𝑥
O,;
in compact form
1
min' ‖Φ𝑤 − 𝑦‖/
=∈ℝ 2𝑁
𝜙; (𝑥- ) ⋯ 𝜙QR- 𝑥- 𝑤; 𝑦-
𝜙; (𝑥/ ) ⋯ 𝜙QR- (𝑥/ ) 𝑤- 𝑦/
Φ= 𝑤= ⋮ 𝑦= ⋮
⋮ ⋱ ⋮
𝜙; (𝑥. ) ⋯ 𝜙QR- (𝑥. ) 𝑤QR- 𝑦.
We want to solve
1
min' ‖Φ𝑤 − 𝑦‖/
=∈ℝ 2𝑁
Φ T Φ𝑤
5 −𝑦 =0
Rearranging we have
General Ordinary
Least Squares
T R- T
5=
𝑤 Φ Φ Φ 𝑦 Formula
E 𝑢 = ΦU 𝑦 + 𝐼 − ΦU Φ 𝑢
𝑤 𝑢 ∈ ℝQ
Recall:
ℋQ = 𝑓: 𝑓 𝑥 = ∑NN
+,; 𝑤O 𝑥 O
so 𝑀 = 100 , but 𝑁 = 10
QR-
ℋQ = 𝑓: 𝑓 𝑥 = 𝑔 3 𝑤O 𝜙O 𝑥 , 𝑤O ∈ ℝV
+,;
exp 𝑧W
𝑔 𝑧 W =
∑O exp 𝑧O
.
1
min 𝑅
-×/ 456
𝑊 = min 3 𝐿 𝑔 Φ𝑊 + , 𝑦+
X∈ℝ X∈ℝ -×/ 𝑁
+,-