0% found this document useful (0 votes)
4 views56 pages

Lecture 1

Uploaded by

anna tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views56 pages

Lecture 1

Uploaded by

anna tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Deep Learning from First Principles

Tan Minh Nguyen


Department of Mathematics, NUS
Logistics
• Discussion Forums
• CampusWire: https://campuswire.com/p/GE20115D5
(Access Code: 7410)
• Course Materials
• Slides, lecture notes, and textbooks
• Where: Files on Campus Wire
Introduction
Can machines think?
I propose to consider the question, “Can
machines think?”. This should begin with
definitions of the meaning of the terms
"machine" and "think." The definitions might
be framed so as to reflect so far as possible
the normal use of the words, but this attitude
is dangerous…

Alan Turing, 1950

Turing, A. M. “Computing machinery and intelligence”. Mind 49 433-460.


The
Imitation
Game

I believe that in about fifty years' time it will be possible, to


programme computers … to make them play the imitation game
so well that an average interrogator will not have more than 70
percent chance of making the right identification after five
minutes of questioning.
Alan M Turing, 1950
https://en.wikipedia.org/wiki/Turing_test#/media/File:Turing_test_diagram.png
From artificial intelligence to machine learning

Can machines think?

Can machines do what


thinking beings do?

How can machines


learn to do some things
that thinking beings do?
In this short course, we are
interested in the study of
algorithmic and
mathematical approaches to
(deep) learning
A concrete definition of learning
A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with
experience E.
Tom Mitchell

8 ÷ 2(2 + 2)
16
calc(*args)

8 ÷ 2(2 + 2)
calc(*args)
16
Why do we need to study
machine learning?
Machine learning: revolution in technology
Machine learning: revolution in science
Machine learning: revolution in engineering
Types of Learning
• Supervised learning:
Example: distinguish photos of cats from photos of dogs
• Unsupervised learning
Example: figure out that cat and dog photos show different
animals
• Reinforcement learning
Example: play Go
Types of Learning
• Supervised learning:
• Linear and nonlinear models
• Basic learning and approximation theory
• Learning/optimization algorithms
• Unsupervised learning
• Dimensional reduction, clustering and generative models
• Reinforcement learning
• Markov decision processes, reinforcement learning
algorithms
What this course is
• A (hopefully) gentle introduction to machine learning and
deep learning.
• A holistic view of the modern interplay of deep learning
models with applied mathematics, including optimization,
differential equations, and control.

What this course isn’t


• A comprehensive survey of state-of-the-art machine
learning models and methods
• A “math class”
Preliminaries
Types of data (from our survey)
Math/Python background

Used ML before? Introductory Class? Expectations


Representing data in computers
Many data are numerical in nature

Other examples
• Video captures
• Financial time series
• Numerical measurements from experiments
What about general discrete data?
We make an important distinction
• Ordinal data
Data that has a natural notion of order, e.g.
• Star ratings of a product
• Level of language proficiency
• Letter grades of a class
• Nominal data
Data that has no order, e.g.
• Categories of image classification
• Answers to True/False questions
We need to embed these discrete data into something we can
represent on a computer, e.g. real/floating point numbers
The types of embedding depends on the nature of the data!
• Ordinal data
We want embedding to preserve this ordering, so we
typically use real numbers
⋆,⋆⋆,⋆⋆⋆ → 1, 2, 3
• Nominal data
This is somewhat opposite -- we want embedding to not
introduce spurious ordering, e.g. one-hot embedding
1 0 0
apple, orange, pear → 0 , 1 , 0
0 0 1
Classes of machine learning problems
Supervised Learning Unsupervised Learning Reinforcement Learning
Regression Clustering Value iteration
Classification Dimensional reduction Policy gradient
Function approximation Generative models Actor-critic
Inverse problems/design Anomaly detection Exploration
… … …

There are many intersections between them!


Evaluation and Selection using Data
In more quantitative terms, given a dataset 𝒟, we split it into

𝒟 = 𝒟𝒕𝒓𝒂𝒊𝒏 ∪ 𝒟𝒕𝒆𝒔𝒕

• 𝒟𝒕𝒓𝒂𝒊𝒏 is called the training set, and it is used to train our


machine learning model
• 𝒟𝒕𝒆𝒔𝒕 is called the testing set, and it is used to evaluate the
performance of our model. We should not peek at this set
while training!
• An additional splitting into a validation set 𝒟𝒗𝒂𝒍𝒊𝒅 is
sometimes used to perform hyper-parameter tuning and model
selection
Supervised Learning
What is supervised learning?
Supervised learning is simplest and most prevalent type of
machine learning problems

It is about learning to make predictions

Examples
• Image recognition
• Weather prediction
• Stock price prediction
• …
Given dataset: 𝒟 = 𝑥+ , 𝑦+ . +,-
Inputs: 𝑥+ Outputs/labels: 𝑦+ Data size: 𝑁
Goal: learn the relationship from 𝑥+ → 𝑦+

𝑥- = 1
𝑦- = Cat
0

𝑥/ = 𝑓 ∗ (Oracle) 0
𝑦/ = Dog
1

The oracle can be


• Deterministic: 𝑦! = 𝑓 ∗ (𝑥! )
• Random: 𝑦! ∼ 𝑝∗ (⋅ |𝑥! ) e.g. 𝑦! = 𝑓 ∗ 𝑥! + 𝜖!
Hypothesis space
The oracle 𝑓 ∗ is unknown to us, except through the dataset

.
𝒟 = 𝑥+ , 𝑦+ = 𝑓 ∗ 𝑥+ +,-

The supervised learning approach:


1. Define a hypothesis space ℋ consisting of a set of
candidate functions, e.g.
ℋ = 𝑓: 𝑓 𝑥 = 𝑤# + 𝑤$ 𝑥
2. Find the “best” function 𝑓? in ℋ that approximates 𝑓 ∗
What you get depends on ℋ!

Curve fitting methods and the message they send. https://xkcd.com/2048/


What does best approximation mean?
Useful to define some loss function 𝐿 𝑦′, 𝑦 which is small if
𝑦 ≈ 𝑦′ and large otherwise. Then, we can find the best
approximation by an optimization problem

.
1
min 𝑅456 𝑓 = 3 𝐿(𝑓 𝑥+ , 𝑓 ∗ (𝑥+ ))
1∈ℋ 𝑁
+,-
𝑦+
This is called empirical risk minimization (ERM)
So, is learning just optimization?
We want to do well on unseen data! In other words, our model
must generalize.
What we can solve

Empirical risk minimization Generalization Gap


.
1
min 𝑅456
1∈ℋ
𝑓 = 3 𝐿(𝑓 𝑥+ , 𝑓 ∗ 𝑥+ )
𝑁 𝑓!
+,-
𝑥+ ∼ 𝜇


Population risk minimization
min 𝑅676 𝑓 = 𝔼8∼: [𝐿 𝑓 𝑥 , 𝑓 ∗ 𝑥 ]
1∈ℋ
𝑓#
What we really want to solve
Three paradigms of supervised learning

tio n 𝑓∗
ma
oxi
r
𝓗 A pp
𝑓$

n
tio
iza
𝑓;

ral
ne
Ge
Optimization 𝑓#
(using 𝒟)
Linear Models
Simple linear regression

This is the simplest case, where 𝑥+ , 𝑦+ are all scalars


Step 1: Define hypothesis space

ℋ = {𝑓: 𝑓 𝑥 = 𝑤; + 𝑤- 𝑥, 𝑤; ∈ ℝ, 𝑤- ∈ ℝ}

Step 2: Find best approximation

We need to define a loss function


1 <
𝐿 𝑦′, 𝑦 = 𝑦 − 𝑦 /
2
Then, the empirical risk minimization problem is
.
1 /
min 𝑅456 (𝑓) = min 3 𝑤; + 𝑤- 𝑥+ − 𝑦+
1∈ℋ =! ,=" 2𝑁
+,-
Empirical risk minimization problem:
.
1 /
min 3 𝑤; + 𝑤- 𝑥+ − 𝑦+
=! ,=" 2𝑁
+,-

Solution:
? !!"# ? !!"#
?=!
𝑤
E; , 𝑤
E- = 0 and ?="
𝑤
E; , 𝑤
E- = 0

∑#(𝑥# − 𝑥)(𝑦
̅ # − 𝑦)
C 1 1
𝑤
@! = 𝑦C − 𝑤
@" 𝑥̅ 𝑤
@" = 𝑥̅ = M 𝑥# 𝑦C = M 𝑦#
∑# 𝑥# − 𝑥̅ $ 𝑁 𝑁
# #

𝑓! 𝑥 = 𝑤
%" + 𝑤
%# 𝑥
Ordinary Least Squares
Formula (1D)
Approximation
Is the linear hypothesis space large enough?

This right figure is an instance of under-fitting


Overfitting and generalization
Polynomial hypothesis space: ℋ = {𝑓: 𝑓 𝑥 = ∑NN O
+,; 𝑤O 𝑥 }

Hypothesis space too big, so over-fitting can happen, with or


without noise!
The role of loss functions
So far, we only considered the
mean-square loss

1
𝐿 𝑦′, 𝑦 = 𝑦 − 𝑦 # $
2

There are many other choices, e.g.


the Huber loss

1
𝑦 − 𝑦 # $ if 𝑦 − 𝑦 # ≤ 𝛿
𝐿 𝑦′, 𝑦 = 2
1
𝛿 𝑦 − 𝑦 # − 𝛿 $ otherwise
2
Mean square vs Huber loss in regression
We perform a linear regression on a noisy dataset with outliers.
What do you observe?
General linear basis models
The simple linear regression we have seen is quite limited
• Only for 1D inputs
• Can only fit linear relationships

It turns out that we can easily generalize the previous approach by


considering linear basis models
General linear basis models
Consider 𝑥 ∈ ℝP and the new hypothesis space
QR-

ℋQ = 𝑓: 𝑓 𝑥 = 3 𝑤O 𝜙O 𝑥
O,;

Each 𝜙O : ℝP → ℝ is called a basis function or feature map

Why is this a generalization?


• Take 𝑑 = 1, 𝑀 = 2, 𝜙# (𝑥) = 1, 𝜙$ 𝑥 = 𝑥
• In general, 𝑀 can be large and 𝜙% ’s can be highly
nonlinear, but 𝑓 is linear in 𝑤 = (𝑤# , … , 𝑤&'$ )
Examples of basis functions
Some choices of basis functions in 1D
• Polynomial basis: 𝜙% 𝑥 = 𝑥 %
"
&'(!
• Gaussian basis: 𝜙% 𝑥 = exp −
$)"
&'(! *
• Sigmoid basis: 𝜙% 𝑥 = 𝜎 with 𝜎 𝑏 =
) *+, #$
Ordinary least squares for linear basis models

The empirical risk minimization problem is now

min 𝑅@AB (𝑓) = min$ 𝑅@AB (𝑤)


=∈ℋ$ C∈ℝ
G
1 H
= min$ 1 𝑓(𝑥E ) − 𝑦E
C∈ℝ 2𝑁
EF#
H
G JK#
1
= min$ 1 1 𝑤I 𝜙I (𝑥E ) − 𝑦E
C∈ℝ 2𝑁
EF# IF"
We can rewrite
/
. QR-
1
min' + + 𝑤O 𝜙O (𝑥+ ) − 𝑦+
=∈ℝ 2𝑁
+,- O,;

in compact form
1
min' ‖Φ𝑤 − 𝑦‖/
=∈ℝ 2𝑁

𝜙; (𝑥- ) ⋯ 𝜙QR- 𝑥- 𝑤; 𝑦-
𝜙; (𝑥/ ) ⋯ 𝜙QR- (𝑥/ ) 𝑤- 𝑦/
Φ= 𝑤= ⋮ 𝑦= ⋮
⋮ ⋱ ⋮
𝜙; (𝑥. ) ⋯ 𝜙QR- (𝑥. ) 𝑤QR- 𝑦.
We want to solve
1
min' ‖Φ𝑤 − 𝑦‖/
=∈ℝ 2𝑁

We can do this by setting ∇𝑅456 𝑤


E = 0.

Suppose ΦT Φ is invertible, then we have

Φ T Φ𝑤
5 −𝑦 =0
Rearranging we have
General Ordinary
Least Squares
T R- T
5=
𝑤 Φ Φ Φ 𝑦 Formula

What happens if ΦT Φ is not invertible, i.e. it is singular?


In the singular case, we have an infinite number of solutions, all
of which have 𝑅456 𝑤 E = 0. They are given by

E 𝑢 = ΦU 𝑦 + 𝐼 − ΦU Φ 𝑢
𝑤 𝑢 ∈ ℝQ

Here, ΦU denotes the Moore-Penrose pseudoinverse of Φ.

How do we pick a solution?


Regularization
Often, it is advantageous to consider the regularized least
squares problem
1
min- ‖Φ𝑤 − 𝑦‖/ + 𝜆𝐶(𝑤)
=∈ℝ 2𝑁
regularizer
Types of regularization
• ℓ( regularization: 𝐶 𝑤 = 𝑤 (
(ridge regression)
• ℓ$ regularization: 𝐶 𝑤 = 𝑤 $ = ∑% |𝑤% | (least absolute
shrinkage and selection operator, or lasso)
• …
Regularization and generalization
We apply ℓ/ regularization on the over-fitting examples

Recall:
ℋQ = 𝑓: 𝑓 𝑥 = ∑NN
+,; 𝑤O 𝑥 O
so 𝑀 = 100 , but 𝑁 = 10

Without regularization With regularization


Classification using linear basis models

In 𝐾-class classification problems, each 𝑦+ takes on the class label


of one of 𝐾 classes.

We will use the one-hot encoding introduced earlier to represent


each 𝑦+ that belongs to class 𝑘 as
𝑦+ = 0, … , 0, 1, 0, … , 0 ∈ ℝV
kth Position
We require a slight change of hypothesis space

QR-

ℋQ = 𝑓: 𝑓 𝑥 = 𝑔 3 𝑤O 𝜙O 𝑥 , 𝑤O ∈ ℝV
+,;

The function 𝑔: ℝV → ℝV is called an activation function, and


the most commonly used one is the soft-max function

exp 𝑧W
𝑔 𝑧 W =
∑O exp 𝑧O

Notice that 𝑔 always outputs a vector which can be interpreted as


probabilities over 𝐾 classes
Everything else remain the same, and we can define the empirical
risk minimization problem for classification as

.
1
min 𝑅
-×/ 456
𝑊 = min 3 𝐿 𝑔 Φ𝑊 + , 𝑦+
X∈ℝ X∈ℝ -×/ 𝑁
+,-

What loss function should we use? We can always use mean


square loss, but there is a better choice: the cross-entropy loss

𝐿 𝑦 < , 𝑦 = − 3 𝑦W log 𝑦W<


W,-
Summary
1. Machine learning vs AI
2. Types of Learning Problems
• Supervised learning
• Unsupervised learning
• Reinforcement learning
3. Linear models as a baseline for supervised learning
Useful Tools
Version control with Git
• https://www.freecodecamp.org/news/what-is-git-and-
how-to-use-it-c341b049ae61/
Interactive python with Jupyter notebooks
• https://www.datacamp.com/community/tutorials/tutorial-
jupyter-notebook
Data visualization using Seaborn and Pandas
• https://jakevdp.github.io/PythonDataScienceHandbook/04
.14-visualization-with-seaborn.html
Further Reading
Matrix Cookbook
• https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbo
ok.pdf
More on linear models (Pattern Recognition and Machine
Learning, Bishop)
• http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%2
0-
%20Pattern%20Recognition%20And%20Machine%20Le
arning%20-%20Springer%20%202006.pdf

You might also like