0% found this document useful (0 votes)
7 views60 pages

Lec 12

Lecture 12 of CSC 311 at the University of Toronto focuses on Reinforcement Learning (RL), explaining its framework through Markov Decision Processes (MDPs). It covers key concepts such as states, actions, policies, rewards, and the value functions, including the Bellman equations that underpin many RL algorithms. The lecture also discusses the process of finding optimal policies and value functions using dynamic programming techniques like value iteration.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views60 pages

Lec 12

Lecture 12 of CSC 311 at the University of Toronto focuses on Reinforcement Learning (RL), explaining its framework through Markov Decision Processes (MDPs). It covers key concepts such as states, actions, policies, rewards, and the value functions, including the Bellman equations that underpin many RL algorithms. The lecture also discusses the process of finding optimal policies and value functions using dynamic programming techniques like value iteration.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

C S C 311: Introduction to Machine Learning

Lecture 12 - Reinforcement Learning

Roger Grosse Rahul G. Krishnan Guodong Zhang

University of Toronto, Fall 2021

Intro M L (UofT) CSC311-Lec12 1 / 60


Reinforcement Learning Problem
Recall: we categorized types of M L by how much information they
provide about the desired behavior.
Supervised learning: labels of desired behavior
Unsupervised learning: no labels
Reinforcement learning: reward signal evaluating the outcome of
past actions
Bandit problems (Lecture 10) are a simple instance of R L where each
decision is independent.
More commonly, we focus on sequential decision making: an agent
chooses a sequence of actions which each affect future possibilities
available to the agent.

An observes takes an action with the goal of


agent the world and its states achieving long-
changes term rewards.

Intro M L (UofT) CSC311-Lec12 2 / 60


Reinforcement Learning
Most R L is done in a mathematical framework called a Markov Decision Process
(MDP).

Intro M L (UofT) CSC311-Lec12 3 / 60


MDPs: States and Actions

First let’s see how to describe the dynamics of the environment.


The state is a description of the environment in sufficient detail to
determine its evolution.
Think of Newtonian physics.
What would be the state variables for a puck sliding on a
frictionless table?

Markov assumption: the state at time t + 1 depends directly on the


state and action at time t, but not on past states and actions.
To describe the dynamics, we need to specify the transition
probabilities P (S t+ 1 | S t , A t ).
In this lecture, we assume the state is fully observable, a highly
nontrivial assumption.

Intro M L (UofT) CSC311-Lec12 4 / 60


MDPs: States and Actions

Suppose you’re controlling a robot hand. What should be the set


of states and actions?

In general, the right granularity of states and actions depends on


what you’re trying to achieve.
Intro M L (UofT) CSC311-Lec12 5 / 60
MDPs: Policies

The way the agent chooses the action in each step is called a
policy.
We’ll consider two types:
Deterministic policy: A t = π(S t ) for some function π : S → A
Stochastic policy: A t ∼ π(· | S t ) for some function π : S →
P(A). (Here, P ( A ) is the set of distributions over actions.)
With stochastic policies, the distribution over rollouts, or
trajectories, factorizes:
p(s 1 , a 1 , . . . , s T , a T ) = p(s 1 ) π(a 1 | s 1 ) P ( s 2 | s 1 , a 1 ) π(a 2 | s 2 ) · · · P ( s T | s T −1 ,
aT −1 ) π(aT | sT )

Note: the fact that policies need consider only the current state is
a powerful consequence of the Markov assumption and full
observability.
If the environment is partially observable, then the policy needs to
depend on the history of observations.
Intro M L (UofT) CSC311-Lec12 6 / 60
MDPs: Rewards
In each time step, the agent receives a reward from a distribution
that depends on the current state and action

R t ∼ R (· | S t , A t )

For simplicity, we’ll assume rewards are deterministic, i.e.

R t = r (S t , A t )

What’s an example where R t should depend on A t ?


The return determines how good was the outcome of an episode.
Undiscounted: G = R 0 + R 1 + R 2 + · · ·
Discounted: G = R 0 + γ R 1 + γ 2 R 2 + · ·
·
The goal is to maximize the expected
return, E[G].
γ is a hyperparameter called the discount factor which determines
Ihow
n t r o M much
L ( U o f T we
) care about rewards
C S C 3 1 1 - L e cnow
12 vs. rewards later. 7 / 60
MDPs: Rewards

How might you define a reward function for an agent learning to


play a video game?
Change in score (why not current score?)
Some measure of novelty (this is sufficient for most Atari games!)
Consider two possible reward functions for the game of Go. How
do you think the agent’s play will differ depending on the choice?
Opti on 1: + 1 for win, 0 for tie, -1 for loss
Opti on 2: Agent’s territory minus opponent’s territory (at end)
Specifying a good reward function can be tricky.
htt ps://www.youtube.com/watch?v=tlOIHko8ySg

Intro M L (UofT) CSC311-Lec12 8 / 60


Markov Decision Processes

Putti ng this together, a Markov Decision Process ( M D P ) is defined by a


tuple (S, A, P , R , γ).
S: State space. Discrete or continuous
A: Action space. Here we consider finite action space, i.e.,
A = { a1 , . . . , a| A | } .
P : Transition probability
R : Immediate reward distribution
γ: Discount factor (0 ≤ γ < 1)
Together these define the environment that the agent operates in, and
the objectives it is supposed to achieve.

Intro M L (UofT) CSC311-Lec12 9 / 60


Finding a Policy

Now that we’ve defined MDPs, let’s see how to find a policy that
achieves a high return.
We can distinguish two situations:
Planning: given a fully specified MDP.
Learning: agent interacts with an environment with unknown
dynamics.
I.e., the environment is a black box that takes in actions and
outputs states and rewards.
Which framework would be most appropriate for chess? Super
Mario?

Intro M L (UofT) CSC311-Lec12 10 / 60


Value Functions

Intro M L (UofT) CSC311-Lec12 11 / 60


Value Function

The value function V π for a policy π measures the expected return if you
start in state s and follow policy π.
" ∞

π
V (s) , Eπ [G t | St = s] = π γ k R t+ k | S t =
E k=0 s .

This measures the desirability of state s.

Intro M L (UofT) CSC311-Lec12 12 / 60


Value Function

Rewards: −1 per time-step


Actions: N, E , S, W
States: Agent’s location

[Slide credit: D. Silver]

Intro M L (UofT) CSC311-Lec12 13 / 60


Value Function

Arrows represent policy


π(s) for each state s

[Slide credit: D. Silver]

Intro M L (UofT) CSC311-Lec12 14 / 60


Value Function

Numbers represent value


V π (s) of each state s

[Slide credit: D. Silver]

Intro M L (UofT) CSC311-Lec12 15 / 60


Bellman equations
The foundation of many R L algorithms is the fact that value functions
satisfy a recursive relationship, called the Bellman equation:
π
V ( s ) = Eπ [G t | S t = s ]
= Eπ [R t + γ G" t+ 1 | S t = s ]
Σ # Σ
= π(a | s) r (s, a) + γ P(s ′ | a, s) Eπ [Gt+ 1 | S t+ 1 = s ′]
a
s'
" #
Σ Σ ′ π ′
= π(a | s) r (s, a) + γ P (s | a, s) V
(s ) a s'

Viewing V π as a vector (where entries correspond to states), define the


Bellman backup operator T π .
"
Σ # Σ
π
(T V )(s) , π(a | s) r (s, a) + γ P(s ′ | a, s) V (s ′ )
a
s'

The Bellman equation can be seen as a fixed point of the Bellman


operator:
T πVπ = Vπ.
Intro M L (UofT) CSC311-Lec12 16 / 60
Value Function

A value function for golf:


vVpuptu 

sand 
ttt
 

 


 sa gree
n n
 d
 

 

— S u tt o n a n d B a r t o , Re in fo rcem e nt L e a r n in g : A n I nt ro d u c ti o n

Intro M L (UofT) CSC311-Lec12 17 / 60


State-Action Value Function
A closely related but usefully different function is the state-action
value function, or Q-function, Q π for policy π, defined as:
 
Σ
Q π (s, a) , π  γ k R t+ k | S t = s, At = a
E k≥0 .
If you knew Q π , how would you obtain V π ?
Σ
V π (s) = π(a | s) Qπ (s,
a). a

If you knew V π , how would you obtain Q π ?


Apply a Bellman-like equation:
Σ
Q π (s, a) = r (s, a) + γ P (s ′ | a, s) V π (s

) s'

This requires knowing the dynamics, so in general it’s not easy to


recover Q π from V π .
Intro M L (UofT) CSC311-Lec12 18 / 60
State-Action Value Function

Q π satisfies a Bellman equation very similar to V π (proof is


analogous):
Σ Σ
Q π (s, a) = r (s, a) + γ P (s JJ | a, s) π(a | s J )Q π (s J ,
` s′ aJ ) a′ x
˛¸
, (T π Q π )
( s,a)

Intro M L (UofT) CSC311-Lec12 19 / 60


Dynamic Programming and Value Iteration

Intro M L (UofT) CSC311-Lec12 20 / 60


Opti mal State-Action Value Function

Suppose you’re in state s. You get to pick one action a, and


then follow (fixed) policy π from then on. What do you pick?

arg max Q π (s, a)


a

If a deterministic policy π is optimal, then it must be the case that


for any state s:
π(s) = arg max Q π (s, a),
a

otherwise you could improve the policy by changing π(s). (see


Sutton & Barto for a proper proof)

Intro M L (UofT) CSC311-Lec12 21 / 60


Opti mal State-Action Value Function

Bellman equation for optimal policy π ∗ :


∗ Σ y
Qπ (s, a) = r (s, a) + γ P (s ′ , | s, a)Qπ (s′ , πs ′

' (s ))
Σs y
= r (s, a) + γ p(s′ | s, a) max Qπ (s′ , ′

a) s'
a'

πs
Now Q ∗ = Q is the optimal state-action value function, and we
can rewrite the optimal Bellman equation without mentioning π ٨ :
Σ
Q ∗J (s, a) = r (s, a) + γ p(s J | s, a) max Q∗ (sJ , a
) a′
` s′ x
˛¸
,(T∗Q ∗)(s,a)

Turns out this is sufficient to characterize the optimal policy. So


we simply need to solve the fixed point equation T ∗ Q ∗ = Q ∗ ,
and then we can choose π ∗ (s) = arg max a Q ∗ (s, a).
Intro M L (UofT) CSC311-Lec12 22 / 60
Bellman Fixed Points

So far: showed that some interesting problems could be reduced


to finding fixed points of Bellman backup operators:
Evaluating a fixed policy π

T π Qπ = Qπ

Finding the optimal policy

T ∗ Q∗ = Q∗

Idea: keep iterating the backup operator over and over again.
π
Q←T (policy evaluation)
Q
Q ← T ∗Q (finding the optimal
We’re treating Q π or Q ∗ aspolicy)
a vector with |S| · |A| entries.
This type of algorithm is an instance of dynamic programming.

Intro M L (UofT) CSC311-Lec12 23 / 60


Bellman Fixed Points

An operator f (mapping from vectors to vectors) is a


contraction map if
ǁf (x 1 ) − f (x 2 )ǁ ≤ αǁx 1 − x 2 ǁ
for some scalar 0 ≤ α < 1 and vector norm ǁ ·
ǁ.
Let f ( k ) denote f iterated k times. A simple
induction shows
(k)
(k) k
ǁf (x 1 ) − f
(x 2 )ǁ ≤ α ǁx 1 − x 2 ǁ.

Let x ∗ be a fixed point of f . Then for any x ,

ǁf (k)
(x ) − x ∗ ǁ ≤ α k ǁx − x ∗ ǁ.

Hence, ( iterated
Intro M LUofT) applicationC Sof
C 3 1f1 - ,L estarting
c12 from any x, converges 24 / 60
Finding the Opti mal Value Function: Value Iteration

Let’s use dynamic programming to find Q ∗ .


Value Iteration: Start from an initial function Q 1 . For each k = 1, 2, . . . ,
apply
Qk+ 1 ← T ∗ Qk

Writing out the update in full,


Σ
Q k+ 1 (s, a) ← r (s, a) + γ P (s′ |s, a) max Qk (s′ , ′
a' ∈A
a) s' ∈S

Observe: a fixed point of this update is exactly a solution of the optimal


Bellman equation, which we saw characterizes the Q-function of an
optimal policy.

Intro M L (UofT) CSC311-Lec12 25 / 60


Value Iteration
Q1
T ⇤Q 1
T ⇤ (or T ⇡ )
ц

Q2

T ⇤Q 2

C l a i m : The value iteration update is a contraction map:


ǁT ∗ Q 1 − T ∗ Q 2 ǁ ∞ ≤ γ ǁQ 1 − Q 2 ǁ ∞
ǁx ǁ∞ = max |xi
ǁ·ǁ∞ denotes the L ∞ norm, defined
| as: i
If this claim is correct, then value iteration converges exponentially to
the unique fixed point.
The exponential decay factor is γ (the discount factor), which means
longer term planning is harder.
Intro M L (UofT) CSC311-Lec12 26 / 60
Bellman Operator is a Contraction (optional)
" #
∗ r (s, a) + γ Σ ′ ′ ′
|(T ∗Q 1 )(s, a) − ( T Q 2 )(s, a)| = (s | s, a) max Q1 (s , a )
. P a'
. s' −
" #
Σ ′ ′ ′
r ( s, a) + γ P ( s | s, a) maa' x Q 2( s , a ) ..
s'

= γ Σ P (s′ | s, a) max Q (s′ , a′ ) − max Q (s′ , ′


a) 1 2 .
. ' a' a' .
s
Σ
≤ γ P (s′ | s, a) max. Q1 (s ′ , a′ ) − Q 2 (s′ , a′ ) .
a'
s'
Σ
≤ γ max .Q 1 (s ′ , a′ ) − Q 2 (s ′ , a′ ). P (s′ | s,
s' ,a'
a) s'
= γ max .Q 1 (s ′ , a′ ) − Q 2 (s ′ , a′ ).
s' ,a'

= γ ǁ Q 1 − Q 2ǁ ∞

This is true for any (s, a), so

ǁ T ∗Q 1 − T ∗Q 2 ǁ ∞ ≤ γ ǁ Q 1 − Q 2ǁ ∞ ,

Iwhich
n t r o M Lis ( what
UofT) we wanted to show.
CSC311-Lec12 27 / 60
Value Iteration Recap

So far, we’ve focused on planning, where the dynamics are known.


The optimal Q-function is characterized in terms of a Bellman
fixed point update.
Since the Bellman operator is a contraction map, we can just keep
applying it repeatedly, and we’ll converge to a unique fixed point.
What are the limitations of value iteration?
assumes known dynamics
requires explicitly representing Q ∗ as a vector
|S| can be extremely large, or infinite
|A| can be infinite (e.g. continuous voltages in robotics)
But value iteration is still a foundation for a lot of more practical
R L algorithms.

Intro M L (UofT) CSC311-Lec12 28 / 60


Towards Learning

Now let’s focus on reinforcement learning, where the


environment is unknown. How can we apply learning?
1 Learn a model of the environment, and do planning in the model
(i.e. model-based reinforcement learning)
You already know how to do this in principle, but it’s very hard to
get to work. Not covered in this course.
2 Learn a value function (e.g. Q-learning, covered in this lecture)
3 Learn a policy directly (e.g. policy gradient, not covered in this
course)
How can we deal with extremely large state spaces?
Function approximation: choose a parametric form for the policy
and/or value function (e.g. linear in features, neural net, etc.)

Intro M L (UofT) CSC311-Lec12 29 / 60


Q-Learning

Intro M L (UofT) CSC311-Lec12 30 / 60


Monte Carlo Esti mation
Recall the optimal Bellman equation:
h i
Q ∗ (s, a) = r (s, a) + P ( s' | s,a)
max Q ∗(s ,′ a ′ )
a'
γE
Prob lem: we need to know the dynamics to evaluate the expectation
Monte Carlo estimation of an expectation µ = E[X]: repeatedly sample
X and update
µ ← µ + α(X − µ)
Idea: Apply Monte Carlo estimation to the Bellman equation by
sampling S ′ ∼ P (· | s, a) and updating:
h
Q(s, a) ← Q(s, a) + α r(s, a) + γ max Q ( S ′ , a′ ) − Q(s,
a) i` a' ˛¸
x = B e llma n error

This is an example of temporal difference learning, i.e. updating our


predictions to match our later predictions (once we have more
information).
Intro M L (UofT) CSC311-Lec12 31 / 60
Monte Carlo Esti mation

Problem: Every iteration of value iteration requires updating Q


for every state.
There could be lots of states
We only observe transitions for states that are visited
Idea: Have the agent interact with the environment, and only
update Q for the states that are actually visited.
Problem: We might never visit certain states if they don’t look
promising, so we’ll never learn about them.
Idea: Have the agent sometimes take random actions so that it
eventually visits every state.
ε-greedy policy: a policy which picks arg max a Q(s, a) with
probability 1 − ε and a random action with probability ε. (Typical
value: ε = 0.05)
Combining all three ideas gives an algorithm called Q-learning.

Intro M L (UofT) CSC311-Lec12 32 / 60


Q-Learning with ε-Greedy Policy
Parameters:
Learning rate α
Exploration parameter ε
Initialize Q (s, a) for all (s, a) ∈ S × A
The agent starts at state
S 0 . F or time step t = 0,
1, ...,
(
Choose A t argmax
according to
a ∈ A Q (S t , a) with probability 1 −
theAε-greedy
t ←
ε policy, i.e.,
Take actionUniformly
A t in therandom action in A with probability ε
environment.
The state changes from S t to S t + 1 ∼ P (·|S t , A t )
Observe S t + 1 and R t (could be r (S t , A t ), or could be stochastic)
Update the action-value function at state-action (S t , A t ):

Q (S t , At ) ← Q (St , At ) + α Rt + γ max Q (St+ 1 , a′ ) − Q (St , At


a' ∈A
)
Intro M L (UofT) CSC311-Lec12 33 / 60
Exploration vs. Exploitati on

The ε-greedy is a simple mechanism for managing the


exploration-exploitation tradeoff.
(
argmax a ∈ A Q (S , a) with probability 1 −
πε (S ; Q ) =
ε niformly random action in A with probability ε
U

The ε-greedy policy ensures that most of the time (probability 1 − ε) the
agent exploits its incomplete knowledge of the world by chooses the best
action (i.e., corresponding to the highest action-value), but occasionally
(probability ε) it explores other actions.
Without exploration, the agent may never find some good actions.
The ε-greedy is one of the simplest, but widely used, methods for
trading-off exploration and exploitation. Exploration-exploitation
tradeoff is an important topic of research.

Intro M L (UofT) CSC311-Lec12 34 / 60


Examples of Exploration-Exploitation in the Real World

Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
[Slide credit: D. Silver]

Intro M L (UofT) CSC311-Lec12 35 / 60


A n Intuition on Why Q-Learning Works? (Opti onal)

Consider a tuple (S, A , R , S ′ ). The Q-learning update is

Q (S , A ) ← Q (S , A ) + α R + γ max Q (S ′ , a ′ ) − Q (S ,
a' ∈A
A) .
To understand this better, let us focus on its stochastic equilibrium, i.e.,
where the expected change in Q(S, A ) is zero. We have

E R + γ max Q (S ′ , a′ ) − Q (S , A )|S , A =
a' ∈A
0
⇒(T Q )(S , A ) = Q (S , A )

So at the stochastic equilibrium, we have (T ∗ Q )(S , A ) = Q (S , A ).


Because the fixed-point of the Bellman optimality operator is unique
(and is Q ∗ ), Q is the same as the optimal action-value function Q ∗ .

Intro M L (UofT) CSC311-Lec12 36 / 60


Off-Policy Learning

Q-learning update again:

Q (S , A ) ← Q (S , A ) + α R + γ max Q (S JJ , a ) − Q (S , A )
a′ ∈A
.
Noti ce: this update doesn’t mention the policy anywhere. The
only thing the policy is used for is to determine which states are
visited.
This means we can follow whatever policy we want (e.g. ε-greedy),
and it still coverges to the optimal Q-function. Algorithms like
this are known as off-policy algorithms, and this is an extremely
useful property.
Policy gradient (another popular R L algorithm, not covered in this
course) is an on-policy algorithm. Encouraging exploration is
much harder in that case.

Intro M L (UofT) CSC311-Lec12 37 / 60


Function Approximation

Intro M L (UofT) CSC311-Lec12 38 / 60


Function Approximation

So far, we’ve been assuming a tabular representation of Q: one


entry for every state/action pair.
This is impractical to store for all but the simplest problems, and
doesn’t share structure between related states.
Soluti on: approximate Q using a parameterized function, e.g.
linear function approximation: Q(s, a) = w T ψ(s, a)
compute Q with a neural net
Update Q using backprop:

t ← r (s t , a t) + γ max Q (s t+ 1 , a)
a
θ ← θ + α(t − Q (s, a))∇θ Q (s t , a t ).

Intro M L (UofT) CSC311-Lec12 39 / 60


Function Approximation

Approximating Q with a neural net is a decades-old idea, but


DeepMind got it to work really well on Atari games in 2013 (“deep
Q-learning”)
They used a very small network by today’s standards

Main technical innovation: store experience into a replay buffer,


and perform Q-learning using stored experience
Gains sample efficiency by separating environment interaction from
optimization — don’t need new experience for every S G D update!

Intro M L (UofT) CSC311-Lec12 40 / 60


A tari

Mnih et al., Nature 2015. Human-level control through deep


reinforcement learning
Network was given raw pixels as observations
Same architecture shared between all games
Assume fully observable environment, even though that’s not the
case
After about a day of training on a particular game, often beat
“human-level” performance (number of points within 5 minutes of
play)
Did very well on reactive games, poorly on ones that require
planning (e.g. Montezuma’s Revenge)
htt ps://www.youtube.com/watch?v=V1eYniJ0Rnk

htt ps://www.youtube.com/watch?v=4MlZncshy1Q
Intro M L
(UofT) CSC311-Lec12 41 / 60
Recap and Other Approaches
All discussed approaches estimate the value function first. They are
called value-based methods.
There are methods that directly optimize the policy, i.e., policy search
methods.
Model-based R L methods estimate the true, but unknown, model of
environment P by an estimate Pˆ, and use the estimate P in order to
plan.
There are hybrid methods.

Value Policy

Model

Intro M L (UofT) CSC311-Lec12 42 / 60


Reinforcement Learning Resources

Books:
Richard S. Sutton and Andrew G. Barto, Reinforcement Learning:
An Introduction, 2nd edition, 2018.
Csaba Szepesvari, Algorithms for Reinforcement Learning, 2010.
Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien
Ernst, Reinforcement Learning and Dynamic Programming Using
Function Approximators, 2010.
Dimitri P. Bertsekas and John N. Tsitsiklis, Neuro-Dynamic
Programming, 1996.
Courses:
Video lectures by David Silver
C I FA R and Vector Institute’s Reinforcement Learning Summer
School, 2018.
Deep Reinforcement Learning, C S 294-112 at U C Berkeley

Intro M L (UofT) CSC311-Lec12 43 / 60


Closing Thoughts

Intro M L (UofT) CSC311-Lec12 44 / 60


O verview

What this course focused on:


Supervised learning: regression, classification
Choose model, loss function, optimizer
Parametric vs. nonparametric
Generative vs. discriminative
Iterative optimization vs. closed-form solutions
Unsupervised learning: dimensionality reduction and clustering
Reinforcement learning: value iteration
This lecture: what we left out, and teasers for other courses

Intro M L (UofT) CSC311-Lec12 45 / 60


CSC413 Teaser: Neural Nets

This course covered some fundamental ideas, most of which are


more than 10 years old.
Big shift of the past decade: neural nets and deep learning
2010: neural nets significantly improved speech recognition accuracy
(after 20 years of stagnation)
2012–2015: neural nets reduced error rates for object recognition by
a factor of 6
2016: a program called AlphaGo defeated the human Go champion
2015–2018: neural nets learned to produce convincing
high-resolution images
2018–today: transformers demonstrate a sophisticated
ability to
generate natural language text and learn from few
examples

Intro M L (UofT) CSC311-Lec12 46 / 60


CSC413 Teaser: Automatic Differentiation

In this course, you derived update rules by hand


Backprop is totally mechanical. Now we have automatic
differentiation tools that compute gradients for you.
In CSC413, you learn how an autodiff package can be
implemented
Lets you do fancy things like differentiate through the whole
training procedure to compute the gradient of validation loss with
respect to the hyperparameters.
With TensorFlow, PyTorch, etc., we can build much more complex
neural net architectures that we could previously.

Intro M L (UofT) CSC311-Lec12 47 / 60


CSC413 Teaser: Beyond Scalar/Discrete Targets

This course focused on regression and classification,


i.e. scalar-valued or discrete outputs
That only covers a small fraction of use cases. Often, we want to
output something more structured:
text (e.g. image captioning, machine translation)
dense labels of images (e.g. semantic segmentation)
graphs (e.g. molecule design)
This used to be known as structured prediction, but now it’s so
routine we don’t need a name for it.

Intro M L (UofT) CSC311-Lec12 48 / 60


CSC413 Teaser: Representation Learning

We talked about neural nets as learning feature maps you can use
for regression/classification
More generally, want to learn a representation of the data
such that mathematical operations on the representation are
semantically meaningful
Classic (decades-old) example: representing words as vectors
Measure semantic similarity using the dot product between word
vectors (or dissimilarity using Euclidean distance)
Represent a web page with the average of its word vectors

Intro M L (UofT) CSC311-Lec12 49 / 60


CSC413 Teaser: Representation Learning
Here’s a linear projection of word representations for cities and capitals
into 2 dimensions (part of a representation learned using word2vec)
The mapping city → capital corresponds roughly to a single direction in
the vector space:

M i ko l o v et a l. , 2018, “ Effi c i e n t e s ti m a ti o n of w o r d re pre s e ntati ons i n ve c to r s p a c e ”

Intro M L (UofT) CSC311-Lec12 50 / 60


CSC413 Teaser: Representation Learning
In other words, vec(Paris) − vec(France) ≈ vec(London) − vec(England)
This means we can analogies by doing arithmetic on word vectors:
e.g. “ Paris is to F rance as London is to

” Find the word whose vector is closest to


vec(France) − vec(Paris) + vec(London)
Example analogies:

M i ko l o v et a l. , 2018, “ Effi c i e n t e s ti m a ti o n of w o r d re pre s e ntati ons i n ve c to r s p a c e ”

Intro M L (UofT) CSC311-Lec12 51 / 60


CSC413 Teaser: Representation Learning
One of the big goals is to learn disentangled representations, where
individual dimensions tell you something meaningful

(a) Baldness (-6, 6) (b) Face width (0, 6)

(c) Gender (-6, 6) (d) Mustache (-6, 0)


C h e n et a l. , 2018, “ I s o l a ti n g s ource s of d i s e n t a n g l e m e n t i n v a r i a ti o n a l a u t o e n c o d e r s ”
Intro M L (UofT) CSC311-Lec12 52 / 60
CSC413 Teaser: Image-to-Image Translation
Due to convenient autodiff frameworks, we can combine multiple
neural nets together into fancy architectures. Here’s the CycleGAN.

Z h u et a l. , 2017, “ U n p a i r e d i m a g e - t o - i m a g e t ra n s l a ti o n u s i n g c yc l e - c on s iste n t a d v e r s a r i a l n e t w o r k s ”

Intro M L (UofT) CSC311-Lec12 53 / 60


CSC413 Teaser: Image-to-Image Translation

Style transfer problem: change the style of an image while preserving


the content.

Data: Two unrelated collections of images, one for each style

Intro M L (UofT) CSC311-Lec12 54 / 60


CSC412 Teaser: Probabilistic Graphical Models
In this course, we just scratched the surface of probabilistic
models.
Probabilistic graphical models (PGMs) let you encode complex
probabilistic relationships between lots of variables.

G h a h r a m a n i , 2015, “ P r o b a b i l i s ti c M L a n d a r ti fi c i a l i n te l li ge n c e ”

Intro M L (UofT) CSC311-Lec12 55 / 60


CSC412 Teaser: P G M Inference

We derived inference methods by inspection for some easy special


cases (e.g. G DA , na¨ıve Bayes)
In CSC412, you’ll learn much more general and powerful inference
techniques that expand the range of models you can build
Exact inference using dynamic programming, for certain types of
graph structures (e.g. chains)
Markov chain Monte Carlo
forms the basis of a powerful probabilistic modeling tool called Stan
Variational inference: try to approximate a complex, intractable, high-
dimensional distribution using a tractable one
Tr y to minimze the K L divergence
Based on the same math from our E M lecture

Intro M L (UofT) CSC311-Lec12 56 / 60


CSC412 Teaser: Beyond Clustering

We’ve seen unsupervised learning algorithms based on two ways of


organizing your data
low-dimensional spaces (dimensionality reduction)
discrete categories (clustering)
Other ways to organize/model data
hierarchies
dynamical
systems sets of
attributes
topic models (each document is a mixture of topics)
Motifs can be combined in all sorts of different
ways

Intro M L (UofT) CSC311-Lec12 57 / 60


CSC412 Teaser: Beyond Clustering

Latent Dirichlet Allocation ( L D A )

The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-
tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in
announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which
will house young artists and provide new public facilities. The Metropolitan Opera Co. and
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading
supporter of the Lincoln Center Consolidated Corporate Fund, will make its usual annual
$100,000 donation, too.

B l e i et a l. , 2003, “ L a t e n t D i r i c h l e t A l l o c a ti o n ”

Intro M L (UofT) CSC311-Lec12 58 / 60


CSC412 Teaser: Automatic Statistician

Automatic search over Gaussian process kernel structures

D u v e n a u d et a l. , 2013, “ S t r u c t u r e d i s c o v e r y i n n o n p a ra m e t r i c re g re s s ion t h r o u g h c o m p o s i ti o n a l ke rne l


s e arch ”
I m a g e : G h a h r a m a n i , 2015, “ P r o b a b i l i s ti c M L a n d a r ti fi c i a l i n te l li ge n c e ”

Intro M L (UofT) CSC311-Lec12 59 / 60


Resources

Continuing with machine learning


Courses
csc413/2516, “Neural Networks and Deep Learning”
csc412/2506, “Probabilistic Learning and Reasoning”
Various topics courses (varies from year to year)
Videos from top M L conferences (NIPS/NeurIPS, I C M L , I C L R ,
UA I )
Tutorials and keynote talks are aimed at people with your level of
background (know the basics, but not experts in a subfield)
Try to reproduce results from papers
If they’ve released code, you can use that as a guide if you get stuck
Lots of excellent free resources avaiable online!

Intro M L (UofT) CSC311-Lec12 60 / 60

You might also like