0% found this document useful (0 votes)

7 views60 pages

Lec 12

Lecture 12 of CSC 311 at the University of Toronto focuses on Reinforcement Learning (RL), explaining its framework through Markov Decision Processes (MDPs). It covers key concepts such as states, actions, policies, rewards, and the value functions, including the Bellman equations that underpin many RL algorithms. The lecture also discusses the process of finding optimal policies and value functions using dynamic programming techniques like value iteration.

Uploaded by

Trần Đình Vương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views60 pages

Lec 12

Uploaded by

Trần Đình Vương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 60

C S C 311: Introduction to Machine Learning

Lecture 12 - Reinforcement Learning

Roger Grosse Rahul G. Krishnan Guodong Zhang

University of Toronto, Fall 2021

Intro M L (UofT) CSC311-Lec12 1 / 60

Reinforcement Learning Problem
Recall: we categorized types of M L by how much information they
provide about the desired behavior.
Supervised learning: labels of desired behavior
Unsupervised learning: no labels
Reinforcement learning: reward signal evaluating the outcome of
past actions
Bandit problems (Lecture 10) are a simple instance of R L where each
decision is independent.
More commonly, we focus on sequential decision making: an agent
chooses a sequence of actions which each affect future possibilities
available to the agent.

An observes takes an action with the goal of

agent the world and its states achieving long-
changes term rewards.

Intro M L (UofT) CSC311-Lec12 2 / 60

Reinforcement Learning
Most R L is done in a mathematical framework called a Markov Decision Process
(MDP).

Intro M L (UofT) CSC311-Lec12 3 / 60

MDPs: States and Actions

First let’s see how to describe the dynamics of the environment.

The state is a description of the environment in sufficient detail to
determine its evolution.
Think of Newtonian physics.
What would be the state variables for a puck sliding on a
frictionless table?

Markov assumption: the state at time t + 1 depends directly on the

state and action at time t, but not on past states and actions.
To describe the dynamics, we need to specify the transition
probabilities P (S t+ 1 | S t , A t ).
In this lecture, we assume the state is fully observable, a highly
nontrivial assumption.

Intro M L (UofT) CSC311-Lec12 4 / 60

MDPs: States and Actions

Suppose you’re controlling a robot hand. What should be the set

of states and actions?

In general, the right granularity of states and actions depends on

what you’re trying to achieve.
Intro M L (UofT) CSC311-Lec12 5 / 60
MDPs: Policies

The way the agent chooses the action in each step is called a
policy.
We’ll consider two types:
Deterministic policy: A t = π(S t ) for some function π : S → A
Stochastic policy: A t ∼ π(· | S t ) for some function π : S →
P(A). (Here, P ( A ) is the set of distributions over actions.)
With stochastic policies, the distribution over rollouts, or
trajectories, factorizes:
p(s 1 , a 1 , . . . , s T , a T ) = p(s 1 ) π(a 1 | s 1 ) P ( s 2 | s 1 , a 1 ) π(a 2 | s 2 ) · · · P ( s T | s T −1 ,
aT −1 ) π(aT | sT )

Note: the fact that policies need consider only the current state is
a powerful consequence of the Markov assumption and full
observability.
If the environment is partially observable, then the policy needs to
depend on the history of observations.
Intro M L (UofT) CSC311-Lec12 6 / 60
MDPs: Rewards
In each time step, the agent receives a reward from a distribution
that depends on the current state and action

R t ∼ R (· | S t , A t )

For simplicity, we’ll assume rewards are deterministic, i.e.

R t = r (S t , A t )

What’s an example where R t should depend on A t ?

The return determines how good was the outcome of an episode.
Undiscounted: G = R 0 + R 1 + R 2 + · · ·
Discounted: G = R 0 + γ R 1 + γ 2 R 2 + · ·
·
The goal is to maximize the expected
return, E[G].
γ is a hyperparameter called the discount factor which determines
Ihow
n t r o M much
L ( U o f T we
) care about rewards
C S C 3 1 1 - L e cnow
12 vs. rewards later. 7 / 60
MDPs: Rewards

How might you define a reward function for an agent learning to

play a video game?
Change in score (why not current score?)
Some measure of novelty (this is sufficient for most Atari games!)
Consider two possible reward functions for the game of Go. How
do you think the agent’s play will differ depending on the choice?
Opti on 1: + 1 for win, 0 for tie, -1 for loss
Opti on 2: Agent’s territory minus opponent’s territory (at end)
Specifying a good reward function can be tricky.
htt ps://www.youtube.com/watch?v=tlOIHko8ySg

Intro M L (UofT) CSC311-Lec12 8 / 60

Markov Decision Processes

Putti ng this together, a Markov Decision Process ( M D P ) is defined by a

tuple (S, A, P , R , γ).
S: State space. Discrete or continuous
A: Action space. Here we consider finite action space, i.e.,
A = { a1 , . . . , a| A | } .
P : Transition probability
R : Immediate reward distribution
γ: Discount factor (0 ≤ γ < 1)
Together these define the environment that the agent operates in, and
the objectives it is supposed to achieve.

Intro M L (UofT) CSC311-Lec12 9 / 60

Finding a Policy

Now that we’ve defined MDPs, let’s see how to find a policy that
achieves a high return.
We can distinguish two situations:
Planning: given a fully specified MDP.
Learning: agent interacts with an environment with unknown
dynamics.
I.e., the environment is a black box that takes in actions and
outputs states and rewards.
Which framework would be most appropriate for chess? Super
Mario?

Intro M L (UofT) CSC311-Lec12 10 / 60

Value Functions

Intro M L (UofT) CSC311-Lec12 11 / 60

Value Function

The value function V π for a policy π measures the expected return if you
start in state s and follow policy π.
" ∞
#Σ
π
V (s) , Eπ [G t | St = s] = π γ k R t+ k | S t =
E k=0 s .

This measures the desirability of state s.

Intro M L (UofT) CSC311-Lec12 12 / 60

Value Function

Rewards: −1 per time-step

Actions: N, E , S, W
States: Agent’s location

[Slide credit: D. Silver]

Intro M L (UofT) CSC311-Lec12 13 / 60

Value Function

Arrows represent policy

π(s) for each state s

[Slide credit: D. Silver]

Intro M L (UofT) CSC311-Lec12 14 / 60

Value Function

Numbers represent value

V π (s) of each state s

[Slide credit: D. Silver]

Intro M L (UofT) CSC311-Lec12 15 / 60

Bellman equations
The foundation of many R L algorithms is the fact that value functions
satisfy a recursive relationship, called the Bellman equation:
π
V ( s ) = Eπ [G t | S t = s ]
= Eπ [R t + γ G" t+ 1 | S t = s ]
Σ # Σ
= π(a | s) r (s, a) + γ P(s ′ | a, s) Eπ [Gt+ 1 | S t+ 1 = s ′]
a
s'
" #
Σ Σ ′ π ′
= π(a | s) r (s, a) + γ P (s | a, s) V
(s ) a s'

Viewing V π as a vector (where entries correspond to states), define the

Bellman backup operator T π .
"
Σ # Σ
π
(T V )(s) , π(a | s) r (s, a) + γ P(s ′ | a, s) V (s ′ )
a
s'

The Bellman equation can be seen as a fixed point of the Bellman

operator:
T πVπ = Vπ.
Intro M L (UofT) CSC311-Lec12 16 / 60
Value Function

A value function for golf:


vVpuptu 

sand 
ttt
 

 


 sa gree
n n
 d
 

 

— S u tt o n a n d B a r t o , Re in fo rcem e nt L e a r n in g : A n I nt ro d u c ti o n

Intro M L (UofT) CSC311-Lec12 17 / 60

State-Action Value Function
A closely related but usefully different function is the state-action
value function, or Q-function, Q π for policy π, defined as:
 
Σ
Q π (s, a) , π  γ k R t+ k | S t = s, At = a
E k≥0 .
If you knew Q π , how would you obtain V π ?
Σ
V π (s) = π(a | s) Qπ (s,
a). a

If you knew V π , how would you obtain Q π ?

Apply a Bellman-like equation:
Σ
Q π (s, a) = r (s, a) + γ P (s ′ | a, s) V π (s
′
) s'

This requires knowing the dynamics, so in general it’s not easy to

recover Q π from V π .
Intro M L (UofT) CSC311-Lec12 18 / 60
State-Action Value Function

Q π satisfies a Bellman equation very similar to V π (proof is

analogous):
Σ Σ
Q π (s, a) = r (s, a) + γ P (s JJ | a, s) π(a | s J )Q π (s J ,
` s′ aJ ) a′ x
˛¸
, (T π Q π )
( s,a)

Intro M L (UofT) CSC311-Lec12 19 / 60

Dynamic Programming and Value Iteration

Intro M L (UofT) CSC311-Lec12 20 / 60

Opti mal State-Action Value Function

Suppose you’re in state s. You get to pick one action a, and

then follow (fixed) policy π from then on. What do you pick?

arg max Q π (s, a)

If a deterministic policy π is optimal, then it must be the case that

for any state s:
π(s) = arg max Q π (s, a),
a

otherwise you could improve the policy by changing π(s). (see

Sutton & Barto for a proper proof)

Intro M L (UofT) CSC311-Lec12 21 / 60

Opti mal State-Action Value Function

Bellman equation for optimal policy π ∗ :

∗ Σ y
Qπ (s, a) = r (s, a) + γ P (s ′ , | s, a)Qπ (s′ , πs ′

' (s ))
Σs y
= r (s, a) + γ p(s′ | s, a) max Qπ (s′ , ′

a) s'
a'

πs
Now Q ∗ = Q is the optimal state-action value function, and we
can rewrite the optimal Bellman equation without mentioning π ٨ :
Σ
Q ∗J (s, a) = r (s, a) + γ p(s J | s, a) max Q∗ (sJ , a
) a′
` s′ x
˛¸
,(T∗Q ∗)(s,a)

Turns out this is sufficient to characterize the optimal policy. So

we simply need to solve the fixed point equation T ∗ Q ∗ = Q ∗ ,
and then we can choose π ∗ (s) = arg max a Q ∗ (s, a).
Intro M L (UofT) CSC311-Lec12 22 / 60
Bellman Fixed Points

So far: showed that some interesting problems could be reduced

to finding fixed points of Bellman backup operators:
Evaluating a fixed policy π

T π Qπ = Qπ

Finding the optimal policy

T ∗ Q∗ = Q∗

Idea: keep iterating the backup operator over and over again.
π
Q←T (policy evaluation)
Q
Q ← T ∗Q (finding the optimal
We’re treating Q π or Q ∗ aspolicy)
a vector with |S| · |A| entries.
This type of algorithm is an instance of dynamic programming.

Intro M L (UofT) CSC311-Lec12 23 / 60

Bellman Fixed Points

An operator f (mapping from vectors to vectors) is a

contraction map if
ǁf (x 1 ) − f (x 2 )ǁ ≤ αǁx 1 − x 2 ǁ
for some scalar 0 ≤ α < 1 and vector norm ǁ ·
ǁ.
Let f ( k ) denote f iterated k times. A simple
induction shows
(k)
(k) k
ǁf (x 1 ) − f
(x 2 )ǁ ≤ α ǁx 1 − x 2 ǁ.

Let x ∗ be a fixed point of f . Then for any x ,

ǁf (k)
(x ) − x ∗ ǁ ≤ α k ǁx − x ∗ ǁ.

Hence, ( iterated
Intro M LUofT) applicationC Sof
C 3 1f1 - ,L estarting
c12 from any x, converges 24 / 60
Finding the Opti mal Value Function: Value Iteration

Let’s use dynamic programming to find Q ∗ .

Value Iteration: Start from an initial function Q 1 . For each k = 1, 2, . . . ,
apply
Qk+ 1 ← T ∗ Qk

Writing out the update in full,

Σ
Q k+ 1 (s, a) ← r (s, a) + γ P (s′ |s, a) max Qk (s′ , ′
a' ∈A
a) s' ∈S

Observe: a fixed point of this update is exactly a solution of the optimal

Bellman equation, which we saw characterizes the Q-function of an
optimal policy.

Intro M L (UofT) CSC311-Lec12 25 / 60

Value Iteration
Q1
T ⇤Q 1
T ⇤ (or T ⇡ )
ц

T ⇤Q 2

C l a i m : The value iteration update is a contraction map:

ǁT ∗ Q 1 − T ∗ Q 2 ǁ ∞ ≤ γ ǁQ 1 − Q 2 ǁ ∞
ǁx ǁ∞ = max |xi
ǁ·ǁ∞ denotes the L ∞ norm, defined
| as: i
If this claim is correct, then value iteration converges exponentially to
the unique fixed point.
The exponential decay factor is γ (the discount factor), which means
longer term planning is harder.
Intro M L (UofT) CSC311-Lec12 26 / 60
Bellman Operator is a Contraction (optional)
" #
∗ r (s, a) + γ Σ ′ ′ ′
|(T ∗Q 1 )(s, a) − ( T Q 2 )(s, a)| = (s | s, a) max Q1 (s , a )
. P a'
. s' −
" #
Σ ′ ′ ′
r ( s, a) + γ P ( s | s, a) maa' x Q 2( s , a ) ..
s'

= γ Σ P (s′ | s, a) max Q (s′ , a′ ) − max Q (s′ , ′

a) 1 2 .
. ' a' a' .
s
Σ
≤ γ P (s′ | s, a) max. Q1 (s ′ , a′ ) − Q 2 (s′ , a′ ) .
a'
s'
Σ
≤ γ max .Q 1 (s ′ , a′ ) − Q 2 (s ′ , a′ ). P (s′ | s,
s' ,a'
a) s'
= γ max .Q 1 (s ′ , a′ ) − Q 2 (s ′ , a′ ).
s' ,a'

= γ ǁ Q 1 − Q 2ǁ ∞

This is true for any (s, a), so

ǁ T ∗Q 1 − T ∗Q 2 ǁ ∞ ≤ γ ǁ Q 1 − Q 2ǁ ∞ ,

Iwhich
n t r o M Lis ( what
UofT) we wanted to show.
CSC311-Lec12 27 / 60
Value Iteration Recap

So far, we’ve focused on planning, where the dynamics are known.

The optimal Q-function is characterized in terms of a Bellman
fixed point update.
Since the Bellman operator is a contraction map, we can just keep
applying it repeatedly, and we’ll converge to a unique fixed point.
What are the limitations of value iteration?
assumes known dynamics
requires explicitly representing Q ∗ as a vector
|S| can be extremely large, or infinite
|A| can be infinite (e.g. continuous voltages in robotics)
But value iteration is still a foundation for a lot of more practical
R L algorithms.

Intro M L (UofT) CSC311-Lec12 28 / 60

Towards Learning

Now let’s focus on reinforcement learning, where the

environment is unknown. How can we apply learning?
1 Learn a model of the environment, and do planning in the model
(i.e. model-based reinforcement learning)
You already know how to do this in principle, but it’s very hard to
get to work. Not covered in this course.
2 Learn a value function (e.g. Q-learning, covered in this lecture)
3 Learn a policy directly (e.g. policy gradient, not covered in this
course)
How can we deal with extremely large state spaces?
Function approximation: choose a parametric form for the policy
and/or value function (e.g. linear in features, neural net, etc.)

Intro M L (UofT) CSC311-Lec12 29 / 60

Q-Learning

Intro M L (UofT) CSC311-Lec12 30 / 60

Monte Carlo Esti mation
Recall the optimal Bellman equation:
h i
Q ∗ (s, a) = r (s, a) + P ( s' | s,a)
max Q ∗(s ,′ a ′ )
a'
γE
Prob lem: we need to know the dynamics to evaluate the expectation
Monte Carlo estimation of an expectation µ = E[X]: repeatedly sample
X and update
µ ← µ + α(X − µ)
Idea: Apply Monte Carlo estimation to the Bellman equation by
sampling S ′ ∼ P (· | s, a) and updating:
h
Q(s, a) ← Q(s, a) + α r(s, a) + γ max Q ( S ′ , a′ ) − Q(s,
a) i` a' ˛¸
x = B e llma n error

This is an example of temporal difference learning, i.e. updating our

predictions to match our later predictions (once we have more
information).
Intro M L (UofT) CSC311-Lec12 31 / 60
Monte Carlo Esti mation

Problem: Every iteration of value iteration requires updating Q

for every state.
There could be lots of states
We only observe transitions for states that are visited
Idea: Have the agent interact with the environment, and only
update Q for the states that are actually visited.
Problem: We might never visit certain states if they don’t look
promising, so we’ll never learn about them.
Idea: Have the agent sometimes take random actions so that it
eventually visits every state.
ε-greedy policy: a policy which picks arg max a Q(s, a) with
probability 1 − ε and a random action with probability ε. (Typical
value: ε = 0.05)
Combining all three ideas gives an algorithm called Q-learning.

Intro M L (UofT) CSC311-Lec12 32 / 60

Q-Learning with ε-Greedy Policy
Parameters:
Learning rate α
Exploration parameter ε
Initialize Q (s, a) for all (s, a) ∈ S × A
The agent starts at state
S 0 . F or time step t = 0,
1, ...,
(
Choose A t argmax
according to
a ∈ A Q (S t , a) with probability 1 −
theAε-greedy
t ←
ε policy, i.e.,
Take actionUniformly
A t in therandom action in A with probability ε
environment.
The state changes from S t to S t + 1 ∼ P (·|S t , A t )
Observe S t + 1 and R t (could be r (S t , A t ), or could be stochastic)
Update the action-value function at state-action (S t , A t ):

Q (S t , At ) ← Q (St , At ) + α Rt + γ max Q (St+ 1 , a′ ) − Q (St , At

a' ∈A
)
Intro M L (UofT) CSC311-Lec12 33 / 60
Exploration vs. Exploitati on

The ε-greedy is a simple mechanism for managing the

exploration-exploitation tradeoff.
(
argmax a ∈ A Q (S , a) with probability 1 −
πε (S ; Q ) =
ε niformly random action in A with probability ε
U

The ε-greedy policy ensures that most of the time (probability 1 − ε) the
agent exploits its incomplete knowledge of the world by chooses the best
action (i.e., corresponding to the highest action-value), but occasionally
(probability ε) it explores other actions.
Without exploration, the agent may never find some good actions.
The ε-greedy is one of the simplest, but widely used, methods for
trading-off exploration and exploitation. Exploration-exploitation
tradeoff is an important topic of research.

Intro M L (UofT) CSC311-Lec12 34 / 60

Examples of Exploration-Exploitation in the Real World

Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
[Slide credit: D. Silver]

Intro M L (UofT) CSC311-Lec12 35 / 60

A n Intuition on Why Q-Learning Works? (Opti onal)

Consider a tuple (S, A , R , S ′ ). The Q-learning update is

Q (S , A ) ← Q (S , A ) + α R + γ max Q (S ′ , a ′ ) − Q (S ,
a' ∈A
A) .
To understand this better, let us focus on its stochastic equilibrium, i.e.,
where the expected change in Q(S, A ) is zero. We have

E R + γ max Q (S ′ , a′ ) − Q (S , A )|S , A =
a' ∈A
0
⇒(T Q )(S , A ) = Q (S , A )
∗

So at the stochastic equilibrium, we have (T ∗ Q )(S , A ) = Q (S , A ).

Because the fixed-point of the Bellman optimality operator is unique
(and is Q ∗ ), Q is the same as the optimal action-value function Q ∗ .

Intro M L (UofT) CSC311-Lec12 36 / 60

Off-Policy Learning

Q-learning update again:

Q (S , A ) ← Q (S , A ) + α R + γ max Q (S JJ , a ) − Q (S , A )
a′ ∈A
.
Noti ce: this update doesn’t mention the policy anywhere. The
only thing the policy is used for is to determine which states are
visited.
This means we can follow whatever policy we want (e.g. ε-greedy),
and it still coverges to the optimal Q-function. Algorithms like
this are known as off-policy algorithms, and this is an extremely
useful property.
Policy gradient (another popular R L algorithm, not covered in this
course) is an on-policy algorithm. Encouraging exploration is
much harder in that case.

Intro M L (UofT) CSC311-Lec12 37 / 60

Function Approximation

Intro M L (UofT) CSC311-Lec12 38 / 60

Function Approximation

So far, we’ve been assuming a tabular representation of Q: one

entry for every state/action pair.
This is impractical to store for all but the simplest problems, and
doesn’t share structure between related states.
Soluti on: approximate Q using a parameterized function, e.g.
linear function approximation: Q(s, a) = w T ψ(s, a)
compute Q with a neural net
Update Q using backprop:

t ← r (s t , a t) + γ max Q (s t+ 1 , a)
a
θ ← θ + α(t − Q (s, a))∇θ Q (s t , a t ).

Intro M L (UofT) CSC311-Lec12 39 / 60

Function Approximation

Approximating Q with a neural net is a decades-old idea, but

DeepMind got it to work really well on Atari games in 2013 (“deep
Q-learning”)
They used a very small network by today’s standards

Main technical innovation: store experience into a replay buffer,

and perform Q-learning using stored experience
Gains sample efficiency by separating environment interaction from
optimization — don’t need new experience for every S G D update!

Intro M L (UofT) CSC311-Lec12 40 / 60

A tari

Mnih et al., Nature 2015. Human-level control through deep

reinforcement learning
Network was given raw pixels as observations
Same architecture shared between all games
Assume fully observable environment, even though that’s not the
case
After about a day of training on a particular game, often beat
“human-level” performance (number of points within 5 minutes of
play)
Did very well on reactive games, poorly on ones that require
planning (e.g. Montezuma’s Revenge)
htt ps://www.youtube.com/watch?v=V1eYniJ0Rnk

htt ps://www.youtube.com/watch?v=4MlZncshy1Q
Intro M L
(UofT) CSC311-Lec12 41 / 60
Recap and Other Approaches
All discussed approaches estimate the value function first. They are
called value-based methods.
There are methods that directly optimize the policy, i.e., policy search
methods.
Model-based R L methods estimate the true, but unknown, model of
environment P by an estimate Pˆ, and use the estimate P in order to
plan.
There are hybrid methods.

Value Policy

Model

Intro M L (UofT) CSC311-Lec12 42 / 60

Reinforcement Learning Resources

Books:
Richard S. Sutton and Andrew G. Barto, Reinforcement Learning:
An Introduction, 2nd edition, 2018.
Csaba Szepesvari, Algorithms for Reinforcement Learning, 2010.
Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien
Ernst, Reinforcement Learning and Dynamic Programming Using
Function Approximators, 2010.
Dimitri P. Bertsekas and John N. Tsitsiklis, Neuro-Dynamic
Programming, 1996.
Courses:
Video lectures by David Silver
C I FA R and Vector Institute’s Reinforcement Learning Summer
School, 2018.
Deep Reinforcement Learning, C S 294-112 at U C Berkeley

Intro M L (UofT) CSC311-Lec12 43 / 60

Closing Thoughts

Intro M L (UofT) CSC311-Lec12 44 / 60

O verview

What this course focused on:

Supervised learning: regression, classification
Choose model, loss function, optimizer
Parametric vs. nonparametric
Generative vs. discriminative
Iterative optimization vs. closed-form solutions
Unsupervised learning: dimensionality reduction and clustering
Reinforcement learning: value iteration
This lecture: what we left out, and teasers for other courses

Intro M L (UofT) CSC311-Lec12 45 / 60

CSC413 Teaser: Neural Nets

This course covered some fundamental ideas, most of which are

more than 10 years old.
Big shift of the past decade: neural nets and deep learning
2010: neural nets significantly improved speech recognition accuracy
(after 20 years of stagnation)
2012–2015: neural nets reduced error rates for object recognition by
a factor of 6
2016: a program called AlphaGo defeated the human Go champion
2015–2018: neural nets learned to produce convincing
high-resolution images
2018–today: transformers demonstrate a sophisticated
ability to
generate natural language text and learn from few
examples

Intro M L (UofT) CSC311-Lec12 46 / 60

CSC413 Teaser: Automatic Differentiation

In this course, you derived update rules by hand

Backprop is totally mechanical. Now we have automatic
differentiation tools that compute gradients for you.
In CSC413, you learn how an autodiff package can be
implemented
Lets you do fancy things like differentiate through the whole
training procedure to compute the gradient of validation loss with
respect to the hyperparameters.
With TensorFlow, PyTorch, etc., we can build much more complex
neural net architectures that we could previously.

Intro M L (UofT) CSC311-Lec12 47 / 60

CSC413 Teaser: Beyond Scalar/Discrete Targets

This course focused on regression and classification,

i.e. scalar-valued or discrete outputs
That only covers a small fraction of use cases. Often, we want to
output something more structured:
text (e.g. image captioning, machine translation)
dense labels of images (e.g. semantic segmentation)
graphs (e.g. molecule design)
This used to be known as structured prediction, but now it’s so
routine we don’t need a name for it.

Intro M L (UofT) CSC311-Lec12 48 / 60

CSC413 Teaser: Representation Learning

We talked about neural nets as learning feature maps you can use
for regression/classification
More generally, want to learn a representation of the data
such that mathematical operations on the representation are
semantically meaningful
Classic (decades-old) example: representing words as vectors
Measure semantic similarity using the dot product between word
vectors (or dissimilarity using Euclidean distance)
Represent a web page with the average of its word vectors

Intro M L (UofT) CSC311-Lec12 49 / 60

CSC413 Teaser: Representation Learning
Here’s a linear projection of word representations for cities and capitals
into 2 dimensions (part of a representation learned using word2vec)
The mapping city → capital corresponds roughly to a single direction in
the vector space:

M i ko l o v et a l. , 2018, “ Effi c i e n t e s ti m a ti o n of w o r d re pre s e ntati ons i n ve c to r s p a c e ”

Intro M L (UofT) CSC311-Lec12 50 / 60

CSC413 Teaser: Representation Learning
In other words, vec(Paris) − vec(France) ≈ vec(London) − vec(England)
This means we can analogies by doing arithmetic on word vectors:
e.g. “ Paris is to F rance as London is to

” Find the word whose vector is closest to

vec(France) − vec(Paris) + vec(London)
Example analogies:

M i ko l o v et a l. , 2018, “ Effi c i e n t e s ti m a ti o n of w o r d re pre s e ntati ons i n ve c to r s p a c e ”

Intro M L (UofT) CSC311-Lec12 51 / 60

CSC413 Teaser: Representation Learning
One of the big goals is to learn disentangled representations, where
individual dimensions tell you something meaningful

(a) Baldness (-6, 6) (b) Face width (0, 6)

(c) Gender (-6, 6) (d) Mustache (-6, 0)

C h e n et a l. , 2018, “ I s o l a ti n g s ource s of d i s e n t a n g l e m e n t i n v a r i a ti o n a l a u t o e n c o d e r s ”
Intro M L (UofT) CSC311-Lec12 52 / 60
CSC413 Teaser: Image-to-Image Translation
Due to convenient autodiff frameworks, we can combine multiple
neural nets together into fancy architectures. Here’s the CycleGAN.

Z h u et a l. , 2017, “ U n p a i r e d i m a g e - t o - i m a g e t ra n s l a ti o n u s i n g c yc l e - c on s iste n t a d v e r s a r i a l n e t w o r k s ”

Intro M L (UofT) CSC311-Lec12 53 / 60

CSC413 Teaser: Image-to-Image Translation

Style transfer problem: change the style of an image while preserving

the content.

Data: Two unrelated collections of images, one for each style

Intro M L (UofT) CSC311-Lec12 54 / 60

CSC412 Teaser: Probabilistic Graphical Models
In this course, we just scratched the surface of probabilistic
models.
Probabilistic graphical models (PGMs) let you encode complex
probabilistic relationships between lots of variables.

G h a h r a m a n i , 2015, “ P r o b a b i l i s ti c M L a n d a r ti fi c i a l i n te l li ge n c e ”

Intro M L (UofT) CSC311-Lec12 55 / 60

CSC412 Teaser: P G M Inference

We derived inference methods by inspection for some easy special

cases (e.g. G DA , na¨ıve Bayes)
In CSC412, you’ll learn much more general and powerful inference
techniques that expand the range of models you can build
Exact inference using dynamic programming, for certain types of
graph structures (e.g. chains)
Markov chain Monte Carlo
forms the basis of a powerful probabilistic modeling tool called Stan
Variational inference: try to approximate a complex, intractable, high-
dimensional distribution using a tractable one
Tr y to minimze the K L divergence
Based on the same math from our E M lecture

Intro M L (UofT) CSC311-Lec12 56 / 60

CSC412 Teaser: Beyond Clustering

We’ve seen unsupervised learning algorithms based on two ways of

organizing your data
low-dimensional spaces (dimensionality reduction)
discrete categories (clustering)
Other ways to organize/model data
hierarchies
dynamical
systems sets of
attributes
topic models (each document is a mixture of topics)
Motifs can be combined in all sorts of different
ways

Intro M L (UofT) CSC311-Lec12 57 / 60

CSC412 Teaser: Beyond Clustering

Latent Dirichlet Allocation ( L D A )

The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-
tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in
announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which
will house young artists and provide new public facilities. The Metropolitan Opera Co. and
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading
supporter of the Lincoln Center Consolidated Corporate Fund, will make its usual annual
$100,000 donation, too.

B l e i et a l. , 2003, “ L a t e n t D i r i c h l e t A l l o c a ti o n ”

Intro M L (UofT) CSC311-Lec12 58 / 60

CSC412 Teaser: Automatic Statistician

Automatic search over Gaussian process kernel structures

D u v e n a u d et a l. , 2013, “ S t r u c t u r e d i s c o v e r y i n n o n p a ra m e t r i c re g re s s ion t h r o u g h c o m p o s i ti o n a l ke rne l

s e arch ”
I m a g e : G h a h r a m a n i , 2015, “ P r o b a b i l i s ti c M L a n d a r ti fi c i a l i n te l li ge n c e ”

Intro M L (UofT) CSC311-Lec12 59 / 60

Resources

Continuing with machine learning

Courses
csc413/2516, “Neural Networks and Deep Learning”
csc412/2506, “Probabilistic Learning and Reasoning”
Various topics courses (varies from year to year)
Videos from top M L conferences (NIPS/NeurIPS, I C M L , I C L R ,
UA I )
Tutorials and keynote talks are aimed at people with your level of
background (know the basics, but not experts in a subfield)
Try to reproduce results from papers
If they’ve released code, you can use that as a guide if you get stuck
Lots of excellent free resources avaiable online!

Intro M L (UofT) CSC311-Lec12 60 / 60

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Artificial General Intelligence
100% (4)
Artificial General Intelligence
277 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
CS229
No ratings yet
CS229
17 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
No ratings yet
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
6 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Lec 3
No ratings yet
Lec 3
15 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
1 Markov
No ratings yet
1 Markov
34 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Lecture3 InsideAnAgent
No ratings yet
Lecture3 InsideAnAgent
35 pages
Slidedeck 6 MAS 2021 22 RL 2 MDP Model-Based
No ratings yet
Slidedeck 6 MAS 2021 22 RL 2 MDP Model-Based
36 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Lec 09
No ratings yet
Lec 09
51 pages
Reinforcement Learning - Markov Decision Process
No ratings yet
Reinforcement Learning - Markov Decision Process
11 pages
Slidedeck 5 MAS 2021 22 RL 1 MDP Bellman v3
No ratings yet
Slidedeck 5 MAS 2021 22 RL 1 MDP Bellman v3
93 pages
Lec 02
No ratings yet
Lec 02
89 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Aligning Superhuman AI With Human Behavior: Chess As A Model System
No ratings yet
Aligning Superhuman AI With Human Behavior: Chess As A Model System
11 pages
Robot Arm Model Using Deep Q Network (RL)
No ratings yet
Robot Arm Model Using Deep Q Network (RL)
97 pages
Deep Reinforcement Learning For Wireless Communications and Networking: Theory, Applications and Implementation Dinh Thai Hoangdownload
100% (2)
Deep Reinforcement Learning For Wireless Communications and Networking: Theory, Applications and Implementation Dinh Thai Hoangdownload
57 pages
Robust Wind-Resistant Hovering Control of Quadrotor UAVs Using Deep Reinforcement Learning
No ratings yet
Robust Wind-Resistant Hovering Control of Quadrotor UAVs Using Deep Reinforcement Learning
10 pages
ML Lecture#1
No ratings yet
ML Lecture#1
52 pages
SAC Explanation Slide
No ratings yet
SAC Explanation Slide
13 pages
Cybersecurity Internship Report
No ratings yet
Cybersecurity Internship Report
55 pages
AI Primer
No ratings yet
AI Primer
24 pages
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
No ratings yet
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
7 pages
Nerf2Real: Sim2Real Transfer of Vision-Guided Bipedal Motion Skills Using Neural Radiance Fields
No ratings yet
Nerf2Real: Sim2Real Transfer of Vision-Guided Bipedal Motion Skills Using Neural Radiance Fields
12 pages
AID 4th Semester Machine Learning Laboratory - Lab Manual2
No ratings yet
AID 4th Semester Machine Learning Laboratory - Lab Manual2
61 pages
Course Overview: Reinforcement Learning
No ratings yet
Course Overview: Reinforcement Learning
20 pages
Modeling and Optimization of Paper-Making Wastewater Treatment Based On Reinforcement Learning
No ratings yet
Modeling and Optimization of Paper-Making Wastewater Treatment Based On Reinforcement Learning
5 pages
Old Meets New: Integrating Artificial Intelligence in Museums' Management Practices
No ratings yet
Old Meets New: Integrating Artificial Intelligence in Museums' Management Practices
15 pages
Machine MCQ
No ratings yet
Machine MCQ
32 pages
PPO (v3)
No ratings yet
PPO (v3)
29 pages
Theories and Practices of Self-Driving Vehicles 1st Edition - Ebook PDF Download
100% (4)
Theories and Practices of Self-Driving Vehicles 1st Edition - Ebook PDF Download
61 pages
Deep Reinforcement Learning For Cyber Security
No ratings yet
Deep Reinforcement Learning For Cyber Security
17 pages
Deep Reinforcement Learning For Drone Delivery
No ratings yet
Deep Reinforcement Learning For Drone Delivery
19 pages
ML All Units Mca 3rd Semester Anna University
No ratings yet
ML All Units Mca 3rd Semester Anna University
100 pages
DeepSeek R1
No ratings yet
DeepSeek R1
22 pages
AScI Summer Research Project List 2025
No ratings yet
AScI Summer Research Project List 2025
42 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Reinforcement Learning Tutorial
100% (1)
Reinforcement Learning Tutorial
17 pages
Trajectory Transformer
No ratings yet
Trajectory Transformer
15 pages
Stanford Center For AI Safety - Whitepaper
No ratings yet
Stanford Center For AI Safety - Whitepaper
6 pages
Notes - Machine Learning
No ratings yet
Notes - Machine Learning
138 pages
Too Hot To Print, Too Slow To Handle Finding Optimal Path Characteristics For WAAM
No ratings yet
Too Hot To Print, Too Slow To Handle Finding Optimal Path Characteristics For WAAM
12 pages
AI Agents Envisioning The Future Forecast-Based Operation of Renewable Energy Storage Systems Using Hydrogen With Deep Reinforcement Learning
No ratings yet
AI Agents Envisioning The Future Forecast-Based Operation of Renewable Energy Storage Systems Using Hydrogen With Deep Reinforcement Learning
19 pages