0% found this document useful (0 votes)
63 views

Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)

Reinforcement learning allows an agent to learn from experience through trial-and-error interactions with its environment. The agent receives feedback in the form of rewards or punishments, without being explicitly told which actions to take. The goal is for the agent to learn a policy that maximizes its total reward over time through exploration and exploitation. Markov decision processes provide a framework for modeling reinforcement learning problems and finding optimal policies using techniques like value iteration, policy iteration, and Q-learning. Q-learning is a model-free approach that can learn directly from experience without knowing the transition and reward functions of the environment.

Uploaded by

Anil Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)

Reinforcement learning allows an agent to learn from experience through trial-and-error interactions with its environment. The agent receives feedback in the form of rewards or punishments, without being explicitly told which actions to take. The goal is for the agent to learn a policy that maximizes its total reward over time through exploration and exploitation. Markov decision processes provide a framework for modeling reinforcement learning problems and finding optimal policies using techniques like value iteration, policy iteration, and Q-learning. Q-learning is a model-free approach that can learn directly from experience without knowing the transition and reward functions of the environment.

Uploaded by

Anil Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 14

Reinforcement Learning

Mitchell, Ch. 13
(see also Barto & Sutton book on-line)
Rationale
• Learning from experience
• Adaptive control
• Examples not explicitly labeled, delayed
feedback
• Problem of credit assignment – which
action(s) led to payoff?
• tradeoff short-term thinking (immediate
reward) for long-term consequences
Agent Model
• Transition function – T:SxA->S, environment
• Reward function R:SxA->real, payoff
• Stochastic but Markov
=

• Policy=decision function, :S->A


• “rationality” – maximize long term expected
reward
– Discounted long-term reward (convergent series)
– Alternatives: finite time horizon, uniform weights
R,T
Markov Decision Processes (MDPs)
• if know R and T(=P), solve for value func V(s)
• policy evaluation
• Bellman Equations
• dynamic programming (|S| eqns in |S| unknowns)
MDPs
• finding optimal policies

• Value iteration – update V(s) iteratively until


(s)=argmaxa V(s) stops changing

• Policy iteration – iterate between choosing  and


updating V over all states

• Monte Carlo sampling: run random scenarios


using  and take average rewards as V(s)
Q-learning: model-free
• Q-function: reformulate as value function
of S and A, independent of R and T(=)
Q-learning algorithm
Convergence
• Theorem: Q converges to Q*, after visiting
each state infinitely often (assuming |r|<)
• Proof: with each iteration (where all SxA
visited), magnitude of largest error in Q
table decreases by at least 
• “on-policy” Training
– exploitation vs. exploration
– will relevant parts of the space be explored if stick to
current (sub-optimal) policy?
– -greedy policies: choose action with max Q value
most of the time, or random action  % of the time
• “off-policy”
– learn from simulations or traces
– SARSA: training example database: <s,a,r,s’,a’>
• Actor-critic
Non-deterministic case
Temporal Difference Learning
• convergence is not the problem
• representation of large Q table is the
problem (domains with many states or
continuous actions)
• how to represent large Q tables?
– neural network
– function approximation
– basis functions
– hierarchical decomposition of state space

You might also like