Lecture#1 - RL An Introduction 2023
Lecture#1 - RL An Introduction 2023
Reinforcement Learning RL An
Introduction
All the lectures of this course has been prepared basically depending on Reinforcement Learning:
An Introduction “2ed “
Richard S. Sutton and Andrew G. Barto
The MIT Press Cambridge, Massachusetts London, England
3
Reinforcement Learning- What is Reinforcement Learning
(RL)?
5
Reinforcement Learning
• The idea that we learn by interacting with our environment is probably the first
to occur to us when we think about the nature of learning.
• When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it
does have a direct sensorimotor connection to its environment.
• Throughout our lives, such interactions are undoubtedly a major source of knowledge
about our environment and ourselves.
• Whether we are learning to drive a car or to hold a conversation, we are acutely aware of
how our environment responds to what we do, and we seek to influence what
happens through our behavior.
8
Reinforcement Learning
• The learner is not told which actions to take, but instead must discover which actions yield
the most reward by trying them.
• In the most interesting and challenging cases, actions may react not only the immediate
reward but also the next situation and, through that, all subsequent rewards.
• These two characteristics
• trial-and-error search
• delayed reward
are the too most important distinguishing features of reinforcement
learning.
9
Reinforcement Learning
10
Reinforcement Learning
• The agent also must have a goal or goals relating to the state
of the environment.
11
Reinforcement Learning
• Markov decision processes are intended to include just these three aspects
—sensation
—action
— goal
12
RL vs SL
RL vs USL
Reinforcement
Learning is, like
Supervised Learning
and Unsupervised
Learning, one the main
areas of Machine
Learning and Artificial
Intelligence.
13
Reinforcement Learning
up till now we learnt that RL is concerned with the learning process of an arbitrary
being, formally known as an , in the world surrounding it, known as
the Environment.
The Agent seeks to maximize the it receives from the Environment, and
performs different in order to learn how the Environment responds and gain
more rewards.
It is therefore heavily used to solve different kind of games, from Tic-Tac-Toe, Chess,
14
Reinforcement Learning
The object of this kind of learning is for the system to extrapolate, or generalize, its
responses so that it acts correctly in situations not present in the training set.
15
Reinforcement Learning
17
Reinforcement Learning
• One of the challenges that arise in reinforcement learning, and not in other kinds of learning,
is
the trade-of between exploration and exploitation.
• Exploration means that you search over the whole sample space (exploring the sample
space)
while
• Exploitation means that you are exploiting the promising areas found when you did
the exploration
19
Reinforcement Learning
• To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried
in the past and found to be reactive in producing reward. But to discover such actions, it has
to try actions that it has not selected before.
• The agent has to exploit what it has already experienced in order to obtain reward, but it also
has to explore in order to make better action selections in the future.
20
Reinforcement Learning
• The dilemma[ ]المعضلةis that neither exploration nor exploitation can be pursued[]تتابع
exclusively without failing at the task.
• The agent must try a variety of actions and progressively favor those that appear to
be best.
• On a stochastic task, each action must be tried many times to gain a reliable estimate
of its expected reward.
21
Reinforcement Learning
• For now, we simply note that the entire issue of balancing exploration and
exploitation does not even arise in supervised and unsupervised learning.
22
Reinforcement Learning Example
• A mobile cleaner robot decides whether it should enter a new room in search of more trash
to collect or start trying to find its way back to its battery recharging station.
• It makes its decision based on the current charge level of its battery and how quickly and
easily it has been able to find the recharger in the past.
• The controller optimizes the yield/cost/quality trade-of on the basis of specified marginal
costs without sticking strictly to the set points originally suggested by engineers.
23
Typical RL Scenario Agent: The learner and
decision maker.
Child, dog, robot,
program, etc.
Environment: Agent’s
surronding or things it
interact with. Agent observe
environment and decide to
take action which changes
environment(it may also
change on its own).
24
• The key idea behind Reinforcement learning, we have an environment
which represents the outside world to the agent and an agent that
takes actions, receives observations from the environment that
consists of a reward for his action and information of his new state.
• That reward informs the agent of how good or bad was the taken
action, and the observation tells him what is his next state in the
environment. Its actions also may affect not only the immediate
rewards but rewards for the next situations.
25
Elements of Reinforcement Learning
27
Elements of Reinforcement Learning– Action Spaces
28
Elements of Reinforcement Learning--Policy
• Deterministic policy is a mapping π:S → A. For each state s∈S, it yields the
action a∈A that the agent will choose while in state s.
• Stochastic policy is a mapping π:S × A → [0,1]. For each state s∈S and
action a∈A, it yields the probability π(a∣s) that the agent chooses action a
while in state s.
31
Elements of Reinforcement Learning--Rewards
• A reward signal defines the goal of a reinforcement learning problem.
• On each time step, the environment sends to the reinforcement learning agent a single
number called the reward.
• Rewards convey how “good" an agent's actions are, not what the “best “actions would have
been.
• If the agent was given instructive feedback (what action it should have taken) this would be a
supervised learning problem, not a reinforcement learning problem.
• The agent’s sole objective is to maximize the total reward [Cumulative Rewards] it receives
over the long run.
• Thus, the reward signal defines what are the good and bad events for the agent.
32
Elements of Reinforcement Learning--
Rewards
• In a biological system, we might think of rewards as analogous to the experiences
of pleasure or pain.
• They are the immediate and defining features of the problem faced by the agent.
• The reward signal is the primary basis for altering the policy; if an action selected
by the policy is followed by low reward, then the policy may be changed to select
some other action in that situation in the future.
33
Elements of Reinforcement Learning--
Rewards
• Reward: A reward is a scalar feedback signal it indicates how well the
agent is doing at step t. The agent’s sole objective is to maximize the
total reward it receives over the long run.
• The reward signal is the primary basis for altering the policy.
• R(s) indicates the reward for simply being in the state S.
• R(S,a) indicates the reward for being in a state S and taking an
action a.
• R(S, a, S’) indicates the reward for being in a state S, taking an
action a and ending up in a state S’.
34
Elements of Reinforcement Learning--values
• A value function specifies what is good in the long run of a reinforcement learning
problem.
• Roughly speaking, the value of a state is the total amount of reward an agent can expect
to accumulate over the future, starting from that state.
35
Elements of Reinforcement Learning--values
36
Elements of Reinforcement Learning--Values
37
Elements of Reinforcement Learning--Values
• Without rewards there could be no values, and the only purpose of estimating
values is to achieve more reward. Nevertheless, it is values with which we are
most concerned when making and evaluating decisions. Action choices are made
based on value judgments.
• We seek actions that bring about states of highest value, not highest reward,
because these actions obtain the greatest amount of reward for us over the long
run.
38
Elements of Reinforcement Learning--Values
39
Elements of Reinforcement Learning--Values
40
Elements of Reinforcement Learning—Model of the
environment
• The final element of some reinforcement learning systems is a model of the
environment.
• For example, given a state and action, the model might predict the resultant next state
and next reward.
41
Elements of Reinforcement Learning—Model
of the environment
• A model predicts what environment will do next. Models are used for
planning, by which we mean any way of deciding on a course of action
by considering possible future situations before they are actually
experienced.
• A Model (sometimes called Transition Model) gives an action’s effect in a
state. In particular, T(S, a, S’) defines a transition T where being in state S
and taking an action a takes us to state S’ (S and S’ may be same).
• For stochastic actions (noisy, non-deterministic) we also define a
probability P(S’|S, a) which represents the probability of reaching a
state S’ if action a is taken in state S.
42
Elements of Reinforcement Learning—Model
of the environment
• Models are used for planning, by which we mean any way of deciding
on a course of action by considering possible future situations before
they are actually experienced.
RL is a subfield of machine learning where the machine “lives” in an environment and is capable of perceiving the
state of that environment as a vector of features. The machine can execute actions in every state. Different actions
bring different rewards and could also move the machine to another state of the environment. The goal of a
reinforcement learning algorithm is to learn a policy. policy is a function (similar to the model in supervised
learning) that takes the feature vector of a state as input and outputs an optimal action to execute in that state.
The action is optimal if it maximizes the expected average reward.
Reinforcement learning solves a particular kind of problems where decision making is sequential, and the goal is
long-term, such as game playing, robotics, resource management, or logistics.