0% found this document useful (0 votes)
6 views

Module_1 - Reinforcement Learning and Markov Decision Process

This chapter introduces reinforcement learning (RL) and its framework through Markov Decision Processes (MDP), emphasizing the agent's interaction with the environment to maximize rewards through feedback. Key concepts include the elements of RL such as policies, reward signals, and value functions, along with various algorithms like Q-learning and policy iteration. The chapter also discusses practical applications of RL, such as in gaming and robotics, and provides a historical context for the evolution of the field.

Uploaded by

Vinut Maradur
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module_1 - Reinforcement Learning and Markov Decision Process

This chapter introduces reinforcement learning (RL) and its framework through Markov Decision Processes (MDP), emphasizing the agent's interaction with the environment to maximize rewards through feedback. Key concepts include the elements of RL such as policies, reward signals, and value functions, along with various algorithms like Q-learning and policy iteration. The chapter also discusses practical applications of RL, such as in gaming and robotics, and provides a historical context for the evolution of the field.

Uploaded by

Vinut Maradur
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Chapter 1: Reinforcement Learning and Markov Decision

Process

Table of Contents

• Chapter Learning Outcomes

• Introduction

• Reinforcement Learning

• Examples of Reinforcement Learning

• Elements of Reinforcement Learning

• Example: Tic-Tac-Toe

• History of Reinforcement Learning

• Learning Sequential decision making

• A Formal Frame Work on Markov Decision Process and Policies

• Value Function and Bellman Equations

• Solving Markov Decision Process

• Dynamic Programming Model-Based Solution Technique

• Reinforcement Learning Model Free Solution Technique

• Summary

Reinforcement Learning and Markov Decision Process 1


Chapter Learning Outcomes
At the end of this module, the students are expected to:

• Implement Reinforcement Learning.

• Identify the steps of Learning Sequential decision making.

• Utilize the Markov Decision Process.

• Apply Dynamic Programming Model-Based Solution Technique.

Reinforcement Learning and Markov Decision Process 2


Introduction
• By executing actions and observing the outcomes of those actions, an agent
learns how to behave in a given environment via reinforcement learning.

• It is a feedback-based machine learning technique.

• The agent receives compliments for each positive activity, and is penalised or
given negative feedback for each negative action.

Reinforcement Learning and Markov Decision Process 3


Reinforcement Learning
• Contrary to supervised learning, the agent in reinforcement learning learns
naturally through feedback without the need for labelled data.

• The agent can only learn from its experience because there are no labelled da
ta.

• It addresses a particular class of issues, such as those in robotics, gaming


and other areas where decisions must be made sequentially and with a long-
term objective.

• The agent engages with the environment and independently explores it. In
reinforcement learning, an agent's main objective is to maximise positive
reinforcement while doing better.

• The agent learns through hit-and-miss depending on its experience.

• It develops the skills necessary to carry out the mission more effectively.

• Thus, ‘Reinforcement learning is a form of machine learning method where an


intelligent agent (computer program) interacts with the environment and
learns to function within that’, we might state.

• One of the examples of reinforcement learning is how a robotic dog learns to


move his arms.

• It is a fundamental component of artificial intelligence, and the idea of


reinforcement learning is the basis for all AI agents. In this case, there is no
need to pre-program the agent because it learns on its own without the
assistance from humans.

• Let us say an AI agent is present in a maze setting, and his objective is to


locate the diamond.

• The agent interacts with the environment by taking certain actions, and
depending on those activities, the agent's state is altered, and it also receives
feedback in the form of rewards or penalties.

• The agent keeps doing these three things—take action, alter his state or
remain in it and obtain feedback—and by doing so, he learns and investigates
his surroundings.

• The agent gains knowledge of which behaviours result in positive feedback or


rewards and which behaviours result in negative feedback penalties.

• The agent receives good points for rewards and negative points for penalties.

Reinforcement Learning and Markov Decision Process 4


Key Features :

• In real life, the agent is not given instructions regarding the surroundings or
what needs to be done.

• It is founded on the hit-and-miss method.

• The agent performs the subsequent action and modifies its states in response
to feedback from the preceding action.

• The agent might receive a reward afterwards.

• The agent must investigate the stochastic environment in order to maximise


positive rewards.

Terms :

• Agent: A thing that can observe and investigate its surroundings and take
appropriate action.

• Environment: The surroundings or circumstances in which an agent is


present. In RL, we take the assumption that the environment is stochastic, or
essentially random.

• Action: An agent's actions are the movements they make while in the
environment.

• State: Following each action the agent does, the environment responds with a
circumstance called ‘state’.

• Reward: Feedback from the environment that the agent receives to assess its
performance.

• Policy: Based on the present state, a policy is a technique used by the agent
to determine what to do next.

• Value: It is predicted to increase over time in contrast to a short-term reward


and with a discount component.

• Q-value: Generally speaking, it is comparable with the value, except it adds a


current action parameter (a).

Algorithms :

Types of reinforcement learning algorithms:

• Q-learning

• Policy iteration

• Value iteration.

Reinforcement Learning and Markov Decision Process 5


Q-learning

• Q-learning, the most significant reinforcement learning method, computes the


reinforcement for both states and actions.

• Q-output learning is influenced by two variables: states and actions.

• When there are a limited number of states and actions, Q-learning is


employed to solve the reinforcement learning problem.

Policy iteration

• Following two steps, namely the policy evaluation and the policy improvement
steps, policy iteration computes the reinforcement for states and actions.

• There is an agent and a domain of states and actions in this reinforcement


learning system.

• Finding a policy that increases reinforcement for each beginning state while
preventing reinforcement from any other successor state from decreasing is
the goal of the reinforcement learner.

• In reinforcement learning issues where there are an endless number of states


and actions, policy iteration is used.

Value iteration

• Utilising the reinforcement signal generated by the reinforcement function,


value iteration computes reinforcement for states and actions.

• When the environment transition equation is known and the action-value


function, a Q-function that provides reinforcement for each state and action,
needs to be found, this technique is applied.

• When the reinforcement learning algorithm is provided with complete


knowledge of the environment transition equation, value iteration is used.

Examples of Reinforcement Learning


• Playing games like Chess & Go

• Self-driving cars

• Data centre automated cooling using Deep RL

• Personalised product recommendation system

• Ad recommendation system

• Personalised video recommendations

Reinforcement Learning and Markov Decision Process 6


• Customised action in video games

• Personalised chatbot response

• AI-powered stock buying/selling

• RL can be used for NLP use cases

• RL in healthcare

Elements of Reinforcement Learning


• Policy

• Reward Signal

• Value Function

• Model of the environment

Policy

• A policy is the way an agent acts at a specific moment in time.

• It connects the perceived environmental conditions to the responses to those


conditions.

• The fundamental component of the RL is a policy because only a policy may


specify how an agent will behave.

• It might be a straightforward function or lookup table in some situations, but


general computing like a search procedure might be necessary for others.

• It could have a stochastic or deterministic policy.

Reward Signal

• The reward signal establishes the aim of reinforcement learning.

• The environment immediately transmits a signal known as a reward signal to


the learning agent at each state.

• These incentives are offered in accordance with the agent's successful and
unsuccessful acts.

• The agent's main goal is to increase the total number of rewards for doing the
right thing.

• The policy can be altered by the reward signal.

• For instance, if an action chosen by the agent yields a poor reward, the policy
may be altered to choose different behaviour in the future.
Reinforcement Learning and Markov Decision Process 7
Value Function

• The value function informs an agent of the situation's and action's merits, as
well as the potential reward.

• A value function defines the good condition and action for the future, but a
reward indicates the immediate signal for each good and bad activity.

• The reward is a necessary component of the value function because the


value cannot exist without it.

• To reap additional advantages, one uses value estimation.

Model of the environment

• The model, which imitates the behaviour of the environment, is the final
component in reinforcement learning.

• One can draw conclusions about the behaviour of the environment using the
model.

• A model, for instance, can forecast the subsequent state and reward if a state
and an action are provided.

• The model is used for planning, which means that it offers a mechanism to
choose a course of action by taking into account all potential outcomes before
those outcomes actually occur.

• The term ‘model-based approach’ refers to methods for tackling RL problems


using models. In contrast, a model-free strategy is one that does not employ a
model.

Example: Tic-Tac-Toe

We can use Q-Learning to implement this game.

Reinforcement Learning and Markov Decision Process 8


The following are the steps to follow to implement:

• A limited number of activities (position to place a mark on the game board)

• A small number of states S (a given configuration of the game board)

• A reward function that, when action is taken in state s, returns a value, or


R(s,a)

• T(s,a,s') is a transition function.

History of Reinforcement Learning


• The exciting and quickly evolving field of reinforcement learning in machine
learning will have a big impact on how technology is used in the future and on
how we live our daily lives.

• The goal of reinforcement learning, a field distinct from supervised and


unsupervised learning, is to solve issues through a decision sequence or
decisions, each of which is optimised to maximise the rewards earned for
making the right choice.

• Reinforcement learning draws from and contributes to neuroscience, as well


as optimal control theory and animal learning in experimental psychology.

• The history of reinforcement learning will be briefly summarised in this article


from its inception to the present.

Reinforcement Learning and Markov Decision Process 9


Learning Sequential decision making
Architecture of a sequential decision-making problem

Reinforcement Learning and Markov Decision Process 10


• In reinforcement learning jargon, we refer to our sequential decision-making
issue as the environment.

• The agent is the decision-maker that we also have.

• The environment includes all elements, even those that may appear to belong
to the agent, over which it does not have complete control.

• For instance, your hands would be a component of the environment and not
the decision-making agent in you if you were a decision-making agent, which
you are.

• This rigorous separation between the agent and its surroundings may seem
counterintuitive at first, but it is necessary as we only want the decision-maker
to perform one function—making decisions. Your hands are not a component
of the decision-making agent as they do not make decisions.

• Zooming in, we can see that most agents follow a three-step process: all
agents engage with the environment, all agents assess their actions based on
the outcomes, and all agents alter some aspect of their actions.

• Actions by the agent have the potential to affect the environment.

• When you choose the same action at different times while the environment is
in the same state, the results may not be the same. This is known as a
nondeterministic response to an action.

Reinforcement Learning and Markov Decision Process 11


• Even if you choose to focus your study efforts on the final exam, you may not
always receive top scores.

• The environment always has a set of variables configured in a way that is


relevant to the decision-maker at any given moment.

• The quantity of money available for trading, stock prices, the day of the week,
the week of the year and a categorical variable indicating the political
condition of the nation, for instance, all contribute to the environment state of
a stock trading agent.

• One state could be created from any combination of the variables. Decision-
making is made more difficult by the fact that environments can exist in
several states.

• When an action is communicated from the agent to the environment, the


environment changes its internal state and adopts the new state as a result of
the action.

• Keep in mind that the agent may or may not have access to the precise
environment states; these states are internal to the environment.

• Some problems—like a game of poker, for instance—have factors in the


environment that are not completely visible, such as the cards held by other
players.

• The environment responds to the agent's action once it changes states.

• In this reaction, the environment frequently provides the agent with a report
that suggests a potential internal condition for the environment.

A Formal Frame Work on Markov Decision Process and Policies


• The reinforcement learning issues are formalised using the Markov Decision
Process, or MDP.

• If the environment is entirely observable, a Markov Process can be used to


model the environment's dynamic.

• In MDP, the agent continuously engages with the environment and takes
action. The environment reacts to each action and creates a new state.

Reinforcement Learning and Markov Decision Process 12


• The RL environment is described using MDP, and practically, all RL problems
may be formalised using MDP.

• MDP contains a tuple of four elements (S, A, Pa, Ra):

• A set of finite States S

• A set of finite Actions A

• Rewards received after transitioning from state S to state S' due to


action a.

• Probability Pa.

• Markov property is a concept that the MDP makes use of, and hence, we
must educate ourselves on it.

• According to the statement, ‘if the agent is in the current state S1 takes action
a1, and then moves to the state S2, the state transition from s1 to s2 only
depends on the current state and future action, and states do not depend on
prior actions, rewards, or states’.

• Or to put it another way, the Markov Property states that the present state
transition is independent of any previous state or activity.

• MDP thus meets the Markov condition and is an RL problem.

• Players in a game of chess, for example, merely concentrate on the current


situation and are not required to recall previous moves or situations.

• There are only a finite number of states, rewards and actions in a finite MDP.
In RL, we just take the finite MDP into account.

• The Markov Process, which makes use of the Markov Property, is a


memoryless process with a series of random states S1, S2,....., St. Markov

Reinforcement Learning and Markov Decision Process 13


chain, which is a tuple (S, P) on the state S and transition function P, is
another name for the Markov process.

• The dynamics of the system can be defined by these two elements (S and P).

Value Function and Bellman Equations


• Being a key component of numerous Reinforcement Learning algorithms, the
Bellman equation is referenced frequently in the Reinforcement Learning
literature.

• In conclusion, the Bellman equation divides the value function into the current
reward and the discounted future values.

• Using this equation, the computation of the value function is made simpler,
allowing us to identify the best solution to a challenging problem by
decomposing it into smaller, recursive subproblems and determining the best
solutions to each of those.

• Almost all algorithms for reinforcement learning entail estimating value


functions of states (or of state-action pairs) that determine how beneficial it is
for the agent to remain in a certain condition (or to carry out a specific action
in a specific state).

• The idea of ‘how good’ is defined in this context in terms of potential rewards
in the future, or more specifically, in terms of expected return. Of course, the
acts the agent will conduct in the future will determine the prizes it will obtain.

• Value functions are, therefore, defined in relation to specific policies.

• Remember that a policy is a mapping from every state (s 2 S) and action (a 2


A(s)) to the likelihood (ajs) of taking action (a) when in state (s).

• Informally, the expected return when beginning in a state and continuing after
is the value of a state under a policy, abbreviated v(s). We can define v(s) for
MDPs officially as:

• E[] stands for the expected value of a random variable if the agent complies
with the policy, and t can be any time step.

• Keep in mind that the terminal state's value, if any, is always zero.

Reinforcement Learning and Markov Decision Process 14


• The state-value function for policy is what we refer to as the function.
Similarly, we define the value of acting in state s in accordance with a policy,
denoted q(s; a), as the expected return beginning from s, acting in s, and then
acting in accordance with policy:

• Experience can be used to estimate the values of v and q.

• For instance, if a representative adheres to procedures and maintains an


average, for each state encountered, the average of the actual returns that
have followed that state will eventually reach the value of the state, v(s), as
the number of times that state is encountered in nitty methods.

• Keeping individual averages for each activity same averages will similarly
converge to the action if done in a state.values and q (s; a).

• Such estimating techniques are referred to as Monte Carlo methods.

• As they require averaging over a large number of random samples of actual


returns.

Solving Markov Decision Process


• Robot Recycling MDP: By simplifying it and adding some more information,
the recycling robot may be transformed into a straightforward illustration of an
MDP.

• Remember that the agent occasionally bases decisions on outside


circumstances.

• Every time this happens, the robot chooses whether to:

• actively look for a can;

• wait for someone to bring it a can; or

• return to home base to charge its batteries.

• Assume the environment is configured as follows:

• Actively looking for cans is the best way to find them, although doing so
drains the robot's battery faster than waiting does.

• There is a chance that the robot's battery will run out while it is
searching. The robot must now stop operating and wait to be rescued.

Reinforcement Learning and Markov Decision Process 15


• The agent only considers the battery's energy level when making judgements.

• The state set is S = {high; low} because it can discriminate between two
levels, high and low.

• Let us refer to the acts of the agent as waiting, searching and recharging.

• Recharging would always be dumb when the energy level is high, therefore
we did not include it in the action set for this situation.

• The action sets of the agent are A(high) = {Search; wait;}and A(low) =
{Search; Wait; recharge;}

• A time of active search can always be completed if the energy level is high
without running the risk of draining the battery.

• The energy level remains high with probability after a period of searching and
drops to low with probability 1 with probability 2.

• However, prolonged searching done when the energy level is low would likely
leave it low and will likely drain the battery.

• In the latter scenario, the robot must be saved, after which the battery must be
fully recharged.

• Each can the robot collects earns it a unit reward, while every time it has to be
rescued, it receives a reward of three units.

• Let rsearch and rwait represent, respectively, the anticipated number of cans t
he robot will gather while searching and while waiting, with rsearch > rwait.

• Finally, to keep things simple, assume that no cans can be gathered on a step
when the battery is low and that no cans can be collected on a run home for r
echarging.

• As a result, this system is a nite MDP, and the following table shows the
transition probabilities and predicted rewards.

Reinforcement Learning and Markov Decision Process 16


Dynamic Programming Model-Based Solution Technique
• The term ‘dynamic programming’ (DP) refers to a group of techniques that
can be used to determine the best course of action given a perfect Markov
decision process model of the environment (MDP).

• Due to their high computing cost and assumption of a perfect model, classical
DP algorithms are of limited practical value in reinforcement learning,
although they are still significant conceptually.

• DP is a crucial foundation for comprehending the methodologies discussed in


the next chapters of this book. In actuality, all of these approaches can be
seen as attempts to obtain results that are very similar to those of DP, but with
less computational effort and without relying on an accurate representation of
the environment.

• The use of value functions to organise and structure the search for effective
policies is the fundamental concept behind DP and reinforcement learning, in
general.

• We demonstrate how to compute the value functions defined in previous


slides.

• Once we have identified the optimal value functions, v or q, that fulfill the
Bellman equations, as was stated there, we may simply obtain optimal
policies.

DP Techniques:

• Policy Iteration

• Policy Evaluation

Reinforcement Learning and Markov Decision Process 17


• Policy Improvement

• Value Iteration

Reinforcement Learning Model Free Solution Technique


• An algorithm that uses neither the transition probability distribution nor the
reward function associated with the Markov decision process (MDP), which in
reinforcement learning (RL) represents the problem to be solved, is referred to
as a model-free algorithm.

• The reward function and the transition probability distribution are frequently
referred to as the environment's ‘model’ or MDP, hence, the term ‘model-free’.

• An ‘explicit’ trial-and-error algorithm is how one can describe a model-free RL


algorithm.

• Q-learning is an illustration of a model-free method.

Summary
• By executing actions and observing the outcomes of those actions, an agent
learns how to behave in a given environment via reinforcement learning

• Contrary to supervised learning, the agent in reinforcement learning learns


naturally through feedback without the need for labelled data.

• ‘Reinforcement learning is a form of machine learning method where an


intelligent agent (computer program) interacts with the environment and
learns to function within that’, we might state.

• Q-learning, the most significant reinforcement learning method, computes the


reinforcement for both states and actions.

• Finding a policy that increases reinforcement for each beginning state while
preventing reinforcement from any other successor state from decreasing is
the goal of the reinforcement learner.

• When the reinforcement learning algorithm is provided with complete


knowledge of the environment transition equation, value iteration is used.

• The goal of reinforcement learning, a field distinct from supervised and


unsupervised learning, is to solve issues through a decision sequence or
decisions, each of which is optimised to maximise the rewards earned for
making the right choice.

• The Markov Process, which makes use of the Markov Property, is a


memoryless process with a series of random states S1, S2,....., St. Markov

Reinforcement Learning and Markov Decision Process 18


chain, which is a tuple (S, P) on the state S and transition function P, is
another name for the Markov process.

• The term ‘dynamic programming’ (DP) refers to a group of techniques that


can be used to determine the best course of action given a perfect Markov
decision process model of the environment (MDP).

Self Assessment Question

1. Reinforcement Learning works based on _________

a. Supervised ML

b. Unsupervised ML

c. Rewards and Penalty

d. All of the above

Answer: c

2. Select all correct statements about Reinforcement Learning?

a. The agent gets rewards or penalties according to the


action

b. It’s a machine learning

c. The target of an agent is to maximise the rewards

d. All of the above

Answer: d

3. ___________________ is an application of Reinforcement Learning?

a. Recommendation system

b. Topic modeling

c. Pattern recognition

d. Content classification

Answer: a

Reinforcement Learning and Markov Decision Process 19

You might also like