Module_1 - Reinforcement Learning and Markov Decision Process
Module_1 - Reinforcement Learning and Markov Decision Process
Process
Table of Contents
• Introduction
• Reinforcement Learning
• Example: Tic-Tac-Toe
• Summary
• The agent receives compliments for each positive activity, and is penalised or
given negative feedback for each negative action.
• The agent can only learn from its experience because there are no labelled da
ta.
• The agent engages with the environment and independently explores it. In
reinforcement learning, an agent's main objective is to maximise positive
reinforcement while doing better.
• It develops the skills necessary to carry out the mission more effectively.
• The agent interacts with the environment by taking certain actions, and
depending on those activities, the agent's state is altered, and it also receives
feedback in the form of rewards or penalties.
• The agent keeps doing these three things—take action, alter his state or
remain in it and obtain feedback—and by doing so, he learns and investigates
his surroundings.
• The agent receives good points for rewards and negative points for penalties.
• In real life, the agent is not given instructions regarding the surroundings or
what needs to be done.
• The agent performs the subsequent action and modifies its states in response
to feedback from the preceding action.
Terms :
• Agent: A thing that can observe and investigate its surroundings and take
appropriate action.
• Action: An agent's actions are the movements they make while in the
environment.
• State: Following each action the agent does, the environment responds with a
circumstance called ‘state’.
• Reward: Feedback from the environment that the agent receives to assess its
performance.
• Policy: Based on the present state, a policy is a technique used by the agent
to determine what to do next.
Algorithms :
• Q-learning
• Policy iteration
• Value iteration.
Policy iteration
• Following two steps, namely the policy evaluation and the policy improvement
steps, policy iteration computes the reinforcement for states and actions.
• Finding a policy that increases reinforcement for each beginning state while
preventing reinforcement from any other successor state from decreasing is
the goal of the reinforcement learner.
Value iteration
• Self-driving cars
• Ad recommendation system
• RL in healthcare
• Reward Signal
• Value Function
Policy
Reward Signal
• These incentives are offered in accordance with the agent's successful and
unsuccessful acts.
• The agent's main goal is to increase the total number of rewards for doing the
right thing.
• For instance, if an action chosen by the agent yields a poor reward, the policy
may be altered to choose different behaviour in the future.
Reinforcement Learning and Markov Decision Process 7
Value Function
• The value function informs an agent of the situation's and action's merits, as
well as the potential reward.
• A value function defines the good condition and action for the future, but a
reward indicates the immediate signal for each good and bad activity.
• The model, which imitates the behaviour of the environment, is the final
component in reinforcement learning.
• One can draw conclusions about the behaviour of the environment using the
model.
• A model, for instance, can forecast the subsequent state and reward if a state
and an action are provided.
• The model is used for planning, which means that it offers a mechanism to
choose a course of action by taking into account all potential outcomes before
those outcomes actually occur.
Example: Tic-Tac-Toe
• The environment includes all elements, even those that may appear to belong
to the agent, over which it does not have complete control.
• For instance, your hands would be a component of the environment and not
the decision-making agent in you if you were a decision-making agent, which
you are.
• This rigorous separation between the agent and its surroundings may seem
counterintuitive at first, but it is necessary as we only want the decision-maker
to perform one function—making decisions. Your hands are not a component
of the decision-making agent as they do not make decisions.
• Zooming in, we can see that most agents follow a three-step process: all
agents engage with the environment, all agents assess their actions based on
the outcomes, and all agents alter some aspect of their actions.
• When you choose the same action at different times while the environment is
in the same state, the results may not be the same. This is known as a
nondeterministic response to an action.
• The quantity of money available for trading, stock prices, the day of the week,
the week of the year and a categorical variable indicating the political
condition of the nation, for instance, all contribute to the environment state of
a stock trading agent.
• One state could be created from any combination of the variables. Decision-
making is made more difficult by the fact that environments can exist in
several states.
• Keep in mind that the agent may or may not have access to the precise
environment states; these states are internal to the environment.
• In this reaction, the environment frequently provides the agent with a report
that suggests a potential internal condition for the environment.
• In MDP, the agent continuously engages with the environment and takes
action. The environment reacts to each action and creates a new state.
• Probability Pa.
• Markov property is a concept that the MDP makes use of, and hence, we
must educate ourselves on it.
• According to the statement, ‘if the agent is in the current state S1 takes action
a1, and then moves to the state S2, the state transition from s1 to s2 only
depends on the current state and future action, and states do not depend on
prior actions, rewards, or states’.
• Or to put it another way, the Markov Property states that the present state
transition is independent of any previous state or activity.
• There are only a finite number of states, rewards and actions in a finite MDP.
In RL, we just take the finite MDP into account.
• The dynamics of the system can be defined by these two elements (S and P).
• In conclusion, the Bellman equation divides the value function into the current
reward and the discounted future values.
• Using this equation, the computation of the value function is made simpler,
allowing us to identify the best solution to a challenging problem by
decomposing it into smaller, recursive subproblems and determining the best
solutions to each of those.
• The idea of ‘how good’ is defined in this context in terms of potential rewards
in the future, or more specifically, in terms of expected return. Of course, the
acts the agent will conduct in the future will determine the prizes it will obtain.
• Informally, the expected return when beginning in a state and continuing after
is the value of a state under a policy, abbreviated v(s). We can define v(s) for
MDPs officially as:
• E[] stands for the expected value of a random variable if the agent complies
with the policy, and t can be any time step.
• Keep in mind that the terminal state's value, if any, is always zero.
• Keeping individual averages for each activity same averages will similarly
converge to the action if done in a state.values and q (s; a).
• Actively looking for cans is the best way to find them, although doing so
drains the robot's battery faster than waiting does.
• There is a chance that the robot's battery will run out while it is
searching. The robot must now stop operating and wait to be rescued.
• The state set is S = {high; low} because it can discriminate between two
levels, high and low.
• Let us refer to the acts of the agent as waiting, searching and recharging.
• Recharging would always be dumb when the energy level is high, therefore
we did not include it in the action set for this situation.
• The action sets of the agent are A(high) = {Search; wait;}and A(low) =
{Search; Wait; recharge;}
• A time of active search can always be completed if the energy level is high
without running the risk of draining the battery.
• The energy level remains high with probability after a period of searching and
drops to low with probability 1 with probability 2.
• However, prolonged searching done when the energy level is low would likely
leave it low and will likely drain the battery.
• In the latter scenario, the robot must be saved, after which the battery must be
fully recharged.
• Each can the robot collects earns it a unit reward, while every time it has to be
rescued, it receives a reward of three units.
• Let rsearch and rwait represent, respectively, the anticipated number of cans t
he robot will gather while searching and while waiting, with rsearch > rwait.
• Finally, to keep things simple, assume that no cans can be gathered on a step
when the battery is low and that no cans can be collected on a run home for r
echarging.
• As a result, this system is a nite MDP, and the following table shows the
transition probabilities and predicted rewards.
• Due to their high computing cost and assumption of a perfect model, classical
DP algorithms are of limited practical value in reinforcement learning,
although they are still significant conceptually.
• The use of value functions to organise and structure the search for effective
policies is the fundamental concept behind DP and reinforcement learning, in
general.
• Once we have identified the optimal value functions, v or q, that fulfill the
Bellman equations, as was stated there, we may simply obtain optimal
policies.
DP Techniques:
• Policy Iteration
• Policy Evaluation
• Value Iteration
• The reward function and the transition probability distribution are frequently
referred to as the environment's ‘model’ or MDP, hence, the term ‘model-free’.
Summary
• By executing actions and observing the outcomes of those actions, an agent
learns how to behave in a given environment via reinforcement learning
• Finding a policy that increases reinforcement for each beginning state while
preventing reinforcement from any other successor state from decreasing is
the goal of the reinforcement learner.
a. Supervised ML
b. Unsupervised ML
Answer: c
Answer: d
a. Recommendation system
b. Topic modeling
c. Pattern recognition
d. Content classification
Answer: a