UNIT 5: REINFORCEMENT LEARNING
Reinforcement learning works on a feedback-based process, in which an AI agent (A
software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets rewarded
for each good action and gets punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only. The reinforcement learning process is similar to a
human being; for example, a child learns various things by experiences in his day-to-day
life.
An example of reinforcement learning is to play a game, where the Game is the
environment, moves of an agent at each step define states, and the goal of the agent
is to get a high score. Agent receives feedback in terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such as
Game theory, Operation Research, Information theory, multi-agent systems.
Reinforcement Learning Algorithms
There are 3 approaches to implement reinforcement learning algorithms:
Value-Based – The main goal of this method is to maximize a value function. Here, an agent
through a policy expects a long-term return of the current states.
Policy-Based – In policy-based, you enable to come up with a strategy that helps to gain
maximum rewards in the future through possible actions performed in each state. Two types
of policy-based methods are deterministic and stochastic.
Model-Based – In this method, we need to create a virtual model for the agent to help in
learning to perform in each specific environment.
Types of Reinforcement Learning:
There are two types :
1. Positive Reinforcement
Positive reinforcement is defined as when an event, occurs due to specific behavior, increases the
strength and frequency of the behavior. It has a positive impact on behavior.
Advantages
– Maximizes the performance of an action
– Sustain change for a longer period
Disadvantage
– Excess reinforcement can lead to an overload of states which would minimize the results.
2. Negative Reinforcement
Negative Reinforcement is represented as the strengthening of a behavior. In other ways, when a
negative condition is barred or avoided, it tries to stop this action in the future.
Advantages
– Maximized behavior
– Provide a decent to minimum standard of performance
Disadvantage
– It just limits itself enough to meet up a minimum behavior
Learning Models for Reinforcement – (Markov Decision process):
Reinforcement Learning is defined by a specific type of problem, and all its solutions are classed
as Reinforcement Learning algorithms. In the problem, an agent is supposed to decide the best
action to select based on his current state. When this step is repeated, the problem is known as
a Markov Decision Process.
The Markov decision process (MDP) is a mathematical framework used for modeling decision-
making problems where the outcomes are partly random and partly controllable. It's a framework
that can address most reinforcement learning (RL) problems.
A Markov Decision Process (MDP) model contains:
A set of possible world states S.
A set of Models.
A set of possible actions A.
A real-valued reward function R(s,a).
A policy the solution of Markov Decision Process.
What is a State?
A State is a set of tokens that represent every state that the agent can be in.
What is a Model?
A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular,
T(S, a, S’) defines a transition T where being in state S and taking an action ‘a’ takes us to state
S’ (S and S’ may be the same). For stochastic actions (noisy, non-deterministic) we also define a
probability P(S’|S,a) which represents the probability of reaching a state S’ if action ‘a’ is taken
in state S. Markov property states that the effects of an action taken in a state depend only
on that state and not on the prior history.
What are Actions?
An Action A is a set of all possible actions. A(s) defines the set of actions that can be taken being
in state S.
What is a Reward?`
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the
state S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’)
indicates the reward for being in a state S, taking an action ‘a’ and ending up in a state S’.
What is a Policy?
A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It
indicates the action ‘a’ to be taken while in state S.
EXAMPLE: Let us take the example of a grid world:
An agent lives in the grid. The above example is a 3*4 grid. The grid has a START state (grid no
1,1). The purpose of the agent is to wander around the grid to finally reach the Blue Diamond
(grid no 4,3). Under all circumstances, the agent should avoid the Fire grid (orange color, grid
no 4,2). Also the grid no 2,2 is a blocked grid, it acts as a wall hence the agent cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent would have taken,
the agent stays in the same place. So for example, if the agent says LEFT in the START grid he
would stay put in the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond. Two such
sequences can be found:
RIGHT RIGHT UP UPRIGHT
UP UP RIGHT RIGHT RIGHT
The agent receives rewards each time step:-
Small reward each step (can be negative when can also be term as punishment, in the
above example entering the Fire can have a reward of -1).
Big rewards come at the end (good or bad).
The goal is to Maximize the sum of rewards.
Q-Learning:
The discount factor, 𝛾, is a real value ∈ [0, 1], cares for the rewards agent achieved in the past,
present, and future.
Main terminologies in Q-Learning:
1. Agent: It is an assumed entity which performs actions in an environment to gain some
reward.
2. Environment (e): A scenario that an agent has to face.
3. Rewards: For every action, the agent will get a positive or negative reward.
4. Episodes: When an agent ends up in a terminating state and can’t take a new action.
5. Q-Values: Used to determine how good an Action, A, taken at a particular state, S, is. Q
(A, S).
6. Value Function: It specifies the value of a state that is the total amount of reward. It is
an agent which should be expected beginning from that state.
7. State (s): State refers to the current situation returned by the environment.
8. Policy (π): It is a strategy which applies by the agent to decide the next action based on
the current state.
9. Temporal Difference: A formula used to find the Q-Value by using the value of current
state and action and previous state and action. Temporal Difference Learning in machine
learning is a method to learn how to predict a quantity that depends on future values of a
given signal. It can also be used to learn both the V-function and the Q-function,
whereas Q-learning is a specific TD algorithm that is used to learn the Q-function.
Q-LEARNING ALGORITHM:
Application of Reinforcement Learning:
1. RL in Marketing:
Marketing is all about promoting and then, selling the products or services either of
your brand or someone else’s. In the process of marketing, finding the right audience
which yields larger returns on investment you or your company is making is a challenge
in itself.
2. RL in Healthcare
Healthcare is an important part of our lives and through DTRs (a sequence-based use-case
of RL), doctors can discover the treatment type, appropriate doses of drugs, and timings for
taking such doses.
DTRs are equipped with: –
a sequence of rules which confirm the current health status of a patient.
Then, they optimally propose treatments that can diagnose diseases like diabetes,
HIV, Cancer, and mental illness too.
If required, these DTRs (i.e. Dynamic Treatment Regimes) can reduce or remove the
delayed impact of treatments through their multi-objective healthcare optimization
solutions.
3. RL in Robotics
Robotics without any doubt facilitates training a robot in such a way that a robot can
perform tasks – just like a human being can. But still, there is a bigger challenge the robotics
industry is facing today – Robots aren’t able to use common sense while making various
moral, social decisions. Here, a combination of Deep Learning and Reinforcement Learning
i.e. Deep Reinforcement Learning comes to the rescue to enable the robots with, “Learn
How To Learn” model. With this, the robots can now: –
manipulate their decisions by grasping well various objects visible to them.
solve complicated tasks which even humans fail to do as robots now know what and
how to learn from different levels of abstractions of the types of datasets available to
them.
4. RL in Gaming
Gaming is something nowadays without which you, me, or a huge chunk of people can’t
live. With games optimization through Reinforcement Learning algorithms, we may
expect better performances of our favorite games related to adventure, action, or
mystery.
5. RL in Image Processing
Image Processing is another important method of enhancing the current version of an
image to extract some useful information from it. And there are some steps associated like:
Capturing the image with machines like scanners.
Analyzing and manipulating it.
Using the output image obtained after analysis for representation, description-
purposes.
6. RL in Manufacturing
Manufacturing is all about producing goods that can satisfy our basic needs and
essential wants. Cobot Manufacturers (or Manufacturers of Collaborative Robots that
can perform various manufacturing tasks with a workforce of more than 100 people) are
helping a lot of businesses with their own RL solutions for packaging and quality testing.
Introduction to Deep Q Learning:
Q-Learning is a process of Q-Learning creates an exact matrix for the working agent
which it can “refer to” to maximize its reward in the long run.
Although this approach is not wrong in itself, this is only practical for very small
environments and quickly loses its feasibility when the number of states and actions in
the environment increases.
Imagine an environment with 10,000 states and 1,000 actions per state. This would
create a table of 10 million cells. Things will quickly get out of control!
This presents two problems:
First, the amount of memory required to save and update that table would
increase as the number of states increases
Second, the amount of time required to explore each state to create the
required Q-table would be unrealistic
So, the idea is to approximate these Q-values with machine learning models such as a
neural network.
The basic working step for Deep Q-Learning is that the initial state is fed into the neural
network and it returns the Q-value of all possible actions as an output. The difference
between Q-Learning and Deep Q-Learning can be illustrated as follows: