0% found this document useful (0 votes)
18 views17 pages

Unit Vi

Deep Reinforcement Learning (DRL) is a machine learning approach that enables agents to learn optimal behaviors through trial and error by receiving rewards or penalties based on their actions in an environment. It combines reinforcement learning principles with deep learning architectures, allowing for complex decision-making in dynamic environments, such as self-driving cars and robotics. Key concepts include agents, actions, states, rewards, and various algorithms like Q-learning and policy iteration, which help in maximizing cumulative rewards over time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views17 pages

Unit Vi

Deep Reinforcement Learning (DRL) is a machine learning approach that enables agents to learn optimal behaviors through trial and error by receiving rewards or penalties based on their actions in an environment. It combines reinforcement learning principles with deep learning architectures, allowing for complex decision-making in dynamic environments, such as self-driving cars and robotics. Key concepts include agents, actions, states, rewards, and various algorithms like Q-learning and policy iteration, which help in maximizing cumulative rewards over time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

What is Deep Reinforcement Learning?

To understand Deep Reinforcement Learning better, imagine that you want your computer to play chess
with you. The first question to ask is this:
Would it be possible if the machine was trained in a supervised fashion?
In theory, yes. But—
There are two drawbacks that you need to consider.
Firstly, to move forward with supervised learning you need a relevant dataset.
Secondly, if we are training the machine to replicate human behavior in the game of chess, the machine
would never be better than the human, because it’s simply replicating the same behavior.
So, by definition, we cannot use supervised learning to train the machine.
But is there a way to have an agent play a game entirely by itself?
Yes, that’s where Reinforcement Learning comes into play.
Reinforcement Learning is a type of machine learning algorithm that learns to solve a multi-level problem
by trial and error. The machine is trained on real-life scenarios to make a sequence of decisions. It receives
either rewards or penalties for the actions it performs. Its goal is to maximize the total reward.
By Deep Reinforcement Learning we mean multiple layers of Artificial Neural Networks that are present
in the architecture to replicate the working of a human brain.
Reinforcement Learning definitions
Before we move on, let’s have a look at some of the definitions that you’ll encounter when learning about
the Reinforcement Learning.
Agent - Agent (A) takes actions that affect the environment. Citing an example, the machine learning to
play chess is the agent.
Action - It is the set of all possible operations/moves the agent can make. The agent makes a decision on
which action to take from a set of discrete actions (a).
Environment - All actions that the reinforcement learning agent makes directly affect the environment.
Here, the board of chess is the environment. The environment takes the agent's present state and action as
information and returns the reward to the agent with a new state.
For example, the move made by the bot will either have a negative/positive effect on the whole game and
the arrangement of the board. This will decide the next action and state of the board.
State - A state (S) is a particular situation in which the agent finds itself.
This can be the state of the agent at any intermediate time (t).
Reward (R) - The environment gives feedback by which we determine the validity of the agent’s actions
in each state. It is crucial in the scenario of Reinforcement Learning where we want the machine to learn
all by itself and the only critic that would help it in learning is the feedback/reward it receives.
For example, in a chess game scenario it happens when the bot takes the place of an opponent's piece and
later captures it.
Discount factor - Over time, the discount factor modifies the importance of incentives. Given the
uncertainty of the future it’s better to add variance to the value estimates. Discount factor helps in reducing
the degree to which future rewards affect our value function estimates.
Policy (π) - It decides what action to take in a certain state to maximize the reward.
Value (V)—It measures the optimality of a specific state. It is the expected discounted rewards that the
agent collects following the specific policy.
Q-value or action-value - Q Value is a measure of the overall expected reward if the agent (A) is in state
(s) and takes action (a), and then plays until the end of the episode according to some policy (π).
Model-based vs Model-free learning algorithms
There are two main types of Reinforcement Learning algorithms:
1. Model-based algorithms
2. Model-free algorithms
1. Model-based algorithms
Model-based algorithm use the transition and reward function to estimate the optimal policy.
• They are used in scenarios where we have complete knowledge of the environment and how it
reacts to different actions.
• In Model-based Reinforcement Learning the agent has access to the model of the environment i.e.,
action required to be performed to go from one state to another, probabilities attached, and
corresponding rewards attached.
• They allow the reinforcement learning agent to plan ahead by thinking ahead.
• For static/fixed environments,Model-based Reinforcement Learning is more suitable.
2. Model-free algorithms
Model-free algorithms find the optimal policy with very limited knowledge of the dynamics of the
environment. They do no thave any transition/reward function to judge the best policy.
• They estimate the optimal policy directly from experience i.e., interaction between agent and
environment without having any hint of the reward function.
• Model-free Reinforcement Learning should be applied in scenarios involving incomplete
information of the environment.
• In real-world, we don't have a fixed environment. Self-driving cars have a dynamic environment
with changing traffic conditions, route diversions etc. In such scenarios, Model-free algorithms
outperform other techniques

Common mathematical and algorithmic frameworks


Now, let’s have a look at some of the most common frameworks used in Deep Reinforcement Learning.
Markov Decision Process (MDP)
Markov Decision Process is a Reinforcement Learning algorithm that gives us a way to formalize
sequential decision making.
This formalization is the basis to the problems that are solved by Reinforcement Learning. The
components involved in a Markov Decision Process (MDP) is a decision maker called an agent that
interacts with the environment it is placed in.
These interactions occur sequentially overtime.
In each timestamp, the agent will get some representation of the environment state. Given this
representation, the agent selects an action to make. The environment is then transitioned into some new
state and the agent is given a reward as a consequence of its previous action.
Let’s wrap up everything that we have covered till now.
The process of selecting an action from a given state, transitioning to a new state and receiving a reward
happens sequentially over and over again. This creates something called a trajectory that shows the
sequence of states, actions and rewards.
Throughout the process, it is the responsibility of the reinforcement learning agent to maximize the total
amount of rewards that it received from taking actions in given states of environments.
The agent not only wants to maximize the immediate rewards but the cumulative reward it receives in the
whole process.
The below image clearly depicts the whole idea.

An important point to note about the Markov Decision Process is that it does not worry about the
immediate reward but aims to maximize the total reward of the entire trajectory. Sometimes, it might
prefer to get a small reward in the next timestamp to get a higher reward eventually over time.
Bellman Equations
Let’s cover the important Bellman Concepts before moving forward.
➔ State is a numerical representation of what an agent observes at a particular point in an environment.
➔ Action is the input the agent is giving to the environment based on a policy.
➔ Reward is a feedback signal from the environment to the reinforcement learning agent reflecting how
the agent has performed in achieving the goal.
Bellman Equations aim to answer these questions:
The agent is currently in a given state ‘s’. Assuming that we take best possible actions in all subsequent
timestamps,what long-term reward the agent can expect?
or
What is the value of the state the agent is currently in?
Bellman Equations are a class of Reinforcement Learning algorithms that are used particularly for
deterministic environments.
The value of a given state (s) is determined by taking a maximum of the actions we can take in the state the
agent is in.The aim of the agent is to pick the action that is going to maximize the value.
Therefore, it needs to take the addition of the reward of the optimal action ‘a’ in state ‘s' and add a
multiplier ‘γ’ that is the discount factor which diminishes its reward over time. Every time the agent takes
an action it gets back to the next state ‘s'.

Rather than summing over numerous time steps, this equation simplifies the computation of the value
function, allowing us to find the best solution to a complex problem by breaking it down into smaller,
recursive subproblems.
Dynamic Programming
In Bellman Optimality Equations if we have large state spaces, it becomes extremely difficult and close to
impossible to solve this system of equations explicitly.
Hence, we shift our approach from recursion to Dynamic Programming.
Dynamic Programming is a method of solving problems by breaking them into simpler sub-problems. In
Dynamic Programming, we are going to create a lookup table to estimate the value of each state.
There are two classes of Dynamic Programming:
1. Value Iteration
2. Policy Iteration
1. Value iteration
In this method, the optimal policy (optimal action for a given state) is obtained by choosing the action that
maximizes optimal state-value function for the given state.
The optimal state-value function is obtained using an iterative function and hence its name—Value
Iteration.
By iteratively improving the estimate of V,the Value Iteration method computes the ideal state value
function (s). V (s) is initialized with arbitrary random values by the algorithm. The Q (s, a) and V (s)
values are updated until they converge. Value Iteration is guaranteed to get you to the best results.
2. Policy iteration
This algorithm has two phases in its working:
1. Policy Evaluation—It computes the values for the states in the environment using the policy provided
by the policy improvement phase.
2. Policy Improvement—Looking into the state values provided by the policy evaluation part, it
improves the policy so that it can get higher state values.
Firstly, the reinforcement learning agents tarts with a random policy π (i). Policy Evaluation will evaluate
the value functions like state values for that particular policy.
The policy improvement will improve the policy and give us π (1) and so on until we get the optimal
policy where the algorithm stops. This algorithm communicates back and forth between the two phases—
Policy Improvement gives the policy to the policy evaluation module which computes values.
Later, looking at the computed policy, policy evaluation improves the policy and iterates this process.

Policy Evaluation is also iterative.


Firstly, the reinforcement learning agent gets the policy from the Policy Improvement phase. In the
beginning, this policy is random.
Here,policy is like a table with state-action pairs which we can randomly initialize. Later, Policy
Evaluation evaluates the values for all the states. This step goes on a loop until the process converges
which is marked by non-changing values.
Then comes the role of Policy Improvement phase. It is just a one-step process. We take the action that
maximizes this equation and that becomes the policy for the next iteration.

To understand the Policy Evaluation algorithm better have a look at this.

Q-learning
Q-Learning combines the policy and value functions,and it tells us jointly how useful a given action is in
gaining some future reward.
Quality is assigned to a state-action pair as Q (s,a) based on the future value that it expects given the
current state and best possible policy the agent has. Once the agent learns this Q-Function, it looks for the
best possible action at a particular state (s) that yields the highest quality.
Once we have an optimal Q-function (Q*), we can determine the optimal policy by applying a
Reinforcement Learning algorithm to find an action that maximizes the value for each state.

In other words, Q* gives the largest expected return achievable by any policy π for each possible state-
action pair.

In the basic Q-Learning approach, we need to maintain a look-up table called q-map for each state-action
pair and the corresponding value associated with it.
Deep Q-Learning aka Deep Q-network employs Neural Network architecture to predict the Q-value for a
given state.
Neural Networks and Deep Reinforcement Learning
Reinforcement Learning involves managing state-action pairs and keeping a track of value (reward)
attached to an action to determine the optimum policy.
This method of maintaining a state-action-value table is not possible in real-life scenarios when there are a
larger number of possibilities.
Instead of utilizing a table, we can make use of Neural Networks to predict values for actions in a given
state.

Applications of deep Reinforcement Learning


Finally, let’s have a look at some of the real-world applications of Reinforcement Learning.
Industrial manufacturing
Deep Reinforcement Learning is very commonly applied in Robotics.
The actions that the robot has to take are inherently sequential. Agents learn to interact with dynamic
changing environments and thus find applications in industrial automation and manufacturing.
Labor expenses, product faults and unexpected downtime are being reduced with significant improvement
in transition times and production speed.
Self-driving cars
Machine Learning technologies power self-driving cars.
Autonomous vehicle used large amounts of visual data and leveraged image processing capabilities in
cohesion with Neural Network architecture.
The algorithms learn to recognize pedestrians,roads, traffic, detect street signs in the environment and act
accordingly. It is trained in complex scenarios and trained to excel in decision making skills in scenarios
involving minimal human loss, best route to follow etc.
Trading and Finance
We have seen how supervised learning and time-series analysis helps in prediction of the stock market.
But none helps us in making decisions of what to do in a particular situation. An RL agent can select
whether to hold, buy, or sell a share. To guarantee that the RL model is working optimally, it is assessed
using market benchmark standards.
Natural Language Processing
Reinforced Learning is expanding in wings and has conquered NLP too. Different NLP tasks like
question-answering, summarization, chatbot implementation can be done by a Reinforcement Learning
agent.
Virtual Bots are trained to mimic conversations. Sequences with crucial conversation properties including
coherence, informativity, and simplicity of response are rewarded using policy gradient approaches.
Healthcare
Reinforced Learning in healthcare is an area of continuous research. Bots equipped with biological
information are extensively trained to perform surgeries that require precision. RL bots help in better
diagnosis of diseases and predict the onset of disease f the treatment is delayed and so on.
Playing Atari with Deep Reinforcement Learning

Reinforcement learning is based on a system of rewards and punishments (reinforcements) for a machine that gets a

problem to solve. It is a cutting-edge technology that forces the AI model to be creative – it is provided only with

the indicator of success and no additional hints. Experiments combining deep learning and reinforcement learning

have been done in particular by DeepMind (in 2013) and by Gerald Tesauro even before (in 1992). We focused on

reducing the time needed to train the model.

A well-designed system of rewards is essential in human education. Now, with reinforcement learning, such a

system has become a pillar of teaching computers to perform more sophisticated tasks, such as beating human

champions in the game Go. In the near future it may be driving an autonomous car. In the case of the Atari 2600

game, the only indicator of success was the points the artificial intelligence earned. There were no further hints or

suggestions. Thus the algorithm had to learn the rules of the game and find the most effective tactics by itself to

maximize the long-term rewards it earned.

In 2013 the learning algorithm needed a whole week of uninterrupted training in an arcade learning environment to

reach superhuman levels in classics like Breakout (knocking out a wall of colorful bricks with a ball) or Space

Invaders (shooting out alien invaders with a mobile laser cannon). By 2016 DeepMind had cut the time to 24 hours

by improving the algorithm.


While the whole process may sound like a like bunch of scientists having fun at work, playing
Atari with deep reinforcement learning is a great way to evaluate a learning model. On a more
sobering note, if someone had a problem understanding the rules of “Space invaders”, would you
let him drive your car?

Cutting the time of deep reinforcement learning

DeepMind’s work inspired various implementations and modifications of the base algorithm
including high-quality open-source implementations of reinforcement learning algorithms
presented in Tensorpack and Baselines. In our work we used Tensorpack.
The reinforcement learning agent learns only from visual input, and has access to only the same
information given to human players. From a single image the RL agent can learn about the
current positions of game objects, but by combining the current image with a few that preceded
it, the deep neural network is able to learn not only positions, but also the game’s physical
characteristics, such as speed at which objects are moving.
The results of the parallelization experiment conducted by deepesense.ai were impressive – the
algorithm required only 20 minutes to master Atari video games, a vast improvement over the
approximately one week required in the original experiments done by DeepMind. We provided
the code and technical details on arXiv, GitHub and in a blog post, so that others can easily
recreate the results. Similar experiments optimizing the training time of Atari games have been
conducted by Adam Stooke and Pieter Abbeel from UC Berkeley among others,
including OpenAI and Uber.
Replacing the silicon spine

To make the learning process more effective, we used an innovative multi-node infrastructure
based on Xeon processors provided by Intel.

The experiment proves that effective machine learning is possible on various architectures,
including more common CPUs. The freedom to choose the infrastructure is crucial in seeking
ways to further optimize the metrics chosen. Sometimes the time of training is sometimes the
decisive factor, at others it is the price of computing power that is the most critical factor.
Instead of insisting that all machine learning be done using a particular type of hardware, in
practicea diversified architecture may prove more efficient. As machine learning is computing-
power-hungry, the wise use of resources may save both money and time.

Biases of mortality revealed by reinforcement learning

Reinforcement learning is much more than just an academic game. By enabling a computer to
learn “by itself” with no hints and suggestions,the machine can act innovatively and overcome
universal, human biases.
A good example is playing chess. Reinforcement learning agents tend to move in a non-orthodox
way that is rarely seen among human players. Sacrificing a bishop only to open the opponent’s
position is one of the best examples of superhuman tactics.

So why Atari games?


A typical Atari game provides an environment consisting of a single screen with a limited
context and a relatively simple goal to achieve. However, the number of variables which AI
must consider is comparable to other visual training environments. Achieving superhuman
performance in Atari games is a good indicator that an algorithm will perform well in other
tasks. A robotic “game” may mean delivering a human to a destination point without incident or
accident or reducing power usage in an intelligent building without any interruption to the
business being conducted inside. The huge potential of reinforcement learning is seen in
robotics, an area deepsense.ai is continuously developing.

A robotic arm can be effectively programmed to perform repetitive tasks like putting in screws
on an assembly line. The task is always done in the same conditions, with no variables or
unexpected events. But when empowered with reinforcement learning and computer vision, the
arm will be able to find a bottle of milk in a refrigerator, a particular book on a bookshelf or a
plate in a dryer. The possibilities are practically endless. An interesting demonstration of
reinforcement learning in robotics may be seen in the video below, which was taken during an
experiment conducted by Chelsea Finn, Sergey Levine and Pieter Abbeel from Cal-Berkeley.

Coding every possible position of milk in every possible fridge would be a Herculean-and
unnecessary-undertaking. A better approach is to provide the machine with many visual
examples from which it learns features of a bottle of milk and then learns through trial and error
how to grasp the bottle. Powered by machine learning, the machine would become a semi-
autonomous assistant for elderly or injured people. It would be able to work in different lighting
conditions or deal with messy fridges.
Warsaw University professors and deepsense.ai contributors Piotr Miłoś, Błażej Osiński and
Henryk Michalewski recently conducted a project dubbed “Learning to Run”. They focused on
building software for modern, sophisticated leg prostheses that automatically adjust to the
wearer’s walking style. Their model can be easily applied in highly flexible environments
involving many rapidly changing variables, like financial markets, urban traffic management or
any real-time challenge requiring rapid decision-making.Given the rapid development of
reinforcement learning methods, we can be sure that 2018 will bring the next spectacular success
in this area.

Web link: https://deepsense.ai/playing-atari-with-deep-reinforcement-learning-deepsense-ais-approach/

For reference : https://builtin.com/artificial-intelligence/deep-q-learning


Markov Decision Process
Reinforcement Learning :

Reinforcement Learning is a type of Machine Learning. It allows machines and software


agents to automatically determine the ideal behavior within a specific context, in order to
maximize its performance. Simple reward feedback is required for the agent to learn its
behavior; this is known as the reinforcement signal.
There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement
Learning is defined by a specific type of problem, and all its solutions are classed as
Reinforcement Learning algorithms. In the problem, an agent is supposed to decide the best
action to select based on his current state. When this step is repeated, the problem is known as
a Markov Decision Process.
A Markov Decision Process (MDP) model contains:
• A set of possible world states S.
• A set of Models.
• A set of possible actions A.
• A real-valued reward function R(s,a).
• A policy the solution of Markov Decision Process.

What is a State?
A State is a set of tokens that represent every state that the agent can be in.
What is a Model?
A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular, T(S, a, S’)
defines a transition T where being in state S and taking an action ‘a’ takes us to state S’ (S and S’ may be
the same). For stochastic actions (noisy, non-deterministic) we also define a probability P(S’|S,a) which
represents the probability of reaching a state S’ if action ‘a’ is taken in state S. Note Markov property states
that the effects of an action taken in a state depend only on that state and not on the prior history.
What are Actions?
An Action A is a set of all possible actions. A(s) defines the set of actions that can be taken being in state
S.
What is a Reward?
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the state S. R(S,a)
indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’) indicates the reward for being
in a state S, taking an action ‘a’ and ending up in a state S’.
What is a Policy?
A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It indicates the
action ‘a’ to be taken while in state S.
Let us take the example of a grid world:

An agent lives in the grid. The above example is a 3*4 grid. The grid has a START state(grid no 1,1). The
purpose of the agent is to wander around the grid to finally reach the Blue Diamond (grid no 4,3). Under all
circumstances, the agent should avoid the Fire grid (orange color, grid no 4,2). Also the grid no 2,2 is a
blocked grid, it acts as a wall hence the agent cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent would have taken, the agent stays in
the same place. So for example, if the agent says LEFT in the START grid he would stay put in the START
grid.
First Aim: To find the shortest sequence getting from START to the Diamond. Two such sequences can be
found:
• RIGHT RIGHT UP UPRIGHT
• UP UP RIGHT RIGHT RIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion.
The move is now noisy. 80% of the time the intended action works correctly. 20% of the time the action agent
takes causes it to move at right angles. For example, if the agent says UP the probability of going UP is 0.8
whereas the probability of going LEFT is 0.1, and the probability of going RIGHT is 0.1 (since LEFT and
RIGHT are right angles to UP).
The agent receives rewards each time step:-
• Small reward each step (can be negative when can also be term as punishment, in the above example
entering the Fire can have a reward of -1).
• Big rewards come at the end (good or bad).
• The goal is to Maximize the sum of rewards.

References: http://reinforcementlearning.ai-depot.com/
http://artint.info/html/ArtInt_224.html
MDPs and Policy vs. Value Learning
Reinforcement learning is one of the AI disciplines that resembles human thinking the closest. Essentially,
reinforcement learning models AI scenarios using a combination of environments and rewards . In that
world, the role of an AI agent is learn about the environment while maximizing its total reward. One of the
most popular mechanisms to represent reinforcement learning problems is known as Markov Decision
Processes(MDPs) which decomposes scenarios as a series of states, connected by actions and associated to
a specific reward. In MDPs, an AI agent can transition from state to state by selecting and action and
obtaining the corresponding rewards.
Conceptually, MDPs aim to help AI agents to find the optimal policy in a target environment. Policies are
defined by the action an AI agent takes on a specific state. The objective of MDP policies is to maximize
the future return fro the AI agent. The biggest challenges on any MDP scenario is always how to tech the
AI agent to the reward. Broadly speaking, the solutions to this challenge fall into two main categories:
policy and value learning.
Policy learning focuses on directly inferring a policy that maximizes the reward on a specific environment.
Contrasting with that approach, value learning tries to quantify the value of every state-action pair. Let’s
explain those concepts using an example of an AI agent trying to learn a new chess opening. Using policy
reinforcement learning, the AI agent would try to infer a strategy to develop the pieces in a way that can
achieve certain well-known position. In the case of value-learning, the AI agent would assign a value to
every position and select the moves that score higher. Taking a psychological perspective, policy-learning
is closer to how adults reason through cognitive challenges while value-learning is closer to how babies
learn.
Q-Learning and Deep-Q-Networks
Q-Learning is one of the most popular of value reinforcement learning. Conceptually, Q-Learning
algorithms focus on learning a Q-Function that qualifies a state-action pair. A Q-Value represents the
expected long term reward of a Q-Learning algorithm assuming that it takes a perfect sequence of actions
from a specific state.
One of the main theoretical artifacts of Q-Learning is known as The Bellman Equation and it states that
“the maximum future reward for a specific action, is the current reward plus the maximum reward for
taking the next action”. That recursive rule seems to make a lot of sente but it runs into all sorts of practical
issues.
The main challenge with Q-Learning and the Bellman Equation comes to the compute cost associated with
estimating all combinations of state-action rewards. The computation cost quickly gets out of control in
problems involving a decent number of states. To deal with that challenge, there are techniques that try to
approximate a Q-function instead of learning an exact one by evaluating all possible Q-Values.

You might also like