Reinforcement Learning For IoT - Final

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 45

Ali Supervisor

Islam Watheq Fouad


Sami
Hussain
Ghanim Prof-DR-Abbas
Moham Salim
Abdul-
med
Mutasher
saleem AL-BAKRI
hameed
Ali
Reinforcement learning (RL)
Reinforcement learning (RL) is very different from both supervised and unsupervised
learning. It's the way most living beings learn—interacting with the environment.

Have you ever observed infants and how they learn to turn over, sit up, crawl, and even
stand?

Have you watched how baby birds learn to fly—the parents throw them out of the nest,
they flutter for some time, and they slowly learn to fly

All of this learning involves a component of the following:


Business Presentation
Trial and error: The baby tries different ways and is unsuccessful many times
before finally succeeding in doing it.

Goal-oriented: All of the efforts are toward reaching a particular goal. The goal
for the human baby can be to crawl, and for baby bird to fly.

Interaction with the environment: The only feedback that they get is from the
environment.
Reinforcement learning (RL)

RL (in Artificial Intelligence) can be defined as a computational approach to goal directed


learning and decision-making, from interaction with the environment, under some idealized
conditions.

-The Agent in RL isn't given any explicit instructions, it learns only from its
interaction with the environment.
-This interaction with the environment, as shown in the following diagram, is a
cyclic process.

- The Agent can sense the state of the Environment.


- The Agent can perform specific well-defined actions on the Environment

This causes two things: First, a change in the state of the environment, and
Second, a reward is generated (under ideal conditions)
Reinforcement learning (RL)
This cycle continues:
RL terminology
Let's consider two examples: an agent finding a route in a maze and an agent steering the
wheel of a Self-Driving Car (SDC). The two are illustrated in the following diagram:
RL terminology
- States s: The states can be thought of as a set of tokens (or representation) that
can define all of the possible states the environment can be in. The state can be
continuous or discrete.
- Actions a(s): Actions are the set of all possible things that the agent can do in a
particular state. The set of possible actions, a, depends on the present state, s.
Actions may or may not result in the change of state. They can be discrete or
continuous.
- Reward r(s, a, s'): It's a scalar value returned by the environment when the agent
selects an action. It defines the goal; the agent gets a higher reward if the action
brings it near the goal, and a low (or even negative) reward otherwise.
- Policy π(s): It defines a mapping between each state and the action to take in that
state. The policy can be deterministic—that is, for each state a well-defined
policy.
The policy can also be stochastic—that is, where an action is taken by
some probability. It can be implemented as a simple look-up table, or it can be a
function dependent on the present state. The policy is the core of the RL agent.
RL terminology
- Value function V(s): It defines the goodness of a state in the long run. It can be
thought of as the total amount of reward the agent can expect to accumulate over
the future, starting from the state s.

There are two ways in which the value function is normally considered:

1- Value function Vπ(s): It's the goodness of state following the policy π.
2- Value-state function (or Q-function) Qπ(s, a): It's the goodness of a state s, taking action
a, and thereafter following policy π.

- Model of the environment: It's an optional element. It mimics the behavior of the
environment, and it contains the physics of the environment; in other words, it
defines how the environment will behave. The model of the environment is
defined by the transition probability to the next state.
Deep reinforcement learning
RL algorithms can be classified into two, based on what they iterate/approximate:

1- Value-based methods: In these methods, the algorithms take the action that
maximizes the value function. The agent here learns to predict how good a given
state or action would be. Hence, here, the aim is to find the optimal value. An
example of the value-based method is Q-learning.

2- Policy-based methods: In these methods, the algorithms predict the best policy
which maximizes the value function. The aim is to find the optimal policy. An
example of the policy-based method is policy gradients. Here, we approximate
the policy function, which allows us to map each state to the best corresponding
action.
 We can use neural networks as a function approximator
 To get an approximate value of either policy or
value.
 When we use deep neural networks as a
• policy approximator or value approximator,
• we call it deep reinforcement learning (DRL).
 DRL has, in the recent past, given very successful results.
Some successful applications
1. AlphaGo Zero.
 Developed by Google's DeepMind team, the AlphaGo Zero
Mastering the game of Go without any human knowledge,
 starts from an absolutely blank slate (tabula rasa).
 it uses :
 Monte Carlo Tree search guided by the neural network to select
the moves.
 one neural network to approximate both the move probabilities
and value. Business Presentation
 This neural network takes as input the raw board representation.
 The reinforcement learning algorithm incorporates look-ahead
search inside the training loop.
 The neural network was optimized on Google Cloud using
TensorFlow, with 64 GPU workers and 19 CPU parameter servers.
Some successful applications (cont.)
2- Al-controlled sailplanes:
 Microsoft developed a controller system that can run on many different autopilot
hardware platforms such as Pixhawk and Raspberry Pi 3.
 It can keep the sailplane in the air without using a motor,
 by autonomously finding and catching rides on naturally occurring thermals.
 The controller helps the sailplane to operate on its own; it detects and uses
thermals to travel without the aid of a motor or a person.
 They implemented it as a partially observable MDP.
 They employ the Bayesian reinforcement learning and use the Monte Carlo tree
search to search for the best action.
 They've divided the whole system into level planners a high-level planer that
makes a decision based on experience and a low-level planner that uses Bayesian
reinforcement learning to detect and latch onto thermals in real time.
Some successful applications (cont.)
Al-controlled sailplanes video
Some successful applications (cont.)
3. Locomotion behavior:
 DeepMind researchers provided the agents with rich and diverse environments.
The environments presented a spectrum of challenges at different levels of
difficulty.
 The agent was provided with difficulties in increasing order; this led the agent to
learn sophisticated locomotion skills without performing any reward engineering.
Simulated environments
 Since RL involves trial and error, it makes sense to train our RL agent first in a
simulated environment.
 While a large number of applications exist that can be used for the creation of an
environment.

SOME COMMON TYPES INCLUDE:


1. OpenAI gym:
 It contains a collection of environments that we can use to train our RL
agents.
 It is open source toolkit to develop and compare RL algorithms.
 It contains a variety of simulated environments that can be used to train
agents and develop new RL algorithms.
 To start, you'll first have to install gym.
 It supports various environments, from simple text-based to three-
dimensional.
Some common types include cont.
2. Unity ML-Agents SDK:
 It allows developers to transform games.
 Simulations created using the Unity editor into environments where intelligent
agents can be trained using DRL, evolutionary strategies, or other machine learning
methods through a simple-to-use Python API.
 It works with TensorFlow.
 Provides the ability to train intelligent agents for two-dimensional/three dimensional
and VR/AR games.

3. Gazebo:
 We can build three-dimensional worlds with physics-based simulation.
 It along with Robot Operating System (ROS) and the OpenAI gym interface
is gym-gazebo and can be used to train RL agents.
Some common types include cont
4. Blender learning environment:
 It's a Python interface for the Blender game Engine and it
also works over OpenAI gym.
 It has its base Blender.
 Free three dimensional modeling software with an
integrated game engine, this provides an easy-to-use,
powerful set of tools for creating games.
 It provides an interface to the Blender game engine, and
the games themselves are designed in Blender.
 We can then create the custom virtual environment to train
an RL agent on a specific problem.
The environments supported in the latest version can
be grouped as follows:

1. Algorithms: It contains environments that involve performing computations


such as addition. While we can easily perform the computations on a computer,
what makes these problems interesting as an RL problem is that the agent
learn these tasks purely by example.
2. Atari: This environment provides a wide variety of classical Atari/arcade
games.
3. Box2D: It contains robotics tasks in two dimensions such as a car racing agent

Now You can create Awesome


or bipedal robot walk.
4. Classic control: This contains the classical control theory problems, such as
Template Within a minute with
balancing a cart pole.
5. MuJoCo: This is proprietary (you can get a one-month free trial). It supports
Right Presentation
various robot simulation tasks. The environment includes a physics engine,
hence, it's used for training robotic tasks.
6. Robotics: This environment too uses the physics engine of MuJoCo. It
simulates goal-based tasks for fetch and shadow-hand robots.
7. Toy text: It's a simple text-based environment—very good for beginners.
Q-learning
The goal of Q-learning is to learn an optimal action selection policy. Given a specific
state, s, and taking a specific action, a, Q-learning attempts to learn the value of the
state s. In its simplest version.

HOW ..?
1- Q-learning can be implemented with the help of look-up tables.
 We maintain a table of values for every state (row) and action (column) possible in
the environment. The algorithm attempts to learn the value—that is, how good it is
to take a particular action in the given state.
Q-learning(cont.)
2- We start by initializing all of the entries in the Q-table to 0; this ensures all states
a uniform (and hence equal chance) value. Later.
 we observe the rewards obtained by taking a particular action and, based on the
rewards, we update the Q-table.
Here, α is the learning rate. This shows the basic Q-learning algorithm:
Q-learning(cont.)
At the end of learning,
we'll have a good Q-table, with optimal policy. An important question here is:

How do we choose the action at the second step?


There are two alternatives:
First: we choose the action randomly. This allows our agent to explore all of the
possible actions with equal probability but, at the same time, ignoring the
information it has already learned.

The second : way is we choose the action for which the value is maximum;
initially, all of the actions have the same Q-value but, as the agent will learn,
some actions will get high value and others low value.
In this case, the agent is exploiting the knowledge it has already learned.
Q-learning(cont.)
Q: So what's better: exploration or exploitation?
This is called the exploration-exploitation trade-off. A natural way to solve this
problem is by relying on what the agent has learned, but at the same time
sometimes just explore. This is achieved via the use of the epsilon greedy
algorithm.

The basic idea:


is that the agent chooses the actions randomly with the probability, ε, and
exploits the information learned in previous episodes by a probability, (1-ε). The
algorithm chooses the best option (greedy) most of the time (1-ε) but
sometimes (ε) it makes a random choice.
Let's now try to implement what we learned in a simple problem.
Taxi drop-off using Q-tables

The simple Q-learning algorithm involves maintaining a table of the size m×n,
where m is the total number of states and n the total number of possible
actions.
Therefore
we choose a problem from the Toy-text group since their state space and
action space is small. For illustrative purposes, we choose the Taxi-v2
environment.
The goal of our agent is to choose the passenger at one location and drop them
off at another. The agent receives +20 points for a successful drop-off and
loses 1 point for every time step it takes.
Taxi drop-off using Q-tables(cont.)
There's also a 10-point penalty for illegal pick-up and drop-off.
The state space has walls shown by | and four location marks, R, G, Y, and B
respectively. The taxi is shown by box: the pick-up and drop-off location can be either
of these four location marks.

The pick-up point is colored blue, and the drop-off is colored purple. The Taxi-v2
environment has a state space of size 500 and action space of size 6, making a Q-
table with 500×6=3000 entries:
In the taxi drop-off environment, the taxi is denoted by the
yellow box. The location mark, R, is the pick-up position,
and G is the drop-off location:
Taxi drop-off using Q-tables(cont.)

1- We start by importing the necessary modules and creating our environment.1.


Since, here, we just need to make a look-up table, using TensorFlow won't be
necessary. As mentioned previously, the Taxi-v2 environment has 500 possible
states and 6 possible actions.

2- We initialize the Q-table of the size (300×6) with all zeros, and define the2.
hyperparameters: γ, the discount factor, and α, the learning rate. We also set the
values for maximum episodes (one episode means one complete run from reset to
done=True) and maximum steps in an episode the agent will learn for :
3- Now, for each episode, we choose the action with the highest value, perform
the3. action, and update the Q-table based on the received rewards .
4-Let's now see how the learned agent works:
The following diagram shows the agent behavior in a particular example.
The empty car is shown as a yellow box, and the car with the passenger is shown
by a green box.
You can see that, in the given case, the agent picks up and drops off the passenger
in 11 steps, and the desired location is marked (B) and the destination is marked
(R):
Taxi drop-off using Q-tables(cont.)
Q-Network
The simple Q-learning algorithm involves maintaining a table of the size m×n, where
m is the total number of states and n the total number of possible actions.
This means we can't use it for large state space and action space.

An alternative is to replace the table with a neural network acting as a function


approximator, approximating the Q-function for each possible action.
The weights of the neural network in this case store the Q-table information (they
match a given state with the corresponding action and its Q-value).

When the neural network that we use to approximate the Q-function is a deep neural
network, we call it a Deep Q-Network (DQN).
The neural network takes the state as its input and calculates the Q-value of all of the
possible actions.
Taxi drop-off using Q-Network
If we consider the preceding Taxi drop-off example, our neural network will consist of
500 input neurons (the state represented by 1×500 one-hot vector) and 6 output
neurons, each neuron representing the Q-value for the particular action for the given
state.
The neural network will here approximate the Q-value for each action.
Hence:
The network should be trained so that its approximated Q-value and the target Q-
value are same.
We train the neural network so that the square error of the difference between the
target Q and predicted Q is minimized.
Taxi drop-off using Q-Network
The Aim :

is to learn the unknown Qtarget function. The weights of QNetwork are

updated using backpropagation so that this loss is minimized. We make the

neural network, QNetwork, to approximate the Q-value. It's a very simple

single-layer neural network, with methods to provide action and their Q-

values (get_action), train the network (learnQ), and get the predicted Q-

value (Qnew):
Taxi drop-off using Q-Network
We now incorporate this neural network in our earlier code where we
trained an RL agent for the Taxi drop-off problem.

We'll need to make some changes; first, the state returned by the OpenAI
step and reset function in this case is just the numeric identification of state.
so we need to convert it into a one-hot vector. Also, instead of a Q-table
update, we'll now get the new Q-predicted from QNetwork, find the target
Q, and train the network so as to minimize the loss.

This should have done a good job but, as you can see, even after training
for 1,000 episodes, the network has a high negative reward, and if you
check the performance of the network.
Taxi drop-off using Q-Network

It appears to just take random steps. Yes, our network hasn't learned
anything; the performance is worse than Q-table. This can also be
verified from the reward plot while training—ideally, the rewards should
increase as the agent learns, but nothing of the sort happens here; the
rewards increase and decrease .
Taxi drop-off using Q-Network
• What happened? Why is the neural network failing to learn,
and can we make it better?
Consider the scenario when the taxi should go west to pick up and,
randomly, the agent chose west; the agent gets a reward and the
network will learn that, in the present state (represented by a one-hot
vector), going west is favorable.
Next, consider another state similar to this one (correlated state
space):
The agent again makes the west move, but this time it results in a
negative reward, so now the agent will unlearn what it had learned
earlier.
Taxi drop-off using Q-Network

Hence, similar state-actions but divergent targets confuse the learning


process.

This is called catastrophic forgetting. The problem arises here because


consecutive states are highly correlated and so, if the agent learns in
sequence (as it does here), this extremely correlated input state space
won't let the agent learn.
Taxi drop-off using Q-Network
Can we break the correlation between the input presented to the
network?
Yes, we can: we can construct a replay buffer, where
we first store each state, its corresponding action, and the consecutive
reward and resultant state (state, action, reward, new state).
The actions, in this case, are chosen completely randomly.
There by ensuring a wide range of actions and resultant states. The
replay buffer will finally consist of a large list of these tuples (S, A, R,
S').
Next,
we present the network with these tuples randomly (instead of
sequentially); this randomness will break the correlation between
consecutive input states. This is called experience replay.
Taxi drop-off using Q-Network

It not only resolves the issues with correlation in input state space but
also allows us to learn from the same tuples more than once, recall rare
occurrences, and in general, make better use of the experience. In one
way, you can say that, by using a replay buffer, we've reduced the
problem of the supervised learning (with the replay buffer as an input-
output dataset), where the random sampling of input ensures that the
network is able to generalize.

Another problem with our approach is that we're updating the target Q
immediately. This too can cause harmful correlations. Remember that,
in Q-learning, we're trying to minimize the difference between the
Qtarget and the currently predicted Q.
Taxi drop-off using Q-Network
This difference is called a temporal difference (TD) error (and
hence Q-learning is a type of TD learning).
At present, we update our Qtarget immediately, hence there exists
a correlation between the target and the parameters we're changing
(weights through Qpred). This is almost like chasing a moving
target and hence won't give a generalized direction.

We can resolve the issue by using fixed Q-targets—that is, use two
networks, one for predicting Q and another for target Q. Both are
exactly the same in terms of architecture, with the predicting
QNetwork changing weights at each step, but the weight of the
target Q-Network is updated after some fixed learning steps. This
provides a more stable learning environment .
DQN to play an Atari game:
Finally:

we make one more small change: right now our epsilon has had a fixed value throughout
learning.

But, in real life, this isn't so. Initially, when we know nothing, we explore a lot but, as we
become familiar, we tend to take the learned path. The same can be done in our epsilon-
greedy algorithm, by changing the value of epsilon as the network learns through each
episode, so that epsilon decreases with time. Equipped with these tricks, let's now build a
DQN to play an Atari game.

DQN to play an Atari game:

At the heart of DQN is a deep convolutional neural network that takes as input the raw pixels
of the game environment (just like any human player would see), captured one screen at a
time, and as output, returns the value for each possible action. The action with the maximum
value is the chosen action:
Double DQN
 when we are using a max operator to both select an action
and to evaluate an action then The result in overestimated
values for an action.
 We can decoupling the selection from evaluation by using
Double DQN.
 In Double DQN, we have two Q-Networks with different
weights; both learn by random experience, one is used to
determine the action using the epsilon-greedy policy and
the other to determine its value .
 This reduces the overestimation and helps us to train the
agent quickly and more reliably.
Dueling DQN
 Dueling DQN decouples the Q-function into the value
function and advantage function.
 The value function is:
• represents the value of the state independent of action
• provides a relative measure of the utility (advantage/goodness) of
action a in the state.
 In Dueling DQN, the same convolutional is used to extract
features but it's separated into two separate networks in later
stages, one providing the value and another providing the
advantage.
 Later, the two stages are recombined using an aggregating
layer to estimate the Q-value.
 This ensures that the network produces separate estimates
for the value function and the advantage function.
Dueling DQN (cont.)

The basic architecture of Dueling DQN


Dueling DQN (cont.)
 The intuition behind this decoupling of value and
advantage is that, it's unnecessary to estimate the value of
each action choice.
 For example, in the car race
• if there's no car in front, then the action turn left or turn
right is not required and so there's no need to estimate
the value of these actions on the given state.
 This allows it to learn which states are valuable, without
having to determine the effect of each action for each
state.
 At the aggregate layer, the value and advantage are
combined such that it's possible to recover both V and A
uniquely from a given Q.
Policy gradients
 The neural network learns a policy for selecting the actions
that maximize the rewards by adjusting its weights using
steepest gradient ascent, hence the name policy gradients.
 The policy is represented by a neural network whose input is
a representation of states and whose output is action
selection probabilities.
 The weights of this network are the policy parameters that
we need to learn.
- We using a parameterized stochastic policy π to update
the weights of this network.
Policy gradients (cont.)
Why policy gradients?
 Firstly, estimate the optimal policy directly , without need
to store additional data.
- it's simple to implement.
 Secondly, we can train it to learn true stochastic policies.
 Finally, it's well suited for continuous action-space.

- For example:
Pong using policy gradients.
The actor-critic algorithm
 The actor-critic method is separate the policy evaluation from
the value evaluation.

 It consists of two neural networks:


- One approximating the policy,
called the actor-network.
- Other approximating the value,
called the critic-network.

Actor-critic architecture
The actor-critic algorithm (cont.)
 We alternate between a policy evaluation and a policy
improvement step, resulting in more stable learning.
 The critic uses the state and action values to estimate a
value function, which is then used to update the actor's policy
network parameters so that the overall performance
improves.

You might also like