Reinforcement Learning For IoT - Final
Reinforcement Learning For IoT - Final
Reinforcement Learning For IoT - Final
Have you ever observed infants and how they learn to turn over, sit up, crawl, and even
stand?
Have you watched how baby birds learn to fly—the parents throw them out of the nest,
they flutter for some time, and they slowly learn to fly
Goal-oriented: All of the efforts are toward reaching a particular goal. The goal
for the human baby can be to crawl, and for baby bird to fly.
Interaction with the environment: The only feedback that they get is from the
environment.
Reinforcement learning (RL)
-The Agent in RL isn't given any explicit instructions, it learns only from its
interaction with the environment.
-This interaction with the environment, as shown in the following diagram, is a
cyclic process.
This causes two things: First, a change in the state of the environment, and
Second, a reward is generated (under ideal conditions)
Reinforcement learning (RL)
This cycle continues:
RL terminology
Let's consider two examples: an agent finding a route in a maze and an agent steering the
wheel of a Self-Driving Car (SDC). The two are illustrated in the following diagram:
RL terminology
- States s: The states can be thought of as a set of tokens (or representation) that
can define all of the possible states the environment can be in. The state can be
continuous or discrete.
- Actions a(s): Actions are the set of all possible things that the agent can do in a
particular state. The set of possible actions, a, depends on the present state, s.
Actions may or may not result in the change of state. They can be discrete or
continuous.
- Reward r(s, a, s'): It's a scalar value returned by the environment when the agent
selects an action. It defines the goal; the agent gets a higher reward if the action
brings it near the goal, and a low (or even negative) reward otherwise.
- Policy π(s): It defines a mapping between each state and the action to take in that
state. The policy can be deterministic—that is, for each state a well-defined
policy.
The policy can also be stochastic—that is, where an action is taken by
some probability. It can be implemented as a simple look-up table, or it can be a
function dependent on the present state. The policy is the core of the RL agent.
RL terminology
- Value function V(s): It defines the goodness of a state in the long run. It can be
thought of as the total amount of reward the agent can expect to accumulate over
the future, starting from the state s.
There are two ways in which the value function is normally considered:
1- Value function Vπ(s): It's the goodness of state following the policy π.
2- Value-state function (or Q-function) Qπ(s, a): It's the goodness of a state s, taking action
a, and thereafter following policy π.
- Model of the environment: It's an optional element. It mimics the behavior of the
environment, and it contains the physics of the environment; in other words, it
defines how the environment will behave. The model of the environment is
defined by the transition probability to the next state.
Deep reinforcement learning
RL algorithms can be classified into two, based on what they iterate/approximate:
1- Value-based methods: In these methods, the algorithms take the action that
maximizes the value function. The agent here learns to predict how good a given
state or action would be. Hence, here, the aim is to find the optimal value. An
example of the value-based method is Q-learning.
2- Policy-based methods: In these methods, the algorithms predict the best policy
which maximizes the value function. The aim is to find the optimal policy. An
example of the policy-based method is policy gradients. Here, we approximate
the policy function, which allows us to map each state to the best corresponding
action.
We can use neural networks as a function approximator
To get an approximate value of either policy or
value.
When we use deep neural networks as a
• policy approximator or value approximator,
• we call it deep reinforcement learning (DRL).
DRL has, in the recent past, given very successful results.
Some successful applications
1. AlphaGo Zero.
Developed by Google's DeepMind team, the AlphaGo Zero
Mastering the game of Go without any human knowledge,
starts from an absolutely blank slate (tabula rasa).
it uses :
Monte Carlo Tree search guided by the neural network to select
the moves.
one neural network to approximate both the move probabilities
and value. Business Presentation
This neural network takes as input the raw board representation.
The reinforcement learning algorithm incorporates look-ahead
search inside the training loop.
The neural network was optimized on Google Cloud using
TensorFlow, with 64 GPU workers and 19 CPU parameter servers.
Some successful applications (cont.)
2- Al-controlled sailplanes:
Microsoft developed a controller system that can run on many different autopilot
hardware platforms such as Pixhawk and Raspberry Pi 3.
It can keep the sailplane in the air without using a motor,
by autonomously finding and catching rides on naturally occurring thermals.
The controller helps the sailplane to operate on its own; it detects and uses
thermals to travel without the aid of a motor or a person.
They implemented it as a partially observable MDP.
They employ the Bayesian reinforcement learning and use the Monte Carlo tree
search to search for the best action.
They've divided the whole system into level planners a high-level planer that
makes a decision based on experience and a low-level planner that uses Bayesian
reinforcement learning to detect and latch onto thermals in real time.
Some successful applications (cont.)
Al-controlled sailplanes video
Some successful applications (cont.)
3. Locomotion behavior:
DeepMind researchers provided the agents with rich and diverse environments.
The environments presented a spectrum of challenges at different levels of
difficulty.
The agent was provided with difficulties in increasing order; this led the agent to
learn sophisticated locomotion skills without performing any reward engineering.
Simulated environments
Since RL involves trial and error, it makes sense to train our RL agent first in a
simulated environment.
While a large number of applications exist that can be used for the creation of an
environment.
3. Gazebo:
We can build three-dimensional worlds with physics-based simulation.
It along with Robot Operating System (ROS) and the OpenAI gym interface
is gym-gazebo and can be used to train RL agents.
Some common types include cont
4. Blender learning environment:
It's a Python interface for the Blender game Engine and it
also works over OpenAI gym.
It has its base Blender.
Free three dimensional modeling software with an
integrated game engine, this provides an easy-to-use,
powerful set of tools for creating games.
It provides an interface to the Blender game engine, and
the games themselves are designed in Blender.
We can then create the custom virtual environment to train
an RL agent on a specific problem.
The environments supported in the latest version can
be grouped as follows:
HOW ..?
1- Q-learning can be implemented with the help of look-up tables.
We maintain a table of values for every state (row) and action (column) possible in
the environment. The algorithm attempts to learn the value—that is, how good it is
to take a particular action in the given state.
Q-learning(cont.)
2- We start by initializing all of the entries in the Q-table to 0; this ensures all states
a uniform (and hence equal chance) value. Later.
we observe the rewards obtained by taking a particular action and, based on the
rewards, we update the Q-table.
Here, α is the learning rate. This shows the basic Q-learning algorithm:
Q-learning(cont.)
At the end of learning,
we'll have a good Q-table, with optimal policy. An important question here is:
The second : way is we choose the action for which the value is maximum;
initially, all of the actions have the same Q-value but, as the agent will learn,
some actions will get high value and others low value.
In this case, the agent is exploiting the knowledge it has already learned.
Q-learning(cont.)
Q: So what's better: exploration or exploitation?
This is called the exploration-exploitation trade-off. A natural way to solve this
problem is by relying on what the agent has learned, but at the same time
sometimes just explore. This is achieved via the use of the epsilon greedy
algorithm.
The simple Q-learning algorithm involves maintaining a table of the size m×n,
where m is the total number of states and n the total number of possible
actions.
Therefore
we choose a problem from the Toy-text group since their state space and
action space is small. For illustrative purposes, we choose the Taxi-v2
environment.
The goal of our agent is to choose the passenger at one location and drop them
off at another. The agent receives +20 points for a successful drop-off and
loses 1 point for every time step it takes.
Taxi drop-off using Q-tables(cont.)
There's also a 10-point penalty for illegal pick-up and drop-off.
The state space has walls shown by | and four location marks, R, G, Y, and B
respectively. The taxi is shown by box: the pick-up and drop-off location can be either
of these four location marks.
The pick-up point is colored blue, and the drop-off is colored purple. The Taxi-v2
environment has a state space of size 500 and action space of size 6, making a Q-
table with 500×6=3000 entries:
In the taxi drop-off environment, the taxi is denoted by the
yellow box. The location mark, R, is the pick-up position,
and G is the drop-off location:
Taxi drop-off using Q-tables(cont.)
2- We initialize the Q-table of the size (300×6) with all zeros, and define the2.
hyperparameters: γ, the discount factor, and α, the learning rate. We also set the
values for maximum episodes (one episode means one complete run from reset to
done=True) and maximum steps in an episode the agent will learn for :
3- Now, for each episode, we choose the action with the highest value, perform
the3. action, and update the Q-table based on the received rewards .
4-Let's now see how the learned agent works:
The following diagram shows the agent behavior in a particular example.
The empty car is shown as a yellow box, and the car with the passenger is shown
by a green box.
You can see that, in the given case, the agent picks up and drops off the passenger
in 11 steps, and the desired location is marked (B) and the destination is marked
(R):
Taxi drop-off using Q-tables(cont.)
Q-Network
The simple Q-learning algorithm involves maintaining a table of the size m×n, where
m is the total number of states and n the total number of possible actions.
This means we can't use it for large state space and action space.
When the neural network that we use to approximate the Q-function is a deep neural
network, we call it a Deep Q-Network (DQN).
The neural network takes the state as its input and calculates the Q-value of all of the
possible actions.
Taxi drop-off using Q-Network
If we consider the preceding Taxi drop-off example, our neural network will consist of
500 input neurons (the state represented by 1×500 one-hot vector) and 6 output
neurons, each neuron representing the Q-value for the particular action for the given
state.
The neural network will here approximate the Q-value for each action.
Hence:
The network should be trained so that its approximated Q-value and the target Q-
value are same.
We train the neural network so that the square error of the difference between the
target Q and predicted Q is minimized.
Taxi drop-off using Q-Network
The Aim :
values (get_action), train the network (learnQ), and get the predicted Q-
value (Qnew):
Taxi drop-off using Q-Network
We now incorporate this neural network in our earlier code where we
trained an RL agent for the Taxi drop-off problem.
We'll need to make some changes; first, the state returned by the OpenAI
step and reset function in this case is just the numeric identification of state.
so we need to convert it into a one-hot vector. Also, instead of a Q-table
update, we'll now get the new Q-predicted from QNetwork, find the target
Q, and train the network so as to minimize the loss.
This should have done a good job but, as you can see, even after training
for 1,000 episodes, the network has a high negative reward, and if you
check the performance of the network.
Taxi drop-off using Q-Network
It appears to just take random steps. Yes, our network hasn't learned
anything; the performance is worse than Q-table. This can also be
verified from the reward plot while training—ideally, the rewards should
increase as the agent learns, but nothing of the sort happens here; the
rewards increase and decrease .
Taxi drop-off using Q-Network
• What happened? Why is the neural network failing to learn,
and can we make it better?
Consider the scenario when the taxi should go west to pick up and,
randomly, the agent chose west; the agent gets a reward and the
network will learn that, in the present state (represented by a one-hot
vector), going west is favorable.
Next, consider another state similar to this one (correlated state
space):
The agent again makes the west move, but this time it results in a
negative reward, so now the agent will unlearn what it had learned
earlier.
Taxi drop-off using Q-Network
It not only resolves the issues with correlation in input state space but
also allows us to learn from the same tuples more than once, recall rare
occurrences, and in general, make better use of the experience. In one
way, you can say that, by using a replay buffer, we've reduced the
problem of the supervised learning (with the replay buffer as an input-
output dataset), where the random sampling of input ensures that the
network is able to generalize.
Another problem with our approach is that we're updating the target Q
immediately. This too can cause harmful correlations. Remember that,
in Q-learning, we're trying to minimize the difference between the
Qtarget and the currently predicted Q.
Taxi drop-off using Q-Network
This difference is called a temporal difference (TD) error (and
hence Q-learning is a type of TD learning).
At present, we update our Qtarget immediately, hence there exists
a correlation between the target and the parameters we're changing
(weights through Qpred). This is almost like chasing a moving
target and hence won't give a generalized direction.
We can resolve the issue by using fixed Q-targets—that is, use two
networks, one for predicting Q and another for target Q. Both are
exactly the same in terms of architecture, with the predicting
QNetwork changing weights at each step, but the weight of the
target Q-Network is updated after some fixed learning steps. This
provides a more stable learning environment .
DQN to play an Atari game:
Finally:
we make one more small change: right now our epsilon has had a fixed value throughout
learning.
But, in real life, this isn't so. Initially, when we know nothing, we explore a lot but, as we
become familiar, we tend to take the learned path. The same can be done in our epsilon-
greedy algorithm, by changing the value of epsilon as the network learns through each
episode, so that epsilon decreases with time. Equipped with these tricks, let's now build a
DQN to play an Atari game.
At the heart of DQN is a deep convolutional neural network that takes as input the raw pixels
of the game environment (just like any human player would see), captured one screen at a
time, and as output, returns the value for each possible action. The action with the maximum
value is the chosen action:
Double DQN
when we are using a max operator to both select an action
and to evaluate an action then The result in overestimated
values for an action.
We can decoupling the selection from evaluation by using
Double DQN.
In Double DQN, we have two Q-Networks with different
weights; both learn by random experience, one is used to
determine the action using the epsilon-greedy policy and
the other to determine its value .
This reduces the overestimation and helps us to train the
agent quickly and more reliably.
Dueling DQN
Dueling DQN decouples the Q-function into the value
function and advantage function.
The value function is:
• represents the value of the state independent of action
• provides a relative measure of the utility (advantage/goodness) of
action a in the state.
In Dueling DQN, the same convolutional is used to extract
features but it's separated into two separate networks in later
stages, one providing the value and another providing the
advantage.
Later, the two stages are recombined using an aggregating
layer to estimate the Q-value.
This ensures that the network produces separate estimates
for the value function and the advantage function.
Dueling DQN (cont.)
- For example:
Pong using policy gradients.
The actor-critic algorithm
The actor-critic method is separate the policy evaluation from
the value evaluation.
Actor-critic architecture
The actor-critic algorithm (cont.)
We alternate between a policy evaluation and a policy
improvement step, resulting in more stable learning.
The critic uses the state and action values to estimate a
value function, which is then used to update the actor's policy
network parameters so that the overall performance
improves.