Reinforcement Learning Course Material
Reinforcement Learning Course Material
Definition: Reinforcement learning is learning what to do, how to map situations to actions
so as to maximize a numerical reward signal. The learner is not told which actions to take,
but instead must discover which actions yield the most reward by trying them.
The two most important distinguishing features of reinforcement learning are trial-and-
error search and delayed reward. RL is known as a semi-supervised learning model in
machine learning, is a technique to allow an agent to take actions and interact with an
environment so as to maximize the total rewards.
RL is usually modeled as a Markov Decision Process (MDP). Figure 4.1 gives the
process of learning in RL using Markov Decision Process. Based on the state of the environment,
the agent will take the action. This will change the state again. Based on the outcome, the system
will give reward. Using the reward, the agent learns itself and changes its action according to it.
This learning will happen till the required outcome is reached.
4.1.1. Examples
A good way to understand reinforcement learning is to consider some of the examples and
possible applications that have guided its development.
● A kid learning to walk (Fig 4.2) is an example of reinforcement learning. The kid will get
the reward as encouragement from parents and failure as the falling of the kid. By
learning from the failure and rewards, the kid will learn to walk.
● Master chess player makes a move. The choice is informed both by planning—
anticipating possible replies and counterreplies—and by immediate, intuitive judgments
of the desirability of particular positions and moves.
● An adaptive controller adjusts parameters of a petroleum refinery’s operation in real time.
The controller optimizes the yield/cost/quality trade-off on the basis of specified marginal
costs without sticking strictly to the set points originally suggested by engineers.
● A gazelle calf struggles to its feet minutes after being born. Half an hour later it is
running at 20 miles per hour.
● A mobile robot decides whether it should enter a new room in search of more trash to
collect or start trying to find its way back to its battery recharging station. It makes its
decision based on the current charge level of its battery and how quickly and easily it has
been able to find the recharger in the past.
● Phil prepares his breakfast. Closely examined, even this apparently mundane activity
reveals a complex web of conditional behavior and interlocking goal–subgoal
relationships: walking to the cupboard, opening it, selecting a cereal box, then reaching
for, grasping, and retrieving the box. Other complex, tuned, interactive sequences of
behavior are required to obtain a bowl, spoon, and milk jug. Each step involves a series
of eye movements to obtain information and to guide reaching and locomotion. Rapid
judgments are continually made about how to carry the objects or whether it is better to
ferry some of them to the dining table before obtaining others. Each step is guided by
goals, such as grasping a spoon or getting to the refrigerator, and is in service of other
goals, such as having the spoon to eat with once the cereal is prepared and ultimately
obtaining nourishment. Whether he is aware of it or not, Phil is accessing information
about the state of his body that determines his nutritional needs, level of hunger, and food
preferences.
These examples share features that are so basic that they are easy to overlook. All involve
interaction between an active decision-making agent and its environment, within which the agent
seeks to achieve a goal despite uncertainty about its environment. The agent’s actions are
permitted to affect the future state of the environment (e.g., the next chess position, the level of
reservoirs of the refinery, the robot’s next location and the future charge level of its battery),
thereby affecting the options and opportunities available to the agent at later times. Correct
choice requires taking into account indirect, delayed consequences of actions, and thus may
require foresight or planning At the same time, in all these examples the effects of actions cannot
be fully predicted; thus the agent must monitor its environment frequently and react
appropriately. For example, Phil must watch the milk he pours into his cereal bowl to keep it
from overflowing. All these examples involve goals that are explicit in the sense that the agent
can judge progress toward its goal based on what it can sense directly. The chess player knows
whether or not he wins, the refinery controller knows how much petroleum is being produced,
the mobile robot knows when its batteries run down, and Phil knows whether or not he is
enjoying his breakfast.
In all of these examples the agent can use its experience to improve its performance over
time. The chess player refines the intuition he uses to evaluate positions, thereby improving his
play; the gazelle calf improves the efficiency with which it can run; Phil learns to streamline
making his breakfast. The knowledge the agent brings to the task at the start—either from
previous experience with related tasks or built into it by design or evolution—influences what is
useful or easy to learn, but interaction with the environment is essential for adjusting behavior to
exploit specific features of the task.
Beyond the agent and the environment, one can identify four main sub-elements of a
reinforcement learning system: a policy, a reward signal, a value function, and, optionally, a
model of the environment.
Fig.4.3: Example for Reinforcement Learning
- A policy defines the learning agent’s way of behaving at a given time. Roughly speaking,
a policy is a mapping from perceived states of the environment to actions to be taken
when in those states
- A reward signal defines the goal of a reinforcement learning problem. On each time step,
the environment sends to the reinforcement learning agent a single number called the
reward. The agent’s sole objective is to maximize the total reward it receives over the
long run.
- Whereas the reward signal indicates what is good in an immediate sense, a value function
specifies what is good in the long run. Roughly speaking, the value of a state is the total
amount of reward an agent can expect to accumulate over the future, starting from that
state.
- The fourth and final element of some reinforcement learning systems is a model of the
environment. This is something that mimics the behavior of the environment, or more
generally, that allows inferences to be made about how the environment will behave. For
example, given a state and action, the model might predict the resultant next state and
next reward.
● Your cat is an agent that is exposed to the environment. In this case, it is your house. An
example of a state could be your cat sitting, and you use a specific word in for cat to
walk.
● Our agent reacts by performing an action transition from one "state" to another "state."
● For example, your cat goes from sitting to walking.
● The reaction of an agent is an action, and the policy is a method of selecting an action
given a state in expectation of better outcomes.
● After the transition, they may get a reward or penalty in return.
This is the first World Championclass chess computer among the oldest challenges in
computer science. When World Chess Champion Garry Kasparov resigned the last game of a
six-game match against IBM’s Deep Blue supercomputer on 11 May 1997, his loss marked
achievement of Deep Blue to meet its goal.
Deep Blue’s 1996 debut in the first Kasparov versus Deep Blue match in Philadelphia
finally eclipsed Deep Thought II. The 1996 version of Deep Blue used a new chess chip
designed at IBM Research over the course of three years. A major revision of this chip
participated in the historic 1997 rematch between Kasparov and Deep Blue to achieve its goal.
Fig 4.4 gives IBM DeepBlue beats the chess champion Garry Kasparov
Fig 4.4 IBM DeepBlue beats the chess champion Garry Kasparov
First computer to defeat TV game show Jeopardy! champions (Ken Jennings and
Brad Rutter). Research teams are working to adapt Watson to other information-intensive
fields, such as telecommunications, financial services and government. Figure 4.5 shows the
photo of IBM Watson computer defeats human in final
AlphaGo is a computer program that plays the board game Go. It was developed by
DeepMind Technologies which was later acquired by Google. AlphaGo versus Lee Sedol, also
known as the Google DeepMind Challenge Match, was a five-game Go match between 18-
time world champion Lee Sedol and AlphaGo, a computer Go program developed by
Google DeepMind, played in Seoul, South Korea between the 9th and 15th of March 2016.
It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero
becomes its own teacher. The system starts off with a neural network that knows nothing about
the game of Go. It then plays games against itself, by combining this neural network with a
powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves,
as well as the eventual winner of the games. Figure 4.6 shows the photo for Google DeepMind
AlphaGo versus Lee Sedol
Fig 4.6:Google DeepMind AlphaGo versus Lee Sedol
This updated neural network is then recombined with the search algorithm to create a new,
stronger version of AlphaGo Zero, and the process begins again. In each iteration, the
performance of the system improves by a small amount, and the quality of the self-play games
increases, leading to more and more accurate neural networks and ever stronger versions of
AlphaGo Zero.
In the diagram below, the environment is the maze. The goal of the agent is to solve this maze by
taking optimal actions.
Fig. 4.7:Agent and environment
It is clear from the diagram how the agent and environment interact with each other. Agent sends
action to the environment, and the environment sends the observation and reward to the agent
after executing each action received from the agent. Observation is nothing but an internal state
of the environment. Figure 4.5 gives the scenario of Agent and environment
A reward Rt is a scalar feedback signal that indicates how well the agent is doing at time step t.
The agent’s job is to maximize the expected sum of rewards.
Agent: An AI algorithm.
An environment interacts with the agent by sending its state and a reward. Thus following are the
steps to create an environment.
● Create a Simulation.
● Add a State vector which represents the internal state of the Simulation.
● Add a Reward system into the Simulation.
Consider another game, simple paddle and ball game, where we have a paddle on the
ground and the paddle needs to hit the moving ball. If the ball touches on the ground instead of
the paddle, that’s a miss.
The inbuilt turtle module is used to implement it in python. Turtle provides an easy and simple
interface to build and move different shapes. Figure 4.8 shows the Code for setting up the
environment
A blank window of size (600, 600) pixels was created. The midpoint coordinate of the
window is (0,0). So the ball can go up-down and left-right by 300 pixels.
Figure 4.12 gives the code for adding movements for paddle. Two functions are created
to implement moving the paddle left and right. Then we bind these functions with the left and
right key. Means pressing the right arrow key, function paddle_right is called and paddle moves
to the right by 20 pixels.
Figure 4.13 shows the environment with ball and paddle. We have completed 30% of the
environment. Let’s add the ball movement now.
Fig.4.14: Code for adding ball movement
Figure 4.14 shows the ball movement. For the ball, the horizontal velocity is set to 3 and vertical
velocity to -3. Means the ball moves by 3 pixels horizontally and -3 pixels vertically after each
frame. So for each frame, we have to update the ball position in our main loop using its velocity.
Fig.4.15:Ball Movement
Now, 40 % done now. But wait, the ball just crossed the screen. It should have collided with the
side walls. So we have to put the following boundary checks into the code.
● The ball should collide with the upper and side balls.
● The ball should collide with the paddle.
● If the ball touches the ground, then the game should start again from the beginning.
Figure 4.17 gives the code for adding scorecard and Figure 4.18 gives the environment after
adding scorecard. We are maintaining 2 variables called hit and miss. If the ball hits the paddle
we increment the hit otherwise miss. Then we can create a scorecard which prints the score in the
top middle of the screen.
All this is packed inside this tiny step function. This is the function where the agent
interacts with the environment. The agent calls this function and provides the action value
in the argument. And this function returns the state vector and rewards back to the agent.
There is one more variable this function returns which is done. This tells the agent
whether the episode is terminated or not. In our case the episode terminates when the ball
touches to the ground and the new episode starts. The code will make much more sense
when you see it collectively
4.5. Categorizing Reinforcement Learning Agents
● Value Based Agent: In this, the agent will evaluate all the states in the state space,
and the policy will be kind of implicit, i.e. the value function tells the agent how
good is each action in a particular state and the agent will choose the best one.
● Policy Based Agent: Here, instead of representing the value function inside the
agent, we explicitly represent the policy. The agent searches for the optimal action-
value function which in turn will enable it to act optimally.
● Actor-Critic Agent: In this, this agent is a value-based and policy-based agent. It’s
an agent that stores both of the policy, and how much reward it is getting from each
state.
● Model-Based Agent: Here, the agent tries to build a model of how the environment
works, and then plans to get the best possible behavior.
● Model-Free Agent: Here the agent doesn’t try to understand the environment, i.e. it
doesn’t try to build the dynamics. Instead we go directly to the policy and/or value
function. We just see experience and try to figure out a policy of how to behave
optimally to get the most possible rewards.
4.6. Action-Value Function
A state-action value function is also called the Q function. It specifies how good it is for
an agent to perform a particular action in a state with a policy π. The Q function is denoted by
Q(s). It denotes the value of taking an action in a state following a policy π.
Almost all reinforcement learning algorithms are based on estimating value functions--functions
of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state
(or how good it is to perform a given action in a given state). The notion of "how good" here is
defined in terms of future rewards that can be expected, or, to be precise, in terms of expected
return. Of course the rewards the agent can expect to receive in the future depend on what
actions it will take.
Fig.4.24: A Zebra Zero robot arm learned a peg-in-hole insertion task with a model-free
policy gradient approach
Robots learn novel behaviors through trial and error interactions. This unburdens the
human operator from having to pre-program accurate behaviors. This is particularly important as
we deploy robots in scenarios where the environment may not be known.
Motion Control – machine learning helps robots with dynamic interaction and obstacle
avoidance to maintain productivity. Data – AI and machine learning both help robots
understand physical and logistical data patterns to be proactive and act accordingly.
Uses of Robotics: Robots are widely used in manufacturing, assembly, packing and
packaging, mining, transport, earth and space exploration, surgery, weaponry, laboratory
research, safety, and the mass production of consumer and industrial goods.
The examples of implementing reinforcement learning in robotics are aerial vehicles,
robotic arms, autonomous vehicles, and humanoid robots.The figures 4.23 and 4.24 illustrate
small samples of robots with behaviors that were reinforcement learned.
Figure gives the Obelix robot which drives automatically in the campus of University of
Germany. The OBELIX robot is a wheeled mobile robot that learned to push boxes with a value
function-based approach. Fig 4.24 gives A Zebra Zero robot arm learned a peg-in-hole insertion
task with a model-free policy gradient approach.
Reinforcement learning is generally a hard problem and many of its challenges are
particularly apparent in the robotics setting. As the states and actions of most robots are
inherently continuous, we are forced to consider the resolution at which they are represented.
We must decide how fine grained the control is that we require over the robot, whether
we employ discretization or function approximation, and what time step we establish.
Additionally, as the dimensionality of both states and actions can be high, we face the “Curse of
Dimensionality” As robotics deals with complex physical systems, samples can be expensive due
to the long execution time of complete tasks, required manual interventions, need for
maintenance and repair. In these real-world measurements, we must cope with the uncertainty
inherent in complex physical systems. A robot requires that the algorithm runs in real-time. The
algorithm must be capable of dealing with delays in sensing and execution that are inherent in
physical systems.
Fig.4.25: This figure shows schematic drawings of the ball-in-a-cup motion (a), the final
learned robot motion (c), as well as a kinesthetic teach-in (b). The green arrows show the
directions of the current movements in that frame. The human cup motion was taught to
the robot by imitation learning. The robot manages to reproduce the imitated motion quite
accurately, but the ball misses the cup by several centimeters. After approximately 75
iterations of the Policy learning by Weighting Exploration with the Returns (PoWER)
algorithm the robot has improved its motion so that the ball regularly goes into the cup.
The children’s game ball-in-a-cup, also known as balero and bilboquet, is challenging
even for adults. The toy consists of a small cup held in one hand (in this case, it is attached to the
end-effector of the robot) and a small ball hanging on a string attached to the cup’s bottom (for
the employed toy, the string is 40cm long).
Initially, the ball is at rest, hanging down vertically. The player needs to move quickly to
induce motion in the ball through the string, toss the ball in the air, and catch it with the cup.
Figure 4.25 shows schematic drawings of the ball-in-a-cup motion. A possible movement is
illustrated in Figure 4.25a.
The state of the system can be described by joint angles and joint velocities of the robot
as well as the Cartesian coordinates and velocities of the ball (neglecting states that cannot be
observed straightforwardly like the state of the string or global room air movement). The actions
are the joint space accelerations, which are translated into torques by a fixed inverse dynamics
controller.
Thus, the reinforcement learning approach has to deal with twenty state and seven action
dimensions, making discretization infeasible. An obvious reward function would be a binary
return for the whole episode, depending on whether the ball was caught in the cup or not. In
order to give the reinforcement learning algorithm a notion of closeness, initially the
programmers used a reward function based solely on the minimal distance between the ball and
the cup. However, the algorithm has exploited rewards resulting from hitting the cup with the
ball from below or from the side, as such behaviors are easier to achieve and yield comparatively
high rewards. To avoid such local optima, it was essential to find a good reward function that
contains the additional prior knowledge that getting the ball into the cup is only possible from
one direction.
The task exhibits some surprising complexity as the reward is not only affected by the
cup’s movements but foremost by the ball’s movements. As the ball’s movements are very
sensitive to small perturbations, the initial conditions, or small arm movement changes will
drastically affect the outcome. Creating an accurate simulation is hard due to the nonlinear,
unobservable dynamics of the string and its non-negligible weight.
A demonstration for imitation was obtained by recording the motions of a human player
performing kinesthetic teach-in as shown in Figure 4.25b. Kinesthetic teach-in means “taking the
robot by the hand”, performing the task by moving the robot while it is in gravity-compensation
mode, and recording the joint angles, velocities and accelerations. It requires a backdrivable
robot system that is similar enough to a human arm to not cause embodiment issues. Even with
demonstration, the resulting robot policy fails to catch the ball with the cup, leading to the need
for self improvement by reinforcement learning. Policy search methods are better suited for a
scenario like this, where the task is episodic, local optimization is sufficient (thanks to the initial
demonstration), and high dimensional, continuous states and actions need to be taken into
account. A single update step in a gradient based method usually requires as many episodes as
parameters to be optimized. Since the expected number of parameters was in the hundreds, a
different approach had to be taken because gradient based methods are impractical in this
scenario. Furthermore, the step-size parameter for gradient based methods often is a crucial
parameter that needs to be tuned to achieve good performance. Instead, an expectation-
maximization inspired algorithm was employed that requires significantly less samples and has
no learning rate. Figure 4.25c gives the final learned robot motion.
Why is Robotics important?? - Robotics technology influences every aspect of work and
home. Robotics has the potential to positively transform lives and work practices, raise
efficiency and safety levels and provide enhanced levels of service. In these industries robotics
already underpins employment.
4.7.2. Gaming
Reinforcement learning and games have a long and mutually beneficial common history.
From one side, games are rich and challenging domains for testing reinforcement learning
algorithms. From the other side, in several games the best computer players use reinforcement
learning. Games are designed to entertain, amuse and challenge humans, so, by studying games,
we can (hopefully) learn about human intelligence, and the challenges that human intelligence
needs to solve. At the same time, games are challenging domains for RL algorithms as well,
probably for the same reason they are so for humans: they are designed to involve interesting
decisions.
- A virtual assistant (typically abbreviated to VA, also called a virtual office assistant) is
generally self-employed and provides professional administrative, technical, or creative
(social) assistance to clients remotely from a home office.
Tasks of Virtual Assistants - Typical tasks a virtual assistant might perform include
scheduling appointments, making phone calls, making travel arrangements, and managing
email accounts.
Examples - Three such applications are Siri on Apple devices (Fig.4.26), Cortana on Microsoft
Devices (Fig.4.27) and Google Assistant on Android devices (Fig.4.28).
Because virtual assistants are independent contractors rather than employees, clients are not
responsible for any employee-related taxes, insurance or benefits, except in the context that those
indirect expenses are included in the VA's fees.
Fig.4.26:Apple’s Siri
Fig.4.27: Microsoft’s Cortana
Fig.4.28:Google Assistant
Clients also avoid the logistical problem of providing extra office space, equipment or
supplies. Clients pay for 100% productive work, and can work with virtual assistants,
individually, or in multi-VA firms to meet their exact needs. Virtual assistants usually work for
other small businesses. but can also support busy executives. It is estimated that there are as few
as 5,000 to 10,000 or as many as 25,000 virtual assistants worldwide. The profession is growing
in centralized economies with "fly-in fly-out" staffing practice.
Common modes of communication and data delivery include the Internet, e-mail and
phone-call conferences, online work spaces, and fax machines. Increasingly virtual assistants are
utilizing technology such as Skype and Zoom, Slack, as well as Google Voice. Professionals in
this business work on a contractual basis and a long-lasting cooperation is standard.
In recent years virtual assistants have also worked their way into many mainstream businesses
and with the advent of VOIP services such as Skype and Zoom it has been possible to have a
virtual assistant who can answer your phone remotely without the end user's knowledge. This
allows many businesses to add a personal touch in the form of a receptionist without the
additional cost of hiring someone.
Virtual assistants come from a variety of business backgrounds, but most have several
years' experience earned in the "real" (non-virtual) business world, or several years' experience
working online or remotely.
A dedicated virtual assistant is someone working in the office under the management of a
company. The facility and internet connection as well as training are provided by the company,
however not in all cases.
The home based virtual assistant worked either in the office sharing environment or in
their house. The general VA are sometimes called an online administrative assistant, online
personal assistant or online sales assistant. A virtual webmaster assistant, virtual marketing
assistant and virtual content writing assistant are specific professionals that are usually
experienced employees from a corporate environment that started to set up their own virtual
offices.
References
1. Richard S.Sutton and Andrew G.Barto, “Reinforcement Learning An Introduction,”
Second Edition, The MIT Press, 2018
2. http://www.csis.pace.edu/~ctappert/dps/pdf/ai-chess-deep.pdf
3. https://researcher.watson.ibm.com/researcher/files/us-
heq/W(3)%20INTRODUCTION%2006177724.pdf
4. https://deepmind.com/blog/article/alphago-zero-starting-scratch
5. https://www.ias.informatik.tu-darmstadt.de/uploads/Publications/Kober_IJRR_2013.pdf
6. https://en.wikipedia.org/wiki/Virtual_assistant