0% found this document useful (0 votes)
60 views

Reinforcement Learning Course Material

This document provides an introduction to reinforcement learning. It defines reinforcement learning as learning what actions to take to maximize rewards through trial-and-error interaction with an environment. The document discusses key elements of reinforcement learning including the agent, environment, policy, reward signal, value function, and optional environment model. Examples of reinforcement learning in action are described such as a baby learning from a TV remote and a kid learning to walk.

Uploaded by

Vamshidhar Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Reinforcement Learning Course Material

This document provides an introduction to reinforcement learning. It defines reinforcement learning as learning what actions to take to maximize rewards through trial-and-error interaction with an environment. The document discusses key elements of reinforcement learning including the agent, environment, policy, reward signal, value function, and optional environment model. Examples of reinforcement learning in action are described such as a baby learning from a TV remote and a kid learning to walk.

Uploaded by

Vamshidhar Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT IV REINFORCEMENT LEARNING

Introduction to Reinforcement Learning

Definition: Reinforcement learning is learning what to do, how to map situations to actions
so as to maximize a numerical reward signal. The learner is not told which actions to take,
but instead must discover which actions yield the most reward by trying them.

The two most important distinguishing features of reinforcement learning are trial-and-
error search and delayed reward. RL is known as a semi-supervised learning model in
machine learning, is a technique to allow an agent to take actions and interact with an
environment so as to maximize the total rewards.

RL is usually modeled as a Markov Decision Process (MDP). Figure 4.1 gives the
process of learning in RL using Markov Decision Process. Based on the state of the environment,
the agent will take the action. This will change the state again. Based on the outcome, the system
will give reward. Using the reward, the agent learns itself and changes its action according to it.
This learning will happen till the required outcome is reached.

Fig 4.1: The agent-environment interaction in a Markov Decision Process

4.1.1. Examples

Imagine a baby is given a TV remote control at your home (environment). In simple


terms, the baby (agent) will first observe and construct his/her own representation of the
environment (state). Then the curious baby will take certain actions like hitting the remote
control (action) and observe how the TV responds (next state). As a non-responding TV is dull,
the baby dislikes it (receiving a negative reward) and will take less actions that will lead to such
a result(updating the policy) and vice versa. The baby will repeat the process until he/she finds a
policy (what to do under different circumstances) that he/she is happy with (maximizing the total
(discounted) rewards).

Fig 4.2: Kid Learning to walk

A good way to understand reinforcement learning is to consider some of the examples and
possible applications that have guided its development.

● A kid learning to walk (Fig 4.2) is an example of reinforcement learning. The kid will get
the reward as encouragement from parents and failure as the falling of the kid. By
learning from the failure and rewards, the kid will learn to walk.
● Master chess player makes a move. The choice is informed both by planning—
anticipating possible replies and counterreplies—and by immediate, intuitive judgments
of the desirability of particular positions and moves.
● An adaptive controller adjusts parameters of a petroleum refinery’s operation in real time.
The controller optimizes the yield/cost/quality trade-off on the basis of specified marginal
costs without sticking strictly to the set points originally suggested by engineers.
● A gazelle calf struggles to its feet minutes after being born. Half an hour later it is
running at 20 miles per hour.
● A mobile robot decides whether it should enter a new room in search of more trash to
collect or start trying to find its way back to its battery recharging station. It makes its
decision based on the current charge level of its battery and how quickly and easily it has
been able to find the recharger in the past.
● Phil prepares his breakfast. Closely examined, even this apparently mundane activity
reveals a complex web of conditional behavior and interlocking goal–subgoal
relationships: walking to the cupboard, opening it, selecting a cereal box, then reaching
for, grasping, and retrieving the box. Other complex, tuned, interactive sequences of
behavior are required to obtain a bowl, spoon, and milk jug. Each step involves a series
of eye movements to obtain information and to guide reaching and locomotion. Rapid
judgments are continually made about how to carry the objects or whether it is better to
ferry some of them to the dining table before obtaining others. Each step is guided by
goals, such as grasping a spoon or getting to the refrigerator, and is in service of other
goals, such as having the spoon to eat with once the cereal is prepared and ultimately
obtaining nourishment. Whether he is aware of it or not, Phil is accessing information
about the state of his body that determines his nutritional needs, level of hunger, and food
preferences.

These examples share features that are so basic that they are easy to overlook. All involve
interaction between an active decision-making agent and its environment, within which the agent
seeks to achieve a goal despite uncertainty about its environment. The agent’s actions are
permitted to affect the future state of the environment (e.g., the next chess position, the level of
reservoirs of the refinery, the robot’s next location and the future charge level of its battery),
thereby affecting the options and opportunities available to the agent at later times. Correct
choice requires taking into account indirect, delayed consequences of actions, and thus may
require foresight or planning At the same time, in all these examples the effects of actions cannot
be fully predicted; thus the agent must monitor its environment frequently and react
appropriately. For example, Phil must watch the milk he pours into his cereal bowl to keep it
from overflowing. All these examples involve goals that are explicit in the sense that the agent
can judge progress toward its goal based on what it can sense directly. The chess player knows
whether or not he wins, the refinery controller knows how much petroleum is being produced,
the mobile robot knows when its batteries run down, and Phil knows whether or not he is
enjoying his breakfast.

In all of these examples the agent can use its experience to improve its performance over
time. The chess player refines the intuition he uses to evaluate positions, thereby improving his
play; the gazelle calf improves the efficiency with which it can run; Phil learns to streamline
making his breakfast. The knowledge the agent brings to the task at the start—either from
previous experience with related tasks or built into it by design or evolution—influences what is
useful or easy to learn, but interaction with the environment is essential for adjusting behavior to
exploit specific features of the task.

4.1.2. Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main sub-elements of a
reinforcement learning system: a policy, a reward signal, a value function, and, optionally, a
model of the environment.
Fig.4.3: Example for Reinforcement Learning

- A policy defines the learning agent’s way of behaving at a given time. Roughly speaking,
a policy is a mapping from perceived states of the environment to actions to be taken
when in those states
- A reward signal defines the goal of a reinforcement learning problem. On each time step,
the environment sends to the reinforcement learning agent a single number called the
reward. The agent’s sole objective is to maximize the total reward it receives over the
long run.
- Whereas the reward signal indicates what is good in an immediate sense, a value function
specifies what is good in the long run. Roughly speaking, the value of a state is the total
amount of reward an agent can expect to accumulate over the future, starting from that
state.
- The fourth and final element of some reinforcement learning systems is a model of the
environment. This is something that mimics the behavior of the environment, or more
generally, that allows inferences to be made about how the environment will behave. For
example, given a state and action, the model might predict the resultant next state and
next reward.

In the example given in Fig.4.3,

● Your cat is an agent that is exposed to the environment. In this case, it is your house. An
example of a state could be your cat sitting, and you use a specific word in for cat to
walk.
● Our agent reacts by performing an action transition from one "state" to another "state."
● For example, your cat goes from sitting to walking.
● The reaction of an agent is an action, and the policy is a method of selecting an action
given a state in expectation of better outcomes.
● After the transition, they may get a reward or penalty in return.

4.2. Game Playing

4.2.1. Deep Blue in Chess

This is the first World Championclass chess computer among the oldest challenges in
computer science. When World Chess Champion Garry Kasparov resigned the last game of a
six-game match against IBM’s Deep Blue supercomputer on 11 May 1997, his loss marked
achievement of Deep Blue to meet its goal.

Deep Blue’s 1996 debut in the first Kasparov versus Deep Blue match in Philadelphia
finally eclipsed Deep Thought II. The 1996 version of Deep Blue used a new chess chip
designed at IBM Research over the course of three years. A major revision of this chip
participated in the historic 1997 rematch between Kasparov and Deep Blue to achieve its goal.
Fig 4.4 gives IBM DeepBlue beats the chess champion Garry Kasparov

Fig 4.4 IBM DeepBlue beats the chess champion Garry Kasparov

4.2.2. IBM Watson in Jeopardy


Watson is a question-answering computer system capable of answering questions
posed in natural language, developed in IBM's DeepQA project by a research team led by
principal investigator David Ferrucci.

First computer to defeat TV game show Jeopardy! champions (Ken Jennings and
Brad Rutter). Research teams are working to adapt Watson to other information-intensive
fields, such as telecommunications, financial services and government. Figure 4.5 shows the
photo of IBM Watson computer defeats human in final

Fig. 4.5:IBM Watson computer defeats human in final

4.2.3. Google’s Deep Mind in AlphaGo

AlphaGo is a computer program that plays the board game Go. It was developed by
DeepMind Technologies which was later acquired by Google. AlphaGo versus Lee Sedol, also
known as the Google DeepMind Challenge Match, was a five-game Go match between 18-
time world champion Lee Sedol and AlphaGo, a computer Go program developed by
Google DeepMind, played in Seoul, South Korea between the 9th and 15th of March 2016.

It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero
becomes its own teacher. The system starts off with a neural network that knows nothing about
the game of Go. It then plays games against itself, by combining this neural network with a
powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves,
as well as the eventual winner of the games. Figure 4.6 shows the photo for Google DeepMind
AlphaGo versus Lee Sedol
Fig 4.6:Google DeepMind AlphaGo versus Lee Sedol

This updated neural network is then recombined with the search algorithm to create a new,
stronger version of AlphaGo Zero, and the process begins again. In each iteration, the
performance of the system improves by a small amount, and the quality of the self-play games
increases, leading to more and more accurate neural networks and ever stronger versions of
AlphaGo Zero.

4.3. Agents and Environment

Reinforcement learning is a branch of Machine learning where we have an agent


and an environment. The environment is nothing but a task or simulation and the Agent is
an AI algorithm that interacts with the environment and tries to solve it.

In the diagram below, the environment is the maze. The goal of the agent is to solve this maze by
taking optimal actions.
Fig. 4.7:Agent and environment

It is clear from the diagram how the agent and environment interact with each other. Agent sends
action to the environment, and the environment sends the observation and reward to the agent
after executing each action received from the agent. Observation is nothing but an internal state
of the environment. Figure 4.5 gives the scenario of Agent and environment

A reward Rt is a scalar feedback signal that indicates how well the agent is doing at time step t.
The agent’s job is to maximize the expected sum of rewards.

So we need 2 things in order to apply reinforcement learning.

Agent: An AI algorithm.

Environment: A task/simulation which needs to be solved by the Agent.

An environment interacts with the agent by sending its state and a reward. Thus following are the
steps to create an environment.

● Create a Simulation.
● Add a State vector which represents the internal state of the Simulation.
● Add a Reward system into the Simulation.

4.3.1. Creating Environment

Consider another game, simple paddle and ball game, where we have a paddle on the
ground and the paddle needs to hit the moving ball. If the ball touches on the ground instead of
the paddle, that’s a miss.
The inbuilt turtle module is used to implement it in python. Turtle provides an easy and simple
interface to build and move different shapes. Figure 4.8 shows the Code for setting up the
environment

Figure 4.8: Code for setting up the environment

4.3.2. Code for creating background window

A blank window of size (600, 600) pixels was created. The midpoint coordinate of the
window is (0,0). So the ball can go up-down and left-right by 300 pixels.

Fig.4.9. Blank Screen


Figure 4.9 Shows the black screen which is being created. Thus, 10% of the environment is
completed. Let’s add a paddle to the bottom and a ball at the centre.

Fig.4.8 :Code for adding paddle and ball

Fig.4.11 : Paddle and Ball


Figure 4.10 Shows the code for adding paddle and ball into the environment and figure 4.11
Shows the look of the environment. Now, 20% of the environment is complete now. Let’s add
paddle left and right movement on pressing left and right keys.

Fig.4.12 :Adding paddle movement

Figure 4.12 gives the code for adding movements for paddle. Two functions are created
to implement moving the paddle left and right. Then we bind these functions with the left and
right key. Means pressing the right arrow key, function paddle_right is called and paddle moves
to the right by 20 pixels.

Fig.4.13: Environment with ball and paddle

Figure 4.13 shows the environment with ball and paddle. We have completed 30% of the
environment. Let’s add the ball movement now.
Fig.4.14: Code for adding ball movement

Figure 4.14 shows the ball movement. For the ball, the horizontal velocity is set to 3 and vertical
velocity to -3. Means the ball moves by 3 pixels horizontally and -3 pixels vertically after each
frame. So for each frame, we have to update the ball position in our main loop using its velocity.

Fig.4.15:Ball Movement

Now, 40 % done now. But wait, the ball just crossed the screen. It should have collided with the
side walls. So we have to put the following boundary checks into the code.
● The ball should collide with the upper and side balls.
● The ball should collide with the paddle.
● If the ball touches the ground, then the game should start again from the beginning.

Fig.4.16 :Code for Adding boundary checks


Figure 4.16 gives the code for adding boundary checks and Figure 4.17 gives the environment
after adding boundary

Fig.4.17: Environment after adding boundary


Looks good. This is 70% done. Let’s put a final nail in the coffin which is a Scorecard.

Fig.4.18: Code for adding scorecard

Figure 4.17 gives the code for adding scorecard and Figure 4.18 gives the environment after
adding scorecard. We are maintaining 2 variables called hit and miss. If the ball hits the paddle
we increment the hit otherwise miss. Then we can create a scorecard which prints the score in the
top middle of the screen.

Fig.4.18: Environment after adding Scorecard


We have completed the 90% environment now. What is left is to add a state vector and the
reward system into this simulation.I created a state vector that contains the following
information.
● Position of the paddle in the x-axis
● Position of the ball in the x and y-axis
● The velocity of the ball in x and y-axis
Following is the reward system I have implemented.
● Give a reward of +3 if the ball touches to the paddle
● Give a reward of -3 if the ball misses the paddle.
● Give a reward of -0.1 each time paddle moves, so that paddle does not move
unnecessary.
The agent will choose one of the actions from the action space and send it to the environment.
Following is the action space I implemented.
● 0 - Move the paddle to the left.
● 1 - Do nothing.
● 2 - Move the paddle to the right.
The agent will send one of these numbers to the environment, and the environment performs the
action corresponding to that number. Figure 4.19 shows the code for adding reward and state
vector.

Fig.4.19:Code for adding reward and state vector

All this is packed inside this tiny step function. This is the function where the agent
interacts with the environment. The agent calls this function and provides the action value
in the argument. And this function returns the state vector and rewards back to the agent.
There is one more variable this function returns which is done. This tells the agent
whether the episode is terminated or not. In our case the episode terminates when the ball
touches to the ground and the new episode starts. The code will make much more sense
when you see it collectively
4.5. Categorizing Reinforcement Learning Agents
● Value Based Agent: In this, the agent will evaluate all the states in the state space,
and the policy will be kind of implicit, i.e. the value function tells the agent how
good is each action in a particular state and the agent will choose the best one.
● Policy Based Agent: Here, instead of representing the value function inside the
agent, we explicitly represent the policy. The agent searches for the optimal action-
value function which in turn will enable it to act optimally.
● Actor-Critic Agent: In this, this agent is a value-based and policy-based agent. It’s
an agent that stores both of the policy, and how much reward it is getting from each
state.
● Model-Based Agent: Here, the agent tries to build a model of how the environment
works, and then plans to get the best possible behavior.
● Model-Free Agent: Here the agent doesn’t try to understand the environment, i.e. it
doesn’t try to build the dynamics. Instead we go directly to the policy and/or value
function. We just see experience and try to figure out a policy of how to behave
optimally to get the most possible rewards.
4.6. Action-Value Function
A state-action value function is also called the Q function. It specifies how good it is for
an agent to perform a particular action in a state with a policy π. The Q function is denoted by
Q(s). It denotes the value of taking an action in a state following a policy π.

Reward vs Value Function


A reward is immediate. In order to acquire the reward, the value function is an efficient way to
determine the value of being in a state. Denoted by V(s), this value function measures potential
future rewards we may get from being in this state s.

Define the Value Function


Figure 4.20 shows an example of changing a state from another state using a value
function.
Fig 4.20: State A leads to state B or C
In figure 4.20, how do we determine the value of state A? There is a 50–50 chance to end up in
the next 2 possible states, either state B or C. The value of state A is simply the sum of all next
states’ probability multiplied by the reward for reaching that state. The value of state A is 0.5. In
figure 4.21, you find yourself in state D with only 1 possible route to state E. Since state E gives
a reward of 1, state D’s value is also 1 since the only outcome is to receive the reward. If you are
in state F (in figure 2), which can only lead to state G, followed by state H. Since state H has a
negative reward of -1, state G’s value will also be -1, likewise for state F.

Fig 4.21: Example of change in states based on value function

Almost all reinforcement learning algorithms are based on estimating value functions--functions
of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state
(or how good it is to perform a given action in a given state). The notion of "how good" here is
defined in terms of future rewards that can be expected, or, to be precise, in terms of expected
return. Of course the rewards the agent can expect to receive in the future depend on what
actions it will take.

4.6. Deep Reinforced Learning


Deep reinforcement learning is a category of machine learning and artificial
intelligence where intelligent machines can learn from their actions similar to the way
humans learn from experience.
Deep reinforcement learning combines artificial neural networks with a reinforcement
learning architecture that enables software-defined agents to learn the best actions possible in a
virtual environment in order to attain their goals. Fig.4.22 shows how Deep Reinforcement
Learning works.

Fig.4.22.:Deep Reinforcement Learning


That is, it unites function approximation and target optimization, mapping state-
action pairs to expected rewards.
Deep reinforcement learning (DRL) uses deep learning and reinforcement learning
principles to create efficient algorithms applied on areas like robotics, video games, NLP
(computer science), computer vision, education, transportation, finance and healthcare.
Implementing deep learning architectures (deep neural networks) with reinforcement
learning algorithms (Q-learning, actor critic, etc.) is capable of scaling to previously unsolvable
problems. That is because DRL is able to learn from raw sensors or image signals as input. A
remarkable milestone in (Deep Q-Network)DQN is that the agent uses end-to-end reinforcement
learning with convolutional neural networks for playing ATARI games.

Challenges with reinforcement learning


The main challenge in reinforcement learning lies in preparing the simulation environment,
which is highly dependent on the task to be performed.
When the model has to go superhuman in Chess, Go or Atari games, preparing the simulation
environment is relatively simple.
When it comes to building a model capable of driving an autonomous car, building a
realistic simulator is crucial before letting the car ride on the street.
❖ The model has to figure out how to brake or avoid a collision in a safe
environment, where sacrificing even a thousand cars comes at a minimal cost.
❖ Transferring the model out of the training environment and into the real world is
where things get tricky.
4.7. Applications
4.7.1. Robotics
What is Robotics?? - Robotics is the engineering science and technology which
involves the conception, design, operation and manufacture of robots. Electronics,
mechanics and software are brought together by robotics. Robots are used for jobs that are dirty,
dull and dangerous.
In robotics, the ultimate goal of reinforcement learning is to endow robots with the ability
to learn, improve, adapt and reproduce tasks with dynamically changing constraints based on
exploration and autonomous learning.
Reinforcement learning offers robotics a framework and set of tools for the design of
sophisticated and hard-to-engineer behaviors. Reinforcement learning (RL) enables a robot to
autonomously discover an optimal behavior through trial-and-error interactions with its
environment.
Instead of explicitly detailing the solution to a problem, in reinforcement learning the
designer will control the task. In the problem of reinforcement learning, an agent explores the
space of possible strategies and receives feedback on the outcome of the choices made. From this
information, a “good” – or ideally optimal – policy (i.e., strategy or controller) must be deduced.
Fig.4.23: Obelix driving autonomously on the campus

Fig.4.24: A Zebra Zero robot arm learned a peg-in-hole insertion task with a model-free
policy gradient approach

Robots learn novel behaviors through trial and error interactions. This unburdens the
human operator from having to pre-program accurate behaviors. This is particularly important as
we deploy robots in scenarios where the environment may not be known.
Motion Control – machine learning helps robots with dynamic interaction and obstacle
avoidance to maintain productivity. Data – AI and machine learning both help robots
understand physical and logistical data patterns to be proactive and act accordingly.
Uses of Robotics: Robots are widely used in manufacturing, assembly, packing and
packaging, mining, transport, earth and space exploration, surgery, weaponry, laboratory
research, safety, and the mass production of consumer and industrial goods.
The examples of implementing reinforcement learning in robotics are aerial vehicles,
robotic arms, autonomous vehicles, and humanoid robots.The figures 4.23 and 4.24 illustrate
small samples of robots with behaviors that were reinforcement learned.
Figure gives the Obelix robot which drives automatically in the campus of University of
Germany. The OBELIX robot is a wheeled mobile robot that learned to push boxes with a value
function-based approach. Fig 4.24 gives A Zebra Zero robot arm learned a peg-in-hole insertion
task with a model-free policy gradient approach.
Reinforcement learning is generally a hard problem and many of its challenges are
particularly apparent in the robotics setting. As the states and actions of most robots are
inherently continuous, we are forced to consider the resolution at which they are represented.
We must decide how fine grained the control is that we require over the robot, whether
we employ discretization or function approximation, and what time step we establish.
Additionally, as the dimensionality of both states and actions can be high, we face the “Curse of
Dimensionality” As robotics deals with complex physical systems, samples can be expensive due
to the long execution time of complete tasks, required manual interventions, need for
maintenance and repair. In these real-world measurements, we must cope with the uncertainty
inherent in complex physical systems. A robot requires that the algorithm runs in real-time. The
algorithm must be capable of dealing with delays in sensing and execution that are inherent in
physical systems.

(a) Schematic drawings of the ball-in-a-cup motion

(b) Kinesthetic teach-in


(c) Final learned robot motion

Fig.4.25: This figure shows schematic drawings of the ball-in-a-cup motion (a), the final
learned robot motion (c), as well as a kinesthetic teach-in (b). The green arrows show the
directions of the current movements in that frame. The human cup motion was taught to
the robot by imitation learning. The robot manages to reproduce the imitated motion quite
accurately, but the ball misses the cup by several centimeters. After approximately 75
iterations of the Policy learning by Weighting Exploration with the Returns (PoWER)
algorithm the robot has improved its motion so that the ball regularly goes into the cup.

The children’s game ball-in-a-cup, also known as balero and bilboquet, is challenging
even for adults. The toy consists of a small cup held in one hand (in this case, it is attached to the
end-effector of the robot) and a small ball hanging on a string attached to the cup’s bottom (for
the employed toy, the string is 40cm long).
Initially, the ball is at rest, hanging down vertically. The player needs to move quickly to
induce motion in the ball through the string, toss the ball in the air, and catch it with the cup.
Figure 4.25 shows schematic drawings of the ball-in-a-cup motion. A possible movement is
illustrated in Figure 4.25a.
The state of the system can be described by joint angles and joint velocities of the robot
as well as the Cartesian coordinates and velocities of the ball (neglecting states that cannot be
observed straightforwardly like the state of the string or global room air movement). The actions
are the joint space accelerations, which are translated into torques by a fixed inverse dynamics
controller.
Thus, the reinforcement learning approach has to deal with twenty state and seven action
dimensions, making discretization infeasible. An obvious reward function would be a binary
return for the whole episode, depending on whether the ball was caught in the cup or not. In
order to give the reinforcement learning algorithm a notion of closeness, initially the
programmers used a reward function based solely on the minimal distance between the ball and
the cup. However, the algorithm has exploited rewards resulting from hitting the cup with the
ball from below or from the side, as such behaviors are easier to achieve and yield comparatively
high rewards. To avoid such local optima, it was essential to find a good reward function that
contains the additional prior knowledge that getting the ball into the cup is only possible from
one direction.
The task exhibits some surprising complexity as the reward is not only affected by the
cup’s movements but foremost by the ball’s movements. As the ball’s movements are very
sensitive to small perturbations, the initial conditions, or small arm movement changes will
drastically affect the outcome. Creating an accurate simulation is hard due to the nonlinear,
unobservable dynamics of the string and its non-negligible weight.

A demonstration for imitation was obtained by recording the motions of a human player
performing kinesthetic teach-in as shown in Figure 4.25b. Kinesthetic teach-in means “taking the
robot by the hand”, performing the task by moving the robot while it is in gravity-compensation
mode, and recording the joint angles, velocities and accelerations. It requires a backdrivable
robot system that is similar enough to a human arm to not cause embodiment issues. Even with
demonstration, the resulting robot policy fails to catch the ball with the cup, leading to the need
for self improvement by reinforcement learning. Policy search methods are better suited for a
scenario like this, where the task is episodic, local optimization is sufficient (thanks to the initial
demonstration), and high dimensional, continuous states and actions need to be taken into
account. A single update step in a gradient based method usually requires as many episodes as
parameters to be optimized. Since the expected number of parameters was in the hundreds, a
different approach had to be taken because gradient based methods are impractical in this
scenario. Furthermore, the step-size parameter for gradient based methods often is a crucial
parameter that needs to be tuned to achieve good performance. Instead, an expectation-
maximization inspired algorithm was employed that requires significantly less samples and has
no learning rate. Figure 4.25c gives the final learned robot motion.
Why is Robotics important?? - Robotics technology influences every aspect of work and
home. Robotics has the potential to positively transform lives and work practices, raise
efficiency and safety levels and provide enhanced levels of service. In these industries robotics
already underpins employment.

4.7.2. Gaming
Reinforcement learning and games have a long and mutually beneficial common history.
From one side, games are rich and challenging domains for testing reinforcement learning
algorithms. From the other side, in several games the best computer players use reinforcement
learning. Games are designed to entertain, amuse and challenge humans, so, by studying games,
we can (hopefully) learn about human intelligence, and the challenges that human intelligence
needs to solve. At the same time, games are challenging domains for RL algorithms as well,
probably for the same reason they are so for humans: they are designed to involve interesting
decisions.

4.7.3. Diagnostic systems


Unlike traditional supervised learning methods that usually rely on one-shot, exhaustive
and supervised reward signals, RL tackles sequential decision making problems with sampled,
evaluative and delayed feedback simultaneously.
Such distinctive features make RL technique a suitable candidate for developing
powerful solutions in a variety of healthcare domains, where diagnosing decisions or treatment
regimes are usually characterized by a prolonged and sequential procedure. RL approaches have
been successfully applied in a number of healthcare domains to date.
Broadly, these application domains can be categorized into three main types: dynamic
treatment regimes (DTR) in chronic disease or critical care, automated medical diagnosis,
and other general domains such as health resources allocation and scheduling, optimal process
control, drug discovery and development, as well as health management.
A DTR is composed of a sequence of decision rules to determine the course of actions
(e.g., treatment type, drug dosage, or reexamination timing) at a time point according to
the current health status and prior treatment history of an individual patient.
Unlike traditional randomized controlled trials that are mainly used as an evaluative tool
for confirming the efficacy of a newly developed treatment, DTRs are tailored for generating
new scientific hypotheses and developing optimal treatments across or within groups of patients.
An optimal DTR that is capable of optimizing the final clinical outcome of particular
interest can be derived. The design of DTRs can be viewed as a sequential decision making
problem that fits into the RL framework well. The series of decision rules in DTRs are
equivalent to the policies in RL, while the treatment outcomes are expressed by the reward
functions.
The inputs in DTRs are a set of clinical observations and assessments of patients, and the
outputs are the treatment options at each stage, equivalent to the states and actions in RL,
respectively. Apparently, applying RL methods to solve DTR problems demonstrates several
benefits. RL is capable of achieving time-dependent decisions on the best treatment for each
patient at each decision time, thus accounting for heterogeneity across patients. This precise
treatment can be achieved even without relying on the identification of any accurate
mathematical models or explicit relationship between treatments and outcomes.

Automatic Medical diagnosis is a mapping process from a patient’s information such as


treatment history, current signs and symptoms to an accurate clarification of a disease. Being a
complex task, medical diagnosis often requires ample medical investigation on the clinical
situations, causing significant cognitive burden for clinicians to assimilate valuable information
from complex and diverse clinical reports.
It has been reported that diagnostic error accounts for as high as 10% of deaths and 17%
of adverse events in hospitals. The error-prone process in diagnosis and the necessity to assist the
clinicians for a better and more efficient decision making urgently call for a significant
revolution of the diagnostic process, leading to the advent of automated diagnostic era that is
fueled by advanced big data analysis and machine learning techniques
Besides the above applications of RL in DTR design and automated medical diagnosis,
there are many other case applications in broader healthcare domains that focus on problems
specifically in health resource scheduling and allocation, optimal process control, drug discovery
and development, as well as health management like Health Resource Scheduling and Allocation
and Drug Discovery and Development

4.7.4. Virtual Assistants


What is a virtual Assistant?

- A virtual assistant (typically abbreviated to VA, also called a virtual office assistant) is
generally self-employed and provides professional administrative, technical, or creative
(social) assistance to clients remotely from a home office.
Tasks of Virtual Assistants - Typical tasks a virtual assistant might perform include
scheduling appointments, making phone calls, making travel arrangements, and managing
email accounts.

Skills required by virtual assistants

● Communication Skills. This is a crucial part of being a successful virtual assistant.


● Cloud-Based Knowledge. When working remotely you need a comprehensive knowledge
on how to share information and the best process for you and your clients. ...
● Time Management Skills
● Take-Charge Attitude
● Organizational Skills

Examples - Three such applications are Siri on Apple devices (Fig.4.26), Cortana on Microsoft
Devices (Fig.4.27) and Google Assistant on Android devices (Fig.4.28).

Because virtual assistants are independent contractors rather than employees, clients are not
responsible for any employee-related taxes, insurance or benefits, except in the context that those
indirect expenses are included in the VA's fees.

Fig.4.26:Apple’s Siri
Fig.4.27: Microsoft’s Cortana

Fig.4.28:Google Assistant

Clients also avoid the logistical problem of providing extra office space, equipment or
supplies. Clients pay for 100% productive work, and can work with virtual assistants,
individually, or in multi-VA firms to meet their exact needs. Virtual assistants usually work for
other small businesses. but can also support busy executives. It is estimated that there are as few
as 5,000 to 10,000 or as many as 25,000 virtual assistants worldwide. The profession is growing
in centralized economies with "fly-in fly-out" staffing practice.
Common modes of communication and data delivery include the Internet, e-mail and
phone-call conferences, online work spaces, and fax machines. Increasingly virtual assistants are
utilizing technology such as Skype and Zoom, Slack, as well as Google Voice. Professionals in
this business work on a contractual basis and a long-lasting cooperation is standard.

In recent years virtual assistants have also worked their way into many mainstream businesses
and with the advent of VOIP services such as Skype and Zoom it has been possible to have a
virtual assistant who can answer your phone remotely without the end user's knowledge. This
allows many businesses to add a personal touch in the form of a receptionist without the
additional cost of hiring someone.

Virtual assistants consist of individuals as well as companies who work remotely as an


independent professional, providing a wide range of products and services both to businesses as
well as consumers. Virtual assistants perform many different roles, including typical secretarial
work, website editing, social media marketing, customer service, data entry, accounts (MYOB,
Quickbooks) and many other remote tasks. The virtual industry has changed substantially as it
attracts others new to the field.

Virtual assistants come from a variety of business backgrounds, but most have several
years' experience earned in the "real" (non-virtual) business world, or several years' experience
working online or remotely.

A dedicated virtual assistant is someone working in the office under the management of a
company. The facility and internet connection as well as training are provided by the company,
however not in all cases.

The home based virtual assistant worked either in the office sharing environment or in
their house. The general VA are sometimes called an online administrative assistant, online
personal assistant or online sales assistant. A virtual webmaster assistant, virtual marketing
assistant and virtual content writing assistant are specific professionals that are usually
experienced employees from a corporate environment that started to set up their own virtual
offices.

References
1. Richard S.Sutton and Andrew G.Barto, “Reinforcement Learning An Introduction,”
Second Edition, The MIT Press, 2018
2. http://www.csis.pace.edu/~ctappert/dps/pdf/ai-chess-deep.pdf
3. https://researcher.watson.ibm.com/researcher/files/us-
heq/W(3)%20INTRODUCTION%2006177724.pdf
4. https://deepmind.com/blog/article/alphago-zero-starting-scratch
5. https://www.ias.informatik.tu-darmstadt.de/uploads/Publications/Kober_IJRR_2013.pdf
6. https://en.wikipedia.org/wiki/Virtual_assistant

You might also like