AI in Robots
AI in Robots
MOHAMMED E. EL-TELBANY
BY
MOHAMMED E. EL-TELBANY
Electronics Research Institute
Computers and System Department
ii
iii
TABLE OF CONTENTS
Table of Contents ........................................................................................................iv
Preface .........................................................................................................................vi
Chapter I...................................................................................................................... 1
Intelligence in Autonomous robots ....................................................................... 1
1.1 Mobile Robots................................................................................................ 1
1.2 Mobile Robots Applications .......................................................................... 2
1.3 Intelligence in Robotics ................................................................................. 3
1.4 Mobile Robot Sensors .................................................................................... 4
1.4.1 Tactile Sensors ........................................................................................ 5
1.4.2 Infrared Sensors ...................................................................................... 5
1.4.3 Sonar Sensors .......................................................................................... 5
1.4.4 Compass Sensors .................................................................................... 7
1.4.5 Shaft Encoder .......................................................................................... 7
1.4.6 Vision Sensors ........................................................................................ 7
1.4.7 Sensors Fusion ........................................................................................ 8
1.5 Mobile Robot Actuators................................................................................. 8
1.6 Assignments ................................................................................................... 9
Chapter II .................................................................................................................. 10
Behavior-Based Control ...................................................................................... 10
2.1 Behavior-Based Robotics ............................................................................ 10
2.2 Subsumption Control Architecture .............................................................. 12
2.3 Potential Fields ............................................................................................ 13
2.4 Schema-Based Reactive Control ................................................................. 15
2.5 Temporal Sequencing .................................................................................. 16
2.6 Motor Schema Versus Subsumption Architecture....................................... 17
2.7 Other Architectures ...................................................................................... 18
2.7.1 Colony Architecture .............................................................................. 18
2.7.2 DAMN Architecture ............................................................................. 18
2.8 Deliberative Control .................................................................................... 18
2.9 Inspiration Sources of Behaviours ............................................................... 19
2.10 Assignments ............................................................................................... 21
Chapter III ................................................................................................................ 22
Behaviors Design and Programming ................................................................. 22
3.1 Behavior Encoding ...................................................................................... 22
3.2 Concurrent Behaviors Combination ............................................................ 23
3.3 Subsumption Architecture Behaviors Programming .................................. 23
3.4 Behaviors and Schema Theory .................................................................... 25
iv
PREFACE
This notes is discuss the artificial intelligence approaches that implemented for the
mobile robots field. The notes were motivated from the desire to educate the students
interested in mobile robots and artificial intelligence. The notes attempt to provide both a
foundation and appreciation for mobile robotics. At the same time, putting newer trends in
machine learning techniques in perspective specially the reinforcement learning.
Mohammed E. El-Telbany
vi
CHAPTER I
presented and giving examples of the kind of reading that can be obtained from them and
their most common applications.
For example, in industry, a lot of different transport systems are established. Besides
conveyors and forklifts, other automated transporters called AGVs (Automated Guided
Vehicles) carry parts to different stations in production halls. Figure 1.1 shows such a system
where, the AGVs act automatically but not autonomous. Usually, they follow marked paths
to find their goal points with high accuracy. But this precision has to be paid with the lack
of autonomy. If one looks for autonomous systems, which can perform transportation
tasks, very often, behavior-based systems can be found. Figure 1.2 shows another robot
with manipulator arm and a payload of about 100 kg. Again, this system is controlled with a
Figure 1.2: A mobile robot with arm and a high payload of about 100 kg .
Figure 1.3 shows a service robot as a last example. Such a service task does no require
millimeter precision accuracy. In fact, these systems need a reactive behavior in the manner
cup not above table yet move a bit forward.
Figure 1.4: Tactical sensors (a) microswitch being used as a short range
digital touch sensor (b) A whisker and micoswitch used as a long-age
touch sensor.
time is measured. As the speed of sound is known the distance to the object that reflected
the pulse can be computed from the time that elapsed between emission and reception,
using Equation 1.1.
1
vt,
2
(1.1)
With d being the distance to the nearest object within the sonar cone, v the speed of
the emitted signal in the medium (i.e. 344 ms-1 for sound in air of 20 C), and t the time
elapsed between emission of a pulse and the reception of its echo as shown in Figure 1.5.
There are a number of uncertainties associated with reading from the sonar sensors.
Firstly, the sensitivity of sonar is cone-shaped as shown in Figure 1.6, and an object
detected at distance d could be any where within the sonar cone, on an arc with distance d
from the robot. So hat, the accuracy of object position measurement is a function of the
width of the sonar beam pattern.
Secondly, objects with flat surfaces and sharp edges reflects very little echo in most
directions and will probably not be noticed which called specular reflection problem. Finally,
after being reflected back from a surface, the sound may strike another surface and
reflected back to the sensor. The time delay will not correspond to a physical object, but to
a ghost object.
Edge detection is the process of finding the boundaries between areas of different
intensities of vision energy.
Motion detection is the process of detecting things that are moving in the field of
vision. You can detect motion by comparing two edge-detection results. If an edge has
moved or a new edge has appeared since the previous edge-detection process was
performed, motion has occurred.
Object line detection is the process of connecting detected edges into lines.
Object shape detection is the process of assembling detected lines to see if they form
some known shape.
Image understanding, the final step of vision processing, is where the system assesses
whether it is seeing the right things. Where the Color, depth, and texture provide
additional information about a viewed scene.
Computer vision requires knowledge of the field of computer graphics and image
processing and will not be addressed in this report.
1.6 Assignments
1. Write a report on the impact of mobile robots on the society. (Hint: explore the
applications of mobile robots in industry, agriculture, services, etc, by searching the
Internet).
2. What is an intelligent robot?
3. Write a report that surveying the methods and techniques used in sensor fusions.
4. In Electronics Research Institute mobile robots platform obtains a copy of the
system documentation. How is powered, recharged, and controlled? Learn how to
power the robot on and off and how to control via software.
CHAPTER II
BEHAVIOR-BASED CONTROL
Building a mobile robot that shows an intelligent behavior while executing a complex
task is an interesting research challenge. In the same time, there has been an upsurge of
interest in learning in autonomous mobile robots deals with the behavior-based paradigm.
This approach concentrates on physical systems situated in the real world and promotes
simple associative learning between sensing and acting. The behavior-based control has been
served as an effective methodology for autonomous mobile robot control and learning in a
large number of autonomous robot problem domains. This chapter presents and overviews
the behavior-based control methods and a comparison between them to show their pros
and cons.
Control systems using this architecture solve their task in several steps. First, the sensor
input is used o modify the internal representation of the environment. Second, based on
the internal representation planning is made. This results in a series of actions for the robot
to take to reach a specified goal. Third, this series of actions is used to control the motors
of the robot. This completes the cycle of the control system and its restarted to achieve
new goals. The first general purpose mobile robot constructed in the late 1960s at the
Stanford Research Institute is Shakey as shown in Figure 2.2, demonstrated that general
purpose planning systems using STRIPS were not very efficient and much two slow for
practical use.
10
General planning approach has several problems. Maintaining the model is in many
cases difficult because of sensor limitation or imperfect. The plans produced by the planner
often dont give the effects in the real world that is anticipated. So planning cannot be done
using this open-loop approach (Wyatt, 1997; Gat, 1998). Brooks (1990a; 1990b; 1991a;
1991b) argues that this traditional approach is very difficult, slow, and the resulting programs
are very brittle and not generalizable to other robotic platform or to a different environment.
It is clear that a different approach is required if one wants to bring robots to every day life.
This was creating systems that are more robust that do not require too many modifications
to the lay out of the environment where the robot will operate. To overcome these
limitations behavior-based control approach is developed (Brooks, 1986). In Behavior-based
approach, instead of decomposing the task based on the functionality, the decomposition is
done based on task-achieving modules or competences, are called behaviors on top of each
other as shown in Figure 2.3, this is called vertical control architecture.
11
The most common behavior-based control approaches are subsumption (Brooks, 1986)
and motor schemas (Arkin, 1989; 1998; Arkin and Balch, 1997; Murphy, 2000) architectures,
which made up of several parallel running behaviors1. Each behavior calculates a mapping
from sensor inputs - the sensor inputs relevant for the task of that behavior are used - to
motor outputs (cf. Appendix A). The suggested motor outputs from the behavior with
highest priority are used to control the robots motors, or summed to generate one motors
output. These architectures are called behavior-based control approaches and represent
methodologies for endowing robots with a collection of intelligent behaviors (Matari,
1994b; 1995b; 1995c; 1997a; 1998a; Parker, 1996). Behavior-based approaches are an
extension of reactive architecture, their computation is not limited to lookup table and
execution a simple functional mappings. Behavior-based systems are typically designed so
the effects of the behaviors interact in the environment rather than internally through the
system.
The large limitation of the subsumption architecture is its complexity. Creating correct
subsuming and inhibitory links into other behaviors requires detailed knowledge of how the
other modules, and the system as a whole, function. Another important limitation is the
lack of explicit goals and its goal handling capabilities (Pirjanian, 1999). Since there is no
1
These are many infinitely man possible robot control schemes. These schemes are classified into four classes: reactive control
(dont think, react), deliberative control (think, then act), hybrid control (think and act independently in parallel, and
behavior-based control (think the way you act). Each of these approaches has its strengths and weaknesses, and all play
important and successful roles in certain problems and application. The behavior-based and hybrid control have the same
expressive and computational capabilities: both can store representation and look ahead. But they work in very different
ways. Reactive control is also a popular in highly stochastic environments and demands very fast responses. Deliberative
control is the best choice for domains that require a great deal of strategy and optimization, and in turn search and learning
(Matari, 2002)
12
centralized control, there the higher-level behavior wins the arbitration when they compete
against lower-level ones, and the functionality of subsumption architecture is viewed as an
emergent property of the interaction of its behavior and depends on the static and dynamic
properties of the environment.
U q Uatt q Urep q
(2.1)
Where U att is the attractive potential and U rep is a repulsive potential associated with
obstacles. The force F U corresponds the most promising direction to move in.
From Equation (2.1) it is seen that F is the sum of two forces: attractive F att U att and
repulsive forces F rep U rep . These forces vectors will generally point directly towards
the goal except in the area around the obstacles as shown in Figure 2.4. There are many
ways to define the elementary functions. For example, the attractive potential field U att can
be defined as a parabolic well:
1
U att q
2
q q goal
(2.2)
Where 0 is a scaling factor and q goal denotes the goal configuration. Figure 2.4(b)
is a plot of U att for a two dimensional cases. The repulsive potential field U rep can define as
follow:
2
1 1
1
if q 0
U rep q 2 q 0
if q 0
0
(2.3)
13
Figure 2.5: (a) The configuration of obstacles (black) in the map. (b) The
attractive potential associated with the goal configuration. (c) The
repulsive potential associative with obstacles. (d) The sum of the
attractive and repulsive potentials (adopted from Pirjanian, 1999).
There are two major difficulties with the potential fields method as applied to robot
navigation. The first is that this method can be slow and require substantial internal state,
as it appears that vectors must be computed based on a map of the entire region. But a
potential fields method can be easily adapted to a reactive system. There is no need to
factor in forces outside the of the robots perceptual range, though true potential fields
required computing vectors based on the entire field. A goal outside the perceptual range
may need to be maintained, but obstacles that the robot cannot sense do not necessarily
need to affect the motion vectors. Thus the vector accompanying the current position of
the robot can be computed using only the sensory information from a single time step
(Arkin, 1989). This vector computation using only current sensor readings makes this
method reactive, in that it requires no internal state, and allows the vector on the current
position of the robot to be computed quite quickly. The second potential field problem is
local minima. The robots motion at any moment is the sum of the forces affecting it. What
happens, then when these forces sum perfectly to zero? The robot could remain
indefinitely in a single position, perfectly immobilized at a local minimum. If no force
14
disrupts this equilibrium, movement may cease. Thus any potential fields method must
employ strategies for coping with this problem of local minima. One of these strategies is
generating a time-varying noise vector; another is to relocate the goal temporarily when stuck in a
local minimum, or to mark areas that have been visited already as impassable. All these methods
do not avoid the existence of local minima and are not always successful in their solutions.
Potential field navigation is very suitable for local navigation since the environment has
only to be known in the vicinity of the vehicle and only a short piece of the path is
calculated with each evaluation of the force fields.
15
normalized. The resultant vector is sent to the robot hardware for execution as shown in
Figure 2.6.
Figure 2.6: Motor schema example. The diagram on the left shows a
vector corresponding to move-to-goal schema; the center diagram shows
avoid-obstacles schema. On the right, the two schemas are summed,
resulting in a complete behavior for reaching the goal.
For example, the foraging task as shown in Figure 2.7, the forage task for a robot is to
wander about the environment looking for items of interest (attractors). Upon
encountering one of these attractors, the robot moves towards it, finally attaching itself.
After attachment the robot navigation to the home base where it deposits the attractor. In
16
this approach to solving the forage task, a robot can be in one of three behaviors states:
wander, acquire and deliver. The robot begins in the wander state. If there are no attractors
within the robots field of view, the robot remains in wander until one is encountered. When
an attractor is encountered, a transition to the acquire state is triggered. While in the acquire
state the robot moves towards the attractor and when it is sufficiently close, attaches to it.
The last state, deliver, is triggered when the robot attaches to the attractor. While in the
deliver state the robot carries the attractor back to home base. Upon reaching home base,
the robot deposits the attractor there and reverts to the wander state (Balch, 1998a).
Motor-Schemata
Hierarchical
Yes
No
Reconfigurability/Learning
No
Yes
Arbitration Mechanism
Competitive
Cooperative
Behavior implementation
Hardwired
Software
Behavior Reusability
No
Yes
17
Figure 2.8: The spectrum of robot control (Adopted from Arkin, 1998).
18
The robot that uses deliberative reasoning in control requires much knowledge about
the world and uses this knowledge to predict the outcome of actions to optimize its
performance. If the information is inaccurate or out of date, the outcome of the reasoning
process is probably incorrect. In a dynamic world, where objects may be moving, it is
potentially dangerous to rely on past information that may no longer be valid.
Representational world models are therefore generally constructed from both prior
knowledge about the environment and incoming sensor data (cf. chapter IV). On the other
side of the control spectrum are the reactive systems. Reactive control is a technique for tightly
coupling perception and action, typically in the context of motor behaviors, to produce timely
robotic responses in dynamic and unstructured worlds.
Both deliberative planning systems and purely reactive control systems have their
limitations, but using both forms of knowledge in a robotic architecture can make
behavior-based navigation more flexible and general. Hybrid deliberative-reactive robotic
architecture can combine the traditional abstract representational knowledge with the
responsiveness, robustness, and flexibility of purely reactive systems. However, building
such a hybrid system requires certain compromises. The interface between deliberation and
reactivity is not yet well understood and a research in this area is still ongoing. An
important design issue of hybrid control systems is the number of layers. Usually the hybrid
architecture consists of two or three layers. In the case of two layers the interface between
the layers is very important because it links rapid reactions with long-range planning. In the
case of three layers the middle layer coordinates between the other two layers, much like
the interface in the two-layer architecture. There are four main interface strategies for the
various hybrid architectural designs:
Selection: planning is viewed as configuration and determines the behavioral
composition and parameters used during execution.
Advising: the planner suggests changes that the reactive control system may or may
not use.
Adaptation: the planner continuously alters the ongoing reactive behaviors based on
the changing conditions within the world and task requirements.
Postponing: planning is viewed as a least commitment process. The planner postpones
making decisions on action until as late as possible.
Strong evidence exists that hybrid deliberative and behavior-based systems are found in
biology, so research in this area could lead to new insights. For example, there seem to be
two distinct systems concerned with controlling human behavior, a reactive and automatic
system (e.g. when ones hand touches something hot) and a willed and conscious control
system (e.g. grasping something).
19
20
Evolution has shaped animals to fit their niche. As the environment is always changing, a
successful animal must be capable of adapting to these changes, a successful animal must
be capable of adapting to these changes or it will became extinct.
2.10 Assignments
1. Explain in one or two sentences each of the following terms: reflexes, taxes, and fixedaction pattern.
2. Write GUI program using Java or C++ to simulate a robot containing the sensors and
actuators, which are coupled together in reflexive behaviours to do:
a. Reflexive avoid: turn left when they touch something (use touch sensors and two
motors).
b. Phototaxis: follow a black line (use the IR sensor to detect the difference between
light and dark).
c. Fixed-action pattern avoid: backup and turn right when robot encounter a
negative obstacles (a cliff).
3. Describe the difference between robot control using a horizontal decomposition and a
vertical decomposition.
4. Evaluate the subsumption architecture in terms of: support for modularity, niche
targetability, ease of portability to other domains, robustness.
21
CHAPTER III
To encode behavior response that the stimuli evoke, a functional mapping is created
from the stimuli plane to the motor plane. By factoring the robots motor response into
orthogonal components: strength and orientation. Strength denotes the magnitude of the
sensory input and orientation denotes the direction of action. The behavior can express as
triple S , R, where S denotes the domain of all interpretable stimuli, R denotes the range
of possible response, and denotes the mapping : S R .
S consists of all the perceivable stimuli. Each individual stimulus or percept s S is
represented as a binary tuple p, having a particular type or perceptual class p and a
property of strength . The stimulus strength can be defined in a variety of ways: discrete
(e.g., binary: absent or present; categorical: absent, weak, medium, strong) or real valued or
continuous. The instantaneous response r R of the mobile robot that moves on flat ground
and can rotate only about its central axis has three degree of freedom expressed as r=[x, y,
]. For each active behavior we can formally establish a mapping between stimulus and
response using the behavior function s r . Associated with a particular behavior, ,
may be a scalar gain value g modifies the magnitude of the overall response r for a given s
as: r gr . These gain values are used to compose behaviors by specify their strengths
relative to one another. In the extreme case, g can be used o turn off a behavior by setting it
to 0, thus reducing to zero [Arkin, 1998]. is defined arbitrary, but it must be defined
over all relevant p in S. The behavior mappings, fall into three general categories:
Null: The Stimulus produces no motor response.
Discrete: The stimulus produces a response from an enumerative set of prescribed
choices (e.g., turn-right, stop etc.,). For example, consists of a finite set of (situation,
response) pairs. Sensing provides the index for finding the appropriate situation. Another
strategy involves he use of rule-based system. Here is represented as a collection of IFTHEN rules [Matari, 1994]., which can be extended using the fuzzy logic by synthesis a
Fuzzy IF-THEN rules [Saffiotti, 1997; Hoffmann, 1998].
22
Continuous: The stimulus produces a response that is continues over Rs range. One of
the most common methods for implementing continuous response is based on the potential
field technique (cf. 2.3).
23
Figure 3.3: Subsumption based model for foraging robot (adopted fro
Arkin 1998).
Coding theses behavior is straightforward for example, a wander behavior which making
the robot to move in a random direction for some time has the following Java code
segment.
private int wander()
{
if ( tt > 5 )
{
tt--;
move_random();
}
return 0; /* set priority = 0 which activate find_goal behavior */
}
For the pickup behavior, which turn towards the sensed attractor and go forward. If at
the attractor, close gripper, has the following Java code segment.
24
25
schema represents the template for the physical activity; the perceptual schema embodies
the sensing as shown in Figure 3.4.
The schema theory implemented the motor schema as a vector field, which direction and
magnitude of the action. In the schema theory, the perceptual schema is permitted to path
both the percept and a gain to the motor schema. The motor schema can use the gain to
compute a magnitude on the output action. However, schema theory does not specify how
the output from concurrent behaviors is combined. Where the output of concurrent
behaviors in some circumstances is combined or summed, in others occur in a sequence,
and sometimes would be the winner-talk-all effect.
26
Any primitive potential field is usually represented by a single function. The vector
impacting the robot is computed each update. Consider a robot with a single range sensor
facing forward. The designer has decided that the repulsive field with a liner drop off is
appropriate. The formula is:
V direction 180
D d
V magnitude D
0
for d D
(3.1)
for d D
Where D is the maximum range of the fields effect, or the maximum distance at which
the robot can detect the obstacles. D is always the range at which the robot should respond
to stimulus. The roboticist set a D of 2 meters and the robot maximum velocity to 10
meters. This can be coded using Java as follow:
class vector {
double magnitude;
double direction;
public vector() {
magnitude = 0;
direction = 0;
}
}
class perceptual_schema {
double readSonar () {
}
class motor_schema {
public static final double MAX_DIST = 2.0;
public static final double MAX_VELOCITY = 10.0;
vector repulsive (double d, double D) {
vector output = new vector();
if (d <= D)
{
output.direction = -180;
output.magnitude = (D d)/D;
}
else
return (new vector()); /* stay where you are */
}
}
To illustrate how the repulsive potential field can be used by a behavior, runaway for a
robot with a single sensor as a motor-schema. The following code segment clarifies the
idea.
27
Figure 3.6: Behavior Fusion via vector summation (adopted from Arkin,
1998).
28
For example, adding another behavior, move_to_goal that is represented with an attractive
potential field and uses the shaft encoders to sense if reach the goal position. The
move_to_goal behavior exerts an attractive field over the entire space; wherever the robot is,
it well feel a force from the goal. The runaway behavior exerts a repulsive field in a radius
around the obstacle. Extending our example in the previous section with the two behaviors
coded in java as follows:
vector output, vec1, vec2;
While (isAlive())
{
vec1 = execute_runaway();
vec2 = execute_move_to_goal();
/* you can multiply the two vectors by their gain values. */
output = vec1 + vec2;
/* combine the two vectors summation. */
turn (output.direction);
forward(output.magnitude* motor_schema.MAX_VOLICITY);
}
3.7 Assignments
1. Describe the difference between the two dominant methods for combining behaviours
in a reactive architecture, subsumption and potential field summation.
2. Suppose you were to construct a library of potential fields of the five primitives using
Java or C++. What parameters would you include as arguments to allow a user to
customize the fields?
29
CHAPTER IV
30
4.1 Mapping
Mapping is the process of constructing a model of the environment based on the
sensors measurements. There are different approaches to representing and using spatial
information. On one side, there are purely gird-based maps, also called geometric or metric maps.
In these representations, a single, global coordinate system in which all mapping and
navigation takes place defines the robots environment. In geometric mapping, low-level
geometric primitives are extracted from sensor data to build up high-level abstractions of the
environment. Thus, the world might be modeled using wire-frame, surface boundary, or
volumetric representations, for example. This approach is appealing because it provides a
concise representation that is easily manipulated and displayed by computers and is readily
understood by humans. However, the geometric paradigm has drawbacks in that it leads to
world models that are potentially fragile and unrepresentative of the sensor data because it
relies heavily upon a priori environment assumptions and heuristics [Elfes, 1987]. An
alternative to the geometric paradigm is the sensor-based paradigm. Here, the environment is
represented as a collection of the sensor readings, obtained through sensor models. No a
priori assumptions or models of the world are employed. Most typical of this approach are
cellular representations such as occupancy grid [Elfes, 1989]. However, the unrealistic storage
requirements needed to represent each cell of the grid explicitly, one alternative is to take
the advantages of the fact that many of the cells will have a similar state, especially those
cells that correspond to spatially adjacent regions. The general approach to solve this
problem is to represent the space using cells of a non-uniform shape and size such as
quadtrees [Nebot and Pagac, 1995]. As shown in Figure 4.1, the map is a grid with each cell
of the grid representing some amount of space in the real worlds.
(a)
(b)
These methods make weaker assumptions about the environment than in the geometric
paradigm, and usually have a high sensing to computation ratio, since no high-level
abstractions are generated. These approaches work well within bounded environments
where the robot has opportunities to realign itself with the global coordinate system using
external markers. On the other side are topological maps, also called qualitative maps, in which
the robots environment is represented as places and connections between places. Each
node in such a graph corresponds to a distinct place or landmark as shown in Figure 4.2.
31
Arcs connect these nodes if there exists a direct path between them. This approach has the
advantage that it eliminates the inevitable problems of dealing with movement uncertainty.
Movement errors do not accumulate in topological maps as they do in maps with global
coordinate system since the robot navigate locally.
Figure 4.3: Cross section of a conical sonar signal. Any object along arc A
will yield a sensed distance of d.
32
Since the sensor uses merely the first echo it receives for range calculations, it is
impossible to determine the exact angular location of the reflecting object with respect to
the acoustic axis; only its radial distance can be measured. The arc labeled A depicts the
angular uncertainty of the ultrasonic sensor - it is bounded by the width of the ultrasonic
cone (typically 30 degrees). In his occupancy grid system [Moravec 1988; Elfes, 1989]
accounts for this uncertainty with a Gaussian probability function that models the error of the
ultrasonic sensor. Here, cells close to the sensed distance from the sensor and near to the
acoustic axis have a high likelihood of occupancy. Likewise, cells near to the ultrasound
sensor and close to the acoustic axis have a high likelihood of vacancy. After each sensor
reading, Bayes theorem (Equation 4.1) is applied to update the probability of occupancy of
each of the cells intersecting the ultrasonic cone. A Gaussian model is used as the evidence
of occupancy (OCC) or vacancy (VAC) and the current value of the cell provide the prior
probability of occupancy.
Pocc | s k 1
Ps k 1 | occ Pocc | s k
Ps k 1 | occ Pocc | s k Ps k 1 | vacPvac | s k
(4.1)
Here Pocc | s k is the current value of the cell (after k sensor readings), Pocc | s k 1
will be the new cell value, and Ps k 1 | occ and Ps k 1 | vac are obtained from the sensor
model. Initially, the occupancy status of a cell is unknown, thus we set Pocc | s 0 0.5 for
all cells. After several readings by multiple sensors at different locations in the
environment, a map representing the objects and empty space of the environment is
constructed. The algorithm summary is shown in Table 4.1.
33
memory is consuming since the complexity of the map does not depend on the complexity
of the environment and as a result of the previous point the processing of these kinds of
maps can be very time consuming since the position of the robot itself must be known
accurately. Also it required more pre -processing when used by symbolic problem solvers
such as behaviors coordination algorithms.
On the other hand, the Topological map is difficult to construct and to maintain since it is
required recognition of landmarks and places and must have the sensory equipment for
doing this and even then, it is no simple task. It is efficient and faster in planning since the
exact robot position isnt that important. Saving the memory and space, since the
resolution depends on the complexity of the environment, and very convenient interface
towards symbolic problem solvers. However, the paths calculated based on topological
maps may not be optimal in terms of energy consumption or distance traveled. Both
methods have their advantages and disadvantages the most important ones are stated in
Table 4.2.
Grid Maps
Topological Maps
Building / Maintainability
easy
difficult
Memory space
huge
small
yes
no
Planning speed
slow
fast
4.2 Localization
An important sub-problem in the robot navigation task is that of localization. That is,
given accumulated knowledge about the world and current sensor readings, what is the
position of the robot within the workspace? Localization is particularly difficult for mapbased approaches that learn their maps, since the accuracy of a metric map depends on the
alignment of the robot with its map. This would be a trivial problem if the navigation
system could rely upon its velocity history (i.e., employ dead reckoning) to calculate its
global coordinates. Unfortunately, distance and orientation measurements made from dead
reckoning are highly prone to error due to shaft encoder imprecision, wheel slippage, and
irregularities in the surface of the workspace, etc. which accumulate over time. This odometry
and inertial navigation methods as relative position measurements, requires supplemental
information in order to periodically flush the metric error from the system and redetermine the robots position in the environment [Borenstein et. al., 1996].
The external measurement techniques, so called the absolute measurement techniques, are
often used in conjunction with the relative measurement techniques to improve localization
estimation. Examples of absolute measurement techniques are:
Guidepath: is one of the simplest forms of localization. The guidepath can be for
instance a wire (which transmits audio or radio signals) or a magnetic strip [Everett,
34
1995], which a robot can follow. Guidepaths indicates static paths in an environment
and are not suitable in applications where mobile robots should be allowed to move
flexibly.
Active beacons use a set of transmitters (beacons), which their locations in the
workspace are known in advance. The transmitters, in most cases, use light or radio
signals to announce their presence. The active beacons may transmit their identity
and current location. For a robot to be able to use the active beacons for its
localization, at some position in the workspace, at least three of the beacons have to
be visible or detectable at that exact position.
Artificial landmark recognition method tries to recognize objects or images with a
distinctive shape placed in the workspace. The positions of the landmarks are known,
and if three of more landmarks are detectable at certain position, and estimation of
the robot position can be calculated.
4.3 Navigation
The process of generating a sequence of actions that has to be performed in order for a
robot to move through its environment (also called workspace) autonomously and without
collisions is called navigation. Methods for navigation, in their most simple form, fall into
one of two categories, deliberative or reactive. The pure reactive methods offer real-time
motion response provided by simple computations and do not use any representations of
its environment. (i.e., it has no understanding of the layout of the workspace, e.g. where the
obstacle are situated). It is compute a function using the parameters obtained from external
and proprioceptive sensors on the robot. It can assure target achieving if it has already made the
way and learned it or if it was designed to follow a certain path.
Pure deliberative methods, on the other and are highly representation dependent and
express a higher level of intelligence. They use a priori knowledge to pre-calculate robot
motion. Typical way to utilize this knowledge is to plan a path through the workspace for
the robot, so called path planning. The path planner generates minimum-cost paths to the
goal(s) using the map or the workspace. There are a large number of approaches that
combine both reactive and deliberative navigation and seem to be the most effective ones.
They combine the efficiency of planning with the flexibility and adaptive ability of the
reactive systems. As a result for this integration, deliberative navigation provides
intermediate goals to the reactive collision avoidance behavior that controls the velocity
and the exact direction of the robots motion.
robot path consisting of segments. Gateways or landmarks are needed so the robot can tell
when it has completed a segment another should begin (4.1). These distinctive places are
derived from the sensor readings and after the robot determines these observable and
ultimately recognizable landmarks, they can be integrated easily into behaviours control.
For example, Matari [1990] demonstrated the integration of qualitative maps and
behavior-based robotic systems in a subsumption-based system as shown in Figure 4.4.
Landmarks are derived from sonar data, using features that are stable and consistent
over time. In particular, right walls, left walls, and corridors are the landmarks features
derived from the sensors. Spatial relationships are added connecting various landmarks
through constructing the graph. Map navigation consists of following the boundaries of
wall or corridor regions.
36
reference can be precisely specified. The configuration of the robot is a vector that holds the
current values of all those single-valued parameters that need to be specified in order to
identify uniquely the robots exact position. These parameters are broadly known as the
generalized coordinates of the robot. For a mobile robot the generalization coordinates might
be the Cartesian coordinates of its center and the orientation of its heading. The
configuration space C of the robot is simply the space where its configuration vector lives,
i.e. the set of all possible configurations, and its dimension is equal to the number of
generalized coordinates [Lozano-Prez, 1983]. Since the robot can only be in a single
configuration at any time, it is represented as a single point in C. obstacles in the workspace
implies that certain areas are not accessible by the robot.
The configuration q of a robot A specifies the exact position and orientation of A relative
to a fixed reference frame. Therefore, the configuration space (often referred to as "C-obstacle")
of A is the set of all possible configurations of A. For a robot with n degrees of freedom
(DOF), its C-space, C, is an n-dimensional space (typically an n-dimensional manifold),
since there is a one-to-one, onto map between q and a specific arrangement of the robot's
joints and the robot's position within its (real world) workspace. Obstacles are mapped into
C-space by determining which configurations of the robot produce collisions with an
obstacle; these configurations are deemed forbidden. Let A(q) denote the location of A's
particles when A is in configuration q. A C-space obstacle (or "C-obstacle") associated with a
physical obstacle B is defined as:
CB q C | Aq B
(4.2)
F C \ i CBi
(4.3)
Motion plans are constructed in F. Note that the closure of freespace includes the
obstacle boundaries. Since generally a robot is allowed to touch an obstacle (but not
penetrate it), the word "freespace" (and symbol F) will often be used to refer to the closure of
freespace as well. In the case illustrated in Figure 4.5 (part (a)), the robot is a rigid body in a
2D environment (its workspace), and it is capable of translating in any direction (i.e.,
horizontally and vertically) but not rotating. Thus, this robot has 2-DOF, and its
configuration space (shown in (b)) is equivalent to 2. Configuration space is a useful
concept for classical motion planners aimed at creating paths for (holonomic3) multi-DOF
robots. Classical motion planning is a two-step process. In the first step, the C-obstacles
are computed from knowledge of the physical obstacles and of robot geometry. The
second step constructs a path within F. A path in C of the robot A from the start
3
Holonomic robots are capable of motion directly between configurations: for example, a robot, which can
move forward, backward, and side-to-side freely in an obstacle-free 2D environment. An automobile is an
example of a non-holonomic vehicle; other techniques must be used (perhaps in addition to C-space) to
capture the inability of a car to move directly sideway.
37
38
the feature in that region than to another feature. The resulting roadmap is the edge
between regions. This edge is the safest possible path as it avoids obstacles by the
maximum distance possible.
39
In approximate cell decomposition, cells usually have a standard predefined shape but
perhaps variable size. Any algorithm for graph search can realize path construction. The
details of the decomposition, however, can affect significantly the completeness and the
computational complexity of the algorithm, as well as the quality of the resulted path.
These methods are perhaps the best studied and have been applied widely for robot
navigation. Simple instances of such method are Quadtrees [Nebot and Pagac, 1995] and
grid-based representations [Moravec and Martin, 1996] as shown in Figures 4.8 and 4.9.
One example of the approx. cell decomposition method is the quadtree algorithm. This
algorithm represents C-space as a quadtree structure by performing the steps listed in Table
4.2 and the resulted decomposition is shown in Figure 4.9. The path is created at any time
by creating a connectivity graph based on the quadtree and computes a path through the
known portions F.
40
determined by the field until the goal configuration is reached. Using only our potential
energy function U q , we can determine the path, which our robot should take through F
using the algorithm listed in Table 4.2.
41
i , j arctan
yi y r
xi x r
(4.4)
mi , j ci2, j a bd ij2
(4.5)
Where c ij is the certainty of cell C ij in the grid, d ij is the distance to the cell C ij from the
robot, and a and b are constants.
4.4 Assignments
1. Implement a potential field path planning method and compute a 2-D trajectory a cross
a rectangular room with several non-convex obstacles in it.
2. Implement the visibility graph-based planner for an environment composed of
polygonal obstacles.
3. An important property of an occupancy grid is that it supports sensor fusion. Discuss
this sentence.
42
CHAPTER V
ROBOT LEARNING
In order to built this mobile robot system to come close for that the people dream with
them. Researches suggested that embedding learning techniques in it to incite them for
efficiency and prepare it to deal with unforeseen situations and circumstances in its
environment. Since that, the learning has often been neglected, and the designer of mobile
robot system has generally taken on the extremely difficult task of trying to anticipate all
possible contingencies and interactions among the robot components ahead of time. The
machine learning techniques ultimate goals are to deal with the noise in their internal and
external sensors and the inconsistencies in the behavior of the environment. Machine
learning focused on developing approaches and algorithms for artificial agent (i.e., robot) to
automatically acquiring knowledge, developing strategies and modifying its behavioral
tendency by experience.
43
44
posses models, and in which the robot may be hampered by incomplete perception and
unreliable effectors. On the other hand, control theory gives us method for constructing
closed-loop adaptive control policies that depends on the environments state and the time
[Singh 1994; Barto et. al., 1995]. Techniques for constructing such control policies include
analytical methods for simple (linear) systems, and numerical methods such as dynamic
programming for more complex (non-linear and stochastic) systems. These methods are for
constructing control policies off-line, given accurate environment model [Wyatt, 1997].
When it is trying to design adaptive control policies for robots we typically have no model
of the environment. Thus the machine learning arises for developing on-line and model-free
approaches and algorithms for constructing robots closed-loop policy by automatically
acquiring knowledge, developing strategies and modifying its behavioral tendency by
experience [Kaelbling at. el., 1996; Sutton and Barto, 1998; Keerthi and Ravindran, 1999;
Mahadevan 1996b; Russell 1996; Mitchell, 1997; Dietterich 1997]. Reinforcement learning (RL)
which describes a large class of learning tasks and algorithms characterized by the fact that
the robot learns an associative mapping, X A (from state space X to the action
space A) by maximizing a scalar evaluation (rewards) of its performance form the
environment. The robot learns strategies for interacting with its environment in closedloop cycle as shown in Figure 5.1. This is the basic reinforcement-learning model where the
robot environment interaction is viewed as a coupled dynamic system in which the outputs
of the robot are transferred into inputs to the environment and the outputs of the
environment are transformed into inputs to the robot. The inputs to the robot usually do
not capture a complete description of the environments state and the environment
changes continuously through time. Furthermore, the robot influences the future sequence
of its inputs by its outputs at each stage. Through that, the robot explore its dynamic
environment, attempting to monitor the environment and figure out by itself how it can
exploit acquired experience to maximize its reward, which will receive based on the
sequences of the actions it takes.
Learning solely from reinforcement signals, makes the environment is a powerful force
shaping the behaviors of autonomous robots. Robots can also directly shape their
environments. Because of this interaction of shaping forces, it is important to develop
autonomous robots in a system where all of the forces for change are learnable to optimize
45
system performance over its lifetime. When learning control policies, we must be able to
evaluate them with respect to each other. The robots goal is to find, based on its
experience with the environment, a strategy, or an optimal policy, for choosing actions that
will yield as much reward as possible. The most obvious metric, the sum of all rewards over
the life of the robot,
r t,
t 0
(5.1)
is generally not used. For robots with infinite lifetimes, all possible sequences of rewards
would sum to infinity. This is not the case for robot with a finite lifetime, however. In this
case, the obvious metric turns into the finite horizon measure,
k
r t.
(5.2)
t 0
This measure sums the rewards over some finite number, k , of time steps. Average
case,
1 k
r t.
k k t 0
lim
(5.3)
Extends this by using the average reward received over the whole lifetime of the robot.
The infinite horizon discounted measure,
r t.
t 0
(5.4)
Uses a discount factor, 0 1, to controls the relative importance of short-term and longterm reinforcements. As 0 the short-term reinforcement signals became more
important. When 0 the only reinforcement signal that matters is the immediate
reinforcement signal, and we obtain the one-step greedy policy; the best action is the one
that gives the greatest immediate reward. The optimal finite horizon policy may be nonstationary in the sense that different actions may be selected for the same state at deferent
time steps. In problems with hard time lines, finite horizon policies are the right choice. In
problem without hard deadlines, k , and an infinite-horizon policy is desired. It is theorem
that the optimal infinite horizon policy is stationary [Haykin, 1999]. Most of the
reinforcement-learning problems are typically cast as Markov decision processes (MDPs)4 as
shown in Figure 5.2.
4
Complete observaabilty of the state is necessary for learning methods based on MDPs. In the case of noisy and incomplete
information for the robot observations, the perceptual aliasing or hidden state problem arise. So, the MDP famework
needs to extend by partially observable Markov decision process (POMDP). See Kaelbling et al., ( 1996) for the stratiges
implemented to deal with this problem.
46
In Markov decision process there is a finite set of states, S , a finite set of actions, A , and
time is discrete. The reward function
R:S A
(5.5)
returns an immediate measure of how good an action was. The resulting state, st 1 , is
dependent on the transition function
T : S A S
(5.6)
*
V s0 E R st .
t 0
(5.7)
reward for state-action pair. If we know the functions T and R then we can define an
optimal value function, V * s , over states:
R s, a T s, a, s * s .
*
V ( s) max
V
a
S
(5.8)
This function assigns a value to each state, which is the best immediate reward that we
can get for any action from that state added to the optimal value from each of the possible
resulting states, weighted by their probability. If we know this function, then we can define
the optimal policy, * by simply selecting the action, a , that gives the maximum value:
R s, a T s, a, s * s .
* ( s) arg max
V
(5.9)
Equation (5.9) is called Bellmans optimality equation. It provides a mean of computing the
optimal policy for the Markov decision process by performing one-step look-ahead search
47
and choosing the action whose backup value is the largest. There are well-understood
dynamic programming based methods for computing V * such as value iteration and policy
iteration, which lead to a simple procedure for learning the optima value function, and hence
the optimal policy. The two algorithms, policy iteration and value iteration, are off-line and
model-based methods, they need to learn T and R functions [Dietterich, 1997]. Instead of
learning T and R , however we can incrementally learn the optimal value function directly.
In the next sections, we describe will known algorithms that attempts to iterative
approximate the optimal value function.
Initialize
Policy to be evaluated
V(s) an arbitrary state-value function.
Returns (s) an empty list, for all sS
Do forever:
Generate an episode using .
For each state s appearing in the episode:
R return following the first occurrence of s.
Append R to Returns (s)
V(s) average (Returns (s))
End Do
Table 5.1: First-visit Monte Carlo method for estimating
V s .
In order to estimate V s , the value of state s under policy , given a set of episodes
obtained by following and passing through s , we first define each occurrence of state
s in an episode to be a visit to s . Then, every-visit Monte Carlo method estimates V s as the
average of the returns following all the visits to s in a set of episodes. Another approach,
called first-visit Monte Carlo method (as shown in Table 5.1) averages just the returns following
48
first visit to s . Both first-visit and every-visit Monte Carlo converge to V s as the
number of visits (or first visits) to s goes to infinity [Kaelbling at. el., 1996; Sutton and
Barto, 1998; Keerthi and Ravindran, 1999].
V st V st st r t 1 V st 1 V st
(5.10)
There are two parameters; a learning rate, , and a discount factor, . The learning rate
controls how much we change the current estimates of V s based on each new
experience. Equation (5.10) forms the basis for the TD (0) algorithm which is really an
instance of a more general class of algorithms called TD (), with =0, where 0 1 is a
parameter that controls how much of the current difference r V t 1 st 1 V t 1 st is
applied to all previous states. As shown in Table 5.2, TD (0) looks only one step ahead when
adjusting value estates; although it will eventually arrive at correct answer, it can take quite a
while to do so. The general TD () is similar to TD (0) but is generalized to propagate
information across trajectories of states according to its eligibility, et st , which is a function
of that indicates how far in the past and how frequency the state has been visited.
V st V st st r t 1 V st 1 V st et st
(5.11)
The value of et st is updated each transition for all s S . There are two forms of
update equations. An accumulating trace updates the eligibility of a state using,
et 1 ( s) 1
et ( s)
et 1 ( s)
if s current state
otherwise
(5.12)
1
et ( s)
et 1 ( s)
if s current state
otherwise
(5.13)
Under both mechanisms, the eligibility of a state decays away exponentially when the
state is unvisited. Under an accumulating trace, the eligibility is increased by a constant
every time the state is visited and under a replacing trace, eligibility of a state is reset to a
constant on each visit. An eligibility trace can be thought of as a short-term memory process,
initiated the first time a state is visited by a robot. The degree of activation depends on the
49
recency of the most recent visit and on the frequency of visits. Thus, eligibility traces
implement two heuristics, a recency heuristic, and a frequency heuristic [Sutton and Barto, 1998].
Sutton [1988] and others [Dayan, 1992; Dayan and Sejnowski, 1994] note that TD ()
converges more quickly for 0. Learning the value function V s for a fixed policy can be
combined with a policy-learner to get what is known as an actor-critic or an adaptive heuristic critic
system [Barto et. al., 1983]. This alternates between learning the value function for the
current policy, and modify the policy based on the learned value function.
TD (0) Algorithm
For each (sS), initiate the table entry V(s) to 0.
Observe the current state s.
Do forever:
Select an action at using st and , execute it.
Receive immediate reward rt+1.
Observe the new state st+1.
Update the table entry for V(s) as Eq. (3.9).
Set st to st+1.
End Do
Table 5.2: TD (0) learning algorithm for estimating
V s .
(5.14)
Once the optimal value function is known, the optimal policy, * s can be easily
calculated:
*
Q s, a .
* s arg max
a
(5.15)
Starting with a random the Q-function can be approximated incrementally according the
following rule:
(5.16)
50
The Q-learning algorithm is summarized in Table 5.3. Watkins and Dayan [1992]
proved the iteration converges to Q*(s, a) under the condition that the learning set includes
an infinite number of episodes for each state and action. If the function is properly
computed, an agent can act optimally simply by looking up the best-valued action for any
situation. The Q-learning chooses the optimal action for state based on the value of the Qfunction for that state. If only the best action is chosen, it is possible that some actions will
never be chosen from some state. This is bad, because we can never be certain of having
found the best action from a state unless we have tried them all. So, we must have sort of
exploration strategy, which encourages us to take some non-optimal actions in order to
gain information about the world. This is known as the exploration/exportation problem
[Thrun, 1992a, 1992b; Wiering, 1999].
Q-learning Algorithm
For each pair (sS, aA), initiate the table entry Q (s, a) to 0.
Observe the current state s.
Do forever:
Select an action at and execute it.
Receive immediate reward rt+1.
Observe the new state st+1.
Update the table entry using real-experience for Q (st, at) as Equation (3.16).
Set st to st+1.
End Do
Table 3.3: Q-learning algorithm.
The simplest exploration strategy follows a policy of optimism in the face of uncertainty.
When the effect of an action is poorly understood, then it is usually better to take that
action in order to reduce this uncertainty. For example, introducing and propagating
exploration bonuses to artificially raise the estimated value of untried actions [Sutton, 1990a;
Wyatt, 1997; Wiering, 1999]. Another approach is -greedy strategy, is to select a random
action a certain percentage of time , and the best action otherwise [Sutton and Barto,
1998]. The last approach attempts to take into account how certain we are that are god or
bad. It is based on the Boltzmann or Gibbs distribution and is often called soft-max
exploration. Where at each time step, action a is chosen from state s with the following
probability:
Pra | s
Qs,a / T
eQ s , a / T
(3.17)
51
where Q-values for different actions are almost equal, and little exploration in states where
Q-values are very different.
(3.18)
Just like TD, SARSA learns the value for a fixed policy and must be combined with
policy learning component in order to make a complete reinforcement system.
52
learning from these hypothetical experiences and useful when making errors are costly, robot
has less real experiences from the interaction with the environment, and speeding up the
learning process is urgently [Sutton, 1990a; 1990b; 1991; Clouse, 1997]. The earliest use of
planning in the reinforcement learning was the Dyna architecture as shown in Table (5.4)
that integrates learning and planning which proposed by Suttons [1990a; 1990b; 1991]. The
planning is treated as being virtually identical to reinforcement learning except that while
learning updates the appropriate value function estimates according to real experience,
planning differs only in that updates these same value function estimated for hypothetical
experience chosen from the world model. The model component of Dyna algorithm is a
table indexed by state action pairs and contains the resultant state and a reward as an entry
[Sutton and Barto, 1998]. The table is filled while real experience gathered. When state and
action space are too large, storing all experience become impractical. There is a version of
Dyna called Prioritized Sweeping [Moore and Atkeson, 1993] and closely related Queue-Dyna
[Peng and Willams, 1993] that store fixed number of most useful experiences and updates
experience tuples for which there will be large update errors using prioritize value functions.
Lin [Lin, 1993] proposed the experience replay method, where the past useful experience are
stored in a fixed length list and presented to the reinforcement learning again and again in
reverse order. This is also a kind of planning because some costly experiences from the
past are used to learn from once again. This is useful for learning in real worlds for which
exploring new experiences is often much more expensive than replaying old experiences.
Q-Dyna-Learning Algorithm
For each pair (sS, aA), initiate the table entry Q (s, a) to 0.
Observe the current state st.
Do forever:
Select an action at and execute it.
Receive immediate reward rt+1.
Observe the new state st+1.
Update the table entry using real-experience for Q (st, at) as Equation (3.16).
Set st to st+1.
Update Model using this real experience.
Repeat k times
Predict reward rt+1 and next state st+1 from the Model
Update the table entry using hypothetical experience for Q (st, at) as
Equation (3.16).
End Repeat
End Do
Table 5.4: Q-Dyna learning algorithm.
53
translates the observed state into a reinforcement value. That is, part of the robots a priori
knowledge is the ability to recognize states and assign a value to them. Balch [1998b]
provides taxonomies for various kinds of reinforcement functions that have developed for
multi-robot systems based on numbers of issues. The taxonomy of multi-robot
reinforcement functions is to examine the source of reward, relation to performance, locality, time
and continuity. In order to define the reinforcement function, the designer distills a large
amount of task specific knowledge into a number. Thus, the reinforcement function usually
closely coupled to the performance metric for a task. Since learning robots strive to
maximize the reward signal provided them, performance is maximized when their reward
closely parallels performance. It is sometimes the case however, in many embedded
autonomous agent applications, especially in multi-robot systems, robots cannot or should
not be rewarded strictly according to overall system performance. For example, the
reinforcement function designer will encode a notation of what classes of states represents
the robots goal. However, the robots sensors do not provide enough information for an
accurate computation of performance, reward is delayed or performance depends on the
actions of other robots over which the agent has limited knowledge and/or control [Balch,
1998b]. Since the reinforcement function is the robots only indication of how well it is
meeting its goals, it must be designed with great care. For all practical purposes, the
reinforcement function defines the robots task. For complex problems, it may be necessary
to refine the reinforcement function as a robot discovers ways to get the reinforcement
without accomplishing the task. In order to design a correct a reinforcement function,
there must first be a clear idea of what the robots goals should be and the reinforcement
function parallels the performance metric [Balch, 1998b]. Given a simple goal, it is
straightforward to define a reinforcement function leading to the proper behavior. The
reinforcement learning gains much of its power from the ability to use reinforcement
function to specify different degrees of goal satisfaction, and identify sub-goals of a
complex task. Rewarding sub-goals is one way to encode domain, task knowledge, and
speedup learning [Matari, 1994c]. The relative magnitude of reinforcements needs to
reflect the relative importance of the states being rewarded. One could prioritize sub-goals
by assigning reinforcements of greater magnitude to the more important ones [Karlsson,
1997].
Evolutionary learning is a form of learning. The main difference between evolution and learning is concerned with how the
adaptation process is carried out in space and time. Evolution is based on a population of individuals, but learning taking
place within a single individual. Evolution capture relatively slow environmental changes on the other hand, learning
capture environmental changes faster.
54
The combining of the evolution and learning during the life of the robot is one way to overcome the lengthy evolutionary
time scale. The learning can improves the search properties of artificial evolution by making the controller more robust to
changes occur very fast.. This is typically achieved by evolving neural controllers that can lean with a reinforcement
learning or back-propagation.
55
Temporal
Difference
Statistical
learning
56
Memory space
Large
Small
Small
Tracking changes
No
Yes
Yes
Model-based learning
No
Yes
Yes
Slow
Middle
Fast
Optimality of policy
Mostly
Approached
Mostly
Credit assignment
Implicit
Explicit
Explicit
Yes
Yes
No
Learning speed
Explore/exploit problem
57
Coordination and cooperation are a vital part of applying multiple robots to a mission in
an efficient manner. It is a form of interaction that lets the individual robot capable of
collaborating with other robots to accomplish that are beyond individuals capabilities.
Designing and implementing cooperative team of robots needs to decide how to resolve
organization problems. The learning of choosing organization scheme is based on the
improvement gained to complete the mission cooperatively within the efficiency
constraints (i.e., execution time, interference, and robustness). Gaining dynamic organization
behavior through learning lets the robots complement each other in cooperative manner and
to cope with the dynamics environments. This will exist different organizations, which
have different behaviors, and this behavior difference leads to difference in performance.
Therefore, no one organization is best suited for every mission. Dynamic assigning each
robot different roles through learning allows a group of robots to improve their
performance by providing constraints for robots by accepting tasks that meet their roles in
the organization based upon these robots skills and experience. As shown in Figure 5.3,
increasing the level of communication and dynamic task allocation leads to a virtual
centralized controller that direct the multi-robot systems for a coherent behavior. The
cooperative and adaptive character of robots behavior increases the organization level, which
demonstrated by high efficiency and decreasing execution time. In addition, preserve for
the robots in the team with their autonomous.
58
59
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
60
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
Borenstein J., and Koren Y., (1989). Real-Time Obstacle Avoidance for Fast Mobile
Robots. IEEE Tans. Systems, Man, and Cybernetics, 19(5): 1179-1187.
Borenstein J., and Koren Y., (1991). The Vector Field Histogram- Fast obstacle avoidance
for mobile robots. IEEE Tans. Robotics and Automation, 7(3): 278-288.
Borenstein J., Everett H., and Feng L., (1996). Navigating Mobile Robots- Systems and
Techniques. A.K. Peters, Wessley, MA.
Brooks R. and Stein L. (1994). Building brains for bodies. Journal Autonomous
Robots, 1: 7-25.
Brooks R., (1986). A Robust Layered Control System for a Mobile Robot. IEEE
Journal of Robotics and Automation, 2(1): 14-22.
Brooks R., (1990a). Challenges for Complete Creature Architectures. In:
Proceedings of the first international conference on simulation of adaptive
Behavior, MIT Press, pp. 484-448.
Brooks R., (1990b). Elephants Dont Play Chess. In: Maes P., (Ed.), Designing
Autonomous Agents, MIT Press, pp. 3-15.
Brooks R., (1990c). The Behavior Language; Users Guide. Technical report AIM1127, MIT Artificial Intelligence Lab.
Brooks R., (1991a). Intelligence Without Reason. In: Proceedings IJCAI-91,
Sydney.
Brooks R., (1991b). Intelligence Without Representation. Artificial Intelligence 47,
139-160.
Canny John (1988). The Complexity of Robot Motion Planning. Cambridge,
Massachusetts: MIT Press.
Cao Y., Fukunaga A., and Kahng A. (1997). Co-operative Mobile Robotics:
Antecedents and Directions. Journal of Autonomous Robotics, 4:1-23.
Clouse J., (1997). On Integrating Apprentice learning and Reinforcement Learning.
Ph.D. Thesis, University of Massachusetts.
Clouse J., and Utgoff P., (1992). A teaching method for reinforcement. Machine
learning: In Proceedings of the Ninth International Conference, pp. 182-189.
Connell J. (1990). Minimalist Mobile Robotics: A Colony Architecture for an Artificial
Creature. Academic Press.
Connell J. (1991). SSS: a hybrid architecture applied to robot navigation. In:
proceedings of IEEE International conference on robotics and automation, France,
pp. 2719-2724.
Dayan P., (1992). The Convergence of TD() for general . Machine Learning, 8:
341-362.
Dayan P., and Sejnowski T., (1994). TD() Converges with Probability 1. Machine
Learning, 14: 295-301.
Dias B., and Stentz A., (2000). A Free Market Architecture for Distributed Control
of a Multirobot System. In: the 6th International Conference on Intelligent
Autonomous Systems, pp. 115-122.
Dietterich T., (1997). Machine Learning Research: Four Current Directions. AI
Magazine 18(4), 97-136.
61
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
Doran J., Franklin S., Jennings N., and Norman T., (1997). On Cooperation in
Multi-Robot Systems. The knowledge Engineering review, 12 (3).
Doyle A., (1995). Algorithms and Computational Techniques for Robot Path Planning.
Ph.D. Thesis, School of Electronic Engineering and Computer Systems, University
of Wales.
Dudek G., and Jenkin M., (2000). Competently Principles of Mobile Robots. Cambridge
University Press.
Elfes A., (1987). Sonar-based real-world mapping and navigation. IEEE Transaction on
Robotics and Automation RA-3(3), pp. 249-265.
Elfes A., (1987). Sonar-based real-world mapping and navigation. IEEE Transaction on
Robotics and Automation RA-3(3), pp. 249-265.
Elfes A., (1989). Using occupancy grids for mobile robot perception and navigation. IEEE
Computer, pp. 46-57.
Elfes A., (1989). Using occupancy grids for mobile robot perception and navigation. IEEE
Computer, pp. 46-57.
El-Telbany M., AbdelWhab A., and Shaheen S. (2001). Learning Multi-robot Cooperation by Expertise Distribution. Proceedings of the International Conference
on Industrial Electronics, Technology & Automation. Cairo, Egypt.
El-Telbany M., AbdelWhab A., and Shaheen S. (2002a). Learning Co-ordination by
Spatial and Expertise Distribution in Multi-agent Systems. The Egyptian Journal of
Engineering and Applied Science, 49 (2), 357-367.
El-Telbany M., AbdelWhab A., and Shaheen S. (2002b). Behaviour-Based MultiRobots Learning: A Reinforcement Learning Survey. The Egyptian Informatics
Journal. (To appear).
El-Telbany M., AbdelWhab A., and Shaheen S. (2002c). Speed up Reinforcement
Learning in Multi-robot Domains by Apprentice Learning. In WSEAS
International conference, Greece.
Everett H., (1995) Sensors for Mobile Robots- Theory and Application. A.K.
Peters, Wessley, MA.
Floreano D., (1997). Evolutionary Mobile Robots. In D. Quagliarelli , J Periaux, C.
Polini, and G. Winter (eds), Genetic Algorithms in Engineering and Computer
Science, Chichester: Johan Wiley and Sons, Ltd.
Fontn M., and Matari M., (1996). A Study of Territoriality: the Role of Critical
Mass in Adaptive Task Division. In Proceedings of fourth International
Conference on Simulation and Adaptive Behavior, pp. 553-561.
Fontn M., and Matari M., (1998). Territorial Multi-Robot Task Division. IEEE
transaction on Robotics and Automation, 14 (5).
Gat E., (1998). On three-layer architectures. In: Bonnasso R., and Merphy P.
(Eds.), Artificial Intelligence and mobile robots, AAAI Press.
Gerkey B., and Matari M., (2000). Murdoch: Publish/Subscribe Task Allocation
for Heterogeneous Agents. Proceedings, Autonomous Agents 2000, Barcelona,
Spain, June 3-7, 203-204.
62
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
[73]
Gerkey B., and Matari M., (2001). Principled Communication for Dynamic MultiRobot Task Allocation. In Rus D., and Singh S., (eds.) Experimental Robotics VII,
353-362, Springer-Verlag, Berlin.
Gerkey B., and Matari M., (2002). Pusher-Watcher: an Approach to Fault-Tolerant
Tightly Coupled Robot Coordination. In Proceedings of IEEE International
Conference on Robotics and Automation, Washington.
Gerkey B., and Matari M., (2002). Sold!: Auction Methods for Multi-Robot
Coordination. IEEE Transactions on Robotics and Automation.
Goldberg D., (1989). Genetic Algorithms in Search, optimization, and Machine
Learning. Addison-Wesley.
Goldberg D., and Matari M., (1997). Interference as a Tool for Designing and
Evaluating Multi-Robot Controllers. In Proceedings AAAI-97, pp. 637-642.
Goldberg D., and Matari M., (1999). Coordinating mobile robot group behavior
using a model of interaction dynamics. In: Proceedings of the third international
Conference on Autonomous Agents, Washington, ACM Press.
Grefenstette J. and Schultz A. (1994). An Evolutionary approach to learning in
robots. In proceedings Machine Learning Workshop on Robot Learning, New
Brunswick, NJ.
Haykin S., (1999). Neural Networks: A Comprehensive Foundation. Second
Edition, Prentice-Hall, Inc.
Hoffmann F., (1998). Soft computing techniques for the design of the mobile robot
behaviors. Journal of Information Sciences.
Holland J., (1986). Escaping brittleness: the possibilities of general-purpose learning
algorithms applied to parallel rule-based systems. In R.S. Michalski and T.M.
Mitchell (eds.), Machine learning: An Artificial Intelligence Approach, Volume II,
Morgan Kaufmann.
Humphrys M., (1995). W-Learning: Competition among selfish Q-Learners.
Technical Report no. 362, University of Cambridge, Computer Laboratory.
Humphrys M., (1997). Action Selection Methods using Reinforcement Learning.
Ph.D., Thesis, University of Cambridge, Computer Laboratory.
Kaelbling L., Littman M., and Moore A., (1996). Reinforcement Learning: A
Survey. Journal of Artificial Intelligence Research, 4, pp.237-258.
Kaiser M., Dillmann R., and Rogalla O., (1996). Communication as the Basis for
Learning in Multi-Agent Systems. In: Workshop on Learning in Distributed AI
Systems, Budapest, Hungary, 1996.
Karlsson J., (1997). Learning to Solve Multiple Goals. Ph.D. Thesis, University of
Rochester, New York.
Keerthi S., and Ravindran B., (1999) A Tutorial Survey of Reinforcement Learning.
Department of Computer Science, Indian Institute of Science.
Khatib O., (1986). Real-time obstacle avoidance for manipulators and mobile
robots. The International Journal of Robotics Research, 5(1).
Koza J., (1992). Genetic Programming, on the Programming of Computers by
Means of Natural Selection. MIT Press.
63
[74]
[75]
[76]
[77]
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
Kube C. R., and Zhang H., (1992). Collective Robotics Intelligence. In: From
Animal to Animates: Second International conference on Simulation of Adaptive
Behaviors, PP. 460-468.
Kube C. R., and Zhang H., (1993). Collective Robotics: From Social Insects to
Robots. Journal of Adaptive Behavior, 2 (2): 189-219.
Kube C. R., and Zhang H., (1996). The use of Perceptual cues in Multi-Robot BoxPushing. In: Proceedings of IEEE International Conference on Robotics and
Automation, pp. 2085-2090.
Kube R., and Bonabeau E., (1998). Cooperative transport by ants and robots.
Journal of Robotics and Autonomous Systems.
Lin Long-Ji, (1992). Self-improving reactive agents based on reinforcement
learning, planning and teaching. Machine learning, 8, 293-321.
Lin Long-Ji, (1993). Scaling up reinforcement learning for robot control.
Proceedings of the 10th International Conference of Machine learning, pp. 182-189.
Lozano-Prez T., (1983). Spatial Planning: A Configuration Space Approach. IEEE
Transaction on Computers, C-32, pp. 108-120.
Lund, H. (1995). Evolving Robot Control System. In Proceedings of the first
NWGA, University of Vaasa, Finland.
Maes P., (1994). Modeling Adaptive Autonomous Agents. Journal of Artificial Life
1(2), 135-162.
Maes P., and Brooks R., (1991). Learning to Coordinate Behaviors. In: Proceedings
of the Eighth National Conference on AI, Boston.
Mahadevan S., (1996a). Machine Learning for robots: A Comparison of Different
Paradigms. In: proceedings of International Conference on Intelligent Robots and
systems.
Mahadevan S., (1996b). The NSF Workshop on Reinforcement Learning: Summary
and Observations. AI Magazine.
Mahadevan S., Connell J., (1992). Automatic Programming of behavior-based
Robots using reinforcement Learning. Artificial Intelligent, pp. 311-365.
Matari M., (1990). A Distributed Model for Mobile-Robot Environment Learning
and navigation. Ms.c. Thesis, Massachusetts Institute of Technology, Artificial
Intelligence laboratory, Cambridge, USA.
Matari M., (1992). Minimizing complexity in controlling a collection of mobile
robots. In: proceedings of IEEE International Conference on Robotics and
Automation, pp. 830-835.
Matari M., (1994a). Learning Social Behaviors. In: Proceedings of the Third
International Conference on Simulation of Adaptive Behavior, PP. 453-462.
Matari M., (1994b). Interaction and Intelligent Behavior. Ph.D. Thesis,
Massachusetts Institute of Technology, Artificial Intelligence laboratory,
Cambridge, USA.
Matari M., (1994c). Reward Functions for Accelerating Learning. In: Proceedings
of the Eleventh International Machine Learning conference, Morgan Kaufmanns
Publisher Inc.
64
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
[100]
[101]
[102]
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110]
65
[111]
[112]
[113]
[114]
[115]
[116]
[117]
[118]
[119]
[120]
[121]
[122]
[123]
[124]
[125]
Nagayuki Y., Ishii S., and Doya K., (1999). Multi-Agent Reinforcement Learning:
An Approach Based on the Other Agents Internal Model. Advanced
Telecommunications Research Institute, Japan.
Nebot E., and Pagac D., (1995). Quadtree Representations and Ultrasonic Information for
Mapping an Autonomous Guided Vehicles Environment. International Journal of
Computers and Their Applications, 2(3), pp. 160-170.
Nehmzov U., and Mitchell T., (1995). A Prospective Students Introduction to the
Robot Learning Problem. Technical Report UMCS-95-12-6, University of
Manchester.
Nehmzow U. and Smithers T., (1992). Using motor actions for location
recognition. In Varela F., and Bourgine P. (Eds.) Toward a practice of autonomous
systems. MIT Press, pp. 96-104.
Ostergaard E., Matari M., and Sukhatme G., (2001b). Distributed Multi-Robot
Task Allocation for Emergency Handling. In Proceedings of IEEE/RSJ
International Conference on Robots and Systems (IROS), Maui, Hawaii, 821-826.
Ostergaard E., Matari M., and Sukhatme G., (2002). Multi-Robot Task Allocation
in the Light of Uncertainty. In Proceedings of IEEE International Conference on
Robotics and Automation (ICRA-2002), Washington.
Ostergaard E., Sukhatme G., and Matari M., (2001a). Emergent Bucket Brigading.
Autonomous Agents 2001, Monteral, Canada.
Parker L., (1992). Adaptive Action Selection for Cooperative Agent Teams. In
International Conference on Simulation of Adaptive Behavior, pp. 442-450.
Parker L., (1994). Heterogeneous Multi-robot Cooperation. Ph.D. thesis,
Massachusetts Institute of Technology, Artificial Intelligence laboratory,
Cambridge, USA.
Parker L., (1995a). The Effect of Action Recognition and robot Awareness in
Cooperative Robotic Teams. In proceedings of International Conference on
Intelligent Robotics and Systems, 1, pp. 212-219.
Parker L., (1995b). L-ALLIANCE: A mechanism for adaptive action selection in
heterogeneous multi-robot teams. Technical Report ORNL/TM-13000, Oak Ridge
National laboratory, Tennessee.
Parker L., (1996). On the Design of Behavior-based Multi-Robot Teams. In Journal
of Advanced Robotics, 10, pp. 547-578.
Parker L., (1998a). ALLIANCE: An Architecture for Fault-Tolerant Multi-robot
Cooperation. IEEE Transactions on Robotics and Automation 14(2), pp. 220-240.
Parker L., (1998b). Distributed Control of Multi-robot Teams: Cooperative BatonPassing Task. In: proceedings of the 4th International Conference on Information
Systems Analysis and Synthesis, Volume 3, pp. 89-94
Parker L., (2000a). Current State of the art in Distributed Autonomous Mobile
Robotics. In: proceedings of Fifth International Symposium on Distributed
Autonomous Robotic systems.
66
[126]
[127]
[128]
[129]
[130]
[131]
[132]
[133]
[134]
[135]
[136]
[137]
[138]
[139]
[140]
[141]
[142]
[143]
[144]
67
[145]
[146]
[147]
[148]
[149]
[150]
[151]
[152]
[153]
[154]
[155]
[156]
[157]
[158]
[159]
[160]
68
[161]
[162]
[163]
[164]
[165]
[166]
69