Summary of Lecture 2: Locomotion Control: Biologically Inspired Artificial Intelligence (WS03: 410)
Summary of Lecture 2: Locomotion Control: Biologically Inspired Artificial Intelligence (WS03: 410)
1
Practical work: Practical work
• Recent matches (Friday November 14th):
3. Reinforcement learning
CPG Reflexes
Reflexes
Actuators Proprioception
2
Concept of Limit Cycle CPG-and-reflex control
• A limit cycle is an oscillatory regime in a dynamical Two types of implementations:
system:
CPG produces desired positions:
~
CPG-and- θ Feedback u θ
reflex + Σ Robot
- Controller (PID)
Neural oscillator:
(Taga 1994)
3
Taga’s neuromechanical simulation Taga’s neuromechanical simulation
Interesting aspects:
Walking gait:
• Locomotion seen as a limit cycle due to the global
entrainment between the neuro-musculo-skeletal system
and the environment
• Robustness against (small) variations in the environment
(e.g. small slopes)
Cons:
• Hand-tuning of (many) parameters for obtaining
satisfactory limit cycles
Limit cycle:
4
Inter-oscillator coupling Body CPG
Two parameters (aij and bij)
per coupling • Model: 40 segments
aij, bij
• Assumptions:
• Lamprey-like system: chain of
oscillators 40
• Two oscillators per segment
• Closest neighbour coupling
• Double symmetry:
left-right+per segment
(Delvolvé et al 1997):
EMG
In axial
musculature
5
Complete CPG Generation of standing wave for walking
swimming
6
Real salamander: from walking to swimming Real salamander: swimming
Salamander applet
7
Outcomes Quadruped-robot controlled with a
• Simple control signals for controlling the speed, direction, CPG-and-reflex based controller
Kimura Lab,
and type of gait (Ebody_left, Ebody_right, Elimb_left, Elimb_right, National Univ. of Electro-Communications
and τ) Tokyo
• Cons:
• Fewer mathematical tools than other methods
• Not (yet) a clear design methodology, it is
Reflex - Knee Bending
recommended to use learning algorithms
Camera control
To avoid obstacles Obstacles detection
8
Lecture 3: Learning algorithms Evolutionary algorithms
Topics: There exist different types of learning:
• Evolution
1. CPG-and-reflex based control of locomotion (end of • Supervised learning
lecture 2) • Learning by imitation
• Reinforcement learning
2. Evolutionary algorithms • Unsupervised learning
• …
3. Reinforcement learning
We will start by making an overview of evolutionary
algorithms
9
GA: algorithm GA: encoding
Let’s assume we would like to find the maximum of a
1.Initial population fitness function f(x,y):
• The initial population is normally randomly generated • Parents are chosen depending on their fitness: the higher
the fitness, the higher the chance to be chosen
• In some case, prior knowledge of the problem can be
used to introduce some particular solutions • Different schemes are possible:
(chromosomes) in the population • Fitness-based selection: probability directly
proportional to the fitness
• Rank-based selection: probability inversely
proportional to the rank (i.e. first, second,…)
• Tournament selection: Pick two potential parents and
keep the best (repeat until you have enough parents)
10
GA: Crossover operator GA: Mutation operator
Crossover operator: recombination operator that swaps Mutation operator: each allele in a gene has a probability M
genetic material from two parent chromosomes to be mutated:
One-point crossover:
011101 001101 xyyx01 001101
011101 001101
11
GA: Ending criterion Typical run: generation 0
Different possibilities:
Y X Y X
5.0 5.0 5.0 5.0
12
Typical run: convergence Typical run Max.
Fitness Average
Min..
Y X
5.0 5.0
Genetic Generations
Diversity
(e.g. sum of standard
deviations of gene values
within the population
Generations
13
GA: applications GA applications:
Karl Sims evolved Creatures
In robotics, GAs are used either to optimize parameters in a
controller, e.g. a sinus-based controller or a finite-state
machine (i.e. a set of if-then rules), or, more commonly,
to optimize parameters such as synaptic weights in a
neural network
Lecture 2: sinus controller • GA used to evolve both body shape and controller
• Fitness function: speed of locomotion
θ i = θ i0 + Ai sin(υi t + ϕi ) • Controller: special type of neural network (with some neurons
producing sinusoidal signals)
Sims, K., "Evolving Virtual Creatures," Computer Graphics (Siggraph '94) Annual
Conference Proceedings, July 1994, pp.43-50.
Urzelai, J., Floreano, D., Dorigo, M., and Colombetti, M. Incremental Robot Shaping,
Connection Science, 10, 341-360, 1998
14
ES: encoding ES: Mutation operator
At each generation, a gene x is mutated as follows:
x t +1 = x t + N 0 (σ x )
t
Y X
5.0 5.0
Where N 0 (σ ) is a Gaussian random number with mean 0
t
15
GP: example of encoding GP: example of crossover operator
Symbolic (as opposed to parametric) fitting of a function: Parents
Functions:
+,-,*,/,…
Children
Terminals:
variables or b *b − 2* 2* a *c − b
numeric values 2*a
16
Evolutionary algorithms: summary Evolutionary algorithms: summary
Pros: Cons:
1. CPG-and-reflex based control of locomotion (end of The next slides are adapted from Sutton and Barto’s course
lecture 2)
2. Evolutionary algorithms
3. Reinforcement learning
17
Key Features of RL Key Features of RL
Environment
❐ Learner is not told which actions to take (i.e. no
supervision), but receives a reward every so often
state action
❐ Possibility of delayed reward
Sacrifice short-term gains for greater long-term gains reward
Agent
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2
Agent
Policy state reward
rt action
st at
Reward rt+1
Value st+1 Environment
Model of
environment
❐ Policy: what to do Agent and environment interact at discrete time steps : t = 0,1,2,K
Agent observes state at step t : st ∈ S
❐ Reward: what is good
produces action at step t : at ∈ A( st )
❐ Value: what is good because it predicts reward gets resulting reward : rt +1 ∈ ℜ
❐ Model: what follows what and resulting next state : st +1
... rt +1 rt +2 rt +3 s ...
st st +1 st +2 t +3
at at +1 at +2 at +3
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4
1
The Agent Learns a Policy Policy
Policy at step t ,π t : S
Stochastic environment:
a
a mapping from states to action probabilities stochastic transition
S’
π t ( s, a ) = probability to take action at = a when st = s
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8
2
Returns Returns for Continuing Tasks
Suppose the sequence of rewards after step t is :
rt +1 , rt+ 2 , rt + 3 , K Continuing tasks: interaction does not have natural episodes.
What do we want to maximize?
Episodic tasks: interaction breaks naturally into where γ , 0 ≤ γ ≤ 1, is the discount rate.
episodes, e.g., plays of a game, trips through a maze.
Rt = rt +1 + rt +2 + L + rT ,
shortsighted 0 ← γ → 1 farsighted
where T is a final time step at which a terminal state is reached,
ending an episode.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10
3
Reinforcement learning algorithms
End of lecture 3