Machine Learning and System Identification For Estimation in Physical Systems
Machine Learning and System Identification For Estimation in Physical Systems
2018
Document Version:
Publisher's PDF, also known as Version of record
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors
and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the
legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study
or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal
Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove
access to the work immediately and investigate your claim.
LUNDUNI
VERSI
TY
PO Box117
22100L und
+4646-2220000
In this thesis, we draw inspiration from both classical system identification and
modern machine learning in order to solve estimation problems for real-world,
physical systems. The main approach to estimation and learning adopted is op-
timization based. Concepts such as regularization will be utilized for encoding
of prior knowledge and basis-function expansions will be used to add nonlinear
modeling power while keeping data requirements practical.
The thesis covers a wide range of applications, many inspired by applications
within robotics, but also extending outside this already wide field. Usage of the
proposed methods and algorithms are in many cases illustrated in the real-world
applications that motivated the research. Topics covered include dynamics mod-
eling and estimation, model-based reinforcement learning, spectral estimation,
friction modeling and state estimation and calibration in robotic machining.
In the work on modeling and identification of dynamics, we develop regu-
larization strategies that allow us to incorporate prior domain knowledge into
flexible, overparameterized models. We make use of classical control theory to
gain insight into training and regularization while using flexible tools from modern
deep learning. A particular focus of the work is to allow use of modern methods in
scenarios where gathering data is associated with a high cost.
In the robotics-inspired parts of the thesis, we develop methods that are practi-
cally motivated and ensure that they are implementable also outside the research
setting. We demonstrate this by performing experiments in realistic settings and
providing open-source implementations of all proposed methods and algorithms.
5
Acknowledgements
I would like to acknowledge the influence of my PhD thesis supervisor Prof. Rolf
Johansson and my Master’s thesis advisor Dr. Vuong Ngoc Dung at SIMTech, who
both encouraged me to pursue the PhD degree, for which I am very thankful. Prof.
Johansson has continuously supported my ideas and let me define my work with
great freedom, thank you.
My thesis co-supervisor, Prof. Anders Robertsson, thank you for your never-
ending enthusiasm, source of good mood and encouragement. When working
100% overtime during hot July nights in the robot lab, it helps to know that one is
never alone.
I would further like to direct my appreciation to friends and colleagues at the
department. It has often fascinated me, how a passionate and excited speaker
can make a boring topic appear interesting. No wonder a group of 50+ highly
motivated and passionate individuals can make an already interesting subject
fantastic. In particular Prof. Bo Bernhardsson, my office mates Gautham Nayak
Seetanadi and Mattias Fält and my travel mates Martin Karlsson, Olof Troeng and
Richard Pates, you have all made the last 5 years outside and at the department
particularly enjoyable.
Credit also goes to Jacob Wikmark, Dr. Björn Olofsson and Dr.
c Martin Karlsson
for incredibly generous and careful proof reading of the manuscript to this thesis,
and to Leif Andersson for helping out with typesetting, you have all been very
helpful!
Finally, I would like to thank my family in Vinslöv who have provided and
continue to provide a solid foundation to build upon, to my family from Sparta
who provided a second home and a source of both comfort and adventure, and to
the welcoming new addition to my family in the Middle East.
7
Financial support
Parts of the presented research were supported by the European Commission
under the 7th Framework Programme under grant agreement 606156 Flexifab.
Parts of the presented research were supported by the European Commission
under the Framework Programme Horizon 2020 under grant agreement 644938
SARAFun. The author is a member of the LCCC Linnaeus Center and the ELLIIT
Excellence Center at Lund University.
8
Contents
1. Introduction 13
1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2. Publications and Contributions 16
9
Contents
10
Contents
11
1
Introduction
Technical computing, sensing and control are well-established fields, still making
steady progress today. Rapid advancements in the ability to train flexible machine
learning models, enabled by amassing data and breakthroughs in the understand-
ing of the difficulties behind gradient-based training of deep architectures, have
made the considerably younger field of machine learning explode with interest.
Together, they have made automation feasible in situations we previously could
not dream of.
The vast majority of applications within machine learning are, thus far, in
domains where data is plentiful, such as image classification and text analysis.
Flexible machine-learning models thrive on large datasets, and much of the ad-
vancements of deep learning is often attributed to growing datasets, rather than
algorithmic advancements [Goodfellow et al., 2016]. In practice, it took a few
breakthrough ideas to enable training of these deep and flexible architectures,
but few argue with the fact that the size of the dataset is of great importance. In
many domains, notably domains involving mechanical systems such as robots
and cars, gathering the data required to make use of a modern machine-learning
model often proves difficult. While a simple online search returns thousands of
pictures of a particular object, and millions of Wikipedia articles are downloaded
in seconds, collecting a single example of a robot task requires actually operating a
robot, in real time. Not only is this associated with a tremendous overhead, but the
data collected during this experiment using a particular policy or controller is also
not always informative of the system and its behavior when it has gone through
training. This has seemingly made the progress of machine learning in control of
physical systems lag behind, and traditional methods are still dominating today.
Design methods based on control theory have long served us well. Complex prob-
lems are broken down into subproblems which are easily solved. The complexity
arising when connecting these subsystems together is handled by making the
design of each subsystem robust to uncertainties in its inputs [Åström and Murray,
2010]. While this has been a very successful strategy, it leaves us with a number of
questions. Are we leaving performance on the table by connecting individually
designed systems together instead of optimizing the complete system? Are we
wasting effort designing subsystems using time-consuming, traditional methods,
13
Chapter 1. Introduction
0.8
lib
Ca
d
0.6
an
t.
s
Es
ic
0.4
am
e
at
yn
St
gD
0.2
in
g
in
n
ar
el
od
0.0
Le
g
L
n
n
W
n
SW
s
RL
in
el
el
M
LQ
io
io
tio
tio
FS
od
od
m
at
ed
at
rF
nd
ic
ra
m
im
im
xM
m
as
Fr
fo
lib
.a
ra
st
-B
st
n
d
Ca
Bo
og
LT
lE
ee
io
s.I
el
Pr
at
od
tra
Sy
at
ck
tim
ic
St
M
ec
a
am
Bl
Sp
es
yn
e
at
D
St
Figure 1.1 This thesis can be divided into three main topics. This figure indicates
the topic distribution for each chapter, where a dark blue color indicates a strong
presence of a topic. The topic distribution was automatically found using latent
Dirichlet allocation (LDA) [Murphy, 2012].
14
1.1 Notation
1.1 Notation
Notation frequently used in the thesis is summarized in Table 1.1. Many methods
developed in the thesis are applied within robotics and we frequently reference
different coordinate frames. The tool-flange frame T F is attached to the tool
flange of a robot, the mechanical interface between the robot and the payload or
tool. The robot base frame RB is the base of the forward-kinematics function of
a manipulator, but could also be, e.g., the frame of an external optical tracking
system that measures the location of the tool frame in the case of a flying robot
etc. A sensor delivers measurements in the sensor frame S . The joint coordinates,
e.g., joint angles for a serial manipulator, are denoted q. The vector of robot joint
torques is denoted τ, and external forces and torques acting on the robot are
gathered in the wrench f. The Jacobian of a function is denoted J and the Jacobian
of a manipulator is denoted J (q). We use k to denote a vector of parameters to
be estimated except in the case of deep networks, which we parameterize by the
weights w. The gradient of a function f with respect to x is denoted ∇x f . We use
x t to denote the state vector at time t in Markov systems, but frequently omit this
time index and use x + to denote x t +1 in equations where all other variables are
given at time t . The matrix 〈s〉 ∈ so is formed by the elements of a vector s and has
the skew-symmetric property 〈s〉 + 〈s〉T = 0 [Murray et al., 1994].
Table 1.1 Definition and description of coordinate frames, variables and notation.
15
2
Publications and
Contributions
The contributions of this thesis and its author, as well as a list of the papers this
thesis is based on, are detailed below.
Included publications
Bagge Carlson, F., A. Robertsson, and R. Johansson (2015a). “Modeling and identi-
fication of position and temperature dependent friction phenomena without
temperature sensing”. In: Int. Conf. Intelligent Robots and Systems (IROS),
Hamburg. IEEE.
Bagge Carlson, F., R. Johansson, and A. Robertsson (2015b). “Six DOF eye-to-hand
calibration from 2D measurements using planar constraints”. In: Int. Conf.
Intelligent Robots and Systems (IROS), Hamburg. IEEE.
Bagge Carlson, F., A. Robertsson, and R. Johansson (2017). “Linear parameter-
varying spectral decomposition”. In: 2017 American Control Conf (ACC), Seat-
tle.
Bagge Carlson, F., A. Robertsson, and R. Johansson (2018a). “Identification of LTV
dynamical models with smooth or discontinuous time evolution by means
of convex optimization”. In: IEEE Int. Conf. Control and Automation (ICCA),
Anchorage, AK.
Bagge Carlson, F., R. Johansson, and A. Robertsson (2018b). “Tangent-space regu-
larization for neural-network models of dynamical systems”. arXiv preprint
arXiv:1806.09919.
16
Chapter 2. Publications and Contributions
Other publications
The following papers, authored or co-authored by the author of this thesis, cover
related topics in robotics but are not included in this thesis:
17
Chapter 2. Publications and Contributions
18
Chapter 2. Publications and Contributions
19
Chapter 2. Publications and Contributions
and proposed solutions to, many unique problems arising in the FSW context.
The chapter also outlines an open-source software framework for simulation of
the state-estimation procedure, intended to guide the user in application of the
method and assembly of the hardware sensing.
The thesis is concluded in Sec. 14.5 with a brief discussion around directions
for future work.
Software
The research presented in this thesis is accompanied by open-source software
implementing all proposed methods and allowing reproduction of simulation
results. A summary of the released software is given below.
[LPVSpectral.jl, B.C., 2016] (Sparse and LPV) Spectral estimation methods, im-
plements [Bagge Carlson et al., 2017].
[PFSeamTracking.jl, B.C. et al., 2016] Seam tracking and simulation [Bagge Carl-
son et al., 2016].
[JacProp.jl, B.C., 2018] Implements all methods in [Bagge Carlson et al., 2018a].
20
Part I
Model Estimation
3
Introduction—System
Identification and Machine
Learning
Estimation, identification and learning are three words often used to describe
similar notions. Different fields have traditionally preferred one or the other, but
no matter what term has been used, the concepts involved have been similar, and
the end goals have been the same. The machine learning community talks about
model learning. The act of observing data generated by a system and building a
model that can either predict the output given an unseen input, or generate new
data from the same distribution as the observed data was generated from [Bishop,
2006; Murphy, 2012; Goodfellow et al., 2016]. The control community, on the other
hand, talks about system identification, the act of perturbing a system using a
controlled input, observing the response of the system and estimating/identifying
a model that agrees with the observations [Ljung, 1987; Johansson, 1993]. Although
terminology, application and sometimes also methods have differed, both fields
are concerned with building models that capture structure observed in data.
This thesis will use the terms more or less interchangeably and they will always
refer to solving an optimization problem. The function we optimize is specifically
constructed to encode how well the model agrees with the observations, or rather,
the degree of mismatch between the model predictions and the data. Optimiza-
tion of a cost function is a very common and the perhaps dominating strategy
in the field, but approaches such as Bayesian inference offer an alternative strat-
egy, focusing on statistical models. Bayesian methods offer interesting and often
valuable insight into the complete posterior distribution of the model parameters
after having observed the data [Bishop, 2006; Murphy, 2012]. This comes at the
cost of computational complexity. Bayesian methods often involve intractable
high-dimensional integration, necessitating approximate solution methods such
as Monte Carlo methods. Variational inference is another popular approximate so-
lution method that transform the Bayesian inference problem to an optimization
problem over a parameterized probability density [Bishop, 2006; Murphy, 2012].
23
Chapter 3. Introduction—System Identification and Machine Learning
For control design and analysis, Linear Time-Invariant (LTI) models have been
hugely important, mainly motivated by their simplicity and the fact that both
performance and robustness properties are well understood. The identification
of linear models shares these properties in many regards, and has been a staple
1 Interpretable machine learning is an emerging field trying to provide insight into the workings of
black-box models.
24
3.1 Models of Dynamical Systems
of system identification since the early beginning [Ljung, 1987]. Not only are the
theory and properties of linear identification well understood, the computational
complexity of many of the linear identification algorithms is also favorable.
Methods that have been made available by decades of progression of Moore’s
law are, however, often underappreciated among system identification practi-
tioners. With the computational power available today, one can solve large op-
timization problems and high dimensional integrals, leading to the emergence
of the fields of deep learning [Goodfellow et al., 2016], large-scale convex opti-
mization [Boyd and Vandenberghe, 2004] and Bayesian nonparametrics [Hjort
et al., 2010; Gershman and Blei, 2011]. In this thesis, we hope to contribute to
bridging some of the gap between the system-identification literature and modern
machine learning. We believe that the interchange will be bidirectional, because
even though new powerful methods have been developed in the learning com-
munities, classical system identification has both useful domain knowledge and a
strong systems-theoretical background, with well developed concepts such as sta-
bility, identifiability and input design, that are seldom talked about in the learning
community.
x t +1 = Ax t + Bu t + v t
yt = xt + e t (3.1)
x + − x̂ + = v
x̂ + = Ax + Bu (3.2)
and PEM constitutes the optimal method if all errors are equation errors [Ljung,
1987], i.e., e = 0. If we instead adopt the model v = 0, we arrive at the output-error
or simulation-error method [Ljung, 1987; Sjöberg et al., 1995], where we minimize
y + − x̂ + = e
x̂ + = A x̂ + Bu (3.3)
The difference between (3.2) and (3.3) may seem subtle, but has big consequences.
In (3.3), no measurements of x are ever used to form the predictions x̂. Instead, the
25
Chapter 3. Introduction—System Identification and Machine Learning
3.2 Stability
26
3.3 Inductive Bias and Prior Knowledge
guaranteed to return a stable model. One can imagine many ways of dealing
with this issue. A conceptually simple way is to search only among stable models.
This strategy is in general hard, but successful approaches include [Manchester
et al., 2012]. Model classes that include only stable models may unfortunately be
restrictive and limit the use of intuition in choosing a model architecture. Another
strategy is to project the found model onto a subset of stable models, provided that
such a projection is available. There is, however, no guarantee that the projection
is the optimal model in the set of stable models. A hybrid approach is to, in each
iteration of an optimization problem, project the model onto the set of stable
models, a technique that in general gradient-based optimization is referred to as
projected gradient descent [Goldstein, 1964]. The hope with such a strategy is that
the optimization procedure will stay close to the desired target set and thus seek
out favorable points within this set, whereas projection of only the final solution
might allow the optimization procedure to stray far away from good solutions
within the desired target set. A closely related approach will be used in Chap. 13,
where the optimization variable is a rotation matrix in SO(3), a space which is
easy to project onto but harder to optimize over directly.
The set of stable discrete-time LTI models is easy to describe; as long as the A
matrix in (3.1) has eigenvalues no greater than 1, the model is stable [Åström and
Murray, 2010; Glad and Ljung, 2014]. If the eigenvalues are strictly less than one,
the model is exponentially stable and all energy contained within the system will
eventually decay to zero. For nonlinear models, characterizing the set of stable
models is in general much harder. One way of proving that a nonlinear system is
stable is to find a Lyapunov function. Systematic ways of finding such a function
are unfortunately lacking.
27
Chapter 3. Introduction—System Identification and Machine Learning
the capacity is there. The inductive bias, however, is clearly more towards natural
images.
Closely related to inductive bias are the concepts of statistical priors and
regularization, both of which are explicit attempts at endowing the model with
inductive bias [Murphy, 2012]. The concept of using regularization to encode prior
knowledge will be used extensively in the thesis.
A different approach to encoding prior knowledge is intelligent initialization
of overparameterized models. It is well known that the gradient descent algorithm
converges to the minimum-norm solution for overparametereized convex prob-
lems if initialized near zero [Wilson et al., 2017]. This can be seen as an implicit
bias or regularization, encoded by the initialization. Similarly, known time con-
stants can be encoded by initialization of matrices in recurrent mappings with
well chosen eigenvalues, or as differentiation chains etc. This topics will not be
discussed much further in this thesis, but may be worthwhile considering during
modeling and training.
Can the problem of estimating models for dynamical control system be re-
duced to that of finding an architecture with the appropriate inductive bias? We
argue that it is at least beneficial to have the model architecture working with us
rather than against us. The question then becomes: How can we construct our
models such that they want to learn good dynamics models? Decades of research
in classical control theory and system identification hopefully become useful in
answering these questions. We hope that the classical control perspective and the
modern machine learning perspective come together in this thesis, helping us
finding good models for dynamical systems.
28
4
State Estimation
29
Chapter 4. State Estimation
Unfortunately, very few pairs of distributions p(x + |x) and p 0 will lead to a tractable
integral in (4.1) and a distribution p(x + ) that we can represent on closed form. The
particle filter therefore approximates p 0 with a collection of samples or particles
© i ªN
x̂ i =1 , where each particle can be seen as a distinct hypothesis of the correct
state. Particles are easily propagated through p(x + |x) to obtain a new collection
at time t = 1, forming a sampled representation of p(x 1 ).
When a measurement y becomes available, we associate each particle with a
weight given by the likelihood of the measurement given the particle state and the
1 A notable exception to this is recursive least-squares estimation of a linear combination of parame-
ters [Ljung and Söderström, 1983; Åström and Wittenmark, 2013b].
30
4.3 The Kalman Filter
observation model p(y|x). Particles that represent state hypotheses that yield a
high likelihood are determined more likely to be correct, and are given a higher
weight.
The collection of particles will spread out more and more with each application
of the dynamics model f . This is a manifestation of the curse of dimensionality,
since the dimension of the space that the density p(x 0:t ) occupies grows with t .
To mitigate this, a re-sampling step is performed. The re-sampling favors particles
with higher weights and thus focuses the attention of the finite collection of
particles to areas of the state-space with high posterior density. We can thus think
of the particle filter as a continuous analogue to the approximate branch-and-
bound method beam search [Zhang, 1999].
The recent popularity of particle filters, fueled by the increase in available
computational power, has led to a corresponding increase in publications describ-
ing the subject. Interested readers may refer to one of such publications for a
more formal description of the subject, e.g., [Gustafsson, 2010; Thrun et al., 2005;
Rawlings and Mayne, 2009].
We summarize the particle filter algorithm in Algorithm 1
31
Chapter 4. State Estimation
L EMMA 1
The affine transformation of a normally distributed random variable is normally
distributed with the following mean and variance
x ∼ N (µ, Σ) (4.2)
y = c +Bx (4.3)
y ∼ N (c + B µ, B ΣB T) (4.4)
2
L EMMA 2
When both prior and likelihood are Gaussian, the posterior distribution is Gaus-
sian with
where the terms in the first equation were expanded, all terms including x col-
lected and the square completed. Terms not including x become part of the
normalization constant and do not determine the mean or covariance. 2
C OROLLARY 1
The equations for the posterior mean and covariance can be written in update
form according to
µ̄ = µ0 + K (µ1 − µ0 ) (4.11)
Σ̄ = Σ0 − K Σ0 (4.12)
−1
K = Σ0 (Σ0 + Σ1 ) (4.13)
Proof The expression for Σ̄ is obtained from the matrix inversion lemma applied
to (4.6) and µ̄ is obtained by expanding Σ̄, first in front of µ0 using (4.12), and
then in front of µ1 using (4.6) together with the identity (Σ−1
0 + Σ1 )
−1 −1
= Σ0 (Σ0 +
Σ1 )−1 Σ1 . 2
32
4.3 The Kalman Filter
x t +1 = Ax t + Bu t + v t (4.14)
yt = C xt + e t (4.15)
where the noise terms v and e are independent2 and Gaussian with mean zero and
covariance R 1 and R 2 , respectively. The estimation begins with an initial estimate
of the state, x 0 , with covariance P 0 . By iteratively applying (4.14) to x 0 , we obtain
x̂ t |t −1 = A x̂ t −1|t −1 + Bu t −1 (4.16)
T
P t |t −1 = AP t −1|t −1 A + R 1 (4.17)
where both equations follow from Lemma 1 and the notation x̂ i | j denotes the
estimate of x at time i , given information available at time j . Equation (4.17)
clearly illustrates that the covariance after a time update is the sum of a term due
to the covariance from the previous time step and the added term R 1 , which is
the uncertainty added by the state-transition noise v. We further note that the
properties of A determine whether or not these equations alone are stable. For
stable A and u ≡ 0, the mean estimate of x converges to zero with a stationary
covariance given by the solution to the discrete-time Lyapunov equation P =
AP AT + R 1 .
Equations (4.16) and (4.17) constitute the prediction step, we will now proceed
to incorporate also a measurement of the state in the measurement update step.
By Lemma 1, the mean and covariance of the expected measurement is given
by
ŷ t |t −1 = C x̂ t |t −1 (4.18)
y T
P t |t −1 = C P t |t −1C (4.19)
which, if we drop C in front of both ŷ and P y , and C T at the end of P y , turns into
¡ ¢
x̂ t |t = x̂ t |t −1 + K t y t −C x̂ t |t −1 (4.23)
P t |t = P t |t −1 − K t C P t |t −1 (4.24)
¢−1
K t = P t |t −1C T C P t |t −1C T + R 2
¡
(4.25)
33
5
Dynamic Programming
Dynamic programming (DP) is a general strategy due to Bellman (1953) for solving
problems that enjoy a particular structure, often referred to as optimal substruc-
ture. In DP, the problem is broken down recursively into overlapping sub-problems,
the simplest of which is easy to solve. While DP is used to solve problems in a
diverse set of applications, such as sequence alignment, matrix-chain multipli-
cation and scheduling, we will focus our introduction on the application to op-
timization problems where the sequential structure arises due to time, such as
state-estimation, optimal control and reinforcement learning.
V µ (x t ) = c t + V µ x t +1 = c t + V µ f (x t , u t )
¡ ¢ ¡ ¢
(5.1)
Of particular interest is the optimal value function V ∗ , i.e., the value function of
the optimal controller µ∗ :
V ∗ (x t ) = min c(x t , u) + V ∗ f (x t , u)
¡ ¡ ¢¢
(5.2)
u
34
5.1 Optimal Control
We can thus both solve for VT∗−1 and represent it efficiently using a single positive
definite matrix. The algorithm for calculating the optimal V ∗ and the optimal
controller µ∗ is in this case called the Linear-Quadratic Regulator (LQR) [Åström,
2012].
The similarity with the Kalman filter is no coincidence. The Kalman filter
essentially solves the maximum-likelihood problem, which when the noise is
Gaussian is equivalent to solving a quadratic optimization problem. The LQR
algorithm and the Kalman filter are thus dual to each other. This duality between
linear control and estimation problems is well known and most classical control
texts discuss it. In Chap. 7, we will explore the similarities further and let them
guide us to efficient algorithms for identification problems.
Iterative LQR
The LQR algorithm is incredibly powerful in the restricted setting where it applies.
In O (T ) time it calculates both the optimal policy and the optimal value function.
Its applicability is unfortunately limited to linear systems, but these systems may
be time varying. An algorithm that makes use of LQR for nonlinear systems is
Iterative LQR (iLQR) [Todorov and Li, 2005]. By linearizing the nonlinear system
along the trajectory, the LQR algorithm can be employed to estimate an optimal
control signal sequence. This sequence can be applied to the nonlinear system
in simulation to obtain a new trajectory along which we can linearize the system
and repeat the procedure. This algorithm is a special case of a more general
algorithm, Differential Dynamic Programming (DDP) [Mayne, 1966], where a
quadratic approximation to both a general cost function and a nonlinear dynamics
model is formed along a trajectory, and the dynamic-programming problem is
solved.
Since both DDP, and the special case iLQR, make use of linear approximations
of the dynamics, a line search or trust region must be employed in order to ensure
convergence. We will revisit this topic in Chap. 11, where we employ iLQR to
solve an reinforcement-learning problem using estimation techniques developed
in Chap. 7.
1 Showing this involves algebra remarkably similar to the derivations in Sec. 4.3.
35
Chapter 5. Dynamic Programming
36
5.2 Reinforcement Learning
Table 5.1 The RL landscape. Methods marked with * or (*) estimate (may estimate)
a value function and methods marked with a † or (†) estimate (may estimate) an
explicit policy 1 [Levine and Koltun, 2013], 2 [Sutton et al., 2000], 3 [Watkins and
Dayan, 1992], 4 [Sutton, 1991], 5 [Silver et al., 2014], 6 [Rummery and Niranjan, 1994],
7 [Williams, 1988], 8 [Schulman et al., 2015].
incoming data does not always hold any information regarding the optimal value
function, which is the ultimate goal of learning. Model-based methods, on the
other hand, use the incoming data to learn about the dynamics of the agent and
the environment. While it is possible to imagine an environment with evolving
dynamics, the dynamics are often laws of nature and do not change, or at least
change much slower than the value function and the policy, quantities we are
explicitly modifying continuously. This is one of the main reasons model-based
methods tend to be more data efficient than model-free methods.
Model-based methods are not without problems though. Optimization un-
der an inaccurate model might cause the RL algorithm to diverge. In Chap. 11,
we will make use of models and identification methods developed in Part I for
reinforcement-learning purposes. The strategy will be based on trajectory op-
timization under estimated models and an estimate of the uncertainty in the
estimated model will be taken into account during the optimization.
37
6
Linear Quadratic
Estimation and
Regularization
38
6.2 Least-Squares Estimation
R̃ = U SV T (6.1)
1
R =U 1 V T (6.2)
T
det(UV )
y = Ak (6.3)
y1 u1 v1
· ¸
. . .. N ×2 k1
y = .. , A = .. . ∈R , k=
k2
yN uN vN
and solving the optimization problem of Eq. (6.4) with solution (6.5).
T HEOREM 1
The vector k ∗ of parameters that solves the optimization problem
°2
k ∗ = arg min ° y − Ak °2
°
(6.4)
k
39
Chapter 6. Linear Quadratic Estimation and Regularization
° °2
J = ° y − Ak °2 = (y − Ak)T(y − Ak)
= y T y − y TAk − k TAT y + k TATAk
³ ¢−1 ´T ³ ¢−1 ´ ¢−1
= k − ATA AT y ATA k − ATA AT y + y T(I − A ATA AT)y
¡ ¡ ¡
where we identify the last expression as a sum of two terms, one that does not
depend on k, and a term which is a positive definite quadratic form (ATA is always
positive (semi)definite). The estimate k ∗ that minimizes J is thus the value that
makes the quadratic form equal to zero. 2
The expression (6.5) is known as the least-squares solution and the full-rank matrix
(ATA)−1 AT is commonly referred to as the pseudo inverse of A. If A is a square
matrix, the pseudo inverse reduces to the standard matrix inverse. If A, however,
is a tall matrix, the equation y = Ak is over determined and (6.5) produces the
solution k ∗ that minimizes (6.4). We emphasize that the important property of
the model y n = k 1 u n + k 2 v n that allows us to find the solution to the optimization
problem on closed-form is that the parameters to be estimated enter linearly. The
signals u and v may be arbitrarily complex functions of some other variable, as
long as these are known.
Consistency
A consistent estimate is one which is asymptotically unbiased and has vanishing
variance as the number of data points grows. The consistency of the least-squares
estimate can be analyzed by calculating the bias and variance properties. Consider
the standard model, with an added noise term v, for which consistency is given by
the following theorem:
T HEOREM 2 ¡ ¢−1
The closed-form expression k̂ = ATA AT y is an unbiased and consistent estimate
of k in the model
y = Ak + v
v ∼ N (0, σ2 )
E A v =0
© T ª
Proof The bias and variance of the resulting least-squares based estimate are:
40
6.2 Least-Squares Estimation
n¡ ¢−1 o
If the regressors are uncorrelated with the noise, E ATA ATv = 0, we can con-
clude that E k̂ = k and the estimate is unbiased.
© ª
= σ2 E (ATA)−1
© ª
= σ2 (ATA)−1
is convex and admits a particularly simple, closed-form expression for the min-
imum. If another norm is used instead of the L 2 norm, the estimate will have
different properties. The choice of other norms will, in general, not admit a solu-
tion on closed form, but for many norms of interest, the optimization problem
remains convex. This fact will in practice guarantee that a global minimum can
be found easily using iterative methods. Many of the methods described in this
thesis could equally well be solved with another convex loss function, such as
the L 1 norm for increased robustness, or the L ∞ norm for a minimum worst-case
scenario. For an introduction to convex optimization and a description of the
properties of different convex loss functions, see [Boyd and Vandenberghe, 2004].
Computation
Although the solutions to the least-squares problems are available in closed form,
¡ ¢−1
it is ill-advised to actually perform the calculation k = ATA AT y [Golub and Van
Loan, 2012]. Numerically more robust strategies include
• performing a QR-decomposition of A.
where the latter two methods avoid the calculation of ATA altogether, which can
be subject to numerical difficulties if A has a high condition number [Golub and
Van Loan, 2012]. In fact, the method of performing a Cholesky decomposition of
41
Chapter 6. Linear Quadratic Estimation and Regularization
ATA = (QR)T(QR) = R TR
Many numerical computation tools, including Julia, Matlab and numpy, pro-
vide numerically robust methods to calculate the solution to the least-squares
problem, indicated in Algorithm 2. These methods typically analyze the matrix
A and choose a suitable numerical algorithm to execute based on its properties
[Julialang, 2017].
¡ ¢−1
Algorithm 2 Syntax for solving the least-squares problem k = ATA AT y in differ-
ent programming languages.
k = A\y # Julia
k = A\y % Matlab
k = numpy. l i n a l g . solve (A , y ) # Python with numpy
k <− solve (A , y ) # R
If the number of features, and thus the matrix ATA, is too large for the problem
to be solved by factorizing A, an iterative method such as conjugate gradients or
GMRES [Saad and Schultz, 1986] can solve the problem by performing matrix-
vector products only.
where φ(v) = [v 0 v 1 ... v J ] is the set of basis function activations. The function
f (v) = φ(v)k can be highly nonlinear and even discontinuous in v, but is linear in
the parameters, making it easy to fit to data.
While the low order monomials v i are easy to work with and provide reason-
able fit when the relationship between y and v is simple, they tend to perform
worse when the relationship is complex.
Intuitively, a basis-function expansion (BFE) decomposes an intricate function
or signal as a linear combination of simple basis functions. The Fourier transform
can be given this interpretation, where an arbitrary signal is decomposed as a sum
42
6.3 Basis-Function Expansions
43
Chapter 6. Linear Quadratic Estimation and Regularization
f (v) 1
−1
−4 −3 −2 −1 0 1 2 3 4
v
Figure 6.2 Basis-function expansions fit to noisy data from the function
f (v) = 0.3v 2 − 0.5 using normalized (-) and nonnormalized (- -) basis functions.
Non-normalized basis functions are shown mirrored in the vertical axis.
Normalization
For some applications, it may be beneficial to normalize the kernel vector for each
input point [Bugmann, 1998] such that
à !−1
K
φ̄(v) = κ(v, µi , γi ) φ(v)
X
i =1
One major difference between a standard BFE and a normalized BFE (NBFE) is
the behavior far (in terms of the width of the basis functions) from the training
data. The prediction of a BFE will tend towards zero, whereas the prediction from
an NBFE tends to keep its value close to the boundary of the data. Figure 6.2
shows the fit of two BFEs of the function f (v) = 0.3v 2 − 0.5 together with the basis
functions used. The BFE tends towards zero both outside the data points and in
the interval of missing data in the center. The NBFE on the other hand generalizes
better and keeps its current prediction trend outside the data. The performance
of NBFEs is studied in detail in [Bugmann, 1998].
44
6.4 Regularization
||k||22
P
||k||1 t ||k t ||2
Figure 6.3 Illustration of the effect of different regularization terms on the struc-
ture of the solution vector. The squared 2-norm promotes small components, the
1-norm promotes sparse components (few nonzero elements) and the nonsquared,
group-wise 2-norm promotes sparse groups (clusters of elements).
6.4 Regularization
Examples of f include
° °2
• °k °2 to promote small k
° °
• °k °1 to promote sparse k
° °
• °k t °2 to promote group-sparse k t
with effects illustrated in Fig. 6.3. We will detail the effects of the mentioned
example regularization terms in the following sections.
45
Chapter 6. Linear Quadratic Estimation and Regularization
number of model parameters is very large and therefore seldom used. In machine
learning it is more common to penalize a convex function of the parameter vector,
such as its norm. Depending on the norm in which we measure the size of the
parameter vector, this procedure has many names. For the common L 2 norm,
the resulting method is commonly referred to as Tikhonov regularized regression,
ridge regression or weight decay if one adopts an optimization perspective, or
maximum a posteriori (MAP) estimation with a Gaussian prior, if one adopts a
Bayesian view on the estimation problem [Murphy, 2012]. If the problem is linear
in the parameters, the solution to the resulting optimization problem remains
on a closed form, as indicated by the following theorem. Here, we demonstrate
an alternative way of proving the least-squares solution, based on differentiation
instead of completion of squares.
T HEOREM 3
The vector k ∗ of parameters that solves the optimization problem
k ∗ = arg min
1°° y − Ak °2 + λ °k °2
° ° °
2 2 (6.7)
k 2 2
dJ
= −AT(y − Ak) + λk = 0
dk
(ATA + λI )k = AT y
k = (ATA + λI )−1 AT y
Since ATA is positive semi-definite, both first- and second-order conditions for a
minimum are satisfied by k ∗ = (ATA + λI )−1 AT y. 2
R EMARK 1
When deriving the expression for k ∗ by differentiation, the terms 1/2 appearing in
(6.7) are commonly inserted for aesthetical purposes, they do not affect k ∗ . 2
We immediately notice that the solution to the regularized problem (6.7) reduces
to the solution of the ordinary least-squares problem (6.4) in the case λ = 0. The
46
6.4 Regularization
k2 k2
° y − ŷ(k)°2
° °
° y − ŷ(k)°2
° °
° °2
°k ° ° °
°k °
2
1
k1 k1
regularization adds the positive term λ to all diagonal elements of ATA, which
reduces the condition number of the matrix to be inverted and ensures that the
problem is well posed [Golub and Van Loan, 2012]. The regularization reduces the
variance in the estimate at the expense of the introduction of a bias.
For numerically robust methods of solving the ridge-regression problem, see,
e.g., the excellent manual by Hansen (1994).
L 1 -regularized regression
L 1 regularization is commonly used to promote sparsity in the solution [Boyd
and Vandenberghe, 2004; Murphy, 2012]. The sparsity promoting effect can be
understood by considering the gradient of the penalty function, which remains
large even for small values of the argument. In contrast, the squared L 2 norm
has a vanishing gradient for small arguments. Further intuition for the sparsity
promoting quality of the L 1 norm is gained by considering the level curves of the
function, see Fig. 6.4. A third way of understanding the properties of the L 1 norm
penalty is as the convex relaxation of the L 0 penalty i I(k i ), i.e., the number of
P
nonzero entries in k.
The L 1 norm is a convex function, but L 1 -regularized problems do not admit a
solution on closed form and worse yet, the L 1 is nonsmooth. When this kind of
problems arises in this thesis, the ADMM algorithm [Parikh and Boyd, 2014] will
be employed to efficiently find a solution.1
47
Chapter 6. Linear Quadratic Estimation and Regularization
grouping of variables.
If the parameter vector is known to be group sparse, i.e., some of the groups
are exactly zero, one way to encourage a solution with this property is to add the
group-lasso penalty [Yuan and Lin, 2006]
X° °
°k t ° (6.9)
2
t
The addition of (6.9) to the cost function will promote a solution where the length
of some k t is exactly zero. To understand why this is the case, one can interpret
(6.9) as the L 1 -norm of lengths of vectors k t . Of key importance in (6.9) is that the
norm is nonsquared, as the sum of squared L 2 norms over groups coincides with
the standard squared L 2 norm penalty without groups
X ° °2 ° °2
°k t ° = °k °
2 2
t
Trend filtering
An important class of signal-reconstruction methods that has been popularized
lately is trend filtering methods [Kim et al., 2009; Tibshirani et al., 2014]. Trend
filtering methods work by specifying a fitness criterion that determines the good-
ness of fit, as well as a regularization term, often chosen with sparsity promoting
qualities. As a simple example, consider the reconstruction ŷ of a noisy signal
y = {y t ∈ R }Tt=1 with piecewise constant segments. To this end, we may formulate
and solve the convex optimization problem
°2
minimize ° y − ŷ °2 + λ | ŷ t +1 − ŷ t |
° X
(6.10)
ŷ t
y
ŷ
48
6.5 Estimation of LTI Models
where D 1 is the first-order difference operator, and we thus realize that the solution
will have a sparse first order time difference, see Fig. 6.5 for an example application.
We remark that trend filtering is a noncasual operation and would with the
terminology employed in this thesis technically be referred to as a smoothing
operation.
Linear time-invariant models are fundamental within the field of control, and
decades of research have been devoted to their identification. We do not intend to
cover much of this research here, but instead limit ourselves to establish notation
and show how an LTI model lends itself to estimation by means of LS if the full
state-sequence is known.
A general LTI model takes the form
x t +1 = Ax t + Bu t + v t , t ∈ [1, T ] (6.12)
y t = C x t + Du t + e t (6.13)
x1
.
y = .. ∈ RT n
xT
k = vec ( AT B T ) ∈ RK
£ ¤
I n ⊗ x 0T I n ⊗ uT0
.. ..
A= ∈ RT n×K
. .
T T
I n ⊗ x T −1 I n ⊗ u T −1
49
Chapter 6. Linear Quadratic Estimation and Regularization
50
6.5 Estimation of LTI Models
51
7
Estimation of LTV Models
7.1 Introduction
Time-varying systems and models arise in many situations. A common case is the
linearization of a nonlinear system along a trajectory [Ljung, 1987]. A linear time-
varying (LTV) model obtained from such a procedure facilitates control design
and trajectory optimization using linear methods, which are in many respects
better understood than nonlinear control synthesis.
The difficulty of the task of identifying time-varying dynamical models varies
greatly with the model considered and the availability of measurements of the
state sequence. For smoothly changing dynamics, linear in the parameters, the
recursive least-squares algorithm with exponential forgetting (RLSλ) is a common
option. If a Gaussian random-walk model for the parameters is assumed, a Kalman
filtering/smoothing algorithm [Rauch et al., 1965] gives the filtering/smoothing
densities of the parameters in closed form. However, the assumption of Brownian-
walk dynamics is often restrictive. Discontinuous dynamics changes occur, for
instance, when an external controller changes operation mode, when a sudden
contact between a robot and its environment is established, when an unmodeled
disturbance enters the system or when a component in the system suddenly fails.
Identification of systems with nonsmooth dynamics evolution has been stud-
ied extensively. The book by Costa et al. (2006) treats the case where the dynamics
are known, but the state sequence unknown, i.e., state estimation. Nagarajaiah
and Li (2004) examine the residuals from an initial constant dynamics fit to deter-
mine regions in time where improved fit is needed, addressing this need by the
introduction of additional constant dynamics models. Results on identifiability
and observability in jump-linear systems in the noncontrolled (autonomous) set-
ting are available due to Vidal et al. (2002). The main result on identifiability in
[Vidal et al., 2002] was a rank condition on a Hankel matrix constructed from the
collected output data, similar to classical results on the least-squares identifica-
tion of ARX models which appears as rank constraints on the, typically Toeplitz or
block-Toeplitz, regressor matrix. Identifiability of the problems proposed in this
chapter is discussed in Sec. 7.3.
In this work, we draw inspiration from the trend-filtering literature to develop
new system-identification methods for LTV models. In trend filtering, a curve is
52
7.2 Model and Identification Problems
x t +1 = A t x t + B t u t + v t
£ ¤T
k t = vec ( A t B t ) (7.1)
k t +1 = H t k t + w t
y t = I n ⊗ x Tt uTt k t + e t
¡ £ ¤¢
(7.2)
The model (7.1)-(7.2) is limited by its lack of noise models. However, this simple
model will allow us to develop very efficient algorithms for identification. We defer
the discussion on measurement noise to Sec. 7.9.
Upon inspection of (7.2), the connection between the present model identifi-
cation problem and the state-estimation problem of Chap. 4 should be apparent.
53
Chapter 7. Estimation of LTV Models
The model (7.2) implies that the coefficients of the LTV model themselves evolve
according to a linear dynamical system, and are thus amenable to estimation us-
ing state-estimation techniques. If no prior knowledge is available, the dynamics
matrix H t can be taken as the identity matrix, H = I , implying that the model
coefficients follow a random walk dictated by the properties of w t , i.e., the state
transition density function p w (k t +1 |k t ). This particular choice of H corresponds
to the optimization problem we will consider in the following section. The emis-
sion density function is given by p e (y t |x t , u t , k t ). Particular choices of p e and p w
emit data likelihoods concave in the parameters and hence amenable to convex
optimization, a point that will be elaborated upon further in this chapter. We em-
phasize here that the state in the parameter evolution model refers to the current
parameters k t and not the system state x t of (7.1).
The following sections will introduce a number of optimization problems
with different regularization functions, corresponding to different choices of p w ,
and different regularization arguments, corresponding to different choices of H .
We also discuss the properties of the identification resulting from the different
modeling choices. We divide our exposition into a number of cases characterized
by the qualitative properties of the evolution of the parameter state k t .
where à and Ỹ are appropriately constructed matrices and the first-order dif-
°2
ferentiation operator matrix D 1 is constructed such that λ2 °D 1 k̃ °2 equals the
°
all but toy problems. An important observation to make to allow for an efficient
method for solving (7.3) is that the cost function is the negative data log-likelihood
of the Brownian random-walk parameter model (7.2) with H = I , which motivates
us to develop a dynamic programming algorithm based on a Kalman smoother.
Details on the estimation algorithms are deferred until Sec. 7.5.
54
7.2 Model and Identification Problems
A system with low-pass character is often said to have a long time constant
[Åström and Murray, 2010]. For a discrete-time linear system, long time constants
correspond to the dynamics matrix having eigenvalues close to the point 1 in
the complex plane. The choice H = I have all eigenvalues at the point 1, reinforc-
ing the intuition of (7.3) promoting a low-frequency evolution of the dynamics.
The connection between eigenvalues, small time-differences and long time con-
stants will be explored further in Chap. 8, where inspiration from (7.3) is drawn to
enhance dynamics models in the deep-learning setting.
An example of identification by solving Eq. (7.3) is provided in Sec. 7.7, where
the influence of λ is illustrated.
Also this optimization problem has a closed-form solution on the form (7.4) with
the corresponding second-order differentiation operator D 2 . Equation (7.5) is
the negative data log-likelihood of a Brownian random-walk parameter model
with added momentum. The matrix H corresponding to this model is derived
in Sec. 7.4, where a Kalman smoother with augmented state is developed to find
the optimal solution. We also extend problem (7.5) to more general regularization
terms in Sec. 7.5.
55
Chapter 7. Estimation of LTV Models
which results in a dynamics evolution with sparse changes in the coefficients, but
changes to different entries of k t are not necessarily occurring at the same time
instants. The formulation (7.6), however, promotes a solution in which the change
occurs at the same time instants for all coefficients in A and B , i.e., k t +1 = k t for
most t .
Equations (7.3) and (7.5) admitted simple interpretations as the likelihood of a
dynamical model on the form (7.2). Unfortunately, Eq. (7.6) does not admit an as
simple interpretation. The solution hence requires an iterative solver and is dis-
cussed in Sec. 7.A. Example usage of this optimization problem for identification
is illustrated in Sec. 7.6.
° y − ŷ °2
° °
minimize 2
k
X
subject to 1{k t +1 6= k t } ≤ M (7.8)
t
where 1{·} is the indicator function. This problem is nonconvex and we propose
solving it using dynamic programming (DP). The proposed algorithm is outlined
in Sec. 7.B.
Summary
The qualitative results of solving the proposed optimization problems are summa-
rized in Table 7.1. The table illustrates how the choice of regularizer and order of
time-differentiation of the parameter vector affect the resulting solution.
56
7.3 Well-Posedness and Identifiability
Norm Dn Result
1 1 Small number of steps (piecewise constant)
1 2 Small number of bends (piecewise affine)
2 1 Small steps (slowly varying)
2 2 Small bends (smooth)
Two-step refinement
Since many of the proposed formulations of the optimization problem penalize
the size of the changes to the parameters in order to promote sparsity, a bias
is introduced and solutions in which the changes are slightly underestimated
are favored. To mitigate this issue, a two-step procedure can be implemented
wherein the first step, time instances where the parameter state vector k t changes
significantly are identified, we call these time instances knots. To identify the
knots,°we observe ° the argument° inside the °sum of the regularization term, i.e.,
a t 1 = °k t +1 − k t °2 or a t 2 = °k t +2 − 2k t +1 + k t °2 . Time instances where a t is taking
large values indicate suitable time indices for knots.
In the second step, the sparsity-promoting penalty is removed and equality
constraints are introduced between the knots. The second step can be computed
very efficiently by noticing that the problem can be split into several identical
sub-problems, which each has a closed-form solution on the form
57
Chapter 7. Estimation of LTV Models
P ROPOSITION 1
Optimization problems (7.3) and (7.6) have unique global minima for λ > 0 if and
only if the corresponding LTI optimization problem has a unique solution. 2
Proof The cost function is a sum of two convex terms. For a global minimum to
be nonunique, the Hessians of the two terms must have intersecting nullspaces.
In the limit λ → ∞ the problem reduces to the LTI problem. The nullspace of the
regularization Hessian, which is invariant to λ, does thus not share any directions
with the nullspace of ÃTÃ which establishes the equivalence of identifiability
between the LTI problem and the LTV problems. 2
P ROPOSITION 2
Optimization problems (7.5) and (7.9) with higher order differentiation in the
regularization term have unique global minima for λ > 0 if and only if there does
not exist any vector v 6= 0 ∈ Rn+m such that
x xT x t uTt
· ¸
C txu v = t tT v = 0 ∀t 2
ut xt u t uTt
Proof Again, the cost function is a sum of two convex terms and for a global
minimum to be nonunique, the Hessians of the two terms must have in-
tersecting nullspaces. In the limit λ → ∞ the regularization term reduces
to a linear constraint set, allowing only parameter vectors that lie along a
line through time. Let ṽ 6= 0 be such a vector, parameterized by t as ṽ =
¤T
2v̄ T · · · t v̄ T · · · T v̄ T ∈ RT K where v̄ = vec({v}n1 ) ∈ RK and v is an ar-
£ T
v̄
bitrary vector ∈ Rn+m ; ṽ ∈ null (ÃTÃ) implies that the loss is invariant to the pertur-
ªT
bation αṽ to k̃ for an arbitrary α ∈ R . (ÃTÃ) is given by blkdiag( I n ⊗C txu 1 ) which
©
R EMARK 2
If we restrict our view to constant systems with stationary Gaussian inputs with
covariance Σu , we have as T → ∞, (1/T ) xx T approaching the stationary con-
P
trollability Gramian given by the solution to Σ = AΣAT + B Σu B T. Not surprisingly,
the well-posedness of the optimization problem is thus linked to the excitation of
the system modes through the controllability of the system. 2
For the LTI problem to be well-posed, the system must be identifiable and the
input u must be persistently exciting of sufficient order [Johansson, 1993].
58
7.4 Kalman Smoother for Identification
59
Chapter 7. Estimation of LTV Models
We will see that for priors from certain families, the resulting optimization
problem remains convex. For the special case of a Gaussian prior over the dynam-
ics parameters or the output, the posterior mean of the parameter vector is once
again conveniently obtained from a Kalman-smoothing algorithm, modified to
include the prior.
General case
A general prior over the parameter-state variables k t can be specified as p(k t |z t ),
where the variable z t is a placeholder for whatever signal might be of relevance,
for instance, the time index t or state x t . The data log-likelihood of (7.2) with the
prior p(k t |z t ) added takes the form
T
X TX
−1 T
X
log p(k, y|x, z)1:T = log p(y t |k t , x t ) + log p(k t +1 |k t ) + log p(k t |z t ) (7.14)
t =1 t =1 t =1
which factors conveniently due to the Markov property of a state-space model. For
particular choices of density functions in (7.14), notably Gaussian and Laplacian,
the negative log-likelihood function becomes convex. The next section will elab-
orate on the Gaussian case and introduce a recursive algorithm that efficiently
solves for the full posterior. The Laplacian case, while convex, does not admit an
equally efficient algorithm. The Laplacian likelihood is, however, more robust to
outliers in the data, making the trade-off worth consideration.
Gaussian case
If all densities in (7.14) are Gaussian and k is modeled with the Brownian random
walk model (7.2) (Gaussian v t ), (7.14) can be written on the form (scaling constants
omitted)
T °
° y t − ŷ(k t , x t )°2 −1
X °
− log p(k, y|x, z)1:T = Σe
t =1
TX
−1 ° T °
°k t +1 − k t °2 −1 + °µ0 (z t ) − k t °2 −1
° X °
+ Σ Σ (z (7.15)
w 0 t)
t =1 t =1
60
7.6 Example—Jump-Linear System
where ¯· denotes the posterior value. This additional correction can be interpreted
as receiving a second measurement µ0 (z t ) with covariance Σ0 (z t ). For the Kalman-
smoothing algorithm, x̂ t |t and P t |t in (7.17) and (7.18) are replaced with x̂ t |T and
P t |T .
A prior over the output of the system, or a subset thereof, is straightforward
to include in the estimation by means of an extra update step, with C , R 2 and y
being replaced with their appropriate values according to the prior.
We remark that although all optimization problems proposed in this chapter
could be solved by generic solvers, the statistical interpretation and solution pro-
vided by the Kalman-smoothing algorithm provide us not only with the optimal
solution, but also an uncertainty estimate. The covariance matrix of the state,
Eq. (4.24), is an estimate of the covariance of the parameter vector for each time
step. This uncertainty estimate may be of importance in downstream tasks using
the estimated model. One such example is trajectory optimization, where it is
useful to limit the step length based on the uncertainty in the model. We will make
use of this when performing trajectory optimization in a reinforcement-learning
setting in Chap. 11, where model uncertainties are taken into account by enforc-
ing a constraint on the Kullback-Leibler divergence between two consecutive
trajectory distributions.
to · ¸ · ¸
0.5 0.05 0.2
At = , Bt =
0.0 0.5 1.0
at t = 200. The input was Gaussian noise of zero mean and unit variance, state
transition noise and measurement noise (y t = x t +1 +e t ) of zero mean and σe = 0.2
were added. In this problem the parameters change abruptly, a suitable choice of
identification algorithm is thus (7.6). Figure 7.1 depicts the estimated coefficients
in the dynamics matrices after solving (7.6), for a value of λ chosen using the
L-curve method [Hansen, 1994]. The figure indicates that the algorithm correctly
61
Chapter 7. Estimation of LTV Models
0.6
0.4
0.2
0.0
50 100 150 200 250 300 350 400
Time index
Figure 7.1 Piecewise constant state-space dynamics. True values are shown with
dashed, black lines. Gaussian state-transition and measurement noise with σ = 0.2
were added. At t = 200 the dynamics of
° the
¢ system change abruptly. A suitable
choice of regularization term λ °k + − k °2 allows us to estimate a dynamics model
¡ °
that exhibit an abrupt change in the coefficients, without specifying the number
of such changes a priori. Please note that this figure shows the coefficients of k
corresponding the A-matrix of Eq. (7.1) only.
identifies the abrupt change in the system parameters at t = 200 and maintains
an otherwise near constant parameter vector. This example highlights how the
sparsity-promoting penalty can be used to indicate whether or not something has
abruptly changed the dynamics, without specifying the number of such changes
a priori. The methods briefly discussed in Sec. 7.2 can be utilized, should it be
desirable to have two separate LTI-models describing the system.
62
7.7 Example—Low-Frequency Evolution
True parameters
λ2 = 1 λ2 = 100
2
0.4 λ = 10000 λ2 = 1.0e6
Model coefficients k t
0.2
0.0
−0.2
0 100 200 300 400 500
Time index
Figure 7.2 Trajectories of the system (7.2) (black) together with models estimated
by solving (7.3) with varying values of λ. The smoothing effect of a high λ is illus-
trated, and as λ → ∞, the estimated model converges to an LTI model. The optimal
value is given by λ ≈ 10. Please note that this figure shows the coefficients of k
corresponding the A-matrix of (7.1) only. The coefficients of the B -matrix evolve
similarly but are omitted for clarity.
63
Chapter 7. Estimation of LTV Models
102 102
101 101
100 100
10−1 10−1 −2
−0.1 0 0.1 −5 · 10 0 5 · 10−2
qq-plot λ2 = 1 2
qq-plot λ = 100
0.010 0.02
0.005 0.01
0.00
0.000
−0.01
−0.005 −0.02
0.06
0.02
0.03
0.00 0.00
−0.02 −0.03
−0.06
−0.04
−4 · 10−2
−2 · 10−2 0 2 · 10 −2
−5 · 10−2 0 5 · 10−2
Figure 7.3 The top panes show (log) error distributions on training data when
solving Eq. (7.3) on data generated by (7.1)-(7.2). The prediction error is a strictly
increasing function of λ, whereas the model error is minimized by the optimal
value for λ. The bottom four panes show quantile-quantile plots of prediction
errors on the training data, for the different choices of λ.
64
7.8 Example—Nonsmooth Robot Arm with Stiff Contact
·104
9.0
8.0
value for λ produces normal residuals, whereas other choices for λ produce heavy-
tailed distributions. While this analysis is available even when the true system is
not known, it might produce less clear outcomes when the data is not generated
by a system included in the model set being searched over. Alternative ways of
setting values for λ include cross validation and maximum-likelihood estimation
under the statistical model (7.1)-(7.2), illustrated in Fig. 7.4. The likelihood of the
data given a model on the form (7.2) is easily calculated during the forward-pass
of the Kalman algorithm.
Yet another option for determining the value of λ is to consider it as a relative
time-constant between the evolution of the state and the evolution of the model
parameters. A figure like Fig. 7.2, together with prior knowledge of the system, is
often useful in determining λ.
65
Chapter 7. Estimation of LTV Models
0.75 1.0
0.50
0.5
0.25
0.00 0.0
-0.25 -0.5
50 100 150 200 50 100 150 200
x y Constraint Stiff contact Velocity sign change
q̇ 1 q̇ 2
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
-0.5
-0.5
-1.0
-1.0 -1.5
50 100 150 200 50 100 150 200
States Velocity sign change Stiff contact Simulation
q1 q2
1.0
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
-1.5
50 100 150 200 50 100 150 200
66
7.9 Discussion
q̇ 1 q̇ 2
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
-0.5
-0.5 -1.0
-1.0 -1.5
50 100 150 200 50 100 150 200
States Orig. Velocity sign change Orig. Stiff contact Simulation
q1 q2
1.0
1.0
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
-1.5
50 100 150 200 50 100 150 200
the model is able to capture the dynamics both during the nonsmooth sign change
of the velocity, and also during the establishment of the stiff contact. The learned
dynamics of the contact is, however, time-dependent. This time-dependence is,
in some situations, a drawback of LTV-models. This drawback is illustrated in
Fig. 7.6, where the model is used on a validation trajectory where a different noise
sequence was added to the control torque. Due to the novel input signal, the
contact is established at a different time-instance and as a consequence, there is
an error transient in the simulated data.
7.9 Discussion
This chapter presents methods for estimation of linear, time-varying models. The
methods presented extend directly to nonlinear models that remain linear in the
parameters.
When estimating an LTV model from a trajectory obtained from a nonlinear
system, one is effectively estimating the linearization of the system around that
trajectory. A first-order approximation to a nonlinear system is not guaranteed to
67
Chapter 7. Estimation of LTV Models
generalize well as deviations from the trajectory become large. Many nonlinear sys-
tems are, however, approximately locally linear, such that they are well described
by a linear model in a small neighborhood around the linearization/operating
point. For certain methods, such as iterative learning control and trajectory cen-
tric reinforcement learning, a first-order approximation to the dynamics is used
for efficient optimization, while the validity of the approximation is ensured by
incorporating penalties or constraints on the deviation between two consecutive
trajectories [Levine and Koltun, 2013]. We explore this concept further using the
methods proposed in this chapter, in Chap. 11.
The methods presented allow very efficient learning of this first-order ap-
proximation due to the postulated prior belief over the nature of the change in
dynamics parameters, encoded by the regularization terms. Prior knowledge en-
coded this way puts less demand on the data required for successful identification.
The identification process will thus not be as invasive as when excessive noise
is added to the input for identification purposes, allowing learning of flexible,
overparameterized models that fit available data well. This makes the proposed
identification methods attractive in applications such as guided policy search
(GPS) [Levine and Koltun, 2013; Levine et al., 2015] and nonlinear iterative learning
control (ILC) [Bristow et al., 2006], where they can lead to dramatically decreased
sample complexity.
The proposed methods that lend themselves to estimation through the Kalman
smoother-based algorithm could find use as a layer in a neural network. Amos
and Kolter (2017) showed that the solution to a quadratic program is differentiable
and can be incorporated as a layer in a deep-learning model. The forward pass
through such a network involves solving the optimization problem, making the
proposed methods attractive due to the O (T ) solution.
When faced with a system where time-varying dynamics is suspected and
no particular knowledge regarding the dynamics evolution is available, or when
the dynamics are known to vary slowly, a reasonable first choice of algorithm
is (7.5). This algorithm is also, by far, the fastest of the proposed methods due
to the Kalman-smoother implementation of Sec. 7.5.1 As a consequence of the
ease of solution, finding a good value for the regularization parameter λ is also
significantly easier. Example use cases include when dynamics are changing with a
continuous auxiliary variable, such as temperature, altitude or velocity. If a smooth
parameter drift is found to correlate with an auxiliary variable, LPV-methodology
can be employed to model the dependency explicitly, something that will not be
elaborated upon further in this thesis but was implemented and tested in [Basis-
FunctionExpansions.jl, B.C., 2016].
Dynamics may change abruptly as a result of, e.g., system failure, change of
operating mode, or when a sudden disturbance enters the system, such as a policy
change affecting a market or a window opening, affecting the indoor temperature.
The identification method (7.6) can be employed to identify when such changes
1 The Kalman-smoother implementation is often several orders of magnitude faster than solving the
optimization problems with an iterative solver.
68
7.10 Conclusions
Measurement-noise model
The identification algorithms developed in this chapter were made available by
the simple nature of the dynamical model Eq. (7.1). In practice, measurements
are often corrupted by noise, in particular if parts of the state-sequence is derived
from time differentiation of measured quantities. We will not cover the topic of
noise-model estimation in depth here, but will provide a few suggested treatments
from the literature that could be considered in a scenario with poor signal-to-noise
ratio.
A very general approach to estimation of noise models is pseudo-linear regres-
sion [Ljung and Söderström, 1983; Ljung, 1987]. The general idea is to estimate
the noise components and include them in the model. In the present context, this
could amount to estimating a model using any of the methods described above,
calculate the model residuals e t = y t − ŷ t , and build a model ê t = ρ(e t −1 , e t −2 , ...).
The combined problem of estimating both states x and parameters k can be
cast as a nonlinear filtering problem [Ljung and Söderström, 1983]. The nonlinear
nature of the resulting problem necessitates a nonlinear filtering approach, such
as the extended Kalman filter [Ljung and Söderström, 1983] or the particle filter
(see Sec. 4.2). The literature on iterated filtering [Ionides et al., 2006; Lindström
et al., 2012] considers this nonlinear filtering problem in the context of constant
dynamics.
7.10 Conclusions
69
Chapter 7. Estimation of LTV Models
where f and g are convex functions and A is a matrix. The optimization problems
with the group-lasso penalty (7.6) and (7.9) can be written on the form (7.19) by
constructing A such
° ° that it performs the computations k t +1 −k t or k t +2 −2k t +1 +k t
and letting g be °·°2 .
The optimization problem of Eq. (7.8) is nonconvex and harder to solve than the
other problems proposed in this chapter. To solve small-scale instances of the
problem, we modify the algorithm developed in [Bellman, 1961], an algorithm fre-
quently referred to as segmented least-squares [Bellman and Roth, 1969]. Bellman
(1961) approximates a curve by piecewise linear segments. We instead associate
each segment (set of consecutive time indices during which the parameters are
constant) with a dynamics model, as opposed to a simple straight line.2
The algorithm relies on the key fact that the value function for a sequential
optimization problem with quadratic cost and parameters entering linearly, is
quadratic. This allows us to find the optimal solution in O (T 2 ) time instead of the
µ ¶
¡ T ¢
O
M
2 Indeed, if a simple integrator is chosen as dynamics model and a constant input is assumed, the
result of our extended algorithm reduces to the segmented least-squares solution.
3 The help of Pontus Giselsson in developing this algorithm is gratefully acknowledged.
70
7.B Solving (7.8)
71
8
Identification and
Regularization of Nonlinear
Black-Box Models
8.1 Introduction
where x is a Markovian state vector, u is the input and f c is a function that maps
the current state and input to the state time derivative. An example of such a
model is a rigid-body dynamical model of a robot
· ¸
q
q̈ = −M −1 (q) C (q, q̇)q̇ +G(q) + F (q̇) − u , x =
¡ ¢
(8.2)
q̇
where f is a function that maps the current state and input to the state at the next
time-instance [Åström and Wittenmark, 2013a]. We have previously discussed the
case where f is a linear function of the state and the control input, and we now
extend our view to nonlinear functions f .
Learning a globally valid dynamics model fˆ of an arbitrary nonlinear system f
with little or no prior information is a challenging problem. Although in principle,
any sufficiently complex function approximator, such as a deep neural network
(DNN), could be employed, high demands are put on the amount of data required
to prevent overfitting and to obtain a faithful representation of the dynamics over
72
8.1 Introduction
the entire state space. If prior knowledge is available, it can often be used to reduce
the demands on the amount of data required to learn an accurate model [Sjöberg
et al., 1995].
Early efforts in nonlinear modeling include Volterra-Wiener models that make
use of basis-function expansions to model nonlinearities [Johansson, 1993]. This
type of models exhibit several drawbacks and are seldom used in practice, one of
which is the difficulty of incorporating prior knowledge into the model. Oftentimes,
this can also be hard to incorporate into a flexible black-box model such as a deep
neural network. This chapter will highlight a few general attempts at doing so,
compatible with a wide class of function approximators.
In many applications, the linearization of f is important, a typical example
being linear control design [Glad and Ljung, 2014]. In applications such as iterative
learning control (ILC) [Bristow et al., 2006] and trajectory centric, episode-based
reinforcement learning (TCRL) [Levine and Koltun, 2013], the linearization of the
nonlinear dynamics along a trajectory is often needed for optimization. Identi-
fication of f must thus not only yield a good model for prediction/simulation,
but also the Jacobian J fˆ of fˆ must be close to the true system Jacobian J . The
linearization of a model around a trajectory returns a Linear Time-Varying (LTV)
model on the form
x t +1 = A t x t + B t u t
where the matrices A and B constitute the output Jacobian. This kind of model
was learned efficiently using dynamic programming in Chap. 7. However, not
all situations allow for accurate learning of an LTV model around a trajectory. A
potential problem that can arise is insufficient excitation provided by the control
input [Johansson, 1993]. Prior knowledge regarding the evolution of the dynamics,
encoded in form of carefully designed regularization, was utilized in Chap. 7 in
order to obtain a well-posed optimization problem and a meaningful result. While
this proved to work well in many circumstances, it might fail if excitation is small
relative to how fast the dynamics changes along a trajectory. When model identifi-
cation is a subtask in an outer algorithm that optimizes a control-signal trajectory
or a feedback policy, adding excessive noise for identification purposes may be
undesirable, making regularization solely over time as in Chap. 7 insufficient. A
step up in sophistication from LTV models is to learn a nonlinear model that is
valid globally. A nonlinear model can be learned from several consecutive trajecto-
ries and is thus able to incorporate more data than an LTV model. Unfortunately,
a black-box nonlinear model also requires more data to learn a high fidelity model
and not suffer from overfitting.
The discussion so far indicates two issues; 1) Complex nonlinear models have
the potential to be valid globally, but may suffer from overfitting and thus not learn
a function that generalizes and learns the correct linearization. 2) LTV models can
be learned efficiently and can represent the linearized dynamics well, but require
sufficient excitation, are time-based and valid only locally.
In this chapter, we draw inspiration from the regularization methods detailed
in Chap. 7 for learning of a general, nonlinear black-box model, fˆ. Since an LTV
73
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models
no matter the complicated nature of g , the Jacobian of the output with respect
to the input will contain the identity component. This allows gradients to flow ef-
fortlessly through deep architectures composed of stacked residual units, making
training of models as deep as 1000 layers possible [He et al., 2015]. In many cases,
learning a function g around the identity, is vastly easier than learning the full
function f . Nguyen et al. (2018) even proved that under certain circumstances and
with enough skip connections to the output layer, a DNN has no local minima and
a continuous path of decreasing loss exists from any starting point to the global
optimum.
Consider what it takes for a network with one hidden layer to learn the identity
mapping. If we use tanh as the hidden layer activation function, the incoming
weights must be small to make sure the tanh is operating in its linear region, while
the outgoing weights have to be the reciprocal of the incoming. The number of
neurons required is the same as the input dimension. For an activation function
that does not have a near-linear region around zero, such as the relu or the sigmoid
74
8.2 Computational Aspects
functions [Goodfellow et al., 2016], help from the bias term is further required to
center the activation in the linear region. For a deep network, this has to happen
for every layer.
While a deep model is perfectly capable of learning the identity mapping,
it is needless to say that incorporating this mapping explicitly can sometimes
be beneficial. In particular, if the input and output of the function live in the
same domain, e.g., image in—image out, state in—state out, it is often easier
to learn the residual around the identity. To quote Sjöberg et al. (1995), "Even if
nonlinear structures are to be applied there is no reason to waste parameters to
estimate facts that are already known". The identity can in this case be considered
a nominal model. Whenever available, simple nominal models provide excellent
starting points, and modeling and learning the residuals of such a model may
be easier than learning the combined effect of the nominal model and residual
effects.
While we argue for the use of prior knowledge where available, in particular
when modeling physical systems where identification data can be hard to acquire,
we would also like to offer a counter example highlighting the need to do so
wisely. Frederick Jelinek, a researcher in natural language processing, famously
said1 "Every time I fire a linguist, the performance of the speech recognizer goes
up". The quote indicates that the data—natural language as used by people—
did not follow the grammatical and syntactical rules laid down by the linguists.
Prior knowledge might in this case have introduced severe bias that held the
performance of the model back. Natural language processing is a domain in which
data is often easy to obtain and plentiful enough to allow fitting of very flexible
models without negative consequences.
A different notion of residual connection is that found in a form of recur-
rent neural network called a Long Short-Term Memory (LSTM) network [Hochre-
iter and Schmidhuber, 1997]. While the resnet [He et al., 2015] was deep in the
sense of multiple consecutive layers, an RNN is deep in the sense that a func-
tion is applied recursively in time. LSTMs were invented to mitigate the issue
of vanishing/exploding gradients while backpropagating through time to train
recurrent neural networks. When a model, or more generally a function f with
Jacobian ∇x f (x) = J (x), is applied to a state recursively, the Jacobian of the re-
cursive mapping grows as ∇x f (n) (x) ∼ J n (x), where f (n) (x) denotes the n times
recursive application of f , f (... f ( f (x))). Any eigenvalues of J greater than 1 will
grow exponentially—exploding gradients—and eigenvalues smaller than 1 will
decay exponentially—vanishing gradients. If we model f as
f (x) = g (x) + x (8.5)
the Jacobian will be the identity plus some small deviation, effectively helping the
eigenvalues of J stay close to 1. Slightly simplified, LSTMs effectively propagate
the state with an identity function according to (8.5).
1 The exact wording and circumstances around this quote constitute the topic of a very lengthy
footnote on the Wikipedia page of Jelinek https://en.wikipedia.org/wiki/Frederick_
Jelinek.
75
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models
Other efforts at mitigating issues with the training of deep models include
careful weight initialization. If, for instance, all weights are initialized to be unitary
matrices, the norm of vectors propagated through the network stays constant. The
initialization of the network can further be used to our advantage. Prior knowledge
of, e.g., resonances etc. can be encoded into the initial model weights. In this
chapter, we will explore this concept further, along with the effects of identity
connections, and motivate them from a control-theoretic perspective.
L EARNING O BJECTIVE 1
x t +1 = fˆ(x t , u t ), f : Rn × Rm 7→ Rn 2
which we will frequently write on the form x + = fˆ(x, u) by omitting the time index
t and letting ·+ indicate ·t +1 . We further consider the linearization of fˆ around a
trajectory τ
x t +1 = A t x t + B t u t
k t = vec ( ATt B tT )
£ ¤
(8.6)
∇x f 1T ∇u f 1T
. .. n×(n+m)
. ∈R
£ ¤
J f = .. = A B
∇x f nT ∇u f nT
and ∇x f i denotes the gradient of the i :th output of f with respect to x. Our
estimate fˆ(x, u, w) of f (x, u) will be parameterized by a vector w.2 The distinction
between f and fˆ will, however, be omitted unless required for clarity.
We frame the learning problem as an optimization problem with the goal of
adjusting the parameters w of fˆ to minimize a cost function V (w) by means of
gradient descent. The cost function V (w) can take many forms, but we will limit
the scope of this work to quadratic loss functions of the one-step prediction error,
i.e.,
1 X¡ + ˆ ¢T¡
x − f (x, u, w) x + − fˆ(x, u, w)
¢
V (w) =
2 t
2 We use w to denote all the parameters of the neural network, i.e., weight matrices and bias vectors.
76
8.3 Estimating a Nonlinear Black-Box Model
x + − x = ∆x = g (x, u), g : Rn × Rm 7→ Rn
f (x, u) = g (x, u) + x 2
where the second equation is equivalent to the first, but highlights a convenient
implementation form that does not require transformation of the data.
To gain insight into how this seemingly trivial change in representation may
affect learning, we note that this transformation will alter the Jacobian according
to £ ¤
J g = A − In B (8.8)
3 We take the eigenvalues of a function to refer to the eigenvalues of the function Jacobian.
77
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models
Optimization landscape
To gain insight into the training of f and g , we analyze the expressions for the
gradient and Hessian of the respective cost functions. For a linear model x + =
Ax + Bu, rewritten on regressor form y = Ak with all parameters of A and B
concatenated into the vector k, and a least-squares cost function V (k) = 12 (y −
Ak)T(y − Ak), the gradient and Hessian are given by
∇k V = −AT(y − Ak)
∇2k V = ATA
The Hessian is clearly independent of both the output y and the parameters k
and differentiating the output, i.e., learning a map to ∆x instead of to x + , does not
have any major impact on gradient-based learning. For a nonlinear model, this is
not necessarily the case:
1 X¡ + ¢T¡
x − f (x, u, w) x + − f (x, u, w)
¢
V (w) =
2 t
T X
n
− x i+ − f i (x, u, w) ∇w f i
X ¡ ¢
∇w V =
t =1 i =1
T X
n
∇2w V = ∇w f i ∇w f iT − x i+ − f i (x, u, w) ∇2w f i
X ¡ ¢
t =1 i =1
where x i+ − f i (x, u, w)
constitute the prediction error. In this case, the Hessian
depends on both the parameters and the target x + . The transformation from f to
g changes the gradients and Hessians according to
T X
n
− ∆x i − g i (x, u, w) ∇w g i
X ¡ ¢
∇w V =
t =1 i =1
T X
n
∇2w V = ∇w g i ∇w g iT − ∆x i − g i (x, u, w) ∇2w g i
X ¡ ¢
t =1 i =1
78
8.4 Weight Decay
When ° ° begins, both f and g are initialized with small random weights, and
° ° training
° f ° and °g ° will typically be small. If the system we are modeling is of low-pass
character, i.e., °∆x ° is small, the prediction error of g will be closer to zero com-
° °
79
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models
which penalizes changes in the input-output Jacobian of the model over time, a
strategy we refer to as Jacobian propagation or Jacprop.
Taking the gradient of terms depending on the model Jacobian requires calcu-
lation of higher order derivatives. Depending on the framework used for optimiza-
80
8.5 Tangent-Space Regularization
tion, this can limit the applicability of the method. We thus proceed to describe
how we implemented the penalty term of (8.9).
Implementation details
The inclusion of (8.9) in the cost function implies the presence of nested differen-
tiation in the gradient of the cost function with respect to the parameters, ∇w V .
The complications arise in the calculation of the term
where J is composed of ∇x f (x, u, w) and ∇u f (x, u, w). Many, but not all, deep-
learning frameworks unfortunately lack support for nested differentiation. Further,
most modern deep-learning frameworks employ reverse-mode automatic differ-
entiation (AD) for automatic calculation of gradients of cost functions with respect
to network weight matrices [Merriënboer et al., 2018]. Reverse-mode AD is very
well suited for scalar functions of many parameters. For vector-valued functions,
however, reverse-mode AD essentially requires separate differentiation of each
output of the function. This scales poorly and is a seemingly unusual use case
in deep learning; most AD frameworks have no explicit support for calculating
Jacobians. These two obstacles together might make implementing the suggested
regularization hard in practice. Indeed, an attempt was made at implementing the
regularization with nested automatic differentiation, where ∇w V was calculated
using reverse-mode AD and J (w) using forward-mode AD. This required find-
ing AD software capable of nested differentiation and was made very difficult by
subtleties regarding closing over the correct variables for the inner differentiation.
The resulting code was also very slow to execute.
The examples detailed later in this chapter instead make use of handwritten
inner differentiation, where ∇w V once again is calculated using reverse-mode AD,
but the calculation of J (w) is done manually. For a DNN composed of affine trans-
formations (W x + b) followed by elementwise nonlinearities (σ), this calculation
can be expressed recursively as
a 1 = W1 x + b 1
a i = Wi l i −1 + b i , i = 2 ... L
l i = σ(a i )
∇x l i = ∇x {σ(a i )} ∇x l i −1
= (∇x σ)¯a · ∇x a i · ∇x l i −1
¯
¯ i
= (∇x σ)¯a · Wi · ∇x l i −1
i
81
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models
Algorithm 4 Julia code for calculation of both forward pass and input-output
Jacobian of neural network f . The code uses the tanh activation function and
assumes that the weight matrices are stored according to w = {W1 b 1 ...WL b L }
function forward_jac(w,x)
l = x
J = Matrix{eltype(w[1])}(I,length(x),length(x)) # Initial J = I n
for i = 1:2:length(w)-2
W,b = w[i], w[i+1]
a = W*l .+ b
l = σ.(a)
∇a = W
∇σ = ∇σ(a)
J = ∇σ * ∇a * J
end
J = w[end-1] * J # Linear output layer
return w[end-1]*l .+ w[end] , J
end
∇σ(a) = Matrix(Diagonal((sech.(a).^2)[:])) # ∇a tanh(a)
8.6 Evaluation
Nominal model
Both functions f and g are modeled as neural networks. For the linear-system
task, the networks had 1 hidden layer with 20 neurons; in the pendulum task, the
networks had 3 hidden layers with 30 neurons each.
A comparative study of 6 different activation functions, presented in Sec. 8.A,
indicated that some unbounded activation functions, such as the relu and leaky
relu functions [Ramachandran et al., 2017], are less suited for the task at hand,
and generally, the tanh activation function performed best and was chosen for the
evaluation.
We train the models using the ADAM optimizer with a fixed step-size and
fixed number of epochs. The framework for training, including all simulation
experiments reported in this chapter, is published at [JacProp.jl, B.C., 2018] and is
implemented in the Julia programming language [Bezanson et al., 2017] and the
82
8.6 Evaluation
3 0.3
2 0.2
1 0.1
0 0.0
Standard Jacprop Standard Jacprop
Figure 8.1 Left: Distribution of prediction errors on the validation data for g + x
trained on the linear-system task. Each violin represents 12 Monte-Carlo runs.
Right: Distribution of errors in estimated Jacobians. The figure indicates that
tangent-space regularization through Jacobian propagation is effective and re-
duces the error in the estimated Jacobian without affecting the prediction error
performance.
83
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models
validation data indicates that overfitting has occurred. The number of parameters
in the models was 6.2 times larger than the number of parameters in the true
linear system and overfitting is thus a concern in this scenario.
To calculate the Jacobian error, we calculate the shortest distance between
each eigenvalue in the true Jacobian to any of the eigenvalues of the estimated
Jacobians, and vice versa. We then sum these distances and take the mean over
all the data points in the validation set. This allows us to penalize both failure to
place an eigenvalue close to a true eigenvalue, and placing an eigenvalue without
a true eigenvalue nearby. The model trained with tangent-space regularization
learns better Jacobians while producing the same prediction error, indicated in
Fig. 8.1 and Fig. 8.2.
The effect of weight decay on the learned Jacobian is illustrated in Fig. 8.3.
Due to overparameterization, heavy overfitting is expected without adequate
regularization. Not only is it clear that learning of g + x has been more successful
than learning of f in the absence of weight decay, but we also see that weight
decay has had a deteriorating effect on learning f , whereas it has been beneficial in
learning g + x. This indicates that the choice of architecture interacts with the use
of standard regularization techniques and must be considered while modeling.
Pendulum-on-cart task
A pendulum attached to a moving cart is simulated to assess the effectiveness of
Jacprop and weight decay on a system with nonlinear dynamics and thus a chang-
ing Jacobian along a trajectory. An example trajectory of the system described
by (8.11)-(8.12), which has 4 states (θ, θ̇, p, v) and one control input u, is shown
in Fig. 8.4. This task demonstrates the utility of tangent-space regularization for
systems where the regularization term is not the theoretically perfect choice, as
was the case with the linear system. Having access to the true system model and
state representation also allows us to compare the learned Jacobian to the true
system Jacobian. We simulate the system with a superposition of sine waves of
84
8.6 Evaluation
Figure 8.3 Eigenvalues of learned Jacobians for the linear system task. True eigen-
values are shown in red, and eigenvalues of the learned model for points sampled
randomly in the state space are shown in blue. The top/bottom rows show models
trained without/with weight decay, left/right columns show f /g . Weight decay
has a deteriorating effect on learning f , pulling some eigenvalues towards 0 while
causing others to become much larger than 1, resulting in a very unstable system.
Weight decay is beneficial for learning g + x, keeping the eigenvalues close to 1.
85
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models
Figure 8.5 Left: Distribution of prediction errors on the validation data for the
pendulum on a cart task using tanh activation functions. Each violin represents 30
Monte-Carlo runs. The figure indicates that tangent-space regularization through
Jacobian propagation is effective and reduces prediction error for f , but not g ,
where weight decay performs equally well. Right: Distribution of errors in estimated
Jacobians. Jacprop is effective at improving the fidelity of the model Jacobians for
both f and g .
different frequencies and random noise as input and compare prediction error as
well as the error in the estimated Jacobian. The dynamical equations of the system
are given by
g u
θ̈ = − sin(θ) + cos(θ) − d θ̇ (8.11)
l l
v̇ = p̈ = u (8.12)
where g , l , d denote the acceleration of gravity, the length of the pendulum and
the damping, respectively.
Once again we train two models, one with weight decay and one with Jacprop.
The regularization parameters were in both cases chosen such that prediction
error on test data was minimized. The models were trained for 2500 epochs on
two trajectories of 200 time steps, approximately an order of magnitude fewer
data points than the number of parameters in the models.
The prediction and Jacobian errors for validation data, i.e., trajectories not
seen during training, are shown in Fig. 8.5. The results indicate that while learning
f , tangent-space regularization leads to reduced prediction errors compared to
weight decay, with lower mean error and smaller spread, indicating more robust
learning. Learning of g + x did not benefit much from Jacobian propagation in
terms of prediction performance compared to weight decay, and both training
methods perform on par and reach a much lower prediction error than the f
models.
To assess the fidelity of the learned Jacobian, we compare it to the ground-truth
Jacobian of the simulator. We display the distribution of errors in the estimated
Jacobians in Fig. 8.5. The results show a significant benefit of tangent-space regu-
larization over weight decay for learning both f and g + x, with a reduction of the
86
8.7 Discussion
Figure 8.6 Eigenvalues of the pendulum system using the tanh activation func-
tion on validation data.
mean error as well as a smaller spread of errors (please note that the figure has
logarithmic y-axis).
The individual entries of the model Jacobian along a trajectory for one instance
of the trained models are visualized as functions of time in Fig. 8.7. The figure
illustrates the smoothing effect of the tangent-space regularization and verifies
the smoothness assumption on the pendulum system. We also note that the
regularization employed does not restrict the ability of the learned model to
change its Jacobian along the trajectory, tracking the true system Jacobian. This is
particularly indicated in the (1, 2) and (4, 3) entries in Fig. 8.7. The figure also shows
how weight decay tuned for optimal prediction performance allows a rapidly
changing Jacobian, indicating overfitting. If a higher weight-decay penalty is used,
this overfitting is reduced, at the expense of prediction performance, hinting at
the heavily biasing properties of excessive weight decay.
The eigenvalues of the true system and learned models are visualized in Fig. 8.6.
8.7 Discussion
87
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models
0.04
1.00 0.2 0.04
0.025
0.1 0.02 0.02
0.75
0.00 0.000
0.0 0.00
0.50 −0.02
−0.1 −0.025 −0.02
0.25 −0.04
−0.2
−0.06 −0.050 −0.04
0.00 −0.3
1.00
0.075 0.100
0.09 0.050
0.75 0.050 0.075
0.06 0.025
0.050
0.50 0.025 0.03 0.000
0.025
0.25 0.000 0.00 −0.025 0.000
Figure 8.7 Individual entries in the Jacobian as functions of time step along a
trajectory of the pendulum system (Ground truth -- , Weight decay — , Jacprop
— ). The figure verifies the smoothness assumption on the system and indicates
that Jacprop is successful in promoting smoothness of the Jacobian of the estimated
model. The entries (1, 2) and (4, 3) change the most along the trajectory, and the
Jacprop-regularized model tracks these changes well without excessive smoothing.
88
8.8 Conclusions
In this domain, the notion of time constants is less well defined. The state of a
recurrent neural network for natural language modeling can change very abruptly
depending on which word is input. LSTMs thus incorporate also a gating mecha-
nism to allow components of the state to be "forgotten". For mechanical systems,
an example of an analogous situation is a state constraint. A dynamical model of a
bouncing ball, for instance, must learn to forget the velocity component of the
state when the position reaches the constraint.
The scope of this chapter was limited to settings where a state sequence is
known. This allowed us to reason about eigenvalues of Jacobians and compare
the learned Jacobians to those of the ground-truth model of a simulator. In a more
general setting, learning the transformation of past measurements and inputs
to a state representation is required, e.g., using a network with recurrence or an
auto-encoder [Karl et al., 2016]. Initial results indicate that the conclusions drawn
regarding the formulation ( f vs. g + x) of the model and the effect of weight decay
remain valid in the RNN setting, but a more detailed analysis is the target of future
work. The concept of tangent-space regularization applies equally well to the
Jacobian from input to hidden state in an RNN, and potential benefits of this kind
of regularization in the general RNN setting remain to be investigated.
We also restricted our exposition to the simplest possible noise model, cor-
responding to the equation-error problem discussed in Sec. 3.1. In a practical
scenario, estimating a more sophisticated noise model may be desirable. Noise
models in the deep-learning setting add complexity to the estimation in the same
way as for linear models. A practical approach, inspired by pseudo-linear regres-
sion [Ljung, 1987], is to train a model without noise model and use this model to
estimate the noise sequence through the prediction errors. This noise sequence
can then be used to train a noise model. If this noise model is trained together
with the dynamics model, back-propagation through time is required and the
computational complexity is increased. Deep-learning examples including noise
models are found in [Karl et al., 2016].
8.8 Conclusions
We investigated different architectures of a neural-network model for modeling
of dynamical systems and found that the relationship between sample time and
system bandwidth affects the preferred choice of architecture, where an approx-
imator architecture incorporating an identity element similar to that of LSTMs
and resnet, train faster and generally generalize better in terms of all metrics if the
sample rate is high. An analysis of gradient and Hessian expressions motivated
the difference and conclusions were reinforced by experiments.
The effect of including L 2 weight decay was investigated and shown to vary
greatly with the model architecture. Implications on the stability and tangent-
space eigenvalues of the learned model highlight the need to consider the archi-
tecture choice carefully.
We further demonstrated how tangent-space regularization by means of Ja-
89
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models
90
8.B Deviations from the Nominal Model
f
1.0
0.0
-1.0
-2.0
g
1.0
0.0
-1.0
-2.0
σ
σ
lu
nh
lu
nh
u
swelu
h
swelu
h
nh
u
swelu
h
el
is
el
is
el
is
re
re
re
ta
ta
ta
yr
yr
r
ky
ak
ak
a
le
le
le
Figure 8.8 Distributions (cropped at extreme values) of log-prediction errors on
the validation data after 20, 500 and 1500 (left, middle, right) epochs of training for
different activation functions. Every violin is representing 200 Monte-Carlo runs
and is independently normalized such that the width is proportional to the density,
with a fixed maximum width.
91
9
Friction Modeling and
Estimation
9.1 Introduction
All mechanical systems with moving parts are subject to friction. The friction force
is a product of interaction forces on an atomic level and is always resisting relative
motion between two elements in contact. Because of the complex nature of the
interaction forces, friction is usually modeled based on empirical observations.
The simplest model of friction is the Coulomb model, (9.1), which assumes a
constant friction force acting in the reverse direction of motion
F f = k c sign (v) (9.1)
where k c is the Coulomb friction constant and v is the relative velocity between
the interacting surfaces.
A slight extension to the Coulomb model includes also velocity dependent
terms
F f = k v v + k c sign (v) (9.2)
where k v is the viscous friction coefficient. The Coulomb model and the viscous
model are illustrated in Fig. 9.1. If the friction is observed to vary with the direction
of motion, sign (v), the model (9.2) can be extended to
F f = k v v + k c+ sign (v + ) + k c− sign (v − ) (9.3)
v v v
93
Chapter 9. Friction Modeling and Estimation
where k s is the stiction friction coefficient. An external force greater than the
stiction force will, according to model (9.4), cause an instantaneous acceleration
and a discontinuity in the friction force.
The models above suffice for many purposes but can not explain several
commonly observed friction-related phenomena, such as the Stribeck effect and
dynamical behavior, etc. [Olsson et al., 1998]. To explain more complicated behav-
ior, dynamical models such as the Dahl model [Dahl, 1968] and the LuGre model
[De Wit et al., 1995] have been proposed.
Most proposed friction models include velocity-dependent effects, but no
position dependence. A dependence upon position is however often observed,
and may stem from, for instance, imperfect assembly, irregularities in the contact
surfaces or application of lubricant, etc. [Armstrong et al., 1994]. Modeling of the
position dependence is unfortunately nontrivial due to an often irregular relation-
ship between the position and the friction force. Several authors have however
made efforts in the area. Armstrong (1988) used accurate friction measurements
to implement a look-up table for the position dependence and Huang et al. (1998)
adaptively identified a sinusoidal position dependence.
More recent endeavors by Kruif and Vries (2002) used an Iterative Learning
Control approach to learn a feedforward model including position-dependent
friction terms.
In [Bittencourt and Gunnarsson, 2012], no significant positional dependence
of the friction in a robot joint was found. However, a clear dependence upon the
temperature of contact region was reported. To allow for temperature sensing, the
grease in the gear box was replaced by an oil-based lubricant, which allowed for
temperature sensing in the oil flow circuit.
A standard approach in dealing with systems with varying parameters is recur-
sive identification during normal operation [Johansson, 1993]. Recursive identifi-
cation of the models (9.1) and (9.2) could account for both position- and tempera-
ture dependence. Whereas straight forward in theory, it is often hard to perform in
a robust manner in practical situations. Presence of external forces, accelerating
motions, etc. require either a break in the adaptation, or an accurate model of the
additional dynamics. Many control programs, such as time-optimal programs,
never exhibit zero acceleration, and thus no chance for parameter adaptation.
To see why unmodeled dynamics cause a bias in the estimated parameters,
consider the system
f = ma + f ext + F f (9.5)
94
9.2 Models and Identification Procedures
Figure 9.2 Dual-arm robot and industrial manipulator IRB140 used for experi-
mental verification of proposed models and identification procedures.
with mass m and externally applied force f ext . If we model this system as
fˆ = F f (9.6)
we effectively have the disturbance terms
f − fˆ = ma + f ext (9.7)
These terms do not constitute uncorrelated random noise with zero mean and
will thus introduce a bias in the estimate. It is therefore of importance to obtain as
accurate models as possible offline, where conditions can be carefully controlled
to minimize the influence of unmodeled dynamics.
This chapter develops a model that incorporates positional friction depen-
dence as well as an implicitly temperature-dependent term. The proposed ad-
ditions can be combined or used independently as appropriate. Since many in-
dustrially relevant systems lack temperature sensing in areas of importance for
friction modeling, a sensor-less approach is proposed. Both models are used for
identification of friction in the joint of an industrial collaborative-style robot, see
Fig. 9.2, and special aspects of position dependence are verified on a traditional
industrial manipulator.
This section first introduces a general identification procedure for friction models
linear in the parameters, based on the least-squares method, followed by the
introduction of a model that allows for the friction to vary with position. Third, a
model that accounts for temperature-varying friction phenomena is introduced.
Here, a sensor-less approach where the power loss due to friction is used as an
input to a first-order system, is adopted.
As the models are equally suited for friction due to linear and angular move-
ments, the terms force and torque are here used interchangeably.
95
Chapter 9. Friction Modeling and Estimation
C (p, v) = 0
¾
⇒ τ = G(p) + F (v) (9.9)
a =0
To further simplify the presentation, it is assumed that G(p) = 0. This can easily be
achieved by either aligning the axis of rotation with the gravitational vector such
that gravitational forces vanish, by identifying and compensating for a gravity
model1 or, as in [Bittencourt and Gunnarsson, 2012], performing a symmetric
experiment with both positive and negative velocities and calculating the torque
difference.
As a result of the discontinuity of the Coulomb model and the related uncer-
tainty in estimating the friction force at zero velocity, datapoints where the velocity
is not significantly different from zero must be removed from the dataset used
for estimation. Since there is a large probability that these points will have the
wrong sign, inclusion of these points might lead to severe bias in the estimate of
the friction parameters.
96
9.3 Position-Dependent Model
with the relations between relevant motor- and arm-side quantities given by
q a = G qm
τa = G −1 τm (9.11)
This causes cross couplings between the arm-side and motor-side friction for the
last three joints. For the first three joints, only the sum of arm- and motor-side
friction is visible, but for joint 5 and 6, friction models of the motor-side friction
for joint 4 and joints 4 and 5, respectively, are needed. Experiments on an ABB
IRB2400 robot indicate that the friction on both sides of the motor is of roughly
equal importance for the overall result, and special treatment of the wrist joints is
therefore crucial.
For the example above, a friction model for joint 6 will thus have to contain
terms dependent on the velocities of the fourth and fifth motors as well, e.g.:
97
Chapter 9. Friction Modeling and Estimation
(p − µ)2
µ ¶
κ(p, µ, σ) = exp − (9.13)
2σ2
φ(p) : (p ∈ P ) 7→ R1×K
φ(p) = κ(p, µ1 , σ), · · · , κ(p, µK , σ)
£ ¤
(9.14)
where µi ∈ P , i = 1, ..., K is a set of K evenly spaced centers. For each input position
p ∈ P ⊆ R, the kernel vector φ(p) will have activated (>0) entries for the kernels
with centers close to p. Refer to Fig. 6.2 for an illustration of RBFs. The kernel
vector is included in the regressor-matrix A from Sec. 6.2 such that if used together
with a nominal, viscous friction model, A and the parameter vector k are given by
v1 sign (v 1 ) φ(p 1 )
kv
. .. .. N ×(2+K )
A = .. . . ∈ R , k = kc (9.15)
vN sign (v N ) φ(p N ) kκ
F f = F n + φ(p)k κ (9.16)
where F n is one of the nominal models from Sec. 9.1. The attentive reader might
expect that for appropriate choices of basis functions, the regressor matrix A will
be rank-deficient. This is indeed the case since the sign(v) column lies in the
span of φ. However, the interpretation of the resulting model coefficients is more
intuitive if the Coulomb level is included as a baseline around which the basis-
function expansion models the residuals. To mitigate the issue of rank deficiency,
one could either estimate the nominal model first, and then fit the BFE to the
residuals, or include a slight ridge-regression penalty on k κ .
The above method is valid for position-varying Coulomb friction. It is conceiv-
able that the position dependence is affected by the velocity, in which case the
model (9.16) will produce a sub-optimal result. The RBF network can, however, be
designed to cover the space (P × V ) ⊆ R2 . The inclusion of velocity dependence
comes at the cost of an increase in the number of parameters from K p to K p K v ,
where K p and K v denote the number of basis-function centers in the position and
velocity input spaces, respectively.
The expression for the RBF kernel will in this extended model assume the form
µ ¶
1 T −1
κ(x, µ, Σ) = exp − (x − µ) Σ (x − µ) (9.17)
2
98
9.4 Energy-Dependent Model
¤T
where x = p v ∈ P × V , µ ∈ P × V and Σ is the covariance matrix determining
£
φ(x) : (x ∈ P × V ) → R1×(K p K v )
φ(x) = κ(x, µ1 , Σ), · · · , κ(x, µK p K v , Σ)
£ ¤
(9.18)
Friction is often observed to vary with the temperature of the contact surfaces
and lubricants involved [Bittencourt and Gunnarsson, 2012]. Many systems of
industrial relevance lack the sensors needed to measure the temperature of the
contact regions, thus rendering temperature-dependent models unusable.
The main contributor to the rise in temperature that occurs during operation
is heat generated by friction. This section introduces a model that estimates the
generated energy, and also estimates its influence on the friction.
A simple model for the temperature change in a system with temperature T ,
surrounding temperature T s , and power input W , is given by
d T (t ) ¡ ¢
= k s T s − T (t ) + kW W (t ) (9.19)
dt
for some constants k s > 0, kW > 0. After the variable change ∆T (t ) = T (t )−T s , and
transformation to the Laplace domain, the model (9.19) can be written
kW
∆Tc (s) = Wc (s) (9.20)
s + ks
where the power input generated by friction losses is equal to the product of the
friction force and the velocity
W (t ) = |F f (t )v(t )| (9.21)
We propose to include the estimated power lost due to friction, and its influ-
ence on friction itself, in the friction model according to
where the friction force F f has been divided into the nominal friction F n and the
signal E , corresponding to the influence of the thermal energy stored in the joint.
The nominal model F n can be chosen as any of the models previously introduced,
including (9.16). The energy is assumed to be supplied by the instantaneous power
99
Chapter 9. Friction Modeling and Estimation
due to friction, W , and is dissipating as a first order system with time constant
τ̄e . A discrete representation is obtained after Zero-Order-Hold (ZOH) sampling
[Wittenmark et al., 2002] according to
ke
E d (z) = H (z)Wd (z) = Wd (z) (9.24)
z − τe
In the suggested model form, (9.22) to (9.24), the transfer function H (z) in-
corporates both the notion of energy being stored and dissipated, as well as the
influence of the stored energy on the friction.
Denote by τ̂n the output of the nominal model F n . Estimation of the signal E
can now be done by rewriting (9.22) in two different ways
The proposed model suggests that the change in friction due to the tempera-
ture change occurs in the Coulomb friction. This assumption is always valid for
the nominal model (9.1), and a reasonable approximation for the model (9.2) if
k c À k v v or if the system is both operated and identified in a small interval of
velocities. If, however, the temperature change has a large effect on the viscous
friction or on the position dependence, a 3D basis-function expansion can be
performed in the space P × V × E , E ∈ E . This general model can handle arbitrary
nonlinear dependencies between position, velocity and estimated temperature.
The energy signal E can then be estimated using a simple nominal model, and
100
9.4 Energy-Dependent Model
15 τ
W
10 E
τ [Nm] 5
0
0.63k̄ e E(W )
k̄ e E(W )
-12
0 τ̄e 30 40 50 60
Time [minutes]
Figure 9.3 A realization of simulated signals. The figure shows how the envelope
of the applied torque approximately decays as the signal E . Dashed, blue lines are
drawn to illustrate the determination of initial guesses for the time constant τ̄e and
the gain k̄ e .
Initial guess
For this scheme to work, an initial estimate of the parameters in H (z) is needed.
This can be easily obtained by observing the raw torque data from an experiment.
Consider for example Fig. 9.3, where the system (9.22) and (9.23) has been sim-
ulated. The figure depicts the torque signal as well as the energy signal E . The
envelope of the torque signal decays approximately as the signal E , which allows
for easy estimation of the gain k̄ e and the time constant τ̄e . The time constant
τ̄e is determined by the time it takes for the signal to reach (1 − e −1 ) ≈ 63 % of
its final value. Since G(s) is essentially a low-pass filter, the output E = G(s)W
will approximately reach E ∞ = G(0)E(W ) = k̄ e E(W ) if sent a stationary, stochastic
input W with fast enough time constant (¿ τ̄e ). Here, E(·) denotes the statistical
expectation operator and E ∞ is the final value of the signal E . An initial estimate
of the gain k̄ e can thus be obtained from the envelope of the torque signal as
E∞ E∞
k̄ e ≈ ≈ (9.27)
E(W ) 1 P
N n Wn
We refer to Fig. 9.3 for an illustration, where dashed guides have been drawn to
illustrate the initial guesses.
The discrete counterpart to G(s) can be obtained by discretization with rele-
vant sampling time [Wittenmark et al., 2002].
101
Chapter 9. Friction Modeling and Estimation
9.5 Simulations
To analyze the validity of the proposed technique for estimation of the energy-
dependent model, a simulation was performed. The system described by (9.22)
and (9.23) was simulated to create 50 realizations of the relevant signals, and the
proposed method was run for 50 iterations to identify the model parameters. The
parameters used in the simulation are provided in Table 9.1. Initial guesses were
chosen at random from the uniform distributions k̄ˆ e ∼ U (0, 3k̄ e ) τ̄ˆe ∼ U (0, 3τ̄).
Table 9.1 Parameter values used in simulation. Values given on the format x/y
represent continuous/discrete values.
Parameter Value
kv 5
kc 15
ke -3/-0.5
τe 10/0.9983
Measurement noise στ 0.5 Nm
Sample time h 1s
Duration 3600 s
Iterations 50
Figure 9.4 shows that the estimated parameters converge rapidly to their true
values, and Fig. 9.5 indicates that the Root Mean Square output Error (RMSE)
converges to the level of the added measurement noise. Figure 9.5 further shows
that the errors in the parameter estimates, as defined by (9.28), were typically
below 5 % of the parameter values.
v
uN µ
u Xp x̂ − x ¶2
i i
NPE = t (9.28)
i =1 |x i |
9.6 Experiments
The proposed models and identification procedures were applied to data from
experiments with the dual-arm and the IRB140 industrial robots, see Fig. 9.2.
Procedure
For IRB140, the first joint was used. The rest of the arms were positioned so as to
minimize the moment of inertia. For the dual-arm robot, joint four in one of the
arms was positioned such that the influence of gravity vanished.
A program that moved the selected joint at piecewise constant velocities
between the two joint limits was executed for approximately 20 min. Torque-,
102
9.6 Experiments
kv kc
6 20
15
4
10
2
5
0 0
0 25 50 0 25 50
ke τe
·10−2
0.0 −0.990
−0.5 −0.992
−0.994
−1.0
−0.996
−1.5 −0.998
−2.0 −1.000
0 25 50 0 25 50
Figure 9.4 Estimated parameters during 50 simulations. The horizontal axis dis-
plays the iteration number and the vertical axis the current parameter value. True
parameter values are indicated with dashed lines.
0.4 1.2
0.3 1.0
0.2 0.8
0.1 0.6
0 0.4
0 25 50 0 25 50
Figure 9.5 Evolution of errors during the simulations, the horizontal axis displays
the iteration number. The left plot shows normalized norms of parameter errors,
defined in (9.28), and the right plot shows the RMS output error using the estimated
parameters. The standard deviation of the added measurement noise is shown
with a dashed line.
103
Chapter 9. Friction Modeling and Estimation
30
Torque [Nm] 20
10
0
0 π π 3π 2π
2 2
Motor position [rad]
Figure 9.6 Illustration of the torque dependence upon the motor position for the
IRB140 robot.
velocity-, and position data were sampled and filtered at 250 Hz and subsequently
sub-sampled and stored at 20 Hz, resulting in 25 000 data points. Points approxi-
mately satisfying (9.9) were selected for identification, resulting in a set of 16 000
data points.
Nominal Model The viscous model (9.3) was fit using the ordinary LS procedure
from Sec. 9.2. This model was also used as the nominal model in the subsequent
fitting of position model (9.16) and energy model, (9.22) to (9.24).
Position model For the position-dependent model, the number of basis func-
tions and their bandwidth was determined using cross validation. A large value
of σ has a strong regularizing effect and resulted in a model that generalized well
outside the training data. The model was fit using normalized basis functions as
discussed in Sec. 6.3.
Due to the characteristics of the gear box and electrical motor in many indus-
trial robots, there is a clear dependence not only on the arm position, but also on
the motor position. Figure 9.6 shows the torque versus the motor position when
the joint is operated at constant velocity. This is especially strong on the IRB140
and results are therefore illustrated for this robot. Both arm and motor positions
are available through the simple relationship p mot or = mod 2π (g · p ar m ), where g
denotes the gear ratio. This allows for a basis-function expansion also in the space
of motor positions. To illustrate this, p mot or was expanded into K p m K v = 36 × 6
basis functions, corresponding to the periodicity observed in Fig. 9.6. The results
for the model with motor-position dependence are reported separately. Further
modeling and estimation of the phenomena observed in Fig. 9.6 is carried out
in Chap. 10, where a spectral estimation technique is developed, motivated by the
observation that the spectrum of the signal in Fig. 9.6 is modulated by the velocity
of the motor.
To reduce variance in the estimated kernel parameters, all position-dependent
models were estimated using ridge regression (Sec. 6.4), where a L 2 -penalty was
put on the kernel parameters. The strength of the penalty was determined using
cross validation. All basis-function expansions were performed with normalized
basis functions.
104
9.6 Experiments
kv kc
·10−2
5.272 0.50
0.49
5.270
v >0
0.48
v <0
5.268
0.47
5.266 0.46
0 10 20 30 40 0 10 20 30 40
ke τ̄e [minutes]
·10−5
−5.2 3.2
3.1
−5.4
3.0
−5.6
2.9
−5.8 2.8
0 10 20 30 40 0 10 20 30 40
Figure 9.7 Estimated parameters from experimental data. The horizontal axis
displays the iteration number and the vertical axis displays the current parameter
value.
Table 9.2 Performance indicators for the three different models identified on the
dual-arm robot.
Energy model The energy-dependent model was identified for the dual-arm
robot using the procedure described in Algorithm 6. The initial guesses for H (z)
were τ̄e = 10 min and k̄ e = −0.1. The nominal model was chosen as the viscous fric-
tion model (9.3). Once the signal E was estimated, a kernel expansion in the space
P × V × E with 40 × 6 × 3 basis functions was performed to capture temperature-
dependent effects in both the Coulomb and viscous friction parameters.
Results
The convergence of the model parameters is shown in Fig. 9.7. Figure 9.8 and
Fig. 9.9 illustrate how the models identified for the dual-arm robot fit the exper-
105
Chapter 9. Friction Modeling and Estimation
0.5
Torque [Nm]
−0.5
0.5
Torque [Nm]
−0.5
Figure 9.8 Model fit to experimental data (dual-arm). Upper plot shows an early
stage of the experiment when the joint is cold. Lower plot a later stage, when the
joint has been warmed up.
imental data. The upper plot in Fig. 9.8 shows an early stage of the experiment
when the joint is cold. At this stage, the model without the energy term under-
estimates the torque needed, whereas the energy model does a better job. The
lower plot shows a later stage of the experiment where the mean torque level is
significantly lower. Here, the model without energy term is instead slightly overes-
timating the friction torque. The observed behavior is expected, since the model
without energy dependence will fit the average friction level during the entire
experiment. The two models correspond well in the middle of the experiments
(not shown). Figure 9.9 illustrates the friction torque predicted by the estimated
model as a function of position and velocity. The visible rise in the surface at large
positive positions and positive velocities corresponds to the increase in friction
torque observed in Fig. 9.8 at, e.g., time t = 0.3 min.
The nominal model (9.3), can not account for any of the positional effects
and produces an overall, much worse fit than the position dependent models.
Different measures of model fit for the three models are presented in Table 9.2 and
Fig. 9.11 (Fit (%), Final Prediction Error, Root Mean Square Error, Mean Absolute
Error). For definitions, see e.g., [Johansson, 1993].
106
9.6 Experiments
0.5
Torque [Nm]
−0.5
−1 4
2 2
0 0
−2
Velocity [rad/s] −2 −4
Position [rad]
−20
0 5 10 15 20 25 30 35 40
Time [s]
Figure 9.10 Model fit including kernel expansion for motor position on IRB140.
During t = [0 s, 22 s], the joint traverses a full revolution of 2π rad. The same dis-
tance was traversed backwards with a higher velocity during t = [22 s, 33 s]. Notice
the repeatable pattern as identified by the position-dependent models.
107
Chapter 9. Friction Modeling and Estimation
For the IRB140, three models are compared. The nominal model (9.3), a model
with a basis-function expansion in the space Par m , and a model with an additional
basis-function expansion in the space Pmot or × V . The resulting model fits are
shown in Fig. 9.10. What may seem like random measurement noise in the torque
signal is in fact predictable using a relatively small set of parameters. Figure 9.12
illustrates that the large dependence of the torque on the motor position results
in large errors. The inclusion of a basis-function expansion of the motor position
in the model reduces the error significantly.
9.7 Discussion
The proposed models try to increase the predictive power of common friction mod-
els, and thereby increase their utility for model-based filtering, by incorporating
position- and temperature dependence into the friction model. Systems with vary-
ing parameters can, in theory, be estimated with recursive algorithms, so called
online identification. As elaborated upon in Sec. 9.1, online or observer-based
identification of friction models is often difficult in practice due to the presence of
additional dynamics or external forces. The proposed models are identified offline,
during a controlled experiment, and are thus not subject to the problems asso-
ciated with online identification. However, apart from the temperature-related
108
9.7 Discussion
parameters, all suggested models are linear in the parameters, and could be
updated recursively using, for instance, the well-known recursive least-squares
algorithm or the Kalman-smoothing algorithms in Chap. 7.
This paper makes use of standard and well-known models for friction, com-
bined with a basis-function expansion to model position dependence. This choice
was motivated by the large increase in model accuracy achieved for a relatively
small increase in model complexity. Linear models are easy to estimate and the
solution to the least-squares optimization problem is well understood. Depending
on the intended use of the friction model, the most fruitful avenue to investigate
in order to increase the model accuracy further varies. To the purpose of force
estimation, accurate models of the stiction force are likely important. Stationary
joints impose a fundamental limitation in the accuracy of the force estimate, and
the maximum stiction force determines the associated uncertainty of the estimate.
Preliminary work shows that the problem of indeterminacy of the friction force
for static joints of redundant manipulators can be mitigated by superposition of a
periodic motion in the nullspace of the manipulator Jacobian. Exploration of this
remains an interesting avenue for future work.
Although outside the scope of this work, effects of joint load on the friction be-
havior can be significant [Bittencourt and Gunnarsson, 2012]. Such dependencies
could be incorporated in the proposed models using the same RBF approach as
for the incorporation of position dependence, i.e., through an RBF expansion in
the joint load (l ∈ L) dimension according to φ(x) : (x ∈ P × E × L) 7→ R1×(K p K e K l ) ,
with K l basis-function centers along dimension L. This strategy would capture
possible position and temperature dependencies in the load-friction interaction.
The temperature-dependent part of the proposed model originates from the
most simple possible model for energy storage, a generic first order differential
equation. Since the generated energy is initially unknown, incorporating it in the
model is not straight forward. We rely on the assumption that a simple initial
friction model can be estimated without this effect and subsequently be used
to estimate the generated energy loss. The energy loss estimated by this model
can then be incorporated in a more complex model. Iterating this scheme was
shown to converge in simulations, but depending on the conditions, the scheme
might diverge. This might happen if, e.g., the friction varies significantly with
temperature, where significantly is taken as compared to the nominal friction
value at room temperature. In such situations, the initially estimated model will
be far from the optimum, reducing the chance of convergence. In practice, this
issue is easily mitigated by estimating the initial model only on data that comes
from the joint at room temperature.
In its simplest form, the proposed energy-dependent model assumes that the
change in friction occurs in the Coulomb friction level. This is always valid for the
Coulomb model, and a reasonable approximation for the viscous friction model
if k c À k v v or if the system is both operated and identified in a small interval of
velocities. If the viscous friction k v v is large, the approximation will be worse. This
109
Chapter 9. Friction Modeling and Estimation
where the Coulomb and viscous constants are seen as functions of the estimated
energy signal E , i.e., a Linear Parameter-Varying model (LPV). To accomplish
this, a kernel expansion including the estimated energy signal was suggested and
evaluated experimentally.
Although models based on the internally generated power remove the need for
temperature sensing in some scenarios, they do not cover significant variations in
the surrounding temperature. The power generated in, for instance, an industrial
robot is, however, often high enough to cause a much larger increase in tempera-
ture than the expected temperature variations of its surrounding [Bittencourt and
Gunnarsson, 2012].
9.8 Conclusions
110
9.8 Conclusions
111
10
Spectral Estimation
10.1 Introduction
Spectral estimation refers to a family of methods that analyze the frequency con-
tents of a sampled signal by means of decomposition into a linear combination
of periodic basis functions. Armed with an estimate of the spectrum of a signal,
it is possible to determine the distribution of power among frequencies, identify
disturbance components and design filters, etc. [Stoica and Moses, 2005]. The
spectrum also serves as a powerful feature representation for many classifica-
tion algorithms [Bishop, 2006], e.g., by looking at the spectrum of a human voice
recording it is often trivial to distinguish male and female speakers from each
other, or tell if a saw-blade is in need of replacement by analyzing the sound it
makes while cutting.
Standard spectral density-estimations techniques such as the discrete Fourier
transform (DFT) exhibit several well-known limitations. These methods are typ-
ically designed for data sampled equidistantly in time or space. Whenever this
fails to hold, typical approaches employ some interpolation technique in order to
perform spectral estimation on equidistantly sampled data. Other possibilities in-
clude employing a method suitable for nonequidistant data, such as least-squares
spectral analysis [Wells et al., 1985]. Fourier transform-based methods further suf-
fer from spectral leakage due to the assumption that all sinusoidal basis functions
are orthogonal over the data window [Puryear et al., 2012]. Least-squares spectral
estimation takes the correlation of the basis functions into account and further
allows for estimation of arbitrary/known frequencies without modification [Wells
et al., 1985].
In some applications, the spectral content is varying with an external variable,
for instance, a controlled input. As a motivating example, we consider the torque
ripple induced by the rotation of an electrical motor. Spectral analysis of the
torque signal is made difficult by the spectrum varying with the velocity of the
motor, both due to the frequency of the ripple being directly proportional to the
velocity, but also due to the properties of an electric DC-motor. A higher velocity
both induces higher magnitude torque ripple, but also a higher filtering effect due
to the inertia of the rotating parts. The effect of a sampling delay on the phase of
the measured ripple is similarly proportional to the velocity.
112
10.2 LPV Spectral Decomposition
1 We limit our exposition to V ⊆ R for clarity, but higher dimensional scheduling spaces are possible.
113
Chapter 10. Spectral Estimation
which are linear in the parameters k. Identification of linear models using the
least-squares procedure was described in Sec. 6.2. The model (10.1) can be written
in compact form by noting that e i ω = cos ω+i sin ω, which will be used extensively
throughout the chapter to simplify notation.2
We will now proceed to formalize the proposed spectral-decomposition
method.
Signal model
Our exposition in this chapter will make use of Gaussian basis functions. The
method is, however, not limited to this choice and extends readily to any other set
of basis functions. A discussion on different choices is held in Sec. 6.3.
We start by establishing some notation. Let k denote the Fourier-series co-
efficients3 of interest. The kernel activation vector φ(v i ) : (v ∈ V ) 7→ R J maps the
input to a set of basis-function activations and is given by
¤T
φ(v i ) = κ(v i , θ1 ) · · · κ(v i , θ J ) ∈ R J
£
(10.2)
(v − µ)2
µ ¶
κ(v, θ j ) = κ j (v) = exp − (10.3)
2σ2
J
k ω, j κ j (v i )e −i ωxi = φ(v i )e −i ωxi , kω ∈ C J
X X X T
ŷ i = kω (10.4)
ω∈Ω j =1 ω∈Ω
2 Note that solving the complex LS problem using complex regressors e i ω is not equivalent to solving
the real LS problem using sin/cos regressors.
3 We use the term Fourier-series coefficients to represent the parameters in the spectral decomposi-
tion, even if the set of basis functions are not chosen so as to constitute a true Fourier series.
4 We note at this stage that x ∈ X can be arbitrarily sampled and are not restricted to lie on an
equidistant grid, as is the case for, e.g., Fourier transform-based methods.
114
10.2 LPV Spectral Decomposition
over V through the BFE. This formulation reduces to the standard Fourier-style
spectral model (10.5) in the case φ(v) ≡ 1
k ω e −i ωx = Φk
X
ŷ = (10.5)
ω∈Ω
−i ω1 x −i ωO x
where Φ = [e ... e ]. If the number J of basis functions equals the number
of data points N , the model will exactly interpolate the signal, i.e., ŷ = y. If in
addition to J = N , the basis-function centers are placed at µ j = v j , we obtain a
Gaussian process regression interpretation where κ is the covariance function.
Owing to the numerical properties of the analytical solution of the least-squares
problem, it is often beneficial to reduce the number of parameters significantly,
so that J ¿ N . If the chosen basis functions are suitable for the signal of interest,
the error induced by this dimensionality reduction is small. In a particular case,
the number of RBFs to include, J , and the bandwidth Σ is usually chosen based
on evidence maximization or cross validation [Murphy, 2012].
To facilitate estimation of the parameters in (10.4), we rewrite the model by
stacking the regressor vectors in a regressor matrix A, see Sec. 10.2, such that
An,: = vec φ(v n )ΦT ∈ CO·J , n = 1...N
¡ ¢T
We further define à by expanding the regressor matrix into its real and imaginary
parts
à = ℜA ℑA ∈ RN ×2O J
£ ¤
such that routines for real-valued least-squares problems can be used. The com-
plex coefficients are, after solving the real-valued least-squares problem5 retrieved
as k = k ℜ + i k ℑ where
T T T
° °
[k ℜ kℑ ] = arg min °Ãk̃ − y °
k̃
115
Chapter 10. Spectral Estimation
Activation κ(v, θ)
0.5
−0.5
−1
-2 -1 0 1 2
Scheduling variable v
Figure 10.1 Gaussian (dashed) and normalized Gaussian (solid) windows. Regu-
lar windows are shown mirrored in the x-axis for clarity.
L EMMA 3
Let a signal y be composed by the linear combination y = k 1 cos(x) + k 2 sin(x),
then y can be written on the form
y = A cos(x − ϕ)
with
k2
q µ ¶
A= k 12 + k 22 ϕ = arctan 2
k1
From this we obtain the following two functions for a particular frequency ω
q
A(ω) = |k ω | = ℜk ω 2 + ℑk ω 2
ϕ(ω) = arg(k ω ) = arctan (ℑk ω /ℜk ω )
116
10.2 LPV Spectral Decomposition
Covariance properties
We will now investigate and prove that (10.7) and (10.8) lead to asymptotically
unbiased and consistent estimates of A and ϕ, and will provide a strategy to obtain
confidence intervals. We will initially consider a special case for which analysis
is simple, whereafter we invoke the RBF universal approximation results of Park
and Sandberg (1991) to show that the estimators are well motivated for a general
class of functions. We start by considering signals on the form (10.9), for which
unbiased and consistent estimates of the parameters are readily available:
P ROPOSITION 3
Let a signal y be given by
with φ = φ(v) and let α̂ and β̂ denote unbiased estimates of α and β. Then
q
¢2 ¡ ¢2
Â(α̂, β̂) = α̂Tφ + β̂Tφ
¡
(10.10)
Proof Since α, β and e appear linearly in (10.9), unbiased and consistent esti-
mates α̂ and β̂ are available from the least-squares procedure (see Sec. 6.2). The
expected value of  2 is given by
n¡ ¢2 ¡ ¢2 o
E Â 2 = E α̂Tφ + β̂Tφ
© ª
n¡ ¢2 o n¡ ¢2 o
= E α̂Tφ + E β̂Tφ (10.12)
We further have
n¡ ¢2 o ª2
E α̂Tφ = E α̂Tφ + V α̂Tφ
© © ª
¢2
= αTφ + φTΣα φ
¡
(10.13)
117
Chapter 10. Spectral Estimation
which provides the upper bound on the expectation of Â. The lower bound is
obtained by writing  on the form
q
¢2 ¡ ¢2 ° °
α̂Tφ + β̂Tφ = °k̂ °
¡
Â(k) = (10.16)
E Â = E °k̂ ° > °E k̂ ° = °k ° = A
© ª ©° °ª ° © ª° ° °
(10.17)
C OROLLARY 2
q
¢2 ¡ ¢2
α̂Tφ + β̂Tφ
¡
 = (10.18)
is an asymptotically unbiased and consistent estimate of A.
Proof Since the least-squares estimate, upon which the estimated quantity is
based, is unbiased and consistent, the variances in the upper bound in (10.11)
will shrink as the number of datapoints increases and both the upper and lower
bounds will become tight, hence
E Â → A as N → ∞
© ª
Analogous bounds for the phase function are harder to obtain, but the simple
estimator ϕ̂ = arg(k̂) based on k̂ obtained from the least-squares procedure is still
asymptotically consistent [Kay, 1993].
Estimates using the least-squares method (6.5) are, under the assumption
of uncorrelated Gaussian residuals of variance σ2 , associated with a posterior
parameter covariance σ2 (ATA)−1 . This will in a straightforward manner produce
confidence intervals for a future prediction of y as a linear combination of the
estimated parameters. Obtaining unbiased estimates of the confidence intervals
for the functions A(v, ω) and ϕ(v, ω) is made difficult by their nonlinear nature.
We therefore proceed to establish an approximation strategy.
The estimated parameters k̂ are distributed according to a complex-normal
distribution CN (ℜz + i ℑz, Γ,C ), where Γ and C are obtained through
For details on the CN -distribution, see, e.g., [Picinbono, 1996]. A linear combina-
tion of squared variables distributed according to a complex normal (CN ) distribu-
tion, is distributed according to a generalized χ2 distribution, a special case of the
118
10.2 LPV Spectral Decomposition
Line-spectral estimation
In many application, finding a sparse spectral estimate with only a few nonzero
frequency components is desired. Sparsity-promoting regularization can be em-
ployed when solving for the Fourier coefficients in order to achieve this. This
procedure is sometimes referred to as line-spectral estimation [Stoica and Moses,
2005] or L 1 -regularized spectral estimation. While this technique only requires
° °
the addition of a regularization term to the cost function in (6.5) on the form °k °1 ,
the resulting problem no longer has a solution on closed form, necessitating an it-
erative solver. Along with the standard periodogram and Welch spectral estimates,
we compare L 1 -regularized spectral estimation to the proposed approach on a
sparse estimation problem in the next section. We further incorporate group-lasso
regularization for the proposed approach. The group-lasso, described in Sec. 6.4,
amounts to adding the term X° °
°k ω ° (10.20)
2
ω∈Ω
to the cost function. We solve the resulting lasso and group-lasso regularized
spectral-estimation problems using the ADMM algorithm [Parikh and Boyd, 2014].
119
Chapter 10. Spectral Estimation
Simulated signals
To assess the qualities of the proposed spectral-decomposition method, a test
signal y t is generated as follows
A(4π, v) = 2v 2
A(20π, v) = 2/(5v + 1)
2
A(100π, v) = 3e −10(v−0.5)
ϕ(ω, v) = 0.5A(ω, v) (10.21)
where the constants are chosen to allow for convenient visualization. The signals
y t and v t are visualized as functions of the sampling points x in Fig. 10.2 and the
functions A and ϕ together with the resulting estimates and confidence intervals
using J = 50 basis functions are shown in Fig. 10.2. The traditional power spectral
density can be calculated from the estimated coefficients as
¯ ¯2
¯X J ¯
P (ω) = ¯ k̂ ω, j ¯ (10.22)
¯ ¯
¯ j =1 ¯
and is compared to the periodogram and Welch spectral estimates in Fig. 10.3.
This figure illustrates how the periodogram and Welch methods fail to clearly iden-
tify the frequencies present in the signal due to the dependence on the scheduling
variable v. The LPV spectral method, however, correctly identifies all three fre-
quencies present. Incorporation of L 1 regularization or group-lasso regularization
introduces a bias. The L 1 -regularized periodogram is severely biased, but manages
to identify the three frequency components present. The difference between the
LPV method and the group-lasso LPV method is small, where the regularization
correctly sets all non-present frequencies to zero, but at the expense of a small
bias.
120
10.3 Experimental Results
Est. ω = 4π True ω = 4π
Est. ω = 20π True ω = 20π Test signal y t and
Est. ω = 100π True ω = 100π scheduling signal v t
3
yt
vt 2
2
A(ω, v)
1
−2
0 −4
0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10
v x (sampling points)
Figure 10.2 Left: True and estimated functional dependencies with 95% confi-
dence intervals. Right: Test signal with N = 500 datapoints. The signal contains
three frequencies, where the amplitude and phase are modulated by the func-
tions (10.21) depicted in the left panel. For visualization purposes, v is chosen
as an increasing signal. We can clearly see how the signal y t on the right has a
higher amplitude when v ≈ 0.5 and is dominated by a low frequency when v ≈ 1,
corresponding to the amplitude functions on the left.
100
10−2
10−4
0 20 40 60 80 100 120 140
ω [rad/s]
Figure 10.3 Estimated spectra, test signal. The periodogram and Welch meth-
ods fail to identify the frequencies present in the signal due to the dependence
on the scheduling variable v. The L 1 -regularized periodogram correctly identi-
fies the frequency components presents, but is severely biased. The LPV spectral
method correctly identifies all three frequencies present and the group-lasso LPV
method identifies the correct spectrum with a small bias and correctly sets all other
frequencies to 0.
121
Chapter 10. Spectral Estimation
Measured signals
The proposed method was used to analyze measurements obtained from an ABB
dual-arm robot (Fig. 9.2). Due to torque ripple and other disturbances, there is
a velocity-dependent periodic signal present in the velocity control error, which
will serve as the subject of analysis. The analyzed signal is shown in Fig. 10.4.
The influence of Coulomb friction on the measured signal is mitigated by
limiting the support of half of the basis functions to positive velocities and vice
versa. A total number of 10 basis functions was used and the model was identi-
fied with ridge regression. The regularization parameter was chosen using the
L-curve method [Hansen, 1994]. The identified spectrum is depicted in Fig. 10.5,
where the dominant frequencies are identified. These frequencies correspond
well with a visual inspection of the data. Figure 10.5 further illustrates the result of
applying the periodogram and Welch spectral estimators to data that has been
sorted and interpolated to an equidistant grid. These methods correctly identify
the main frequency, 4 rev−1 , but fail to identify the lower-amplitude frequencies
at 7 rev−1 and 9 rev−1 visible in the signal. The amplitude functions for the three
strongest frequencies are illustrated in Fig. 10.6, where it is clear that the strongest
frequency, 4 rev−1 , has most of its power distributed over the lower-velocity data-
points, whereas the results indicate a slight contribution of frequencies at 7 rev−1
and 9 rev−1 at higher velocities, corresponding well with a visual inspection of the
signal. Figure 10.6 also displays a histogram of the velocity values of the analyzed
data. The confidence intervals are narrow for velocities present in the data, while
they become wider outside the represented velocities.
Measured signal y t
·10−2
2
1
1
0 0
−1
−1 −2
−π 0 π
x [rad] (sampling points)
Figure 10.4 Measured signal as a function of sampling location, i.e., motor posi-
tion. The color information indicates the value of the velocity/scheduling variable
in each datapoint. Please note that this is not a plot of the measured data sequen-
tially in time. This figure indicates that there is a high amplitude periodicity of
4 rev−1 for low velocities, and slightly higher frequencies but lower-amplitude
signals at 7 rev−1 and 9 rev−1 for higher velocities.
122
10.3 Experimental Results
10−5
10−7
10−9
0 2 4 6 8 10 12
f [rev−1 ]
Figure 10.5 Estimated spectra, measured signal. The dominant frequencies are
identified by the proposed method, while the Fourier-based methods correctly
identify the main frequency, 4 rev−1 , but fail to identify the lower-amplitude fre-
quencies at 7 rev−1 and 9 rev−1 visible in the signal in Fig. 10.4.
f =1 f =4 f =7 f =9
·104
·10−3
Number of datapoints in V
6
2 A(v)
1
2
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
v
123
Chapter 10. Spectral Estimation
10.4 Discussion
In this chapter, we make further use of basis-function expansions, this time in the
context of spectral estimation. The common denominator is the desire to model
a functional relationship where the function is low-dimensional and can have
an arbitrary complicated form. The goal is to estimate how the amplitude and
phase of sinusoidal basis functions that make up a signal vary with an auxiliary
signal. Due to the phase variable entering nonlinearly, the estimation problem
is rephrased as the estimation of linear parameter-varying coefficients of sines
and cosines of varying frequency. The amplitude and phase functions are then
calculated using nonlinear transforms of the estimated coefficients. While it was
shown that the simple estimators of the amplitude and phase functions are biased,
this bias vanishes as the number of datapoints increases.
From the expression for the expected value of the amplitude function
© ª q
A < E Â < A 2 + φTΣα φ + φTΣβ φ (10.23)
we see that the bias vanishes as Σα and Σβ are reduced. Further insight into this
inequality can be gained by considering the scalar, nonlinear transform
If µ is several standard deviations away from zero, the nonlinear aspect of the
absolute-value function will have negligible effect. When µ/σ becomes smaller,
say less than 2, the effect starts becoming significant. Hence, if the estimated
coefficients are significantly different from zero, the bias is small. This is apparent
also from the figures indicating the estimated functional relationship with esti-
mated confidence bounds. For areas where data is sparse, the confidence bounds
become wider and the estimate of the mean seems to be inflated in these areas.
The leakage present in the standard Fourier-based methods is usually unde-
sired. The absence of leakage might, however, be problematic when the number of
estimated frequencies is low, and the analyzed signal contains a very well defined
frequency. If this frequency is not included in the set of basis frequencies, the ab-
sence of leakage might lead to this component being left unnoticed. Introduction
of leakage is technically straightforward, but best practices for doing so remain to
be investigated.
10.5 Conclusions
124
10.A Proofs
leakage and allows for estimation of arbitrary chosen frequencies. The closed-form
calculation of the spectrum requires O (J 3 O 3 ) operations due to the matrix inver-
sion associated with solving the LS-problem, which serves as the main drawback
of the method if the number of frequencies to estimate is large (the product JO
greater than a few thousands). For larger problems, an iterative solution method
must be employed. Iterative solution further enables the addition of a group-lasso
regularization, which was shown to introduce sparsity in the estimated spectrum.
Implementations of all methods and examples discussed in this chapter are
made available in [LPVSpectral.jl, B.C., 2016].
Appendix A. Proofs
Proof Lemma 3
The amplitude A is given by two trigonometric identities
k2 A sin ϕ
µ ¶ µ ¶
arctan = arctan =ϕ
k1 A cos ϕ
Proof Proposition¤ 4
Let v T = x T y T . The mean and variance of ṽ = Lv is given by
£
E {ṽ} = L E {v} = 0
E ṽ ṽ T = E Lv v TLT = LI LT = Σ
© ª © ª
125
11
Model-Based Reinforcement
Learning
In this chapter, we will briefly demonstrate how the models developed in Part
I of the thesis can be utilized for reinforcement-learning purposes. As alluded
to in Sec. 5.2, there are many approaches to RL, some of which learn a value
function, some a policy directly and some that learn the dynamics of the system.
While value-function-based methods are very general and can be applied to a
very wide range of problems without making many assumptions, they are terribly
inefficient and require a vast amount of interaction with the environment in
order to learn even simple policies. Model-based methods trade off some of the
generality by making more assumptions, but in return offer greater data efficiency.
In line with the general topic of this thesis, we will demonstrate the use of the LTV
models and identification algorithms from Chap. 7 together with the black-box
models from Chap. 8 in an example of model-based reinforcement learning. The
learning algorithm used is derived from [Levine and Koltun, 2013], where it was
shown to have impressive data-efficiency when applied to physical systems. In this
chapter, we show how the data-efficiency can be improved further by estimating
the dynamical models using the methods from Chap. 7.
127
Chapter 11. Model-Based Reinforcement Learning
with a quadratic. The solution will, due to the approximations, not be exact, and
the procedure must thus be iterated, where new approximations are obtained
around the previous solution trajectory. This algorithm can be used for model-
based, trajectory centric reinforcement learning by, after each optimization pass,
performing a rollout on the real system and updating the system model using
the new data. We will begin with an outline of the algorithm [Todorov and Li,
2005] and then provide a detailed description on how we use it for model-based
reinforcement learning.
The algorithm
The standard LQR algorithm uses dynamic programming to find the optimal linear,
time-varying feedback gain along a trajectory of an, in general, linear time-varying
system. To extend this to the nonlinear, non-Gaussian setting, we linearize the
system and cost function c along a trajectory τi = {x̂ t , û t }Tt=1 , where i denotes the
learning iteration starting at 1 and s denotes the superstate [x T uT]T:
x + = Ax + Bu (11.1)
1
c(x, u) = s Tc s + s Tc ss s (11.2)
2
where subscripts denote partial derivatives. Using the linear quadratic model
(11.1), the optimal state- and state-action value functions V and Q can be calcu-
lated recursively starting at t = T using the expressions
Q ss = c s + fˆsTVxx
+ ˆ
fs (11.3)
Qs = cs + ˆT
f s Vx+ (11.4)
T −1
Vxx = Q xx −Q ux Q uu Q ux (11.5)
−1
T
Vx = Q x −Q uxQ uu Qu (11.6)
+
where all quantities are given at time t and · denotes a quantity at t +1. Subscripts
Q s and Q ux , etc., denote partial derivatives ∇s Q and ∇ux Q, respectively. The
calculation of value functions is started at time t = T with VT +1 = 0. Given the
calculated value functions, the optimal update to the control signal trajectory, k,
and feedback gain, K , are given by
−1
K = −Q uu Q ux (11.7)
−1
k = −Q uu Qu (11.8)
These are used to update the nominal trajectory τ as
u i +1 = u i + k i (11.9)
C (x) = K x̄ (11.10)
where x̄ = x − x̂ denotes deviations from the nominal trajectory. The update of
the state trajectory in τ is done by forward simulation of the system using u i +1 as
input. In the reinforcement-learning setting, this corresponds to performing an
experiment on the real system.
128
11.1 Iterative LQR—Differential Dynamic Programming
Stochastic controller
For learning purposes, it might be beneficial to let the controller be stochastic in
order to more efficiently explore the statespace around the current trajectory. A
Gaussian controller
p(u|x) = N (K x̄, Σ) (11.11)
can be calculated with the above framework, where the choice Σ = Q uu
−1
was shown
by Levine and Koltun (2013) to optimize a maximum entropy cost function
−1
To explore with Gaussian noise with a covariance matrix of Q uu makes intuitive
sense. It allows us to add large noise in the directions where the value function does
not increase much. Directions where the value function seems to increase sharply,
however, should not be explored as much as this would lead the optimization to
high-cost areas of the state-space.
The term Σ f in (11.13) corresponds to the uncertainty in the model of f along
the trajectory. If f is modeled as an LTV model and estimated using the methods
in Chap. 7, this term can be derived from quantities calculated by the Kalman
smoothing algorithm, making these identification methods a natural fit to the
current setting.
D KL p(τ) || p̂(τ) ≤ ²
¡ ¢
(11.14)
where µ ¶
Y
p(τ) = p(x 0 ) p(x t +1 |x t , u t )p(u t |x t ) (11.15)
t
and p̂(τ) denotes the previous trajectory distribution. The addition of (11.14)
allows the resulting constrained optimization problem to be solved efficiently
using a small modification to the algorithm, reminiscent of the Lagrangian method.
The simplicity of the resulting problem is a product of the linear-Gaussian nature
of the models involved.
129
Chapter 11. Model-Based Reinforcement Learning
the third step of Algorithm 7, we employ the iLQR algorithm with a bound on the
KL divergence between two consecutive trajectory distributions. To incorporate
control signal bounds, we employ a variant of iLQR due to Tassa et al. (2014). Our
implementation of iLQR, allowing control signal contraints and constraints on the
KL divergence between trajectory distributions, is made available at [Differential-
DynamicProgramming.jl, B.C., 2016].
We compare three different models; the ground-truth system model, an LTV
model obtained by solving (7.5) using the Kalman smoothing algorithm, and an
LTI model. The total cost over T = 400 time steps is shown as a function of learning
iteration in Fig. 11.1. The figure illustrates how the learning procedure reaches
the optimal cost of the ground-truth model when an LTV model is used, whereas
when using an LTI model, the learning diverges. The figure further illustrates that
if the LTV model is fit using a prior (Sec. 7.5), the learning speed is increased. We
obtain the prior in two different ways. In one experiment, we fit a neural network
model on the form x + = g (x, u) + x with the same structure and parameters as
in Sec. 8.6, with tangent-space regularization. After every rollout, the black-box
130
11.2 Example—Reinforcement Learning
80
60
Cost
40
20
0 2 4 6 8 10 12 14 16 18 20
Learning Iteration
Figure 11.1 Reinforcement learning example. Three different model types are
used to iteratively optimize the trajectory of a pendulum on a cart. Colored bands
show standard deviation around the median. Linear expansions of the dynamics
are unstable in the upward position and stable in the downward position. The
algorithm thus fails with an LTI model. A standard LTV model learns the task well
and reaches the optimal cost in approximately 15 learning iterations. Building a
successive DNN model for use as prior to the LTV model improves convergence
slightly. Using a DNN model from a priori identification as prior improves conver-
gence at the onset of training when the amount of information obtained about
the system is low. Interestingly, the figure highlights how use of this model ham-
pers performance near convergence. This illustrates that successively updating
the DNN model with new data along the current trajectory improves the model in
the areas of the state space important to the task. The optimal control algorithm
queried the true system model a total of 52 times.
model is refit using all available data up to that point.1 No uncertainty estimate is
available from such a model, and we thus heuristically decay the covariance of
the prior obtained from this model as the inverse of the learning iteration. This
prior information is easily incorporated into the learning of the LTV model using
the method outlined in Sec. 7.5, and the precision of the prior acts as a hyper
parameter weighing the LTV model against the neural network model. In the
second experiment with priors, we pretrain a neural network model of the system
in a standard system-identification fashion, and use this model as a prior without
updating it with new data during the learning iterations. The figure indicates
that use of a successively updated prior model is beneficial to convergence, but
1 Retraining in every iteration is computationally costly, but ensures that the model does not con-
verge to a minimum only suitable for the first seen trajectories.
131
Chapter 11. Model-Based Reinforcement Learning
θ θ̇
3
2 0
1
−5
0
0.2 1
0 0
−0.2 −1
−0.4 −2
0 100 200 300 400 0 100 200 300 400
u
10
iLQR with true model LTV
5 LTV successive prior
−5
−10
0 50 100 150 200 250 300 350 400
Figure 11.2 Optimized trajectories of pendulum system after learning using three
different methods. The state consists of angle θ, angular velocity θ̇, position p and
velocity v. DDP denotes the optimal control procedure with access to the true
system model. The noise in the control signal trajectories of the RL algorithms is
added for exploration purposes.
132
11.2 Example—Reinforcement Learning
it takes a few trajectories of data before the difference is noticeble. If the prior
model is pretrained the benefit is immediate, but failure to update the prior model
hampers convergence towards the end of learning. This behavior indicates that
it was beneficial to update the model with data following the state-visitation
distribution of the policy under optimization, as this focuses the attention of the
neural network, allowing it to accurately represent the system in relevant areas of
the state space. In taking this experiment further, a Bayesian prior model, allowing
calculation of the true posterior over the Jacobian for use in (7.16), would be
beneficial. Example trajectories found by the learning procedure using different
models are illustrated in Fig. 11.2. Due to the constraints on the control signal, all
algorithms converge to similar policies of bang-bang type.
Although a simple example, Fig. 11.1 illustrates how an appropriate choice of
model class can allow a trajectory-based reinforcement learning problem to be
solved using very few experiments on the real process. We emphasize that one
iteration of Algorithm 7 corresponds to one experiment on the process. Levine
and Koltun (2013), from where inspiration to this approach was drawn, require a
handful of experiments on the process per learning iteration and policy update,
just to fit an adequate dynamics model.
133
Part II
∆q = C j (τ) (12.1)
∆X = CC (f) (12.2)
where τ and f are the joint torques and external forces, respectively, X is some
notion of Cartesian pose and C denotes some, possibly nonlinear, compliance
function. The corresponding inverse relations are typically referred to as stiffness
models. Robotic compliance modeling has been investigated by many authors,
where the most straightforward approach is based on linear models obtained by
measuring the deflections under application of known external loads. To avoid the
dependence on expensive equipment capable of accurately measuring the deflec-
tions, techniques such as the clamping method have been proposed [Bennett et al.,
1992; Lehmann et al., 2013; Sörnmo, 2015; Olofsson, 2015] for the identification of
137
Chapter 12. Introduction—Friction Stir Welding
models on the form (12.1). This approach makes the assumption that deflections
only occur in the joints, in the direction of movement. Hence, deflections occur-
ring in the links or in the joints orthogonally to the movement cause model errors,
limiting the resulting accuracy of the model obtained [Sörnmo, 2015]. In [Guillo
and Dubourg, 2016], the use of arm-side encoders was investigated to allow for
direct measurement of the joint deflections. As of today, arm-side encoders are
not available in the vast majority of robots, and the modification required to install
them is yet another obstacle to the adaptation of robotic FSW. The method further
suffers from the lack of modeling of link- and orthogonal joint deflections.
Cartesian models like (12.2) have been investigated in the FSW context by [De
Backer, 2014; Guillo and Dubourg, 2016; Abele et al., 2008]. The proposed Cartesian
deflection models are local in nature and not valid globally. This requires separate
models to be estimated throughout the workspace, which is time consuming and
limits the flexibility of the setup.
Although the use of compliance models leads to a reduction of the uncertainty
introduced by external forces, it is difficult to obtain compliance models accurate
enough throughout the entire workspace. This fact serves as the motivation for
complementing the compliance modeling with sensor-based feedback. Sensor-
based feedback is standard in conventional robotic arc and spot welding, where
the crucial task of the feedback action is to align the weld torch with the seam
along the transversal direction, with the major uncertainty being the placement
of the work pieces. During FSW, however, the uncertainties in the robot pose are
significant, while the tilt angle of the tool in addition to its position is of great
importance [De Backer et al., 2012]. This requires a state estimator capable of
estimating accurately at least four DOF, with slightly lower performance required
in the tool rotation axis and the translation along the weld seam. Conventional
seam-tracking sensors are capable of measuring 1-3 DOF only [Nayak and Ray,
2013; Gao et al., 2012], limiting the information available to a state estimator and
thus maintaining the need for, e.g., compliance modeling.
Motivated by the concerns raised above, we embark on developing calibra-
tion methods, a state estimator and a framework for simulation of robotic seam
tracking under the influence of large external process forces. Chapter 13 develops
methods for calibration of 6 DOF force/torque sensors and a seam-tracking laser
sensor, while a particle-filter based state-estimator and simulation framework is
developed in Chap. 14. Notation, coordinate frames and variables used in this part
of the thesis are listed in Table 12.1.
12.1 Kinematics
Kinematics refer to the motion of objects without concerns for any forces involved
(as opposed to dynamics). This section will briefly introduce some notation and
concepts within kinematics that will become important in subsequent chapters,
and simultaneously introduce the necessary notation. Most of the discussion will
focus around the representation and manipulation of coordinate frames.
138
12.1 Kinematics
139
13
Calibration
The field of robotics offers a wide range of calibration problems, the solutions to
which are oftentimes crucial for the accuracy or success of a robotic application.
For our purposes, we will loosely define calibration as the act of finding a trans-
formation of a measurement from the space it is measured in, to another space
in which it is more useful to us. For example, a sensor measuring a quantity in its
intrinsic coordinate system must be calibrated for us to know how to interpret the
measurements in an external coordinate system, such as that of the robot or its
tool. This chapter will describe a number of calibration algorithms developed in
order to make use of various sensors in the seam-tracking application, but their
usefulness outside this application will also be highlighted.
The calibration problems considered in this chapter are all geometric in nature
and related to the kinematics of a mechanical system. In kinematic calibration,
we estimate the kinematic parameters of a structure. Oftentimes, and in all cases
considered here, we have a well defined kinematic equation defining the kinematic
structure, but this equation contains unknown parameters. To calibrate those
parameters, one has to devise a method to collect data that allow us to solve for
the parameters of interest. Apart from solving the equations given the collected
data, the data collection itself is what sets different calibration methods apart
from each other, and from system identification in general. Many methods require
special-purpose equipment, either for precise perturbations of the system, or for
precise measurements of the system. Use of special-purpose equipment limits the
availability of a calibration method. We therefore propose calibration algorithms
that solve for the desired kinematic parameters without any special-purpose
equipment, making them widely applicable.
140
13.1 Force/Torque Sensor Calibration
In order to make use of a force/torque sensor, the rotation matrix R TSF between
the tool flange and the sensor coordinate systems, the mass m held by the force
sensor at rest, and the translational vector r ∈ R3 from the sensor origin to the
center of mass are required. Methods from the literature typically involve fixing
the force/torque sensor in a jig and applying known forces/torques to the sensor
[Song et al., 2007; Chen et al., 2015]. In the following, we will develop and analyze
a calibration method that only requires movement of the sensor attached to the
tool flange in free air, making it very simple to use.
Method
The relevant force and torque equations are given by
where g is the gravity vector given in the robot base-frame and f , τ are the force
and torque measurements, respectively, such that f = [ f T τT]T. At first glance, this
is a hard problem to solve. The equation for the force relation does not appear to
allow us to solve for both m and R ST F , the constraint R ∈ SO(3) is difficult to handle,
and the equation for the torque contains the nonlinear term R ST F 〈r 〉. Fortunately,
however, the problem can be separated into two steps, and the constraint R ST F ∈
SO(3) will allow us to distinguish R ST F from m.
A naive approach to the stated calibration problem is to formulate an optimiza-
tion problem where R is parameterized using, e.g., Euler angles. A benefit of this
approach is its automatic and implicit handling of the constraint R ∈ SO(3). One
then minimizes the residuals of (13.1) and (13.2) with respect to all parameters us-
ing a local optimization method. This approach is, however, prone to convergence
to local optima and is hard to conduct in practice.
Instead, we start by noting that multiplying a matrix with a scalar only affects
its singular values, but not its singular vectors, mR = U (mS)V T. Thus, if we solve a
convex relaxation to the problem and estimate the product mR, we can recover
R by projecting mR onto SO(3) using the procedure in Sec. 6.1. Given R we can
easily recover m. Equation (13.1) is linear in mR and the minimization step can
readily be conducted using the least-squares approach of Sec. 6.2. To facilitate this
estimation, we write (13.1) on the equivalent form
141
Chapter 13. Calibration
which is a linear equation in the parameters s that can be solved using the standard
least-squares procedure. The least-squares solution to this problem was, however,
found during experiments to be very sensitive to measurement noise in f S . This is
due to the fact that f S appears not only in the dependent variable on the right-hand
side, but also in the regressor 〈 f S + R TRBF (mg )〉. This is thus an errors-in-variables
problem for which the solution is given by the total least-squares procedure [Golub
and Van Loan, 2012], which we make use of in the following evaluation.
Numerical evaluation
The two algorithms, the relaxation-based and the Cayley-transform based, were
compared on the problem of finding R ST F by simulating a force-calibration scenario
where a random R ST F and 100 random poses R RB TF
were generated. In one simulation,
we let the first algorithm find R ST F and m with an error in the initial estimate of m
by a factor of 2, while the Cayley algorithm was given the correct mass. The results,
depicted in the left panel of Fig. 13.1 indicate that the two methods performed on
par with each other. The figure shows the error in the estimated rotation matrix as a
function of the added measurement noise in f S . In a second experiment, depicted
in the right panel of Fig. 13.1, we started both algorithms with a mass estimate
with 10 % error. Consequently, the Cayley algorithm performed significantly worse
142
13.2 Laser-Scanner Calibration
10−1 10−1
10−4 10−3 10−2 10−1 100 101 10−4 10−3 10−2 10−1 100 101
σ [N] σ [N]
Figure 13.1 The error in the estimated rotation matrix is shown as a function
of the added measurement noise for two force-calibration methods, relaxation
based and Cayley-transform based. On the left, the relaxation-based method was
started with an initial mass estimate m 0 = 2m whereas the Cayley-transform based
method was given the correct mass. On the right, both algorithms were given
m 0 = 1.1m
for low noise levels, while the difference was negligible when the measurement
noise was large enough to dominate the final result.
The experiment showed that not only does the relaxation-based method per-
form on par with the unconstrained Cayley-transform based method, it also allows
us to estimate the mass, reducing the room for potential errors due to an error in
the mass estimate. It is thus safe to conclude that the relaxation-based algorithm
is superior to the Cayley algorithm in all situations. Implementations of both
algorithms are provided in [Robotlib.jl, B.C., 2015].
Laser scanners have been widely used for many years in the field of robotics.
A large group of laser scanners, such as 2D laser range finders and laser stripe
profilers, provide accurate distance measurements confined to a plane. By moving
either the scanner or the scanned object, a 2D laser scanner can be used to build a
3D representation of an object or the environment. To this purpose, laser scanners
are commonly mounted on mobile platforms, aerial drones or robots.
This section considers the calibration of such a sensor, and as a motivating
example, we consider a wrist-mounted laser scanner for robotic 3D scanning
and weld-seam-tracking applications. The method does, however, work equally
well in any situation where a similar sensor is mounted on a drone or mobile
platform, as long as the location of the platform is known in relation to a fixed
coordinate system. We will use the term robot to refer to any such system, and the
robot coordinate system to refer to either the robot base coordinate system, or the
coordinate system of a tracking device measuring the location of the tool flange.
To relate the measurements of the scanner to the robot coordinate system, the
rigid transformation between the scanner coordinate system and the tool flange
143
Chapter 13. Calibration
Figure 13.2 ABB IRB140 used for experimental verification. The sensor (blue)
is mounted on the wrist and is plane of the laser light is intersecting a white flat
surface.
144
13.2 Laser-Scanner Calibration
matic model of the robot, and convergence results are therefore only presented
for initial guesses very close to their true values (0.01mm/0.01°). Initial estimates
this accurate are very hard to obtain unless a very precise CAD model of all parts
involved is available.
Zhang and Pless (2004) found the transformation between a camera and a
laser range finder using a checkerboard pattern and used computer vision to esti-
mate the location of the calibration planes. With the equations of the calibration
planes known, the desired transformation matrix was obtained from a set of linear
equations.
Planar constraints have also been considered by Ikits and Hollerbach (1997)
who employed a non-linear optimization technique to estimate the kinematic
parameters. The method requires careful definition of the planes and can not
handle arbitrary frame assignments.
A wrist mounted sensor can be seen as an extension of the kinematic chain of
the robot. Initial guesses can be poor, especially if based on visual estimates. This
section presents a method based solely on solving linear sets of equations. The
method accepts a very crude initial estimate of the desired kinematic parameters,
which is refined in an iterative procedure. The placement of the calibration planes
is assumed unknown, and their locations are found automatically together with
the desired transformation matrix. The exposition will be focused on sensors
measuring distances in a plane, but extensions to the proposed method to 3D
laser scanners such as LIDARs and 1D point lasers will be provided towards the
end.
Preliminaries
Throughout this section, the kinematic notation presented in Sec. 12.1 will be
used. The normal of a plane from which measurement point i is taken, given in
frame A, will be denoted n iA .
A plane is completely specified by
° °
nTp = d , °n °2 = 1 (13.12)
where d is the orthogonal distance from the origin to the plane, n the plane normal
and p is any point on the plane.
Laser-scanner characteristics
The laser scanner consists of an optical sensor and a laser source emitting light in a
plane that intersects a physical plane in a line, see Fig. 13.2. The three dimensional
location of a point along the projected laser line may be calculated by triangula-
tion, based on the known geometry between the camera and the laser emitter. A
single measurement from the laser scanner typically yields the coordinates of a
large number of points in the laser plane. Alternatively, a measurement consists of
a single point and the angle of the surface, which is easily converted to two points
by sampling a second point on the line through the given point. Comments on
statistical considerations for this sampling are provided in Sec. 13.A.
145
Chapter 13. Calibration
Method
The objective of the calibration is to find the transformation matrix TTSF ∈ SE (3)
that relates the measurements of the laser scanner to the coordinate frame of the
tool flange of the robot.
The kinematic chain of a robot will here consist of the transformation between
TF
the robot base frame and the tool flange TRB i , given by the forward kinematics1 in
pose i , and the transformation between the tool flange and the sensor TTSF . The
sensor, in turn, projects laser light onto a plane with unknown equation. A point
observed by the sensor can be translated to the robot base frame by
TF
pRBi = TRB i TTSF pS i (13.13)
where i denotes the index of the pose.
To find TTSF , an iterative two-step method is proposed, which starts with an
initial guess of the matrix. In each iteration, the equations for the planes are found
using eigendecomposition, whereafter a set of linear equations is solved for an
improved estimate of the desired transformation matrix. The scheme, illustrated
in Fig. 13.3, is iterated until convergence.
Finding the calibration planes Consider initially a set of measurements, PS =
[p 1 , ..., p Np ]S , gathered from a single plane. The normal can be found by Principal
Component Analysis (PCA), which amounts to performing an eigendecomposi-
tion of the covariance matrix C of the points [Pearson, 1901]. The eigenvector
corresponding to the smallest eigenvalue of C will be the desired estimate of the
plane normal.2 To this purpose, all points are transformed to a common frame,
the robot base frame, using (13.13) and the current estimate of TTSF .
To fully specify the plane equation, the center of mass µ of PRB is calculated.
The distance d to the plane is then calculated as the length of the projection of
the vector µ onto the plane normal
d = °n̄(n̄Tµ)°
° °
(13.14)
where n̄ is a normal with unit
° °length given by PCA. This distance can be encoded
into the normal by letting °n ° = d . The normal is then simply found by
n = n̄(n̄Tµ) (13.15)
1 Or alternatively, given by an external tracking system.
2 This eigenvalue will correspond to the mean squared distance from the points to the plane.
146
13.2 Laser-Scanner Calibration
This procedure is repeated for all measured calibration planes and results in a
set of normals that will be used to find the optimal TTSF .
Solving for TTSF All measured points should fulfill the equation for the plane they
were obtained from. This means that for a single point p, the following must hold
° °2
n̄Tp = d ⇐⇒ nTp = °n ° (13.16)
A measurement point obtained from the sensor in the considered setup should
thus fulfill the following set of linear equations
TF
pRBi = TRB i TTSF pS i (13.17)
° °2
nTp RBi = °n ° (13.18)
¤T £ ¤T
pS i = p TS i 1 = x S i
£
y Si zSi 1 (13.19)
Since (13.17) and (13.18) are linear in the parameters, all elements of TTSF can be
extracted into k, and A i can be obtained by performing the matrix multiplications
in (13.17) and (13.18) and identifying terms containing any of the elements of k.
Terms which do not include any parameter to be identified are associated with q i .
The final expressions for A i and q i given above can then be obtained by identifying
matrix-multiplication structures among the elements of A i and q i .
Equation (13.20) does not have a unique solution. A set of at least nine points
gathered from at least three planes is required in order to obtain a unique solution
to the vector k. This can be obtained by stacking the entries in (13.20) according
to ° °
A1
°n 1 ° − q 1
A2 ° °
°n 2 ° − q 2
Ak = Y, A = . , Y = (13.23)
..
..
.
A Np
° °
°n Np ° − q N
p
147
Chapter 13. Calibration
The resulting problem is linear in the parameters, and the vector k ∗ of param-
eters that minimizes
k ∗ = arg min °Ak − Y°2
° °
(13.24)
k
can then be obtained from the least-squares procedure of Sec. 6.2. We make a note
at this point that while solving (13.24) is the most straightforward way of obtaining
an estimate, the problem contains errors in both A and Y and is therefore of errors-
in-variables type and a candidate for the total least-squares procedure. We discuss
the two solution methods further in Sec. 13.B.
Since k only contains the first two columns of R TS F , the third column is formed
as
R3 = R1 × R2 (13.25)
where × denotes the cross product between R 1 and R 2 , which produces a vector
orthogonal to both R 1 and R 2 . When solving (13.24) we are actually solving a
problem with the same°relaxation
° ° °as the one employed in Sec. 13.1, since the
constraints R 1TR 2 = 0, °R 1 ° = °R 2 ° = 1 are not enforced. The resulting R TS F will
thus in general not belong to SO(3). The closest valid rotation matrix can be found
by projection onto SO(3) according to the procedure outlined in Sec. 6.1.
This projection will change the corresponding entries in k ∗ and the result-
ing coefficients will no longer solve the problem (13.24). A second optimization
problem can thus be formed to re-estimate the translational part of k, given the
orthogonalized rotational part. Let k be decomposed according to
k = R̃ ∗ p R̃ ∗ ∈ R1×6 , p ∈ R1×3
£ ¤
(13.26)
and denote by An:k columns n to k of A. The optimal translational vector, given
the orthonormal rotation matrix, is found by solving the following optimization
problem
Ỹ = Y − A1:6 R̃ ∗ (13.27)
p ∗ = arg min °A7:9 p − Ỹ°
° °
2
(13.28)
p
148
13.2 Laser-Scanner Calibration
° °
°T − T̂ ° RMS distance from points to plane [m]
F
100 100
10−1 10−1
10−2 10−2
10−3 10−3
10−4 10−4
0 10 20 30 0 10 20 30
Number of iterations Number of iterations
Figure 13.4 Convergence results for simulated data during 100 realizations, each
with 10 poses sampled from each of 3 planes. Measurement noise level σ = 0.5mm
is marked with a dashed line. On the left, the Frobenius norm between the true
matrix and the estimated, on the right, the RMS distance between measurement
points and the estimated plane.
Results
The performance of the method was initially studied in simulations, which allows
for a comparison between the obtained estimate and the ground truth. The sim-
ulation study is followed by experimental verification using a real laser scanner
mounted on the wrist of an industrial manipulator.
149
Chapter 13. Calibration
10−2 100
10−3 10−1
10−4 10−2
10−5 10−3
Before After Before After
Figure 13.5 Errors in TTSF before and after calibration for 100 realizations with 30
calibration iterations. For each realization, 10 poses were sampled from each of 3
planes. The measurement noise level σ = 0.5mm is marked with a dashed line. On
the left, the translational error between the true matrix and the estimated, on the
right, the rotational error.
Figure 13.6 A visualization of the reconstructed planes used for data collection.
The planes were placed so as to be close to orthogonal to each other, surrounding
the robot. Figure 13.2 presents a photo of the setup.
150
13.2 Laser-Scanner Calibration
10−3
10−4
0 10 20 30 40 50
Number of iterations
Figure 13.7 Convergence results for experimental data gathered from 5 planes.
The RMS distance between measurement points and the estimated planes are
shown together with the mean over all planes.
The translational part of the initial guess was obtained by estimating the distance
from the tool flange to the origin of the laser scanner, whereas the rotation matrix
was obtained by estimating the projection of the coordinate axes of the scanner
onto the axes of the tool flange.3
The convergence behavior, illustrated in Fig. 13.7, is similar to that in the
simulation, and the final error was on the same level as the noise in the sensor
data. A histogram of the final errors is shown in Fig. 13.8, indicating a symmetric
but heavy-tailed distribution. We remark that if a figure like Fig. 13.8 indicate the
presence of outliers or errors with a highly non-Gaussian distribution, one could
consider alternative distance metrics in (13.24), such as the L 1 norm, for a more
robust performance.
3 The fact that the initial estimate of the rotation matrix was the identity matrix is a coincidence.
151
Chapter 13. Calibration
Histogram of errors
0.6
0.2
0
−1 −0.5 0 0.5 1
Error [m]
·10−3
Discussion
The calibration method described is highly practically motivated. Calibration is
often tedious and an annoyance to the operator of any technical equipment. The
method described tries to mitigate this problem by making use of data that is easy
to acquire. In its simplest form, the algorithm requires some minor bookkeeping in
order to associate the recorded points with the plane they were collected from. An
extension to the algorithm that would make it even more accessible is automatic
identification of planes using a RANSAC [Fischler and Bolles, 1981] or clustering
procedure.
While the method was shown to be robust against large initial errors in the
estimated transform, the effect of a nonzero curvature of the planes from which
the data is gathered remains to be investigated.
In practice, physical limitations may limit the poses from which data can be
gathered, i.e., neither the sensor nor the robot must collide with the planes. The
failure of the collected poses to adequately span the relevant space may make
the algorithm sensitive to errors in certain directions in the space of estimated
parameters. Fortunately, however, directions that are difficult to span when gath-
ering calibration data are also unlikely to be useful when the calibrated sensor is
deployed. The found transformation will consequently be accurate for relevant
poses that were present in the calibration data.
Conclusions
This section has presented a robust, iterative method composed of linear sub
problems for the kinematic calibration of a 2D laser sensor. Large uncertainties in
the initial estimates are handled and the estimation error converges to below the
level of the measurement noise. The calibration routine can be used for any type of
152
13.A Point Sampling
laser sensor that measures distances in a plane, as long as the forward kinematics
is known, such as when the sensor is mounted on the flange of an industrial robot
or on a mobile platform or aerial drone, tracked by an external tracking system.
Extensions to other kinds of laser sensors are provided in Sections 13.C and 13.D.
An implementation of the proposed method is made available in [Robotlib.jl,
B.C., 2015].
Some laser sensors for weld-seam tracking do not provide all the measured points,
but instead provide the location of the weld seam and the angle of the surface.
In this situation one must sample a second point along the line implied by the
provided measurements in order for the proposed algorithm to work. The sam-
pling of the second point, p 2 , is straight forward, but entails a trade-off between
noise sensitivity and numerical accuracy. Given measurements of a point and an
angle, p m , αm , corrupted with measurement noise e p , e α , respectively, p 2 can be
calculated as
p 2 = p m + γl (13.32)
cos αm
· ¸
l= (13.33)
sin αm
αm = α + e α , e α ∼ N (0, σ2α ) (13.34)
pm = p + e p , e p ∼ N (0, Σp ) (13.35)
E p 2 p T2 = V p 2 = V p m + γl = V p m + γ2 E l l T
© ª © ª © ª © ª © ª
(13.36)
2 2¯
≈ Σp + γ σα ∇α l ∇α l
¯ ¯
T¯
(13.37)
2
sin αm − sin αm cos αm ¯¯
¯· ¸¯
= Σp + γ2 σ2α ¯¯
¯
(13.38)
− sin αm cos αm cos2 αm ¯
where (13.36) holds if E e α e p = 0. We thus see that the variance in the sampled
© ª
point will be strictly larger than the variance in the measured point, and we should
ideally trust this second point less when we form the estimate of both the plane
equations and TTSF . Since the original problem is of errors-in-variables type, the
optimal solution to the problem with unequal variances is given by the weighted
total least-squares procedure [Fang, 2013]. Experiments outlined ° in°the Sec. 13.B
indicated that as long as γ was chosen on the same scale as °p S °, taking this
uncertainty into account had no effect on the fixed point to which the algorithm
converged.
153
Chapter 13. Calibration
° ° TF
A i k = °n i ° − nTi p RB i (13.39)
£ T T Fi TF T Fi ¤
A i = n i R RB x S i nTi R RB i y S i nTi R RB (13.40)
makes less assumptions than the first approach. While conceptually simple, this
approach requires N p bootstrapping procedures and is thus more computation-
ally expensive. One can, however, run the complete algorithm until convergence
using estimates based on regular LS, and first after convergence switch to the
more accurate WTLS procedure to improve accuracy.
Evaluation
The first strategy was implemented and tested on the simulated data from
Sec. 13.2. We used 500 bootstrap samples and 15 iterations in the WTLS algo-
rithm. No appreciable difference in fixed points was detected between solving for
TTSF using the standard LS procedure and the WTLS procedure. We thus concluded
that the considerable additional complexity involved in estimating the covariance
matrices of the data and solving the optimization problem using the more compli-
cated WTLS algorithm is not warranted. The evaluation can be reproduced using
code available in [Robotlib.jl, B.C., 2015] and the WTLS algorithm implementation
is made available in [TotalLeastSquares.jl, B.C., 2018].
154
13.C Calibration of Point Lasers
Eq. (13.19) We now assume, without loss of generality, that the laser point lies
along the line y S = z S = 0. As a result, the second and third columns of TTSF
can not be solved for. These two vectors can be set to zero.
Eq. (13.20) The truncated vector k ∈ R6 will now consist of the first column of R TSF
and the translation vector p TS F .
TF
Eq. (13.21) The three middle elements of A i , corresponding to nTi R RB i y S i are
removed.
The proposed algorithm can be utilized also for 3D distance sensors, such as
LIDARs. This type of sensor provides richer information about the environment,
and thus allows a richer set of calibration algorithms to be employed. We make no
claims regarding the effectiveness of the proposed algorithm in this scenario, and
simply note the adjustments needed to employ it.
To use the proposed algorithm, we modify it according to
Eq. (13.19) We no longer assume that the laser line lies in the plane z S = 0. As a
result, the full matrix TTSF can now be solved for immediately, without the
additional step of forming R 3 = R 1 × R 2 .
Eq. (13.20) The vector k ∈ R12 will now consist of all the columns of R TSF and the
translation vector p TS F .
To make use of this algorithm in practice, one has to consider the problem of
assigning measured points to the correct planes. This can be hard when employing
155
Chapter 13. Calibration
156
14
State Estimation for FSW
In this chapter, we will consider the problem of state estimation in the context
of friction stir welding. The state estimator we develop is able to incorporate
measurements from the class of laser sensors that was considered in the previous
chapter, as well as compliance models and force-sensor measurements. As alluded
to in Chap. 12, the FSW process requires accurate control of the full 6 DOF pose
of the robot relative to the seam. The problem of estimating the pose of the
welding tool during welding is both nonlinear and non-Gaussian, motivating a
state estimator beyond the standard Kalman filter. The following sections will
highlight the unique challenges related to state estimation associated with friction
stir welding and propose a particle-filter based estimator, a method that was
introduced in Sec. 4.2.
We also develop a framework for seam-tracking simulation in Sec. 14.2, where
the relation between sources of error and estimation performance is analyzed.
Through geometric reasoning, we show that some situations call for additional
sensing on top of what is provided by a single laser sensor. The framework is
intended to assist the user in selection of an optimal sensor configuration for
a given seam, where sensor configurations vary in, e.g., the number of sensors
applied and their distance from the tool center point (TCP). The framework also
helps the user tune the state estimator, a problem which is significantly harder for
a particle-filter based estimator compared to a Kalman filter.
157
Chapter 14. State Estimation for FSW
measurements that naturally occupy the same Cartesian space as the tool pose
we are ultimately interested in estimating. Although subject to errors, the sensor
information available from the robot is naturally transformed to SE (3) by means of
the forward kinematics function F k (q). This information will be used to increase
the observability in directions of the state-space where the external sensing leaves
us blind.
The velocities and accelerations present during FSW are typically very low
and we therefore chose to not include velocities in the state to be estimated.
This reduces the state dimension and computational burden significantly, while
maintaining a high estimation performance.
Preliminaries
This section briefly introduces a number of coordinate frames and variables used
in the later description of the method. For a general treatment of coordinate
frames and transformations, see [Murray et al., 1994].
The following text will reference a number of different coordinate frames. We
list them all in Table 12.1 and provide a visual guide to relate them to each other
in Fig. 14.1. Table 12.1 further introduces a number of variables and some special
notation that will be referred to in the following description. All Cartesian-space
variables are given in the robot base frame RB unless otherwise noted.
T
S
Figure 14.1 Coordinate frames (x, y, z) = (red, green, blue). The origin of the
sensor frame S is located in the laser plane at the desired seam intersection point.
The tool frame is denoted by T .
158
14.1 State Estimator
Nominal trajectory
Before we describe the details of the state estimator, we will establish the concept
of the nominal trajectory. In the linear-Gaussian case, the reference trajectory of a
control system is of no importance while estimating the state, this follows from
(4.24), which shows that the covariance of the state estimate is independent of
the control signal. In the present context, however, we make use of the reference
trajectory for two purposes. 1.) It provides prior information regarding the state
transition. This lets us bypass a lot of modeling complexity by assuming that
the robot controller will do a good job following the velocities specified in the
reference trajectory. We know, however, that the robot controller will follow this
reference with a potentially large position error, due to deflections, etc., outlined
above. 2.) The reference trajectory provides the nominal seam geometry needed
to determine the likelihood of a measurement from the laser sensor given a state
hypothesis X̂ .
To get a suitable representation of the nominal trajectory used to propagate
the particles forward, we can, e.g., perform a simulation of the robot program
using a simulation software, often provided by the robot manufacturer. This pro-
cedure eliminates the need to reverse engineer the robot path planner. During the
simulation, a stream of joint angles is saved, which, when run through the forward
kinematics, returns the nominal trajectory in Cartesian space. The simulation
framework outlined in the following sections provides a number of methods for
generating a nominal trajectory for simulation experiments.
© ªN
The nominal trajectory will consist of a sequence of joint coordinates q̄ t t =1
which, if run through the forward kinematic function, yields a sequence of points
© ªN
specifying the nominal seam geometry p t = F k (q̄ t ) t =1 .
Density functions
To employ a particle filter for state estimation, specification of a statistical model
of the state evolution and the measurement generating process is required, see
Sec. 4.2. This section introduces and motivates the various modeling choices in
the form of probability density functions used in the proposed particle filter.
State transition
p(X + |X , f) (14.1)
The state-transition function is a model of the state at the next time instance, given
the state at the current time instance. We model the mean of the state-transition
density (14.1) using the robot reference trajectory. The reference trajectory is
generated by, e.g., the robot controller or FSW path planner, and consists of a
sequence of poses which the robot should traverse. We assume that a tracking
controller can make corrections to this nominal trajectory, based on the state
estimates from the state estimator.
159
Chapter 14. State Estimation for FSW
The shape of the density should encode the uncertainty in the update of
the robot state from one sample to another. For a robot moving in free space,
this uncertainty is usually very small. Under the influence of varying external
process forces, however, significant uncertainty is introduced [De Backer, 2014;
Sörnmo, 2015; Olofsson, 2015]. Based on this assumption, we may choose a density
where the width is a function of the process force. For example, we may chose a
multivariate Gaussian distribution and let the covariance matrix be a function of
the process force.
When the robot is subject to large external forces applied at the tool, the measure-
ments provided by the robot will not provide an accurate estimate of the tool pose
through the forward-kinematics function. If a compliance model C j (τ) is available,
we may use it to reduce the uncertainty induced by kinematic deflections, a topic
explored in detail in [Lehmann et al., 2013; Sörnmo, 2015; Olofsson, 2015]. We
thus choose the following model for the mean of the robot measurement density
(14.2)
µ p(q, f|X ) = F k (q +C j (τ))
© ª
(14.3)
The uncertainty in the robot measurement comes from several sources. The
joint resolvers/encoders are affected by noise, which is well modeled as a Gaus-
sian random variable. When Gaussian errors, e q , in the joint measurements are
propagated through the linearized forward-kinematics function, the covariance
matrix ΣC of the resulting Cartesian-space errors eC is obtained by approximating
e q = dq as
q m = q + e q = q + dq
e q ∼ N (0, Σq )
eC ∼ N (0, J Σq J T)
160
14.1 State Estimator
d 〈F k (q)〉∨
eC = dq = J dq = J e q
dq
n o n o
ΣC = E eC eCT = E J e q e Tq J T = J E e q e Tq J T
© ª
where the approximation J (q + e q ) ≈ J (q) has been made. The twist coordinate
representation 〈F k (q)〉∨ is obtained by taking the logarithm of the transformation
matrix log(F k (q)), which produces a twist ξ ∈ se(3), and the operation ξ∨ ∈ R6
returns the twist coordinates [Murray et al., 1994].1
Except for the measurement noise e q , the errors in the robot measurement
update density are not independent between samples. The error in both the for-
ward kinematics and the compliance model is configuration dependent. Since
the velocity of the robot is bounded, the configuration will change slowly and
configuration-dependent errors will thus be highly correlated in time. The stan-
dard derivation of the particle filter relies on the assumption that the measurement
errors constitute a sequence of independent, identically distributed (i.i.d.) random
variables. Independent measurement errors can be averaged between samples
to obtain a more accurate estimate, which is not possible with correlated errors,
where several consecutive measurements all suffer from the same error.
Time-correlated errors are in general hard to handle in the particle filtering
framework and no systematic way to cope with this problem has been found.
One potential approach is to incorporate the correlated error as a state to be
estimated [Evensen, 2003; Åström and Wittenmark, 2013a]. This is feasible only if
there exist a way to differentiate between the different sources of error, something
which in the present context would require additional sensing. State augmen-
tation further doubles the state dimension, with a corresponding increase in
computational complexity.
Since only a combination of the tracking error, the kinematic error and the
dynamic error is measurable, we propose to model the time-correlated uncer-
tainties as a uniform random variable with a width d chosen as the maximum
expected error. When performing the measurement update with the densities of
several perfectly correlated uniform random variables, the posterior distribution
equals the prior distribution. The uniform distribution is thus invariant under
the measurement update. We illustrate this in Fig. 14.2, where the effect of the
measurement update is displayed for a hybrid between the Gaussian and uniform
distributions.
The complete robot measurement density function with the above modeling
choices, (14.2), is formed by the convolution of the densities for a Gaussian, pG ,
and a uniform, pU , random variable, according to
Z
p(q, f|X ) = pU (x − y) pG (y) d y (14.4)
Rk
1 If the covariance of the measurements q
m are is obtainednon the
o motor-side of the gearbox, the
Cartesian-space covariance will take the form ΣC = JG E e q e T T T
q G J where G is the gear-ratio
matrix of (9.11).
161
Chapter 14. State Estimation for FSW
0.8
Gaussian
Gaussian + Uniform
0.6
2 meas.
2 meas.
0.4 50 meas.
50 meas.
0.2
0
−4σ −3σ −2σ −σ 0 σ 2σ 3σ 4σ
C
if |∆x| ≤ d
p(q, f|X ) ≈ 2¶ (14.5)
(|∆x| − d )
µ
C exp −
if |∆x| > d
2σ2
1
C=p
2πσ + 2d
162
14.1 State Estimator
·10−2
2
p(q, f|X )
0
10 10
5 5
0 0
−5 −5
−10 −10
x 2 [mm] x 1 [mm]
by
if °∆x °2 ≤ d
° °
D
163
Chapter 14. State Estimation for FSW
p(m| X̂ )
p2
v m̂ Sor i g i n Laser line
pi
e m
p1
TTSF
p(q| X̂ )
Seam
X̂
164
14.1 State Estimator
µ p(m|X ) = p i
© ª
and the shape should be chosen to reflect the error distribution of the laser sensor,
here modeled as a normal distribution according to
µ ¶
3 1 1
p(m|X ) = (2π)− 2 |Σ|− 2 exp − e TΣ−1 e , e = m̂ − p i
2
Many seam-tracking sensors are capable of measuring also the angle of the
weld surface around the normal of the laser plane. An angle measurement is easily
compared to the corresponding angle hypothesis of a particle using standard roll,
pitch, yaw calculations. Using the convention in Fig. 14.1, the angle around the
normal of the laser plane corresponds to the yaw angle. Roll and pitch angles are
unfortunately not directly measurable by this type of sensor. If, however, a sensor
with two or more laser planes is used, it is possible to estimate the full orientation
of the sensor. This will be analyzed further in Sec. 14.2.
Reduction of computational time The evaluation of p(m|X ) can be compu-
tationally expensive due to the search procedure. We can reduce this cost by
reducing the number of points to search over. This can be achieved by approxi-
mating the trajectory with a piecewise affine function. Since the intersection point
between the nominal seam line and the laser light plane is calculated, this does
not affect the accuracy of the evaluation of p(m|X ) much. To this end, we solve
the following convex optimization problem
NX 3
−2 X
° y − z °2 + λ
° °
minimize F |w t , j |
z,w
t =1 j =1
(14.8)
≤²
° °
subject to °y − z°
∞
w t , j = z t , j − 2z t +1, j + z t +2, j
where y ∈ RN ×3 are the positions of the nominal trajectory points, z is the approx-
imation of y, and ² is the maximum allowed approximation error. The nonzero
elements of w will determine the location of the knots in the piecewise affine
approximation and λ will influence the number of knots.2
The proposed optimization problem does not incorporate constraints on the
orientation error of the approximation. This error will, however, be small if the
trajectory is smooth with bounded curvature and a constraint is put on the error
in the translational approximation, as in (14.8).
Optimization problem (14.8) can be seen as a multivariable trend-filtering
problem, a topic which was discussed in greater detail in Sec. 6.4.
2 w = z − 2z
t t t +1 + z t +2 is a discrete second-order differentiation of z.
165
Chapter 14. State Estimation for FSW
Visualization
An often time-consuming part during the implementation of a particle filtering
framework is the tuning of the filter parameters. Due to the highly nonlinear
nature of the present filtering problem, this is not as straightforward as in the
Kalman-filtering scenario. A poorly tuned Kalman filter manifests itself as either
too noisy, or too slow. A poorly tuned particle filter may, however, suffer from
catastrophic failures such as mode collapse or particle degeneracy [Gustafsson,
2010; Thrun et al., 2005; Rawlings and Mayne, 2009].
To identify the presence of mode collapse or particle degeneracy and to assist
in the tuning of the filter, we provide a visualization tool that displays the true
trajectory as traversed by the robot together with the distribution of the particles,
as well as each particle’s hypothesis measurement location. An illustrative example
is shown in Fig. 14.5, where one dimension in the filter state is shown as a function
of time in a screen shot of the visualizer.
To further aid the tuning of the filter, we perform several simulations in par-
allel with nominal filter parameters perturbed by samples from a prespecified
distribution and perform statistical significance tests to determine the parameters
of most importance to the result for a certain sensor/trajectory configuration.
Figure 14.6 displays the statistical significance of various filter parameters for a
certain trajectory and sensor configuration. The color coding indicates the log(P)-
values for the corresponding parameters in a linear model predicting the errors
in Cartesian directions and orientation. As an example, the figure indicates that
the parameter σW 2 , corresponding to the orientation noise in the state update,
has a significant impact on the errors in all Cartesian directions. The sign and
value of the underlying linear model can then serve as a guide to fine tuning of
this parameter.
One of the main goals of this work was to enable analysis of the optimal sensor
configuration for a given seam geometry. On the one hand, not all seam geometries
allow for accurate estimation of the state in all directions, and on the other hand,
not all seam geometries require accurate control in all directions. The optimal
sensor configuration depends heavily on the amount of features present in the
166
14.3 Analysis of Sensor Configurations
3
y2
50
2
20 40 60 80 100
0
Figure 14.5 Visualization of a particle distribution as a function of time during
a simulation. The black line indicates the evolution of one coordinate of the true
state as a function of the time step and the heatmap illustrates the density of the
particles. This figure illustrates how the uncertainty of the estimate is reduced as a
feature in the trajectory becomes visible for the sensor at time step 50. The sensor is
located slightly ahead of the tool, hence, the distribution becomes narrow slightly
before the tool reaches the feature. The feature is in this case a sharp bend in the
otherwise straight seam.
R
z
-2
y
x
-4
σf
(I c
Np
fr1 5
fr2
fr3
fr4
fr5
f
sen t
et0
σm
σm
σm
σW
σW
σW
σV
σV
σV
σV
σV 1
σV 2
σV 3
σV 4
r6
pt
1
2
3
FK
FK
FK
FK
FK
ar
s
1
2
3
1
2
3
)
Figure 14.6 An illustration of how the various parameters in the software frame-
work can be tuned. By fitting linear models, with tuning parameters as factors,
that predict various errors as linear combinations of parameter values, parameters
with significant effect on the performance can be identified using the log(P)-values
(color coded). The x-axis indicates the factors and the y-axis indicates the pre-
dicted errors in orientation and translation. The parameters are described in detail
in the software framework.
167
Chapter 14. State Estimation for FSW
X2
X1
seam
Figure 14.7 A sensor with a single laser stripe is not capable of distinguishing
between wrong translation and wrong orientation. The two hypotheses X 1 , X 2 both
share the closest measurement point on the seam. The second laser stripe invali-
dates the erroneous hypothesis X 2 which would have the second measurement
point far from the seam. Without the second laser stripe it is clear that the available
sensor information can not distinguish X 1 and X 2 from each other.
168
14.4 Discussion
0.8
1.5
0.6
1
0.4
0.5
0.2
0.0 0
Error z [mm] Error rot [deg]
2 0.8
1.5 0.6
1 0.4
0.5 0.2
0 0.0
0 sens xy
1 sens xy
2 sens xy
0 sens yz
1 sens yz
2 sens yz
0 sens xy
1 sens xy
2 sens xy
0 sens yz
1 sens yz
2 sens yz
Figure 14.8 Error distributions for various sensor configurations (0-2 sensors)
and two different trajectory types (xy,yz). In both trajectory cases, y is the major
movement direction along which the laser sensors obtain little or no information.
The same filter parameters, tuned for the x y-trajectory, were used in all experi-
ments.
14.4 Discussion
169
Chapter 14. State Estimation for FSW
20
10
0
0 10 20 30 40 50 60 70 80 90 100
200
100
0
0 10 20 30 40 50 60 70 80 90 100
100
50
0
0 10 20 30 40 50 60 70 80 90 100
Figure 14.9 Trajectories x y (solid) and y z (dashed). Distance [mm] along each
axis (x, y, z) is depicted as a function of time step.
number for the maximum error in the forward kinematics under no load is usually
provided by the robot manufacturer, or can be obtained using, e.g., an external
optical tracking system.
A major source of uncertainty is compliance in the structure of the robot.
Deflections in the robot joints and links caused by large process forces result in
an uncertainty in the measured tool position. This problem can be mitigated
by a compliance model, C j (τ) in (14.3), reducing the uncertainty to the level
of the model uncertainty [Lehmann et al., 2013]. Although several authors have
considered compliance modeling, the large range of possible seam geometries and
the large range of possible process forces make finding a globally valid, sufficiently
accurate compliance model very difficult.
The reduction of uncertainty through additional sensing offers the possibility
of reducing the remaining errors greatly, potentially eliminating the need for a
state estimator altogether. Sensors capable of measuring the full 6DOF pose, such
as optical tracking systems, are unfortunately very expensive. They further require
accurate measurements also of the workpiece, potentially placing additional bur-
den on the operator. Relative sensing, such as the laser sensors considered in
this work, directly measure the relevant distance between the seam and the tool.
Unfortunately, they suffer from a number of weaknesses. They can for obvious
170
14.4 Discussion
reasons not measure the location of the seam at the tool center point, and must
thus measure the seam at a different location where it is visible. In a practical
scenario, this might cause the sensor to measure the seam up to 50 mm from the
TCP. Since the location of the TCP relative to the seam must be inferred through
geometry from this sensor measurement, the full 6 DOF pose becomes relevant,
even if does not have to be accurately controlled. A second weakness of the con-
sidered relative sensing is the lack of observability along the seam direction. While
dual laser sensors allow measuring more degrees of freedom than a single laser
sensor, no amount of laser sensors can infer the position along a straight seam.
The proposed state estimator tries to infer as much as possible about the state
by requiring knowledge of the seam geometry. This is necessary for the estimator
to know what sensor measurements to expect. The particle filter maintains a
representation of the full filtering density, it is thus possible to determine in
simulation whether or not the uncertainty is such that the worst-case tracking
performance is sufficient. The uncertainty is in general highest along the direction
of movement since no sensor information is available in this direction. Fortunately,
however, this direction is also the direction with lowest required tracking accuracy.
Situations that require higher tracking accuracy in this direction luckily coincide
with the situations that allow for higher estimation accuracy, when a feature is
present in the seam. An example of this was demonstrated in Fig. 14.5.
While the proposed framework is intended for simulation in order to aid the
design of a specialized state estimator, some measures were taken to reduce the
computational time and at the same time reducing the number of parameters
the operator has to tune. The most notable such measure was the choice to not
include velocities in the state. The velocities typically present in the FSW context
are fairly low, while forces are high. The acceleration in the transverse direction
can thus be high enough to render the estimation of velocities impossible on
the short time-scale associated with vibrations in the process. The bandwidth of
the controller is further far from enough for compensation to be feasible. In the
directions along the seam, the velocity is typically well controlled by the robot
controller apart from during the transient occurring when contact is established.
Once again, the bandwidth is not sufficient to compensate for errors occurring at
the frequencies present during the transient.
Lastly, the method does not include estimation of errors in the location of the
work piece. Without assumptions on either the error in the work-piece location or
the error in the forward kinematics of the robot, these two sources of error can not
be distinguished. Hence, augmenting the state with a representation of the work-
piece error will not be fruitful. If significant variation in work-piece placement
is suspected we instead propose to add a scanning phase prior to welding. This
would allow for using the laser sensor to, under no load, measure the location of
sufficiently many points along the seam to be able to estimate the location of the
work piece in the coordinate system of the robot. This procedure, which could
be easily automated, would compensate for errors in both work-piece placement
and the kinematic chain of the robot.
171
Chapter 14. State Estimation for FSW
14.5 Conclusions
We have suggested a particle-filter based state estimator capable of estimating
the full 6 DOF pose of the tool relative to the seam in a seam-tracking scenario.
Sensor fusion is carried out between the robot internal measurements, propagated
through a forward kinematics model with large uncertainties due to the applied
process forces, and measurements from a class of seam-tracking laser sensors.
We have highlighted some of the difficulties related to state estimation where
accurate measurements come in a reduced-dimensional space, together with
highly uncertain measurements of the full state space, where the uncertainties are
highly correlated in time.
The presented framework is available as open-source [PFSeamTracking.jl,
B.C. et al., 2016] and the algorithm has been successfully implemented at The
Welding Institute (TWI) in Sheffield, UK, and is capable of executing in approxi-
mately 1000 Hz using 500 particles on a standard desktop PC.
172
Conclusions and Future
Work
We have presented a wide range of problems and methods within estimation for
physical systems. Common to many parts of the thesis is the use of ideas from
machine learning to solve classical identification problems. Many of the problems
considered could, in theory, be solved by gathering massive amounts of data and
training a deep neural network. Instead, we have opted for developing methods
that make use of prior knowledge where available, and flexibility where not. This
has resulted in practical methods that require a practical amount of data. The
proof of this has in many cases been provided by experimental application on
physical systems, the very systems that motivated the work.
Looking forward, we see robust uncertainty quantification as a very interesting
and important direction for future work. Some of the developed methods have a
probabilistic interpretation and lend themselves well to maximum a posteriori
inference, in restricted settings. Lifting this restriction is straightforward in theory
but most often computationally challenging. Work on approximate methods has
recently made great strides in the area, but the field requires further attention
before its application as a robust technology.
173
Bibliography
174
Bibliography
175
Bibliography
176
Bibliography
Dahl, P. (1968). A solid friction model. Tech. rep. TOR-0158 (3107-18)-1. Aerospace
Corp, El Segundo, CA.
Daniilidis, K. (1999). “Hand-eye calibration using dual quaternions”. The Interna-
tional Journal of Robotics Research 18:3, pp. 286–298.
De Backer, J. (2014). Feedback Control of Robotic Friction Stir Welding. PhD thesis.
ISBN 978-91-87531-00-2, University West, Trollhättan, Sweden.
De Backer, J. and G. Bolmsjö (2014). “Deflection model for robotic friction stir
welding”. Industrial Robot: An International Journal 41:4, pp. 365–372.
De Backer, J., A.-K. Christiansson, J. Oqueka, and G. Bolmsjö (2012). “Investigation
of path compensation methods for robotic friction stir welding”. Industrial
Robot: An International Journal 39:6, pp. 601–608.
De Wit, C. C., H. Olsson, K. J. Åström, and P. Lischinsky (1995). “A new model
for control of systems with friction”. Automatic Control, IEEE Trans. on 40:3,
pp. 419–425.
Eckart, C. and G. Young (1936). “The approximation of one matrix by another of
lower rank”. Psychometrika 1:3, pp. 211–218.
Eggert, D. W., A. Lorusso, and R. B. Fisher (1997). “Estimating 3-d rigid body
transformations: a comparison of four major algorithms”. Machine Vision and
Applications 9:5-6, pp. 272–290.
Evensen, G. (2003). “The ensemble Kalman filter: theoretical formulation and
practical implementation”. Ocean Dynamics 53:4, pp. 343–367. ISSN: 1616-
7341. DOI: 10.1007/s10236-003-0036-9.
Fang, X. (2013). “Weighted total least squares: necessary and sufficient conditions,
fixed and random parameters”. Journal of geodesy 87:8, pp. 733–749.
Fischler, M. A. and R. C. Bolles (1981). “Random sample consensus: a paradigm for
model fitting with applications to image analysis and automated cartography”.
Communications of the ACM 24:6, pp. 381–395.
Gao, X., D. You, and S. Katayama (2012). “Seam tracking monitoring based on adap-
tive kalman filter embedded elman neural network during high-power fiber
laser welding”. Industrial Electronics, IEEE Transactions on 59:11, pp. 4315–
4325.
Gershman, S. J. and D. M. Blei (2011). A tutorial on Bayesian nonparametric models.
eprint: arXiv:1106.2697.
Glad, T. and L. Ljung (2014). Control theory. CRC press, Boca Raton, Florida.
Goldstein, A. A. (1964). “Convex programming in hilbert space”. Bulletin of the
American Mathematical Society 70:5, pp. 709–710.
Golub, G. H. and C. F. Van Loan (2012). Matrix computations. Vol. 3. Johns Hopkins
University Press, Baltimore.
Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. http://www.
deeplearningbook.org. MIT Press, Cambridge MA.
177
Bibliography
178
Bibliography
Karl, M., M. Soelch, J. Bayer, and P. van der Smagt (2016). Deep variational Bayes fil-
ters: unsupervised learning of state space models from raw data. eprint: arXiv:
1605.06432.
Karlsson, M., F. Bagge Carlson, A. Robertsson, and R. Johansson (2017). “Two-
degree-of-freedom control for trajectory tracking and perturbation recovery
during execution of dynamical movement primitives”. In: 20th IFAC World
Congress, Toulouse.
Kay, S. M. (1993). Fundamentals of statistical signal processing, volume I: estima-
tion theory. Prentice Hall, Englewood Cliffs, NJ.
Khalil, H. K. (1996). “Nonlinear systems”. Prentice-Hall, New Jersey 2:5.
Kim, S.-J., K. Koh, S. Boyd, and D. Gorinevsky (2009). “`1 trend filtering”. SIAM
review 51:2, pp. 339–360.
Kingma, D. and J. Ba (2014). “Adam: a method for stochastic optimization”. arXiv
preprint arXiv:1412.6980.
Kruif, B. J. de and T. J. de Vries (2002). “Support-vector-based least squares for
learning non-linear dynamics”. In: Decision and Control, 2002, Proc. IEEE Conf.,
Las Vegas. Vol. 2, pp. 1343–1348.
Lehmann, C., B. Olofsson, K. Nilsson, M. Halbauer, M. Haage, A. Robertsson, O.
Sörnmo, and U. Berger (2013). “Robot joint modeling and parameter identifi-
cation using the clamping method”. In: 7th IFAC Conference on Manufacturing
Modelling, Management,and Control. Saint Petersburg, Russia, pp. 843–848.
Lennartson, B., R. H. Middleton, and I. Gustafsson (2012). “Numerical sensitivity
of linear matrix inequalities using shift and delta operators”. IEEE Transactions
on Automatic Control 57:11, pp. 2874–2879.
Levine, S. and P. Abbeel (2014). “Learning neural network policies with guided
policy search under unknown dynamics”. In: Advances in Neural Information
Processing Systems, Montreal, pp. 1071–1079.
Levine, S. and V. Koltun (2013). “Guided policy search”. In: Int. Conf. Machine
Learning (ICML), Atlanta, pp. 1–9.
Levine, S., N. Wagener, and P. Abbeel (2015). “Learning contact-rich manipulation
skills with guided policy search”. In: Robotics and Automation (ICRA), IEEE Int.
Conf., Seattle. IEEE, pp. 156–163.
Lindström, E., E. Ionides, J. Frydendall, and H. Madsen (2012). “Efficient iterated
filtering”. IFAC Proceedings Volumes 45:16, pp. 1785–1790.
Ljung, L. (1987). System identification: theory for the user. Prentice-hall, Englewood
Cliffs, NJ.
Ljung, L. and T. Söderström (1983). Theory and practice of recursive identification.
MIT press, Cambridge, MA.
Manchester, I. R., M. M. Tobenkin, and A. Megretski (2012). “Stable nonlinear sys-
tem identification: convexity, model class, and consistency”. IFAC Proceedings
Volumes 45:16, pp. 328–333. DOI: 10.3182/20120711-3-BE-2027.00405.
179
Bibliography
180
Bibliography
181
Bibliography
182
Bibliography
183
Bibliography
Yuan, M. and Y. Lin (2006). “Model selection and estimation in regression with
grouped variables”. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 68:1, pp. 49–67.
Zhang, Q. and R. Pless (2004). “Extrinsic calibration of a camera and laser range
finder (improves camera calibration)”. In: Intelligent Robots and Systems, 2004.
(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, Sendai,
Japan. Vol. 3, 2301–2306 vol.3. DOI: 10.1109/IROS.2004.1389752.
Zhang, W. (1999). State-space search: Algorithms, complexity, extensions, and ap-
plications. Springer Science & Business Media, New York.
Zhuang, H., S. Motaghedi, and Z. S. Roth (1999). “Robot calibration with pla-
nar constraints”. In: Robotics and Automation, 1999. Proceedings. 1999 IEEE
International Conference on, Detroit, Michigan. Vol. 1, 805–810 vol.1. DOI:
10.1109/ROBOT.1999.770073.
184