0% found this document useful (0 votes)
42 views

Machine Learning and System Identification For Estimation in Physical Systems

Uploaded by

hongyu zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Machine Learning and System Identification For Estimation in Physical Systems

Uploaded by

hongyu zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 185

Machine Learning and System Identification for Estimation in Physical Systems

Bagge Carlson, Fredrik

2018

Document Version:
Publisher's PDF, also known as Version of record

Link to publication

Citation for published version (APA):


Bagge Carlson, F. (2018). Machine Learning and System Identification for Estimation in Physical Systems. Lund:
Department of Automatic Control, Faculty of Engineering LTH, Lund University.

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors
and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the
legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study
or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal
Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove
access to the work immediately and investigate your claim.

LUNDUNI
VERSI
TY

PO Box117
22100L und
+4646-2220000

Download date: 07. Apr. 2020


Machine Learning and System Identification for
Estimation in Physical Systems

Fredrik Bagge Carlson

Department of Automatic Control


PhD Thesis TFRT-1122
ISBN 978-91-7753-920-9 (print)
ISBN 978-91-7753-921-6 (web)
ISSN 0280–5316

Department of Automatic Control


Lund University
Box 118
SE-221 00 LUND
Sweden

© 2018 by Fredrik Bagge Carlson. All rights reserved.


Printed in Sweden by Media-Tryck.
Lund 2018
To Farah and the next step
Abstract

In this thesis, we draw inspiration from both classical system identification and
modern machine learning in order to solve estimation problems for real-world,
physical systems. The main approach to estimation and learning adopted is op-
timization based. Concepts such as regularization will be utilized for encoding
of prior knowledge and basis-function expansions will be used to add nonlinear
modeling power while keeping data requirements practical.
The thesis covers a wide range of applications, many inspired by applications
within robotics, but also extending outside this already wide field. Usage of the
proposed methods and algorithms are in many cases illustrated in the real-world
applications that motivated the research. Topics covered include dynamics mod-
eling and estimation, model-based reinforcement learning, spectral estimation,
friction modeling and state estimation and calibration in robotic machining.
In the work on modeling and identification of dynamics, we develop regu-
larization strategies that allow us to incorporate prior domain knowledge into
flexible, overparameterized models. We make use of classical control theory to
gain insight into training and regularization while using flexible tools from modern
deep learning. A particular focus of the work is to allow use of modern methods in
scenarios where gathering data is associated with a high cost.
In the robotics-inspired parts of the thesis, we develop methods that are practi-
cally motivated and ensure that they are implementable also outside the research
setting. We demonstrate this by performing experiments in realistic settings and
providing open-source implementations of all proposed methods and algorithms.

5
Acknowledgements

I would like to acknowledge the influence of my PhD thesis supervisor Prof. Rolf
Johansson and my Master’s thesis advisor Dr. Vuong Ngoc Dung at SIMTech, who
both encouraged me to pursue the PhD degree, for which I am very thankful. Prof.
Johansson has continuously supported my ideas and let me define my work with
great freedom, thank you.
My thesis co-supervisor, Prof. Anders Robertsson, thank you for your never-
ending enthusiasm, source of good mood and encouragement. When working
100% overtime during hot July nights in the robot lab, it helps to know that one is
never alone.
I would further like to direct my appreciation to friends and colleagues at the
department. It has often fascinated me, how a passionate and excited speaker
can make a boring topic appear interesting. No wonder a group of 50+ highly
motivated and passionate individuals can make an already interesting subject
fantastic. In particular Prof. Bo Bernhardsson, my office mates Gautham Nayak
Seetanadi and Mattias Fält and my travel mates Martin Karlsson, Olof Troeng and
Richard Pates, you have all made the last 5 years outside and at the department
particularly enjoyable.
Credit also goes to Jacob Wikmark, Dr. Björn Olofsson and Dr.
c Martin Karlsson
for incredibly generous and careful proof reading of the manuscript to this thesis,
and to Leif Andersson for helping out with typesetting, you have all been very
helpful!
Finally, I would like to thank my family in Vinslöv who have provided and
continue to provide a solid foundation to build upon, to my family from Sparta
who provided a second home and a source of both comfort and adventure, and to
the welcoming new addition to my family in the Middle East.

7
Financial support
Parts of the presented research were supported by the European Commission
under the 7th Framework Programme under grant agreement 606156 Flexifab.
Parts of the presented research were supported by the European Commission
under the Framework Programme Horizon 2020 under grant agreement 644938
SARAFun. The author is a member of the LCCC Linnaeus Center and the ELLIIT
Excellence Center at Lund University.

8
Contents

1. Introduction 13
1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2. Publications and Contributions 16

Part I Model Estimation 21


3. Introduction—System Identification and Machine Learning 23
3.1 Models of Dynamical Systems . . . . . . . . . . . . . . . . . . . . . 24
3.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Inductive Bias and Prior Knowledge . . . . . . . . . . . . . . . . . . 27
4. State Estimation 29
4.1 General State Estimation . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 The Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 The Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5. Dynamic Programming 34
5.1 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 36
6. Linear Quadratic Estimation and Regularization 38
6.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 38
6.2 Least-Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Basis-Function Expansions . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.5 Estimation of LTI Models . . . . . . . . . . . . . . . . . . . . . . . 49
7. Estimation of LTV Models 52
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.2 Model and Identification Problems . . . . . . . . . . . . . . . . . 53
7.3 Well-Posedness and Identifiability . . . . . . . . . . . . . . . . . . . 57
7.4 Kalman Smoother for Identification . . . . . . . . . . . . . . . . . 59
7.5 Dynamics Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.6 Example—Jump-Linear System . . . . . . . . . . . . . . . . . . . . 61
7.7 Example—Low-Frequency Evolution . . . . . . . . . . . . . . . . 62

9
Contents

7.8 Example—Nonsmooth Robot Arm with Stiff Contact . . . . . . . 65


7.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.A Solving (7.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.B Solving (7.8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8. Identification and Regularization of Nonlinear Black-Box Models 72
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.2 Computational Aspects . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3 Estimating a Nonlinear Black-Box Model . . . . . . . . . . . . . . 76
8.4 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.5 Tangent-Space Regularization . . . . . . . . . . . . . . . . . . . . 80
8.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.A Comparison of Activation Functions . . . . . . . . . . . . . . . . 90
8.B Deviations from the Nominal Model . . . . . . . . . . . . . . . . 90
9. Friction Modeling and Estimation 93
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.2 Models and Identification Procedures . . . . . . . . . . . . . . . 95
9.3 Position-Dependent Model . . . . . . . . . . . . . . . . . . . . . . . 97
9.4 Energy-Dependent Model . . . . . . . . . . . . . . . . . . . . . . 99
9.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
10. Spectral Estimation 112
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.2 LPV Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . 113
10.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 120
10.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
10.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
10.A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
11. Model-Based Reinforcement Learning 127
11.1 Iterative LQR—Differential Dynamic Programming . . . . . . . . 127
11.2 Example—Reinforcement Learning . . . . . . . . . . . . . . . . . 130

Part II Robot State Estimation 135


12. Introduction—Friction Stir Welding 137
12.1 Kinematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
13. Calibration 140
13.1 Force/Torque Sensor Calibration . . . . . . . . . . . . . . . . . . 140
13.2 Laser-Scanner Calibration . . . . . . . . . . . . . . . . . . . . . . 143

10
Contents

13.A Point Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153


13.B Least-Squares vs. Total Least-Squares . . . . . . . . . . . . . . . . . 154
13.C Calibration of Point Lasers . . . . . . . . . . . . . . . . . . . . . . 155
13.D Calibration of 3D Lasers and LIDARs . . . . . . . . . . . . . . . . 155
14. State Estimation for FSW 157
14.1 State Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
14.2 Simulation Framework . . . . . . . . . . . . . . . . . . . . . . . . 166
14.3 Analysis of Sensor Configurations . . . . . . . . . . . . . . . . . . 166
14.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
14.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Conclusions and Future Work 173
Bibliography 174

11
1
Introduction

Technical computing, sensing and control are well-established fields, still making
steady progress today. Rapid advancements in the ability to train flexible machine
learning models, enabled by amassing data and breakthroughs in the understand-
ing of the difficulties behind gradient-based training of deep architectures, have
made the considerably younger field of machine learning explode with interest.
Together, they have made automation feasible in situations we previously could
not dream of.
The vast majority of applications within machine learning are, thus far, in
domains where data is plentiful, such as image classification and text analysis.
Flexible machine-learning models thrive on large datasets, and much of the ad-
vancements of deep learning is often attributed to growing datasets, rather than
algorithmic advancements [Goodfellow et al., 2016]. In practice, it took a few
breakthrough ideas to enable training of these deep and flexible architectures,
but few argue with the fact that the size of the dataset is of great importance. In
many domains, notably domains involving mechanical systems such as robots
and cars, gathering the data required to make use of a modern machine-learning
model often proves difficult. While a simple online search returns thousands of
pictures of a particular object, and millions of Wikipedia articles are downloaded
in seconds, collecting a single example of a robot task requires actually operating a
robot, in real time. Not only is this associated with a tremendous overhead, but the
data collected during this experiment using a particular policy or controller is also
not always informative of the system and its behavior when it has gone through
training. This has seemingly made the progress of machine learning in control of
physical systems lag behind, and traditional methods are still dominating today.
Design methods based on control theory have long served us well. Complex prob-
lems are broken down into subproblems which are easily solved. The complexity
arising when connecting these subsystems together is handled by making the
design of each subsystem robust to uncertainties in its inputs [Åström and Murray,
2010]. While this has been a very successful strategy, it leaves us with a number of
questions. Are we leaving performance on the table by connecting individually
designed systems together instead of optimizing the complete system? Are we
wasting effort designing subsystems using time-consuming, traditional methods,

13
Chapter 1. Introduction

when larger, more complex subsystems could be designed automatically using


data-based methods?
In this thesis, we will draw inspiration from both classical system identification
and machine learning. The combination is potentially powerful, where system
identification’s deep roots in physics and domain knowledge allow us to use flexi-
ble machine-learning methods in applications where the data alone is insufficient.
The motivation for the presented research mainly comes from the projects Flexi-
fab and SARAFun. The Flexifab project investigated the use of industrial robots for
friction stir welding, whereas the SARAFun project considered robotic assembly.
While these two projects may seem dissimilar, and indeed they are, they have both
presented research problems within estimation in physical systems. The thesis
is divided into three parts, not related to the project behind the research, but
rather based on the types of problems considered. In Part I, we consider modeling,
learning and identification problems. Many of these problems were encountered
in a robotics context but result in generic methods that extend outside the field
of robotics. We also illustrate the use of some of the developed methods in rein-
forcement learning and trajectory optimization. In Part II, we consider problems
motivated by the friction-stir-welding (FSW) process. FSW is briefly introduced,
whereafter we consider a number of calibration problems, arising in the FSW
context, but finding application also outside of FSW [Bao et al., 2017; Chalus and
Liska, 2018; Yongsheng et al., 2017]. We also discuss state estimation in the FSW
context, a problem extending to general machining with industrial manipulators.
The outline of the thesis is given in Chap. 2 and visualized graphically in Fig. 1.1.

Topic distribution per chapter


1.0
.

0.8
lib
Ca
d

0.6
an
t.

s
Es

ic

0.4
am
e
at

yn
St

gD

0.2
in

g
in
n
ar

el
od

0.0
Le

g
L

n
n

W
n

SW
s

RL
in

el

el
M

LQ

io
io

tio

tio
FS
od

od
m

at

ed
at

rF
nd

ic

ra
m

im
im

xM
m

as
Fr

fo
lib
.a

ra

st

-B
st

n
d

Ca
Bo
og

LT

lE
ee

io
s.I

el
Pr

at
od
tra
Sy

at

ck

tim
ic
St

M
ec
a
am

Bl

Sp

es
yn

e
at
D

St

Figure 1.1 This thesis can be divided into three main topics. This figure indicates
the topic distribution for each chapter, where a dark blue color indicates a strong
presence of a topic. The topic distribution was automatically found using latent
Dirichlet allocation (LDA) [Murphy, 2012].

14
1.1 Notation

1.1 Notation
Notation frequently used in the thesis is summarized in Table 1.1. Many methods
developed in the thesis are applied within robotics and we frequently reference
different coordinate frames. The tool-flange frame T F is attached to the tool
flange of a robot, the mechanical interface between the robot and the payload or
tool. The robot base frame RB is the base of the forward-kinematics function of
a manipulator, but could also be, e.g., the frame of an external optical tracking
system that measures the location of the tool frame in the case of a flying robot
etc. A sensor delivers measurements in the sensor frame S . The joint coordinates,
e.g., joint angles for a serial manipulator, are denoted q. The vector of robot joint
torques is denoted τ, and external forces and torques acting on the robot are
gathered in the wrench f. The Jacobian of a function is denoted J and the Jacobian
of a manipulator is denoted J (q). We use k to denote a vector of parameters to
be estimated except in the case of deep networks, which we parameterize by the
weights w. The gradient of a function f with respect to x is denoted ∇x f . We use
x t to denote the state vector at time t in Markov systems, but frequently omit this
time index and use x + to denote x t +1 in equations where all other variables are
given at time t . The matrix 〈s〉 ∈ so is formed by the elements of a vector s and has
the skew-symmetric property 〈s〉 + 〈s〉T = 0 [Murray et al., 1994].

Table 1.1 Definition and description of coordinate frames, variables and notation.

RB Robot base frame.


TF Tool-flange frame, attached to the TCP.
S Sensor frame.
q ∈ Rn Joint Coordinate
q̇ ∈ Rn Joint velocity
τ ∈ Rn Torque vector
f ∈ R6 External force/torque wrench
R BA ∈ SO(3) Rotation matrix from B to A
T AB ∈ SE (3) Transformation matrix from B to A
F k (q) ∈ SE (3) Robot forward kinematics at pos. q
J (q) ∈ R6×n Manipulator Jacobian at pos. q
〈s〉 ∈so(3) Skew-symmetric matrix with parameters s ∈ R3
x + = f (x, u) Rn × Rm 7→ Rn Dynamics model
x State variable
u Input/control signal
x+ x at the next sample instant
k Parameter vector
∇x f Gradient of f with respect to x
x̂ Estimate of variable x
xi : j Elements i , i + 1, ..., j of x

15
2
Publications and
Contributions

The contributions of this thesis and its author, as well as a list of the papers this
thesis is based on, are detailed below.

Included publications

This thesis is based on the following publications:

Bagge Carlson, F., A. Robertsson, and R. Johansson (2015a). “Modeling and identi-
fication of position and temperature dependent friction phenomena without
temperature sensing”. In: Int. Conf. Intelligent Robots and Systems (IROS),
Hamburg. IEEE.
Bagge Carlson, F., R. Johansson, and A. Robertsson (2015b). “Six DOF eye-to-hand
calibration from 2D measurements using planar constraints”. In: Int. Conf.
Intelligent Robots and Systems (IROS), Hamburg. IEEE.
Bagge Carlson, F., A. Robertsson, and R. Johansson (2017). “Linear parameter-
varying spectral decomposition”. In: 2017 American Control Conf (ACC), Seat-
tle.
Bagge Carlson, F., A. Robertsson, and R. Johansson (2018a). “Identification of LTV
dynamical models with smooth or discontinuous time evolution by means
of convex optimization”. In: IEEE Int. Conf. Control and Automation (ICCA),
Anchorage, AK.
Bagge Carlson, F., R. Johansson, and A. Robertsson (2018b). “Tangent-space regu-
larization for neural-network models of dynamical systems”. arXiv preprint
arXiv:1806.09919.

In the publications listed above, F. Bagge Carlson developed manuscripts,


models, identification procedures, implementations and performed experiments.
A. Robertsson and R. Johansson assisted in improving the manuscripts.

16
Chapter 2. Publications and Contributions

Bagge Carlson, F., M. Karlsson, A. Robertsson, and R. Johansson (2016). “Particle


filter framework for 6D seam tracking under large external forces using 2D
laser sensors”. In: Int. Conf. Intelligent Robots and Systems (IROS), Daejeong,
South Korea.

In this publication, F. Bagge Carlson contributed with a majority of the im-


plementation and structure of the state estimator and manuscript. M. Karlsson
assisted in parts of the implementation and contributed with ideas on the struc-
ture of the state estimator, as well as assistance in preparing the manuscript. A.
Robertsson and R. Johansson assisted in improving the manuscript.
Parts of the work presented in this thesis have previously been published in
the Licentiate Thesis by the author

Bagge Carlson, F. (2017). Modeling and Estimation Topics in Robotics. Licentiate


Thesis TFRT-3272. Dept. Automatic Control, Lund University, Sweden.

Other publications

The following papers, authored or co-authored by the author of this thesis, cover
related topics in robotics but are not included in this thesis:

Bagge Carlson, F., N. D. Vuong, and R. Johansson (2014). “Polynomial reconstruc-


tion of 3D sampled curves using auxiliary surface data”. In: 2014 IEEE Int. Conf.
Robotics and Automation (ICRA) Hong-Kong.
Stolt, A., F. Bagge Carlson, M. M. Ghazaei Ardakani, I. Lundberg, A. Robertsson, and
R. Johansson (2015). “Sensorless friction-compensated passive lead-through
programming for industrial robots”. In: Int. Conf. Intelligent Robots and Systems
(IROS), Hamburg.
Karlsson, M., F. Bagge Carlson, J. De Backer, M. Holmstrand, A. Robertsson, and
R. Johansson (2016). “Robotic seam tracking for friction stir welding under
large contact forces”. In: 7th Swedish Production Symposium (SPS), Lund.
Karlsson, M., F. Bagge Carlson, J. De Backer, M. Holmstrand, A. Robertsson, R.
Johansson, L. Quintino, and E. Assuncao (2019). “Robotic friction stir welding,
challenges and solutions”. Welding in the World, The Int. Journal of Materials
Joining. ISSN: 0043-2288. Submitted.
Karlsson, M., F. Bagge Carlson, A. Robertsson, and R. Johansson (2017). “Two-
degree-of-freedom control for trajectory tracking and perturbation recovery
during execution of dynamical movement primitives”. In: 20th IFAC World
Congress, Toulouse.
Bagge Carlson, F. and M. Haage (2017). YuMi low-level motion guidance using the
Julia programming language and Externally Guided Motion Research Inter-
face. Technical report TFRT-7651. Department of Automatic Control, Lund
University, Sweden.

17
Chapter 2. Publications and Contributions

Outline and Contributions


The following is an outline of the contents and contributions of subsequent chap-
ters.
Chapters 3 to 6 serve as an introduction and the only contribution is the
organization of the material. An attempt at highlighting interesting connections
between control theory, system identification and machine learning is made,
illustrating similarities between the fields. Methods from the literature serving as
background and inspiration for the contributions outlined in subsequent chapters
are introduced here.
Chapter 7 is based on “Identification of LTV Dynamical Models with Smooth
or Discontinuous Time Evolution by means of Convex Optimization” and presents
a framework for identification of Linear Time-Varying models. The contributions
made in the chapter include

• Organization of identification methods into a common framework.

• Development of efficient algorithms for solving a set of optimization prob-


lems based on dynamic programming.

• Proof of well-posedness for a set of optimization problems.

• Modification of a standard dynamic-programming algorithm to allow inclu-


sion of prior information.

Usage of the proposed methods is demonstrated in numerical examples and an


open-source framework implementing the methods is made available. Methods
developed in this chapter are further used in Chap. 11.
Chapter 8 is based on “Tangent-Space Regularization for Neural-Network
Models of Dynamical Systems” and treats identification of dynamics models
using methods from deep learning. The chapter provides an analysis of how
standard deep-learning regularization affects the learning of dynamical systems
and a new regularization approach is proposed and shown to introduce less bias
compared to traditional regularization. Structural choices in the deep-learning
model are further viewed from a dynamical-systems perspective and the effects of
these choices are analyzed from an optimization perspective. The discussion is
supported by extensive numerical evaluation.
Chapter 9 is based on “Modeling and identification of position and tempera-
ture dependent friction phenomena without temperature sensing” and introduces
two new friction models. It is shown how, for some industrial manipulators, the
joint friction varies with the joint angle. A model and identification procedure for
this angle-dependent friction is introduced and verified experimentally to reduce
errors in friction modeling.
Chapter 9 further introduces a friction model that makes use of estimated
power losses due to friction. Power losses are known to increase the joint tempera-
ture and in turn, influence friction. The main benefit of the model is the offline
identification and open-loop application, eliminating the need for adaptation of

18
Chapter 2. Publications and Contributions

friction parameters during operation. Also this model is verified experimentally as


well as in simulations.
The work in Chap. 10 was motivated by observations gathered during the
work presented in Chap. 9, where residuals from friction modeling indicated the
presence of a highly periodic disturbance. Analysis of this disturbance, which
turned out to be modulated by the velocity of the joint, led to the development of a
new spectral estimation method, the main topic and contribution of this chapter.
The method decomposes the spectrum of a signal along an auxiliary dimension
and allows for the estimation of a functional dependence between the auxiliary
variable and the Fourier coefficients of the signal under analysis. The method
was demonstrated on a simulated signal as well as applied to the residual signal
from Chap. 9. The chapter also includes a statistical proof of consistency of the
proposed method.
In Chap. 11, usage of the methods developed in Chapters 7 and 8 is illustrated
in an application of model-based reinforcement learning, parts of which were
originally introduced in “Identification of LTV Dynamical Models with Smooth
or Discontinuous Time Evolution by means of Convex Optimization”. It is shown
how the regularized methods presented in Chap. 7 allow solving a model-based
trajectory-optimization problem without any prior model of the system. It is
further shown how incorporating the deep-learning models of Chap. 8 using
the modified dynamic-programming solver presented in Chap. 7 can accelerate
the learning procedure by accumulating experience between experiments. The
combination of dynamics model and learning algorithm was shown to result in a
highly data-efficient reinforcement-learning algorithm.
Chapter 12 introduces the friction-stir-welding (FSW) process that served
as motivation for the conducted research. Chapter 13 introduces algorithms to
calibrate force sensors and laser sensors that make use of easily gathered data,
important for practical application of the methods.
The algorithm developed for calibration of force/torque sensors solves a con-
vex relaxation of an optimization problem, and it is shown how the optimal so-
lution to the originally constrained problem is obtained by a projection onto
the constraint set. The main benefit of the proposed algorithm is its numerical
robustness and the lack of requirement for special calibration equipment.
The algorithm proposed for calibration of laser sensors, originally presented
in “Six DOF eye-to-hand calibration from 2D measurements using planar con-
straints”, was motivated by the FSW process and finds the transformation matrix
between the coordinate systems of the sensor and the tool. This method elimi-
nates the need for special-purpose equipment in the calibration procedure and
was shown to be robust to errors in the required initial guess. Use of the algorithm
was demonstrated in both simulations and using a real sensor.
Chapter 14 is based on “Particle Filter Framework for 6D Seam Tracking Under
Large External Forces Using 2D Laser Sensors” and builds upon the work from
Chap. 13. In this chapter, a state estimator capable of incorporating the sensing
modalities described in Chap. 13 is introduced. The main contribution is an inte-
grated framework for state estimation in the FSW context, with discussions about,

19
Chapter 2. Publications and Contributions

and proposed solutions to, many unique problems arising in the FSW context.
The chapter also outlines an open-source software framework for simulation of
the state-estimation procedure, intended to guide the user in application of the
method and assembly of the hardware sensing.
The thesis is concluded in Sec. 14.5 with a brief discussion around directions
for future work.

Software
The research presented in this thesis is accompanied by open-source software
implementing all proposed methods and allowing reproduction of simulation
results. A summary of the released software is given below.

[Robotlib.jl, B.C., 2015] Robot kinematics, dynamics and calibration. Imple-


ments [Bagge Carlson et al., 2015b; Bagge Carlson et al., 2015a].

[Robotlab.jl, B.C. et al., 2017 ] Real-time robot controllers in Julia. Connections


to ABB robots [Bagge Carlson and Haage, 2017].

[LPVSpectral.jl, B.C., 2016] (Sparse and LPV) Spectral estimation methods, im-
plements [Bagge Carlson et al., 2017].

[PFSeamTracking.jl, B.C. et al., 2016] Seam tracking and simulation [Bagge Carl-
son et al., 2016].

[LowLevelParticleFilters.jl, B.C., 2018] General state estimation and parameter


estimation for dynamical systems.

[BasisFunctionExpansions.jl, B.C., 2016] Tools for estimation and use of basis-


function expansions.

[DifferentialDynamicProgramming.jl, B.C., 2016] Optimal control and model-


based reinforcement learning.

[DynamicMovementPrimitives.jl, B.C. et al., 2016] DMPs in Julia, implements


[Karlsson et al., 2017].

[LTVModels.jl, B.C., 2017] Implements all methods in [Bagge Carlson et al.,


2018b].

[JacProp.jl, B.C., 2018] Implements all methods in [Bagge Carlson et al., 2018a].

20
Part I

Model Estimation
3
Introduction—System
Identification and Machine
Learning

Estimation, identification and learning are three words often used to describe
similar notions. Different fields have traditionally preferred one or the other, but
no matter what term has been used, the concepts involved have been similar, and
the end goals have been the same. The machine learning community talks about
model learning. The act of observing data generated by a system and building a
model that can either predict the output given an unseen input, or generate new
data from the same distribution as the observed data was generated from [Bishop,
2006; Murphy, 2012; Goodfellow et al., 2016]. The control community, on the other
hand, talks about system identification, the act of perturbing a system using a
controlled input, observing the response of the system and estimating/identifying
a model that agrees with the observations [Ljung, 1987; Johansson, 1993]. Although
terminology, application and sometimes also methods have differed, both fields
are concerned with building models that capture structure observed in data.
This thesis will use the terms more or less interchangeably and they will always
refer to solving an optimization problem. The function we optimize is specifically
constructed to encode how well the model agrees with the observations, or rather,
the degree of mismatch between the model predictions and the data. Optimiza-
tion of a cost function is a very common and the perhaps dominating strategy
in the field, but approaches such as Bayesian inference offer an alternative strat-
egy, focusing on statistical models. Bayesian methods offer interesting and often
valuable insight into the complete posterior distribution of the model parameters
after having observed the data [Bishop, 2006; Murphy, 2012]. This comes at the
cost of computational complexity. Bayesian methods often involve intractable
high-dimensional integration, necessitating approximate solution methods such
as Monte Carlo methods. Variational inference is another popular approximate so-
lution method that transform the Bayesian inference problem to an optimization
problem over a parameterized probability density [Bishop, 2006; Murphy, 2012].

23
Chapter 3. Introduction—System Identification and Machine Learning

No matter what learning paradigm one chooses to employ, a model structure


must be chosen before any learning or identification can begin. The choice of
model is not always trivial and must be guided by application-specific goals. Are
we estimating a model to learn something about the system or to predict future
output of the system? Do we want to use the model for simulation or control
synthesis?
Linear models offer a strong baseline, they are easy to fit and provide excellent
interpretability. While few systems are truly linear, many systems are described
well locally by a linear model [Åström and Murray, 2010; Glad and Ljung, 2014]. A
system actively regulated to stay around an operating point is, for instance, often
well described by a linear model. Linear models further facilitate easy control
design thanks to the very well-developed theory for linear control system analysis
and synthesis.
When a linear model is inadequate, we might consider first principles and
specify a gray-box model, a model with well motivated structure but unknown
parameters [Johansson, 1993]. The parameters are then chosen so as to agree with
observed data. Specification of a gray-box model requires insight into the physics
of the system. Complicated systems might defy our efforts to write down simple
governing equations, making gray-box modeling hard. However, when we are able
to use them, we are often rewarded with further insight into the system provided
by the identification of the model parameters.
A third modeling approach is what is often referred to as black-box modeling.
We refer to the model as a black box since it offers little or no insight into how the
system is actually working. It does, however, offer potentially unlimited modeling
flexibility, the ability to fit any data-generating system [Sjöberg et al., 1995]. The
structure of the black-box model is chosen so as to promote both flexibility, but
also ease of learning. Alongside giving up interpretability1 of the resulting model,
the fitting of black-box models is associated with the risk of overfitting—a failure
to capture the true governing mechanisms of the system [Murphy, 2012]. An overfit
model agrees very well with the data used for training, but fails to generalize to
novel data. A common explanation for the phenomenon is the flexible model being
deceived by noise present in the data. Combatting overfitting has been a major
research topic for a long time and remains so today. Oftentimes, regularization—a
restriction of flexibility—is employed, a concept this thesis will explore in detail
and make great use of.

3.1 Models of Dynamical Systems

For control design and analysis, Linear Time-Invariant (LTI) models have been
hugely important, mainly motivated by their simplicity and the fact that both
performance and robustness properties are well understood. The identification
of linear models shares these properties in many regards, and has been a staple
1 Interpretable machine learning is an emerging field trying to provide insight into the workings of
black-box models.

24
3.1 Models of Dynamical Systems

of system identification since the early beginning [Ljung, 1987]. Not only are the
theory and properties of linear identification well understood, the computational
complexity of many of the linear identification algorithms is also favorable.
Methods that have been made available by decades of progression of Moore’s
law are, however, often underappreciated among system identification practi-
tioners. With the computational power available today, one can solve large op-
timization problems and high dimensional integrals, leading to the emergence
of the fields of deep learning [Goodfellow et al., 2016], large-scale convex opti-
mization [Boyd and Vandenberghe, 2004] and Bayesian nonparametrics [Hjort
et al., 2010; Gershman and Blei, 2011]. In this thesis, we hope to contribute to
bridging some of the gap between the system-identification literature and modern
machine learning. We believe that the interchange will be bidirectional, because
even though new powerful methods have been developed in the learning com-
munities, classical system identification has both useful domain knowledge and a
strong systems-theoretical background, with well developed concepts such as sta-
bility, identifiability and input design, that are seldom talked about in the learning
community.

Prediction error vs. simulation error


Common for all models linear in the parameters, is that paired with a quadratic
cost function, the solution to the prediction error problem is available on closed
form [Johansson, 1993]. Linear time-invariant (LTI) dynamic models on the form
(3.1) are no exceptions and they can indeed be estimated from data by solving the
normal equations, provided that the full state is measured.

x t +1 = Ax t + Bu t + v t
yt = xt + e t (3.1)

In (3.1), x ∈ Rn , y ∈ Rp and u ∈ Rm are the state, measurement and input respec-


tively.
The name prediction error method (PEM) refers to the minimization of the
prediction errors

x + − x̂ + = v
x̂ + = Ax + Bu (3.2)

and PEM constitutes the optimal method if all errors are equation errors [Ljung,
1987], i.e., e = 0. If we instead adopt the model v = 0, we arrive at the output-error
or simulation-error method [Ljung, 1987; Sjöberg et al., 1995], where we minimize

y + − x̂ + = e
x̂ + = A x̂ + Bu (3.3)

The difference between (3.2) and (3.3) may seem subtle, but has big consequences.
In (3.3), no measurements of x are ever used to form the predictions x̂. Instead, the

25
Chapter 3. Introduction—System Identification and Machine Learning

model is applied recursively with previous predictions as inputs. In (3.2), however,


a measurement of x is used as input to the model to form the prediction x̂ + . While
the prediction error can be minimized with standard LS, output error minimiza-
tion is a highly nonlinear problem that requires additional care. Sophisticated
methods based on matrix factorizations exist for solving the OE problem for linear
models [Verhaegen and Dewilde, 1992], but in general, the problem is hard. The
difficulty stems from the recursive application of the model parameters, intro-
ducing the risk for exploding/vanishing gradients and nonconvex loss surfaces.
The system-identification literature is full of methods to mitigate these issues, the
more common of which include multiple shooting and collocation [Stoer and
Bulirsch, 2013].
Optimization of the simulation-error metric leads to long-standing challenges
that have resurfaced recently in the era of deep learning [Goodfellow et al., 2016].
The notion of backpropagation through time for training of modern recurrent
neural networks and all its associated computational challenges are very much the
same challenges as those related to solving the simulation-error problem. When
simulating the system more than one step forward in time, the state sequence
becomes a product of both parameters and previous states, which in turn are
functions of the parameters. While an LTI model is linear in the parameters, the
resulting optimization problem is not. Both classical and recent research have
made strides towards mitigating some of these issues [Hochreiter and Schmidhu-
ber, 1997; Stoer and Bulirsch, 2013; Xu et al., 2015], but the fundamental problem
remains [Pascanu et al., 2013]. One of the classical approaches, multiple shooting
[Stoer and Bulirsch, 2013], successfully mitigates the problem with a deep com-
putational graph by breaking it up and introducing constraints on the boundary
conditions between the breakpoints. While methods like multiple shooting work
well also for training of recurrent neural networks, they are seldom used, and the
deep-learning community has invented its own solutions [Pascanu et al., 2013].

3.2 Stability

An important theoretical aspect of dynamical systems is the notion of stability


[Khalil, 1996; Åström and Murray, 2010]. Loosely speaking, a stable system is one
where neither the output nor the internal state of the system goes to infinity
unless we supply an infinite input. When estimating a model for a system known
to be stable, one would ideally like to obtain a stable model. Some notions of
stability imply the convergence of system trajectories to, e.g., an equilibrium point
or a limit cycle. The effect of perturbations, noise or small model errors will for
a stable model have an eventually vanishing effect. For unstable systems and
models, small perturbations in initial conditions or perturbations to the trajectory
can have an unbounded effect. For simulation, obtaining a stable model of a
stable system is thus important. Many model sets, including the set of LTI models
of a particular dimension, include unstable models. If the feasible set contains
unstable models, the search for the model that best agrees with the data is not

26
3.3 Inductive Bias and Prior Knowledge

guaranteed to return a stable model. One can imagine many ways of dealing
with this issue. A conceptually simple way is to search only among stable models.
This strategy is in general hard, but successful approaches include [Manchester
et al., 2012]. Model classes that include only stable models may unfortunately be
restrictive and limit the use of intuition in choosing a model architecture. Another
strategy is to project the found model onto a subset of stable models, provided that
such a projection is available. There is, however, no guarantee that the projection
is the optimal model in the set of stable models. A hybrid approach is to, in each
iteration of an optimization problem, project the model onto the set of stable
models, a technique that in general gradient-based optimization is referred to as
projected gradient descent [Goldstein, 1964]. The hope with such a strategy is that
the optimization procedure will stay close to the desired target set and thus seek
out favorable points within this set, whereas projection of only the final solution
might allow the optimization procedure to stray far away from good solutions
within the desired target set. A closely related approach will be used in Chap. 13,
where the optimization variable is a rotation matrix in SO(3), a space which is
easy to project onto but harder to optimize over directly.
The set of stable discrete-time LTI models is easy to describe; as long as the A
matrix in (3.1) has eigenvalues no greater than 1, the model is stable [Åström and
Murray, 2010; Glad and Ljung, 2014]. If the eigenvalues are strictly less than one,
the model is exponentially stable and all energy contained within the system will
eventually decay to zero. For nonlinear models, characterizing the set of stable
models is in general much harder. One way of proving that a nonlinear system is
stable is to find a Lyapunov function. Systematic ways of finding such a function
are unfortunately lacking.

3.3 Inductive Bias and Prior Knowledge


In the control literature, it is well known that a continuous-time linear system
with long time constants correspond to small eigenvalues of the dynamics matrix,
or eigenvalues close to 1 in the discrete-time case. The success of the LSTM
(Long Short-Term Memory), a form of recurrent neural network [Hochreiter and
Schmidhuber, 1997], in learning long time dependencies seem natural in this light.
The LSTM is essentially introducing an inductive bias towards models with long
time constants.
In fact, many success stories in the deep-learning field can be traced back to
the invention of a model architecture with appropriate inductive bias for a specific
task. The perhaps most prominent example of this is the success of convolutional
neural networks (CNN) for computer vision tasks. Ulyanov et al. (2017) showed
that a CNN can be used remarkably successfully for computer vision tasks such
as de-noising and image in-painting completely without pre-training. The CNN
architecture simply learns to fit the desirable structure in a single natural image
much faster and better than it fits, say, random noise. Given enough training
epochs, the complex neural network manages to fit also the noise, showing that

27
Chapter 3. Introduction—System Identification and Machine Learning

the capacity is there. The inductive bias, however, is clearly more towards natural
images.
Closely related to inductive bias are the concepts of statistical priors and
regularization, both of which are explicit attempts at endowing the model with
inductive bias [Murphy, 2012]. The concept of using regularization to encode prior
knowledge will be used extensively in the thesis.
A different approach to encoding prior knowledge is intelligent initialization
of overparameterized models. It is well known that the gradient descent algorithm
converges to the minimum-norm solution for overparametereized convex prob-
lems if initialized near zero [Wilson et al., 2017]. This can be seen as an implicit
bias or regularization, encoded by the initialization. Similarly, known time con-
stants can be encoded by initialization of matrices in recurrent mappings with
well chosen eigenvalues, or as differentiation chains etc. This topics will not be
discussed much further in this thesis, but may be worthwhile considering during
modeling and training.
Can the problem of estimating models for dynamical control system be re-
duced to that of finding an architecture with the appropriate inductive bias? We
argue that it is at least beneficial to have the model architecture working with us
rather than against us. The question then becomes: How can we construct our
models such that they want to learn good dynamics models? Decades of research
in classical control theory and system identification hopefully become useful in
answering these questions. We hope that the classical control perspective and the
modern machine learning perspective come together in this thesis, helping us
finding good models for dynamical systems.

28
4
State Estimation

The state of a system is a collection of information that summarizes everything


one needs to know in addition to the model in order to predict the future state of
the system. As an example, consider a point mass—a suitable state-representation
for this system is its position and velocity. A dynamics model for the movement
of the point mass might be a double integrator with acceleration as input. We
refer to the function of the dynamics model that evolves the state in time as the
state-transition function.
The notion of state is well developed in control. Recurrent neural networks
introduce and learn their own state representation, similar to how subspace-based
identification [Van Overschee and De Moor, 1995] can identify both a state repre-
sentation and a model for linear systems. LSTMs [Hochreiter and Schmidhuber,
1997] were introduced to mitigate vanishing/exploding gradients and to allow
the model to learn a state representation that remembers information on longer
time-scales. Unfortunately, also LSTMs forget; to mitigate this, the attention mech-
anism was introduced by the deep learning community [Xu et al., 2015]. The
attention vector is essentially containing the entire input history, but the use of
it is gated by a nonlinear, learned, model. Attention as used in the literature is a
sequence-to-sequence model, often in a smoothing fashion, where the input is
encoded both forwards and backwards. Use of smoothing is feasible for reasoning
about a system on an episode basis, but not for prediction.
While mechanisms such as attention [Xu et al., 2015] have been very successful
in tasks such as natural language translation, the classical notion of state provides
a terser representation of the information content that can give insight into the
modeled system. Given a state-space model of the system, state estimation refers
to the act of identifying the sequence of states that best agrees with both the
specified model and with observations made of the system. Observations might
come at equidistant or nonequidistant points in time and consist of parts of the
state, the whole state or, in general, a function of the state. We refer to this function
as an observation model.

29
Chapter 4. State Estimation

4.1 General State Estimation


The state-estimation problem is conceptually simple; solve an optimization prob-
lem for the state-sequence that minimizes residuals between model predictions
and observations. How the size of the residuals is measured is often determined
by either practical considerations or statistical assumptions on the noise acting
on the system and the observations. The complexity of this straightforward ap-
proach naturally grows with the length of the data collected.1 Potential mitigations
include moving-horizon estimation [Rawlings and Mayne, 2009], where the op-
timization problem is solved for a fixed-length data record which is updated at
each time step.
It is often desirable to estimate not only the most likely state sequence, but
also the uncertainty in the estimate. Given a generative model of the data, one
can estimate the full posterior density of the state sequence after having seen the
data. Full posterior density estimation is a powerful concept, but exact calculation
is unfortunately only tractable in a very restricted setting, namely the linear-
Gaussian case. In this case, the optimal estimate is given exactly by the Kalman
filter [Åström, 2012], which we will touch upon in Sec. 4.3. Outside the linear and
Gaussian world, one is left with approximate solution strategies, one particularly
successful one being the particle filter.

4.2 The Particle Filter

The particle filter is a sequential Monte-Carlo method for approximate solution of


high dimensional integrals with a sequential structure [Gustafsson, 2010]. We will
not develop much of the theory of particle filters here, but will instead give a brief
intuitive introduction.
We begin by associating a statistical model with the state-transition function
x + ∼ p(x + |x). One example is x + = Ax + v, where v ∼ N (0, 1). At time t = 0, we
may summarize our belief of the state in some distribution p(x 0 ). At the next time
instance t = 1, the distribution of the state will then be given by
Z Z
p(x 1 ) = p(x 1 , x 0 ) dx 0 = p(x 1 |x 0 )p 0 (x 0 ) dx 0 (4.1)

Unfortunately, very few pairs of distributions p(x + |x) and p 0 will lead to a tractable
integral in (4.1) and a distribution p(x + ) that we can represent on closed form. The
particle filter therefore approximates p 0 with a collection of samples or particles
© i ªN
x̂ i =1 , where each particle can be seen as a distinct hypothesis of the correct
state. Particles are easily propagated through p(x + |x) to obtain a new collection
at time t = 1, forming a sampled representation of p(x 1 ).
When a measurement y becomes available, we associate each particle with a
weight given by the likelihood of the measurement given the particle state and the
1 A notable exception to this is recursive least-squares estimation of a linear combination of parame-
ters [Ljung and Söderström, 1983; Åström and Wittenmark, 2013b].

30
4.3 The Kalman Filter

observation model p(y|x). Particles that represent state hypotheses that yield a
high likelihood are determined more likely to be correct, and are given a higher
weight.
The collection of particles will spread out more and more with each application
of the dynamics model f . This is a manifestation of the curse of dimensionality,
since the dimension of the space that the density p(x 0:t ) occupies grows with t .
To mitigate this, a re-sampling step is performed. The re-sampling favors particles
with higher weights and thus focuses the attention of the finite collection of
particles to areas of the state-space with high posterior density. We can thus think
of the particle filter as a continuous analogue to the approximate branch-and-
bound method beam search [Zhang, 1999].
The recent popularity of particle filters, fueled by the increase in available
computational power, has led to a corresponding increase in publications describ-
ing the subject. Interested readers may refer to one of such publications for a
more formal description of the subject, e.g., [Gustafsson, 2010; Thrun et al., 2005;
Rawlings and Mayne, 2009].
We summarize the particle filter algorithm in Algorithm 1

Algorithm 1 A simple particle filter algorithm.


Initialize particles using a prior distribution
repeat
Assign weights to particles using likelihood under observation model p(y|x)
(Optional) Calculate a state estimate based on the weighted collection of
particles
Re-sample particles based on weights
Propagate particles forward using p(x + |x)
until End of time

4.3 The Kalman Filter

The Kalman filter is a well-known algorithm to estimate the sequence of state


distributions in a linear Gaussian state-space system, given noisy measurements
[Åström, 2012]. The Kalman filter operates in one of the very few settings where
the posterior density is available in closed form. Since both the state-transition
function and the observation model are affine transformations of the state, the
Gaussian distribution of the initial state remains Gaussian, both after a time
update with Gaussian noise and after incorporating measurements corrupted
with Gaussian noise. Instead of representing densities with a collection of particles
as we did in the particle filter, we can now represent them exactly by a mean vector
and a covariance matrix.
We will now proceed to derive the Kalman filter to establish the foundation for
extensions provided later in the thesis. To facilitate the derivation, we provide two
well-known lemmas regarding normal distributions:

31
Chapter 4. State Estimation

L EMMA 1
The affine transformation of a normally distributed random variable is normally
distributed with the following mean and variance

x ∼ N (µ, Σ) (4.2)
y = c +Bx (4.3)
y ∼ N (c + B µ, B ΣB T) (4.4)
2

L EMMA 2
When both prior and likelihood are Gaussian, the posterior distribution is Gaus-
sian with

N (µ̄, Σ̄) = N (µ0 , Σ0 ) · N (µ1 , Σ1 ) (4.5)


Σ̄ = (Σ−1 −1 −1
0 + Σ1 ) (4.6)
µ̄ = Σ̄(Σ−1 −1
0 µ0 + Σ1 µ1 ) (4.7)

Proof By multiplying the two probability-denisity functions in (4.5), we obtain


(constants omitted)
³ 1 1 ´
exp − (x − µ0 )TΣ−1
0 (x − µ 0 ) − (x − µ1 )T −1
Σ1 (x − µ1 )
2 2
³ 1 ´
= exp − (x − µ̄)TΣ̄−1 (x − µ̄) (4.8)
2
Σ̄ = (Σ−1 −1 −1
0 + Σ1 ) (4.9)
µ̄ = Σ̄(Σ−1 −1
0 µ0 + Σ1 µ1 ) (4.10)

where the terms in the first equation were expanded, all terms including x col-
lected and the square completed. Terms not including x become part of the
normalization constant and do not determine the mean or covariance. 2

C OROLLARY 1
The equations for the posterior mean and covariance can be written in update
form according to

µ̄ = µ0 + K (µ1 − µ0 ) (4.11)
Σ̄ = Σ0 − K Σ0 (4.12)
−1
K = Σ0 (Σ0 + Σ1 ) (4.13)

Proof The expression for Σ̄ is obtained from the matrix inversion lemma applied
to (4.6) and µ̄ is obtained by expanding Σ̄, first in front of µ0 using (4.12), and
then in front of µ1 using (4.6) together with the identity (Σ−1
0 + Σ1 )
−1 −1
= Σ0 (Σ0 +
Σ1 )−1 Σ1 . 2

32
4.3 The Kalman Filter

We now consider a state-space model of the form

x t +1 = Ax t + Bu t + v t (4.14)
yt = C xt + e t (4.15)

where the noise terms v and e are independent2 and Gaussian with mean zero and
covariance R 1 and R 2 , respectively. The estimation begins with an initial estimate
of the state, x 0 , with covariance P 0 . By iteratively applying (4.14) to x 0 , we obtain

x̂ t |t −1 = A x̂ t −1|t −1 + Bu t −1 (4.16)
T
P t |t −1 = AP t −1|t −1 A + R 1 (4.17)

where both equations follow from Lemma 1 and the notation x̂ i | j denotes the
estimate of x at time i , given information available at time j . Equation (4.17)
clearly illustrates that the covariance after a time update is the sum of a term due
to the covariance from the previous time step and the added term R 1 , which is
the uncertainty added by the state-transition noise v. We further note that the
properties of A determine whether or not these equations alone are stable. For
stable A and u ≡ 0, the mean estimate of x converges to zero with a stationary
covariance given by the solution to the discrete-time Lyapunov equation P =
AP AT + R 1 .
Equations (4.16) and (4.17) constitute the prediction step, we will now proceed
to incorporate also a measurement of the state in the measurement update step.
By Lemma 1, the mean and covariance of the expected measurement is given
by

ŷ t |t −1 = C x̂ t |t −1 (4.18)
y T
P t |t −1 = C P t |t −1C (4.19)

We can now, using Corollary 1, write the posterior measurement as


y
ŷ t |t = C x̂ t |t −1 + K t (y t −C x̂ t |t −1 ) (4.20)
y y y y
P t |t = P t |t −1 − K t P t |t −1 (4.21)
y T T −1
Kt = C P t |t −1C (C P t |t −1C + R 2 ) (4.22)

which, if we drop C in front of both ŷ and P y , and C T at the end of P y , turns into
¡ ¢
x̂ t |t = x̂ t |t −1 + K t y t −C x̂ t |t −1 (4.23)
P t |t = P t |t −1 − K t C P t |t −1 (4.24)
¢−1
K t = P t |t −1C T C P t |t −1C T + R 2
¡
(4.25)

where K is the Kalman gain.


2 The case of correlated state-transition and measurement noise requires only a minor modification,
but is left out for simplicity.

33
5
Dynamic Programming

Dynamic programming (DP) is a general strategy due to Bellman (1953) for solving
problems that enjoy a particular structure, often referred to as optimal substruc-
ture. In DP, the problem is broken down recursively into overlapping sub-problems,
the simplest of which is easy to solve. While DP is used to solve problems in a
diverse set of applications, such as sequence alignment, matrix-chain multipli-
cation and scheduling, we will focus our introduction on the application to op-
timization problems where the sequential structure arises due to time, such as
state-estimation, optimal control and reinforcement learning.

5.1 Optimal Control


A particularly common application of DP is optimal control [Åström, 2012; Bert-
sekas et al., 2005]. Given a cost function c(x, u), a dynamics model x + = f (x, u),
and a fixed controller µ generating u, the sum of future costs at time t can be
written as a sum of the cost in the current time step c t = c(x t , u t ), and the sum of
future costs c t +1 + c t +2 + ... + c T . We call this quantity the value function V µ (x t ) of
the current policy µ, and note that it can be defined recursively as

V µ (x t ) = c t + V µ x t +1 = c t + V µ f (x t , u t )
¡ ¢ ¡ ¢
(5.1)

Of particular interest is the optimal value function V ∗ , i.e., the value function of
the optimal controller µ∗ :

V ∗ (x t ) = min c(x t , u) + V ∗ f (x t , u)
¡ ¡ ¢¢
(5.2)
u

which defines the optimal controller µ∗ = arg minu c(x t , u) + V ∗ f (x t , u) . Thus, if


¡ ¢

we could somehow determine V ∗ , we would be one step closer to having found


the optimal controller (solving for arg minu could still be a difficult problem).
Determining V ∗ is in general hard and the literature is full of methods for both the
general and special cases. We will refrain from discussing the general case here
and only comment on some special cases.

34
5.1 Optimal Control

Linear Quadratic Regulation


Just as the state-estimation problem enjoyed a particularly simple solution when
the dynamics were linear and the noise was Gaussian, the optimal control problem
has a particularly simple solution when the same conditions apply [Åström, 2012].
The value function in the last time step is simply VT∗ = minu c(x T , u) and is thus a
quadratic function in x T . The real magic happens when we note that the set of
convex quadratic functions is closed under summation and minimization, mean-
ing that VT∗−1 = minu c T −1 + VT∗ is also a quadratic function, this time in x t −1 .1
¡ ¢

We can thus both solve for VT∗−1 and represent it efficiently using a single positive
definite matrix. The algorithm for calculating the optimal V ∗ and the optimal
controller µ∗ is in this case called the Linear-Quadratic Regulator (LQR) [Åström,
2012].
The similarity with the Kalman filter is no coincidence. The Kalman filter
essentially solves the maximum-likelihood problem, which when the noise is
Gaussian is equivalent to solving a quadratic optimization problem. The LQR
algorithm and the Kalman filter are thus dual to each other. This duality between
linear control and estimation problems is well known and most classical control
texts discuss it. In Chap. 7, we will explore the similarities further and let them
guide us to efficient algorithms for identification problems.

Iterative LQR
The LQR algorithm is incredibly powerful in the restricted setting where it applies.
In O (T ) time it calculates both the optimal policy and the optimal value function.
Its applicability is unfortunately limited to linear systems, but these systems may
be time varying. An algorithm that makes use of LQR for nonlinear systems is
Iterative LQR (iLQR) [Todorov and Li, 2005]. By linearizing the nonlinear system
along the trajectory, the LQR algorithm can be employed to estimate an optimal
control signal sequence. This sequence can be applied to the nonlinear system
in simulation to obtain a new trajectory along which we can linearize the system
and repeat the procedure. This algorithm is a special case of a more general
algorithm, Differential Dynamic Programming (DDP) [Mayne, 1966], where a
quadratic approximation to both a general cost function and a nonlinear dynamics
model is formed along a trajectory, and the dynamic-programming problem is
solved.
Since both DDP, and the special case iLQR, make use of linear approximations
of the dynamics, a line search or trust region must be employed in order to ensure
convergence. We will revisit this topic in Chap. 11, where we employ iLQR to
solve an reinforcement-learning problem using estimation techniques developed
in Chap. 7.

1 Showing this involves algebra remarkably similar to the derivations in Sec. 4.3.

35
Chapter 5. Dynamic Programming

5.2 Reinforcement Learning


The field of Reinforcement Learning (RL) has grown tremendously in recent years
as the first RL methods making use of deep learning made significant strides to
solving problems that were previously thought to be decades away from a solution.
Noteworthy examples include the victory of the RL system AlphaGO against the
human world champion in the game of GO [Silver et al., 2016].
When both cost function and dynamics are known, solving for V ∗ is referred to
as optimal control [Bertsekas et al., 2005]. If either of the two functions is unknown,
the situation is made considerably more difficult. If the cost is known but the
dynamics are unknown, one common strategy is to perform system identification
and use the estimated model for optimal control. The same can be done with a
cost function that is only available through sampling. Oftentimes, however, the
state space is too large, and one can not hope to obtain globally accurate models of
c and f from identification. In this setting, we may instead resort to reinforcement
learning.
Reinforcement learning is, in all essence, a trial-and-error approach in which
a controller interacts with the environment and uses the observed outcome to
guide future interaction. The goal is still the same, to minimize a cumulative cost.
The way we make use of the observed data to guide future interaction to reach this
goal is what distinguishes different RL methods from each other. RL is very closely
related to the field of adaptive control [Åström and Wittenmark, 2013b], although
the terminology and motivating problems often differ. The RL community often
considers a wider range of problems, such as online advertising and complex
games with a discrete action set, while the adaptive control community long has
had an emphasis on control using continuous action sets and low-complexity
controllers, one of the main areas in which RL techniques have yet to prove
effective.
RL algorithms can be broadly classified using a number of dichotomies; some
methods try to estimate the value function, whereas some methods estimate the
policy directly. Some methods estimate a dynamics model, we call these model-
based methods, whereas some are model free. We indicate how some examples
from the literature fit into this framework in Table 5.1.
Algorithms that try to estimate the value function can further be subdivided
into two major camps; some use the Bellman equation and hence a form of
dynamic programming, whereas some estimate the value function based on ob-
served samples of the cost function alone in a Monte-Carlo fashion.
The failure of RL methods in continuous domains can often be traced back
to their inefficient use of data. Many state-of-the-art methods require on the
order of millions of episodic interactions with the environment in order to learn
a successful controller [Mnih et al., 2015]. A fundamental problem with data
efficiency in many modern RL methods stems from what they choose to model
and learn. Methods that learn the value function are essentially trying to use the
incoming data to hit a moving target. In the early stages of learning, the estimate
of the value function and the controller are sub-optimal. In this early stage, the

36
5.2 Reinforcement Learning

Table 5.1 The RL landscape. Methods marked with * or (*) estimate (may estimate)
a value function and methods marked with a † or (†) estimate (may estimate) an
explicit policy 1 [Levine and Koltun, 2013], 2 [Sutton et al., 2000], 3 [Watkins and
Dayan, 1992], 4 [Sutton, 1991], 5 [Silver et al., 2014], 6 [Rummery and Niranjan, 1994],
7 [Williams, 1988], 8 [Schulman et al., 2015].

Model based Model free


Dynamics known Optimal control (*,†) If simulation/experiments
Policy/Value iteration are very fast
Dynamics unknown Guided Policy Search2 *† Policy gradient3 †
Model free methods Q-learning4 *
with simulation (DYNA5 ) (*,†) DPG6 *†
SARSA7 *
REINFORCE8 †
TRPO9 †

incoming data does not always hold any information regarding the optimal value
function, which is the ultimate goal of learning. Model-based methods, on the
other hand, use the incoming data to learn about the dynamics of the agent and
the environment. While it is possible to imagine an environment with evolving
dynamics, the dynamics are often laws of nature and do not change, or at least
change much slower than the value function and the policy, quantities we are
explicitly modifying continuously. This is one of the main reasons model-based
methods tend to be more data efficient than model-free methods.
Model-based methods are not without problems though. Optimization un-
der an inaccurate model might cause the RL algorithm to diverge. In Chap. 11,
we will make use of models and identification methods developed in Part I for
reinforcement-learning purposes. The strategy will be based on trajectory op-
timization under estimated models and an estimate of the uncertainty in the
estimated model will be taken into account during the optimization.

37
6
Linear Quadratic
Estimation and
Regularization

This chapter introduces a number of well-known topics and serves as an introduc-


tion to the reader unfamiliar with concepts such as singular value decomposition,
linear least-squares, regularization and basis-function expansions. These methods
will be used extensively in this work, where they are only briefly introduced as
needed. Readers familiar with these topics can skip this chapter.

6.1 Singular Value Decomposition


The singular value decomposition (SVD) [Golub and Van Loan, 2012] was first
developed in the late 1800s for bilinear forms, and later extended to rectangular
matrices by [Eckart and Young, 1936]. The SVD is a factorization of a matrix
A ∈ RN ×M on the form
A = U SV T
where the matrices U ∈ RN ×N and V ∈ RM ×M are orthonormal, such that
U TU = UU T = I N and V TV = V V T = I M , and S = diag(σ1 , ..., σm ) ∈ RN ×M is a rect-
angular, diagonal matrix with the singular values on the diagonal. The singular
values are the square roots of the eigenvalues of the matrices A AT and AT A and are
always nonnegative and real. The orthonormal matrices U and V can be shown
to have columns consisting of a set of orthonormal eigenvectors of A AT and AT A
respectively.
One of many applications of the SVD that will be exploited in this thesis is
to find the equation for a plane that minimizes the sum of squared distances
between the plane and a set of points. The normal to this plane is simply the
singular vector corresponding to the smallest singular value of a matrix composed
of all point coordinates. The smallest singular value will in this case correspond to
the mean squared distance between the points and the plane, i.e., the variance of
the residuals.

38
6.2 Least-Squares Estimation

Finding the closest orthonormal matrix


A matrix R is said to be orthonormal if R TR = RR T = I . If the additional fact
det(R) = 1 holds, the matrix is said to be a rotation matrix, an element of the
n-dimensional special orthonormal group SO(n) [Murray et al., 1994; Mooring
et al., 1991].
Given an arbitrary matrix R̃ ∈ R3×3 , the closest rotation matrix in SO(3), in
the sense ||R − R̃||F , can be found by Singular Value Decomposition according to
[Eggert et al., 1997]

R̃ = U SV T (6.1)
 
1
R =U  1 V T (6.2)
T
det(UV )

6.2 Least-Squares Estimation


This thesis will frequently deal with the estimation of models which are linear in
the parameters, and thus can be written on the form

y = Ak (6.3)

where A denotes the regressor matrix and k denotes a vector of coefficients to be


identified. Models on the form (6.3) are commonly identified with the well-known
least-squares procedure [Johansson, 1993]. As an example, we consider the model
y n = k 1 u n + k 2 v n , where a measured signal y is a linear combination of two input
signals u and v. The identification task is to identify the parameters k 1 and k 2 . In
this case, the procedure amounts to arranging the data according to

y1 u1 v1
   
· ¸
 .   . ..  N ×2 k1
y =  ..  , A =  .. . ∈R , k=
k2
yN uN vN

and solving the optimization problem of Eq. (6.4) with solution (6.5).

T HEOREM 1
The vector k ∗ of parameters that solves the optimization problem
°2
k ∗ = arg min ° y − Ak °2
°
(6.4)
k

is given by the closed-form expression


¢−1
k ∗ = ATA AT y
¡
(6.5)

39
Chapter 6. Linear Quadratic Estimation and Regularization

Proof Completion of squares in the least-squares cost function J yields

° °2
J = ° y − Ak °2 = (y − Ak)T(y − Ak)
= y T y − y TAk − k TAT y + k TATAk
³ ¢−1 ´T ³ ¢−1 ´ ¢−1
= k − ATA AT y ATA k − ATA AT y + y T(I − A ATA AT)y
¡ ¡ ¡

where we identify the last expression as a sum of two terms, one that does not
depend on k, and a term which is a positive definite quadratic form (ATA is always
positive (semi)definite). The estimate k ∗ that minimizes J is thus the value that
makes the quadratic form equal to zero. 2
The expression (6.5) is known as the least-squares solution and the full-rank matrix
(ATA)−1 AT is commonly referred to as the pseudo inverse of A. If A is a square
matrix, the pseudo inverse reduces to the standard matrix inverse. If A, however,
is a tall matrix, the equation y = Ak is over determined and (6.5) produces the
solution k ∗ that minimizes (6.4). We emphasize that the important property of
the model y n = k 1 u n + k 2 v n that allows us to find the solution to the optimization
problem on closed-form is that the parameters to be estimated enter linearly. The
signals u and v may be arbitrarily complex functions of some other variable, as
long as these are known.

Consistency
A consistent estimate is one which is asymptotically unbiased and has vanishing
variance as the number of data points grows. The consistency of the least-squares
estimate can be analyzed by calculating the bias and variance properties. Consider
the standard model, with an added noise term v, for which consistency is given by
the following theorem:
T HEOREM 2 ¡ ¢−1
The closed-form expression k̂ = ATA AT y is an unbiased and consistent estimate
of k in the model

y = Ak + v
v ∼ N (0, σ2 )
E A v =0
© T ª

Proof The bias and variance of the resulting least-squares based estimate are:

Bias We begin be rewriting the expression for the estimate k̂ as


¢−1
k̂ = ATA AT y
¡
¢−1
= ATA AT(Ak + v)
¡
¢−1
= k + ATA ATv
¡

40
6.2 Least-Squares Estimation
n¡ ¢−1 o
If the regressors are uncorrelated with the noise, E ATA ATv = 0, we can con-
clude that E k̂ = k and the estimate is unbiased.
© ª

Variance The variance is given by


n¡ ¢−1 o
E (k̂ − k)(k̂ − k)T = E ATA ATv v TA(ATA)−T
© ª

= σ2 E (ATA)−1
© ª

= σ2 (ATA)−1

where the second equality holds if v and A are uncorrelated. As N → ∞, we have


σ2 (ATA)−1 → 0, provided that the Euclidean length of all columns in A increases
as N increases. 2

Other loss functions


The least-squares loss function
°2
k ∗ = arg min ° y − Ak °2
°
k

is convex and admits a particularly simple, closed-form expression for the min-
imum. If another norm is used instead of the L 2 norm, the estimate will have
different properties. The choice of other norms will, in general, not admit a solu-
tion on closed form, but for many norms of interest, the optimization problem
remains convex. This fact will in practice guarantee that a global minimum can
be found easily using iterative methods. Many of the methods described in this
thesis could equally well be solved with another convex loss function, such as
the L 1 norm for increased robustness, or the L ∞ norm for a minimum worst-case
scenario. For an introduction to convex optimization and a description of the
properties of different convex loss functions, see [Boyd and Vandenberghe, 2004].

Computation
Although the solutions to the least-squares problems are available in closed form,
¡ ¢−1
it is ill-advised to actually perform the calculation k = ATA AT y [Golub and Van
Loan, 2012]. Numerically more robust strategies include

• performing a Cholesky factorization of the symmetric matrix ATA.

• performing a QR-decomposition of A.

• performing a singular value decomposition (SVD) of A.

where the latter two methods avoid the calculation of ATA altogether, which can
be subject to numerical difficulties if A has a high condition number [Golub and
Van Loan, 2012]. In fact, the method of performing a Cholesky decomposition of

41
Chapter 6. Linear Quadratic Estimation and Regularization

ATA can be implemented using the QR-decomposition since triangular matrix R


obtained by a QR-decomposition of A is a Cholesky factor of ATA:

ATA = (QR)T(QR) = R TR

Many numerical computation tools, including Julia, Matlab and numpy, pro-
vide numerically robust methods to calculate the solution to the least-squares
problem, indicated in Algorithm 2. These methods typically analyze the matrix
A and choose a suitable numerical algorithm to execute based on its properties
[Julialang, 2017].

¡ ¢−1
Algorithm 2 Syntax for solving the least-squares problem k = ATA AT y in differ-
ent programming languages.

k = A\y # Julia
k = A\y % Matlab
k = numpy. l i n a l g . solve (A , y ) # Python with numpy
k <− solve (A , y ) # R

If the number of features, and thus the matrix ATA, is too large for the problem
to be solved by factorizing A, an iterative method such as conjugate gradients or
GMRES [Saad and Schultz, 1986] can solve the problem by performing matrix-
vector products only.

6.3 Basis-Function Expansions


When estimating a functional relationship between two or more variables, i.e.,
y = f (v), a standard initial approach is linear regression using the least-squares
procedure. A strong motivation for this is the fact that the optimal linear combi-
nation of the chosen basis functions, or regressors, is available in closed form. A
typical choice of basis functions is low-order monomials, e.g., a decomposition of
a signal y according to

y = φ(v)k = k 0 + k 1 v 1 + k 2 v 2 + ... + k J v J (6.6)

where φ(v) = [v 0 v 1 ... v J ] is the set of basis function activations. The function
f (v) = φ(v)k can be highly nonlinear and even discontinuous in v, but is linear in
the parameters, making it easy to fit to data.
While the low order monomials v i are easy to work with and provide reason-
able fit when the relationship between y and v is simple, they tend to perform
worse when the relationship is complex.
Intuitively, a basis-function expansion (BFE) decomposes an intricate function
or signal as a linear combination of simple basis functions. The Fourier transform
can be given this interpretation, where an arbitrary signal is decomposed as a sum

42
6.3 Basis-Function Expansions

Gaussian Rectangular Triangular

Signal y = f (v) Reconstruction ŷ = φ(v)k

Figure 6.1 Reconstructions of a sampled signal y = f (v) using different sets of


basis functions. The basis functions used for the decomposition of y is shown in
the background.

of complex-valued sinusoids. Similarly, a stair function can be decomposed as a


sum of step functions.
In many situations, there is no a priori information regarding the relationship
between the free variable v and the dependent variable y, and it might be hard
to choose a suitable set of basis functions to use for a decomposition of the sig-
nal y. In such situations, an alternative is to choose a set of functions with local
support, spread out to cover the domain of v. Some examples of basis ¡ functions¢
with local support are: Gaussian radial basis functions κ(v) = exp −γ(v − µ)2
which carry the implicit assumption of a smooth function f , triangular functions
κ(v) = max(0, 1 − γ|v − µ|) resulting in a piecewise affine ŷ and rectangular func-
tions κ(v) = |v − µ| < γ where γ = ∆µ, resulting in piecewise constant ŷ. In the
last example, we interpret the Boolean values true/false as 1/0. In all cases, µ
determines the center of the basis function and γ determines the width. Examples
of decompositions using these basis functions are shown in Fig. 6.1. In many
situations, there is no a priori information regarding the relationship between
the free variable v and the dependent variable y, and it might be hard to choose
a suitable set of basis functions to use for a decomposition of the signal y. In
such situations, an alternative is to choose a set of functions with local support,
spread out to cover the domain of v. Some examples of basis functions with local
support are: radial basis functions κ(v) = exp −γ(v − µ)2 which carry the implicit
¡ ¢

assumption of a smooth function f , triangular functions κ(v) = max(0, 1−γ|v −µ|)


resulting in a piecewise affine ŷ and rectangular functions κ(v) = |v − µ| < γ where
γ = ∆µ, resulting in piecewise constant ŷ. In the last example, we interpret the
Boolean values true/false as 1/0. In all cases, µ determines the center of the basis
function and γ determines the width. Examples of decompositions using these
basis functions are shown in Fig. 6.1. Other motivations for considering basis
functions with local support include making the modeling more intuitive and the
result easier to interpret. This in contrast to basis functions with global support,

43
Chapter 6. Linear Quadratic Estimation and Regularization

Data f (v) Non normalized Normalized

f (v) 1

−1
−4 −3 −2 −1 0 1 2 3 4
v

Figure 6.2 Basis-function expansions fit to noisy data from the function
f (v) = 0.3v 2 − 0.5 using normalized (-) and nonnormalized (- -) basis functions.
Non-normalized basis functions are shown mirrored in the vertical axis.

such as sigmoid-type functions or monomials.


Basis-function expansions are related to Gaussian processes [Rasmussen,
2004]. We will not make this connection explicit in this work, but the interested
reader might find the Bayesian interpretation provided by a Gaussian process
worthwhile.
The concept of basis-function expansions will be used extensively in the thesis.
The open-source software accompanying many of the papers in this thesis makes
use of basis-function expansions. This functionality has been externalized into the
software package [BasisFunctionExpansions.jl, B.C., 2016], which provides many
convenient methods for working with basis-function expansions.

Normalization
For some applications, it may be beneficial to normalize the kernel vector for each
input point [Bugmann, 1998] such that
à !−1
K
φ̄(v) = κ(v, µi , γi ) φ(v)
X
i =1

One major difference between a standard BFE and a normalized BFE (NBFE) is
the behavior far (in terms of the width of the basis functions) from the training
data. The prediction of a BFE will tend towards zero, whereas the prediction from
an NBFE tends to keep its value close to the boundary of the data. Figure 6.2
shows the fit of two BFEs of the function f (v) = 0.3v 2 − 0.5 together with the basis
functions used. The BFE tends towards zero both outside the data points and in
the interval of missing data in the center. The NBFE on the other hand generalizes
better and keeps its current prediction trend outside the data. The performance
of NBFEs is studied in detail in [Bugmann, 1998].

44
6.4 Regularization

||k||22
P
||k||1 t ||k t ||2

Figure 6.3 Illustration of the effect of different regularization terms on the struc-
ture of the solution vector. The squared 2-norm promotes small components, the
1-norm promotes sparse components (few nonzero elements) and the nonsquared,
group-wise 2-norm promotes sparse groups (clusters of elements).

6.4 Regularization

Regularization is a general concept which can, as a result of several overloaded


meanings of the term, be hard to define. The purposes of regularization include
preventing overfitting by penalizing complexity, improving numerical robustness
at the cost of bias and imposing a special structure on the solution such as sparsity
[Murphy, 2012].
Regularization will in our case amount to adding a term to the cost function of
the optimization problem we are solving according to
°2
minimize ° y − ŷ(k)° + λ2 f (k)
°
k

Examples of f include
° °2
• °k °2 to promote small k
° °
• °k °1 to promote sparse k
° °
• °k t °2 to promote group-sparse k t

with effects illustrated in Fig. 6.3. We will detail the effects of the mentioned
example regularization terms in the following sections.

Squared L 2 -regularized regression


For certain problems, it might be desirable to add a term to the cost function (6.4)
that penalizes the size of the estimated parameter vector, for some notion of
size. This might be the case if the problem is ill-posed, or if we have the a priori
knowledge that the parameter vector is small. Classically, methods such as Akaike
Information Criterion (AIC) and Bayesian Information Criterion (BIC) have penal-
ized the dimensionality of the parameter vector [Ljung, 1987; Johansson, 1993].
These methods are, however, impractical in modern machine learning where the

45
Chapter 6. Linear Quadratic Estimation and Regularization

number of model parameters is very large and therefore seldom used. In machine
learning it is more common to penalize a convex function of the parameter vector,
such as its norm. Depending on the norm in which we measure the size of the
parameter vector, this procedure has many names. For the common L 2 norm,
the resulting method is commonly referred to as Tikhonov regularized regression,
ridge regression or weight decay if one adopts an optimization perspective, or
maximum a posteriori (MAP) estimation with a Gaussian prior, if one adopts a
Bayesian view on the estimation problem [Murphy, 2012]. If the problem is linear
in the parameters, the solution to the resulting optimization problem remains
on a closed form, as indicated by the following theorem. Here, we demonstrate
an alternative way of proving the least-squares solution, based on differentiation
instead of completion of squares.
T HEOREM 3
The vector k ∗ of parameters that solves the optimization problem

k ∗ = arg min
1°° y − Ak °2 + λ °k °2
° ° °
2 2 (6.7)
k 2 2

is given by the closed-form expression

k ∗ = (ATA + λI )−1 AT y (6.8)

Proof Differentiation of the cost function yields

1°° y − Ak °2 + λ °k °2 = 1 (y − Ak)T(y − Ak) + λ k Tk


° ° °
J= 2 2
2 2 2 2
dJ
= −AT(y − Ak) + λk
dk
If we equate this last expression to zero we get

dJ
= −AT(y − Ak) + λk = 0
dk
(ATA + λI )k = AT y
k = (ATA + λI )−1 AT y

Since ATA is positive semi-definite, both first- and second-order conditions for a
minimum are satisfied by k ∗ = (ATA + λI )−1 AT y. 2

R EMARK 1
When deriving the expression for k ∗ by differentiation, the terms 1/2 appearing in
(6.7) are commonly inserted for aesthetical purposes, they do not affect k ∗ . 2

We immediately notice that the solution to the regularized problem (6.7) reduces
to the solution of the ordinary least-squares problem (6.4) in the case λ = 0. The

46
6.4 Regularization

k2 k2

° y − ŷ(k)°2
° °
° y − ŷ(k)°2
° °
° °2
°k ° ° °
°k °
2
1

k1 k1

Figure 6.4 The sparsity-promoting effect of the L 1 norm can be understood by


considering the level surfaces of the penalty function (dashed). The level surface to
the objective function (solid) is likely to hit a particular level surface of ||k||1 on a
corner, whereas this does not hold for the ||k||22 penalty function.

regularization adds the positive term λ to all diagonal elements of ATA, which
reduces the condition number of the matrix to be inverted and ensures that the
problem is well posed [Golub and Van Loan, 2012]. The regularization reduces the
variance in the estimate at the expense of the introduction of a bias.
For numerically robust methods of solving the ridge-regression problem, see,
e.g., the excellent manual by Hansen (1994).

L 1 -regularized regression
L 1 regularization is commonly used to promote sparsity in the solution [Boyd
and Vandenberghe, 2004; Murphy, 2012]. The sparsity promoting effect can be
understood by considering the gradient of the penalty function, which remains
large even for small values of the argument. In contrast, the squared L 2 norm
has a vanishing gradient for small arguments. Further intuition for the sparsity
promoting quality of the L 1 norm is gained by considering the level curves of the
function, see Fig. 6.4. A third way of understanding the properties of the L 1 norm
penalty is as the convex relaxation of the L 0 penalty i I(k i ), i.e., the number of
P
nonzero entries in k.
The L 1 norm is a convex function, but L 1 -regularized problems do not admit a
solution on closed form and worse yet, the L 1 is nonsmooth. When this kind of
problems arises in this thesis, the ADMM algorithm [Parikh and Boyd, 2014] will
be employed to efficiently find a solution.1

Group L 2 -regularized regression


In some applications, the parameter vector k of an optimization problem is natu-
rally divided into groups {k t }Tt=1 . One such situation is the estimation of a linear
1 Our implementation of ADMM makes use of the software package [ProximalOperators.jl,
S. et al., 2016] for efficient and convenient proximal operations.

47
Chapter 6. Linear Quadratic Estimation and Regularization

time-varying model x + = A t x t + B t u t , where k t = vec( ATt B tT ) forms a natural


£ ¤

grouping of variables.
If the parameter vector is known to be group sparse, i.e., some of the groups
are exactly zero, one way to encourage a solution with this property is to add the
group-lasso penalty [Yuan and Lin, 2006]
X° °
°k t ° (6.9)
2
t

The addition of (6.9) to the cost function will promote a solution where the length
of some k t is exactly zero. To understand why this is the case, one can interpret
(6.9) as the L 1 -norm of lengths of vectors k t . Of key importance in (6.9) is that the
norm is nonsquared, as the sum of squared L 2 norms over groups coincides with
the standard squared L 2 norm penalty without groups
X ° °2 ° °2
°k t ° = °k °
2 2
t

The group L 2 -regularization term is also convex but nonsmooth. We employ


the linearized ADMM algorithm [Parikh and Boyd, 2014] to find a solution.

Trend filtering
An important class of signal-reconstruction methods that has been popularized
lately is trend filtering methods [Kim et al., 2009; Tibshirani et al., 2014]. Trend
filtering methods work by specifying a fitness criterion that determines the good-
ness of fit, as well as a regularization term, often chosen with sparsity promoting
qualities. As a simple example, consider the reconstruction ŷ of a noisy signal
y = {y t ∈ R }Tt=1 with piecewise constant segments. To this end, we may formulate
and solve the convex optimization problem
°2
minimize ° y − ŷ °2 + λ | ŷ t +1 − ŷ t |
° X
(6.10)
ŷ t

y

Figure 6.5 Example of signal reconstruction by means of trend filtering with a


sparsity promoting regularization term (6.11).

48
6.5 Estimation of LTI Models

We note that (6.10) can be written on the form


°2
minimize ° y − ŷ °2 + λ °D 1 ŷ °1
° ° °
(6.11)

where D 1 is the first-order difference operator, and we thus realize that the solution
will have a sparse first order time difference, see Fig. 6.5 for an example application.
We remark that trend filtering is a noncasual operation and would with the
terminology employed in this thesis technically be referred to as a smoothing
operation.

6.5 Estimation of LTI Models

Linear time-invariant models are fundamental within the field of control, and
decades of research have been devoted to their identification. We do not intend to
cover much of this research here, but instead limit ourselves to establish notation
and show how an LTI model lends itself to estimation by means of LS if the full
state-sequence is known.
A general LTI model takes the form

x t +1 = Ax t + Bu t + v t , t ∈ [1, T ] (6.12)
y t = C x t + Du t + e t (6.13)

where x ∈ Rn , y ∈ Rp and u ∈ Rm are the state, measurement and input respec-


tively. A discussion around the noise terms v t and e t is deferred until Sec. 7.5,
where we indicate how statistical assumptions on v t influence the cost function
and the properties of the estimate. We further limit ourselves to the case where
the state and input sequences are measured, i.e., C = I . This makes a plethora of
methods for estimating the parameters available. A common method for iden-
tification of systems that are linear in the parameters is the least-squares (LS),
prediction-error method (PEM), which in case of Gaussian noise, v, coincides
with the maximum likelihood (ML) estimate. To facilitate estimation using the LS
method, we write the model on the form y = Ak, and arrange the data according
to

x1
 
 . 
y =  ..  ∈ RT n
xT
k = vec ( AT B T ) ∈ RK
£ ¤

I n ⊗ x 0T I n ⊗ uT0
 
.. ..
A= ∈ RT n×K
 
. . 
T T
I n ⊗ x T −1 I n ⊗ u T −1

49
Chapter 6. Linear Quadratic Estimation and Regularization

where ⊗ denotes the Kronecker product and K = n 2 + nm is the number of model


parameters. We then solve the optimization problem (6.4) with closed-form solu-
tion (6.5).
Any linear system of order n can be represented by a minimal representation
where the matrix A has n free parameters [Ljung, 1987]. The measured state
sequence might, however, not correspond to this minimal representation. Known
sparsity structure in A given a measured state representation can be enforced
using constrained optimization or regularization.
In the next chapter, we will extend our view to linear time-varying models,
where the parameters of the matrices A and B evolve as functions of time.

50
6.5 Estimation of LTI Models

51
7
Estimation of LTV Models

7.1 Introduction
Time-varying systems and models arise in many situations. A common case is the
linearization of a nonlinear system along a trajectory [Ljung, 1987]. A linear time-
varying (LTV) model obtained from such a procedure facilitates control design
and trajectory optimization using linear methods, which are in many respects
better understood than nonlinear control synthesis.
The difficulty of the task of identifying time-varying dynamical models varies
greatly with the model considered and the availability of measurements of the
state sequence. For smoothly changing dynamics, linear in the parameters, the
recursive least-squares algorithm with exponential forgetting (RLSλ) is a common
option. If a Gaussian random-walk model for the parameters is assumed, a Kalman
filtering/smoothing algorithm [Rauch et al., 1965] gives the filtering/smoothing
densities of the parameters in closed form. However, the assumption of Brownian-
walk dynamics is often restrictive. Discontinuous dynamics changes occur, for
instance, when an external controller changes operation mode, when a sudden
contact between a robot and its environment is established, when an unmodeled
disturbance enters the system or when a component in the system suddenly fails.
Identification of systems with nonsmooth dynamics evolution has been stud-
ied extensively. The book by Costa et al. (2006) treats the case where the dynamics
are known, but the state sequence unknown, i.e., state estimation. Nagarajaiah
and Li (2004) examine the residuals from an initial constant dynamics fit to deter-
mine regions in time where improved fit is needed, addressing this need by the
introduction of additional constant dynamics models. Results on identifiability
and observability in jump-linear systems in the noncontrolled (autonomous) set-
ting are available due to Vidal et al. (2002). The main result on identifiability in
[Vidal et al., 2002] was a rank condition on a Hankel matrix constructed from the
collected output data, similar to classical results on the least-squares identifica-
tion of ARX models which appears as rank constraints on the, typically Toeplitz or
block-Toeplitz, regressor matrix. Identifiability of the problems proposed in this
chapter is discussed in Sec. 7.3.
In this work, we draw inspiration from the trend-filtering literature to develop
new system-identification methods for LTV models. In trend filtering, a curve is

52
7.2 Model and Identification Problems

decomposed into a set of polynomial segments. In the identification methods


proposed in this work, we instead consider a multivariable state sequence as the
output of an LTV model, where the model coefficients evolve as polynomial func-
tions of time. We start by defining a set of optimization problems of trend-filtering
character, with a least-squares loss function and carefully chosen regularization
terms. Similar inspiration was seen in [Ohlsson, 2010] for SISO systems, where
generic solvers were employed to find a solution. We proceed to establish a con-
nection between the proposed optimization problems and a statistical model of
the evolution of the parameters constituting the LTV model, and use this model
to guide us to efficient algorithms for solving the problems. We then discuss how
prior information can be utilized to increase the accuracy of the identification in
situations with poor excitation provided by the input signal, and end the chapter
with two examples. The identification methods developed in this chapter are later
used for model-based reinforcement learning in Chap. 11, where we note that
an LTV model can be seen as a first-order approximation of the dynamics of a
nonlinear system around a trajectory. We emphasize that such an approximation
will, in general, fail to generalize far from this trajectory, but many methods in
reinforcement learning and control make efficient use of the linearized dynamics
for optimization, while ensuring validity of the approximation by constraints or
penalty terms. A significant part of this chapter will be devoted to developing
efficient solvers to the proposed methods and in some cases, provide a statistical
interpretation that allows us to quantify the uncertainty in the model coefficients.
This uncertainty estimate will be of importance in Chap. 11 where it allows us to
adaptively constrain the step size of a trajectory optimization algorithm.

7.2 Model and Identification Problems

Linear time-varying models can be formulated in a number of different ways. We


limit the scope of this work to models on the form

x t +1 = A t x t + B t u t + v t
£ ¤T
k t = vec ( A t B t ) (7.1)

where the state-sequence x t is measured, possibly corrupted by measurement


noise, and where the parameters k are assumed to evolve according to the dynam-
ical system

k t +1 = H t k t + w t
y t = I n ⊗ x Tt uTt k t + e t
¡ £ ¤¢
(7.2)

The model (7.1)-(7.2) is limited by its lack of noise models. However, this simple
model will allow us to develop very efficient algorithms for identification. We defer
the discussion on measurement noise to Sec. 7.9.
Upon inspection of (7.2), the connection between the present model identifi-
cation problem and the state-estimation problem of Chap. 4 should be apparent.

53
Chapter 7. Estimation of LTV Models

The model (7.2) implies that the coefficients of the LTV model themselves evolve
according to a linear dynamical system, and are thus amenable to estimation us-
ing state-estimation techniques. If no prior knowledge is available, the dynamics
matrix H t can be taken as the identity matrix, H = I , implying that the model
coefficients follow a random walk dictated by the properties of w t , i.e., the state
transition density function p w (k t +1 |k t ). This particular choice of H corresponds
to the optimization problem we will consider in the following section. The emis-
sion density function is given by p e (y t |x t , u t , k t ). Particular choices of p e and p w
emit data likelihoods concave in the parameters and hence amenable to convex
optimization, a point that will be elaborated upon further in this chapter. We em-
phasize here that the state in the parameter evolution model refers to the current
parameters k t and not the system state x t of (7.1).
The following sections will introduce a number of optimization problems
with different regularization functions, corresponding to different choices of p w ,
and different regularization arguments, corresponding to different choices of H .
We also discuss the properties of the identification resulting from the different
modeling choices. We divide our exposition into a number of cases characterized
by the qualitative properties of the evolution of the parameter state k t .

Low-frequency time evolution


Many systems of practical interest exhibit slowly varying dynamics. Examples
include friction varying due to temperature change or wear and tear, and electricity
demand varying with season.
A slowly varying signal is characterized by small first-order time differences. To
identify slowly varying dynamics parameters, we thus formulate an optimization
problem where we in addition to penalizing the prediction errors of the model
also penalize the squared 2-norm of the first-order time difference of the model
parameters:
°2 °2
minimize ° y − ŷ °2 + λ2 °k t +1 − k t °2
° X°
(7.3)
k t
P
where t denotes the sum over relevant indices t , in this case t ∈ [1, T − 1]. This
optimization problem has a closed-form solution given by

k̃ ∗ = (ÃTÃ + λ2 D T1 D 1 )−1 ÃTỸ (7.4)


k̃ = vec(k 1 , ... , k T )

where à and Ỹ are appropriately constructed matrices and the first-order dif-
°2
ferentiation operator matrix D 1 is constructed such that λ2 °D 1 k̃ °2 equals the
°

second term in (7.3). The computational complexity of computing k ∗ using the


closed-form solution (7.4), O (T K )3 where K = n 2 + nm, becomes prohibitive for
¡ ¢

all but toy problems. An important observation to make to allow for an efficient
method for solving (7.3) is that the cost function is the negative data log-likelihood
of the Brownian random-walk parameter model (7.2) with H = I , which motivates
us to develop a dynamic programming algorithm based on a Kalman smoother.
Details on the estimation algorithms are deferred until Sec. 7.5.

54
7.2 Model and Identification Problems

A system with low-pass character is often said to have a long time constant
[Åström and Murray, 2010]. For a discrete-time linear system, long time constants
correspond to the dynamics matrix having eigenvalues close to the point 1 in
the complex plane. The choice H = I have all eigenvalues at the point 1, reinforc-
ing the intuition of (7.3) promoting a low-frequency evolution of the dynamics.
The connection between eigenvalues, small time-differences and long time con-
stants will be explored further in Chap. 8, where inspiration from (7.3) is drawn to
enhance dynamics models in the deep-learning setting.
An example of identification by solving Eq. (7.3) is provided in Sec. 7.7, where
the influence of λ is illustrated.

Smooth time evolution


Additional assumptions can be put on the qualitative evolution of k t , such as its
evolution being smooth and differentiable. A smoothly varying signal is charac-
terized by small second-order time differences. To identify smoothly time-varying
dynamics parameters, we thus penalize the squared 2-norm of the second-order
time difference of the model parameters, and solve the optimization problem
°2 °2
minimize ° y − ŷ °2 + λ2 °k t +2 − 2k t +1 + k t °2
° X°
(7.5)
k t

Also this optimization problem has a closed-form solution on the form (7.4) with
the corresponding second-order differentiation operator D 2 . Equation (7.5) is
the negative data log-likelihood of a Brownian random-walk parameter model
with added momentum. The matrix H corresponding to this model is derived
in Sec. 7.4, where a Kalman smoother with augmented state is developed to find
the optimal solution. We also extend problem (7.5) to more general regularization
terms in Sec. 7.5.

Piecewise constant time evolution


Thus far, we have considered situations in which the parameter state k t evolves
in a continuous fashion. In the presence of discontinuous or abrupt changes in
the dynamics, estimation by solving (7.3) might perform poorly. A signal which
is mostly flat, with a small number of distinct level changes, is characterized by
a sparse first-order time difference. To encourage a solution where k t remains
approximately constant most of the time, but exhibits sudden changes in dynam-
ics at a few but unspecified number of time steps, we formulate and solve the
problem
°2
minimize ° y − ŷ ° + λ °k t +1 − k t °
° X° °
2 2 (7.6)
k t

We can give (7.6) an interpretation as a grouped-lasso cost function, where instead


of groups being formed out of variables, our groups are defined by differences
between variables. The group-lasso is a sparsity-promoting penalty, hence a solu-
tion in which only a small number of nonzero first-order time differences in the

55
Chapter 7. Estimation of LTV Models

model parameters is favored, i.e., a piecewise constant dynamics evolution. At a


first glance, one might consider the formulation
°2
minimize ° y − ŷ °2 + λ °k t +1 − k t °1
° X° °
(7.7)
k t

which results in a dynamics evolution with sparse changes in the coefficients, but
changes to different entries of k t are not necessarily occurring at the same time
instants. The formulation (7.6), however, promotes a solution in which the change
occurs at the same time instants for all coefficients in A and B , i.e., k t +1 = k t for
most t .
Equations (7.3) and (7.5) admitted simple interpretations as the likelihood of a
dynamical model on the form (7.2). Unfortunately, Eq. (7.6) does not admit an as
simple interpretation. The solution hence requires an iterative solver and is dis-
cussed in Sec. 7.A. Example usage of this optimization problem for identification
is illustrated in Sec. 7.6.

Piecewise constant time evolution with known number of steps


If the maximum number of switches in dynamics parameters, M , is known in
advance, the optimal problem to solve is

° y − ŷ °2
° °
minimize 2
k
X
subject to 1{k t +1 6= k t } ≤ M (7.8)
t

where 1{·} is the indicator function. This problem is nonconvex and we propose
solving it using dynamic programming (DP). The proposed algorithm is outlined
in Sec. 7.B.

Piecewise linear time evolution


The group-sparsity promoting effects of the group-lasso can be explored further. A
piecewise linear signal is characterized by a sparse second-order time difference, i.e.,
it has a small number of changes in the slope. A piecewise linear time-evolution of
the dynamics parameters is hence obtained if we solve the optimization problem
°2
minimize ° y − ŷ °2 + λ °k t +2 − 2k t +1 + k t °2
° X° °
(7.9)
k t

The solution to this problem is discussed in Sec. 7.A.

Summary
The qualitative results of solving the proposed optimization problems are summa-
rized in Table 7.1. The table illustrates how the choice of regularizer and order of
time-differentiation of the parameter vector affect the resulting solution.

56
7.3 Well-Posedness and Identifiability

Table 7.1 Summary of optimization problem formulations. D n refers to parame-


ter vector time-differentiation of order n.

Norm Dn Result
1 1 Small number of steps (piecewise constant)
1 2 Small number of bends (piecewise affine)
2 1 Small steps (slowly varying)
2 2 Small bends (smooth)

Two-step refinement
Since many of the proposed formulations of the optimization problem penalize
the size of the changes to the parameters in order to promote sparsity, a bias
is introduced and solutions in which the changes are slightly underestimated
are favored. To mitigate this issue, a two-step procedure can be implemented
wherein the first step, time instances where the parameter state vector k t changes
significantly are identified, we call these time instances knots. To identify the
knots,°we observe ° the argument° inside the °sum of the regularization term, i.e.,
a t 1 = °k t +1 − k t °2 or a t 2 = °k t +2 − 2k t +1 + k t °2 . Time instances where a t is taking
large values indicate suitable time indices for knots.
In the second step, the sparsity-promoting penalty is removed and equality
constraints are introduced between the knots. The second step can be computed
very efficiently by noticing that the problem can be split into several identical
sub-problems, which each has a closed-form solution on the form

k ∗ = (ATA + λ2 D Tn D n )−1 ATỸ (7.10)

See Sec. 6.5 for additional details on LTI-model identification.

7.3 Well-Posedness and Identifiability

To assess the well-posedness of the proposed identification methods, we start


by noting that the problem of finding A in x t +1 = Ax t given a pair (x t +1 , x t ) is an
ill-posed problem in the sense that the solution is non unique. If we are given
several pairs (x t +1 , x t ), for different t , while A remains constant, the problem
becomes over-determined and well-posed in the least-squares sense, provided
that the vectors of state components {x t(i ) }Tt=1 span Rn . The LTI-case in Sec. 6.5 is
well posed according to classical results when A has full column rank.
When we extend our view to LTV models, the number of free parameters is
increased significantly, and the corresponding regressor matrix à will never have
full column rank, necessitating the introduction of a regularization term. Infor-
mally, for every n measurements, the problem has K = n 2 + nm free parameters.
If we consider the identification problem of Eq. (7.6) and let λ → ∞, the regular-
izer terms essentially become equality constraints. This will enforce a solution in
which all parameters in k are constant over time, and the problem reduces to the

57
Chapter 7. Estimation of LTV Models

LTI-problem. As λ decreases, the effective number of free parameters increases


until the problem gets ill-posed for λ = 0. We formalize the above arguments as

P ROPOSITION 1
Optimization problems (7.3) and (7.6) have unique global minima for λ > 0 if and
only if the corresponding LTI optimization problem has a unique solution. 2

Proof The cost function is a sum of two convex terms. For a global minimum to
be nonunique, the Hessians of the two terms must have intersecting nullspaces.
In the limit λ → ∞ the problem reduces to the LTI problem. The nullspace of the
regularization Hessian, which is invariant to λ, does thus not share any directions
with the nullspace of ÃTÃ which establishes the equivalence of identifiability
between the LTI problem and the LTV problems. 2

P ROPOSITION 2
Optimization problems (7.5) and (7.9) with higher order differentiation in the
regularization term have unique global minima for λ > 0 if and only if there does
not exist any vector v 6= 0 ∈ Rn+m such that

x xT x t uTt
· ¸
C txu v = t tT v = 0 ∀t 2
ut xt u t uTt

Proof Again, the cost function is a sum of two convex terms and for a global
minimum to be nonunique, the Hessians of the two terms must have in-
tersecting nullspaces. In the limit λ → ∞ the regularization term reduces
to a linear constraint set, allowing only parameter vectors that lie along a
line through time. Let ṽ 6= 0 be such a vector, parameterized by t as ṽ =
¤T
2v̄ T · · · t v̄ T · · · T v̄ T ∈ RT K where v̄ = vec({v}n1 ) ∈ RK and v is an ar-
£ T

bitrary vector ∈ Rn+m ; ṽ ∈ null (ÃTÃ) implies that the loss is invariant to the pertur-
ªT
bation αṽ to k̃ for an arbitrary α ∈ R . (ÃTÃ) is given by blkdiag( I n ⊗C txu 1 ) which
©

means that ṽ ∈ null (ÃTÃ) ⇐⇒ αt (I n ⊗C txu )v̄ = 0 ∀(α, t ) ⇐⇒ v̄ ∈ null (I n ⊗C txu ) ∀t ,


which implies v ∈ nullC txu due to the block-diagonal nature of I n ⊗C txu . 2

R EMARK 2
If we restrict our view to constant systems with stationary Gaussian inputs with
covariance Σu , we have as T → ∞, (1/T ) xx T approaching the stationary con-
P
trollability Gramian given by the solution to Σ = AΣAT + B Σu B T. Not surprisingly,
the well-posedness of the optimization problem is thus linked to the excitation of
the system modes through the controllability of the system. 2

For the LTI problem to be well-posed, the system must be identifiable and the
input u must be persistently exciting of sufficient order [Johansson, 1993].

58
7.4 Kalman Smoother for Identification

7.4 Kalman Smoother for Identification


We now elaborate on the connection between the proposed optimization prob-
lems and the dynamical system governing the evolution of the parameter state
k t (7.2).
We note that (7.3) is the negative log-likelihood of the dynamics model (7.2)
with state vector k t . The identification problem is thus reduced to a standard
state-estimation problem.
To develop a Kalman smoother-based algorithm for solving (7.5), we augment
the model (7.2) with the state variable k t0 = k t − k t −1 and note that k t0 +1 − k t0 =
k t +1 − 2k t + k t −1 . We thus introduce the augmented-state model
· ¸ · ¸· ¸ · ¸
k t +1 IK IK kt 0
= + (7.11)
k t0 +1 0K I K k t0 wt
· ¸
¤ kt
y t = I n ⊗ x Tt uTt
£¡ £ ¤¢
0 (7.12)
k t0

which is on a form suitable for filtering/smoothing with the machinery developed


in Sec. 4.3.
General case The Kalman smoother-based identification method can be gen-
eralized to solving optimization problems where the argument in the regularizer
appearing in (7.5) is replaced by a general linear operation on the parameter
vector, P (z)k °2 °2
minimize ° y − ŷ °2 + λ2 °P (z)k t °2
° X°
(7.13)
k t

where P (z) is a polynomial of degree n > 0 in the time-difference operator z.


For (7.5), P (z) equals z 2 − 2z + 1 and P −1 (z) has a realization on the form (7.11).

7.5 Dynamics Priors

As discussed in Sec. 7.3, the identifiability of the parameters in a dynamics model


hinges on the observability of the dynamical system (7.2), or more explicitly, only
modes excited by the input u will be satisfactorily identified. In many practical
scenarios, such as if the identification is part of an iterative learning and control
scheme, e.g., Iterative Learning Control (ILC) or reinforcement learning, it might
be undesirable to introduce excessive noise in the input to improve excitation
for identification. This section will introduce additional information about the
dynamics in the form of priors, which mitigate the issue of poor excitation of the
system modes. The prior information might come from, e.g., a nominal model
known to be inaccurate, or an estimated global model such as a Gaussian mixture
model (GMM). A statistical model of the joint density p(x t +1 , x t , u t ) constructed
from previously collected tuples (x t +1 , x t , u t ), for instance, provides a dynami-
cal model of the system through the conditional probability-density function
p(x t +1 |x t , u t ).

59
Chapter 7. Estimation of LTV Models

We will see that for priors from certain families, the resulting optimization
problem remains convex. For the special case of a Gaussian prior over the dynam-
ics parameters or the output, the posterior mean of the parameter vector is once
again conveniently obtained from a Kalman-smoothing algorithm, modified to
include the prior.

General case
A general prior over the parameter-state variables k t can be specified as p(k t |z t ),
where the variable z t is a placeholder for whatever signal might be of relevance,
for instance, the time index t or state x t . The data log-likelihood of (7.2) with the
prior p(k t |z t ) added takes the form

T
X TX
−1 T
X
log p(k, y|x, z)1:T = log p(y t |k t , x t ) + log p(k t +1 |k t ) + log p(k t |z t ) (7.14)
t =1 t =1 t =1

which factors conveniently due to the Markov property of a state-space model. For
particular choices of density functions in (7.14), notably Gaussian and Laplacian,
the negative log-likelihood function becomes convex. The next section will elab-
orate on the Gaussian case and introduce a recursive algorithm that efficiently
solves for the full posterior. The Laplacian case, while convex, does not admit an
equally efficient algorithm. The Laplacian likelihood is, however, more robust to
outliers in the data, making the trade-off worth consideration.

Gaussian case
If all densities in (7.14) are Gaussian and k is modeled with the Brownian random
walk model (7.2) (Gaussian v t ), (7.14) can be written on the form (scaling constants
omitted)

T °
° y t − ŷ(k t , x t )°2 −1
X °
− log p(k, y|x, z)1:T = Σe
t =1
TX
−1 ° T °
°k t +1 − k t °2 −1 + °µ0 (z t ) − k t °2 −1
° X °
+ Σ Σ (z (7.15)
w 0 t)
t =1 t =1

for some function µ0 (z t ) which produces the prior mean of k t given z t .


Σe , Σw , Σ0 (z t ) are the covariance matrices of the measurement noise, param-
° °2
eter drift and prior respectively and °x °Σ−1 = x TΣ−1 x.
For this special case, we devise a modified Kalman-smoothing algorithm,
where the conditional mean of the state is updated with the prior, to solve the
estimation problem. The standard Kalman-filtering equations for a general linear
system are derived in Sec. 4.3—the modification required to incorporate a Gaus-
sian prior on the state variable p(x t |z t ) = N (µ0 (z t ), Σ0 (z t )) involves a repeated

60
7.6 Example—Jump-Linear System

correction step and takes the form


¢−1
K̄ t = P t |t P t |t + Σ0 (z t )
¡
(7.16)
x̄ t |t = x̂ t |t + K̄ t µ0 (z t ) − x̂ t |t
¡ ¢
(7.17)
P̄ t |t = P t |t − K̄ t P t |t (7.18)

where ¯· denotes the posterior value. This additional correction can be interpreted
as receiving a second measurement µ0 (z t ) with covariance Σ0 (z t ). For the Kalman-
smoothing algorithm, x̂ t |t and P t |t in (7.17) and (7.18) are replaced with x̂ t |T and
P t |T .
A prior over the output of the system, or a subset thereof, is straightforward
to include in the estimation by means of an extra update step, with C , R 2 and y
being replaced with their appropriate values according to the prior.
We remark that although all optimization problems proposed in this chapter
could be solved by generic solvers, the statistical interpretation and solution pro-
vided by the Kalman-smoothing algorithm provide us not only with the optimal
solution, but also an uncertainty estimate. The covariance matrix of the state,
Eq. (4.24), is an estimate of the covariance of the parameter vector for each time
step. This uncertainty estimate may be of importance in downstream tasks using
the estimated model. One such example is trajectory optimization, where it is
useful to limit the step length based on the uncertainty in the model. We will make
use of this when performing trajectory optimization in a reinforcement-learning
setting in Chap. 11, where model uncertainties are taken into account by enforc-
ing a constraint on the Kullback-Leibler divergence between two consecutive
trajectory distributions.

7.6 Example—Jump-Linear System


To demonstrate how the proposed optimization problems may be used for iden-
tification, we now consider a simulated example. We generate a state sequence
from the following LTV system, where the dynamics change from
· ¸ · ¸
0.95 0.1 0.2
At = , Bt =
0.0 0.95 1.0

to · ¸ · ¸
0.5 0.05 0.2
At = , Bt =
0.0 0.5 1.0
at t = 200. The input was Gaussian noise of zero mean and unit variance, state
transition noise and measurement noise (y t = x t +1 +e t ) of zero mean and σe = 0.2
were added. In this problem the parameters change abruptly, a suitable choice of
identification algorithm is thus (7.6). Figure 7.1 depicts the estimated coefficients
in the dynamics matrices after solving (7.6), for a value of λ chosen using the
L-curve method [Hansen, 1994]. The figure indicates that the algorithm correctly

61
Chapter 7. Estimation of LTV Models

Model coefficients k t 0.8

0.6

0.4

0.2

0.0
50 100 150 200 250 300 350 400
Time index

Figure 7.1 Piecewise constant state-space dynamics. True values are shown with
dashed, black lines. Gaussian state-transition and measurement noise with σ = 0.2
were added. At t = 200 the dynamics of
° the
¢ system change abruptly. A suitable
choice of regularization term λ °k + − k °2 allows us to estimate a dynamics model
¡ °

that exhibit an abrupt change in the coefficients, without specifying the number
of such changes a priori. Please note that this figure shows the coefficients of k
corresponding the A-matrix of Eq. (7.1) only.

identifies the abrupt change in the system parameters at t = 200 and maintains
an otherwise near constant parameter vector. This example highlights how the
sparsity-promoting penalty can be used to indicate whether or not something has
abruptly changed the dynamics, without specifying the number of such changes
a priori. The methods briefly discussed in Sec. 7.2 can be utilized, should it be
desirable to have two separate LTI-models describing the system.

7.7 Example—Low-Frequency Evolution

In this example, we simulate the system (7.1)-(7.2) with H = I to generate an LTV


system with low-frequency time evolution of the dynamics. We let the state x ∈ R3
drift with isotropic covariance of σ2v = 0.012 , and let the parameters k drift with
isotropic covariance of σ2w = 0.0012 . The input u ∈ R2 was Gaussian with isotropic
covariance 1. The evolution of the true parameters in the A-matrix is shown in
black in Fig. 7.2.
°2
We estimate LTV models by solving Eq. (7.3) (regularization term λ °k + − k °2 )
°

62
7.7 Example—Low-Frequency Evolution

True parameters
λ2 = 1 λ2 = 100
2
0.4 λ = 10000 λ2 = 1.0e6
Model coefficients k t

0.2

0.0

−0.2
0 100 200 300 400 500
Time index

Figure 7.2 Trajectories of the system (7.2) (black) together with models estimated
by solving (7.3) with varying values of λ. The smoothing effect of a high λ is illus-
trated, and as λ → ∞, the estimated model converges to an LTI model. The optimal
value is given by λ ≈ 10. Please note that this figure shows the coefficients of k
corresponding the A-matrix of (7.1) only. The coefficients of the B -matrix evolve
similarly but are omitted for clarity.

using the Kalman-smoothing algorithm and 4 different values of λ, shown in color


in Fig. 7.2. The optimal value of λ is approximately given by λ = σv /σw = 10. In
Fig. 7.2 we see how a too small value of the regularization parameter leads to noisy
estimates of the time-varying parameters, whereas a too high value leads to overly
conservative changes. Arguably, the optimal value of λ = 10 performs best. In
Fig. 7.3, we illustrate the log-distributions of prediction errors in the top-left pane,
and model errors in the top-right pane, in both cases on the training data. It is
clear that the prediction error is an increasing function of the regularization pa-
rameter, as expected. However, the model error, calculated as the sum of squared
differences between the true system parameters and the estimated parameters,
is minimized by the optimal value of λ. In reality, this optimal value is unknown
and determining it is not always easy. In the bottom four panes, we show quantile-
quantile plots [Wilk and Gnanadesikan, 1968] of the prediction errors. The optimal

63
Chapter 7. Estimation of LTV Models

λ2 = 1 λ2 = 100 λ2 = 10000 λ2 = 1e6

Prediction errors Model errors

102 102

101 101

100 100

10−1 10−1 −2
−0.1 0 0.1 −5 · 10 0 5 · 10−2
qq-plot λ2 = 1 2
qq-plot λ = 100

0.010 0.02
0.005 0.01
0.00
0.000
−0.01
−0.005 −0.02

−5 · 10−3 0 5 · 10−3 −2 · 10−2 0 2 · 10−2


2 2
qq-plot λ = 10000 qq-plot λ = 1e6

0.06
0.02
0.03
0.00 0.00

−0.02 −0.03
−0.06
−0.04
−4 · 10−2
−2 · 10−2 0 2 · 10 −2
−5 · 10−2 0 5 · 10−2

Figure 7.3 The top panes show (log) error distributions on training data when
solving Eq. (7.3) on data generated by (7.1)-(7.2). The prediction error is a strictly
increasing function of λ, whereas the model error is minimized by the optimal
value for λ. The bottom four panes show quantile-quantile plots of prediction
errors on the training data, for the different choices of λ.

64
7.8 Example—Nonsmooth Robot Arm with Stiff Contact

·104

9.0

8.0

10−2 10−1 100 101 102 103 104


λ

Figure 7.4 Maximum-likelihood estimation to determine a suitable value of λ.


The curve illustrates the likelihood under the model (7.1)-(7.2) for different choices
of λ, the maximum is marked with a dot.

value for λ produces normal residuals, whereas other choices for λ produce heavy-
tailed distributions. While this analysis is available even when the true system is
not known, it might produce less clear outcomes when the data is not generated
by a system included in the model set being searched over. Alternative ways of
setting values for λ include cross validation and maximum-likelihood estimation
under the statistical model (7.1)-(7.2), illustrated in Fig. 7.4. The likelihood of the
data given a model on the form (7.2) is easily calculated during the forward-pass
of the Kalman algorithm.
Yet another option for determining the value of λ is to consider it as a relative
time-constant between the evolution of the state and the evolution of the model
parameters. A figure like Fig. 7.2, together with prior knowledge of the system, is
often useful in determining λ.

7.8 Example—Nonsmooth Robot Arm with Stiff Contact


To illustrate the ability of the proposed models to represent the nonsmooth dy-
namics along a trajectory of a robot arm, we simulate a two-link robot with dis-
continuous Coulomb friction. We also let the robot establish a stiff contact with
the environment to illustrate both strengths and weaknesses of the modeling
approach.
The state of the robot arm consists of two joint coordinates, q, and their time
derivatives, q̇. The control signal trajectory was computed using an inverse dy-
namics model of the robot, and Gaussian noise was superimposed the computed
torque trajectory. Figure 7.5 illustrates the state trajectories, control torques and
simulations of a model estimated by solving (7.6). The figure clearly illustrates that

65
Chapter 7. Estimation of LTV Models

End-effector positions Control torques

0.75 1.0
0.50
0.5
0.25
0.00 0.0
-0.25 -0.5
50 100 150 200 50 100 150 200
x y Constraint Stiff contact Velocity sign change
q̇ 1 q̇ 2
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
-0.5
-0.5
-1.0
-1.0 -1.5
50 100 150 200 50 100 150 200
States Velocity sign change Stiff contact Simulation
q1 q2

1.0
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
-1.5
50 100 150 200 50 100 150 200

Figure 7.5 Simulation of nonsmooth robot dynamics with stiff contact—training


data vs. sample time index. A sign change in velocity, and hence a discontinuous
change in friction torque, occurs in the time interval 50-100 and the contact is
established in the time interval 100-150. For numerical stability, all time-series
are normalized to zero mean and unit variance, hence, the original velocity zero
crossing is explicitly marked with a dashed line. The control signal plot clearly
indicates the discontinuity in torque around the unnormalized zero crossing of q̇ 2 .

66
7.9 Discussion

q̇ 1 q̇ 2
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
-0.5
-0.5 -1.0
-1.0 -1.5
50 100 150 200 50 100 150 200
States Orig. Velocity sign change Orig. Stiff contact Simulation
q1 q2
1.0
1.0
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
-1.5
50 100 150 200 50 100 150 200

Figure 7.6 Simulation of nonsmooth robot dynamics with stiff contact—


validation data vs. sample time index. The dashed lines indicate the event times
as occurring for the original training data, highlighting that the model is able to
deal effortlessly with the nonsmooth friction, but inaccurately predicts the time
evolution around the contact event, which now occurs at a slightly different time
instance.

the model is able to capture the dynamics both during the nonsmooth sign change
of the velocity, and also during the establishment of the stiff contact. The learned
dynamics of the contact is, however, time-dependent. This time-dependence is,
in some situations, a drawback of LTV-models. This drawback is illustrated in
Fig. 7.6, where the model is used on a validation trajectory where a different noise
sequence was added to the control torque. Due to the novel input signal, the
contact is established at a different time-instance and as a consequence, there is
an error transient in the simulated data.

7.9 Discussion

This chapter presents methods for estimation of linear, time-varying models. The
methods presented extend directly to nonlinear models that remain linear in the
parameters.
When estimating an LTV model from a trajectory obtained from a nonlinear
system, one is effectively estimating the linearization of the system around that
trajectory. A first-order approximation to a nonlinear system is not guaranteed to

67
Chapter 7. Estimation of LTV Models

generalize well as deviations from the trajectory become large. Many nonlinear sys-
tems are, however, approximately locally linear, such that they are well described
by a linear model in a small neighborhood around the linearization/operating
point. For certain methods, such as iterative learning control and trajectory cen-
tric reinforcement learning, a first-order approximation to the dynamics is used
for efficient optimization, while the validity of the approximation is ensured by
incorporating penalties or constraints on the deviation between two consecutive
trajectories [Levine and Koltun, 2013]. We explore this concept further using the
methods proposed in this chapter, in Chap. 11.
The methods presented allow very efficient learning of this first-order ap-
proximation due to the postulated prior belief over the nature of the change in
dynamics parameters, encoded by the regularization terms. Prior knowledge en-
coded this way puts less demand on the data required for successful identification.
The identification process will thus not be as invasive as when excessive noise
is added to the input for identification purposes, allowing learning of flexible,
overparameterized models that fit available data well. This makes the proposed
identification methods attractive in applications such as guided policy search
(GPS) [Levine and Koltun, 2013; Levine et al., 2015] and nonlinear iterative learning
control (ILC) [Bristow et al., 2006], where they can lead to dramatically decreased
sample complexity.
The proposed methods that lend themselves to estimation through the Kalman
smoother-based algorithm could find use as a layer in a neural network. Amos
and Kolter (2017) showed that the solution to a quadratic program is differentiable
and can be incorporated as a layer in a deep-learning model. The forward pass
through such a network involves solving the optimization problem, making the
proposed methods attractive due to the O (T ) solution.
When faced with a system where time-varying dynamics is suspected and
no particular knowledge regarding the dynamics evolution is available, or when
the dynamics are known to vary slowly, a reasonable first choice of algorithm
is (7.5). This algorithm is also, by far, the fastest of the proposed methods due
to the Kalman-smoother implementation of Sec. 7.5.1 As a consequence of the
ease of solution, finding a good value for the regularization parameter λ is also
significantly easier. Example use cases include when dynamics are changing with a
continuous auxiliary variable, such as temperature, altitude or velocity. If a smooth
parameter drift is found to correlate with an auxiliary variable, LPV-methodology
can be employed to model the dependency explicitly, something that will not be
elaborated upon further in this thesis but was implemented and tested in [Basis-
FunctionExpansions.jl, B.C., 2016].
Dynamics may change abruptly as a result of, e.g., system failure, change of
operating mode, or when a sudden disturbance enters the system, such as a policy
change affecting a market or a window opening, affecting the indoor temperature.
The identification method (7.6) can be employed to identify when such changes

1 The Kalman-smoother implementation is often several orders of magnitude faster than solving the
optimization problems with an iterative solver.

68
7.10 Conclusions

occur, without specifying a priori how many changes are expected.


For simplicity, the regularization weights were kept as simple scalars in
° °2
this chapter. However, all terms λ °∆k °2 = (∆k)T(λI )(∆k) can be generalized to
(∆k)TΛ(∆k), where Λ is an arbitrary positive definite matrix. This allows incorpo-
ration of different scales for different variables with little added implementation
complexity. Known structure, such as sparsity patterns in the matrices A and B , is
easily incorporated into Λ or the covariance matrix of w for the Kalman-smoother
solver.

Measurement-noise model
The identification algorithms developed in this chapter were made available by
the simple nature of the dynamical model Eq. (7.1). In practice, measurements
are often corrupted by noise, in particular if parts of the state-sequence is derived
from time differentiation of measured quantities. We will not cover the topic of
noise-model estimation in depth here, but will provide a few suggested treatments
from the literature that could be considered in a scenario with poor signal-to-noise
ratio.
A very general approach to estimation of noise models is pseudo-linear regres-
sion [Ljung and Söderström, 1983; Ljung, 1987]. The general idea is to estimate
the noise components and include them in the model. In the present context, this
could amount to estimating a model using any of the methods described above,
calculate the model residuals e t = y t − ŷ t , and build a model ê t = ρ(e t −1 , e t −2 , ...).
The combined problem of estimating both states x and parameters k can be
cast as a nonlinear filtering problem [Ljung and Söderström, 1983]. The nonlinear
nature of the resulting problem necessitates a nonlinear filtering approach, such
as the extended Kalman filter [Ljung and Söderström, 1983] or the particle filter
(see Sec. 4.2). The literature on iterated filtering [Ionides et al., 2006; Lindström
et al., 2012] considers this nonlinear filtering problem in the context of constant
dynamics.

7.10 Conclusions

We have proposed a framework for identification of linear, time-varying models


using convex optimization. We showed how a Kalman smoother can be used to
estimate the dynamics efficiently in a few special cases, and demonstrated the
use of the proposed LTV models on two examples, highlighting their efficiency for
jump-linear system identification and learning the linearization of a nonlinear
system along a trajectory. We further demonstrated the ability of the models to
handle nonsmooth friction dynamics as well as analyzed the identifiability of the
models.
Implementations of all discussed methods are made available in [LTVModels.jl,
B.C., 2017].

69
Chapter 7. Estimation of LTV Models

Appendix A. Solving (7.6)


P ° °
Due to the nonsquared norm penalty t °k t +1 − k t °2 , the problems (7.6) and (7.9)
are significantly harder to solve than (7.3). An efficient implementation using
the linearized ADMM algorithm [Parikh and Boyd, 2014] is made available in the
accompanying repository.
The linearized ADMM algorithm [Parikh and Boyd, 2014] solves the problem

minimize f (k) + g (Ak) (7.19)


k

where f and g are convex functions and A is a matrix. The optimization problems
with the group-lasso penalty (7.6) and (7.9) can be written on the form (7.19) by
constructing A such
° ° that it performs the computations k t +1 −k t or k t +2 −2k t +1 +k t
and letting g be °·°2 .

Appendix B. Solving (7.8)

The optimization problem of Eq. (7.8) is nonconvex and harder to solve than the
other problems proposed in this chapter. To solve small-scale instances of the
problem, we modify the algorithm developed in [Bellman, 1961], an algorithm fre-
quently referred to as segmented least-squares [Bellman and Roth, 1969]. Bellman
(1961) approximates a curve by piecewise linear segments. We instead associate
each segment (set of consecutive time indices during which the parameters are
constant) with a dynamics model, as opposed to a simple straight line.2
The algorithm relies on the key fact that the value function for a sequential
optimization problem with quadratic cost and parameters entering linearly, is
quadratic. This allows us to find the optimal solution in O (T 2 ) time instead of the
µ ¶
¡ T ¢
O
M

complexity of the naive solution. A simplified implementation is provided in


Algorithm 3.3
Unfortunately, the computational complexity of the dynamic-programming
solution, O (T 2 K 3 M ), becomes prohibitive for large T , in which case an approx-
imate solution can be obtained by, e.g., solving (7.6) and then projecting the
solution onto the constraint set by only keeping the M largest parameter changes.

2 Indeed, if a simple integrator is chosen as dynamics model and a constant input is assumed, the
result of our extended algorithm reduces to the segmented least-squares solution.
3 The help of Pontus Giselsson in developing this algorithm is gratefully acknowledged.

70
7.B Solving (7.8)

Algorithm 3 Simple dynamic-programming solver without memoization. Effi-


cient solver provided in [LTVModels.jl, B.C., 2017]. The algorithm can be used
for the original purpose of [Bellman, 1961] by letting costfun(y,a,b) and
argmin(y,a,b) calculate the cost and optimal estimate of an affine approxi-
mation to y between a and b. By instead letting these functions calculate the
optimal cost and solution to the LTI identification problem of Sec. 6.5, we obtain
the optimal solution to (7.8). Return values V, t , a constitute the value function,
breakpoints and optimal parameters, respectively.
function seg_bellman(y,M)
T = length(y)
B = zeros(Int, M-1, T) # back-pointer matrix
fi = [costfun(y,j,T) for j = 1:T] # initialize Bellman iteration
# Bellman iteration
fnext = Vector{Float64}(T)
for j = M-1:-1:1
for k = j:T-(M-j)
opt, optl = Inf, 0
for l = k+1:T-(M-j-1)
cost = costfun(y,k,l-1) + fi[l]
if cost < opt; opt, optl = cost, l; end
end
fnext[k] = opt
B[j,k] = optl-1
end
fi .= fnext
end
V = [costfun(y,1,j)+fi[j+1] for j = 1:T-M] # last Bellman iterate
# Backward pass
t = Vector{Int}(M)
a = Vector{typeof(argmin(y,1,2))}(M+1)
_,t[1] = findmin(V[1:end-M]) # t = index of minimum
a[1] = argmin(y,1,t[1])
for j = 2:M
t[j] = B[j-1,t[j-1]]
a[j] = argmin(y,t[j-1]+1,t[j])
end
a[M+1] = argmin(y,t[M]+1,T)
return V,t,a
end

71
8
Identification and
Regularization of Nonlinear
Black-Box Models

8.1 Introduction

Dynamical control systems are often described in continuous time by differential


state equations on the form
¡ ¢
ẋ(t ) = f c x(t ), u(t ) (8.1)

where x is a Markovian state vector, u is the input and f c is a function that maps
the current state and input to the state time derivative. An example of such a
model is a rigid-body dynamical model of a robot
· ¸
q
q̈ = −M −1 (q) C (q, q̇)q̇ +G(q) + F (q̇) − u , x =
¡ ¢
(8.2)

where M ,C ,G and F model phenomena such as inertia, Coriolis, gravity and


friction [Spong et al., 2006] and q are the joint coordinates.
In the discrete time domain, we often consider models on the form
¡ ¢
x t +1 = f x t , u t (8.3)

where f is a function that maps the current state and input to the state at the next
time-instance [Åström and Wittenmark, 2013a]. We have previously discussed the
case where f is a linear function of the state and the control input, and we now
extend our view to nonlinear functions f .
Learning a globally valid dynamics model fˆ of an arbitrary nonlinear system f
with little or no prior information is a challenging problem. Although in principle,
any sufficiently complex function approximator, such as a deep neural network
(DNN), could be employed, high demands are put on the amount of data required
to prevent overfitting and to obtain a faithful representation of the dynamics over

72
8.1 Introduction

the entire state space. If prior knowledge is available, it can often be used to reduce
the demands on the amount of data required to learn an accurate model [Sjöberg
et al., 1995].
Early efforts in nonlinear modeling include Volterra-Wiener models that make
use of basis-function expansions to model nonlinearities [Johansson, 1993]. This
type of models exhibit several drawbacks and are seldom used in practice, one of
which is the difficulty of incorporating prior knowledge into the model. Oftentimes,
this can also be hard to incorporate into a flexible black-box model such as a deep
neural network. This chapter will highlight a few general attempts at doing so,
compatible with a wide class of function approximators.
In many applications, the linearization of f is important, a typical example
being linear control design [Glad and Ljung, 2014]. In applications such as iterative
learning control (ILC) [Bristow et al., 2006] and trajectory centric, episode-based
reinforcement learning (TCRL) [Levine and Koltun, 2013], the linearization of the
nonlinear dynamics along a trajectory is often needed for optimization. Identi-
fication of f must thus not only yield a good model for prediction/simulation,
but also the Jacobian J fˆ of fˆ must be close to the true system Jacobian J . The
linearization of a model around a trajectory returns a Linear Time-Varying (LTV)
model on the form
x t +1 = A t x t + B t u t

where the matrices A and B constitute the output Jacobian. This kind of model
was learned efficiently using dynamic programming in Chap. 7. However, not
all situations allow for accurate learning of an LTV model around a trajectory. A
potential problem that can arise is insufficient excitation provided by the control
input [Johansson, 1993]. Prior knowledge regarding the evolution of the dynamics,
encoded in form of carefully designed regularization, was utilized in Chap. 7 in
order to obtain a well-posed optimization problem and a meaningful result. While
this proved to work well in many circumstances, it might fail if excitation is small
relative to how fast the dynamics changes along a trajectory. When model identifi-
cation is a subtask in an outer algorithm that optimizes a control-signal trajectory
or a feedback policy, adding excessive noise for identification purposes may be
undesirable, making regularization solely over time as in Chap. 7 insufficient. A
step up in sophistication from LTV models is to learn a nonlinear model that is
valid globally. A nonlinear model can be learned from several consecutive trajecto-
ries and is thus able to incorporate more data than an LTV model. Unfortunately,
a black-box nonlinear model also requires more data to learn a high fidelity model
and not suffer from overfitting.
The discussion so far indicates two issues; 1) Complex nonlinear models have
the potential to be valid globally, but may suffer from overfitting and thus not learn
a function that generalizes and learns the correct linearization. 2) LTV models can
be learned efficiently and can represent the linearized dynamics well, but require
sufficient excitation, are time-based and valid only locally.
In this chapter, we draw inspiration from the regularization methods detailed
in Chap. 7 for learning of a general, nonlinear black-box model, fˆ. Since an LTV

73
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models

model lives in the tangent-space of a nonlinear model, we call the procedure


tangent-space regularization. In this work, we will model fˆ using a deep neural
network (DNN), a class of models which is well suited to learn arbitrarily complex
static mappings. We will explore how modern techniques for training of DNNs
interact with the particular structure chosen to represent fˆ and compare this reg-
ularization to a common choice in the literature, L 2 regularization or weight decay.
We investigate how weight decay affects the fidelity and eigenvalue spectrum of
the Jacobian of the learned model and reason about this from a control-theoretical
viewpoint.
We proceed to introduce the problem of learning a dynamics model fˆ in
Sec. 8.3. We then discuss the influence of weight decay on different formulations
of the learning problem and introduce tangent-space regularization formally in
Sec. 8.5. Finally, we conduct numerical evaluations.

8.2 Computational Aspects

A problem very much related to that of simulation error minimization discussed


in Chap. 3, long considered insurmountable, is that of training deep neural net-
works. It was believed that the propagation of gradients through deep models was
either too sensitive and unstable, or would inevitably lead to either exploding or
vanishing gradients. Many incremental steps have been taken recently to mitigate
this problem, allowing training of ever deeper models. A major contribution was
the notion of residual connections [He et al., 2015]. A residual connection is a
simple idea: to a complicated function, add a parallel identity y = f (x) = g (x) + x.
The function g is called a residual function, acting around the nominal identity
function x, where x is sometimes called a skip connection. Looking at the Jacobian
of these two functions, the impact of the modeling difference is obvious:

∇x y = ∇x f (x) = ∇x g (x) + I (8.4)

no matter the complicated nature of g , the Jacobian of the output with respect
to the input will contain the identity component. This allows gradients to flow ef-
fortlessly through deep architectures composed of stacked residual units, making
training of models as deep as 1000 layers possible [He et al., 2015]. In many cases,
learning a function g around the identity, is vastly easier than learning the full
function f . Nguyen et al. (2018) even proved that under certain circumstances and
with enough skip connections to the output layer, a DNN has no local minima and
a continuous path of decreasing loss exists from any starting point to the global
optimum.
Consider what it takes for a network with one hidden layer to learn the identity
mapping. If we use tanh as the hidden layer activation function, the incoming
weights must be small to make sure the tanh is operating in its linear region, while
the outgoing weights have to be the reciprocal of the incoming. The number of
neurons required is the same as the input dimension. For an activation function
that does not have a near-linear region around zero, such as the relu or the sigmoid

74
8.2 Computational Aspects

functions [Goodfellow et al., 2016], help from the bias term is further required to
center the activation in the linear region. For a deep network, this has to happen
for every layer.
While a deep model is perfectly capable of learning the identity mapping,
it is needless to say that incorporating this mapping explicitly can sometimes
be beneficial. In particular, if the input and output of the function live in the
same domain, e.g., image in—image out, state in—state out, it is often easier
to learn the residual around the identity. To quote Sjöberg et al. (1995), "Even if
nonlinear structures are to be applied there is no reason to waste parameters to
estimate facts that are already known". The identity can in this case be considered
a nominal model. Whenever available, simple nominal models provide excellent
starting points, and modeling and learning the residuals of such a model may
be easier than learning the combined effect of the nominal model and residual
effects.
While we argue for the use of prior knowledge where available, in particular
when modeling physical systems where identification data can be hard to acquire,
we would also like to offer a counter example highlighting the need to do so
wisely. Frederick Jelinek, a researcher in natural language processing, famously
said1 "Every time I fire a linguist, the performance of the speech recognizer goes
up". The quote indicates that the data—natural language as used by people—
did not follow the grammatical and syntactical rules laid down by the linguists.
Prior knowledge might in this case have introduced severe bias that held the
performance of the model back. Natural language processing is a domain in which
data is often easy to obtain and plentiful enough to allow fitting of very flexible
models without negative consequences.
A different notion of residual connection is that found in a form of recur-
rent neural network called a Long Short-Term Memory (LSTM) network [Hochre-
iter and Schmidhuber, 1997]. While the resnet [He et al., 2015] was deep in the
sense of multiple consecutive layers, an RNN is deep in the sense that a func-
tion is applied recursively in time. LSTMs were invented to mitigate the issue
of vanishing/exploding gradients while backpropagating through time to train
recurrent neural networks. When a model, or more generally a function f with
Jacobian ∇x f (x) = J (x), is applied to a state recursively, the Jacobian of the re-
cursive mapping grows as ∇x f (n) (x) ∼ J n (x), where f (n) (x) denotes the n times
recursive application of f , f (... f ( f (x))). Any eigenvalues of J greater than 1 will
grow exponentially—exploding gradients—and eigenvalues smaller than 1 will
decay exponentially—vanishing gradients. If we model f as
f (x) = g (x) + x (8.5)
the Jacobian will be the identity plus some small deviation, effectively helping the
eigenvalues of J stay close to 1. Slightly simplified, LSTMs effectively propagate
the state with an identity function according to (8.5).
1 The exact wording and circumstances around this quote constitute the topic of a very lengthy
footnote on the Wikipedia page of Jelinek https://en.wikipedia.org/wiki/Frederick_
Jelinek.

75
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models

Other efforts at mitigating issues with the training of deep models include
careful weight initialization. If, for instance, all weights are initialized to be unitary
matrices, the norm of vectors propagated through the network stays constant. The
initialization of the network can further be used to our advantage. Prior knowledge
of, e.g., resonances etc. can be encoded into the initial model weights. In this
chapter, we will explore this concept further, along with the effects of identity
connections, and motivate them from a control-theoretic perspective.

8.3 Estimating a Nonlinear Black-Box Model

To frame the learning problem, we let the dynamics of a system be described by a


neural network fˆ to be fitted to input-output data τ = {x t , u t }Tt=1 according to

L EARNING O BJECTIVE 1

x t +1 = fˆ(x t , u t ), f : Rn × Rm 7→ Rn 2

which we will frequently write on the form x + = fˆ(x, u) by omitting the time index
t and letting ·+ indicate ·t +1 . We further consider the linearization of fˆ around a
trajectory τ

x t +1 = A t x t + B t u t
k t = vec ( ATt B tT )
£ ¤
(8.6)

where the matrices A and B constitute the input-output Jacobian J f of f

∇x f 1T ∇u f 1T
 
 . ..  n×(n+m)
. ∈R
£ ¤
J f =  .. = A B
∇x f nT ∇u f nT

and ∇x f i denotes the gradient of the i :th output of f with respect to x. Our
estimate fˆ(x, u, w) of f (x, u) will be parameterized by a vector w.2 The distinction
between f and fˆ will, however, be omitted unless required for clarity.
We frame the learning problem as an optimization problem with the goal of
adjusting the parameters w of fˆ to minimize a cost function V (w) by means of
gradient descent. The cost function V (w) can take many forms, but we will limit
the scope of this work to quadratic loss functions of the one-step prediction error,
i.e.,
1 X¡ + ˆ ¢T¡
x − f (x, u, w) x + − fˆ(x, u, w)
¢
V (w) =
2 t

2 We use w to denote all the parameters of the neural network, i.e., weight matrices and bias vectors.

76
8.3 Estimating a Nonlinear Black-Box Model

Sampling of continuous-time models


The functions f and f c in (8.1) and (8.3) are quite different from each other [Åström
and Wittenmark, 2013a]. It is well known that the Jacobian of the discrete-time
model f has eigenvalues different from that of the continuous-time counterpart
f c . The sampling procedure warps the space of Jacobian eigenvalues such that
the origin is moved to the point 1 in the complex plane, and the imaginary axis is
wrapped around the unit circle, with multiples of the Nyquist frequency occurring
at the point -1. If the system is sampled with a high sample rate, or viewed differ-
ently, if the dynamics of the system is slow in relation to the time scale at which
we are observing it, most eigenvalues3 of f c are close to 0. The eigenvalues for the
discrete-time f , however, tend to cluster around 1 when the sample rate is high.
This disparity between continuous-time and discrete-time models has been of
interest historically. It was found that controllers and models with slow dynamics
relative to the sample rate suffered from implementations with low precision
arithmetic [Åström and Wittenmark, 2011]. Middleton and Goodwin (1986) refor-
mulated the discrete-time model using the so-called δ-operator so that in the limit
of infinite sample rate, the poles of the discrete-time model would coincide with
the continuous-time counterpart. Related improvements in accuracy was found
by Lennartson et al. (2012) when using the δ-operator for solving linear-matrix
inequalities related to systems with fast sampling. The reformulation is based on
the definition of the derivative as the limit of a finite difference
x+ − x
lim = f c (x, u) (8.7)
∆t →0 ∆t

where a discrete-time model, with poles similar to the continuous-time poles, is


obtained for a sufficiently small ∆t . The increase in numerical accuracy from the
reformulation can be understood from the Taylor expansion of the dynamics. If
∆t is small, the term ∆t f c (x, u) will be insignificant next to the x in the expression
x + = x + ∆t · f c (x, u).
We will, alongside the naive discrete-time formulation f , also explore this
reformulation of the discrete-time model, and introduce a new learning objective:
L EARNING O BJECTIVE 2

x + − x = ∆x = g (x, u), g : Rn × Rm 7→ Rn
f (x, u) = g (x, u) + x 2
where the second equation is equivalent to the first, but highlights a convenient
implementation form that does not require transformation of the data.
To gain insight into how this seemingly trivial change in representation may
affect learning, we note that this transformation will alter the Jacobian according
to £ ¤
J g = A − In B (8.8)
3 We take the eigenvalues of a function to refer to the eigenvalues of the function Jacobian.

77
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models

with a corresponding unit reduction of the eigenvalues of A. For systems with


integrators, or slow dynamics in general, this transformation leads to a better
conditioned estimation problem, something we will discuss in the next section.
In Sec. 8.6 we investigate whether or not this transformation leads to a better
prediction result and whether modern neural-network training techniques such
as the ADAM optimizer [Kingma and Ba, 2014] and batch normalization [Ioffe and
Szegedy, 2015] render this transformation superfluous. We further investigate how
weight decay and choice of activation function affect the eigenvalue spectrum of
the Jacobian of f and g , and hence system stability.
From the expression x + = g (x, u) + x, the similarity with skip-connections as
introduced in resnet [He et al., 2015] and LSTMs [Hochreiter and Schmidhuber,
1997] should be apparent, and we now have a control-theoretical, dynamical
systems view of this transformation.

Optimization landscape
To gain insight into the training of f and g , we analyze the expressions for the
gradient and Hessian of the respective cost functions. For a linear model x + =
Ax + Bu, rewritten on regressor form y = Ak with all parameters of A and B
concatenated into the vector k, and a least-squares cost function V (k) = 12 (y −
Ak)T(y − Ak), the gradient and Hessian are given by
∇k V = −AT(y − Ak)
∇2k V = ATA
The Hessian is clearly independent of both the output y and the parameters k
and differentiating the output, i.e., learning a map to ∆x instead of to x + , does not
have any major impact on gradient-based learning. For a nonlinear model, this is
not necessarily the case:
1 X¡ + ¢T¡
x − f (x, u, w) x + − f (x, u, w)
¢
V (w) =
2 t
T X
n
− x i+ − f i (x, u, w) ∇w f i
X ¡ ¢
∇w V =
t =1 i =1
T X
n
∇2w V = ∇w f i ∇w f iT − x i+ − f i (x, u, w) ∇2w f i
X ¡ ¢
t =1 i =1

where x i+ − f i (x, u, w)
constitute the prediction error. In this case, the Hessian
depends on both the parameters and the target x + . The transformation from f to
g changes the gradients and Hessians according to
T X
n
− ∆x i − g i (x, u, w) ∇w g i
X ¡ ¢
∇w V =
t =1 i =1
T X
n
∇2w V = ∇w g i ∇w g iT − ∆x i − g i (x, u, w) ∇2w g i
X ¡ ¢
t =1 i =1

78
8.4 Weight Decay

When ° ° begins, both f and g are initialized with small random weights, and
° ° training
° f ° and °g ° will typically be small. If the system we are modeling is of low-pass
character, i.e., °∆x ° is small, the prediction error of g will be closer to zero com-
° °

pared to f . The transformation from f to g can thus be seen as preconditioning


the problem by decreasing the influence of the term ∇2w g = ∇2w f in the Hessian.
With only the positive semi-definite term ∇w (g )∇w (g )T = ∇w ( f )∇w ( f )T, corre-
sponding to ATA in the linear case, remaining, the optimization problem might
become easier. Similarly, g starts out closer to a critical point ∇w V = 0, which
might make convergence faster. The last two claims will be investigated in Sec. 8.6.
Motivation for the statement that the optimization of g is better conditioned
is obtained by considering the derivation of the Gauss-Newton (GN) optimization
algorithm [Nocedal and Wright,¡1999]. GN ¢resembles the Newton algorithm, but
ignores the second-order term x + − f (w) ∇2w f by considering only a linear ap-
proximation to the prediction error. This algorithm is likely to converge ¡ if either¢
the Hessian ∇2w f is small, or, more importantly, the prediction errors x + − f (w)
are small [Nocedal and Wright, 1999, p. 259].
An approximation to the GN algorithm, suitable for training of neural net-
works, was developed by Botev et al. (2017), but although they showed improved
convergence over ADAM [Kingma and Ba, 2014] in terms of iteration count, ADAM
outperformed the approximate GN algorithm in terms of wall-clock time on a
GPU.

8.4 Weight Decay


Weight decay is commonly an integral part of training procedure in the deep learn-
ing setting, used to combat overfitting [Murphy, 2012; Goodfellow et al., 2016].
We can think of weight decay as either penalizing complexity of the model, or
as encoding prior knowledge about the size of the model coefficients. L 2 weight
decay is, however, a blunt weapon. While often effective at mitigating overfitting,
the added penalty term might introduce a severe bias in the estimate. Since the
bias always is directed towards smaller weights, it can have different consequences
depending on what small weights imply for a particular model architecture. For a
discrete-time model f , small weights intuitively imply small eigenvalues and a
small output. For x + = g (x, u) + x, on the other hand, small weights imply eigen-
values closer to 1. Weight decay might thus have different effects on learning f
and g +x. With a high sample rate and consequently eigenvalues close to 1, weight
decay is likely to bias the result of learning g + x in a milder way as compared to f .
The intuition of a small w implying small eigenvalues of the network Jacobian
is, unfortunately, somewhat deceiving. ° ° The largest eigenvalue λmax (∇x f (x, u, w))
is not constant on a level surface of °w °. As we move to a level surface with smaller
° ° of w , the biggest λmax (∇x f ) we can find is reduced, so an upper bound on
° °
value ° °
°w ° effectively puts an upper bound on λmax (∇x f ), but the connection between
them is complicated and depends on the model of f .
A natural question to ask is if weight decay can bias the eigenvalues of the

79
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models

learned function to arbitrarily chosen locations. A generalized form of model


formulation is
x + = h(x, u) + Ax + Bu
where A and B can be seen as a nominal linear model around which we learn
the nonlinear behavior. Weight decay will for this formulation bias the Jacobian
towards A and B which can be chosen arbitrarily. Obviously, choosing a nominal
linear model is not always easy, and may in some cases not make sense. One can
however limit the scope to damped formulations like x + = h(x, u) + γx, where γ
is a scalar or a diagonal matrix that shifts the nominal eigenvalues along the real
axis to, e.g., encourage stability.
The use of regularization to promote stability of the learned dynamical sys-
tem is not the goal of this work, and weight decay is not well suited for the task.
Oberman and Calder (2018) discuss Lipschitz regularization and show how the
Lipschitz constant of a DNN can easily be upper bounded. Although Oberman
and Calder (2018) do not consider learning of dynamical systems, we conjecture
that their Lipschitz regularization would be better suited for the task. Oberman
and Calder (2018) that most activation functions° have a Lipschitz °constant of
1, rendering the product of all weight matrices, °WL WL−1 · · ·W2 W1 °, an upper
bound on the Lipschitz constant of the entire network. In the next section, we
show how the input-output Jacobian of a standard neural network is similarly
straightforward to compute and also this matrix could prove useful in regularizing
the Lipschitz constant of the model.
In Sec. 8.6, we investigate the influence of weight decay on the learning of
DNNs representing f and g and in the next section, we highlight a way of encod-
ing prior knowledge of the system to be modeled and introduce a new form of
regularization that, if the system enjoys certain properties, introduces less bias as
compared to weight decay.

8.5 Tangent-Space Regularization

The topic of Chap. 7 was learning heavily overparameterized models by restricting


the flexibility of the model by means of appropriate regularization. For systems
where the function f is known to be smooth, the Jacobian J f (t ) will vary slowly.
In the rigid-body dynamical model in (8.2), for instance, the inertial and grav-
itational forces are changing smoothly with the joint configuration. Similar to
the penalty function in (7.3), a natural addition to the cost function of the op-
timization problem would thus be a tangent-space regularization term on the
form
° Jˆt +1 − Jˆt °
X° °
(8.9)
t

which penalizes changes in the input-output Jacobian of the model over time, a
strategy we refer to as Jacobian propagation or Jacprop.
Taking the gradient of terms depending on the model Jacobian requires calcu-
lation of higher order derivatives. Depending on the framework used for optimiza-

80
8.5 Tangent-Space Regularization

tion, this can limit the applicability of the method. We thus proceed to describe
how we implemented the penalty term of (8.9).

Implementation details
The inclusion of (8.9) in the cost function implies the presence of nested differen-
tiation in the gradient of the cost function with respect to the parameters, ∇w V .
The complications arise in the calculation of the term

∇w ° Jˆt +1 (w) − Jˆt (w)°


X° °
(8.10)
t

where J is composed of ∇x f (x, u, w) and ∇u f (x, u, w). Many, but not all, deep-
learning frameworks unfortunately lack support for nested differentiation. Further,
most modern deep-learning frameworks employ reverse-mode automatic differ-
entiation (AD) for automatic calculation of gradients of cost functions with respect
to network weight matrices [Merriënboer et al., 2018]. Reverse-mode AD is very
well suited for scalar functions of many parameters. For vector-valued functions,
however, reverse-mode AD essentially requires separate differentiation of each
output of the function. This scales poorly and is a seemingly unusual use case
in deep learning; most AD frameworks have no explicit support for calculating
Jacobians. These two obstacles together might make implementing the suggested
regularization hard in practice. Indeed, an attempt was made at implementing the
regularization with nested automatic differentiation, where ∇w V was calculated
using reverse-mode AD and J (w) using forward-mode AD. This required find-
ing AD software capable of nested differentiation and was made very difficult by
subtleties regarding closing over the correct variables for the inner differentiation.
The resulting code was also very slow to execute.
The examples detailed later in this chapter instead make use of handwritten
inner differentiation, where ∇w V once again is calculated using reverse-mode AD,
but the calculation of J (w) is done manually. For a DNN composed of affine trans-
formations (W x + b) followed by elementwise nonlinearities (σ), this calculation
can be expressed recursively as

a 1 = W1 x + b 1
a i = Wi l i −1 + b i , i = 2 ... L
l i = σ(a i )
∇x l i = ∇x {σ(a i )} ∇x l i −1
= (∇x σ)¯a · ∇x a i · ∇x l i −1
¯
¯ i
= (∇x σ)¯a · Wi · ∇x l i −1
i

where σ denotes the activation function.


In a practical implementation, the Jacobian can be calculated at the same time
as the forward pass of the network. An implementation is given in Algorithm 4.
The function defined in Algorithm 4 is suitable for use in an outer cost function
which is differentiated with respect to w using reverse mode AD.

81
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models

Algorithm 4 Julia code for calculation of both forward pass and input-output
Jacobian of neural network f . The code uses the tanh activation function and
assumes that the weight matrices are stored according to w = {W1 b 1 ...WL b L }
function forward_jac(w,x)
l = x
J = Matrix{eltype(w[1])}(I,length(x),length(x)) # Initial J = I n
for i = 1:2:length(w)-2
W,b = w[i], w[i+1]
a = W*l .+ b
l = σ.(a)
∇a = W
∇σ = ∇σ(a)
J = ∇σ * ∇a * J
end
J = w[end-1] * J # Linear output layer
return w[end-1]*l .+ w[end] , J
end
∇σ(a) = Matrix(Diagonal((sech.(a).^2)[:])) # ∇a tanh(a)

8.6 Evaluation

The previously described methods were evaluated on two benchmark problems.


We compared performance on one-step prediction and further compared the
fidelity of the Jacobian of the estimated models, an important property for op-
timization algorithms making use of the models. The benchmarks consist of a
pendulum on a cart, and randomized, stable linear systems.
We initially describe a baseline neural-network model used in the experimen-
tal evaluation, which we use to draw conclusions regarding the different learning
objectives. We describe how deviations from this baseline model alter the conclu-
sions drawn in Sec. 8.B.

Nominal model
Both functions f and g are modeled as neural networks. For the linear-system
task, the networks had 1 hidden layer with 20 neurons; in the pendulum task, the
networks had 3 hidden layers with 30 neurons each.
A comparative study of 6 different activation functions, presented in Sec. 8.A,
indicated that some unbounded activation functions, such as the relu and leaky
relu functions [Ramachandran et al., 2017], are less suited for the task at hand,
and generally, the tanh activation function performed best and was chosen for the
evaluation.
We train the models using the ADAM optimizer with a fixed step-size and
fixed number of epochs. The framework for training, including all simulation
experiments reported in this chapter, is published at [JacProp.jl, B.C., 2018] and is
implemented in the Julia programming language [Bezanson et al., 2017] and the

82
8.6 Evaluation

Algorithm 5 Generation of random, stable linear systems.


A 0 = 10 ×10 matrix of random coefficients
A = A 0 − AT0 skew-symmetric = pure imaginary eigenvalues
A = A − ∆t I make ’slightly’ stable
A = exp(∆t A) discrete time, sample time ∆t
B = 10 ×10 matrix of random coefficients

Prediction RMS Jacobian Error


4 0.4

3 0.3

2 0.2
1 0.1
0 0.0
Standard Jacprop Standard Jacprop

Figure 8.1 Left: Distribution of prediction errors on the validation data for g + x
trained on the linear-system task. Each violin represents 12 Monte-Carlo runs.
Right: Distribution of errors in estimated Jacobians. The figure indicates that
tangent-space regularization through Jacobian propagation is effective and re-
duces the error in the estimated Jacobian without affecting the prediction error
performance.

Flux machine learning library [Innes, 2018].

Randomized linear system


To assess the effectiveness of Jacobian propagation we create random, stable linear
systems according to Algorithm 5 and evaluate the Jacobian of the learned model
for points sampled randomly in the state space. Learning a linear system allows
us to easily visualize the learned Jacobian eigenvalues and relate them to the
eigenvalues of the true system, which remain constant along a trajectory. While
the proposed regularization penalty is theoretically optimal for a linear system,
this simple setting allows us to verify that the additional penalty term does not
introduce any undesired difficulty in the optimization that results in convergence
to local optima, etc.
We train two models, the first model is trained using weight decay and the
second using Jacprop. The regularization parameters are in both cases tuned such
that prediction error on the test data is minimized. The models are trained for
300 epochs on two trajectories of 200 time steps each. The input was low-pass
filtered Gaussian noise where the cutoff frequency was high in relation to the time
constant of the system so as to excite all modes of the system.
The results are illustrated in Fig. 8.1 and Fig. 8.2. During training, the model
trained without tangent-space regularization reaches a far lower training error, but

83
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models

Figure 8.2 Learned Jacobian eigenvalues of g + x for points sampled randomly


in the state space (blue) together with the eigenvalues of the true model (red).
Tangent-space regularization (right) leads to better estimation of the Jacobian with
eigenvalues in a tighter cluster around the true eigenvalues close to the unit circle.

validation data indicates that overfitting has occurred. The number of parameters
in the models was 6.2 times larger than the number of parameters in the true
linear system and overfitting is thus a concern in this scenario.
To calculate the Jacobian error, we calculate the shortest distance between
each eigenvalue in the true Jacobian to any of the eigenvalues of the estimated
Jacobians, and vice versa. We then sum these distances and take the mean over
all the data points in the validation set. This allows us to penalize both failure to
place an eigenvalue close to a true eigenvalue, and placing an eigenvalue without
a true eigenvalue nearby. The model trained with tangent-space regularization
learns better Jacobians while producing the same prediction error, indicated in
Fig. 8.1 and Fig. 8.2.
The effect of weight decay on the learned Jacobian is illustrated in Fig. 8.3.
Due to overparameterization, heavy overfitting is expected without adequate
regularization. Not only is it clear that learning of g + x has been more successful
than learning of f in the absence of weight decay, but we also see that weight
decay has had a deteriorating effect on learning f , whereas it has been beneficial in
learning g + x. This indicates that the choice of architecture interacts with the use
of standard regularization techniques and must be considered while modeling.

Pendulum-on-cart task
A pendulum attached to a moving cart is simulated to assess the effectiveness of
Jacprop and weight decay on a system with nonlinear dynamics and thus a chang-
ing Jacobian along a trajectory. An example trajectory of the system described
by (8.11)-(8.12), which has 4 states (θ, θ̇, p, v) and one control input u, is shown
in Fig. 8.4. This task demonstrates the utility of tangent-space regularization for
systems where the regularization term is not the theoretically perfect choice, as
was the case with the linear system. Having access to the true system model and
state representation also allows us to compare the learned Jacobian to the true
system Jacobian. We simulate the system with a superposition of sine waves of

84
8.6 Evaluation

Figure 8.3 Eigenvalues of learned Jacobians for the linear system task. True eigen-
values are shown in red, and eigenvalues of the learned model for points sampled
randomly in the state space are shown in blue. The top/bottom rows show models
trained without/with weight decay, left/right columns show f /g . Weight decay
has a deteriorating effect on learning f , pulling some eigenvalues towards 0 while
causing others to become much larger than 1, resulting in a very unstable system.
Weight decay is beneficial for learning g + x, keeping the eigenvalues close to 1.

States Control signal


θ θ̇ 3
5.0 p v u
2
2.5
1
0.0
0
−2.5
−1
−5.0
−2
−7.5
0 50 100 150 200 0 50 100 150 200
Time Time

Figure 8.4 Example trajectory of pendulum on a cart.

85
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models

Prediction RMS Jacobian Error


100.0
0.9
0.8 10−0.5
0.7 10−1.0
0.6 10−1.5
f g f g

Figure 8.5 Left: Distribution of prediction errors on the validation data for the
pendulum on a cart task using tanh activation functions. Each violin represents 30
Monte-Carlo runs. The figure indicates that tangent-space regularization through
Jacobian propagation is effective and reduces prediction error for f , but not g ,
where weight decay performs equally well. Right: Distribution of errors in estimated
Jacobians. Jacprop is effective at improving the fidelity of the model Jacobians for
both f and g .

different frequencies and random noise as input and compare prediction error as
well as the error in the estimated Jacobian. The dynamical equations of the system
are given by

g u
θ̈ = − sin(θ) + cos(θ) − d θ̇ (8.11)
l l
v̇ = p̈ = u (8.12)

where g , l , d denote the acceleration of gravity, the length of the pendulum and
the damping, respectively.
Once again we train two models, one with weight decay and one with Jacprop.
The regularization parameters were in both cases chosen such that prediction
error on test data was minimized. The models were trained for 2500 epochs on
two trajectories of 200 time steps, approximately an order of magnitude fewer
data points than the number of parameters in the models.
The prediction and Jacobian errors for validation data, i.e., trajectories not
seen during training, are shown in Fig. 8.5. The results indicate that while learning
f , tangent-space regularization leads to reduced prediction errors compared to
weight decay, with lower mean error and smaller spread, indicating more robust
learning. Learning of g + x did not benefit much from Jacobian propagation in
terms of prediction performance compared to weight decay, and both training
methods perform on par and reach a much lower prediction error than the f
models.
To assess the fidelity of the learned Jacobian, we compare it to the ground-truth
Jacobian of the simulator. We display the distribution of errors in the estimated
Jacobians in Fig. 8.5. The results show a significant benefit of tangent-space regu-
larization over weight decay for learning both f and g + x, with a reduction of the

86
8.7 Discussion

Figure 8.6 Eigenvalues of the pendulum system using the tanh activation func-
tion on validation data.

mean error as well as a smaller spread of errors (please note that the figure has
logarithmic y-axis).
The individual entries of the model Jacobian along a trajectory for one instance
of the trained models are visualized as functions of time in Fig. 8.7. The figure
illustrates the smoothing effect of the tangent-space regularization and verifies
the smoothness assumption on the pendulum system. We also note that the
regularization employed does not restrict the ability of the learned model to
change its Jacobian along the trajectory, tracking the true system Jacobian. This is
particularly indicated in the (1, 2) and (4, 3) entries in Fig. 8.7. The figure also shows
how weight decay tuned for optimal prediction performance allows a rapidly
changing Jacobian, indicating overfitting. If a higher weight-decay penalty is used,
this overfitting is reduced, at the expense of prediction performance, hinting at
the heavily biasing properties of excessive weight decay.
The eigenvalues of the true system and learned models are visualized in Fig. 8.6.

8.7 Discussion

Throughout experiments, we note that g + x generally trains faster, reaches a lower


value of the objective function compared to f and learns a Jacobian that is closer
to the Jacobian of the true system model. In a reinforcement-learning setting
where available data is limited and Jacobians of the learned model are used for
optimization, this property is of great importance.

87
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models

0.04
1.00 0.2 0.04
0.025
0.1 0.02 0.02
0.75
0.00 0.000
0.0 0.00
0.50 −0.02
−0.1 −0.025 −0.02
0.25 −0.04
−0.2
−0.06 −0.050 −0.04
0.00 −0.3
1.00
0.075 0.100
0.09 0.050
0.75 0.050 0.075
0.06 0.025
0.050
0.50 0.025 0.03 0.000
0.025
0.25 0.000 0.00 −0.025 0.000

−0.025 −0.03 −0.050 −0.025


0.00
0.05
1.00
0.04 0.03 0.10
0.75 0.00
0.00
0.02 0.05
0.50 −0.05 −0.03
0.00
−0.06 0.00
0.25 −0.10
−0.02 −0.09
−0.05
0.00
1.00 0.025
0.06
0.025 0.02
0.75 0.03 0.000
0.000 0.00
0.50 0.00
−0.025 −0.025
−0.02
−0.03
0.25 −0.050 −0.050
−0.04
−0.06
0.00 −0.075 −0.075
−0.06

Figure 8.7 Individual entries in the Jacobian as functions of time step along a
trajectory of the pendulum system (Ground truth -- , Weight decay — , Jacprop
— ). The figure verifies the smoothness assumption on the system and indicates
that Jacprop is successful in promoting smoothness of the Jacobian of the estimated
model. The entries (1, 2) and (4, 3) change the most along the trajectory, and the
Jacprop-regularized model tracks these changes well without excessive smoothing.

The structure of g + x resembles that of a residual network [He et al., 2016],


where a skip connection is added between the input and a layer beyond the first
adjacent layer, in our case, directly to the output. While skip connections have
helped to enable successful training of very deep architectures for tasks such as
image recognition, we motivated the benefit of the skip connection with classical
theory for sampling of continuous-time systems [Middleton and Goodwin, 1986]
and an analysis of the model Hessian, where we compared learning of g + x to
the Gauss-Newton algorithm. Exploring the similarities with residual networks
remains an interesting avenue for future work. The notion of skip connections is
seen also in LSTMs [Hochreiter and Schmidhuber, 1997], once again motivated
by gradient flow. LSTMs are often used to model and generate natural language.

88
8.8 Conclusions

In this domain, the notion of time constants is less well defined. The state of a
recurrent neural network for natural language modeling can change very abruptly
depending on which word is input. LSTMs thus incorporate also a gating mecha-
nism to allow components of the state to be "forgotten". For mechanical systems,
an example of an analogous situation is a state constraint. A dynamical model of a
bouncing ball, for instance, must learn to forget the velocity component of the
state when the position reaches the constraint.
The scope of this chapter was limited to settings where a state sequence is
known. This allowed us to reason about eigenvalues of Jacobians and compare
the learned Jacobians to those of the ground-truth model of a simulator. In a more
general setting, learning the transformation of past measurements and inputs
to a state representation is required, e.g., using a network with recurrence or an
auto-encoder [Karl et al., 2016]. Initial results indicate that the conclusions drawn
regarding the formulation ( f vs. g + x) of the model and the effect of weight decay
remain valid in the RNN setting, but a more detailed analysis is the target of future
work. The concept of tangent-space regularization applies equally well to the
Jacobian from input to hidden state in an RNN, and potential benefits of this kind
of regularization in the general RNN setting remain to be investigated.
We also restricted our exposition to the simplest possible noise model, cor-
responding to the equation-error problem discussed in Sec. 3.1. In a practical
scenario, estimating a more sophisticated noise model may be desirable. Noise
models in the deep-learning setting add complexity to the estimation in the same
way as for linear models. A practical approach, inspired by pseudo-linear regres-
sion [Ljung, 1987], is to train a model without noise model and use this model to
estimate the noise sequence through the prediction errors. This noise sequence
can then be used to train a noise model. If this noise model is trained together
with the dynamics model, back-propagation through time is required and the
computational complexity is increased. Deep-learning examples including noise
models are found in [Karl et al., 2016].

8.8 Conclusions
We investigated different architectures of a neural-network model for modeling
of dynamical systems and found that the relationship between sample time and
system bandwidth affects the preferred choice of architecture, where an approx-
imator architecture incorporating an identity element similar to that of LSTMs
and resnet, train faster and generally generalize better in terms of all metrics if the
sample rate is high. An analysis of gradient and Hessian expressions motivated
the difference and conclusions were reinforced by experiments.
The effect of including L 2 weight decay was investigated and shown to vary
greatly with the model architecture. Implications on the stability and tangent-
space eigenvalues of the learned model highlight the need to consider the archi-
tecture choice carefully.
We further demonstrated how tangent-space regularization by means of Ja-

89
Chapter 8. Identification and Regularization of Nonlinear Black-Box Models

cobian propagation can be used to incorporate prior knowledge of the modeled


system and regularize the learning of a neural network model of a smooth dynam-
ical system with an increase in prediction performance as well as increasing the
fidelity of the learned Jacobians as result.

Appendix A. Comparison of Activation Functions


Figure 8.8 displays the distribution of prediction errors over 200 Monte-Carlo
runs with different random seeds and different activation functions. The results
indicate that the relu and leaky relu functions are worse suited for the considered
task and are thus left out from the set of selected bootstrap ensemble activation
functions. For a definition of the activation functions, see, e.g., [Ramachandran
et al., 2017].

Appendix B. Deviations from the Nominal Model

Number of neurons Doubling or halving the number of neurons generally led


to worse performance.
Number of layers Adding a fully connected layer with the same number of neu-
rons did not change any conclusions.
Dropout Inclusion of dropout (20%) increased prediction and simulation per-
formance for g while performance was decreased for f . Performance on Jacobian
estimation was in general worse.
Layer normalization Prediction and simulation performance remained similar
or slightly worse, whereas Jacobian estimation broke down completely.
Measurement noise Although not a hyper parameter, we investigated the in-
fluence of measurement noise on the identification results. Increasing degree of
measurement noise degraded all performance metrics in predictable ways and
did not change any qualitative conclusions.
Optimizer, Batch normalization, Batch size Neither of these modifications al-
tered the performance significantly.

90
8.B Deviations from the Nominal Model

f
1.0
0.0
-1.0
-2.0

g
1.0
0.0
-1.0
-2.0
σ

σ
lu

nh

lu

nh
u

swelu
h

swelu
h

nh
u

swelu
h
el

is

el

is

el

is
re

re

re
ta

ta

ta
yr

yr

r
ky
ak

ak

a
le

le

le
Figure 8.8 Distributions (cropped at extreme values) of log-prediction errors on
the validation data after 20, 500 and 1500 (left, middle, right) epochs of training for
different activation functions. Every violin is representing 200 Monte-Carlo runs
and is independently normalized such that the width is proportional to the density,
with a fixed maximum width.

91
9
Friction Modeling and
Estimation

9.1 Introduction

All mechanical systems with moving parts are subject to friction. The friction force
is a product of interaction forces on an atomic level and is always resisting relative
motion between two elements in contact. Because of the complex nature of the
interaction forces, friction is usually modeled based on empirical observations.
The simplest model of friction is the Coulomb model, (9.1), which assumes a
constant friction force acting in the reverse direction of motion
F f = k c sign (v) (9.1)
where k c is the Coulomb friction constant and v is the relative velocity between
the interacting surfaces.
A slight extension to the Coulomb model includes also velocity dependent
terms
F f = k v v + k c sign (v) (9.2)
where k v is the viscous friction coefficient. The Coulomb model and the viscous
model are illustrated in Fig. 9.1. If the friction is observed to vary with the direction
of motion, sign (v), the model (9.2) can be extended to
F f = k v v + k c+ sign (v + ) + k c− sign (v − ) (9.3)

Coulomb Viscous Stiction


Ff Ff Ff

v v v

Figure 9.1 Illustrations of simple friction models.

93
Chapter 9. Friction Modeling and Estimation

where the sign operator is defined to be zero for v = 0, v + = max(0, v) and


v − = min(0, v).
It is commonly observed that the force needed to initiate movement from a
resting position is higher than the force required to maintain a low velocity. This
phenomenon, called stiction, is illustrated in Fig. 9.1. The friction for zero velocity
and an external force F e can be modeled as
½
Fe if v = 0 and |F e | < k s
Ff = (9.4)
k s sign F e if v = 0 and |F e | ≥ k s

where k s is the stiction friction coefficient. An external force greater than the
stiction force will, according to model (9.4), cause an instantaneous acceleration
and a discontinuity in the friction force.
The models above suffice for many purposes but can not explain several
commonly observed friction-related phenomena, such as the Stribeck effect and
dynamical behavior, etc. [Olsson et al., 1998]. To explain more complicated behav-
ior, dynamical models such as the Dahl model [Dahl, 1968] and the LuGre model
[De Wit et al., 1995] have been proposed.
Most proposed friction models include velocity-dependent effects, but no
position dependence. A dependence upon position is however often observed,
and may stem from, for instance, imperfect assembly, irregularities in the contact
surfaces or application of lubricant, etc. [Armstrong et al., 1994]. Modeling of the
position dependence is unfortunately nontrivial due to an often irregular relation-
ship between the position and the friction force. Several authors have however
made efforts in the area. Armstrong (1988) used accurate friction measurements
to implement a look-up table for the position dependence and Huang et al. (1998)
adaptively identified a sinusoidal position dependence.
More recent endeavors by Kruif and Vries (2002) used an Iterative Learning
Control approach to learn a feedforward model including position-dependent
friction terms.
In [Bittencourt and Gunnarsson, 2012], no significant positional dependence
of the friction in a robot joint was found. However, a clear dependence upon the
temperature of contact region was reported. To allow for temperature sensing, the
grease in the gear box was replaced by an oil-based lubricant, which allowed for
temperature sensing in the oil flow circuit.
A standard approach in dealing with systems with varying parameters is recur-
sive identification during normal operation [Johansson, 1993]. Recursive identifi-
cation of the models (9.1) and (9.2) could account for both position- and tempera-
ture dependence. Whereas straight forward in theory, it is often hard to perform in
a robust manner in practical situations. Presence of external forces, accelerating
motions, etc. require either a break in the adaptation, or an accurate model of the
additional dynamics. Many control programs, such as time-optimal programs,
never exhibit zero acceleration, and thus no chance for parameter adaptation.
To see why unmodeled dynamics cause a bias in the estimated parameters,
consider the system
f = ma + f ext + F f (9.5)

94
9.2 Models and Identification Procedures

Figure 9.2 Dual-arm robot and industrial manipulator IRB140 used for experi-
mental verification of proposed models and identification procedures.

with mass m and externally applied force f ext . If we model this system as
fˆ = F f (9.6)
we effectively have the disturbance terms
f − fˆ = ma + f ext (9.7)
These terms do not constitute uncorrelated random noise with zero mean and
will thus introduce a bias in the estimate. It is therefore of importance to obtain as
accurate models as possible offline, where conditions can be carefully controlled
to minimize the influence of unmodeled dynamics.
This chapter develops a model that incorporates positional friction depen-
dence as well as an implicitly temperature-dependent term. The proposed ad-
ditions can be combined or used independently as appropriate. Since many in-
dustrially relevant systems lack temperature sensing in areas of importance for
friction modeling, a sensor-less approach is proposed. Both models are used for
identification of friction in the joint of an industrial collaborative-style robot, see
Fig. 9.2, and special aspects of position dependence are verified on a traditional
industrial manipulator.

9.2 Models and Identification Procedures

This section first introduces a general identification procedure for friction models
linear in the parameters, based on the least-squares method, followed by the
introduction of a model that allows for the friction to vary with position. Third, a
model that accounts for temperature-varying friction phenomena is introduced.
Here, a sensor-less approach where the power loss due to friction is used as an
input to a first-order system, is adopted.
As the models are equally suited for friction due to linear and angular move-
ments, the terms force and torque are here used interchangeably.

95
Chapter 9. Friction Modeling and Estimation

Least-squares friction identification


A standard model of the torques in rigid-body dynamical systems, such as indus-
trial robots, is [Spong et al., 2006]

τ = M (p)a +C (p, v)v +G(p) + F (v) (9.8)

where a = v̇ = p̈ is the acceleration, τ the control torque, M ,C ,G are matrices


representing inertia-, Coriolis-, centrifugal- and gravitational forces and F is a
friction model. If a single joint at the time is operated, at constant velocity, Coriolis
effects disappear [Spong et al., 2006] and

C (p, v) = 0
¾
⇒ τ = G(p) + F (v) (9.9)
a =0

To further simplify the presentation, it is assumed that G(p) = 0. This can easily be
achieved by either aligning the axis of rotation with the gravitational vector such
that gravitational forces vanish, by identifying and compensating for a gravity
model1 or, as in [Bittencourt and Gunnarsson, 2012], performing a symmetric
experiment with both positive and negative velocities and calculating the torque
difference.
As a result of the discontinuity of the Coulomb model and the related uncer-
tainty in estimating the friction force at zero velocity, datapoints where the velocity
is not significantly different from zero must be removed from the dataset used
for estimation. Since there is a large probability that these points will have the
wrong sign, inclusion of these points might lead to severe bias in the estimate of
the friction parameters.

Estimation in multi-joint robots


When performing friction estimation for multi-link robots, such as a serial manip-
ulator, one can often choose to perform experiments on individual joints one at a
time. This approach, while structured and systematic, might require additional
time to complete. Another alternative is to perform an experiment on all joints
simultaneously. The simplifying assumption G(q) = 0 is then invalid and one must
simultaneously consider full gravity model estimation.
The simple models described in Sec. 9.1 are all linear in the parameters and
can be estimated efficiently using the methods described in Sec. 6.2.
Gear coupling Friction occurs between all sliding surfaces. In an electrical mo-
tor with gears, this implies the presence of friction on both sides of the gear box
as well as in the gear itself. In a simple gear, it is not possible, nor necessary, to
distinguish between the sources of friction, only the sum is observable. However,
many multi-link manipulators exhibit gear coupling between consecutive motors.
This complicates friction estimation since the movement of a motor with coupled
gears causes friction between parts that are also affected by the movement of the
1 For a single joint, this simply amounts to appending the regressor matrix A with sin(p)
£ ¤
cos(p)

96
9.3 Position-Dependent Model

coupled motors. For instance, some manipulators have a triangular gear-ratio


matrix for the spherical wrist. As an example, the gear-ratio matrix for the ABB
IRB140 has the following structure
 


 ∗ 

 ∗ 
G =  (9.10)

 ∗ 

 ∗ ∗ 
∗ ∗ ∗

with the relations between relevant motor- and arm-side quantities given by

q a = G qm
τa = G −1 τm (9.11)

This causes cross couplings between the arm-side and motor-side friction for the
last three joints. For the first three joints, only the sum of arm- and motor-side
friction is visible, but for joint 5 and 6, friction models of the motor-side friction
for joint 4 and joints 4 and 5, respectively, are needed. Experiments on an ABB
IRB2400 robot indicate that the friction on both sides of the motor is of roughly
equal importance for the overall result, and special treatment of the wrist joints is
therefore crucial.
For the example above, a friction model for joint 6 will thus have to contain
terms dependent on the velocities of the fourth and fifth motors as well, e.g.:

F f 6 = k c6 sign q̇ 6 + k v6 q̇ 6 + k c5 sign q̇ 5 + k v5 q̇ 5 + k c4 sign q̇ 4 + k v4 q̇ 4 (9.12)

9.3 Position-Dependent Model

As mentioned in Sec. 9.1, a positional, repeatable friction dependence is often


observed in mechanical systems. This section extends the simple nominal models
presented in Sec. 9.1 with position-dependent terms, where the position depen-
dence is modeled with a basis-function expansion. Througout the chapter, we
will make use of Gaussian basis functions, with the implicit assumption that the
estimated function is smooth and differentiable. This choice is, however, not im-
portant for the development of the method, and the choice of basis functions
should be made with considerations mentioned in Sec. 6.3. One such consider-
ation is the discontinuity of Coulomb friction at zero velocity, which we handle
explicitly by truncating half of the basis functions for positive velocities, and vice
versa. We further explicitly estimate the mean Coulomb coefficient even when esti-
mating position dependent models, and let the basis-function expansion estimate
deviations around a simple nominal model.

97
Chapter 9. Friction Modeling and Estimation

We define the Gaussian RBF kernel κ and the kernel vector φ

(p − µ)2
µ ¶
κ(p, µ, σ) = exp − (9.13)
2σ2
φ(p) : (p ∈ P ) 7→ R1×K
φ(p) = κ(p, µ1 , σ), · · · , κ(p, µK , σ)
£ ¤
(9.14)

where µi ∈ P , i = 1, ..., K is a set of K evenly spaced centers. For each input position
p ∈ P ⊆ R, the kernel vector φ(p) will have activated (>0) entries for the kernels
with centers close to p. Refer to Fig. 6.2 for an illustration of RBFs. The kernel
vector is included in the regressor-matrix A from Sec. 6.2 such that if used together
with a nominal, viscous friction model, A and the parameter vector k are given by

v1 sign (v 1 ) φ(p 1 )
   
kv
 . .. ..  N ×(2+K )
A =  .. . .  ∈ R , k = kc  (9.15)
vN sign (v N ) φ(p N ) kκ

where k κ ∈ RK denotes the parameters corresponding to the kernel vector entries.


The number of RBFs to include and the bandwidth σ is usually chosen based on
evidence maximization or cross validation [Murphy, 2012].
The position-dependent model can now be summarized as

F f = F n + φ(p)k κ (9.16)

where F n is one of the nominal models from Sec. 9.1. The attentive reader might
expect that for appropriate choices of basis functions, the regressor matrix A will
be rank-deficient. This is indeed the case since the sign(v) column lies in the
span of φ. However, the interpretation of the resulting model coefficients is more
intuitive if the Coulomb level is included as a baseline around which the basis-
function expansion models the residuals. To mitigate the issue of rank deficiency,
one could either estimate the nominal model first, and then fit the BFE to the
residuals, or include a slight ridge-regression penalty on k κ .
The above method is valid for position-varying Coulomb friction. It is conceiv-
able that the position dependence is affected by the velocity, in which case the
model (9.16) will produce a sub-optimal result. The RBF network can, however, be
designed to cover the space (P × V ) ⊆ R2 . The inclusion of velocity dependence
comes at the cost of an increase in the number of parameters from K p to K p K v ,
where K p and K v denote the number of basis-function centers in the position and
velocity input spaces, respectively.
The expression for the RBF kernel will in this extended model assume the form
µ ¶
1 T −1
κ(x, µ, Σ) = exp − (x − µ) Σ (x − µ) (9.17)
2

98
9.4 Energy-Dependent Model
¤T
where x = p v ∈ P × V , µ ∈ P × V and Σ is the covariance matrix determining
£

the bandwidth. The kernel vector will be

φ(x) : (x ∈ P × V ) → R1×(K p K v )
φ(x) = κ(x, µ1 , Σ), · · · , κ(x, µK p K v , Σ)
£ ¤
(9.18)

This concept extends to higher dimensions, at the cost of an exponential growth


in the number of model parameters.

9.4 Energy-Dependent Model

Friction is often observed to vary with the temperature of the contact surfaces
and lubricants involved [Bittencourt and Gunnarsson, 2012]. Many systems of
industrial relevance lack the sensors needed to measure the temperature of the
contact regions, thus rendering temperature-dependent models unusable.
The main contributor to the rise in temperature that occurs during operation
is heat generated by friction. This section introduces a model that estimates the
generated energy, and also estimates its influence on the friction.
A simple model for the temperature change in a system with temperature T ,
surrounding temperature T s , and power input W , is given by

d T (t ) ¡ ¢
= k s T s − T (t ) + kW W (t ) (9.19)
dt
for some constants k s > 0, kW > 0. After the variable change ∆T (t ) = T (t )−T s , and
transformation to the Laplace domain, the model (9.19) can be written

kW
∆Tc (s) = Wc (s) (9.20)
s + ks

where the power input generated by friction losses is equal to the product of the
friction force and the velocity

W (t ) = |F f (t )v(t )| (9.21)
We propose to include the estimated power lost due to friction, and its influ-
ence on friction itself, in the friction model according to

F f = F n + sign (v)E (9.22)


k̄ e
E c (s) = G(s)Wc (s) = Wc (s) (9.23)
1 + s τ̄e

where the friction force F f has been divided into the nominal friction F n and the
signal E , corresponding to the influence of the thermal energy stored in the joint.
The nominal model F n can be chosen as any of the models previously introduced,
including (9.16). The energy is assumed to be supplied by the instantaneous power

99
Chapter 9. Friction Modeling and Estimation

due to friction, W , and is dissipating as a first order system with time constant
τ̄e . A discrete representation is obtained after Zero-Order-Hold (ZOH) sampling
[Wittenmark et al., 2002] according to
ke
E d (z) = H (z)Wd (z) = Wd (z) (9.24)
z − τe
In the suggested model form, (9.22) to (9.24), the transfer function H (z) in-
corporates both the notion of energy being stored and dissipated, as well as the
influence of the stored energy on the friction.
Denote by τ̂n the output of the nominal model F n . Estimation of the signal E
can now be done by rewriting (9.22) in two different ways

Ê = (τ − τ̂n ) sign (v) (9.25)


F n = τ − sign (v)Ê (9.26)

Estimating the model


The joint estimation of the parameters in the nominal model and in H (z) in (9.24)
can be carried out in a fixed-point iteration scheme. This amounts to iteratively
finding an estimate F̂ n of the nominal model, using F̂ n to find an estimate Ê of
E according to (9.25), using Ê to estimate H (z) in (9.24) and, using H (z), filter
Ê = H (z)W . An algorithm for the estimation of all parameters in (9.22) to (9.24) is
given in Algorithm 6. The estimation of Ĥ (z) in (9.24) can be done with, e.g., the
Output Error Method [Ljung, 1987; Johansson, 1993] and the estimation of the
nominal model is carried out using the LS procedure from Sec. 9.2.

Algorithm 6 Estimation of the parameters and the signal E in the energy-


dependent friction model.
Require: Initial estimate Ĥ (z, k e , τe );
repeat
Ê ← Ĥ (z)W . Filter W through Ĥ (z);
Update estimate of F n according to (9.26) using (6.5);
Calculate Ê according to (9.25);
Update Ĥ (z) using (9.24) . E.g., command oe() in Matlab;
until Convergence

The proposed model suggests that the change in friction due to the tempera-
ture change occurs in the Coulomb friction. This assumption is always valid for
the nominal model (9.1), and a reasonable approximation for the model (9.2) if
k c À k v v or if the system is both operated and identified in a small interval of
velocities. If, however, the temperature change has a large effect on the viscous
friction or on the position dependence, a 3D basis-function expansion can be
performed in the space P × V × E , E ∈ E . This general model can handle arbitrary
nonlinear dependencies between position, velocity and estimated temperature.
The energy signal E can then be estimated using a simple nominal model, and

100
9.4 Energy-Dependent Model

15 τ
W
10 E
τ [Nm] 5

0
0.63k̄ e E(W )
k̄ e E(W )
-12
0 τ̄e 30 40 50 60
Time [minutes]

Figure 9.3 A realization of simulated signals. The figure shows how the envelope
of the applied torque approximately decays as the signal E . Dashed, blue lines are
drawn to illustrate the determination of initial guesses for the time constant τ̄e and
the gain k̄ e .

included in the kernel expansion for an extended model. Further discussion on


this is held in Sec. 9.7.

Initial guess
For this scheme to work, an initial estimate of the parameters in H (z) is needed.
This can be easily obtained by observing the raw torque data from an experiment.
Consider for example Fig. 9.3, where the system (9.22) and (9.23) has been sim-
ulated. The figure depicts the torque signal as well as the energy signal E . The
envelope of the torque signal decays approximately as the signal E , which allows
for easy estimation of the gain k̄ e and the time constant τ̄e . The time constant
τ̄e is determined by the time it takes for the signal to reach (1 − e −1 ) ≈ 63 % of
its final value. Since G(s) is essentially a low-pass filter, the output E = G(s)W
will approximately reach E ∞ = G(0)E(W ) = k̄ e E(W ) if sent a stationary, stochastic
input W with fast enough time constant (¿ τ̄e ). Here, E(·) denotes the statistical
expectation operator and E ∞ is the final value of the signal E . An initial estimate
of the gain k̄ e can thus be obtained from the envelope of the torque signal as

E∞ E∞
k̄ e ≈ ≈ (9.27)
E(W ) 1 P
N n Wn

We refer to Fig. 9.3 for an illustration, where dashed guides have been drawn to
illustrate the initial guesses.
The discrete counterpart to G(s) can be obtained by discretization with rele-
vant sampling time [Wittenmark et al., 2002].

101
Chapter 9. Friction Modeling and Estimation

9.5 Simulations
To analyze the validity of the proposed technique for estimation of the energy-
dependent model, a simulation was performed. The system described by (9.22)
and (9.23) was simulated to create 50 realizations of the relevant signals, and the
proposed method was run for 50 iterations to identify the model parameters. The
parameters used in the simulation are provided in Table 9.1. Initial guesses were
chosen at random from the uniform distributions k̄ˆ e ∼ U (0, 3k̄ e ) τ̄ˆe ∼ U (0, 3τ̄).

Table 9.1 Parameter values used in simulation. Values given on the format x/y
represent continuous/discrete values.

Parameter Value
kv 5
kc 15
ke -3/-0.5
τe 10/0.9983
Measurement noise στ 0.5 Nm
Sample time h 1s
Duration 3600 s
Iterations 50

Figure 9.4 shows that the estimated parameters converge rapidly to their true
values, and Fig. 9.5 indicates that the Root Mean Square output Error (RMSE)
converges to the level of the added measurement noise. Figure 9.5 further shows
that the errors in the parameter estimates, as defined by (9.28), were typically
below 5 % of the parameter values.
v
uN µ
u Xp x̂ − x ¶2
i i
NPE = t (9.28)
i =1 |x i |

9.6 Experiments
The proposed models and identification procedures were applied to data from
experiments with the dual-arm and the IRB140 industrial robots, see Fig. 9.2.

Procedure
For IRB140, the first joint was used. The rest of the arms were positioned so as to
minimize the moment of inertia. For the dual-arm robot, joint four in one of the
arms was positioned such that the influence of gravity vanished.
A program that moved the selected joint at piecewise constant velocities
between the two joint limits was executed for approximately 20 min. Torque-,

102
9.6 Experiments

kv kc
6 20

15
4
10
2
5

0 0
0 25 50 0 25 50
ke τe
·10−2
0.0 −0.990

−0.5 −0.992
−0.994
−1.0
−0.996
−1.5 −0.998
−2.0 −1.000
0 25 50 0 25 50

Figure 9.4 Estimated parameters during 50 simulations. The horizontal axis dis-
plays the iteration number and the vertical axis the current parameter value. True
parameter values are indicated with dashed lines.

Normalized Parameter Error RMSE

0.4 1.2

0.3 1.0

0.2 0.8

0.1 0.6

0 0.4
0 25 50 0 25 50

Figure 9.5 Evolution of errors during the simulations, the horizontal axis displays
the iteration number. The left plot shows normalized norms of parameter errors,
defined in (9.28), and the right plot shows the RMS output error using the estimated
parameters. The standard deviation of the added measurement noise is shown
with a dashed line.

103
Chapter 9. Friction Modeling and Estimation

30
Torque [Nm] 20
10
0
0 π π 3π 2π
2 2
Motor position [rad]

Figure 9.6 Illustration of the torque dependence upon the motor position for the
IRB140 robot.

velocity-, and position data were sampled and filtered at 250 Hz and subsequently
sub-sampled and stored at 20 Hz, resulting in 25 000 data points. Points approxi-
mately satisfying (9.9) were selected for identification, resulting in a set of 16 000
data points.

Nominal Model The viscous model (9.3) was fit using the ordinary LS procedure
from Sec. 9.2. This model was also used as the nominal model in the subsequent
fitting of position model (9.16) and energy model, (9.22) to (9.24).

Position model For the position-dependent model, the number of basis func-
tions and their bandwidth was determined using cross validation. A large value
of σ has a strong regularizing effect and resulted in a model that generalized well
outside the training data. The model was fit using normalized basis functions as
discussed in Sec. 6.3.
Due to the characteristics of the gear box and electrical motor in many indus-
trial robots, there is a clear dependence not only on the arm position, but also on
the motor position. Figure 9.6 shows the torque versus the motor position when
the joint is operated at constant velocity. This is especially strong on the IRB140
and results are therefore illustrated for this robot. Both arm and motor positions
are available through the simple relationship p mot or = mod 2π (g · p ar m ), where g
denotes the gear ratio. This allows for a basis-function expansion also in the space
of motor positions. To illustrate this, p mot or was expanded into K p m K v = 36 × 6
basis functions, corresponding to the periodicity observed in Fig. 9.6. The results
for the model with motor-position dependence are reported separately. Further
modeling and estimation of the phenomena observed in Fig. 9.6 is carried out
in Chap. 10, where a spectral estimation technique is developed, motivated by the
observation that the spectrum of the signal in Fig. 9.6 is modulated by the velocity
of the motor.
To reduce variance in the estimated kernel parameters, all position-dependent
models were estimated using ridge regression (Sec. 6.4), where a L 2 -penalty was
put on the kernel parameters. The strength of the penalty was determined using
cross validation. All basis-function expansions were performed with normalized
basis functions.

104
9.6 Experiments

kv kc
·10−2
5.272 0.50

0.49
5.270
v >0
0.48
v <0
5.268
0.47

5.266 0.46
0 10 20 30 40 0 10 20 30 40
ke τ̄e [minutes]
·10−5
−5.2 3.2

3.1
−5.4
3.0
−5.6
2.9

−5.8 2.8
0 10 20 30 40 0 10 20 30 40

Figure 9.7 Estimated parameters from experimental data. The horizontal axis
displays the iteration number and the vertical axis displays the current parameter
value.

Table 9.2 Performance indicators for the three different models identified on the
dual-arm robot.

Nominal Position Position + Energy


Fit 86.968 93.193 96.674
FPE 3.63e-03 1.03e-03 2.65e-04
RMSE 6.03e-02 3.15e-02 1.54e-02
MAE 4.71e-02 2.36e-02 1.22e-02

Energy model The energy-dependent model was identified for the dual-arm
robot using the procedure described in Algorithm 6. The initial guesses for H (z)
were τ̄e = 10 min and k̄ e = −0.1. The nominal model was chosen as the viscous fric-
tion model (9.3). Once the signal E was estimated, a kernel expansion in the space
P × V × E with 40 × 6 × 3 basis functions was performed to capture temperature-
dependent effects in both the Coulomb and viscous friction parameters.

Results
The convergence of the model parameters is shown in Fig. 9.7. Figure 9.8 and
Fig. 9.9 illustrate how the models identified for the dual-arm robot fit the exper-

105
Chapter 9. Friction Modeling and Estimation

Torque Nominal model Pos model Pos + Energy model

0.5
Torque [Nm]

−0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5
Torque [Nm]

−0.5

8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9


Time [minutes]

Figure 9.8 Model fit to experimental data (dual-arm). Upper plot shows an early
stage of the experiment when the joint is cold. Lower plot a later stage, when the
joint has been warmed up.

imental data. The upper plot in Fig. 9.8 shows an early stage of the experiment
when the joint is cold. At this stage, the model without the energy term under-
estimates the torque needed, whereas the energy model does a better job. The
lower plot shows a later stage of the experiment where the mean torque level is
significantly lower. Here, the model without energy term is instead slightly overes-
timating the friction torque. The observed behavior is expected, since the model
without energy dependence will fit the average friction level during the entire
experiment. The two models correspond well in the middle of the experiments
(not shown). Figure 9.9 illustrates the friction torque predicted by the estimated
model as a function of position and velocity. The visible rise in the surface at large
positive positions and positive velocities corresponds to the increase in friction
torque observed in Fig. 9.8 at, e.g., time t = 0.3 min.
The nominal model (9.3), can not account for any of the positional effects
and produces an overall, much worse fit than the position dependent models.
Different measures of model fit for the three models are presented in Table 9.2 and
Fig. 9.11 (Fit (%), Final Prediction Error, Root Mean Square Error, Mean Absolute
Error). For definitions, see e.g., [Johansson, 1993].

106
9.6 Experiments

0.5
Torque [Nm]

−0.5

−1 4
2 2
0 0
−2
Velocity [rad/s] −2 −4
Position [rad]

Figure 9.9 Estimated position-dependent model for dual-arm (discontinuous


surface) together with the datapoints used for estimation. A Coloumb + viscous
model would consist of two flat surfaces, whereas the position-dependent model
has uncovered a more complicated structure.

Torque Pos + Motorpos Model Nominal Model Pos Model


20
Torque [Nm]

−20
0 5 10 15 20 25 30 35 40
Time [s]

Figure 9.10 Model fit including kernel expansion for motor position on IRB140.
During t = [0 s, 22 s], the joint traverses a full revolution of 2π rad. The same dis-
tance was traversed backwards with a higher velocity during t = [22 s, 33 s]. Notice
the repeatable pattern as identified by the position-dependent models.

107
Chapter 9. Friction Modeling and Estimation

Normalized Performance Criteria


1
0.8
Nominal model
0.6
Pos model
0.4
Pos + Energy model
0.2
0
FPE RMS

Figure 9.11 Performance indicators for the identified models, dual-arm.

Normalized Performance Criteria


1
0.8
Nominal model
0.6
Pos model
0.4
Pos + Motorpos Model
0.2
0
FPE RMS

Figure 9.12 Performance indicators for the identified models, IRB140.

For the IRB140, three models are compared. The nominal model (9.3), a model
with a basis-function expansion in the space Par m , and a model with an additional
basis-function expansion in the space Pmot or × V . The resulting model fits are
shown in Fig. 9.10. What may seem like random measurement noise in the torque
signal is in fact predictable using a relatively small set of parameters. Figure 9.12
illustrates that the large dependence of the torque on the motor position results
in large errors. The inclusion of a basis-function expansion of the motor position
in the model reduces the error significantly.

9.7 Discussion

The proposed models try to increase the predictive power of common friction mod-
els, and thereby increase their utility for model-based filtering, by incorporating
position- and temperature dependence into the friction model. Systems with vary-
ing parameters can, in theory, be estimated with recursive algorithms, so called
online identification. As elaborated upon in Sec. 9.1, online or observer-based
identification of friction models is often difficult in practice due to the presence of
additional dynamics or external forces. The proposed models are identified offline,
during a controlled experiment, and are thus not subject to the problems asso-
ciated with online identification. However, apart from the temperature-related

108
9.7 Discussion

parameters, all suggested models are linear in the parameters, and could be
updated recursively using, for instance, the well-known recursive least-squares
algorithm or the Kalman-smoothing algorithms in Chap. 7.
This paper makes use of standard and well-known models for friction, com-
bined with a basis-function expansion to model position dependence. This choice
was motivated by the large increase in model accuracy achieved for a relatively
small increase in model complexity. Linear models are easy to estimate and the
solution to the least-squares optimization problem is well understood. Depending
on the intended use of the friction model, the most fruitful avenue to investigate
in order to increase the model accuracy further varies. To the purpose of force
estimation, accurate models of the stiction force are likely important. Stationary
joints impose a fundamental limitation in the accuracy of the force estimate, and
the maximum stiction force determines the associated uncertainty of the estimate.
Preliminary work shows that the problem of indeterminacy of the friction force
for static joints of redundant manipulators can be mitigated by superposition of a
periodic motion in the nullspace of the manipulator Jacobian. Exploration of this
remains an interesting avenue for future work.
Although outside the scope of this work, effects of joint load on the friction be-
havior can be significant [Bittencourt and Gunnarsson, 2012]. Such dependencies
could be incorporated in the proposed models using the same RBF approach as
for the incorporation of position dependence, i.e., through an RBF expansion in
the joint load (l ∈ L) dimension according to φ(x) : (x ∈ P × E × L) 7→ R1×(K p K e K l ) ,
with K l basis-function centers along dimension L. This strategy would capture
possible position and temperature dependencies in the load-friction interaction.
The temperature-dependent part of the proposed model originates from the
most simple possible model for energy storage, a generic first order differential
equation. Since the generated energy is initially unknown, incorporating it in the
model is not straight forward. We rely on the assumption that a simple initial
friction model can be estimated without this effect and subsequently be used
to estimate the generated energy loss. The energy loss estimated by this model
can then be incorporated in a more complex model. Iterating this scheme was
shown to converge in simulations, but depending on the conditions, the scheme
might diverge. This might happen if, e.g., the friction varies significantly with
temperature, where significantly is taken as compared to the nominal friction
value at room temperature. In such situations, the initially estimated model will
be far from the optimum, reducing the chance of convergence. In practice, this
issue is easily mitigated by estimating the initial model only on data that comes
from the joint at room temperature.
In its simplest form, the proposed energy-dependent model assumes that the
change in friction occurs in the Coulomb friction level. This is always valid for the
Coulomb model, and a reasonable approximation for the viscous friction model
if k c À k v v or if the system is both operated and identified in a small interval of
velocities. If the viscous friction k v v is large, the approximation will be worse. This

109
Chapter 9. Friction Modeling and Estimation

suggests modeling the friction as

F f = k v (E )v + k c (E ) sign (v) (9.29)

where the Coulomb and viscous constants are seen as functions of the estimated
energy signal E , i.e., a Linear Parameter-Varying model (LPV). To accomplish
this, a kernel expansion including the estimated energy signal was suggested and
evaluated experimentally.
Although models based on the internally generated power remove the need for
temperature sensing in some scenarios, they do not cover significant variations in
the surrounding temperature. The power generated in, for instance, an industrial
robot is, however, often high enough to cause a much larger increase in tempera-
ture than the expected temperature variations of its surrounding [Bittencourt and
Gunnarsson, 2012].

9.8 Conclusions

The modeling of both position and temperature dependence in systems with


friction has been investigated. To model position varying friction, a basis-function
expansion approach was adopted. It has been experimentally verified that taking
position dependence into account can significantly reduce the model output error.
It has also been reported that friction phenomena on both sides of a gearbox can
be modeled using the proposed approach.
Further, the influence of an increase in temperature due to power generated
by friction has been modeled and estimated. The proposed approach was based
on a first-order temperature input-output model where the power generated by
friction was used as input. The model together with the proposed identification
procedure was shown to capture the decrease in friction seen in an industrial
robot during a long-term experiment, this was accomplished without the need of
temperature sensing.

110
9.8 Conclusions

111
10
Spectral Estimation

10.1 Introduction
Spectral estimation refers to a family of methods that analyze the frequency con-
tents of a sampled signal by means of decomposition into a linear combination
of periodic basis functions. Armed with an estimate of the spectrum of a signal,
it is possible to determine the distribution of power among frequencies, identify
disturbance components and design filters, etc. [Stoica and Moses, 2005]. The
spectrum also serves as a powerful feature representation for many classifica-
tion algorithms [Bishop, 2006], e.g., by looking at the spectrum of a human voice
recording it is often trivial to distinguish male and female speakers from each
other, or tell if a saw-blade is in need of replacement by analyzing the sound it
makes while cutting.
Standard spectral density-estimations techniques such as the discrete Fourier
transform (DFT) exhibit several well-known limitations. These methods are typ-
ically designed for data sampled equidistantly in time or space. Whenever this
fails to hold, typical approaches employ some interpolation technique in order to
perform spectral estimation on equidistantly sampled data. Other possibilities in-
clude employing a method suitable for nonequidistant data, such as least-squares
spectral analysis [Wells et al., 1985]. Fourier transform-based methods further suf-
fer from spectral leakage due to the assumption that all sinusoidal basis functions
are orthogonal over the data window [Puryear et al., 2012]. Least-squares spectral
estimation takes the correlation of the basis functions into account and further
allows for estimation of arbitrary/known frequencies without modification [Wells
et al., 1985].
In some applications, the spectral content is varying with an external variable,
for instance, a controlled input. As a motivating example, we consider the torque
ripple induced by the rotation of an electrical motor. Spectral analysis of the
torque signal is made difficult by the spectrum varying with the velocity of the
motor, both due to the frequency of the ripple being directly proportional to the
velocity, but also due to the properties of an electric DC-motor. A higher velocity
both induces higher magnitude torque ripple, but also a higher filtering effect due
to the inertia of the rotating parts. The effect of a sampling delay on the phase of
the measured ripple is similarly proportional to the velocity.

112
10.2 LPV Spectral Decomposition

Time-frequency analysis traditionally employs windowing techniques [Johans-


son, 1993] in order to reduce spectral leakage [Harris, 1978; Stoica and Moses,
2005], mitigate effects of non-stationarity, reduce the influence of ill-posed auto-
correlation estimates [Stoica and Moses, 2005], and allow for time-varying spectral
estimates [Puryear et al., 2012]. The motivating example considers estimation of
the spectral content of a signal that is periodic over the space of angular positions
X . The spectral content will vary with time solely due to the fact that the veloc-
ity is varying with time. Time does thus not hold any intrinsic meaning to the
modification of the spectrum, and the traditional windowing in time is no longer
essential.
This chapter develops a spectral-estimation technique using basis-function
expansions that allows the spectral properties (phase and amplitude) of the an-
alyzed signal to vary with an auxiliary signal. Apart from a standard spectrum,
functional relationships between the scheduling signal and the amplitude and
phase of each frequency will be identified. We further consider the task of sparse
spectral estimation and show how the proposed method extends also to this
setting.

10.2 LPV Spectral Decomposition


In order to decompose the spectrum along an external dimension, we consider
basis-function expansions, a topic that was introduced in detail in Sec. 6.3 and
used for friction-modeling in Chap. 9. Intuitively, a basis-function expansion de-
composes an intricate function or signal as a linear combination of simple basis
functions. The Fourier transform can be given this interpretation, where an ar-
bitrary signal is decomposed as a sum of complex-valued sinusoids. With this
intuition, we aim for a method that allows decomposition of the spectrum of a
signal along an external dimension, in LPV terminology called the scheduling
dimension, V .1 If we consider a single sinusoid in the spectrum, the function
decomposed by the basis-function expansion will thus be the complex-valued
coefficient k in ke i ω as a function of the scheduling variable, v, i.e., k = k(v). In the
motivating example, v is the angular velocity of the motor. Using complex-valued
calculations, we simultaneously model the dependence of both amplitude and
phase of a real frequency by considering the complex frequency. This parameteri-
zation will also result in an estimation problem that is linear in the parameters,
which is not true for the problem of estimating phase and amplitude directly.
For additional details on the topic of basis-function expansions, the reader is
referred to Sec. 6.3.

1 We limit our exposition to V ⊆ R for clarity, but higher dimensional scheduling spaces are possible.

113
Chapter 10. Spectral Estimation

Least-squares identification of periodic signals


Spectral estimation amounts to estimation of models of a signal y on the form

y(n) = k 1 sin(ωn) + k 2 cos(ωn) (10.1)

which are linear in the parameters k. Identification of linear models using the
least-squares procedure was described in Sec. 6.2. The model (10.1) can be written
in compact form by noting that e i ω = cos ω+i sin ω, which will be used extensively
throughout the chapter to simplify notation.2
We will now proceed to formalize the proposed spectral-decomposition
method.

Signal model
Our exposition in this chapter will make use of Gaussian basis functions. The
method is, however, not limited to this choice and extends readily to any other set
of basis functions. A discussion on different choices is held in Sec. 6.3.
We start by establishing some notation. Let k denote the Fourier-series co-
efficients3 of interest. The kernel activation vector φ(v i ) : (v ∈ V ) 7→ R J maps the
input to a set of basis-function activations and is given by
¤T
φ(v i ) = κ(v i , θ1 ) · · · κ(v i , θ J ) ∈ R J
£
(10.2)
(v − µ)2
µ ¶
κ(v, θ j ) = κ j (v) = exp − (10.3)
2σ2

where κ j is a basis function parameterized by θ j = (µ j , σ j ), µ ∈ V is the center of


the kernel and σ2 is determining the width.
Let y denote the signal to be decomposed and denote the location of the
sampling of y i by x i ∈ X . The space X is commonly time or space; in the motivat-
ing example of the electrical motor, X is the space of motor positions.4 Let the
intensities of a set of complex frequencies i ω ∀ ω ∈ Ω be given by basis-function
expansions along V , according to

J
k ω, j κ j (v i )e −i ωxi = φ(v i )e −i ωxi , kω ∈ C J
X X X T
ŷ i = kω (10.4)
ω∈Ω j =1 ω∈Ω

The complex coefficients to be estimated, k ∈ CO×J , O = card(Ω), constitute the


Fourier-series coefficients, where the intensity of each coefficient is decomposed

2 Note that solving the complex LS problem using complex regressors e i ω is not equivalent to solving
the real LS problem using sin/cos regressors.
3 We use the term Fourier-series coefficients to represent the parameters in the spectral decomposi-
tion, even if the set of basis functions are not chosen so as to constitute a true Fourier series.
4 We note at this stage that x ∈ X can be arbitrarily sampled and are not restricted to lie on an
equidistant grid, as is the case for, e.g., Fourier transform-based methods.

114
10.2 LPV Spectral Decomposition

over V through the BFE. This formulation reduces to the standard Fourier-style
spectral model (10.5) in the case φ(v) ≡ 1
k ω e −i ωx = Φk
X
ŷ = (10.5)
ω∈Ω
−i ω1 x −i ωO x
where Φ = [e ... e ]. If the number J of basis functions equals the number
of data points N , the model will exactly interpolate the signal, i.e., ŷ = y. If in
addition to J = N , the basis-function centers are placed at µ j = v j , we obtain a
Gaussian process regression interpretation where κ is the covariance function.
Owing to the numerical properties of the analytical solution of the least-squares
problem, it is often beneficial to reduce the number of parameters significantly,
so that J ¿ N . If the chosen basis functions are suitable for the signal of interest,
the error induced by this dimensionality reduction is small. In a particular case,
the number of RBFs to include, J , and the bandwidth Σ is usually chosen based
on evidence maximization or cross validation [Murphy, 2012].
To facilitate estimation of the parameters in (10.4), we rewrite the model by
stacking the regressor vectors in a regressor matrix A, see Sec. 10.2, such that
An,: = vec φ(v n )ΦT ∈ CO·J , n = 1...N
¡ ¢T

We further define à by expanding the regressor matrix into its real and imaginary
parts
à = ℜA ℑA ∈ RN ×2O J
£ ¤

such that routines for real-valued least-squares problems can be used. The com-
plex coefficients are, after solving the real-valued least-squares problem5 retrieved
as k = k ℜ + i k ℑ where
T T T
° °
[k ℜ kℑ ] = arg min °Ãk̃ − y °

Since the purpose of the decomposition is spectral analysis, it is important to


normalize the basis-function activations such that the total activation over V for
each data point is unity. To this end, the expressions (10.4) are modified to
J
k ω, j κ̄ j (v)e −i ωx = φ̄(v)e −i ωx
X X X T
ŷ = kω
ω∈Ω j =1 ω∈Ω
κ j (v) φ(v)
κ̄ j (v) = P , φ̄(v) = P (10.6)
j κ j (v) φ(v)
This ensures that the spectral content for a single frequency ω is a convex com-
bination of contributions from each basis function in the scheduling dimension.
Without this normalization, the power of the spectrum would be ill-defined and
depend on an arbitrary scaling of the basis functions. The difference between a set
of Gaussian functions and a set of normalized Gaussian functions is demonstrated
in Fig. 10.1. The normalization performed in (10.6) can be viewed as the kernel
function being made data adaptive by normalizing φ(v) to sum to one.
5 See Sec. 6.2 for details on the least-squares procedure.

115
Chapter 10. Spectral Estimation

Activation κ(v, θ)
0.5

−0.5

−1
-2 -1 0 1 2
Scheduling variable v

Figure 10.1 Gaussian (dashed) and normalized Gaussian (solid) windows. Regu-
lar windows are shown mirrored in the x-axis for clarity.

Amplitude and phase functions


In spectral analysis, two functions of the Fourier-series coefficients are typically of
interest, the amplitude and phase functions. These are easily obtained through
elementary trigonometry and are stated here as a lemma, while a simple proof is
deferred until Sec. 10.A:

L EMMA 3
Let a signal y be composed by the linear combination y = k 1 cos(x) + k 2 sin(x),
then y can be written on the form

y = A cos(x − ϕ)

with
k2
q µ ¶
A= k 12 + k 22 ϕ = arctan 2
k1

From this we obtain the following two functions for a particular frequency ω
q
A(ω) = |k ω | = ℜk ω 2 + ℑk ω 2
ϕ(ω) = arg(k ω ) = arctan (ℑk ω /ℜk ω )

In the proposed spectral-decomposition method, these functions further depend


on v, and are approximated by
¯ ¯
¯X J ¯ ¯
A(ω, v) = ¯ k ω, j κ̄(v)¯ = ¯k ω φ̄(v)¯
T
¯
(10.7)
¯ ¯
¯ j =1 ¯
à !
J
ϕ(ω, v) = arg k ω, j κ̄(v) = arg k ω φ̄(v)
X ¡ T ¢
(10.8)
j =1

116
10.2 LPV Spectral Decomposition

Covariance properties
We will now investigate and prove that (10.7) and (10.8) lead to asymptotically
unbiased and consistent estimates of A and ϕ, and will provide a strategy to obtain
confidence intervals. We will initially consider a special case for which analysis
is simple, whereafter we invoke the RBF universal approximation results of Park
and Sandberg (1991) to show that the estimators are well motivated for a general
class of functions. We start by considering signals on the form (10.9), for which
unbiased and consistent estimates of the parameters are readily available:
P ROPOSITION 3
Let a signal y be given by

y = a cos(x) + b sin(x) + e e ∈ N (0, σ2 )


a = αTφ b = βTφ (10.9)

with φ = φ(v) and let α̂ and β̂ denote unbiased estimates of α and β. Then
q
¢2 ¡ ¢2
Â(α̂, β̂) = α̂Tφ + β̂Tφ
¡
(10.10)

has an expected value with upper and lower bounds given by


© ª q
A < E Â < A 2 + φTΣα φ + φTΣβ φ (10.11)

Proof Since α, β and e appear linearly in (10.9), unbiased and consistent esti-
mates α̂ and β̂ are available from the least-squares procedure (see Sec. 6.2). The
expected value of  2 is given by
n¡ ¢2 ¡ ¢2 o
E Â 2 = E α̂Tφ + β̂Tφ
© ª
n¡ ¢2 o n¡ ¢2 o
= E α̂Tφ + E β̂Tφ (10.12)

We further have
n¡ ¢2 o ª2
E α̂Tφ = E α̂Tφ + V α̂Tφ
© © ª

¢2
= αTφ + φTΣα φ
¡
(10.13)

where Σα and Σβ are the covariance matrices of α̂ and β̂ respectively. Calculations


for β are analogous. From (10.12) and (10.13) we deduce
¢2 ¡ ¢2
E Â 2 = αTφ + βTφ + φTΣα φ + φTΣβ φ
© ª ¡

= A 2 + φTΣα φ + φTΣβ φ (10.14)

Now, due to Jensen’s inequality, we have


np o q © ª
E Â = E Â 2 < E Â 2
© ª
(10.15)

117
Chapter 10. Spectral Estimation

which provides the upper bound on the expectation of Â. The lower bound is
obtained by writing  on the form
q
¢2 ¡ ¢2 ° °
α̂Tφ + β̂Tφ = °k̂ °
¡
Â(k) = (10.16)

with k̂ = [α̂Tφ β̂Tφ]. From Jensen’s inequality we have

E Â = E °k̂ ° > °E k̂ ° = °k ° = A
© ª ©° °ª ° © ª° ° °
(10.17)

which concludes the proof. 2

C OROLLARY 2
q
¢2 ¡ ¢2
α̂Tφ + β̂Tφ
¡
 = (10.18)
is an asymptotically unbiased and consistent estimate of A.

Proof Since the least-squares estimate, upon which the estimated quantity is
based, is unbiased and consistent, the variances in the upper bound in (10.11)
will shrink as the number of datapoints increases and both the upper and lower
bounds will become tight, hence

E Â → A as N → ∞
© ª

Analogous bounds for the phase function are harder to obtain, but the simple
estimator ϕ̂ = arg(k̂) based on k̂ obtained from the least-squares procedure is still
asymptotically consistent [Kay, 1993].
Estimates using the least-squares method (6.5) are, under the assumption
of uncorrelated Gaussian residuals of variance σ2 , associated with a posterior
parameter covariance σ2 (ATA)−1 . This will in a straightforward manner produce
confidence intervals for a future prediction of y as a linear combination of the
estimated parameters. Obtaining unbiased estimates of the confidence intervals
for the functions A(v, ω) and ϕ(v, ω) is made difficult by their nonlinear nature.
We therefore proceed to establish an approximation strategy.
The estimated parameters k̂ are distributed according to a complex-normal
distribution CN (ℜz + i ℑz, Γ,C ), where Γ and C are obtained through

Γ = Σℜℜ + Σℑℑ + i (Σℑℜ − Σℜℑ )


C = Σℜℜ − Σℑℑ + i (Σℑℜ + Σℜℑ )
Σ Σℜℑ
· ¸
Σ = ℜℜ = σ2 (ÃTÃ)−1 (10.19)
Σℑℜ Σℑℑ

For details on the CN -distribution, see, e.g., [Picinbono, 1996]. A linear combina-
tion of squared variables distributed according to a complex normal (CN ) distribu-
tion, is distributed according to a generalized χ2 distribution, a special case of the

118
10.2 LPV Spectral Decomposition

gamma distribution. Expressions for sums of dependent gamma-distributed vari-


ables exist, see, e.g., [Paris, 2011], but no expressions for the distribution of linear
combinations of norms of Gaussian vectors, e.g., (10.7), are known to the author.
In order to establish estimates of confidence bounds on the spectral functions,
one is therefore left with high-dimensional integration or Monte-Carlo techniques.
Monte-Carlo estimates will be used in the results presented in this chapter. The
sampling from a CN -distribution is outlined in Proposition 4, with a proof given
in Sec. 10.A:
P ROPOSITION 4
The vector
z = ℜz̃ + i ℑz̃ ∈ CD
where · ¸ · ¸
ℜz̃ ℜz
=L , ℜz, ℑz ∼ N (0, I ) ∈ RD
ℑz̃ ℑz
and Σ = LLT is a Cholesky decomposition of the matrix
· ¸
1 ℜ(Γ +C ) ℑ(−Γ +C )
Σ= ∈ R2D×2D
2 ℑ(Γ +C ) ℜ(Γ −C )
is a sample from the complex normal distribution CN (0, Γ,C ). 2
By sampling from the posterior distribution p(k ω |y) and propagating the
samples through the nonlinear functions A(ω, v) and ϕ(ω, v), estimates of relevant
confidence intervals are easily obtained.
The quality of the estimate thus hinges on the ability of the basis-function
expansion to approximate the given functions a and b in (10.9). Park and Sandberg
(1991) provide us with the required result that establishes RBF expansions as
universal function approximators for well-behaved functions.

Line-spectral estimation
In many application, finding a sparse spectral estimate with only a few nonzero
frequency components is desired. Sparsity-promoting regularization can be em-
ployed when solving for the Fourier coefficients in order to achieve this. This
procedure is sometimes referred to as line-spectral estimation [Stoica and Moses,
2005] or L 1 -regularized spectral estimation. While this technique only requires
° °
the addition of a regularization term to the cost function in (6.5) on the form °k °1 ,
the resulting problem no longer has a solution on closed form, necessitating an it-
erative solver. Along with the standard periodogram and Welch spectral estimates,
we compare L 1 -regularized spectral estimation to the proposed approach on a
sparse estimation problem in the next section. We further incorporate group-lasso
regularization for the proposed approach. The group-lasso, described in Sec. 6.4,
amounts to adding the term X° °
°k ω ° (10.20)
2
ω∈Ω
to the cost function. We solve the resulting lasso and group-lasso regularized
spectral-estimation problems using the ADMM algorithm [Parikh and Boyd, 2014].

119
Chapter 10. Spectral Estimation

10.3 Experimental Results

Simulated signals
To assess the qualities of the proposed spectral-decomposition method, a test
signal y t is generated as follows

A(ω, v) cos ωx − ϕ(ω, v) + e e ∈ N (0, 0.12 )


X ¡ ¢
yt =
ω∈Ω
v t = linspace(0, 1, N )
x = sort(U (0, 10))

where Ω = {4π, 20π, 100π}, the scheduling variable v t is generated as N = 500


equidistantly sampled points between 0 and 1 and x is a sorted vector of uniform
random numbers. The sorting is carried out for visualization purposes and for
the Fourier-based methods to work, but this property is not a requirement for the
proposed method. The functions A and ϕ are defined as follows

A(4π, v) = 2v 2
A(20π, v) = 2/(5v + 1)
2
A(100π, v) = 3e −10(v−0.5)
ϕ(ω, v) = 0.5A(ω, v) (10.21)

where the constants are chosen to allow for convenient visualization. The signals
y t and v t are visualized as functions of the sampling points x in Fig. 10.2 and the
functions A and ϕ together with the resulting estimates and confidence intervals
using J = 50 basis functions are shown in Fig. 10.2. The traditional power spectral
density can be calculated from the estimated coefficients as
¯ ¯2
¯X J ¯
P (ω) = ¯ k̂ ω, j ¯ (10.22)
¯ ¯
¯ j =1 ¯

and is compared to the periodogram and Welch spectral estimates in Fig. 10.3.
This figure illustrates how the periodogram and Welch methods fail to clearly iden-
tify the frequencies present in the signal due to the dependence on the scheduling
variable v. The LPV spectral method, however, correctly identifies all three fre-
quencies present. Incorporation of L 1 regularization or group-lasso regularization
introduces a bias. The L 1 -regularized periodogram is severely biased, but manages
to identify the three frequency components present. The difference between the
LPV method and the group-lasso LPV method is small, where the regularization
correctly sets all non-present frequencies to zero, but at the expense of a small
bias.

120
10.3 Experimental Results

Est. ω = 4π True ω = 4π
Est. ω = 20π True ω = 20π Test signal y t and
Est. ω = 100π True ω = 100π scheduling signal v t
3
yt
vt 2
2
A(ω, v)

1
−2

0 −4
0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10
v x (sampling points)

Figure 10.2 Left: True and estimated functional dependencies with 95% confi-
dence intervals. Right: Test signal with N = 500 datapoints. The signal contains
three frequencies, where the amplitude and phase are modulated by the func-
tions (10.21) depicted in the left panel. For visualization purposes, v is chosen
as an increasing signal. We can clearly see how the signal y t on the right has a
higher amplitude when v ≈ 0.5 and is dominated by a low frequency when v ≈ 1,
corresponding to the amplitude functions on the left.

Periodogram Welch L1 LPV L 1 LPV


Spectral density

100

10−2

10−4
0 20 40 60 80 100 120 140
ω [rad/s]

Figure 10.3 Estimated spectra, test signal. The periodogram and Welch meth-
ods fail to identify the frequencies present in the signal due to the dependence
on the scheduling variable v. The L 1 -regularized periodogram correctly identi-
fies the frequency components presents, but is severely biased. The LPV spectral
method correctly identifies all three frequencies present and the group-lasso LPV
method identifies the correct spectrum with a small bias and correctly sets all other
frequencies to 0.

121
Chapter 10. Spectral Estimation

Measured signals
The proposed method was used to analyze measurements obtained from an ABB
dual-arm robot (Fig. 9.2). Due to torque ripple and other disturbances, there is
a velocity-dependent periodic signal present in the velocity control error, which
will serve as the subject of analysis. The analyzed signal is shown in Fig. 10.4.
The influence of Coulomb friction on the measured signal is mitigated by
limiting the support of half of the basis functions to positive velocities and vice
versa. A total number of 10 basis functions was used and the model was identi-
fied with ridge regression. The regularization parameter was chosen using the
L-curve method [Hansen, 1994]. The identified spectrum is depicted in Fig. 10.5,
where the dominant frequencies are identified. These frequencies correspond
well with a visual inspection of the data. Figure 10.5 further illustrates the result of
applying the periodogram and Welch spectral estimators to data that has been
sorted and interpolated to an equidistant grid. These methods correctly identify
the main frequency, 4 rev−1 , but fail to identify the lower-amplitude frequencies
at 7 rev−1 and 9 rev−1 visible in the signal. The amplitude functions for the three
strongest frequencies are illustrated in Fig. 10.6, where it is clear that the strongest
frequency, 4 rev−1 , has most of its power distributed over the lower-velocity data-
points, whereas the results indicate a slight contribution of frequencies at 7 rev−1
and 9 rev−1 at higher velocities, corresponding well with a visual inspection of the
signal. Figure 10.6 also displays a histogram of the velocity values of the analyzed
data. The confidence intervals are narrow for velocities present in the data, while
they become wider outside the represented velocities.

Measured signal y t
·10−2
2
1
1

0 0

−1
−1 −2
−π 0 π
x [rad] (sampling points)

Figure 10.4 Measured signal as a function of sampling location, i.e., motor posi-
tion. The color information indicates the value of the velocity/scheduling variable
in each datapoint. Please note that this is not a plot of the measured data sequen-
tially in time. This figure indicates that there is a high amplitude periodicity of
4 rev−1 for low velocities, and slightly higher frequencies but lower-amplitude
signals at 7 rev−1 and 9 rev−1 for higher velocities.

122
10.3 Experimental Results

Periodogram Welch LPV


−3
10
Spectral density

10−5

10−7

10−9
0 2 4 6 8 10 12
f [rev−1 ]

Figure 10.5 Estimated spectra, measured signal. The dominant frequencies are
identified by the proposed method, while the Fourier-based methods correctly
identify the main frequency, 4 rev−1 , but fail to identify the lower-amplitude fre-
quencies at 7 rev−1 and 9 rev−1 visible in the signal in Fig. 10.4.

f =1 f =4 f =7 f =9
·104
·10−3
Number of datapoints in V

6
2 A(v)

1
2

0 0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
v

Figure 10.6 Estimated functional dependencies with 99% confidence intervals.


The left axis and histogram illustrates the number of datapoints available at each
velocity v. The right axis illustrate the estimated amplitude functions together with
their confidence intervals.

123
Chapter 10. Spectral Estimation

10.4 Discussion
In this chapter, we make further use of basis-function expansions, this time in the
context of spectral estimation. The common denominator is the desire to model
a functional relationship where the function is low-dimensional and can have
an arbitrary complicated form. The goal is to estimate how the amplitude and
phase of sinusoidal basis functions that make up a signal vary with an auxiliary
signal. Due to the phase variable entering nonlinearly, the estimation problem
is rephrased as the estimation of linear parameter-varying coefficients of sines
and cosines of varying frequency. The amplitude and phase functions are then
calculated using nonlinear transforms of the estimated coefficients. While it was
shown that the simple estimators of the amplitude and phase functions are biased,
this bias vanishes as the number of datapoints increases.
From the expression for the expected value of the amplitude function
© ª q
A < E Â < A 2 + φTΣα φ + φTΣβ φ (10.23)

we see that the bias vanishes as Σα and Σβ are reduced. Further insight into this
inequality can be gained by considering the scalar, nonlinear transform

f (x) = |x|, x ∼ N (µ, σ2 ) (10.24)

If µ is several standard deviations away from zero, the nonlinear aspect of the
absolute-value function will have negligible effect. When µ/σ becomes smaller,
say less than 2, the effect starts becoming significant. Hence, if the estimated
coefficients are significantly different from zero, the bias is small. This is apparent
also from the figures indicating the estimated functional relationship with esti-
mated confidence bounds. For areas where data is sparse, the confidence bounds
become wider and the estimate of the mean seems to be inflated in these areas.
The leakage present in the standard Fourier-based methods is usually unde-
sired. The absence of leakage might, however, be problematic when the number of
estimated frequencies is low, and the analyzed signal contains a very well defined
frequency. If this frequency is not included in the set of basis frequencies, the ab-
sence of leakage might lead to this component being left unnoticed. Introduction
of leakage is technically straightforward, but best practices for doing so remain to
be investigated.

10.5 Conclusions

This chapter developed a spectral-estimation method that can decompose the


spectrum of a signal along an external dimension, which allows estimation of the
amplitude and phase of the frequency components as functions of the external
variable. The method is linear in the parameters which allows for straightforward
calculation of the spectrum through solving a set of linear equations. The method
does not impose limitations such as equidistant sampling, does not suffer from

124
10.A Proofs

leakage and allows for estimation of arbitrary chosen frequencies. The closed-form
calculation of the spectrum requires O (J 3 O 3 ) operations due to the matrix inver-
sion associated with solving the LS-problem, which serves as the main drawback
of the method if the number of frequencies to estimate is large (the product JO
greater than a few thousands). For larger problems, an iterative solution method
must be employed. Iterative solution further enables the addition of a group-lasso
regularization, which was shown to introduce sparsity in the estimated spectrum.
Implementations of all methods and examples discussed in this chapter are
made available in [LPVSpectral.jl, B.C., 2016].

Appendix A. Proofs

Proof Lemma 3
The amplitude A is given by two trigonometric identities

A cos(x − ϕ) = A cos(ϕ) cos(x) + A sin(ϕ) sin(x) (10.25)


= k 1 cos(x) + k 2 sin(x) (10.26)
k 12 + k 22 2 2 2
= A (cos(ϕ) + sin(ϕ) ) = A 2
(10.27)

and the phase ϕ by

k2 A sin ϕ
µ ¶ µ ¶
arctan = arctan =ϕ
k1 A cos ϕ

where k 1 = A cos(ϕ), k 2 = A sin(ϕ) is identified from (10.25). 2

Proof Proposition¤ 4
Let v T = x T y T . The mean and variance of ṽ = Lv is given by
£

E {ṽ} = L E {v} = 0
E ṽ ṽ T = E Lv v TLT = LI LT = Σ
© ª © ª

The complex vector z = x + i y ∈ CD composed of the elements of v is then


CN (0, Γ,C )-distributed according to [Picinbono, 1996, Proposition 1]. 2

125
11
Model-Based Reinforcement
Learning

In this chapter, we will briefly demonstrate how the models developed in Part
I of the thesis can be utilized for reinforcement-learning purposes. As alluded
to in Sec. 5.2, there are many approaches to RL, some of which learn a value
function, some a policy directly and some that learn the dynamics of the system.
While value-function-based methods are very general and can be applied to a
very wide range of problems without making many assumptions, they are terribly
inefficient and require a vast amount of interaction with the environment in
order to learn even simple policies. Model-based methods trade off some of the
generality by making more assumptions, but in return offer greater data efficiency.
In line with the general topic of this thesis, we will demonstrate the use of the LTV
models and identification algorithms from Chap. 7 together with the black-box
models from Chap. 8 in an example of model-based reinforcement learning. The
learning algorithm used is derived from [Levine and Koltun, 2013], where it was
shown to have impressive data-efficiency when applied to physical systems. In this
chapter, we show how the data-efficiency can be improved further by estimating
the dynamical models using the methods from Chap. 7.

11.1 Iterative LQR—Differential Dynamic Programming

Trajectory optimization refers simply to an optimization problem with sequential


structure and dynamical equations among the constraints. As an example, con-
sider the problem of finding a sequence of joint torques that moves a robot arm
from one pose to another while minimizing a weighted sum of the required time
and energy. In solving this problem, the dynamics of the robot have to be satis-
fied by any feasible trajectory, and it is thus reasonable to include the dynamical
equations as constraints in the optimization problem.
In Sec. 5.1 we briefly mentioned that the LQR algorithm [Todorov and Li, 2005]
can be used for trajectory optimization with nonlinear systems and nonquadratic
cost functions by linearizing the system and approximating the cost function

127
Chapter 11. Model-Based Reinforcement Learning

with a quadratic. The solution will, due to the approximations, not be exact, and
the procedure must thus be iterated, where new approximations are obtained
around the previous solution trajectory. This algorithm can be used for model-
based, trajectory centric reinforcement learning by, after each optimization pass,
performing a rollout on the real system and updating the system model using
the new data. We will begin with an outline of the algorithm [Todorov and Li,
2005] and then provide a detailed description on how we use it for model-based
reinforcement learning.

The algorithm
The standard LQR algorithm uses dynamic programming to find the optimal linear,
time-varying feedback gain along a trajectory of an, in general, linear time-varying
system. To extend this to the nonlinear, non-Gaussian setting, we linearize the
system and cost function c along a trajectory τi = {x̂ t , û t }Tt=1 , where i denotes the
learning iteration starting at 1 and s denotes the superstate [x T uT]T:
x + = Ax + Bu (11.1)
1
c(x, u) = s Tc s + s Tc ss s (11.2)
2
where subscripts denote partial derivatives. Using the linear quadratic model
(11.1), the optimal state- and state-action value functions V and Q can be calcu-
lated recursively starting at t = T using the expressions
Q ss = c s + fˆsTVxx
+ ˆ
fs (11.3)
Qs = cs + ˆT
f s Vx+ (11.4)
T −1
Vxx = Q xx −Q ux Q uu Q ux (11.5)
−1
T
Vx = Q x −Q uxQ uu Qu (11.6)
+
where all quantities are given at time t and · denotes a quantity at t +1. Subscripts
Q s and Q ux , etc., denote partial derivatives ∇s Q and ∇ux Q, respectively. The
calculation of value functions is started at time t = T with VT +1 = 0. Given the
calculated value functions, the optimal update to the control signal trajectory, k,
and feedback gain, K , are given by
−1
K = −Q uu Q ux (11.7)
−1
k = −Q uu Qu (11.8)
These are used to update the nominal trajectory τ as
u i +1 = u i + k i (11.9)
C (x) = K x̄ (11.10)
where x̄ = x − x̂ denotes deviations from the nominal trajectory. The update of
the state trajectory in τ is done by forward simulation of the system using u i +1 as
input. In the reinforcement-learning setting, this corresponds to performing an
experiment on the real system.

128
11.1 Iterative LQR—Differential Dynamic Programming

Stochastic controller
For learning purposes, it might be beneficial to let the controller be stochastic in
order to more efficiently explore the statespace around the current trajectory. A
Gaussian controller
p(u|x) = N (K x̄, Σ) (11.11)
can be calculated with the above framework, where the choice Σ = Q uu
−1
was shown
by Levine and Koltun (2013) to optimize a maximum entropy cost function

p(τ) = arg min E {c(τ)} − H(p(τ)) (11.12)


p(u|x)

subject to p(x + |x, u) = N (Ax + Bu, Σ f ) (11.13)

−1
To explore with Gaussian noise with a covariance matrix of Q uu makes intuitive
sense. It allows us to add large noise in the directions where the value function does
not increase much. Directions where the value function seems to increase sharply,
however, should not be explored as much as this would lead the optimization to
high-cost areas of the state-space.
The term Σ f in (11.13) corresponds to the uncertainty in the model of f along
the trajectory. If f is modeled as an LTV model and estimated using the methods
in Chap. 7, this term can be derived from quantities calculated by the Kalman
smoothing algorithm, making these identification methods a natural fit to the
current setting.

Staying close to the last trajectory


In a reinforcement-learning setting, the model under which the algorithm opti-
mizes, fˆ, is estimated from a small amount of data. Optimization can in general
lead arbitrarily far away from the starting point, i.e., the initial trajectory. A nonlin-
ear model or LTV model estimated around a trajectory is not likely to accurately
represent the dynamics of the system in regions of the state space far away from
where it was estimated. To mitigate this issue and make sure that the optimization
procedure does not suggest trajectories too far away from the trajectory along
which the model was estimated, a constraint can be added to the optimization
problem. The constraint suggested by Levine and Abbeel (2014) takes the form

D KL p(τ) || p̂(τ) ≤ ²
¡ ¢
(11.14)

where µ ¶
Y
p(τ) = p(x 0 ) p(x t +1 |x t , u t )p(u t |x t ) (11.15)
t

and p̂(τ) denotes the previous trajectory distribution. The addition of (11.14)
allows the resulting constrained optimization problem to be solved efficiently
using a small modification to the algorithm, reminiscent of the Lagrangian method.
The simplicity of the resulting problem is a product of the linear-Gaussian nature
of the models involved.

129
Chapter 11. Model-Based Reinforcement Learning

11.2 Example—Reinforcement Learning


In this example, we use methods from Chap. 7 to identify LTV dynamics models
for reinforcement learning. We will consider the system used for simulations in
Chap. 8, the pendulum on a cart. Owing to the nonlinear nature of the pendulum
dynamics, linear approximations of the dynamics in the upward (initial) position
have an unstable pole, and imaginary poles in the downward (final) position. This
simple system is well understood and often used as a reinforcement learning
benchmark. The traditional goal of the benchmark is to find a policy that is able
to swing up the pendulum from its downward hanging position to the unstable
equilibrium where the pendulum is upright. We will consider the reverse task
of optimally dampen the swinging pendulum after it has fallen from its upright
position, since this task allows us to use a trajectory-based RL algorithm. We can
thus compare the results to those obtained by using an optimal control algorithm
equipped with the ground-truth system model. This task is also robust to the
choice of cost function. We simply penalize states and control signals quadratically
and constrain the control signal to ±10.
To find an optimal dampening policy, we employ a reinforcement learning
framework inspired by Levine and Koltun (2013), summarized in Algorithm 7. In

Algorithm 7 A simple reinforcement learning algorithm using trajectory optimiza-


tion.
repeat
Perform rollout
Fit dynamics model
Optimize trajectory and controller
until Convergence

the third step of Algorithm 7, we employ the iLQR algorithm with a bound on the
KL divergence between two consecutive trajectory distributions. To incorporate
control signal bounds, we employ a variant of iLQR due to Tassa et al. (2014). Our
implementation of iLQR, allowing control signal contraints and constraints on the
KL divergence between trajectory distributions, is made available at [Differential-
DynamicProgramming.jl, B.C., 2016].
We compare three different models; the ground-truth system model, an LTV
model obtained by solving (7.5) using the Kalman smoothing algorithm, and an
LTI model. The total cost over T = 400 time steps is shown as a function of learning
iteration in Fig. 11.1. The figure illustrates how the learning procedure reaches
the optimal cost of the ground-truth model when an LTV model is used, whereas
when using an LTI model, the learning diverges. The figure further illustrates that
if the LTV model is fit using a prior (Sec. 7.5), the learning speed is increased. We
obtain the prior in two different ways. In one experiment, we fit a neural network
model on the form x + = g (x, u) + x with the same structure and parameters as
in Sec. 8.6, with tangent-space regularization. After every rollout, the black-box

130
11.2 Example—Reinforcement Learning

LTV LTI LTV, successive prior


LTV, prebuilt prior True model

Median cost over 10 Monte-Carlo runs

80

60
Cost

40

20
0 2 4 6 8 10 12 14 16 18 20
Learning Iteration

Figure 11.1 Reinforcement learning example. Three different model types are
used to iteratively optimize the trajectory of a pendulum on a cart. Colored bands
show standard deviation around the median. Linear expansions of the dynamics
are unstable in the upward position and stable in the downward position. The
algorithm thus fails with an LTI model. A standard LTV model learns the task well
and reaches the optimal cost in approximately 15 learning iterations. Building a
successive DNN model for use as prior to the LTV model improves convergence
slightly. Using a DNN model from a priori identification as prior improves conver-
gence at the onset of training when the amount of information obtained about
the system is low. Interestingly, the figure highlights how use of this model ham-
pers performance near convergence. This illustrates that successively updating
the DNN model with new data along the current trajectory improves the model in
the areas of the state space important to the task. The optimal control algorithm
queried the true system model a total of 52 times.

model is refit using all available data up to that point.1 No uncertainty estimate is
available from such a model, and we thus heuristically decay the covariance of
the prior obtained from this model as the inverse of the learning iteration. This
prior information is easily incorporated into the learning of the LTV model using
the method outlined in Sec. 7.5, and the precision of the prior acts as a hyper
parameter weighing the LTV model against the neural network model. In the
second experiment with priors, we pretrain a neural network model of the system
in a standard system-identification fashion, and use this model as a prior without
updating it with new data during the learning iterations. The figure indicates
that use of a successively updated prior model is beneficial to convergence, but
1 Retraining in every iteration is computationally costly, but ensures that the model does not con-
verge to a minimum only suitable for the first seen trajectories.

131
Chapter 11. Model-Based Reinforcement Learning

θ θ̇
3

2 0

1
−5
0

0 100 200 300 400 0 100 200 300 400


p v

0.2 1

0 0

−0.2 −1

−0.4 −2
0 100 200 300 400 0 100 200 300 400
u
10
iLQR with true model LTV
5 LTV successive prior

−5

−10
0 50 100 150 200 250 300 350 400

Figure 11.2 Optimized trajectories of pendulum system after learning using three
different methods. The state consists of angle θ, angular velocity θ̇, position p and
velocity v. DDP denotes the optimal control procedure with access to the true
system model. The noise in the control signal trajectories of the RL algorithms is
added for exploration purposes.

132
11.2 Example—Reinforcement Learning

it takes a few trajectories of data before the difference is noticeble. If the prior
model is pretrained the benefit is immediate, but failure to update the prior model
hampers convergence towards the end of learning. This behavior indicates that
it was beneficial to update the model with data following the state-visitation
distribution of the policy under optimization, as this focuses the attention of the
neural network, allowing it to accurately represent the system in relevant areas of
the state space. In taking this experiment further, a Bayesian prior model, allowing
calculation of the true posterior over the Jacobian for use in (7.16), would be
beneficial. Example trajectories found by the learning procedure using different
models are illustrated in Fig. 11.2. Due to the constraints on the control signal, all
algorithms converge to similar policies of bang-bang type.
Although a simple example, Fig. 11.1 illustrates how an appropriate choice of
model class can allow a trajectory-based reinforcement learning problem to be
solved using very few experiments on the real process. We emphasize that one
iteration of Algorithm 7 corresponds to one experiment on the process. Levine
and Koltun (2013), from where inspiration to this approach was drawn, require a
handful of experiments on the process per learning iteration and policy update,
just to fit an adequate dynamics model.

133
Part II

Robot State Estimation


12
Introduction—Friction Stir
Welding

Friction stir welding (FSW) is becoming an increasingly popular joining technique


capable of producing stronger joints than fusion welding, allowing for a reduction
of material thickness and weight of the welded components [Midling et al., 1998;
De Backer, 2014]. Conventional, custom-made FSW machines of gantry type are
built to support the large forces inherent in the FSW process. The high stiffness
required has resulted in expensive and inflexible machinery, which has limited
the number of feasible applications of FSW as well as the adaptation of FSW as a
joining technique [De Backer, 2014]. Recently, the use of robotic manipulators in
FSW applications has gained significant interest due to the lower cost compared
to conventional FSW machinery as well as the much increased flexibility of an
articulated manipulator [De Backer, 2014; Guillo and Dubourg, 2016]. The down-
sides of the use of robots include the comparatively low stiffness, which causes
significant deflections during welding, with a lower quality weld as result.
A typical approach adopted to reduce the uncertainty introduced by deflec-
tions is stiffness/compliance modeling [Guillo and Dubourg, 2016; Lehmann et al.,
2013; De Backer and Bolmsjö, 2014]. This amounts to finding models of the joint
deflections ∆q or of the Cartesian deflections ∆x on one of the forms

∆q = C j (τ) (12.1)
∆X = CC (f) (12.2)

where τ and f are the joint torques and external forces, respectively, X is some
notion of Cartesian pose and C denotes some, possibly nonlinear, compliance
function. The corresponding inverse relations are typically referred to as stiffness
models. Robotic compliance modeling has been investigated by many authors,
where the most straightforward approach is based on linear models obtained by
measuring the deflections under application of known external loads. To avoid the
dependence on expensive equipment capable of accurately measuring the deflec-
tions, techniques such as the clamping method have been proposed [Bennett et al.,
1992; Lehmann et al., 2013; Sörnmo, 2015; Olofsson, 2015] for the identification of

137
Chapter 12. Introduction—Friction Stir Welding

models on the form (12.1). This approach makes the assumption that deflections
only occur in the joints, in the direction of movement. Hence, deflections occur-
ring in the links or in the joints orthogonally to the movement cause model errors,
limiting the resulting accuracy of the model obtained [Sörnmo, 2015]. In [Guillo
and Dubourg, 2016], the use of arm-side encoders was investigated to allow for
direct measurement of the joint deflections. As of today, arm-side encoders are
not available in the vast majority of robots, and the modification required to install
them is yet another obstacle to the adaptation of robotic FSW. The method further
suffers from the lack of modeling of link- and orthogonal joint deflections.
Cartesian models like (12.2) have been investigated in the FSW context by [De
Backer, 2014; Guillo and Dubourg, 2016; Abele et al., 2008]. The proposed Cartesian
deflection models are local in nature and not valid globally. This requires separate
models to be estimated throughout the workspace, which is time consuming and
limits the flexibility of the setup.
Although the use of compliance models leads to a reduction of the uncertainty
introduced by external forces, it is difficult to obtain compliance models accurate
enough throughout the entire workspace. This fact serves as the motivation for
complementing the compliance modeling with sensor-based feedback. Sensor-
based feedback is standard in conventional robotic arc and spot welding, where
the crucial task of the feedback action is to align the weld torch with the seam
along the transversal direction, with the major uncertainty being the placement
of the work pieces. During FSW, however, the uncertainties in the robot pose are
significant, while the tilt angle of the tool in addition to its position is of great
importance [De Backer et al., 2012]. This requires a state estimator capable of
estimating accurately at least four DOF, with slightly lower performance required
in the tool rotation axis and the translation along the weld seam. Conventional
seam-tracking sensors are capable of measuring 1-3 DOF only [Nayak and Ray,
2013; Gao et al., 2012], limiting the information available to a state estimator and
thus maintaining the need for, e.g., compliance modeling.
Motivated by the concerns raised above, we embark on developing calibra-
tion methods, a state estimator and a framework for simulation of robotic seam
tracking under the influence of large external process forces. Chapter 13 develops
methods for calibration of 6 DOF force/torque sensors and a seam-tracking laser
sensor, while a particle-filter based state-estimator and simulation framework is
developed in Chap. 14. Notation, coordinate frames and variables used in this part
of the thesis are listed in Table 12.1.

12.1 Kinematics

Kinematics refer to the motion of objects without concerns for any forces involved
(as opposed to dynamics). This section will briefly introduce some notation and
concepts within kinematics that will become important in subsequent chapters,
and simultaneously introduce the necessary notation. Most of the discussion will
focus around the representation and manipulation of coordinate frames.

138
12.1 Kinematics

Rotation and transformation matrices


SO(3) denotes the special orthogonal group 3 and is the set of all 3D orientation
matrices. Matrices in SO(3) have determinant one and orthogonal columns and
rows with norm one, R TR = RR T = I [Spong et al., 2006].
SE (3) denotes the special Euclidean group 3 and is the set of all 3D rigid body
transformation matrices [Spong et al., 2006]. Transformation matrices in SE (3)
have the form · D
tCD
¸
D RC
TC ∈ SE (3) = (12.3)
01×3 1
where the rotation matrix R ∈ SO(3) and the translation vector t ∈ R3 . TCD denotes
the rigid-body transformation from frame D to frame C , such that pC = TCD pD for
¤T £ ¤T
a point pD = p TD 1 = x y z 1 D and vC = TCD vD for a vector on the form
£
£ ¤T
vD = x y z 0 D . The inverse of a transformation matrix is given by
· T
−R Tt
· ¸ ¸
R t R
T= ∈ SE (3), T −1 = (12.4)
0 1 0 1
We say that p is the homogeneous form of p, which allows Rp + t to be repre-
sented compactly as T p.

Table 12.1 Definition and description of coordinate frames and variables.

RB Robot base frame.


T Tool frame, attached to the (TCP).
S Sensor frame, specified according to Fig. 14.1.
q ∈ Rn Joint Coordinate
X ∈ SE (3) Tool pose (State)
τ ∈ Rn /R3 Joint torque or external torque
f ∈ R6 External force/torque wrench
m ∈ R2 Laser measurement in S
ma ∈ R1 Laser angle measurement in S
e ∈ R2 Measurement error
T AB ∈ SE (3) Transformation matrix from B to A
F k (q) ∈ SE (3) Robot forward kinematics at pos. q
J (q) ∈ R6×n Manipulator Jacobian at pos. q
〈s〉 ∈so(3) Skew-symmetric matrix with parameters s ∈ R3
â Estimate of variable a
a+ a at the next sample instant
ā Reference for variable a
ai : j Elements i , i + 1, ..., j of a
T∨ ∈ R6 The twist coordinate representation of T

For additional details on kinematics and transforms in the robotics context,


the reader is referred to, e.g., [Murray et al., 1994; Spong et al., 2006].

139
13
Calibration

The field of robotics offers a wide range of calibration problems, the solutions to
which are oftentimes crucial for the accuracy or success of a robotic application.
For our purposes, we will loosely define calibration as the act of finding a trans-
formation of a measurement from the space it is measured in, to another space
in which it is more useful to us. For example, a sensor measuring a quantity in its
intrinsic coordinate system must be calibrated for us to know how to interpret the
measurements in an external coordinate system, such as that of the robot or its
tool. This chapter will describe a number of calibration algorithms developed in
order to make use of various sensors in the seam-tracking application, but their
usefulness outside this application will also be highlighted.
The calibration problems considered in this chapter are all geometric in nature
and related to the kinematics of a mechanical system. In kinematic calibration,
we estimate the kinematic parameters of a structure. Oftentimes, and in all cases
considered here, we have a well defined kinematic equation defining the kinematic
structure, but this equation contains unknown parameters. To calibrate those
parameters, one has to devise a method to collect data that allow us to solve for
the parameters of interest. Apart from solving the equations given the collected
data, the data collection itself is what sets different calibration methods apart
from each other, and from system identification in general. Many methods require
special-purpose equipment, either for precise perturbations of the system, or for
precise measurements of the system. Use of special-purpose equipment limits the
availability of a calibration method. We therefore propose calibration algorithms
that solve for the desired kinematic parameters without any special-purpose
equipment, making them widely applicable.

13.1 Force/Torque Sensor Calibration

A 6 DOF force/torque sensor is a device capable of measuring the complete wrench


of forces and torques applied to the sensor. They are commonly mounted on the
tool flange of a manipulator to endow it with force/torque sensing capabilities,
useful for, e.g., accurate control in contact situations.

140
13.1 Force/Torque Sensor Calibration

In order to make use of a force/torque sensor, the rotation matrix R TSF between
the tool flange and the sensor coordinate systems, the mass m held by the force
sensor at rest, and the translational vector r ∈ R3 from the sensor origin to the
center of mass are required. Methods from the literature typically involve fixing
the force/torque sensor in a jig and applying known forces/torques to the sensor
[Song et al., 2007; Chen et al., 2015]. In the following, we will develop and analyze
a calibration method that only requires movement of the sensor attached to the
tool flange in free air, making it very simple to use.

Method
The relevant force and torque equations are given by

f S = R ST F R TRBF (mg ) (13.1)


τS = R S 〈r 〉 R
TF RB
TF (mg ) (13.2)

where g is the gravity vector given in the robot base-frame and f , τ are the force
and torque measurements, respectively, such that f = [ f T τT]T. At first glance, this
is a hard problem to solve. The equation for the force relation does not appear to
allow us to solve for both m and R ST F , the constraint R ∈ SO(3) is difficult to handle,
and the equation for the torque contains the nonlinear term R ST F 〈r 〉. Fortunately,
however, the problem can be separated into two steps, and the constraint R ST F ∈
SO(3) will allow us to distinguish R ST F from m.
A naive approach to the stated calibration problem is to formulate an optimiza-
tion problem where R is parameterized using, e.g., Euler angles. A benefit of this
approach is its automatic and implicit handling of the constraint R ∈ SO(3). One
then minimizes the residuals of (13.1) and (13.2) with respect to all parameters us-
ing a local optimization method. This approach is, however, prone to convergence
to local optima and is hard to conduct in practice.
Instead, we start by noting that multiplying a matrix with a scalar only affects
its singular values, but not its singular vectors, mR = U (mS)V T. Thus, if we solve a
convex relaxation to the problem and estimate the product mR, we can recover
R by projecting mR onto SO(3) using the procedure in Sec. 6.1. Given R we can
easily recover m. Equation (13.1) is linear in mR and the minimization step can
readily be conducted using the least-squares approach of Sec. 6.2. To facilitate this
estimation, we write (13.1) on the equivalent form

( f ST ⊗ I 3 ) vec(mR) = R TRBF g (13.3)

where vec(mR) ∈ R9 is a vector of parameter to be estimated.


Once R ST F and m are estimated using measured forces only, we can estimate r
using the torque relation by noting that

τS = R ST F 〈r 〉 R TRBF (mg ) (13.4)


R TSF τS = 〈R TRBF (mg )〉 r (13.5)

141
Chapter 13. Calibration

where the second equation is linear in the unknown parameter-vector r .


When solving a relaxed problem, there is in general no guarantee that a good
solution to the original problem will be found. To verify that relaxing the problem
does not introduce any numerical issues or problems in the presence of noise,
etc., we introduce a second algorithm for finding R ST F . If the mass m is known in
advance, the problem can be reformulated using the Cayley transform [Tsiotras
et al., 1997] and a technique similar to the attitude estimation algorithm found
in [Mortari et al., 2007] can be used to solve for the rotation matrix, without
constraints or relaxations.
The Cayley transform of a matrix R ∈ SO(3) is given by

R = (I + Σ)−1 (I − Σ) = (I − Σ)(I + Σ)−1 (13.6)

where Σ = 〈s〉 is a skew-symmetric matrix of the Cayley-Gibbs-Rodrigues parame-


ters s ∈ R3 [Tsiotras et al., 1997]. Applying the Cayley transform to the force relation
yields

f S = R ST F R TRBF (mg ) (13.7)


−1
f S = (I + Σ) (I − Σ)R RB
TF (mg ) (13.8)
(I + Σ) f S = (I − Σ)R (mg )
RB
TF (13.9)
Σ f S + R TRBF (mg ) = − f S − R TRBF (mg )
¡ ¢ ¡ ¢
(13.10)
h i
Σ = 〈s〉 , 〈a〉 b = 〈b〉 a
〈 f S + R TRBF (mg )〉 s = − f S − R TRBF (mg )
¡ ¢
(13.11)

which is a linear equation in the parameters s that can be solved using the standard
least-squares procedure. The least-squares solution to this problem was, however,
found during experiments to be very sensitive to measurement noise in f S . This is
due to the fact that f S appears not only in the dependent variable on the right-hand
side, but also in the regressor 〈 f S + R TRBF (mg )〉. This is thus an errors-in-variables
problem for which the solution is given by the total least-squares procedure [Golub
and Van Loan, 2012], which we make use of in the following evaluation.

Numerical evaluation
The two algorithms, the relaxation-based and the Cayley-transform based, were
compared on the problem of finding R ST F by simulating a force-calibration scenario
where a random R ST F and 100 random poses R RB TF
were generated. In one simulation,
we let the first algorithm find R ST F and m with an error in the initial estimate of m
by a factor of 2, while the Cayley algorithm was given the correct mass. The results,
depicted in the left panel of Fig. 13.1 indicate that the two methods performed on
par with each other. The figure shows the error in the estimated rotation matrix as a
function of the added measurement noise in f S . In a second experiment, depicted
in the right panel of Fig. 13.1, we started both algorithms with a mass estimate
with 10 % error. Consequently, the Cayley algorithm performed significantly worse

142
13.2 Laser-Scanner Calibration

10−1 10−1

||R e ||2F 10−5 10−5


Relaxation Relaxation
10−9 Cayley 10−9 Cayley

10−4 10−3 10−2 10−1 100 101 10−4 10−3 10−2 10−1 100 101
σ [N] σ [N]

Figure 13.1 The error in the estimated rotation matrix is shown as a function
of the added measurement noise for two force-calibration methods, relaxation
based and Cayley-transform based. On the left, the relaxation-based method was
started with an initial mass estimate m 0 = 2m whereas the Cayley-transform based
method was given the correct mass. On the right, both algorithms were given
m 0 = 1.1m

for low noise levels, while the difference was negligible when the measurement
noise was large enough to dominate the final result.
The experiment showed that not only does the relaxation-based method per-
form on par with the unconstrained Cayley-transform based method, it also allows
us to estimate the mass, reducing the room for potential errors due to an error in
the mass estimate. It is thus safe to conclude that the relaxation-based algorithm
is superior to the Cayley algorithm in all situations. Implementations of both
algorithms are provided in [Robotlib.jl, B.C., 2015].

13.2 Laser-Scanner Calibration

Laser scanners have been widely used for many years in the field of robotics.
A large group of laser scanners, such as 2D laser range finders and laser stripe
profilers, provide accurate distance measurements confined to a plane. By moving
either the scanner or the scanned object, a 2D laser scanner can be used to build a
3D representation of an object or the environment. To this purpose, laser scanners
are commonly mounted on mobile platforms, aerial drones or robots.
This section considers the calibration of such a sensor, and as a motivating
example, we consider a wrist-mounted laser scanner for robotic 3D scanning
and weld-seam-tracking applications. The method does, however, work equally
well in any situation where a similar sensor is mounted on a drone or mobile
platform, as long as the location of the platform is known in relation to a fixed
coordinate system. We will use the term robot to refer to any such system, and the
robot coordinate system to refer to either the robot base coordinate system, or the
coordinate system of a tracking device measuring the location of the tool flange.
To relate the measurements of the scanner to the robot coordinate system, the
rigid transformation between the scanner coordinate system and the tool flange

143
Chapter 13. Calibration

Figure 13.2 ABB IRB140 used for experimental verification. The sensor (blue)
is mounted on the wrist and is plane of the laser light is intersecting a white flat
surface.

of the robot, TTSF , is needed.


A naive approach to the stated calibration problem is to make use of the 4/5/6-
point tool calibration routines commonly found in industrial robot systems. This
amounts to several times positioning the origin of the sensor at a fixed point in the
workspace, with varying orientation. These methods suffer from the fact that the
origin and the axes of the sensor coordinate system are invisible to the operator,
which must rely on visual feedback from both the workspace and a computer
monitor simultaneously. Further, the accuracy of these methods is very much
dependent on the skill of the operator and data collection for even a small amount
of points is very tedious.
Sensor manufacturers use special-purpose calibration objects and jigs that are
machined to high precision [Meta Vision Systems, 2014; SICK IVP, 2011], a method
largely unavailable to a user.
Other well known algorithms for eye-to-hand calibration include [Daniilidis,
1999; Tsai and Lenz, 1989; Horaud and Dornaika, 1995], which are all adopted
for calibration of a wrist-mounted camera using a calibration pattern such as a
checkerboard pattern. A laser scanner is fundamentally different in the informa-
tion it captures and can not determine the pose of a checkerboard pattern. This
must be considered by the calibration algorithm employed.
The proposed method will make use of data collected from planar surfaces
as a very simple form of calibration object. Kinematic calibration of robotic ma-
nipulator using planar constraints in various formats has been considered before.
Zhuang et al. (1999), propose a method that begins with an initial estimate of
the desired parameters, which is improved with a non-linear optimization algo-
rithm. They also discuss observability issues related to identification using planar
constraints. The method focuses on improving parameter estimates in the kine-

144
13.2 Laser-Scanner Calibration

matic model of the robot, and convergence results are therefore only presented
for initial guesses very close to their true values (0.01mm/0.01°). Initial estimates
this accurate are very hard to obtain unless a very precise CAD model of all parts
involved is available.
Zhang and Pless (2004) found the transformation between a camera and a
laser range finder using a checkerboard pattern and used computer vision to esti-
mate the location of the calibration planes. With the equations of the calibration
planes known, the desired transformation matrix was obtained from a set of linear
equations.
Planar constraints have also been considered by Ikits and Hollerbach (1997)
who employed a non-linear optimization technique to estimate the kinematic
parameters. The method requires careful definition of the planes and can not
handle arbitrary frame assignments.
A wrist mounted sensor can be seen as an extension of the kinematic chain of
the robot. Initial guesses can be poor, especially if based on visual estimates. This
section presents a method based solely on solving linear sets of equations. The
method accepts a very crude initial estimate of the desired kinematic parameters,
which is refined in an iterative procedure. The placement of the calibration planes
is assumed unknown, and their locations are found automatically together with
the desired transformation matrix. The exposition will be focused on sensors
measuring distances in a plane, but extensions to the proposed method to 3D
laser scanners such as LIDARs and 1D point lasers will be provided towards the
end.

Preliminaries
Throughout this section, the kinematic notation presented in Sec. 12.1 will be
used. The normal of a plane from which measurement point i is taken, given in
frame A, will be denoted n iA .
A plane is completely specified by
° °
nTp = d , °n °2 = 1 (13.12)

where d is the orthogonal distance from the origin to the plane, n the plane normal
and p is any point on the plane.

Laser-scanner characteristics
The laser scanner consists of an optical sensor and a laser source emitting light in a
plane that intersects a physical plane in a line, see Fig. 13.2. The three dimensional
location of a point along the projected laser line may be calculated by triangula-
tion, based on the known geometry between the camera and the laser emitter. A
single measurement from the laser scanner typically yields the coordinates of a
large number of points in the laser plane. Alternatively, a measurement consists of
a single point and the angle of the surface, which is easily converted to two points
by sampling a second point on the line through the given point. Comments on
statistical considerations for this sampling are provided in Sec. 13.A.

145
Chapter 13. Calibration

Initial estimate of TTS F


Measured data End

Find plane equations Find transformation


n Tp = d matrix TTS F

Figure 13.3 Illustration of the two-step, iterative method.

Method
The objective of the calibration is to find the transformation matrix TTSF ∈ SE (3)
that relates the measurements of the laser scanner to the coordinate frame of the
tool flange of the robot.
The kinematic chain of a robot will here consist of the transformation between
TF
the robot base frame and the tool flange TRB i , given by the forward kinematics1 in
pose i , and the transformation between the tool flange and the sensor TTSF . The
sensor, in turn, projects laser light onto a plane with unknown equation. A point
observed by the sensor can be translated to the robot base frame by
TF
pRBi = TRB i TTSF pS i (13.13)
where i denotes the index of the pose.
To find TTSF , an iterative two-step method is proposed, which starts with an
initial guess of the matrix. In each iteration, the equations for the planes are found
using eigendecomposition, whereafter a set of linear equations is solved for an
improved estimate of the desired transformation matrix. The scheme, illustrated
in Fig. 13.3, is iterated until convergence.
Finding the calibration planes Consider initially a set of measurements, PS =
[p 1 , ..., p Np ]S , gathered from a single plane. The normal can be found by Principal
Component Analysis (PCA), which amounts to performing an eigendecomposi-
tion of the covariance matrix C of the points [Pearson, 1901]. The eigenvector
corresponding to the smallest eigenvalue of C will be the desired estimate of the
plane normal.2 To this purpose, all points are transformed to a common frame,
the robot base frame, using (13.13) and the current estimate of TTSF .
To fully specify the plane equation, the center of mass µ of PRB is calculated.
The distance d to the plane is then calculated as the length of the projection of
the vector µ onto the plane normal
d = °n̄(n̄Tµ)°
° °
(13.14)
where n̄ is a normal with unit
° °length given by PCA. This distance can be encoded
into the normal by letting °n ° = d . The normal is then simply found by

n = n̄(n̄Tµ) (13.15)
1 Or alternatively, given by an external tracking system.
2 This eigenvalue will correspond to the mean squared distance from the points to the plane.

146
13.2 Laser-Scanner Calibration

This procedure is repeated for all measured calibration planes and results in a
set of normals that will be used to find the optimal TTSF .
Solving for TTSF All measured points should fulfill the equation for the plane they
were obtained from. This means that for a single point p, the following must hold
° °2
n̄Tp = d ⇐⇒ nTp = °n ° (13.16)

A measurement point obtained from the sensor in the considered setup should
thus fulfill the following set of linear equations
TF
pRBi = TRB i TTSF pS i (13.17)
° °2
nTp RBi = °n ° (13.18)
¤T £ ¤T
pS i = p TS i 1 = x S i
£
y Si zSi 1 (13.19)

where bold-face notation denotes a point expressed in homogeneous coordinates


according to (13.19). Without loss of generality, the points p S can be assumed
to lie in the plane z S = 0. As a result, the third column in TTSF can not be solved
for directly. The constraints on R TS F to belong to SO(3), will however allow for
reconstruction of the third column in R TS F from the first two columns.
Let T̃ denote the remainder of TTSF after removing the third column and the
last row. The linear equations (13.17)-(13.18) can be expressed as
° °
A i k = °n i ° − q i (13.20)

where k = vec(T̃ ) ∈ R9×1 consists of the stacked columns of T̃ and


TF TF TF ¤
∈R1×9
£
A i = nTi R RB i x S i nTi R RB i y S i nTi R RB i (13.21)
T Fi
T
qi = ni p RB ∈R (13.22)

Since (13.17) and (13.18) are linear in the parameters, all elements of TTSF can be
extracted into k, and A i can be obtained by performing the matrix multiplications
in (13.17) and (13.18) and identifying terms containing any of the elements of k.
Terms which do not include any parameter to be identified are associated with q i .
The final expressions for A i and q i given above can then be obtained by identifying
matrix-multiplication structures among the elements of A i and q i .
Equation (13.20) does not have a unique solution. A set of at least nine points
gathered from at least three planes is required in order to obtain a unique solution
to the vector k. This can be obtained by stacking the entries in (13.20) according
to  ° °

A1
 °n 1 ° − q 1 
 A2   ° °
 °n 2 ° − q 2 

Ak = Y, A =  .  , Y =  (13.23)
 
..

 ..  
 .


A Np
° °
°n Np ° − q N
p

147
Chapter 13. Calibration

The resulting problem is linear in the parameters, and the vector k ∗ of param-
eters that minimizes
k ∗ = arg min °Ak − Y°2
° °
(13.24)
k
can then be obtained from the least-squares procedure of Sec. 6.2. We make a note
at this point that while solving (13.24) is the most straightforward way of obtaining
an estimate, the problem contains errors in both A and Y and is therefore of errors-
in-variables type and a candidate for the total least-squares procedure. We discuss
the two solution methods further in Sec. 13.B.
Since k only contains the first two columns of R TS F , the third column is formed
as
R3 = R1 × R2 (13.25)
where × denotes the cross product between R 1 and R 2 , which produces a vector
orthogonal to both R 1 and R 2 . When solving (13.24) we are actually solving a
problem with the same°relaxation
° ° °as the one employed in Sec. 13.1, since the
constraints R 1TR 2 = 0, °R 1 ° = °R 2 ° = 1 are not enforced. The resulting R TS F will
thus in general not belong to SO(3). The closest valid rotation matrix can be found
by projection onto SO(3) according to the procedure outlined in Sec. 6.1.
This projection will change the corresponding entries in k ∗ and the result-
ing coefficients will no longer solve the problem (13.24). A second optimization
problem can thus be formed to re-estimate the translational part of k, given the
orthogonalized rotational part. Let k be decomposed according to
k = R̃ ∗ p R̃ ∗ ∈ R1×6 , p ∈ R1×3
£ ¤
(13.26)
and denote by An:k columns n to k of A. The optimal translational vector, given
the orthonormal rotation matrix, is found by solving the following optimization
problem
Ỹ = Y − A1:6 R̃ ∗ (13.27)
p ∗ = arg min °A7:9 p − Ỹ°
° °
2
(13.28)
p

Final refinement As noted by Zhang and Pless (2004), solving an optimization


problem like (13.24) is equivalent to minimizing the algebraic distance between
the matrix, parameterized by k, and the data. There is no direct minimization of
the distances from measurements to planes involved. Given the result from the
above procedure as initial guess, any suitable, iterative minimization strategy can
be employed to further minimize a cost function on the form
Np
J (TTSF ) = (nTi p RBi (TTSF ) − °n i °)2
X ° °
(13.29)
i =1
which is the sum of squared distances from the measurement points to the planes.
Here, p RBi is seen as a function of TTSF according to (13.17).
The simulation experiments presented in the next section did not show any sig-
nificant improvement to the estimated transformation matrix from this additional
refinement, and we thus refrain from exploring this topic further.

148
13.2 Laser-Scanner Calibration

° °
°T − T̂ ° RMS distance from points to plane [m]
F
100 100

10−1 10−1

10−2 10−2

10−3 10−3

10−4 10−4
0 10 20 30 0 10 20 30
Number of iterations Number of iterations

Figure 13.4 Convergence results for simulated data during 100 realizations, each
with 10 poses sampled from each of 3 planes. Measurement noise level σ = 0.5mm
is marked with a dashed line. On the left, the Frobenius norm between the true
matrix and the estimated, on the right, the RMS distance between measurement
points and the estimated plane.

Results
The performance of the method was initially studied in simulations, which allows
for a comparison between the obtained estimate and the ground truth. The sim-
ulation study is followed by experimental verification using a real laser scanner
mounted on the wrist of an industrial manipulator.

Simulations To study the convergence properties of the proposed approach, a


simulation study was conducted. A randomly generated TTSF was used together
with measurements from a set of random poses. The initial guess of TTSF was
chosen as the true matrix corrupted with an error distributed uniformly, with
x, y, z ∼ U (−200mm, 200mm) and roll, pitch, yaw∼ U (−30°, 30°). The measure-
ments were obtained from three orthogonal planes and corrupted with Gaussian
white noise with standard deviation σ = 0.5mm.
Figure 13.4 illustrates the convergence for 100 realizations of the described
procedure. Most realizations converged to the true matrix within 15 iterations.
Analysis shows that careful selection of poses results in faster convergence. The
random pose-selection strategy employed in the simulation study suffers the risk
of co-linearity between measurement poses, which slows down convergence.
Figure 13.5 illustrates the final results in terms of the accuracy in both the
translational and rotational part of the estimate of TTSF .

149
Chapter 13. Calibration

Distance error [m] Rotation error [degrees]


0
10
101
10−1

10−2 100

10−3 10−1

10−4 10−2

10−5 10−3
Before After Before After
Figure 13.5 Errors in TTSF before and after calibration for 100 realizations with 30
calibration iterations. For each realization, 10 poses were sampled from each of 3
planes. The measurement noise level σ = 0.5mm is marked with a dashed line. On
the left, the translational error between the true matrix and the estimated, on the
right, the rotational error.

Figure 13.6 A visualization of the reconstructed planes used for data collection.
The planes were placed so as to be close to orthogonal to each other, surrounding
the robot. Figure 13.2 presents a photo of the setup.

Experiments Experimental verification of the proposed method was conducted


with an ABB IRB140 robot equipped with a Meta SLS 25 [Meta Vision Systems,
2014] weld seam-tracking sensor, see Fig. 13.2. A flat whiteboard was placed on
different locations surrounding the robot, see Fig. 13.6, and several measurements
of each plane were recorded.

150
13.2 Laser-Scanner Calibration

RMS distance from points to plane [m]


10−2
Mean
Individual planes

10−3

10−4
0 10 20 30 40 50
Number of iterations

Figure 13.7 Convergence results for experimental data gathered from 5 planes.
The RMS distance between measurement points and the estimated planes are
shown together with the mean over all planes.

The algorithm was started with the initial guess


 
1 0 0 0
S
 0 1 0 0.15 
TT F = 
  (13.30)
0 0 1 0.15 
0 0 0 1

and returned the final estimate


 
0.9620 0.2710 0.0010 0.0850
−0.2710 0.9620 −0.0240 0.1170 
TTSF = 

 (13.31)
 −0.0070 0.0230 1.0000 0.1610 
0 0 0 1

The translational part of the initial guess was obtained by estimating the distance
from the tool flange to the origin of the laser scanner, whereas the rotation matrix
was obtained by estimating the projection of the coordinate axes of the scanner
onto the axes of the tool flange.3
The convergence behavior, illustrated in Fig. 13.7, is similar to that in the
simulation, and the final error was on the same level as the noise in the sensor
data. A histogram of the final errors is shown in Fig. 13.8, indicating a symmetric
but heavy-tailed distribution. We remark that if a figure like Fig. 13.8 indicate the
presence of outliers or errors with a highly non-Gaussian distribution, one could
consider alternative distance metrics in (13.24), such as the L 1 norm, for a more
robust performance.
3 The fact that the initial estimate of the rotation matrix was the identity matrix is a coincidence.

151
Chapter 13. Calibration

Histogram of errors

0.6

Normalized frequency 0.4

0.2

0
−1 −0.5 0 0.5 1
Error [m]
·10−3

Figure 13.8 Histogram of errors Y − Aw ∗ for the experimental data.

Discussion
The calibration method described is highly practically motivated. Calibration is
often tedious and an annoyance to the operator of any technical equipment. The
method described tries to mitigate this problem by making use of data that is easy
to acquire. In its simplest form, the algorithm requires some minor bookkeeping in
order to associate the recorded points with the plane they were collected from. An
extension to the algorithm that would make it even more accessible is automatic
identification of planes using a RANSAC [Fischler and Bolles, 1981] or clustering
procedure.
While the method was shown to be robust against large initial errors in the
estimated transform, the effect of a nonzero curvature of the planes from which
the data is gathered remains to be investigated.
In practice, physical limitations may limit the poses from which data can be
gathered, i.e., neither the sensor nor the robot must collide with the planes. The
failure of the collected poses to adequately span the relevant space may make
the algorithm sensitive to errors in certain directions in the space of estimated
parameters. Fortunately, however, directions that are difficult to span when gath-
ering calibration data are also unlikely to be useful when the calibrated sensor is
deployed. The found transformation will consequently be accurate for relevant
poses that were present in the calibration data.

Conclusions
This section has presented a robust, iterative method composed of linear sub
problems for the kinematic calibration of a 2D laser sensor. Large uncertainties in
the initial estimates are handled and the estimation error converges to below the
level of the measurement noise. The calibration routine can be used for any type of

152
13.A Point Sampling

laser sensor that measures distances in a plane, as long as the forward kinematics
is known, such as when the sensor is mounted on the flange of an industrial robot
or on a mobile platform or aerial drone, tracked by an external tracking system.
Extensions to other kinds of laser sensors are provided in Sections 13.C and 13.D.
An implementation of the proposed method is made available in [Robotlib.jl,
B.C., 2015].

Appendix A. Point Sampling

Some laser sensors for weld-seam tracking do not provide all the measured points,
but instead provide the location of the weld seam and the angle of the surface.
In this situation one must sample a second point along the line implied by the
provided measurements in order for the proposed algorithm to work. The sam-
pling of the second point, p 2 , is straight forward, but entails a trade-off between
noise sensitivity and numerical accuracy. Given measurements of a point and an
angle, p m , αm , corrupted with measurement noise e p , e α , respectively, p 2 can be
calculated as

p 2 = p m + γl (13.32)
cos αm
· ¸
l= (13.33)
sin αm
αm = α + e α , e α ∼ N (0, σ2α ) (13.34)
pm = p + e p , e p ∼ N (0, Σp ) (13.35)

where γ is a constant determining the offset between p m and p 2 . If this offset is


very small, the condition number of A will be large and the estimation accuracy
will be impacted. If γ is chosen very large, the variance in p m and p 2 will be very
different. The variance of p 2 is given by

E p 2 p T2 = V p 2 = V p m + γl = V p m + γ2 E l l T
© ª © ª © ª © ª © ª
(13.36)
2 2¯
≈ Σp + γ σα ∇α l ∇α l
¯ ¯

(13.37)
2
sin αm − sin αm cos αm ¯¯
¯· ¸¯
= Σp + γ2 σ2α ¯¯
¯
(13.38)
− sin αm cos αm cos2 αm ¯

where (13.36) holds if E e α e p = 0. We thus see that the variance in the sampled
© ª

point will be strictly larger than the variance in the measured point, and we should
ideally trust this second point less when we form the estimate of both the plane
equations and TTSF . Since the original problem is of errors-in-variables type, the
optimal solution to the problem with unequal variances is given by the weighted
total least-squares procedure [Fang, 2013]. Experiments outlined ° in°the Sec. 13.B
indicated that as long as γ was chosen on the same scale as °p S °, taking this
uncertainty into account had no effect on the fixed point to which the algorithm
converged.

153
Chapter 13. Calibration

Appendix B. Least-Squares vs. Total Least-Squares


In the main text, we only made a brief comment on the potential methods of
TF
estimating TRB given the matrices A and Y. While the least-squares procedure
corresponds to the ML estimate for a model where only the right-hand side Y is
corrupted with Gaussian noise, this does not hold true when also A is corrupted
with noise. In the present context we have

° ° TF
A i k = °n i ° − nTi p RB i (13.39)
£ T T Fi TF T Fi ¤
A i = n i R RB x S i nTi R RB i y S i nTi R RB (13.40)

where significant measurement errors appear in x S and y S . The variables n i are


estimated using the current estimate of TTSF and will thus be corrupted by both
estimation errors and measurement errors. Unfortunately, estimating the vari-
ance of this error is non-trivial, and correct application of the total least-squares
procedure is thus hard. We propose two alternative strategies:

Alt. 1 An approximate strategy is obtained if we assume that the variance of


x S and y S is negligible in comparison to the variance in n i . We further make the
simplifying assumption that n i is normally distributed with the covariance matrix
Σni . To obtain an estimate of Σni one can, e.g., perform statistical bootstrapping.
This approach requires as many bootstrapping procedures as the number of
planes data is sampled from.

Alt. 2 Another strategy is obtained by estimating the complete covariance matrix


¤T
Σ Ay of A i y i using statistical bootstrapping [Murphy, 2012]. This approach
£

makes less assumptions than the first approach. While conceptually simple, this
approach requires N p bootstrapping procedures and is thus more computation-
ally expensive. One can, however, run the complete algorithm until convergence
using estimates based on regular LS, and first after convergence switch to the
more accurate WTLS procedure to improve accuracy.

Evaluation
The first strategy was implemented and tested on the simulated data from
Sec. 13.2. We used 500 bootstrap samples and 15 iterations in the WTLS algo-
rithm. No appreciable difference in fixed points was detected between solving for
TTSF using the standard LS procedure and the WTLS procedure. We thus concluded
that the considerable additional complexity involved in estimating the covariance
matrices of the data and solving the optimization problem using the more compli-
cated WTLS algorithm is not warranted. The evaluation can be reproduced using
code available in [Robotlib.jl, B.C., 2015] and the WTLS algorithm implementation
is made available in [TotalLeastSquares.jl, B.C., 2018].

154
13.C Calibration of Point Lasers

Appendix C. Calibration of Point Lasers


In this section, we briefly consider the changes required to the proposed algorithm
for sensors measuring the distance to a single point only.
We can for a point laser only determine the equation for the laser line, as
opposed to before when we could find the laser light plane and thus the entire
calibration matrix in SE (3).
Once more, data is collected from three planar surfaces. The required modifi-
cations to the proposed algorithm are listed below

Eq. (13.19) We now assume, without loss of generality, that the laser point lies
along the line y S = z S = 0. As a result, the second and third columns of TTSF
can not be solved for. These two vectors can be set to zero.

Eq. (13.20) The truncated vector k ∈ R6 will now consist of the first column of R TSF
and the translation vector p TS F .
TF
Eq. (13.21) The three middle elements of A i , corresponding to nTi R RB i y S i are
removed.

Orthogonalization The orthogonalization procedure reduces to the normaliza-


tion of the first three elements of k to have norm one.

The rest of the algorithm proceeds according to the original formulation.

Appendix D. Calibration of 3D Lasers and LIDARs

The proposed algorithm can be utilized also for 3D distance sensors, such as
LIDARs. This type of sensor provides richer information about the environment,
and thus allows a richer set of calibration algorithms to be employed. We make no
claims regarding the effectiveness of the proposed algorithm in this scenario, and
simply note the adjustments needed to employ it.
To use the proposed algorithm, we modify it according to

Eq. (13.19) We no longer assume that the laser line lies in the plane z S = 0. As a
result, the full matrix TTSF can now be solved for immediately, without the
additional step of forming R 3 = R 1 × R 2 .

Eq. (13.20) The vector k ∈ R12 will now consist of all the columns of R TSF and the
translation vector p TS F .

Eq. (13.21) The regressor will now consist of


TF TF TF TF ¤
nTi R RB i ∈ R1×12
£
A i = nTi R RB i x S i nTi R RB i y S i nTi R RB i z S i

To make use of this algorithm in practice, one has to consider the problem of
assigning measured points to the correct planes. This can be hard when employing

155
Chapter 13. Calibration

3D sensors with a large field-of-view. Potential strategies include pre-estimation


of planes in the sensor coordinate system using, e.g., the RANSAC algorithm, or
manual segmentation.

156
14
State Estimation for FSW

In this chapter, we will consider the problem of state estimation in the context
of friction stir welding. The state estimator we develop is able to incorporate
measurements from the class of laser sensors that was considered in the previous
chapter, as well as compliance models and force-sensor measurements. As alluded
to in Chap. 12, the FSW process requires accurate control of the full 6 DOF pose
of the robot relative to the seam. The problem of estimating the pose of the
welding tool during welding is both nonlinear and non-Gaussian, motivating a
state estimator beyond the standard Kalman filter. The following sections will
highlight the unique challenges related to state estimation associated with friction
stir welding and propose a particle-filter based estimator, a method that was
introduced in Sec. 4.2.
We also develop a framework for seam-tracking simulation in Sec. 14.2, where
the relation between sources of error and estimation performance is analyzed.
Through geometric reasoning, we show that some situations call for additional
sensing on top of what is provided by a single laser sensor. The framework is
intended to assist the user in selection of an optimal sensor configuration for
a given seam, where sensor configurations vary in, e.g., the number of sensors
applied and their distance from the tool center point (TCP). The framework also
helps the user tune the state estimator, a problem which is significantly harder for
a particle-filter based estimator compared to a Kalman filter.

14.1 State Estimator


We will start our exposition by choosing a state representation, describing the
probability density functions used in the state transition and measurement update
steps as well as cover some practical implementation details.
A natural state to consider in robotics is the set of joint angles, q, and their
velocities, q̇. In the context of robotic machining in general, and FSW in partic-
ular, deflections in the kinematic structure due to process forces unfortunately
invalidate the joint angles as an accurate description of the robot pose. The pro-
posed state estimator will therefore work in the space SE (3), represented as 4 × 4
transformation matrices, which further allows for a natural inclusion of sensor

157
Chapter 14. State Estimation for FSW

measurements that naturally occupy the same Cartesian space as the tool pose
we are ultimately interested in estimating. Although subject to errors, the sensor
information available from the robot is naturally transformed to SE (3) by means of
the forward kinematics function F k (q). This information will be used to increase
the observability in directions of the state-space where the external sensing leaves
us blind.
The velocities and accelerations present during FSW are typically very low
and we therefore chose to not include velocities in the state to be estimated.
This reduces the state dimension and computational burden significantly, while
maintaining a high estimation performance.

Preliminaries
This section briefly introduces a number of coordinate frames and variables used
in the later description of the method. For a general treatment of coordinate
frames and transformations, see [Murray et al., 1994].
The following text will reference a number of different coordinate frames. We
list them all in Table 12.1 and provide a visual guide to relate them to each other
in Fig. 14.1. Table 12.1 further introduces a number of variables and some special
notation that will be referred to in the following description. All Cartesian-space
variables are given in the robot base frame RB unless otherwise noted.

T
S

Figure 14.1 Coordinate frames (x, y, z) = (red, green, blue). The origin of the
sensor frame S is located in the laser plane at the desired seam intersection point.
The tool frame is denoted by T .

158
14.1 State Estimator

Nominal trajectory
Before we describe the details of the state estimator, we will establish the concept
of the nominal trajectory. In the linear-Gaussian case, the reference trajectory of a
control system is of no importance while estimating the state, this follows from
(4.24), which shows that the covariance of the state estimate is independent of
the control signal. In the present context, however, we make use of the reference
trajectory for two purposes. 1.) It provides prior information regarding the state
transition. This lets us bypass a lot of modeling complexity by assuming that
the robot controller will do a good job following the velocities specified in the
reference trajectory. We know, however, that the robot controller will follow this
reference with a potentially large position error, due to deflections, etc., outlined
above. 2.) The reference trajectory provides the nominal seam geometry needed
to determine the likelihood of a measurement from the laser sensor given a state
hypothesis X̂ .
To get a suitable representation of the nominal trajectory used to propagate
the particles forward, we can, e.g., perform a simulation of the robot program
using a simulation software, often provided by the robot manufacturer. This pro-
cedure eliminates the need to reverse engineer the robot path planner. During the
simulation, a stream of joint angles is saved, which, when run through the forward
kinematics, returns the nominal trajectory in Cartesian space. The simulation
framework outlined in the following sections provides a number of methods for
generating a nominal trajectory for simulation experiments.
© ªN
The nominal trajectory will consist of a sequence of joint coordinates q̄ t t =1
which, if run through the forward kinematic function, yields a sequence of points
© ªN
specifying the nominal seam geometry p t = F k (q̄ t ) t =1 .

Density functions
To employ a particle filter for state estimation, specification of a statistical model
of the state evolution and the measurement generating process is required, see
Sec. 4.2. This section introduces and motivates the various modeling choices in
the form of probability density functions used in the proposed particle filter.

State transition

p(X + |X , f) (14.1)

The state-transition function is a model of the state at the next time instance, given
the state at the current time instance. We model the mean of the state-transition
density (14.1) using the robot reference trajectory. The reference trajectory is
generated by, e.g., the robot controller or FSW path planner, and consists of a
sequence of poses which the robot should traverse. We assume that a tracking
controller can make corrections to this nominal trajectory, based on the state
estimates from the state estimator.

159
Chapter 14. State Estimation for FSW

We denote by T + the incremental transformation from F k (q̄) to F k (q̄ + ) such


that
F k (q̄ + ) = T + F k (q̄)

The mean of the state transition density is thus given by

µ p(X + |X , f) = T + = F k (q̄ + ) F k (q̄)−1


© ª

The shape of the density should encode the uncertainty in the update of
the robot state from one sample to another. For a robot moving in free space,
this uncertainty is usually very small. Under the influence of varying external
process forces, however, significant uncertainty is introduced [De Backer, 2014;
Sörnmo, 2015; Olofsson, 2015]. Based on this assumption, we may choose a density
where the width is a function of the process force. For example, we may chose a
multivariate Gaussian distribution and let the covariance matrix be a function of
the process force.

Robot measurement update


p(q, f|X ) (14.2)

When the robot is subject to large external forces applied at the tool, the measure-
ments provided by the robot will not provide an accurate estimate of the tool pose
through the forward-kinematics function. If a compliance model C j (τ) is available,
we may use it to reduce the uncertainty induced by kinematic deflections, a topic
explored in detail in [Lehmann et al., 2013; Sörnmo, 2015; Olofsson, 2015]. We
thus choose the following model for the mean of the robot measurement density
(14.2)
µ p(q, f|X ) = F k (q +C j (τ))
© ª
(14.3)

The uncertainty in the robot measurement comes from several sources. The
joint resolvers/encoders are affected by noise, which is well modeled as a Gaus-
sian random variable. When Gaussian errors, e q , in the joint measurements are
propagated through the linearized forward-kinematics function, the covariance
matrix ΣC of the resulting Cartesian-space errors eC is obtained by approximating
e q = dq as

q m = q + e q = q + dq
e q ∼ N (0, Σq )
eC ∼ N (0, J Σq J T)

where q m is the measured value. The corresponding Cartesian-space covariance


matrix is given by

160
14.1 State Estimator

d 〈F k (q)〉∨
eC = dq = J dq = J e q
dq
n o n o
ΣC = E eC eCT = E J e q e Tq J T = J E e q e Tq J T
© ª

where the approximation J (q + e q ) ≈ J (q) has been made. The twist coordinate
representation 〈F k (q)〉∨ is obtained by taking the logarithm of the transformation
matrix log(F k (q)), which produces a twist ξ ∈ se(3), and the operation ξ∨ ∈ R6
returns the twist coordinates [Murray et al., 1994].1
Except for the measurement noise e q , the errors in the robot measurement
update density are not independent between samples. The error in both the for-
ward kinematics and the compliance model is configuration dependent. Since
the velocity of the robot is bounded, the configuration will change slowly and
configuration-dependent errors will thus be highly correlated in time. The stan-
dard derivation of the particle filter relies on the assumption that the measurement
errors constitute a sequence of independent, identically distributed (i.i.d.) random
variables. Independent measurement errors can be averaged between samples
to obtain a more accurate estimate, which is not possible with correlated errors,
where several consecutive measurements all suffer from the same error.
Time-correlated errors are in general hard to handle in the particle filtering
framework and no systematic way to cope with this problem has been found.
One potential approach is to incorporate the correlated error as a state to be
estimated [Evensen, 2003; Åström and Wittenmark, 2013a]. This is feasible only if
there exist a way to differentiate between the different sources of error, something
which in the present context would require additional sensing. State augmen-
tation further doubles the state dimension, with a corresponding increase in
computational complexity.
Since only a combination of the tracking error, the kinematic error and the
dynamic error is measurable, we propose to model the time-correlated uncer-
tainties as a uniform random variable with a width d chosen as the maximum
expected error. When performing the measurement update with the densities of
several perfectly correlated uniform random variables, the posterior distribution
equals the prior distribution. The uniform distribution is thus invariant under
the measurement update. We illustrate this in Fig. 14.2, where the effect of the
measurement update is displayed for a hybrid between the Gaussian and uniform
distributions.
The complete robot measurement density function with the above modeling
choices, (14.2), is formed by the convolution of the densities for a Gaussian, pG ,
and a uniform, pU , random variable, according to
Z
p(q, f|X ) = pU (x − y) pG (y) d y (14.4)
Rk
1 If the covariance of the measurements q
m are is obtainednon the
o motor-side of the gearbox, the
Cartesian-space covariance will take the form ΣC = JG E e q e T T T
q G J where G is the gear-ratio
matrix of (9.11).

161
Chapter 14. State Estimation for FSW

0.8
Gaussian
Gaussian + Uniform
0.6
2 meas.
2 meas.
0.4 50 meas.
50 meas.
0.2

0
−4σ −3σ −2σ −σ 0 σ 2σ 3σ 4σ

Figure 14.2 Illustration of measurement densities and the posterior densities


after several performed measurement updates. The figure illustrates the difference
between a Gaussian distribution, for which the variance is reduced by each addi-
tional measurement update, and the proposed hybrid distribution with d = σ, for
which the uniform part maintains its uncertainty after a measurement update.

where k is the dimensionality of the state x. This integral has no closed-form


solution, but can technically be evaluated numerically. Instead of evaluating
(14.4), which is computationally expensive and must be done for every particle at
every time step, we propose the following approximation


C
 if |∆x| ≤ d
p(q, f|X ) ≈ 2¶ (14.5)
(|∆x| − d )
µ
C exp −
 if |∆x| > d
2σ2

with ∆x taken to be the element-wise difference between the positional coordi-


nates of X and a mean vector µ ∈ R3 , ∆x = x − µ, and the normalization constant

1
C=p
2πσ + 2d

This approximation is a hybrid between the Gaussian and uniform distributions


and reduces to the Gaussian distribution as the width of the uniform part d → 0,
and reduces to the uniform distribution as σ → 0. Equation (14.5) is given for the
one-dimensional case and one possible extension to higher dimensions is given

162
14.1 State Estimator

·10−2
2
p(q, f|X )

0
10 10
5 5
0 0
−5 −5
−10 −10
x 2 [mm] x 1 [mm]

Figure 14.3 Illustration of the multivariate version of the robot measurement


density, (14.6).

by
if °∆x °2 ≤ d
° °
D

p(q, f|X ) = (14.6)


µ ¶
D exp − 1 δx TΣ−1 δx if °∆x ° > d
° °
2 2
à !
d
δx = 1 − ° ° ∆x
°∆x °
2
1
D= p
k
(2π) 2det(Σ) + V (d , k)
where k is the state dimension and V (d , k) is the volume of a k-dimensional
sphere with radius d .
The univariate distribution, and the posterior distribution after several fused
measurements, are shown in Fig. 14.2. An illustration of the multivariate case with
· ¸
4 0
Σ= , d =3
0 1
is shown in Fig. 14.3.
The kinematic deflections that cause the uncertainty in the robot measure-
ment are proportional to the external forces applied to the tool. The width of the
uniform random variable d = d (f) is therefore chosen as a function of the process
force °°
d (f) = d 0 + k d °f°
where d 0 is a nominal uncertainty chosen with respect to the maximum abso- °°
lute positioning error of the robot in the relevant work-space volume and k d °f°
reflects the increase in uncertainty with the magnitude of the process force.
The discussion on the errors associated with the robot measurements is con-
tinued in more detail in Sec. 14.4.

163
Chapter 14. State Estimation for FSW

p(m| X̂ )

p2
v m̂ Sor i g i n Laser line
pi
e m
p1
TTSF
p(q| X̂ )
Seam

Figure 14.4 Illustration of the relations between a particle X̂ (TCP hypothesis),


its belief about the location of the laser line and the laser measurement m (14.7).
Particles for which the distance, e, between the measurement hypothesis m̂ and
the seam intersection point p i is small, in terms of the distribution p(m|X ), are
more likely to be correct estimates of the current state X . The points p 1 and p 2 are
found by searching for the seam points closest to m̂.

Laser sensor measurement update


p(m|X )
A model for the measurement of the laser sensor is straightforward to describe,
but harder to evaluate since it involves evaluating the distance between a mea-
surement location and the nominal seam, i.e., determining how likely obtaining
a particular measurement is given a state hypothesis and the prespecified seam
geometry. Given a state hypothesis X̂ , the corresponding hypothesis for the inter-
section point between the seam and the laser plane, m̂, is calculated using TTSF
according to · ¸
m
m̂ = ( X̂ TTSF )1:3,4 + (14.7)
0
If the prespecified seam geometry is given as a sequence of points p t , we can
evaluate the distance e between m̂ and the seam by performing a search for the
closest nominal trajectory points. The error e is then calculated as the projection
of m̂ onto the seam along the laser plane. The projected point p i lies in the laser
plane on the line v between the closest seam points on each side on the laser
plane, p 1 and p 2 , refer to Fig. 14.4 for an illustration. Formally, the intersection
point p i must satisfy the following two equations
p i = p 1 + γv
)
nT(m̂ − p 1 )
⇒ γ =
0 = nT(p i − m̂) n Tv

164
14.1 State Estimator

where n is the normal of the laser light plane.


The mean of p(m|X ) is thus equal to

µ p(m|X ) = p i
© ª

and the shape should be chosen to reflect the error distribution of the laser sensor,
here modeled as a normal distribution according to
µ ¶
3 1 1
p(m|X ) = (2π)− 2 |Σ|− 2 exp − e TΣ−1 e , e = m̂ − p i
2

Many seam-tracking sensors are capable of measuring also the angle of the
weld surface around the normal of the laser plane. An angle measurement is easily
compared to the corresponding angle hypothesis of a particle using standard roll,
pitch, yaw calculations. Using the convention in Fig. 14.1, the angle around the
normal of the laser plane corresponds to the yaw angle. Roll and pitch angles are
unfortunately not directly measurable by this type of sensor. If, however, a sensor
with two or more laser planes is used, it is possible to estimate the full orientation
of the sensor. This will be analyzed further in Sec. 14.2.
Reduction of computational time The evaluation of p(m|X ) can be compu-
tationally expensive due to the search procedure. We can reduce this cost by
reducing the number of points to search over. This can be achieved by approxi-
mating the trajectory with a piecewise affine function. Since the intersection point
between the nominal seam line and the laser light plane is calculated, this does
not affect the accuracy of the evaluation of p(m|X ) much. To this end, we solve
the following convex optimization problem

NX 3
−2 X
° y − z °2 + λ
° °
minimize F |w t , j |
z,w
t =1 j =1
(14.8)
≤²
° °
subject to °y − z°

w t , j = z t , j − 2z t +1, j + z t +2, j

where y ∈ RN ×3 are the positions of the nominal trajectory points, z is the approx-
imation of y, and ² is the maximum allowed approximation error. The nonzero
elements of w will determine the location of the knots in the piecewise affine
approximation and λ will influence the number of knots.2
The proposed optimization problem does not incorporate constraints on the
orientation error of the approximation. This error will, however, be small if the
trajectory is smooth with bounded curvature and a constraint is put on the error
in the translational approximation, as in (14.8).
Optimization problem (14.8) can be seen as a multivariable trend-filtering
problem, a topic which was discussed in greater detail in Sec. 6.4.
2 w = z − 2z
t t t +1 + z t +2 is a discrete second-order differentiation of z.

165
Chapter 14. State Estimation for FSW

14.2 Simulation Framework


The PF algorithm (Algorithm 1), detailed in the previous sections, has been im-
plemented in an open-source framework, publicly available [PFSeamTracking.jl,
B.C. et al., 2016]. The framework provides, apart from the state estimator, conve-
nience methods for plotting, trajectory generation, optimization, simulation of
laser-, joint-, and force sensor readings and perturbations due to process forces
and kinematic model errors as well as tools for visualization of particle distribu-
tions.

Visualization
An often time-consuming part during the implementation of a particle filtering
framework is the tuning of the filter parameters. Due to the highly nonlinear
nature of the present filtering problem, this is not as straightforward as in the
Kalman-filtering scenario. A poorly tuned Kalman filter manifests itself as either
too noisy, or too slow. A poorly tuned particle filter may, however, suffer from
catastrophic failures such as mode collapse or particle degeneracy [Gustafsson,
2010; Thrun et al., 2005; Rawlings and Mayne, 2009].
To identify the presence of mode collapse or particle degeneracy and to assist
in the tuning of the filter, we provide a visualization tool that displays the true
trajectory as traversed by the robot together with the distribution of the particles,
as well as each particle’s hypothesis measurement location. An illustrative example
is shown in Fig. 14.5, where one dimension in the filter state is shown as a function
of time in a screen shot of the visualizer.
To further aid the tuning of the filter, we perform several simulations in par-
allel with nominal filter parameters perturbed by samples from a prespecified
distribution and perform statistical significance tests to determine the parameters
of most importance to the result for a certain sensor/trajectory configuration.
Figure 14.6 displays the statistical significance of various filter parameters for a
certain trajectory and sensor configuration. The color coding indicates the log(P)-
values for the corresponding parameters in a linear model predicting the errors
in Cartesian directions and orientation. As an example, the figure indicates that
the parameter σW 2 , corresponding to the orientation noise in the state update,
has a significant impact on the errors in all Cartesian directions. The sign and
value of the underlying linear model can then serve as a guide to fine tuning of
this parameter.

14.3 Analysis of Sensor Configurations

One of the main goals of this work was to enable analysis of the optimal sensor
configuration for a given seam geometry. On the one hand, not all seam geometries
allow for accurate estimation of the state in all directions, and on the other hand,
not all seam geometries require accurate control in all directions. The optimal
sensor configuration depends heavily on the amount of features present in the

166
14.3 Analysis of Sensor Configurations

3
y2

50
2

20 40 60 80 100
0
Figure 14.5 Visualization of a particle distribution as a function of time during
a simulation. The black line indicates the evolution of one coordinate of the true
state as a function of the time step and the heatmap illustrates the density of the
particles. This figure illustrates how the uncertainty of the estimate is reduced as a
feature in the trajectory becomes visible for the sensor at time step 50. The sensor is
located slightly ahead of the tool, hence, the distribution becomes narrow slightly
before the tool reaches the feature. The feature is in this case a sharp bend in the
otherwise straight seam.

R
z
-2
y
x
-4
σf
(I c
Np

fr1 5
fr2
fr3
fr4
fr5
f
sen t
et0

σm
σm
σm
σW
σW
σW

σV
σV
σV
σV
σV 1
σV 2
σV 3
σV 4

r6
pt

1
2
3
FK
FK
FK
FK
FK
ar
s

1
2
3
1
2
3
)

Figure 14.6 An illustration of how the various parameters in the software frame-
work can be tuned. By fitting linear models, with tuning parameters as factors,
that predict various errors as linear combinations of parameter values, parameters
with significant effect on the performance can be identified using the log(P)-values
(color coded). The x-axis indicates the factors and the y-axis indicates the pre-
dicted errors in orientation and translation. The parameters are described in detail
in the software framework.

167
Chapter 14. State Estimation for FSW

X2

X1
seam

Figure 14.7 A sensor with a single laser stripe is not capable of distinguishing
between wrong translation and wrong orientation. The two hypotheses X 1 , X 2 both
share the closest measurement point on the seam. The second laser stripe invali-
dates the erroneous hypothesis X 2 which would have the second measurement
point far from the seam. Without the second laser stripe it is clear that the available
sensor information can not distinguish X 1 and X 2 from each other.

trajectories, where a feature is understood as a localizable detail in the trajectory.


The estimation performance is also critically dependent on the number of laser
light planes that intersect the seam. A single laser sensor can measure three
degrees of freedom, two translations and one orientation. The remaining three
DOFs are in general not observable. This is illustrated in the planar case in Fig. 14.7.
All particles lying on a capsule manifold, generated by the spherical movement
around the measurement location, together with a sliding motion along the seam,
are equally likely given the measurement. A second measurement eliminates the
spherical component of the capsule, leaving only the line corresponding to the
sliding motion along the seam unobservable. The unobservable subspace left
when two or more laser planes are used can only be reduced by features in the
seam, breaking the line symmetry. One example of a reduction in uncertainty
due to a feature in the trajectory is illustrated in Fig. 14.5. The forward kinematics
measurement from the robot will, however, ensure that the uncertainty stays
bounded within a region of the true state.
To illustrate the importance of seam geometry for the estimation performance,
we consider Fig. 14.8, where the resulting errors for two trajectory types and several
sensor configurations (0,1,2 sensors) are displayed. The trajectories referred to in
the figure are generated as follows. The x y-trajectory lies entirely in the x y-plane
of the tool frame T , with a linear movement of 200 mm along the y-direction
and a smooth, 20 mm amplitude, triangle-wave motion in the x-direction. The
y z-trajectory lies in the y z-plane, with a linear movement of 200 mm along the
y-direction and a 100 mm amplitude, sinusoidal, motion along the z-direction.
The trajectories are depicted in Fig. 14.9. It is clear that the type of trajectory is
important for the resulting estimation error, in this case, the filter was tuned for
trajectory type x y.
Figure 14.8 illustrates the difficulties in determining the translation along
the direction of movement when no features are present, as well as the benefit
of sensor feedback in the measurable dimensions. The provided visualization
tools assist in re-tuning the filter for a new trajectory, and can suggest optimal
configurations of the available seam-tracking sensors.

168
14.4 Discussion

Error x [mm] Error y [mm]


1.0 2

0.8
1.5
0.6
1
0.4
0.5
0.2

0.0 0
Error z [mm] Error rot [deg]
2 0.8

1.5 0.6

1 0.4

0.5 0.2

0 0.0
0 sens xy

1 sens xy

2 sens xy

0 sens yz

1 sens yz

2 sens yz

0 sens xy

1 sens xy

2 sens xy

0 sens yz

1 sens yz

2 sens yz
Figure 14.8 Error distributions for various sensor configurations (0-2 sensors)
and two different trajectory types (xy,yz). In both trajectory cases, y is the major
movement direction along which the laser sensors obtain little or no information.
The same filter parameters, tuned for the x y-trajectory, were used in all experi-
ments.

14.4 Discussion

We have discussed several sources of kinematic uncertainties. To improve the


performance of a state estimator, one can consider two approaches. Reducing
uncertainties through modeling, or by introduction of additional sensing. Along
the first avenue, we note that the kinematic model of the robot used in the forward
kinematics calculations is often inaccurate, and errors in the absolute positioning
accuracy of an industrial robot can often be in the order of 1 mm or more [Mooring
et al., 1991; Nubiola and Bonev, 2013], even after additional, costly calibration
performed by the robot manufacturer [ABB, 2005]. To characterize this uncer-
tainty without performing a full kinematic calibration is usually hard, since it is a
nonlinear function of the errors in link lengths, offsets etc. in the kinematic model.
Possibilities include modeling this uncertainty as a Gaussian distribution with a
variance corresponding to the average error in the considered work-space volume,
or as a uniform distribution with a width corresponding to the maximum error. A

169
Chapter 14. State Estimation for FSW

20

10

0
0 10 20 30 40 50 60 70 80 90 100

200

100

0
0 10 20 30 40 50 60 70 80 90 100

100

50

0
0 10 20 30 40 50 60 70 80 90 100

Figure 14.9 Trajectories x y (solid) and y z (dashed). Distance [mm] along each
axis (x, y, z) is depicted as a function of time step.

number for the maximum error in the forward kinematics under no load is usually
provided by the robot manufacturer, or can be obtained using, e.g., an external
optical tracking system.
A major source of uncertainty is compliance in the structure of the robot.
Deflections in the robot joints and links caused by large process forces result in
an uncertainty in the measured tool position. This problem can be mitigated
by a compliance model, C j (τ) in (14.3), reducing the uncertainty to the level
of the model uncertainty [Lehmann et al., 2013]. Although several authors have
considered compliance modeling, the large range of possible seam geometries and
the large range of possible process forces make finding a globally valid, sufficiently
accurate compliance model very difficult.
The reduction of uncertainty through additional sensing offers the possibility
of reducing the remaining errors greatly, potentially eliminating the need for a
state estimator altogether. Sensors capable of measuring the full 6DOF pose, such
as optical tracking systems, are unfortunately very expensive. They further require
accurate measurements also of the workpiece, potentially placing additional bur-
den on the operator. Relative sensing, such as the laser sensors considered in
this work, directly measure the relevant distance between the seam and the tool.
Unfortunately, they suffer from a number of weaknesses. They can for obvious

170
14.4 Discussion

reasons not measure the location of the seam at the tool center point, and must
thus measure the seam at a different location where it is visible. In a practical
scenario, this might cause the sensor to measure the seam up to 50 mm from the
TCP. Since the location of the TCP relative to the seam must be inferred through
geometry from this sensor measurement, the full 6 DOF pose becomes relevant,
even if does not have to be accurately controlled. A second weakness of the con-
sidered relative sensing is the lack of observability along the seam direction. While
dual laser sensors allow measuring more degrees of freedom than a single laser
sensor, no amount of laser sensors can infer the position along a straight seam.
The proposed state estimator tries to infer as much as possible about the state
by requiring knowledge of the seam geometry. This is necessary for the estimator
to know what sensor measurements to expect. The particle filter maintains a
representation of the full filtering density, it is thus possible to determine in
simulation whether or not the uncertainty is such that the worst-case tracking
performance is sufficient. The uncertainty is in general highest along the direction
of movement since no sensor information is available in this direction. Fortunately,
however, this direction is also the direction with lowest required tracking accuracy.
Situations that require higher tracking accuracy in this direction luckily coincide
with the situations that allow for higher estimation accuracy, when a feature is
present in the seam. An example of this was demonstrated in Fig. 14.5.
While the proposed framework is intended for simulation in order to aid the
design of a specialized state estimator, some measures were taken to reduce the
computational time and at the same time reducing the number of parameters
the operator has to tune. The most notable such measure was the choice to not
include velocities in the state. The velocities typically present in the FSW context
are fairly low, while forces are high. The acceleration in the transverse direction
can thus be high enough to render the estimation of velocities impossible on
the short time-scale associated with vibrations in the process. The bandwidth of
the controller is further far from enough for compensation to be feasible. In the
directions along the seam, the velocity is typically well controlled by the robot
controller apart from during the transient occurring when contact is established.
Once again, the bandwidth is not sufficient to compensate for errors occurring at
the frequencies present during the transient.
Lastly, the method does not include estimation of errors in the location of the
work piece. Without assumptions on either the error in the work-piece location or
the error in the forward kinematics of the robot, these two sources of error can not
be distinguished. Hence, augmenting the state with a representation of the work-
piece error will not be fruitful. If significant variation in work-piece placement
is suspected we instead propose to add a scanning phase prior to welding. This
would allow for using the laser sensor to, under no load, measure the location of
sufficiently many points along the seam to be able to estimate the location of the
work piece in the coordinate system of the robot. This procedure, which could
be easily automated, would compensate for errors in both work-piece placement
and the kinematic chain of the robot.

171
Chapter 14. State Estimation for FSW

14.5 Conclusions
We have suggested a particle-filter based state estimator capable of estimating
the full 6 DOF pose of the tool relative to the seam in a seam-tracking scenario.
Sensor fusion is carried out between the robot internal measurements, propagated
through a forward kinematics model with large uncertainties due to the applied
process forces, and measurements from a class of seam-tracking laser sensors.
We have highlighted some of the difficulties related to state estimation where
accurate measurements come in a reduced-dimensional space, together with
highly uncertain measurements of the full state space, where the uncertainties are
highly correlated in time.
The presented framework is available as open-source [PFSeamTracking.jl,
B.C. et al., 2016] and the algorithm has been successfully implemented at The
Welding Institute (TWI) in Sheffield, UK, and is capable of executing in approxi-
mately 1000 Hz using 500 particles on a standard desktop PC.

172
Conclusions and Future
Work

We have presented a wide range of problems and methods within estimation for
physical systems. Common to many parts of the thesis is the use of ideas from
machine learning to solve classical identification problems. Many of the problems
considered could, in theory, be solved by gathering massive amounts of data and
training a deep neural network. Instead, we have opted for developing methods
that make use of prior knowledge where available, and flexibility where not. This
has resulted in practical methods that require a practical amount of data. The
proof of this has in many cases been provided by experimental application on
physical systems, the very systems that motivated the work.
Looking forward, we see robust uncertainty quantification as a very interesting
and important direction for future work. Some of the developed methods have a
probabilistic interpretation and lend themselves well to maximum a posteriori
inference, in restricted settings. Lifting this restriction is straightforward in theory
but most often computationally challenging. Work on approximate methods has
recently made great strides in the area, but the field requires further attention
before its application as a robust technology.

173
Bibliography

ABB (2005). Absolute accuracy: industrial robot option. URL: https://library.


e.abb.com/public/0f879113235a0e1dc1257b130056d133/Absolute%
20Accuracy%20EN_R4%20US%2002_05.pdf (visited on 2018-11-22).
Abele, E., S. Rothenbücher, and M. Weigold (2008). “Cartesian compliance model
for industrial robots using virtual joints”. Production Engineering 2:3, pp. 339–
343.
Amos, B. and J. Z. Kolter (2017). Optnet: differentiable optimization as a layer in
neural networks. eprint: arXiv:1703.00443.
Armstrong, B. (1988). “Friction: experimental determination, modeling and com-
pensation”. In: Robotics and Automation, Proc. 1988 IEEE Int. Conf. Pennsylva-
nia, pp. 1422–1427.
Armstrong, B., P. Dupont, and C. C. De Wit (1994). “A survey of models, analysis
tools and compensation methods for the control of machines with friction”.
Automatica 30:7, pp. 1083–1138.
Åström, K. and B. Wittenmark (2013a). Computer-Controlled Systems: Theory and
Design, Third Edition. Dover Publications, Minola, NY. ISBN: 9780486284040.
Åström, K. J. (2012). Introduction to stochastic control theory. Courier Corporation,
New York.
Åström, K. J. and B. Wittenmark (2013b). Adaptive control. Courier Corporation,
New York.
Åström, K. J. and R. M. Murray (2010). Feedback systems: an introduction for
scientists and engineers. Princeton University Press, New Jersey.
Åström, K. J. and B. Wittenmark (2011). Computer-controlled systems: theory and
design. Dover, New York.
Bagge Carlson, F. (2015). Robotlib.jl. Dept. Automatic Control, Lund University,
Sweden. URL: https://github.com/baggepinnen/Robotlib.jl.
Bagge Carlson, F. (2016a). BasisFunctionExpansions.jl. Dept. Automatic Con-
trol, Lund University, Sweden. URL: https://github.com/baggepinnen/
BasisFunctionExpansions.jl.

174
Bibliography

Bagge Carlson, F. (2016b). DifferentialDynamicProgramming.jl. Dept. Auto-


matic Control, Lund University, Sweden. URL: https : / / github . com /
baggepinnen/DifferentialDynamicProgramming.jl.
Bagge Carlson, F. (2016c). LPVSpectral.jl. Dept. Automatic Control, Lund Univer-
sity, Sweden. URL: https://github.com/baggepinnen/LPVSpectral.jl.
Bagge Carlson, F. (2017). LTVModels.jl. Dept. Automatic Control, Lund University,
Sweden. URL: https://github.com/baggepinnen/LTVModels.jl.
Bagge Carlson, F. (2018a). JacProp.jl. Dept. Automatic Control, Lund University,
Sweden. URL: https://github.com/baggepinnen/JacProp.jl.
Bagge Carlson, F. (2018b). LowLevelParticleFilters.jl. Dept. Automatic Control,
Lund University, Sweden. URL: https : / / github . com / baggepinnen /
LowLevelParticleFilters.jl.
Bagge Carlson, F. (2018c). TotalLeastSquares.jl. Dept. Automatic Control,
Lund University, Sweden. URL: https : / / github . com / baggepinnen /
TotalLeastSquares.jl.
Bagge Carlson, F. and M. Haage (2017). YuMi low-level motion guidance using the
Julia programming language and Externally Guided Motion Research Inter-
face. Technical report TFRT-7651. Department of Automatic Control, Lund
University, Sweden.
Bagge Carlson, F., R. Johansson, and A. Robertsson (2015a). “Six DOF eye-to-hand
calibration from 2D measurements using planar constraints”. In: Int. Conf.
Intelligent Robots and Systems (IROS), Hamburg. IEEE.
Bagge Carlson, F., R. Johansson, and A. Robertsson (2018a). “Tangent-space regu-
larization for neural-network models of dynamical systems”. arXiv preprint
arXiv:1806.09919.
Bagge Carlson, F. and M. Karlsson (2016a). DynamicMovementPrimitives.jl. Dept.
Automatic Control, Lund University, Sweden. URL: https://github.com/
baggepinnen/DynamicMovementPrimitives.jl.
Bagge Carlson, F. and M. Karlsson (2016b). PFSeamTracking.jl. Dept. Auto-
matic Control, Lund University, Sweden. URL: https : / / github . com /
baggepinnen/PFSeamTracking.jl.
Bagge Carlson, F., M. Karlsson, A. Robertsson, and R. Johansson (2016). “Particle
filter framework for 6D seam tracking under large external forces using 2D
laser sensors”. In: Int. Conf. Intelligent Robots and Systems (IROS), Daejeong,
South Korea.
Bagge Carlson, F., A. Robertsson, and R. Johansson (2015b). “Modeling and identi-
fication of position and temperature dependent friction phenomena without
temperature sensing”. In: Int. Conf. Intelligent Robots and Systems (IROS),
Hamburg. IEEE.
Bagge Carlson, F., A. Robertsson, and R. Johansson (2017). “Linear parameter-
varying spectral decomposition”. In: 2017 American Control Conf (ACC), Seat-
tle.

175
Bibliography

Bagge Carlson, F., A. Robertsson, and R. Johansson (2018b). “Identification of LTV


dynamical models with smooth or discontinuous time evolution by means
of convex optimization”. In: IEEE Int. Conf. Control and Automation (ICCA),
Anchorage, AK.
Bao, Y., L. Tang, and D. Shah (2017). “Robotic 3d plant perception and leaf probing
with collision-free motion planning for automated indoor plant phenotyping”.
In: 2017 ASABE Annual International Meeting. American Society of Agricultural
and Biological Engineers, p. 1.
Bellman, R. (1953). An Introduction to the Theory of Dynamic Programming. R-245.
Rand Corporation, Santa Monica, CA.
Bellman, R. (1961). “On the approximation of curves by line segments using dy-
namic programming”. Communications of the ACM 4:6, p. 284.
Bellman, R. and R. Roth (1969). “Curve fitting by segmented straight lines”. Journal
of the American Statistical Association 64:327, pp. 1079–1084.
Bennett, D. J., J. M. Hollerbach, and P. D. Henri (1992). “Kinematic calibration by
direct estimation of the jacobian matrix”. In: Proceedings. IEEE International
Conference on Robotics and Automation, Nice. Pp. 351–357.
Bertsekas, D. P., D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas (2005). Dynamic
programming and optimal control. Vol. 1. 3. Athena scientific Belmont, MA.
Bezanson, J., A. Edelman, S. Karpinski, and V. B. Shah (2017). “Julia: a fresh ap-
proach to numerical computing”. SIAM Review 59:1, pp. 65–98.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, New
York.
Bittencourt, A. C. and S. Gunnarsson (2012). “Static friction in a robot joint - mod-
eling and identification of load and temperature effects”. Journal of Dynamic
Systems, Measurement, and Control 134:5.
Botev, A., H. Ritter, and D. Barber (2017). Practical Gauss-Newton optimisation for
deep learning. eprint: arXiv:1706.03662.
Boyd, S. and L. Vandenberghe (2004). Convex optimization. Cambridge University
Press, Cambridge, UK.
Bristow, D. A., M. Tharayil, and A. G. Alleyne (2006). “A survey of iterative learning
control”. IEEE Control Systems 26:3, pp. 96–114.
Bugmann, G. (1998). “Normalized gaussian radial basis function networks”. Neu-
rocomputing 20:1-3, pp. 97–110. ISSN: 0925-2312.
Chalus, M. and J. Liska (2018). “Calibration and using a laser profile scanner for 3d
robotic welding”. International Journal of Computational Vision and Robotics
8:4, pp. 351–374.
Chen, D., A. Song, and A. Li (2015). “Design and calibration of a six-axis force/-
torque sensor with large measurement range used for the space manipulator”.
Procedia Engineering 99:1, pp. 1164–1170.
Costa, O. L. V., M. D. Fragoso, and R. P. Marques (2006). Discrete-time Markov jump
linear systems. Springer Science & Business Media, London.

176
Bibliography

Dahl, P. (1968). A solid friction model. Tech. rep. TOR-0158 (3107-18)-1. Aerospace
Corp, El Segundo, CA.
Daniilidis, K. (1999). “Hand-eye calibration using dual quaternions”. The Interna-
tional Journal of Robotics Research 18:3, pp. 286–298.
De Backer, J. (2014). Feedback Control of Robotic Friction Stir Welding. PhD thesis.
ISBN 978-91-87531-00-2, University West, Trollhättan, Sweden.
De Backer, J. and G. Bolmsjö (2014). “Deflection model for robotic friction stir
welding”. Industrial Robot: An International Journal 41:4, pp. 365–372.
De Backer, J., A.-K. Christiansson, J. Oqueka, and G. Bolmsjö (2012). “Investigation
of path compensation methods for robotic friction stir welding”. Industrial
Robot: An International Journal 39:6, pp. 601–608.
De Wit, C. C., H. Olsson, K. J. Åström, and P. Lischinsky (1995). “A new model
for control of systems with friction”. Automatic Control, IEEE Trans. on 40:3,
pp. 419–425.
Eckart, C. and G. Young (1936). “The approximation of one matrix by another of
lower rank”. Psychometrika 1:3, pp. 211–218.
Eggert, D. W., A. Lorusso, and R. B. Fisher (1997). “Estimating 3-d rigid body
transformations: a comparison of four major algorithms”. Machine Vision and
Applications 9:5-6, pp. 272–290.
Evensen, G. (2003). “The ensemble Kalman filter: theoretical formulation and
practical implementation”. Ocean Dynamics 53:4, pp. 343–367. ISSN: 1616-
7341. DOI: 10.1007/s10236-003-0036-9.
Fang, X. (2013). “Weighted total least squares: necessary and sufficient conditions,
fixed and random parameters”. Journal of geodesy 87:8, pp. 733–749.
Fischler, M. A. and R. C. Bolles (1981). “Random sample consensus: a paradigm for
model fitting with applications to image analysis and automated cartography”.
Communications of the ACM 24:6, pp. 381–395.
Gao, X., D. You, and S. Katayama (2012). “Seam tracking monitoring based on adap-
tive kalman filter embedded elman neural network during high-power fiber
laser welding”. Industrial Electronics, IEEE Transactions on 59:11, pp. 4315–
4325.
Gershman, S. J. and D. M. Blei (2011). A tutorial on Bayesian nonparametric models.
eprint: arXiv:1106.2697.
Glad, T. and L. Ljung (2014). Control theory. CRC press, Boca Raton, Florida.
Goldstein, A. A. (1964). “Convex programming in hilbert space”. Bulletin of the
American Mathematical Society 70:5, pp. 709–710.
Golub, G. H. and C. F. Van Loan (2012). Matrix computations. Vol. 3. Johns Hopkins
University Press, Baltimore.
Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. http://www.
deeplearningbook.org. MIT Press, Cambridge MA.

177
Bibliography

Guillo, M. and L. Dubourg (2016). “Impact & improvement of tool deviation in


friction stir welding: weld quality & real-time compensation on an industrial
robot”. Robotics and Computer-Integrated Manufacturing 39, pp. 22–31.
Gustafsson, F. (2010). “Particle filter theory and practice with positioning applica-
tions”. Aerospace and Electronic Systems Magazine, IEEE 25:7, pp. 53–82.
Hansen, P. C. (1994). “Regularization tools: a matlab package for analysis and
solution of discrete ill-posed problems”. Numerical algorithms 6:1, pp. 1–35.
URL : http://www2.compute.dtu.dk/~pcha/Regutools/RTv4manual.
pdf (visited on 2017-01).
Harris, F. J. (1978). “On the use of windows for harmonic analysis with the discrete
fourier transform”. Proceedings of the IEEE 66:1, pp. 51–83.
He, K., X. Zhang, S. Ren, and J. Sun (2015). Deep residual learning for image recog-
nition. eprint: arXiv:1512.03385.
He, K., X. Zhang, S. Ren, and J. Sun (2016). “Deep residual learning for image
recognition”. In: IEEE Conf. on Computer Vision and Pattern Recognition, Las
Vegas, pp. 770–778.
Hjort, N. L., C. Holmes, P. Müller, and S. G. Walker (2010). Bayesian nonparametrics.
Vol. 28. Cambridge University Press, Cambridge, UK.
Hochreiter, S. and J. Schmidhuber (1997). “Long short-term memory”. Neural
computation 9:8, pp. 1735–1780.
Horaud, R. and F. Dornaika (1995). “Hand-eye calibration”. The International
Journal of Robotics Research 14:3, pp. 195–210.
Huang, P.-Y., Y.-Y. Chen, and M.-S. Chen (1998). “Position-dependent friction
compensation for ballscrew tables”. In: Control Applications, 1998. Proc. 1998
IEEE Int. Conf., Trieste, Italy. Vol. 2, pp. 863–867.
Ikits, M. and J. Hollerbach (1997). “Kinematic calibration using a plane constraint”.
In: Robotics and Automation, 1997. Proceedings., 1997 IEEE International Con-
ference on, Pittsburgh. Vol. 4, 3191–3196 vol.4. DOI: 10.1109/ROBOT.1997.
606774.
Innes, M. (2018). “Flux: elegant machine learning with julia”. Journal of Open
Source Software. DOI: 10.21105/joss.00602.
Ioffe, S. and C. Szegedy (2015). “Batch normalization: accelerating deep network
training by reducing internal covariate shift”. In: International Conference on
Machine Learning, Lille, pp. 448–456.
Ionides, E. L., C. Bretó, and A. A. King (2006). “Inference for nonlinear dynamical
systems”. Proceedings of the National Academy of Sciences 103:49, pp. 18438–
18443.
Johansson, R. (1993). System modeling & identification. Prentice-Hall, Englewood
Cliffs, NJ.
Julialang (2017). Julia standard library. URL: http://docs.julialang.org/en/
stable/stdlib/linalg/ (visited on 2017-01).

178
Bibliography

Karl, M., M. Soelch, J. Bayer, and P. van der Smagt (2016). Deep variational Bayes fil-
ters: unsupervised learning of state space models from raw data. eprint: arXiv:
1605.06432.
Karlsson, M., F. Bagge Carlson, A. Robertsson, and R. Johansson (2017). “Two-
degree-of-freedom control for trajectory tracking and perturbation recovery
during execution of dynamical movement primitives”. In: 20th IFAC World
Congress, Toulouse.
Kay, S. M. (1993). Fundamentals of statistical signal processing, volume I: estima-
tion theory. Prentice Hall, Englewood Cliffs, NJ.
Khalil, H. K. (1996). “Nonlinear systems”. Prentice-Hall, New Jersey 2:5.
Kim, S.-J., K. Koh, S. Boyd, and D. Gorinevsky (2009). “`1 trend filtering”. SIAM
review 51:2, pp. 339–360.
Kingma, D. and J. Ba (2014). “Adam: a method for stochastic optimization”. arXiv
preprint arXiv:1412.6980.
Kruif, B. J. de and T. J. de Vries (2002). “Support-vector-based least squares for
learning non-linear dynamics”. In: Decision and Control, 2002, Proc. IEEE Conf.,
Las Vegas. Vol. 2, pp. 1343–1348.
Lehmann, C., B. Olofsson, K. Nilsson, M. Halbauer, M. Haage, A. Robertsson, O.
Sörnmo, and U. Berger (2013). “Robot joint modeling and parameter identifi-
cation using the clamping method”. In: 7th IFAC Conference on Manufacturing
Modelling, Management,and Control. Saint Petersburg, Russia, pp. 843–848.
Lennartson, B., R. H. Middleton, and I. Gustafsson (2012). “Numerical sensitivity
of linear matrix inequalities using shift and delta operators”. IEEE Transactions
on Automatic Control 57:11, pp. 2874–2879.
Levine, S. and P. Abbeel (2014). “Learning neural network policies with guided
policy search under unknown dynamics”. In: Advances in Neural Information
Processing Systems, Montreal, pp. 1071–1079.
Levine, S. and V. Koltun (2013). “Guided policy search”. In: Int. Conf. Machine
Learning (ICML), Atlanta, pp. 1–9.
Levine, S., N. Wagener, and P. Abbeel (2015). “Learning contact-rich manipulation
skills with guided policy search”. In: Robotics and Automation (ICRA), IEEE Int.
Conf., Seattle. IEEE, pp. 156–163.
Lindström, E., E. Ionides, J. Frydendall, and H. Madsen (2012). “Efficient iterated
filtering”. IFAC Proceedings Volumes 45:16, pp. 1785–1790.
Ljung, L. (1987). System identification: theory for the user. Prentice-hall, Englewood
Cliffs, NJ.
Ljung, L. and T. Söderström (1983). Theory and practice of recursive identification.
MIT press, Cambridge, MA.
Manchester, I. R., M. M. Tobenkin, and A. Megretski (2012). “Stable nonlinear sys-
tem identification: convexity, model class, and consistency”. IFAC Proceedings
Volumes 45:16, pp. 328–333. DOI: 10.3182/20120711-3-BE-2027.00405.

179
Bibliography

Mayne, D. (1966). “A second-order gradient method for determining optimal tra-


jectories of non-linear discrete-time systems”. International Journal of Control
3:1, pp. 85–95.
Merriënboer, B. van, O. Breuleux, A. Bergeron, and P. Lamblin (2018). Automatic
differentiation in ML: where we are and where we should be going. eprint:
arXiv:1810.11530.
Meta Vision Systems (2014). Meta FSW. URL: http://www.meta-mvs.com/fsw
(visited on 2014-01-12).
Middleton, R. and G. Goodwin (1986). “Improved finite word length characteris-
tics in digital control using delta operators”. IEEE Transactions on Automatic
Control 31:11, pp. 1015–1021.
Midling, O. T., E. J. Morley, and A. Sandvik (1998). Friction stir welding. US Patent
5,813,592.
Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015). “Human-level control
through deep reinforcement learning”. Nature 518:7540, p. 529.
Mooring, B., Z. Roth, and M. Driels (1991). Fundamentals of manipulator calibra-
tion. J. Wiley, New York. ISBN: 9780471508649.
Mortari, D., F. L. Markley, and P. Singla (2007). “Optimal linear attitude estimator”.
Journal of Guidance, Control, and Dynamics 30:6, pp. 1619–1627.
Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press,
Cambridge, MA.
Murray, R. M., Z. Li, and S. S. Sastry (1994). A mathematical introduction to robotic
manipulation. CRC Press, Boca Raton, Florida.
Nagarajaiah, S. and Z. Li (2004). “Time segmented least squares identification
of base isolated buildings”. Soil Dynamics and Earthquake Engineering 24:8,
pp. 577–586.
Nayak, N. R. and A. Ray (2013). Intelligent seam tracking for robotic welding.
Springer Science & Business Media, London, UK.
Nguyen, Q., M. C. Mukkamala, and M. Hein (2018). On the loss landscape of a class
of deep neural networks with no bad local valleys. eprint: arXiv:1809.10749.
Nocedal, J. and S. Wright (1999). Numerical optimization. Springer-Verlag New
York, Inc.
Nubiola, A. and I. A. Bonev (2013). “Absolute calibration of an ABB IRB 1600 robot
using a laser tracker”. Robotics and Computer-Integrated Manufacturing 29:1,
pp. 236–245. ISSN: 0736-5845.
Oberman, A. M. and J. Calder (2018). Lipschitz regularized deep neural networks
converge and generalize. eprint: arXiv:1808.09540.
Ohlsson, H. (2010). Regularization for Sparseness and Smoothness: Applications
in System Identification and Signal Processing. PhD thesis 1351. Linköping
University Electronic Press, Linköping, Sweden.

180
Bibliography

Olofsson, B. (2015). Topics in Machining with Industrial Robot Manipulators and


Optimal Motion Control. PhD thesis. ISRN TFRT–1108–SE, Lund University,
Lund, Sweden.
Olsson, H., K. J. Åström, C. C. de Wit, M. Gäfvert, and P. Lischinsky (1998). “Friction
models and friction compensation”. European Journal of Control 4:3, pp. 176–
195.
Parikh, N. and S. Boyd (2014). “Proximal algorithms”. Foundations and Trends in
Optimization 1:3, pp. 127–239.
Paris, J. F. (2011). “A note on the sum of correlated gamma random variables”. CoRR
abs/1103.0505. URL: http://arxiv.org/abs/1103.0505.
Park, J. and I. W. Sandberg (1991). “Universal approximation using radial-basis-
function networks”. Neural computation 3:2, pp. 246–257.
Pascanu, R., T. Mikolov, and Y. Bengio (2013). “On the difficulty of training re-
current neural networks”. In: International Conference on Machine Learning,
Atlanta, pp. 1310–1318.
Pearson, K. (1901). “On lines and planes of closest fit to systems of points in space”.
The London, Edinburgh, and Dublin Philosophical Magazine and Journal of
Science 2:11, pp. 559–572.
Picinbono, B. (1996). “Second-order complex random vectors and normal distri-
butions”. IEEE Transactions on Signal Processing 44:10, pp. 2637–2640.
Puryear, C. I., O. N. Portniaguine, C. M. Cobos, and J. P. Castagna (2012). “Con-
strained least-squares spectral analysis: application to seismic data”. Geo-
physics 77:5, pp. V143–V167.
Ramachandran, P., B. Zoph, and Q. V. Le (2017). Searching for activation functions.
eprint: arXiv:1710.05941.
Rasmussen, C. E. (2004). “Gaussian processes in machine learning”. In: Advanced
lectures on machine learning. Springer, New York, pp. 63–71.
Rauch, H. E., F. Tung, C. T. Striebel, et al. (1965). “Maximum likelihood estimates
of linear dynamic systems”. AIAA journal 3:8, pp. 1445–1450.
Rawlings, J. and D. Mayne (2009). Model Predictive Control: Theory and Design.
Nob Hill Pub. Madison, Wisconsin. ISBN: 9780975937709.
Rummery, G. A. and M. Niranjan (1994). On-line Q-learning using connectionist
systems. Vol. 37. University of Cambridge, Department of Engineering, Cam-
bridge, England.
Saad, Y. and M. H. Schultz (1986). “Gmres: a generalized minimal residual al-
gorithm for solving nonsymmetric linear systems”. Journal on scientific and
statistical computing 7:3, pp. 856–869.
Schulman, J., S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015). “Trust region
policy optimization”. In: International Conference on Machine Learning, Lille,
pp. 1889–1897.
SICK IVP (2011). SICK Ranger. URL: http : / / www . chronos - vision . de /
downloads/FAQ_Summary_3D_Camera_V1.13.pdf (visited on 2018-08-29).

181
Bibliography

Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J.


Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016).
“Mastering the game of go with deep neural networks and tree search”. Nature
529:7587, p. 484.
Silver, D., G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014).
“Deterministic policy gradient algorithms”. In: Proceedings of the 31st Interna-
tional Conference on International Conference on Machine Learning (ICML)
Beijing. JMLR.
Sjöberg, J., Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P.-Y. Glorennec, H. Hjal-
marsson, and A. Juditsky (1995). “Nonlinear black-box modeling in system
identification: a unified overview”. Automatica 31:12, pp. 1691–1724.
Song, A., J. Wu, G. Qin, and W. Huang (2007). “A novel self-decoupled four degree-
of-freedom wrist force/torque sensor”. Measurement 40:9-10, pp. 883–891.
Sörnmo, O. (2015). Adaptation and Learning for Manipulators and Machining.
PhD thesis. ISRN TFRT–1110–SE, Lund University, Lund, Sweden.
Spong, M. W., S. Hutchinson, and M. Vidyasagar (2006). Robot modeling and
control. Vol. 3. Wiley, New York.
Stella, L., N. Antonello, and M. Fält (2016). ProximalOperators.jl. URL : https :
//github.com/kul-forbes/ProximalOperators.jl.
Stoer, J. and R. Bulirsch (2013). Introduction to numerical analysis. Vol. 12. Springer
Science & Business Media, New York.
Stoica, P. and R. L. Moses (2005). Spectral analysis of signals. Pearson/Prentice Hall
Upper Saddle River, NJ.
Sutton, R. S. (1991). “Dyna, an integrated architecture for learning, planning, and
reacting”. ACM SIGART Bulletin 2:4, pp. 160–163.
Sutton, R. S., D. A. McAllester, S. P. Singh, and Y. Mansour (2000). “Policy gra-
dient methods for reinforcement learning with function approximation”. In:
Advances in neural information processing systems, Denver, pp. 1057–1063.
Tassa, Y., N. Mansard, and E. Todorov (2014). “Control-limited differential dynamic
programming”. In: Robotics and Automation (ICRA), 2014 IEEE International
Conference on, Hong Kong. DOI: 10.1109/ICRA.2014.6907001.
Thrun, S., W. Burgard, and D. Fox (2005). Probabilistic Robotics. Intelligent robotics
and autonomous agents. MIT Press, Cambridge, MA. ISBN: 9780262201629.
Tibshirani, R. J. et al. (2014). “Adaptive piecewise polynomial estimation via trend
filtering”. The Annals of Statistics 42:1, pp. 285–323.
Todorov, E. and W. Li (2005). “A generalized iterative lqg method for locally-optimal
feedback control of constrained nonlinear stochastic systems”. In: American
Control Conference. IEEE, pp. 300–306.
Tsai, R. Y. and R. K. Lenz (1989). “A new technique for fully autonomous and
efficient 3d robotics hand/eye calibration”. Robotics and Automation, IEEE
Transactions on 5:3, pp. 345–358.

182
Bibliography

Tsiotras, P., J. L. Junkins, and H. Schaub (1997). “Higher-order cayley transforms


with applications to attitude representations”. Journal of Guidance, Control,
and Dynamics 20:3, pp. 528–534.
Ulyanov, D., A. Vedaldi, and V. Lempitsky (2017). “Deep image prior”. arXiv preprint
arXiv:1711.10925.
Van Overschee, P. and B. De Moor (1995). “A unifying theorem for three subspace
system identification algorithms”. Automatica 31:12, pp. 1853–1864.
Verhaegen, M. and P. Dewilde (1992). “Subspace model identification
part 1. the output-error state-space model identification class of al-
gorithms”. International Journal of Control 56:5, pp. 1187–1210. DOI:
10 . 1080 / 00207179208934363. URL: https : / / doi . org / 10 . 1080 /
00207179208934363.
Vidal, R., A. Chiuso, and S. Soatto (2002). “Observability and identifiability of jump
linear systems”. In: IEEE Conf. Decision and Control (CDC), Las Vegas. Vol. 4.
IEEE, pp. 3614–3619.
Watkins, C. J. and P. Dayan (1992). “Q-learning”. Machine learning 8:3-4, pp. 279–
292.
Wells, D. E., P. Vanícek, and S. D. Pagiatakis (1985). Least squares spectral anal-
ysis revisited. 84. Department of Surveying Engineering, University of New
Brunswick Fredericton, Canada.
Wilk, M. B. and R. Gnanadesikan (1968). “Probability plotting methods for the
analysis of data”. Biometrika 55:1, pp. 1–17. DOI: 10.1093/biomet/55.1.1.
URL : https://doi.org/10.1093/biomet/55.1.1.
Williams, R. (1988). Toward a Theory of Reinforcement-learning Connectionist
Systems. B00072BHM6. Northeastern University, Boston, MA.
Wilson, A. C., R. Roelofs, M. Stern, N. Srebro, and B. Recht (2017). “The marginal
value of adaptive gradient methods in machine learning”. In: Advances in
Neural Information Processing Systems, pp. 4148–4158.
Wittenmark, B., K. J. Åström, and K.-E. Årzén (2002). “Computer control: an
overview”. IFAC Professional Brief. URL: https://www.ifac-control.org/
publications/list-of-professional-briefs/pb_wittenmark_etal_
final.pdf/view.
Xu, K., J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and
Y. Bengio (2015). “Show, attend and tell: neural image caption generation
with visual attention”. CoRR abs/1502.03044. arXiv: 1502.03044. URL: http:
//arxiv.org/abs/1502.03044.
Yongsheng, W., W. Tianqi, L. Liangyu, L. Jinzhong, and D. Boyu (2017). “Auto-
matic path planning technology of stitching robot for composite fabric with
curved surface (translated from Chinese)”. Materials Science and Technology
(translated from Chinese) 25:2, pp. 16–21.

183
Bibliography

Yuan, M. and Y. Lin (2006). “Model selection and estimation in regression with
grouped variables”. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 68:1, pp. 49–67.
Zhang, Q. and R. Pless (2004). “Extrinsic calibration of a camera and laser range
finder (improves camera calibration)”. In: Intelligent Robots and Systems, 2004.
(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, Sendai,
Japan. Vol. 3, 2301–2306 vol.3. DOI: 10.1109/IROS.2004.1389752.
Zhang, W. (1999). State-space search: Algorithms, complexity, extensions, and ap-
plications. Springer Science & Business Media, New York.
Zhuang, H., S. Motaghedi, and Z. S. Roth (1999). “Robot calibration with pla-
nar constraints”. In: Robotics and Automation, 1999. Proceedings. 1999 IEEE
International Conference on, Detroit, Michigan. Vol. 1, 805–810 vol.1. DOI:
10.1109/ROBOT.1999.770073.

184

You might also like