0% found this document useful (0 votes)
41 views

A Hierarchical Bayesian Linear Regression Model With Local Features For Stochastic Dynamics Approximation

This document summarizes a research paper that presents a hierarchical Bayesian linear regression model with local features to learn the dynamics of stochastic dynamical systems. The model is hierarchical since it considers non-stationary priors for the model parameters, which increases its flexibility. The authors use variational expectation maximization to estimate the maximum likelihood of this hierarchical model. They apply their method to two illustrative systems and a micro-robotic system, demonstrating accurate and fast predictions for approximating stochastic dynamics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

A Hierarchical Bayesian Linear Regression Model With Local Features For Stochastic Dynamics Approximation

This document summarizes a research paper that presents a hierarchical Bayesian linear regression model with local features to learn the dynamics of stochastic dynamical systems. The model is hierarchical since it considers non-stationary priors for the model parameters, which increases its flexibility. The authors use variational expectation maximization to estimate the maximum likelihood of this hierarchical model. They apply their method to two illustrative systems and a micro-robotic system, demonstrating accurate and fast predictions for approximating stochastic dynamics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/326342881

A Hierarchical Bayesian Linear Regression Model with Local Features for


Stochastic Dynamics Approximation

Preprint · August 2018

CITATIONS READS

0 287

4 authors, including:

Behnoosh Parsa Keshav Rajasekaran


University of Washington Seattle University of Maryland, College Park
13 PUBLICATIONS   59 CITATIONS    5 PUBLICATIONS   14 CITATIONS   

SEE PROFILE SEE PROFILE

Ashis Gopal Banerjee


University of Washington Seattle
63 PUBLICATIONS   1,216 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Human-robot collaboration View project

Smart manufacturing informatics View project

All content following this page was uploaded by Behnoosh Parsa on 15 August 2018.

The user has requested enhancement of the downloaded file.


Noname manuscript No.
(will be inserted by the editor)

A Hierarchical Bayesian Linear Regression Model


with Local Features for Stochastic Dynamics
Approximation

Behnoosh Parsa · Keshav Rajasekaran ·


Franziska Meier · Ashis G. Banerjee

Received: date / Accepted: date


arXiv:1807.03931v2 [cs.LG] 1 Aug 2018

Abstract One of the challenges with model-based control of stochastic dynamical


systems is that the state transition dynamics are involved, making it difficult and
inefficient to make good-quality predictions of the states. Moreover, there are not
many representational models for the majority of autonomous systems, as it is not
easy to build a compact model that captures all the subtleties and uncertainties in
the system dynamics. In this work, we present a hierarchical Bayesian linear regres-
sion model with local features to learn the dynamics of such systems. The model
is hierarchical since we consider non-stationary priors for the model parameters
which increases its flexibility. To solve the maximum likelihood (ML) estimation
problem for this hierarchical model, we use the variational expectation maximiza-
tion (EM) algorithm, and enhance the procedure by introducing hidden target
variables. The algorithm is guaranteed to converge to the optimal log-likelihood
values under certain reasonable assumptions. It also yields parsimonious model
structures, and consistently provides fast and accurate predictions for all our ex-
amples, including two illustrative systems and a challenging micro-robotic system,
involving large training and test sets. These results demonstrate the effectiveness
of the method in approximating stochastic dynamics, which make it suitable for
future use in a paradigm, such as model-based reinforcement learning, to compute
optimal control policies in real time.

B. Parsa
Department of Mechanical Engineering, University of Washington, Seattle, WA, USA
E-mail: [email protected]
K. Rajasekaran
Department of Mechanical Engineering, University of Maryland, College Park, MD, USA
E-mail: [email protected]
F. Meier
Autonomous Motion Department
Max Planck Institute for Intelligent Systems, Tübingen, Germany
E-mail: [email protected]
A. G. Banerjee
Department of Industrial & Systems Engneering and Department of Mechanical Engineering
University of Washington, Seattle, WA, USA
E-mail: [email protected]
2 Behnoosh Parsa et al.

Keywords Hierarchical Bayesian Model · Variational EM Algorithm · Locally


Weighted Regression · Stochastic Dynamics

1 Introduction

Analyzing the dynamics of nonlinear systems in various ways has been always an
important area of research. This field includes nonlinear system identification [45,
44], modal analysis of dynamical system [50, 56], system identification using neural
networks [40, 25] and supervised learning of time series data, which is also known
as function approximation or regression [2, 53, 58, 13, 29].
Most of the dynamical systems we want to understand and control are either
stochastic, or they have some inherent noise components, which makes learning
their models challenging. Moreover, in model-based control [5, 1], or optimal con-
trol [21], often the model must be learned online as we get some measurement
data throughout the process. Hence, the learning process not only has to result in
accurate and precise approximations of the real-world behavior of the system but
also must be fast enough so that it will not delay the control process. For exam-
ple, in model-based reinforcement learning [5, 1], an agent first learns a model of
the environment and then uses that model to decide which action is best to take
next. If the computation of the transition dynamics takes very long, the predicted
policy is no longer useful. On the other hand, if the control policy is learned based
on an imprecise transition model, the policy cannot be used without significant
modifications as shown in Atkeson (1994) [4]. These learning procedures can be
combined with Bayesian inference techniques to capture the stochasticities and/or
uncertainties in the dynamical systems as well as the nonlinearities [38, 41, 32].
One of the popular methods to learn the transition dynamics is supervised
learning. Usually, there are three different ways to construct the learning criteria:
global, local, or a combination of both. From another perspective, these methods
are classified into memory-based (lazy) or memory-less (eager) methods based on
whether they use the training sets in the prediction process. An example of the
former methodology is the k-nearest neighbor algorithm, and for the latter, the
artificial neural network is a good representative. In an artificial neural network,
the target function is approximated globally during training, implying that we do
not need the training set to run an inference for a new query. Therefore, it requires
much less memory than a lazy learning system. Moreover, the post-training queries
have no effect on the learned model itself, and we get the same result every time
for a given query. On the other hand, in lazy learning, the training set expands
for any new query, and, thus, the model’s prediction changes over time.

1.1 Global Regression Methods for Learning Dynamics

Most regression algorithms are global in that they minimize a global loss func-
tion. Gaussian processes (GPs) are popular global methods [49] that allow us to
estimate the hyperparameters under uncertainties. However, their computational
cost (O(N 3 ) for N observations) limits their applications in learning the transition
dynamics for robot control. There have been efforts both in sparsifying Gaussian
Bayesian Local Linear Regression 3

process regression (GPR) [52, 55, 34, 37, 24] and developing online [33, 60] and in-
cremental [17, 14] algorithms for the sparse GPR to make it applicable for robotic
systems. Another challenging question is how to construct these processes, as for
instance, how to initialize the hyperparameters or define a reasonable length scale.
[59] presents the convolutional Gaussian processes, in which a variational frame-
work is adopted for approximation in GP models. The variational objective mini-
mizes the KL divergence across the entire latent process [36], which guarantees an
exact model approximation given enough resources. The computational complex-
ity of this algorithm is O(N M 2 ) (M  N ) through sparse approximation [55].
Moreover, it optimizes a non-Gaussian likelihood function [46].

1.2 Dynamics Learning with Spatially Localized Basis Functions

Receptive field-weighted regression (RFWR) in [51] is one of the prominent algo-


rithms that motivates many incremental learning procedures. Receptive fields in
nonparametric regression are constructed online and discarded right after predic-
tion. Locally weighted regression (LWR), and a more advanced version of it, termed
as locally weighted projection regression (LWPR) [58], are two popular variants of
RFWR, especially in control and robotics. The idea is that a local model is trained
incrementally within each receptive field independent of the other receptive fields.
Prediction is then made by blending the results of all the local models. The size
and shape of the receptive fields and the bias on the relevance of the individual
input dimensions are the parameters to be learned in these models. In addition,
the learning algorithm allocates new receptive fields and prunes the extra ones
as needed when new data are observed. The local models are usually simple like
low-order polynomials [22]. These methods are robust to negative inference since
the local models are trained independently. This contrasts with neural network
models and is discussed in [51].
Another natural and interesting comparison is with the mixture of experts
(ME) model [26, 28]. ME models are global learning systems where the experts
compete globally to cover the training data, and they address the bias-variance
problem [23] by finding the right number of local experts and also the optimal
distance metric for computing the weights in locally weighted regression [3]. In
[62], they define a mixture of Gaussian as the gating network and train both
the gating network and the local models with a single-loop analytical version of
the expectation-maximization (EM) algorithm. Schaal et al. [51] compare these
two algorithms in [51], and conclude that the performance of ME depends on
the number of experts and their distributions in the input space. Moreover, it
needs a larger training set than RFWR to provide comparably accurate results.
The results of the two algorithms become indistinguishable when we increase the
amount of training data and/or lower the noise. However, it is difficult to achieve
these conditions since a large enough training set is not always available and
the signal-to-noise ratio is not manually defined. Nevertheless, LWR and LWPR
require many tuning parameters whose optimal values are highly data dependent.
Therefore, we cannot always obtain robust and accurate results due to the highly
localized training strategy of these methods, which does not allow the local models
to benefit from the other models in their vicinities.
4 Behnoosh Parsa et al.

1.3 Combining Local and Global Regression Methods for Dynamics Learning

Fig. 1: Graphical model for a global linear regression problem solved using direct
maximum likelihood (ML) estimation without any prior. y (n) is the observed ran-
dom variable and w is the unknown model parameter. In this model, no prior
knowledge is available about the unknown parameter.

Usually, the online performance of global methods depends on how well the dis-
tance metric is learned offline, which is based on how well the training data rep-
resent the input space. There are several concerns with linear models as discussed
above. Therefore, one should develop better methods to tackle these problems.
Local Gaussian regression (LGR) in [38] is a probabilistic alternative to LWR,
which transforms it into a localized inference procedure by employing variational
approximation. This method combines the best of global and local regression
frameworks by combining the well-known Bayesian regression framework with ra-
dial basis function (RBF) features for nonlinear function approximation. This is
a top-down approach that takes advantage of the efficiency of the local regression
and the accuracy of the global Bayesian regression and incorporates local features.
In this method, maximizing the likelihood function is facilitated by the introduc-
tion of the hidden targets for every basis function, which act as links that connect
observations to the unknown parameters via Bayes’ law. In addition, the likelihood
function is maximized using the Variational Expectation Maximization (EM) al-
gorithm [43, 27, 57]. EM algorithm is a popular method that iteratively maximizes
the likelihood function without explicitly computing it. The variational EM is an
alternative algorithm that approximates the posterior distribution with a factor-
ized function of the hidden variables in the model [57]. The use of variational EM
makes LGR a fast learning algorithm. A graphical model representation of this
algorithm is shown in Fig. 1.
Another useful characteristic of LGR is optimizing the RBF length scale so
that the minimum number of local models is used for learning. This optimization
reduces the size of the model significantly. In addition, it adopts the idea of Auto-
matic Relevance Determination (ARD) [35, 54, 41] that results in a sparse model
as described later in section 2.2. Both these adjustments result in a compact model
that does not occupy much memory as compared to the space needed to store the
training set.
Bayesian Local Linear Regression 5

1.4 Validity of the EM Algorithm

The EM algorithm is a popular iterative method for solving maximum likelihood


estimation problems, and its performance guarantees has been shown for the most
generic EM algorithms [6, 61], However, there are very few theorems that prove its
the general convergence properties of EM algorithms especially when applied to
complex maximum likelihood problems. In problems with a bounded or convex loss
functions, it is possible to find a generalization bound on the approximations made
by the EM algorithm using Rademacher complexity [12, 30, 18, 39]. This is a line
of research that links the Bayesian and frequentist approaches together [16]. The
idea is that the Bayes approach is powerful in constructing an estimator, whereas
the frequentist approach provides a methodology for better evaluation. Therefore,
by putting them together, one can come up with a more robust model design.
However, these methods are limited to simple classes of problems [47], and are
usually not applicable to hierarchical Bayesian models. This lack of applicability
is primarily to the fact that the likelihood function cannot be described using a
bounded or convex function.
Moreover, most of the bounds found using the VC dimension or the Rademacher
complexity are data-dependent, making them less popular for model (algorithm)
evaluation. For instance, [39] establishes data-dependent bounds for mixture algo-
rithms with convex constraints and unbounded log-likelihood function. However,
this is rarely the case in building complex models for high-dimensional datasets.
Another effort has been to investigate whether the parameters updated by the EM
algorithm converge to a stationary point in the log-likelihood function [19, 20]. [19]
restate the EM algorithm as a generalized alternating minimization procedure and
used an information geometric framework for describing such algorithms and ana-
lyzing their convergence properties. They show that the optimal parameters found
by the variational EM procedure is a stationary point in the likelihood function if
and only if the parameters locally minimize the variational approximation error.
This can only happen in two ways. The first way is if the variational error has
stationary points in the likelihood function, which is guaranteed only if we know
those stationary point prior to estimation. The other way is if the variational error
is independent of the parameters, which is not possible if the variational family
makes independence assumptions to ensure tractability. Therefore, such models
have lower variational errors than those with independence assumptions [19].
Dempster et al. [15] prove a general convergence theorem for the EM algorithm.
However, the proof is not constructed based on correct assumptions. [61] and
[10] point out flaws in the proof, and present a counter example for which the
convergence of the generalized EM algorithm to the maxima of the likelihood
function does not result in a single set of parameters. The convergence of these
parameters are, therefore, highly dependent on the initial guess points and the
properties of the likelihood function.

1.5 Summary

In this paper, we adopt the LGR model presented in [38], and modify it suitably
for batch-wise learning of stochastic dynamics. We term this model the Batch-
hierarchical Bayesian linear regression (Batch-HBLR) model. Moreover, we de-
6 Behnoosh Parsa et al.

scribe all the steps to derive the Bayes update equations for the posterior dis-
tributions of the model. We then analyze the convergence of the variational EM
algorithm that is at the core of training the model. Subsequently, we evaluate
its performance experimentally on three different dynamical systems, including a
challenging external force-field actuated micro-robot. Results indicate good per-
formance on all the systems in terms of approximating the dynamics closely with
a parsimonious model structure leading to fast computation times during testing.
We, therefore, anticipate our Batch-HBLR model to provide a foundation for online
model-based reinforcement learning of robot motion control under uncertainty.

2 Hierarchical Bayesian Regression

2.1 Statistical Background

Prior to formulating the regression model, it is useful to provide some statistical


background. The multivariate normal (Gaussian) distribution for x ∈ Rd and the
univariate gamma distribution probability density functions (pdfs) are written in
(1) and (3), respectively.
 
1 1 > −1
p(x; µ, Σ) = p exp − (x − µ) Σ (x − µ) . (1)
(2π)d |Σ| 2

We use the following notation in the rest of the manuscript to refer to a normal
distribution, N (x; µ, Σ), where µ is the mean vector and Σ is the covariance
matrix. The log-likelihood for N i.i.d. samples drawn from N (x; µ, Σ) is:
N
N 1X
log p(X; µ, Σ) = − log((2π)d |Σ|) − (xn − µ)> Σ −1 (xn − µ) + const.
2 2 n=1
N
N 1X  
∝− log((2π)d |Σ|) − Trace Σ −1 (xn − µ)(xn − µ)>
2 2 n=1
N
!
N d 1 −1
X >
= − log((2π) |Σ|) − Trace Σ (xn − µ)(xn − µ) .
2 2 n=1
(2)
Here, X ∈ RN ×d is a matrix containing all the samples, and xn ∈ Rd represents
an individual sample. Moreover, in this manuscript, we use log for the natural
logarithm.
For a random variable x drawn from the Gamma distribution (x ∼ G(α, β)),
we use the following pdf,
β α xα−1 e−βx
p(x; α, β) = . (3)
Γ (α)
We use the following notation in the rest of the manuscript to refer to a gamma
distribution, G(x; α, β). Using Stirling’s formula for the gamma function, we ap-
proximate log Γ (α) for Re(α) > 0 with α log(α) − α. Then, we get the following
log-likelihood function for the gamma distribution,
log p(x; α, β) = α log(β) + (α − 1) log(x) − βx − α + α log(α). (4)
Bayesian Local Linear Regression 7

At this point, we also want to clarify the difference between the two notations
p(x; θ) and p(x | θ). We use the former when θ is a vector of parameters, and the
probability is a function of θ and we call it the likelihood function. In contrast, the
latter denotes the conditional probability of x when θ is a random variable.

2.2 Bayesian Linear Regression Model with Local Features

In the Bayesian framework, consider the function h(x) ∈ R and the variable x ∈
Ω ⊆ Rd . We want to predict the function value y ∗ = h(x∗ ) at an arbitrary location
n oN
x∗ ∈ Ω, using a set of N noisy observations, (X, Y) = (x(n) , y (n) ) , where
n=1
y (n) = h(x(n) ) + (n) . (n) are independently drawn form a zero-mean Gaussian
distribution,
p() = N (0, βy−1 I) (5)
h i>
where βy is the precision and  = (1) , . . . , (N ) . Therefore, one can assume
(X, Y) is randomly sampled from a data generating distribution represented by
D, and denote (X, Y) ∼ DN as the i.i.d. observations of N elements, x(n) ∈ Rd ,
and y (n) ∈ R.
In basic (global) Bayesian regression, the function h(x) is modeled as a linear
combination of P basis functions, and if these functions also include a bias term,
then P = d + 1.
h(x(n) ) = w> φ(x(n) ) (6)
where, w = [w1 , . . . , wP ]> are the weights of the linearcombination, and φ(x(n) ) =
>
 
φ(n) = (x(n) , 1)> . By definition, y x(n) = w> φ x(n) + (n) . Therefore, we
write the likelihood function in the following form

N  
N y (n) ; w> φ(n) , βy−1 .
Y
p(Y; w) = (7)
n=1

In local regression, we introduce kernel functions, such as the Radial Basis


Function (RBF), to emphasize the local observations about the point of interest.
This emphasis results in more accurate predictions, while the computational com-
plexity remains the same as the global regression problem. A better methodology is
to use a Bayesian linear regression comprising local linear models, with a Gaussian
likelihood for the regression parameters (w̃ = [w1 , w2 , . . . , wM ]) written as:

N M
!
> (n)
φm , βy−1
Y (n)
X
p(Y; w̃) = N y ; wm , (8)
n=1 m=1

h i>
where Y = y (1) , . . . , y (N ) ∈ RN ×1 , and wm ∈ RP ×1 . M is the dimensionality
(n)
of the wighted feature vector φm = φm (x(n) ) = ηm (x(n) )ξm ∈ RP , or the number
of linear models. Here, we build the feature vector with linear bases plus a bias
>
term, and define the mth feature vector as ξm = (x − cm )> , 1 . The weights

8 Behnoosh Parsa et al.

for these M spatially localized basis functions are defined by the Radial Basis
Function (RBF) kernel, whereby the mth RBF feature weight is given by

 
1
ηm (x) = exp − (x − cm )> Λ−1
m (x − cm ) . (9)
2

The RBF function is parameterized by its center cm ∈ Rd and a positive def-


inite matrix Λm ∈ Rd×d , both of which are estimated given the observations
n oN
O = x(n) , y (n) . Here, we consider the diagonal distance matrix Λm =
n=1
2 2

diag λm1 , . . . , λmd .

In this methodology, we implicitly assume independence among the local mod-


els, as the correlations between the distant models are almost zero (due to the
low value of the RBF). However, we have not yet used this assumption to re-
duce the computational cost of the EM algorithm (mentioned earlier in Section
1 and described later in Section 2.3), to iteratively solve the maximum likelihood
estimation problem for the models parameters. To do so, we introduce a hidden
(n)
variable fm for every local model, and infer the parameters of each local model
independently. Fig. 2 is a graphical model that illustrates how the hidden vari-
ables participate in both in the global and the local loops. Therefore, the complete
likelihood of this graphical model is,

N  M
 Y  
N y (n) ; 1> f̃ (n) , βy−1 > (n)
φm , βf−1
Y (n)
p(Y, f ; θ) = N fm ; wm m
. (10)
n=1 m=1

(n) (n) (n)


f̃ (n) = [f1 , f2 , . . . , fM ]> ∈ RM , so f = [f̃ (1) , f̃ (2) , . . . , f̃ (N ) ]> ∈ RN ×M , and
Y ∈ RN . Later, in Algorithm 1, we use fm to denote the mth column of f .
Bayesian Local Linear Regression 9

Fig. 2: Graphical model illustrating the algorithm (adapted from [38]). The random
variables inside the dark circles are observed. The variables inside the white circles
are unknown. The variables represented with small black circles are the model
parameters. This is a model with a hierarchical prior, since it includes priors on
the parameters of the priors. That is why, we use a variational EM algorithm to
solve the maximum likelihood estimation problem efficiently. The priors on the
precision parameters are stationary, while those on the model weights are non-
stationary.

We place Gaussian priors on the regression parameters wm (11) and gamma


priors on the precision parameters {βf m , αm }. It is worth mentioning that Am is
a diagonal matrix with the precision parameters along the main diagonal Am =
(P ) >
  h i
(1) (2) (P ) (1)
diag αm , αm , . . . , αm . When we write αm = αm , . . . , αm , we refer to
the vector that contains all the precision parameters of the local model m. Due
to this diagonal structure of Am , the model learns the precision parameters for
each feature independently. This property is very useful when adopting the idea of
automatic relevance determination (ARD) to automatically select the significant
parameters by comparing the precision with a threshold value [35, 54, 41]. ARD
is an elegant method for sparsifying a learning model with many features. The
prior on the weights (11) is a normal distribution; since, it is a conjugate prior,
the posterior is also a normal distribution.

M
N (wm ; 0, A−1
Y
p(w̃; α) = m ). (11)
m=1

After training, usually, some of these precision parameters converge to large num-
bers, which indicates that the corresponding feature does not play a significant
role in representing the data. Thus, it would be reasonable to ignore those fea-
tures in order to reduce the dimensionality of the regression model. This technique
10 Behnoosh Parsa et al.

is called pruning, which largely alleviates the problem arising from the curse of
dimensionality.
The joint prior distribution of all the model parameters is:

M M
p(w̃, βf , A−1
Y Y
m ) = p(wm | αm )p(βfm ) p(αm )
m=1 m=1
M M Y
P
N (wm ; 0, A−1
α(p)
Y β β
Y
= m )G(βfm ; a0 , b0 ) G(α(p) α
m ; a0 , b0 ).
m=1 m=1 p=1
(12)
n oM
(n)
To summarize, the model has hidden variables wm , {fm }Nn=1 and param-
m=1
eters θ = βy , {βf m , αm , λm }M

m=1 .

2.3 Variational Inference of Hidden Variables and Models Parameters

We now want to infer the posterior probability p(θ|Y). One of the most popular
methods to do so is maximum likelihood (ML). According to this approach, the
ML estimate is obtained as

θ = argmax p(Y; θ). (13)


θ

As mentioned earlier, we use the EM algorithm to solve the ML estimation prob-


lem, and follow the exposition in [42] and [8]. More specifically, we employ the
variational EM framework as described below.
We rewrite the log-likelihood function as

log p(Y; θ) = F (q, θ) + KL(q k p), (14)

with
Z  
p(Y, z; θ)
F (q, θ) = q(z) log dz, (15)
q(z)
and
Z  
p(z|Y; θ)
KL(q k p) = − q(z) log dz, (16)
q(z)

where q(z) is an arbitrary probability function for the hidden variable z, and
KL(q k p) is the Kullback-Leibler (KL) divergence between p(Y|z; θ) and q(z).
Since KL(q k p) ≥ 0, log p(Y; θ) ≥ F (q, θ). Hence, F (q, θ) is the lower bound of
the log-likelihood function. The EM algorithm maximizes the lower bound using
a two step iterative procedure. Assume that the current value of the parameters is
θold . The E-step maximizes the lower bound with respect to q(z), which happens
when q(z) = p(z|Y; θ). In this case, the lower bound is equal to the likelihood
because KL(q k p) ≥ 0. In the M-step, we keep q(z) constant, and maximize the
lower bound with respect to the parameters to find a new value θnew .
Bayesian Local Linear Regression 11

Now, if we substitute q(z) = p(z|Y; θ) in the lower bound and expand (15),
we get
Z Z
F (q, θ) = p(z|Y; θold ) log p(Y, z; θ)dz − p(f |Y; θold ) log p(Y, z; θold )dz

= Q(θ, θold ) + constant,


(17)
where the second term in the right-hand-side is the entropy of p(z|Y; θold ) and
does not depend on θ. The first term, which is usually named Q(θ, θold ), is the
expectation of the likelihood of the complete data (Ep(z|Y;θold ) [log p(Y, z; θ)]).
Q(θ, θold ) is commonly maximized in the M-step of the EM algorithm in the
signal processing literature. To summarize, the two steps of the EM algorithm are,

E-step : Compute p(z|Y; θold ) (18)


M-step : Find θnew = argmax Q(θ, θold ). (19)
θold

In the variational EM algorithm, we approximate the posterior p(z|Y; θold )


with a factorized function. In this approximation, the hidden variable z is parti-
tioned into
Q L partitions zi with l = 1, . . . , L. q(z) is also factorized in the same way
q(z) = L l=1 ql (zl ). Given this assumption, the lower bound can be formulated as

XZ
F (q, θ) = −KL(qj k p̃) − qi log qi dz (20)
i6=j

where qj = q(zj ), and


Z Y
p̃(Y, zj ; θ) = Ei6=j [p(Y, z; θ)] = p(Y, zj ; θ) qi dzi . (21)
i6=j

The bound in (20) is maximized when the KL distance become zero, which is the
case for qj (zj ) = p̃(Y, zj ; θ). In other words, the optimal distribution is obtained
from the following equation,

qj∗ (zj ) = Ei6=j [p(Y, z; θ)] + constant ∀j = 1, . . . , L. (22)

To summarize, the two steps of the variational EM algorithm are given by,

E-step : Find qnew (z) = argmax F (q, θold ) (23)


q(z)

M-step : Find θnew = argmax F (qnew (z), θ) = argmax Q(θ, θold ). (24)
θ θ

These equations are the set of consistency conditions for the maximum of the
lower bound for the factorized approximation of the posterior. They do not give us
an explicit solution as they also depend on other factors. Hence, we need to iterate
through all the factors and replace each of them one-by-one with its revised version.
We now derive the updates for every factor of the hidden variables by considering
the factorized posterior distribution,

q(f , w̃, β̃f , α̃) = q(f )q(w̃)q(β̃f )q(α̃). (25)


12 Behnoosh Parsa et al.

where β̃f = [βf1 , . . . , βfM ] and α̃ = [α1 , . . . , αM ].


Note that assuming factorization between f and w̃ automatically results in a
factorized posterior distribution over all fm and wm , and every element of αm . To
approximate the natural logarithm of the posterior distribution for the factors, we
take the expected value of q(f , w, βf , α) over all the variables except the ones that
we are solving for. In the resulting expression, we only keep the terms containing
the variable of interest. By doing so, the variational Bayes update equations for the
posterior mean and covariance of the parameters of each linear model are found
in Equations (26) through (43).

N
" #
(n) (n)
φm , βf−1
X
>
log q(wm ) = Efm ,βfm ,αm log N (fm ; wm m
) + log N (wm ; 0, A−1
m )
n=1
N 
"
βf  (n) >  
> (n) (n) > (n)
X
= Efm ,βfm ,αm log βfm − m fm − wm φm fm − wm φm
n=1
2

1 > −1
+ log(|Am |) − wm Am wm
2
N 
" 
βf X (n) 2 (n)> > (n) (n) > (n)
= Efm ,βfm ,αm − m fm + φm wm wm φm − 2fm wm φm
2 n=1

1 >
− wm Am wm
2
N N
!
1 > X (n) (n) > >
X (n) (n)
= − wm E[Am ] + E[βfm ] φm φm wm + wm E[βfm ] φm E[fm ]
2 n=1 n=1
= log N (wm ; µwm , Σwm ).
(26)
In the rest of the manuscript, we use ( ˆ ) symbol to refer to the first moment of the
approximation of the parameters. For instance, β̂fm = E[βfm ] and Âm = E[Am ].
(n) (n)
Moreover, E[fm ] = µ> wm φm .
The posterior distribution of wm is a Normal distribution; therefore, log q(wm )
is a quadratic function of wm , which we refer as J(wm ). From (2), we know that
the negative inverse of the covariance matrix is equal to the second derivative of
J(wm ) with respect to wm . The derivatives of the right hand side of (11) are:
N
! N
∂J(wm ) X (n) (n) >
X
= − Âm + β̂fm φm φm wm + β̂fm φ(n) (n)
m E[fm ] (27)
∂wm n=1 n=1

N
!
∂ 2 J(wm ) X (n) >
2
= − Â m + β̂ f m
φ(n)
m φm . (28)
∂wm n=1

Moreover, ∂J(w
∂wm
m)
= 0 at the mean; hence, by setting (12) equal to zero and solving
for wm , we get the mean of the posterior:
N
!−1
(n) (n) >
X
Σwm = Âm + β̂fm φm φm (29)
n=1
N
X
µwm = β̂fm Σwm φ(n) (n)
m E[fm ]. (30)
n=1
Bayesian Local Linear Regression 13

Σwm ∈ RP ×P is a diagonal matrix, and we refer to the elements on its main


(p)
diagonal by σwm , where p ∈ {1, . . . , P }.
Similarly, we find the variational Bayes update for every element of the preci-
sion vector αm .
h  i
(p) (p) (p) (p) α(p) α(p)
log q(αm ) = Ewm log N (wm ; 0, αm ) + log G αm | a0 , b0
" (p)
#
(p) αm  (p) 2  α(p) 
(p) α(p) (p)
= Ewm log(αm )1/2 − wm + a0 − 1 log(αm ) − b0 αm
2
 
α(p) α(p) α(p)
= log G αm | aN m , bN m .
(31)
Hence,

α(p) α(p) 1
aN m = a0 + (32)
2    
α(p) α(p) 1  (p) 2 α(p) 1  (p) 2 (p)
bN m = b0 + E wm = b0 + µwm + σw m
. (33)
2 2

α(p)
We observe that aN m is the same for all the models (∀m) and every individual
α(p)
element (p) of the precision vector αm . Therefore, we use aα
N instead of aN m later
in Algorithm 1.
Similarly, we derive the variational Bayes update for the posterior of the pre-
cision parameter βf by computing
N
" #
   
(n) > (n)
φm , βf−1 β β
X
log q(βfm ) = Ewm ,fm log N fm ; wm m
+ log G β f m | a ,
0 0b
n=1
N 
"
βfm X  (n) >  
> (n) (n) > (n)
= Ewm ,fm log(βfm )N/2 − fm − wm φm fm − wm φm
2 n=1
  i
+ a0 − 1 log(βfm ) − bβ
β
0 βfm
 
= log G βfm | aβ β
N m , bN m .
(34)
The updates are

N
aβN m = aβ0 + (35)
2
N  >  
1X 
bβN m = bβ0 + E[fm(n)
] − E[wm ]> φ(n)
m E[f (n)
m ] − E[w m ]> (n)
φ m
2 n=1
N  > 
1X   (36)
β
= b0 + µf (n) − µ>w m
φ(n)
m µf (n) − µ> wm
φ(n)
m + σfm
2 n=1 m m

>
 i
+Trace φ(n)m Σwm φ(n)
m .

In (36), σfm is the approximate variance of the mth local model. Again, since aβN m
is identical for all the local models, we use aβN instead.
Finally, the variational Bayes update equations
 for the posterior of mean and
(1) (N )
covariance of each local model target fm = fm , . . . , fm are found through the
14 Behnoosh Parsa et al.

following steps:
h    i
log q(f̃ (n) ) = Ew̃,β̃f log N y (n) | 1> f̃ (n) , βy−1 + log N f̃ (n) | F̃(n) , B−1
 > 
βy  (n)  1
= Ew̃,β̃f − y − 1> f̃ (n) y (n) − 1> f̃ (n) − log(| B |)
2 2
1  (n) >  
− f̃ − F̃(n) B f̃ (n) − F̃(n)
2
= log N (f̃ (n) | µf̃ (n) , Σf̃ (n) ),
(37)
h i>
(n) > (n) > (n) > (n)
where F̃ = w1 φ1 , w2 φ2 , . . . , wm φm , B = [diag(βf1 , . . . , βfm )], β̂fm =
aβ (n)
Eβfm [βfm ] = bβ
N
, and φm ∈ Rp .
N,m

N (f̃ (n) | µf̃ (n) , Σf̃ (n) ) = N (y (n) | 1> f̃ (n) , βy−1 )N (f̃ (n) | F̃(n) , B−1 ). (38)

We re-write the right hand side of (38) as an exponential function (exp(−J(f̃ (n) ))),
where J(f̃ (n) ) is a quadratic function of f̃ (n) and is defined as
 > 
βy  > (n) 
(n)
J(f̃ ) = Ew̃,β̃f 1 f̃ − y (n) 1> f̃ (n) − y (n)
2
>   (39)
1 (n)

+ f̃ − F̃(n) B f̃ (n) − F̃(n) .
2
To find the parameters of N (f̃ (n) | µf̃ (n) , Σf̃ (n) ), we compute the first and second
derivatives of J(f̃ (n) ):
∂J(f̃ (n) )    h i
= βy 1 1> f̃ (n) − y (n) + B̂ f̃ (n) − Ewm F̃(n) (40)
∂ f̃ (n)

∂ 2 J(f̃ (n) )
2 = βy 11> + B̂. (41)
∂ f̃ (n)
−1
Hence, the covariance matrix Σf̃ (n) is βy 11> + B . Using the Sherman-Morrison
formula, we reformulate the covariance matrix as
B̂−1 11> B−1
Σf̃ (n) = B̂−1 − . (42)
βy−1 + 1> B̂−1 1
Let us suppose s = βy−1 + 1> B̂−1 1. Then, the diagonal elements of Σf̃ (n) are
2
(β̂ −1 )
β̂f−1
m
− fsm . Note that the covariance matrix of the local models does not depend
on the individual samples.
The minimum of J(f̃ (n) ) is attained when the first derivative is zero. Setting
(40) to zero and solving for f̃ (n) gives us
 −1  h i
µf̃ (n) = βy 11> + B̂ βy y (n) 1 + B̂Ewm F̃(n)
!
−1 B̂−1 11> B−1  h i
= B̂ − −1 βy y (n) 1 + B̂Ewm F̃(n) (43)
βy + 1> B̂−1 1
 h i
= Σf̃ (n) βy y (n) 1 + B̂Ewm F̃(n) ,
Bayesian Local Linear Regression 15

where E{βf } [B] is written as B̂. The update for the individual target values
m ∀m
2
(β̂ −1 )
h i
is µfm = fm + fsm (Y − Ypre ); Ypre = Ewm F̃(n) .

2.4 Length Scale Optimization

To optimize the length scale λ = [λ1 , . . . , λM ], we maximize the expected complete


log likelihood under the variational bound

λ̃opt = argmax Ef ,w̃,β̃f ,α̃ log p(Y, f , w̃, α, β̃f | Φ, λ). (44)
λ

(n)
Here, by Φ, we refer to all the feature vectors φm for every n and m. (44) nicely
factorizes into independent maximization problems for each local model
N  
> (n)
φm , βf−1
X
λopt
m = argmax Ewm ,fm ,βfm
(n)
log N fm | wm m
= argmax K(λm ),
λm n=1 λm
(45)
which are optimized via gradient ascent. Simplifying (45), we have
N 
" #
βfm X  (n) >  
> (n) (n) > (n)
λopt
m = argmax Ewm ,fm log(βfm ) N/2
− fm − wm φm fm − wm φm
λm 2 n=1
N
" #
βf X  (n) > (n)
2
= argmax Ewm ,fm − m fm − wm φm
λm 2 n=1
N
" #
βfm X 
> (n) 2

= argmax − µ (n) − µwm φm .
λm 2 fm
n=1
(46)
We evaluate the gradient at a single point each time and update the variables with
the approximation of the gradient based on that single data point.
N  2 
X βf m  > (n)
∇λm K = ∇λm − µf (n) − µwm φm
n=1
2 m

N
βf m  
(n) (n) >
µf (n) − µ> ξm µwm power(x − cm , 2)> Λ−2 ,
X (n)
=− wm φm ηm
n=1
2 m

(47)
where power(., 2) indicates the element-wise power operator. Also, ∇λm K ∈ Rd ,
and the update for λm is λnew old
m = λm + κ∇λm , where κ is the learning rate.

2.5 Batch-wise Learning and Model Prediction

The batch-wise learning of the regression model is summarized as Algorithm


1, and is, henceforth, referred as the Batch-HBLR model. wgen is used as an
activation threshold for the local models. If xn does not activate any of the available
local models, the algorithm picks xn as the center of a new local model. Moreover,
we add a small positive number  into the calculation of Σwm to avoid numerical
issues during matrix inversion. The output is the set of local models learned for a
16 Behnoosh Parsa et al.

batch of data. Note that all the update equations are local with the exception of
µfm , the posterior mean of the hidden target.
To find the predictive distribution, we marginalize the complete likelihood of
the model first over the hidden variables f
Z    
N y ∗ ; 1> f̃ ∗ , βy−1 N f ∗ ; W> φ∗ , B−1 df ∗
M
! (48)
∗ > ∗ −1 > −1
X
=N y ; wm φm , βy + 1 B 1 ,
m

and then over the regression parameters w


M
!
∗ > ∗
φm , βy−1 > −1
X
N y ; wm +1 B 1 N (w; µw , Σw ) dw
m
! (49)
M
∗ > ∗
φm , σ 2 (x∗ )
X
=N y ; wm .
m

> ∗
(49) represents the posterior with predictive mean M
P
m wm φm , which is the sum
of all the local models predictions, and the predictive variance σ 2 (x∗ ) = βy−1 +
1> B−1 1 + φ∗m > Σwm φ∗m .
For better prediction, we divide the training set into mini-batches when the
training set is very large, and train the model for each mini-batch separately.
To make the prediction for a new location x∗ , we find the correct mini-batch
model(s) that x∗ belongs to, and then predict x∗ using (49). The prediction from
each model is averaged for the points that lie within the intersection of two or
more mini-batches. The procedure for preparing data and training the model on
all the batches is described in Algorithm 2. The procedure for testing the models
is stated in Algorithm 3.

3 Theoretical Analysis

We have explained how to solve an MLE problem for the graphical model in Fig. 2
using a variation EM algorithm, and derived the corresponding update equations
for the unknown model parameters in the previous Section. Therefore, now, an
important question is whether the variational EM algorithm has the desired per-
formance. As discussed in the Introduction, there have been many efforts toward
finding convergence guarantees of various EM algorithms. Here, we show that,
assuming a factorized posterior and considering the monotonicity of the EM algo-
rithm as established in Theorem 2.1 of [20], the (t + 1)th update is never less likely
than the tth update. Alternatively, this means that improving the Q-function in
(24) never makes the log-likelihood function worse. Let us restate the Theorem for
a part of the Batch-HBLR model:

Theorem 1 Let random variables X and Y have parametric densities with pa-
rameter θ ∈ Ω. Suppose the support of X does not depend on θ, and the Markov
relationship θ → X → Y , that is,

p(y|x, θ) = p(y|x)
Bayesian Local Linear Regression 17

Algorithm 1 Locally weighted Bayesian linear regression models implemented


batch-wise
1: procedure Batch-HBLR((X, Y) = {x(i) , y (i) }N α α β β
i=1 , T, κ, , δ, a0 , b0 , a0 , b0 , βy , wgen , λinit )
2:
3: LM ← InitializeLM(X, aα α β β
0 , b0 , a0 , b0 , wgen , λinit )
1
4: aα α
N = a0 + 2
. p is the number of LM parameters (equal to d + 1, where x ∈ Rd )
5: aβ
N = aβ
0 + N
2
6:
7: t=0
8: while t <= T do
9: t=t+1
10: for m = 1, . . . , M do
11: B = diag(β̂f1 , . . . , β̂fm )
12: s = βy−1 + 1> B−1 1
−1 2
 
β̂f
13: σfm = β̂f−1
m
− m
s
14:
15: for n = 1, . . . , N doh i
16: ηm (x(n) ) = exp − 21 (x(n) − cm )> Λ−1
m (x
(n) − c )
m
(n) > (n)
φm = ηm (x(n) ) (x(n) − cm )> , 1 . φm ∈ Rp

17:
18: end for
19: . Variational E-step to update the target variables and the moments of the
posterior distributions of the regression models weights
(N ) >
h i
(1)
20: E[fm ] = µ> >
wm φm , . . . , µwm φm . fm ∈ RN is the vector of the mth local
(n)
model evaluations at the N samples; we refer to the elements of this vector as E[fm ]
(N ) >
h i
(1)
21: Ypre = µ> >
wm φm , . . . , µwm φm
 2
β̂ −1
22: µfm = fm + fsm (Y − Ypre ) . µfm ∈ RN
 
(1) (p)
23: Âm = diag α̂1 , . . . , α̂m . Âm ∈ Rp×p
 −1
(n) (n) >
Σwm = (1 + )Âm + β̂fm N . Σwm ∈ Rp×p
P
24: n=1 φ m φm
(n) (n)
µwm = β̂f m Σwm N . µwm ∈ Rp
P
25: n=1 m φ E[f m ]

26: . Variational M-step to update the regression models hyperparameters


N 
1 X  (n) >
  
(n)

Nm = b β
0 + µ (n) − µ>
wm φm µ (n) − µ>
wm φm + σfm
2 n=1 fm f m
27:  
(n) > (n)
+Trace φm Σwm φm
 
(p) 2

α(p) α(p) (p)
28: bN m = b0 + 12 µwm + σwm
29:  >
aα aα
30: α̂m = N
α(1) ,..., N
α(p) . α̂m ∈ Rp
bN m bN m
β
aN
31: β̂fm = β . β̂f m ∈ R
bN m

32: Update λnm using gradient ascent with κoas the learning rate
33: LMm = cm , λm , β̂fm , α̂m , µwm , Σwm
34: end forh
(N ) >
i
(1)
35: Ypre = µ> >
wm φm , . . . , µwm φm
pre 2kY−Y k2
36: nM SE[t] = Var(Y)
37: if k nM SE[t] − nM SE[t − 1] k<= δ then break . Iteration threshold
38: end if
39: end while
40:
41: return LM
42: end procedure
18 Behnoosh Parsa et al.

43: function InitializeLM(X, aα α β β


0 , b0 , a0 , b0 , wgen , λinit )
44:
45: M =1  . M is the index of the localmodels
β
a aα
46: LM1 = c1 = x(1) , λ1 = λinit , β̂f1 = β0 , α̂1 = bα 0
1p , µw1 = 0p , Σw1 = 0p×p
b0 0
47: for n = 1, . . . , N do
48: for m = 1, . . . , M do
49: Λm = λ2init Id h . Id ∈iRd×d is the identity matrix
(n) 1 (n) > −1 (n)
50: ηm (x ) = exp − 2 (x − cm ) Λm (x − cm )
51: end for
52:
if ηm x(n) < wgen ,

53: ∀m = 1, . . . , M then
54: M = M +(1
aβ aα
cM = x(n) , λM = λinit , β̂fM = 0
, α̂M = 0
1p , µwm = 0p ,
55: LMM = bβ
0

0
Σwm = 0p×p d}
56: end if
57: end for
58:
59: return LM
60: end function

Algorithm 2 Data preparation and model training


1: procedure Training((X, Y) = {x(i) , y (i) }N i=1 )
2:
3: Find all the instances corresponding to the change in the control signal (T∗ )
4:
5: If possible, evenly divide the training input space into S segment using T∗ . If not divide
such that only the size of the last segment is different ((X, Y) = ∪Ss=1 (Xs , Ys ))
6: if Var(Ys ) == 0.0 then
7:  ∼ N (0.0, 10−8 )
8: Ys = Ys + 
9: end if
10:
11: for ∀s do
12: LMs =Batch-HBLR(Xs , Ys )
13: end for
14: return LM
15: end procedure

holds for all θ ∈ Ω, X ∈ X and y ∈ Y. Then, for θ ∈ Ω and any y ∈ Y with


X =
6 ∅,
l(θ) ≥ l(θ(t) ),
if,
Q(θ|θ(t) ) ≥ Q(θ(t) |θ(t) ).
Theorem 1 can also be extended to the maximum a posteriori (MAP) EM, where
we have a prior on θ. This extension is established using the following Theorem.
Theorem 2 Let random variables X and Y have parametric densities with pa-
rameter θ ∈ Ω, where θ is distributed according to the density p(θ) on Ω. Suppose
the support of X does not depend on θ, and the Markov relationship θ → X → Y ,
that is,
p(y|x, θ) = p(y|x)
Bayesian Local Linear Regression 19

Algorithm 3 Model testing


1: procedure Test((X, Y) = {x(i) , y (i) }N
i=1 )
2:
3: for i = 1, . . . , N do
4: Determine the segments x(i) belongs to (store their indices in S), and use the model
learned for that segment (LMs ) to make prediction (Eq. 34)
5:
6: if there is only one segment to which x(i) belongs then
(i)
Ypre = M > i
P
7: m wm φm
8:
9: else if there are more segments then
10: for s ∈ S do
y(s) = M > i
P
11: m wm φm . Using LMs
12: end for
13:
|S|
P
(i) y(s)
14: Ypre = s=1 |S|
. |S| is the number of active segments
15: end if
16:
17: end for
18:
kY−Ypre k2
2
19: nM SE = Var(Y)
20:
21: return nM SE
22: end procedure

holds for all θ ∈ Ω, X ∈ X and y ∈ Y. Then, for θ ∈ Ω and any y ∈ Y with


X =
6 ∅,
l(θ) + log p(θ) ≥ l(θ(t) ) + log p(θ(t) ),
if,
Q(θ|θ(t) ) + log p(θ) ≥ Q(θ(t) |θ(t) ) + log p(θ(t) ).
Since both the Theorems are taken directly from [20], we do not include their
proofs here, but provide them in Appendix A for the sake of completeness. Now,
recall that we assume factorized priors to approximate the posterior. Moreover,
consider the graphical model representation of the algorithm in Fig. 2. We decom-
pose the graphical model into chains that satisfy the assumptions in the two Theo-
rems. These chains are w → f → Y, α → w → f , α → w → βf , and βf → f → Y.
In some of these chains, we also have priors on the parameters. Therefore, we use
Theorem 2 to prove the enhancement of their log-likelihood function by using
the EM updates. The corresponding likelihood functions for these chains generate
factors in the complete likelihood function (10), and all these factors are bounded
in their respective domains. We also know that if two functions are monotonic in
their domains, the product of these functions is monotonic in its domain. Hence,
the overall log-likelihood function is bounded, and its value increases when we
apply the updates in each iteration until it converges to the optimal log-likelihood
value [20].
However, this result does not necessarily guarantee the convergence of the pa-
rameters to an optimal value. Indeed, there has not been any general convergence
theorem for the EM algorithm. To study the convergence of the parameters, we
need more information about the family of the desired distribution and the struc-
ture of the log-likelihood function [19]. However, even knowing the family of the
20 Behnoosh Parsa et al.

desired distribution is not sufficient when we are dealing with the variational EM
algorithm with a factorized posterior (as discussed in the Introduction) since there
is no guarantee that the estimated parameters correspond to the stationary points
of the likelihood function.

4 Experimental Results

In this section, we present the implementation details and characterize the per-
formance of the Batch-HBLR method on three different stochastic dynamical sys-
tems. First, we evaluate our method on a stochastic mass-spring-damper (MSD)
system, and then on a more complex double inverted pendulum on a cart (DIPC)
system. Last, we validate its effectiveness on the synthetic version of a real-world
micro-robotic system.

4.1 Implementation
β β −6
In order to construct non-informative priors, we choose aα α
0 = b0 = a0 = b0 = 10
9
and βy = 10 as the hyperparameters. We assume similar initial precision (αm )

for all the models weights, and set the initial value equal to bα0 1p since they are
0
generated by a gamma distribution. The matrix inversion instability avoidance
−10
parameter, , is chosen to be 10 . We use wgen = 0.5 as the RBF activation
threshold. If the RBF value for a sample (x(n) ) is less than wgen , the algorithm
adds a new local model with that sample as the center, and initializes its parame-
ters in the INITIALIZELM function in Algorithm 1. These parameters comprise
the length scale (λinit = 0.3), the mean µwm = 0, and the covariance Σwm = 0p×p
of the models weights. We only retain those features with precision values less than
1000. If the precision becomes larger than this threshold, we set the corresponding
weight to zero and prune the corresponding feature. The learning rate, κ, of λ is
selected to be 0.0001 and the iteration threshold, δ, is also chosen as 0.0001.
In the two illustrative examples, we report the normalized mean squared error
(nM SE), which is computed in line 19 of Algorithm 3. When we have a response
variable (y) that is a constant function, Var(y) is zero, and unless we add noise to
the data, we are not able to use nM SE and compute M SE instead. For the micro-
robotic system, some of the states remain constant. Therefore, we use M SE both
as a termination criterion in the training algorithm and as a performance measure
in the test algorithm. We allow a maximum iteration limit of 200 for training the
models. However, the algorithm stops if the difference between the previous and
current nM SE values is less than 0.0001. It is worth mentioning the models are
learned before 200 iterations in all the examples.
Algorithms 1 to 3 are written in Python 2.7 and tested on both Windows 10
and Ubuntu 16.04 LST operating systems. We randomly select the training (67%
of the data) and test (33% of the data) sets using the train-test-split function
from the scikit-learn library in Python. The simulator for generating the micro-
robot trajectories in Section 4.3 is written in C++ programming language, and
is run on the Ubuntu 16.04 LST operating system. The source code and the
examples would be made available in GitHub later on.
Bayesian Local Linear Regression 21

4.2 Illustrative Examples

4.2.1 Stochastic Mass and Spring Damper system

We build a stochastic mass-spring damper system by adding white noise to the


input of a standard mass-spring damper system (see Fig. 3). Its linearized state
equation is then given by
      
ẋ1 (t) 0 1 x1 (t) 0
= 2 + w(t). (50)
ẋ2 (t) −ν −γ x2 (t) 1

Equation (55) is an Itô equation [31], where x1 and x2 are the position and velocity
of the mass, respectively, (“ ˙ ” represents time derivative), and w ∼ N (0, 1). In
Fig. 3, x(t) is equivalent to x1 in the state equation, and k and c are the
q stiffness
k
of the spring and the damping coefficient, respectively. In (55), ν = m is the
c
natural frequency of the system, and γ = m is the damping ratio. It is solved
using itoint a numerical integration method in the sdeint Python library.

Fig. 3: Stochastic Mass and Spring Damper system. The damping ratio is γ = 1,
and the natural frequency is ν = 3 for this system.

 >
The system starts from x = 3.0, 0.0 and is simulated for 10 seconds in
increments of 0.005 second. The effect of measurement noise is included by adding
a white noise in the form of x(t) = x(t)+sd ∗w where, w ∼ N (0, 1). std = 0.1 for the
first state and std = 0.4 for the second one. The training set for the Batch-HBLR
method has size N = 1340, 3-dimensional input (X = [x[0 : N − 1]> , Time]), and
2-dimension response (Y = x[1 : N ]). We train the model on every dimension of
the response separately. The predicted results for the test set (N = 660) using
the trained model are shown in Fig. 4. They closely match the “true” state values
regardless of the amount of noise added into the stochastic system.
The important quantitative performance measures of the training and test
algorithms are summarized in Table 1. The reported prediction time is the time
taken by the algorithm to make prediction for one new sample1 . We observe that
not only is the algorithm fast in making predictions, it also converges much before
reaching the maximum iteration limit during training. The number of local models
1 Only the average time is reported as the standard error is negligible
22 Behnoosh Parsa et al.

(# of LMs) depends only on the inputs and not on the responses (Y). Therefore,
it is the same for both the response states. This number is more than two orders
of magnitude smaller than the number of samples, making the model extremely
parsimonious. The training and test errors are comparable, and very small. The
actual expressions of the learned local models for both x(t) and ẋ(t) are reported
in Tables 4 and 5 in Appendix B.

Fig. 4: Prediction results of our Batch-HBLR method on a stochastic mass-spring


damper (MSD) system with varying amounts of noise.

Table 1: Performance measures of our Batch-HBLR method on the simulated data


of a noisy MSD system.
Bayesian Local Linear Regression 23

# of training samples: 1340 & # of testing samples: 660


Windows 10 - 64 bit
nMSE nMSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
x(t) 0.04300 0.04759 3 57 9
ẋ(t) 0.08262 0.0.08189 3 30 9

4.2.2 Stochastic double inverted pendulum on a cart

Our second example is a stochastic double inverted pendulum on a cart, which is


depicted in Fig. 5. The differential equations of the system dynamics is adopted
from the model presented in [9].

Fig. 5: Stochastic double inverted pendulum on a cart with masses, m0 = 1.5,


m1 = 0.5, and m2 = 0.75. The length of the first pendulum is l1 = 0.5 m and that
of the second pendulum is l2 = 0.75 m. The moment of inertia Ji = 31 mi li2 for
i = 1, 2. F is the input force applied to the cart.

We now provide a brief description of the dynamics model. The state vector is
 >
defined as x = θ0 , θ1 , θ2 , θ̇0 , θ̇1 , θ̇2 , where the position of the cart is in meters
and the angles are in radians. For the system in Fig. 5 the state-space equation is

ẋ(t) = Ax(t) + BF(t) + Dw(t) (51)


y(t) = x(t) (52)
24 Behnoosh Parsa et al.

where,
 
0.0 0.0 0.0 1.0 0.0 0.0
0.0 0.0 0.0 0.0 1.0 0.0
 
0.0 0.0 0.0 0.0 0.0 1.0
A=
0.0 0.0 −7.49

 0.798 0.0 0.0
0.0 0.0 74.93 −33.71 0.0 0.0
0.0 0.0 −59.94 52.12 0.0 0.0
 
0.0
 0.0 
 
 0.0 
B=
 
−0.61

 1.5 
−0.3
 
0.0
0.0
 
0.0
0.1 ,
D= 
 
0.1
0.1

and w ∼ N (0, 1). The stochastic tracking error dynamics is written as

ė(t) = (A + BF(t))e(t) + Dw(t), (53)

where, e(t) = x(t) − xd (t) is the error calculated with respect to the desired
trajectory xd (t).
We consider a Linear Quadratic Regulator (LQR) controller, F(t) = −Kx, and
find it by minimizing a quadratic cost J in the form of x> Qx + F> RF. Here, K
is the solution of the algebraic Riccati equation. Defining

Q = diag 10, 100, 100, 700, 700, 700 ,

and R = 1, we get
 
K = −3.162, 589.127, −842.986, −29.493, 4.469, −133.079 .
 >
The system starts from x = 0.0, 0.175, −0.175, 0.0, 0.0, 0.0 and follows a de-
sired trajectory for 200 seconds in increments of 0.01 second. The effect of measure-
ment noise is incorporated by adding a white noise in the form of x(t) = x(t)+sd w.
 >
sd = 0.5, 0.3, 0.25, 0.8, 0.3, 0.2 is a vector of the standard deviations of the
noise we add to the states, and w ∼ N (0, 1). The training set for the Batch-HBLR
method has N = 13400 paired samples. The samples have 8-dimensional input
space (X = [x[0 : N − 1]> , F, Time], which consists of the 6 states, the input
F, and the simulation time Time. The response state is a 6-dimensional vector
Y = [x[1 : N ]], which captures all the states of the system. We train the model on
every dimension of the target separately, and report the results in Table 2. The
simulated trajectories of the pendulum and cart used for testing the model are
shown in Fig. 6. The tabulated performance measures and the illustrated predic-
tion results are very similar to those for the stochastic MSD system, indicating
Bayesian Local Linear Regression 25

the general applicability of our method. The learned local models for one of the
response variables θ0 (t) are reported in Table 6 in Appendix B.

Fig. 6: Prediction results of our Batch-HBLR method on a stochastic double in-


verted pendulum on a cart (SDIP) system.

Table 2: Performance measures of our Batch-HBLR method on the simulated data


of a noisy SDIP system.

# of training samples: 13400 & # of testing samples: 6600


Windows 10 - 64 bit
nMSE nMSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
θ0 0.0141 0.0146 3 80 3
θ1 0.5828 0.5821 3 99 3
θ2 0.9827 0.6619 3 9 3
θ̇0 0.6479 0.5687 3 13 3
θ̇1 0.5718 0.5643 3 33 3
θ̇2 0.9847 0.9880 3 8 3
26 Behnoosh Parsa et al.

4.3 Micro-Robotic System

The final and most important example of the paper is a micro-robotic system, in
which a microscopic object (robot) is manipulated (actively controlled) in a fluid
medium using the optical force produced by shining a tightly focused laser beam
at the object. The dynamics of this system is stochastic due to the Langevin force
that gives rise to Brownian motion-based diffusion. Rajasekaran et al. [48] describe
the details of the dynamics model, which also includes optical forces, viscous drag,
and buoyancy, using a tensor stochastic differential equation. We directly use the
OTGO toolbox [11] to generate look-up tables of the optical trapping forces Fo .
The dynamics model for one optical trap (local controller generating the optical
forces) affecting n spherical micro-robots and other freely diffusing objects, termed
as obstacles, is then represented as

Mẍ(t) = Fo − Bd ẋ(t) − Bo + Fη. (54)

The symbols in (54) are defined as follows.

– M is the 3n × 3n diagonal mass matrix of the micro-objects.


– x is a 3n × 1 combined vector of the object locations (center coordinates).
– Bd is the viscous drag
 coefficient matrix of dimension
6πrµ if the object is not close to the cover slide
3n×3n populated by 6πrµ
9r r3 45r 4 r5
if the object is near the cover slide.
1− 16h + 8h3 − 256h4 − 16h5

µ is the viscosity of the fluid medium and h is the distance between the cover-
slip and the object center.
– Bo is the buoyancy force; it is a 3n × 3n diagonal matrix with V ρl g as the
diagonal element in every third row. V is the volume of the displaced fluid, ρl
is the density of the fluid, and g is the acceleration due to gravity.
√ disturbance coupling matrix with dimension 3n × 3n, where
– F is the diagonal
each term is 2kb T γ with kb being the Boltzmann constant and γ being the
viscous drag coefficient.
– η is the standard Gaussian variable.

We develop a high-fidelity simulator, akin to the one presented in [7], to gen-


erate the trajectories for various instances of a basic scenario involving exactly
one micro-robot and one obstacle in its neighborhood. Note here that this basic
scenario, under varying robot velocities and relative displacements of the robot-
obstacle pair, is replicable for more complex environments with a larger number of
robots and obstacles. We consider a small 3D volume as our environment, which
is discretized into a regular 4 × 4 × 4 grid with its size equal to the robot diameter
(Fig. 7). Both the robot and the obstacle consist of a 2 µm diameter amorphous
silica bead. In all the experiments, the robot starts from the bottom corner of the
environment (the origin) and is moved along a straight line with constant speed
until it leaves the environment’s boundary. The maximum speed at which one can
manipulate this robot in a stable manner using a 0.5W laser beam in an aqueous
medium at room temperature is 1 µm/s. We then consider the following discrete
set of speeds (in µm/s) S = {0.2, 0.4, 0.6, 0.8, 1.0}
Bayesian Local Linear Regression 27

Fig. 7: In the left figure, the yellow part of the cube shows the volume used for
running the experiments (the remaining volume is ignored due to force symmetry
considerations). The green arrow is a sample trajectory followed by the controlled
micro-robot. θx and θz show the angle made by the trajectory in the X-Y plane
with the X-axis and the angle with the Z-axis in the plane defined by the Z-axis
and the trajectory vector, respectively. The obstacle is located at the nodes of the
3D grid (see the right figure for the locations on the X-Y plane).

The optical effect on a micro-robot is symmetric (direction insensitive) in the


horizontal plane (X-Y). Therefore, we only simulate the movement in half of the
X-Y plane (the shaded region in Fig. 7). The trajectories are generated such that
they cover the specified regions within the environment. The angle between the
trajectory and the Z-axis is called θz and the angle between the projection of the
trajectory in the X-Y plane and the X-axis is termed as θx . Five representative
directions are used for training and testing the model. Defining the direction of a
trajectory with the tuple of (θx , θz ), the trajectories (., 0), (0, 45), (45, 45), (0, 90),
and (30, 75) are used in this experiment. Note that when θz = 0, the micro-robot is
moving along the Z axis; thus, θx is undefined, and we show its direction as (., 0).
For every direction of the movement of the micro-robot, we put an obstacle in the
environment as shown in Fig. 7. Considering the 3D nature of the environment, we
have 14 × 5 = 70 obstacle positions. Therefore, we have 70 × 5 = 350 simulations
for every direction (5 is the size of speed set).
Figures 8 and 9 show the test results for two experiments. In Fig. 8, the micro-
robot is moving with the maximum speed (1µm/s) along the Z-axis and the obsta-
cle is at [4, 4, 4]. The obstacle is far enough in this experiment; therefore, it stays
at its initial position. The micro-robot moves along the Z-axis until it leaves the
cube. In Fig. 9, the robot is moving with speed (0.6µm/s) along (30◦ , 75◦ ), and
the obstacle is at [4, 2, 2]. It gets attracted to the trap after 4 seconds from the
beginning of the experiment, and then the two beads travel together until they
leave the environment.
The micro-robot is moved by controlling the position of the optical trap. The
simulator updates the trap position every 500ms and stores 100 samples for every
control action. As a result, the size of the training set is large. Hence, we partition it
into smaller sections with the overlap of one control action between the sections.
28 Behnoosh Parsa et al.

The number of partitions used for each speed in S = {0.2, 0.4, 0.6, 0.8, 1.0} is
P = {20, 10, 6, 5, 4}, respectively.
The average quantitative results with standard errors are presented in Table
3. As for every direction, we have 350 experiments, the mean squared error (MSE)
is reported along with its statistics. The MSE for the training set is first averaged
over the segments and its statistics are then computed for the 350 experiments.
The statistics for the total number of local models obtained in each segment are
reported in the table under the heading of ”# of LMs”. The number of iterations
(# of iter.) is computed similarly to the number of local models.
In addition to accurate training and test results (low MSE), what is significant
about this learning algorithm is that it only stores very few samples as the centers
of the RBF models. For instance, in the first sub-table in Table 3, the average
size of the training set is 37200 samples, whereas the average number of local
models/RBF centers is 60. This significantly reduces the amount of memory needed
for recovering the results and making predictions for new data streams. Moreover,
if in some problems one needs to keep updating the model as new data arrives, it
is easy to transform this algorithm to operate in an incremental fashion. In this
manner, we can continuously expand the model by adding new local models and
updating the parameters of all the models accordingly.
Last, but not the least, the algorithm results in fast predictions. The last col-
umn of Table 3 corresponds to the time elapsed while the model makes prediction
for a new location. The average is less than 0.5 milliseconds for all the states. This
is a very important advantage for a learning algorithm, especially if it is used for
learning the dynamics of a robotic system (or the transition probabilities). Fast
state estimation not only makes robot motion control more precise but also more
adept in reacting to environmental changes.
Bayesian Local Linear Regression 29

Fig. 8: Ground truth (simulated) and predicted X, Y , Z trajectories of a micro-


robot (left column) starting from the origin and moving along the Z axis with a
speed of 1 µ m/s. The obstacle (right column) is initially located at the [4, 4, 4]
grid node, and remains unaffected by the moving robot. The bottom left figure
also shows the moving window for selecting the four training data batches.
30 Behnoosh Parsa et al.

Fig. 9: Ground truth and predicted X, Y , Z trajectories of a micro-robot (left


column) starting from the origin and moving along the (30, 75) direction with a
speed of 0.6 µ m/s. The obstacle is initially located at the [4, 2, 2] grid node, and
gets dragged by the moving robot as it passes close by (right column). The top left
figure also shows the moving window for selecting the six training data batches.
Bayesian Local Linear Regression 31

Table 3: Performance measures of our Batch-HBLR method for the optically actu-
ated micro-robotic system. The reported values are the average of 350 experiments
in which the controlled micro-robot is moved along a particular specified direction.
As we partition the training data into S segments and train the HBLR model for
each segment separately, the reported training M SE values are the averages of
the MSEs observed for all the segments.

θz = 0◦ , θx is not defined
Avg. # of training samples: 37200
Avg. # of test samples: 18322
Ubuntu 16.4
Avg. MSE MSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
9.93 ∗ 10−11 3.01 ∗ 10−14 60 2 0.27
X
±2.09 ∗ 10−13 ±6.17 ∗ 10−15 ±35 ±0 ±0.07
1.00 ∗ 10−10 3.84 ∗ 10−14 60 2 0.27
Y
±1.00 ∗ 10−12 ±1.27 ∗ 10−14 ±35 ±0 ±0.07
1.07 ∗ 10−3 1.06 ∗ 10−3 60 62 0.27
Z
±1.94 ∗ 10−4 ±1.81 ∗ 10−4 ±35 ±3 ±0.07
1.27 ∗ 10−3 1.24 ∗ 10−3 60 59 0.27
Xobs
±1.16 ∗ 10−3 ±1.13 ∗ 10−3 ±35 ±23 ±0.07
1.10 ∗ 10−3 1.08 ∗ 10−3 60 41 0.27
Yobs
±1.32 ∗ 10−3 ±1.29 ∗ 10−3 ±35 ±28 ±0.07
1.30 ∗ 10−3 1.28 ∗ 10−3 60 55 0.27
Zobs
±1.18 ∗ 10−3 ±1.15 ∗ 10−3 ±35 ±24 ±0.07

θz = 90◦ , θx = 0◦
Avg. # of training samples: 37400
Avg. # of test samples: 18421
Ubuntu 16.4 - 64 bit
Avg. MSE MSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
1.90 ∗ 10−3 1.90 ∗ 10−3 58 60 0.29
X
±7.76 ∗ 10−4 ±7.67 ∗ 10−4 ±37 ±4 ±0.09
9.98 ∗ 10−11 6.28 ∗ 10−14 58 2 0.29
Y
±9.47 ∗ 10−13 ±3.11 ∗ 10−14 ±37 ±0 ±0.09
5.90 ∗ 10−6 6.01 ∗ 10−6 58 2 0.29
Z
±1.84 ∗ 10−7 ±3.37 ∗ 10−7 ±37 ±0 ±0.09
3.60 ∗ 10−3 3.60 ∗ 10−3 58 75 0.29
Xobs
±2.36 ∗ 10−3 ±2.35 ∗ 10−3 ±37 ±9 ±0.09
3.33 ∗ 10−3 3.32 ∗ 10−3 58 51 0.29
Yobs
±2.50 ∗ 10−3 ±2.49 ∗ 10−3 ±37 ±28 ±0.09
3.52 ∗ 10−3 3.51 ∗ 10−3 58 61 0.29
Zobs
±2.48 ∗ 10−3 ±2.47 ∗ 10−3 ±37 ±25 ±0.09
32 Behnoosh Parsa et al.

θz = 45◦ , θx = 0◦
Avg. # of training samples: 52200
Avg. # of test samples: 25710
Ubuntu 16.4
Avg. MSE MSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
1.78 ∗ 10−3 1.75 ∗ 10−3 81 85 0.36
X
±4.31 ∗ 10−4 ±4.08 ∗ 10−4 ±46 ±5 ±0.07
9.97 ∗ 10−11 7.40 ∗ 10−14 81 2 0.36
Y
±7.495 ∗ 10−13 ±7.67 ∗ 10−14 ±46 ±0 ±0.07
1.23 ∗ 10−3 1.21 ∗ 10−3 81 85 0.36
Z
±9.41 ∗ 10−5 ±8.97 ∗ 10−5 ±46 ±5 ±0.07
3.90 ∗ 10−3 3.87 ∗ 10−3 81 103 0.36
Xobs
±2.30 ∗ 10−3 ±2.26 ∗ 10−3 ±46 ±10 ±0.07
3.54 ∗ 10−3 3.51 ∗ 10−3 81 65 0.36
Yobs
±2.60 ∗ 10−3 ±2.57 ∗ 10−3 ±46 ±39 ±0.07
3.94 ∗ 10−3 3.91 ∗ 10−3 81 92 0.36
Zobs
±2.32 ∗ 10−3 ±2.31 ∗ 10−3 ±46 ±22 ±0.07

θz = 45◦ , θx = 45◦
Avg. # of training samples: 74000
Avg. # of test samples: 36448
Windows 10 - 64 bit
Avg. MSE MSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
1.99 ∗ 10−3 1.99 ∗ 10−3 111 108 0.47
X
±2.24 ∗ 10−4 ±2.31 ∗ 10−4 ±65 ±8 ±1.96
2.00 ∗ 10−3 1.99 ∗ 10−3 11 108 0.47
Y
±2.24 ∗ 10−4 ±2.30 ∗ 10−4 ±65 ±8 ±1.96
1.63 ∗ 10−3 1.16 ∗ 10−3 111 117 0.47
Z
±1.19 ∗ 10−4 ±1.16 ∗ 10−4 ±65 ±8 ±1.96
4.15 ∗ 10−3 4.19 ∗ 10−3 111 130 0.47
Xobs
±1.90 ∗ 10−3 ±1.93 ∗ 10−3 ±65 ±14 ±1.96
4.04 ∗ 10−3 4.07 ∗ 10−3 111 100 0.47
Yobs
±1.90 ∗ 10−3 ±1.92 ∗ 10−3 ±65 ±37 ±1.96
4.24 ∗ 10−3 4.26 ∗ 10−3 111 118 0.47
Zobs
±1.95 ∗ 10−3 ±1.97 ∗ 10−3 ±65 ±26 ±1.96
Bayesian Local Linear Regression 33

θz = 75◦ , θx = 30◦
Avg. # of training samples: 44000
Avg. # of test samples: 21672
Ubuntu 16.4
Avg. MSE MSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
1.79 ∗ 10−3 1.77 ∗ 10−3 66 69 0.32
X
±7.05 ∗ 10−4 ±6.81 ∗ 10−4 ±40 ±1 ±0.06
1.30 ∗ 10−3 1.26 ∗ 10−3 66 61 0.32
Y
±2.15 ∗ 10−4 ±1.95 ∗ 10−4 ±40 ±2 ±0.06
1.26 ∗ 10−3 1.21 ∗ 10−3 66 50 0.32
Z
±5.13 ∗ 10−5 ±5.77 ∗ 10−5 ±40 ±2 ±0.06
3.22 ∗ 10−3 3.18 ∗ 10−3 66 86 0.32
Xobs
±1.72 ∗ 10−3 ±1.69 ∗ 10−3 ±40 ±8 ±0.06
3.11 ∗ 10−3 3.07 ∗ 10−3 66 64 0.32
Yobs
±1.76 ∗ 10−3 ±1.72 ∗ 10−3 ±40 ±25 ±0.06
3.24 ∗ 10−3 3.19 ∗ 10−3 66 76 0.32
Zobs
±1.77 ∗ 10−3 ±1.75 ∗ 10−3 ±40 ±18 ±0.06

5 Conclusions

Learning the state transition dynamics for a robotic system, and being able to
make fast and accurate predictions is critical for any real-time motion planning
and control problem. The performance of the learning algorithm becomes even
more important when stochasticity plays an important role in the system. In this
paper, we present a hierarchical Bayesian linear regression model with local fea-
tures for stochastic dynamics approximation, and demonstrate its performance in
three different systems. The model is based on a top-down approach that benefits
both from the efficiency of local models and the accuracy of global models.
As the data collected from the interactions of robots with their environment
are typically enormous, we need a parsimonious learning model for compact stor-
age. In our case, only a few samples are stored and used as the local RBF feature
centers. This is done by optimizing the RBF length scale to obtain the minimum
number of local models that best represent the data. Moreover, usually the robots’
state vectors are large, and the curse of dimensionality slows down learning and
prediction substantially. This emphasizes the importance of a sparse learning algo-
rithm, meaning that the weights of the model are pruned if they do not contribute
significantly toward the predictions. The presented algorithm adopts the idea of
automatic relevance determination and only retains those weights with significant
values. The algorithm is also guaranteed to converge to the optimal values of
the log-likelihood function for a factorized representation of the posterior and a
Markov relationship among the predictor variables and their parameters.
These characteristics result in satisfactory performance (low prediction errors,
small prediction times, few local models, and consistent convergence in a reason-
able number of iterations) on all the three test systems. Hence, we believe that our
model would be suitable for optimal control policy generation using a paradigm
such as model-based reinforcement learning. Such use would require us to construct
34 Behnoosh Parsa et al.

the regression models for a denser set of trajectories that would cover the entire
operating environments of the robots with additional controlled robots and obsta-
cles. The challenges then would be to train all the models in a reasonable amount
of time (with acceptable sample complexity) and identify the correct prediction
model efficiently at run time.

6 Appendix A

Theorem 1 is proved as follows.


Proof Consider the log-likelihood function l(θ) = log p(y|θ):

l(θ) = log p(y|θ)


Z
= log p(x, y|θ)dx
X (y)
Z
p(x, y|θ)
= log (t) )
p(x|y, θ(t) )dx (55)
X (y) p(x|y, θ
 
p(X, y|θ)
= log EX|y,θ(t)
p(X|y, θ(t) )
(rewrite the integral as an expectation)
 
p(X, y|θ)
≥ EX|y,θ(t) log
p(X|y, θ(t) )
(by Jensen’s inequality)
 
p(X|θ)p(y|X)
= EX|y,θ(t) log
p(X|θ(t) )p(y|X)p(y|θ(t) )
(by Bayes’ rule and Markov property)
" #
p(X|θ)p(y|θ(t) )
= EX|y,θ(t) log
p(X|θ(t) )
h i
= EX|y,θ(t) [p(X|θ)] − EX|y,θ(t) p(X|θ(t) ) + log p(y|θ(t) )

= Q(θ|θ(t) ) − Q(θ(t) |θ(t) ) + l(θ(t) ), (56)

where the Q-function is defined in (24). The assumptions of the base X being
independent of θ and Markov property are needed for the proof. The lower bound
of the log-likelihood, therefore, is,

l(θ) ≥ l(θ(t) ) + Q(θ|θ(t) ) − Q(θ(t) |θ(t) ). (57)

Based on the assumption Q(θ|θ(t) ) ≥ Q(θ(t) |θ(t) ), we conclude that,

l(θ) ≥ l(θ(t) ) + Q(θ|θ(t) ) − Q(θ(t) |θ(t) ) ≥ l(θ(t) ), (58)

which completes the proof.



Theorem 2 is proved as follows.
Bayesian Local Linear Regression 35

Proof Add log p(θ) to both sides of (57),

l(θ) + log p(θ) ≥ l(θ(t) ) + Q(θ|θ(t) ) − Q(θ(t) |θ(t) ) + log p(θ)


= l(θ(t) ) + log p(θ(t) ) + Q(θ|θ(t) ) − Q(θ(t) |θ(t) )
+ log p(θ) − log p(θ(t) )
≥ l(θ(t) ) + log p(θ(t) ),

where the last line follows from the assumption made in the Theorem.


7 Appendix B

Table 4: Local model parameters for the target state x(t) of the MSD system. In
models 2 and 3, as we adopt ARD, the bias term is pruned out; therefore, the
covariance matrix is in R2×2 .

Learned value
 
0.16434
c1
−0.58072
 
0.33351
λ1
0.27799
 
−0.2818
w1  0.87123 
0.08370

1.46 ∗ 10−5 4.22 ∗ 10−6 −1.43 ∗ 10−6


 

Σ1  4.22 ∗ 10−6 5.28 ∗ 10−5 −6


4.38e ∗ 10 
−1.43 ∗ 10−6 4.38 ∗ 10−6 1.28 ∗ 10−6
 
12.59
α1  1.32 
142.69

βf 1 14281.53

 
−0.02749
c2
−0.06403
 
0.56134
λ2
0.31700
36 Behnoosh Parsa et al.

Table 4 – continued
 from previous
 page
−2.69628
w2  8.88278 
0

5.07 ∗ 10−6 −9.40 ∗ 10−7


 
Σ2 −9.40 ∗ 10−7 1.67 ∗ 10−5
 
13.75
α2
1.27

βf 2 14250.26

 
−0.07787
c3
0.35885
 
0.33493
λ3
0.25643
 
−0.05474
w3  0.32631 
0

1.03 ∗ 10−5 2.70 ∗ 10−5


 
Σ3 2.70 ∗ 10−5 1.12 ∗ 10−4
 
332.40
α3
9.38

βf 3 14371.72
Bayesian Local Linear Regression 37

Table 5: Local model parameters for the response state ẋ(t) of the MSD system. In
model 2, as we adopt ARD, the bias term is pruned out; therefore, the covariance
matrix is in R2×2 .

Learned value
 
0.16434
c1
−0.58072
 
0.32576
λ1
0.28048
 
0.96775
w1  0.30136 
−0.55395

2.11 ∗ 10−4 6.38 ∗ 10−5 −2.10 ∗ 10−5


 

Σ1  6.38 ∗ 10−5 7.81 ∗ 10−4 6.40e ∗ 10−5 


−2.10 ∗ 10−5 6.40e ∗ 10−5 1.88 ∗ 10−5
 
1.07
α1 10.92
3.26

βf1 974.10

 
−0.02749
c2
−0.06403
 
0.31075
λ2
0.32949
 
0.52709
w2 −0.28383
0

7.51 ∗ 10−5 −2.70 ∗ 10−5


 
Σ2 −2.70 ∗ 10−5 3.32 ∗ 10−4
 
3.60
α2
12.36

βf2 983.51

 
−0.07787
c3
0.35885
Continued on next page
38 Behnoosh Parsa et al.

Table 5 – continued from previous page


 
0.30229
λ3
0.26645
 
0.79691
w3 −0.25168
0.24750

2.60 ∗ 10−4 2.73 ∗ 10−4 4.56 ∗ 10−5


 

Σ3 2.73 ∗ 10−4 1.81 ∗ 10−3 −4.95 ∗ 10−5 


4.56 ∗ 10−5 −4.95 ∗ 10−5 1.80 ∗ 10−5
 
1.57
α3  5.34 
16.32

βf3 976.02
Bayesian Local Linear Regression 39

Table 6: Local model parameters for the response state θ0 of the SDIP system.
The bias term is pruned out in models 1 and 2, and both the bias and the first
linear feature are pruned in model 3.

Learned value
2.584 ∗ 10−2
 
 7.47 ∗ 10−4 
−5.01 ∗ 10−3 
 
−3.64 ∗ 10−3 
 
c1 
−1.32 ∗ 10−3 

 
−5.13 ∗ 10−3 
1.55
 
0.41
 0.30 
 
 0.27 
 
λ1  0.28 
 
 0.29 
 
 0.27 
12.27
 
0
 −0.59 
 
−54.63
 
w1  0.13 
 
 19.07 
 
 12.26 
17.96

6.00 ∗ 10−7 5.90 ∗ 10−6 −6.70 ∗ 10−6 1.32 ∗ 10−6 1.65 ∗ 10−7 3.60 ∗ 10−6
 
 5.90 ∗ 10−6 1.52 ∗ 10−4 −1.03 ∗ 10−4 2.02 ∗ 10−5 −7.80 ∗ 10−6 5.02 ∗ 10−5

−6.70 ∗ 10−6 −1.03 ∗ 10−4 3.43 ∗ 10−4 −5.95 ∗ 10−5 1.01 ∗ 10−5 −2.78 ∗ 10−4 
 
Σ1
 1.32 ∗ 10−6 2.02 ∗ 10−5 −5.95 ∗ 10−5 7.28 ∗ 10−5 5.51 ∗ 10−5 −1.39 ∗ 10−5 
 
 1.65 ∗ 10−7 −7.80 ∗ 10−6 1.01 ∗ 10−5 5.51 ∗ 10−5 1.50 ∗ 10−5 −5
−9.23 ∗ 10 
3.60 ∗ 10−6 5.02 ∗ 10−5 −2.78 ∗ 10−4 −1.39 ∗ 10−5 −9.24 ∗ 10−5 3.09 ∗ 10−4
 
2.84
−4
3.35 ∗ 10 
 6.22 ∗ 101 
 
α1 2.75 ∗ 10−3 
 
6.65 ∗ 10−4 
 

3.10 ∗ 10−3

βf 1 622729.59

Continued on next page


40 Behnoosh Parsa et al.

Table 6 – continued from previous page


 
0.050
 0.003 
 
 0.005 
 
c2  0.003 
 
 0.001 
 
 0.003 
−0.102
 
0.47
 0.30 
 
 0.29 
 
λ2  0.30 
 
 0.32 
 
 0.29 
10.72
 
0
 1.65 
 
 31.83 
 
w2  60.51 
 
 −7.44 
 
−2.166
−69.39

5.66 ∗ 10−7 6.32 ∗ 10−6 −8.49 ∗ 10−6 1.16 ∗ 10−6 2.14 ∗ 10−7 4.77 ∗ 10−6
 
 6.32 ∗ 10−6 1.89 ∗ 10−4 −1.65 ∗ 10−4 1.69 ∗ 10−5 2.91 ∗ 10−6 7.49 ∗ 10−5

−8.49 ∗ 10−6 −1.65 ∗ 10−4 2.65 ∗ 10−4 −5.83 ∗ 10−5 −1.75 ∗ 10−5 −1.67 ∗ 10−4 
 
Σ2
 1.16 ∗ 10−6 6 1.69 ∗ 10−5 −5.83 ∗ 10−5 6.79 ∗ 10−5 5.04 ∗ 10−5 −9.07 ∗ 10−6 
 
 2.14 ∗ 10−7 2.91 ∗ 10−6 −1.75 ∗ 10−5 5.04 ∗ 10−5 1.41 ∗ 10−4 −5
−7.02 ∗ 10 
4.77 ∗ 10−6 7.49 ∗ 10−5 −1.67 ∗ 10−4 −9.07 ∗ 10−6 −7.02 ∗ 10−5 1.91 ∗ 10−4

3.68 ∗ 10−1
 
9.87 ∗ 10−4 
2.73 ∗ 10−4 
 
α2 1.81 ∗ 10−2 
 
 
2.13 ∗ 10−1 
2.08 ∗ 10−4

βf 2 661996.37
 
0.021
0.004
 
0.007
 
c3 0.003
 
0.003
 
0.007
0.757
Continued on next page
Bayesian Local Linear Regression 41

Table 6 – continued from previous page


 
0.27
 0.30 
 
 0.30 
 
λ3  0.30 
 
0.289
 
 0.31 
6.67
 
0
 0 
 
 21.99 
 
w3 −60.59
 
−11.75
 
−10.16
51.15

1.22 ∗ 10−4 −5.85 ∗ 10−5 4.884 ∗ 10−6 3.230 ∗ 10−6 7.690 ∗ 10−6
 
−5.852 ∗ 10−5 5 2.67 ∗ 10−4 −4.86 ∗ 10−5 −6.793 ∗ 10−6 −2.04 ∗ 10 −4

Σ3  4.884 ∗ 10−6 −4.859 ∗ 10−5 5.42 ∗ 10−5 3.59 ∗ 10−5 1.000 ∗ 10−5 
 
 3.230 ∗ 10−6 −6.793 ∗ 10−6 3.59 ∗ 10−5 5 1.28 ∗ 10−4 −5.357 ∗ 10−5 
7.69 ∗ 10−4 −2.04 ∗ 10−4 1.000 ∗ 10−5 −5.357 ∗ 10−5 2.01 ∗ 10−4
 
0.002
0.000
 
α3 0.007
 
0.010
0.000

βf 3 657070.27

References

1. Reinforcement Learning in Robotics: A Survey. Springer Tracts in Advanced Robotics 97,


9–67 (2014)
2. Alaeddini, A., Alemzadeh, S., Mesbahi, A., Mesbahi, M.: Linear model regression
on time-series data: Non-asymptotic error bounds and applications. arXiv preprint
arXiv:1807.06611 (2018)
3. Atkeson, C.G.: Using local models to control movement. In: Advances in Neural Informa-
tion Processing Systems, pp. 316–323 (1990)
4. Atkeson, C.G.: Using local trajectory optimizers to speed up global optimization in dy-
namic programming. In: Advances in Neural Information Processing Systems, pp. 663–670
(1994)
5. Atkeson, C.G.: Nonparametric model-based reinforcement learning. In: Advances in Neural
Information Processing Systems, pp. 1008–1014 (1998)
6. Balakrishnan, S., Wainwright, M.J., Yu, B., et al.: Statistical guarantees for the EM algo-
rithm: From population to sample-based analysis. The Annals of Statistics 45(1), 77–120
(2017)
7. Banerjee, A.G., Rajasekaran, K., Parsa, B.: A step toward learning to control tens of
optically actuated microrobots in three dimensions. In: IEEE International Conference on
Automation Science and Engineering (2018)
42 Behnoosh Parsa et al.

8. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer-Verlag, New York,
NY (2006)
9. Bogdanov, A.: Optimal control of a double inverted pendulum on a cart. Oregon Health
and Science University, Tech. Rep. CSE-04-006, OGI School of Science and Engineering,
Beaverton, OR (2004)
10. Boyles, R.A.: On the convergence of the EM algorithm. Journal of the Royal Statistical
Society. Series B (Methodological) pp. 47–50 (1983)
11. Callegari, A., Mijalkov, M., Gököz, A.B., Volpe, G.: Computational toolbox for optical
tweezers in geometrical optics. Journal of the Optical Society of America B 32(5), B11–B19
(2015)
12. Cherkassky, V., Shao, X., Mulier, F.M., Vapnik, V.N.: Model complexity control for re-
gression using VC generalization bounds. IEEE Transactions on Neural Networks 10(5),
1075–1089 (1999)
13. Cleveland, W.S., Devlin, S.J.: Locally weighted regression: an approach to regression anal-
ysis by local fitting. Journal of the American Statistical Association 83(403), 596–610
(1988)
14. Csató, L., Opper, M.: Sparse on-line Gaussian processes. Neural Computation 14(3),
641–668 (2002)
15. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) pp.
1–38 (1977)
16. Germain, P., Bach, F., Lacoste, A., Lacoste-Julien, S.: PAC-Bayesian theory meets
Bayesian inference. In: Advances in Neural Information Processing Systems, pp. 1884–1892
(2016)
17. Gijsberts, A., Metta, G.: Real-time model learning using incremental sparse spectrum
Gaussian process regression. Neural Networks 41, 59–69 (2013)
18. Grünwald, P.D., Mehta, N.A.: A tight excess risk bound via a unified PAC-Bayesian-
Rademacher-Shtarkov-MDL complexity. arXiv preprint arXiv:1710.07732 (2017)
19. Gunawardana, A., Byrne, W.: Convergence theorems for generalized alternating minimiza-
tion procedures. Journal of Machine Learning Research 6, 2049–2073 (2005)
20. Gupta, M.R., Chen, Y.: Theory and use of the EM algorithm. Foundations and Trends R
in Signal Processing 4(3), 223–296 (2011)
21. Ha, J.S., Choi, H.L.: Multiscale abstraction, planning and control using diffusion wavelets
for stochastic optimal control problems. In: IEEE International Conference on Robotics
and Automation, pp. 687–694 (2017)
22. Hastie, T., Loader, C.: Local regression: Automatic kernel carpentry. Statistical Science
pp. 120–129 (1993)
23. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data mining,
inference, and prediction. Springer-Verlag, New York, NY (2009)
24. Hensman, J., Matthews, A.G., Filippone, M., Ghahramani, Z.: MCMC for variationally
sparse Gaussian processes. In: Advances in Neural Information Processing Systems, pp.
1648–1656 (2015)
25. Hunt, K.J., Sbarbaro, D., Żbikowski, R., Gawthrop, P.J.: Neural networks for control
systemsa survey. Automatica 28(6), 1083–1112 (1992)
26. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts.
Neural Computation 3(1), 79–87 (1991)
27. Jordan, M.I.: Learning in graphical models (Adaptive computation and machine learning).
MIT Press, Cambridge, MA (1999)
28. Jordan, M.I., Jacobs, R.A.: Hierarchical mixtures of experts and the EM algorithm. Neural
Computation 6(2), 181–214 (1994)
29. Júnior, A.H.S., Barreto, G.A., Corona, F.: Regional models: A new approach for nonlinear
system identification via clustering of the self-organizing map. Neurocomputing 147, 31–
46 (2015)
30. Kakade, S.M., Sridharan, K., Tewari, A.: On the complexity of linear prediction: Risk
bounds, margin bounds, and regularization. In: Advances in Neural Information Processing
Systems, pp. 793–800 (2009)
31. Karatzas, I., Steven, E.S.: Brownian motion and stochastic calculus. Springer Science &
Business Media, New York, NY (2012)
32. Kononenko, I.: Bayesian neural networks. Biological Cybernetics 61(5), 361–370 (1989)
33. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105
(2012)
Bayesian Local Linear Regression 43

34. Lázaro-Gredilla, M., Quiñonero Candela, J., Rasmussen, C.E., Figueiras-Vidal, A.R.:
Sparse spectrum Gaussian process regression. Journal of Machine Learning Research
11(Jun), 1865–1881 (2010)
35. MacKay, D.J.: Bayesian interpolation. Neural Computation 4(3), 415–447 (1992)
36. Matthews, A.: Scalable Gaussian process inference using variational methods. Ph.D. thesis,
University of Cambridge (2016)
37. Matthews, A.G.d.G., Hensman, J., Turner, R., Ghahramani, Z.: On sparse variational
methods and the Kullback-Leibler divergence between stochastic processes. Journal of
Machine Learning Research 51, 231–239 (2016)
38. Meier, F., Hennig, P., Schaal, S.: Efficient bayesian local model learning for control. In:
Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference
on, pp. 2244–2249 (2014)
39. Meir, R., Zhang, T.: Generalization error bounds for Bayesian mixture algorithms. Journal
of Machine Learning Research 4(Oct), 839–860 (2003)
40. Narendra, K.S., Parthasarathy, K.: Identification and control of dynamical systems using
neural networks. IEEE Transactions on Neural Networks 1(1), 4–27 (1990)
41. Neal, R.M.: Bayesian learning for neural networks, vol. 118. Springer Science & Business
Media (2012)
42. Neal, R.M., Hinton, G.E.: In: M.I. Jordan (ed.) Learning in Graphical Models, chap. A
View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants, pp.
355–368. Springer, Dordrecht (1998)
43. Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse,
and other variants. In: Learning in Graphical Models, pp. 355–368. Springer (1998)
44. Nelles, O.: Nonlinear system identification: from classical approaches to neural networks
and fuzzy models. Springer Science & Business Media (2013)
45. Ogunnaike, B.A., Ray, W.H.: Process dynamics, modeling, and control. Oxford University
Press New York (1994)
46. Opper, M., Archambeau, C.: The variational Gaussian approximation revisited. Neural
Computation 21(3), 786–792 (2009)
47. Opper, M., Vivarelli, F.: General bounds on Bayes errors for regression with Gaussian
processes. In: Advances in Neural Information Processing Systems, pp. 302–308 (1999)
48. Rajasekaran, K., Bollavaram, M., Banerjee, A.G.: Toward automated formation of micro-
sphere arrangements using multiplexed optical tweezers. In: SPIE Nanoscience + Engi-
neering, p. 99222Y. San Diego, CA (2016)
49. Rasmussen, C.E.: Gaussian processes in machine learning. In: Advanced Lectures on
Machine Learning, pp. 63–71. Springer (2004)
50. Rudy, S., Alla, A., Brunton, S.L., Kutz, J.N.: Data-driven identification of parametric
partial differential equations. arXiv preprint arXiv:1806.00732 (2018)
51. Schaal, S., Atkeson, C.G.: Constructive incremental learning from only local information.
Neural Computation 10(8), 2047–2084 (1998)
52. Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. In: Advances
in Neural Information Processing Systems, pp. 1257–1264 (2006)
53. Ting, J.A., Vijayakumar, S., Schaal, S.: Locally weighted regression for control. In: Ency-
clopedia of Machine Learning, pp. 613–624. Springer (2011)
54. Tipping, M.E.: Sparse Bayesian learning and the relevance vector machine. Journal of
Machine Learning Research 1, 211–244 (2001)
55. Titsias, M.: Variational learning of inducing variables in sparse Gaussian processes. In:
Artificial Intelligence and Statistics, pp. 567–574 (2009)
56. Tu, J.H., Rowley, C.W., Luchtenburg, D.M., Brunton, S.L., Kutz, J.N.: On dynamic mode
decomposition: theory and applications. arXiv preprint arXiv:1312.0041 (2013)
57. Tzikas, D.G., Likas, A.C., Galatsanos, N.P.: The variational approximation for Bayesian
inference: Life after the EM algorithm. IEEE Signal Processing Magazine 25(6), 131–146
(2008)
58. Vijayakumar, S., Schaal, S.: Locally weighted projection regression: An O(n) algorithm
for incremental real time learning in high dimensional space. In: International Conference
on Machine Learning, pp. 288–293 (2000)
59. van der Wilk, M., Rasmussen, C.E., Hensman, J.: Convolutional Gaussian processes. In:
Advances in Neural Information Processing Systems, pp. 2845–2854 (2017)
60. Wilson, A.G., Hu, Z., Salakhutdinov, R.R., Xing, E.P.: Stochastic variational deep kernel
learning. In: Advances in Neural Information Processing Systems, pp. 2586–2594 (2016)
44 Behnoosh Parsa et al.

61. Wu, C.J.: On the convergence properties of the EM algorithm. The Annals of Statistics
pp. 95–103 (1983)
62. Xu, L., Jordan, M.I., Hinton, G.E.: An alternative model for mixtures of experts. In:
Advances in Neural Information Processing Systems, pp. 633–640 (1995)

View publication stats

You might also like