A Hierarchical Bayesian Linear Regression Model With Local Features For Stochastic Dynamics Approximation
A Hierarchical Bayesian Linear Regression Model With Local Features For Stochastic Dynamics Approximation
net/publication/326342881
CITATIONS READS
0 287
4 authors, including:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Behnoosh Parsa on 15 August 2018.
B. Parsa
Department of Mechanical Engineering, University of Washington, Seattle, WA, USA
E-mail: [email protected]
K. Rajasekaran
Department of Mechanical Engineering, University of Maryland, College Park, MD, USA
E-mail: [email protected]
F. Meier
Autonomous Motion Department
Max Planck Institute for Intelligent Systems, Tübingen, Germany
E-mail: [email protected]
A. G. Banerjee
Department of Industrial & Systems Engneering and Department of Mechanical Engineering
University of Washington, Seattle, WA, USA
E-mail: [email protected]
2 Behnoosh Parsa et al.
1 Introduction
Analyzing the dynamics of nonlinear systems in various ways has been always an
important area of research. This field includes nonlinear system identification [45,
44], modal analysis of dynamical system [50, 56], system identification using neural
networks [40, 25] and supervised learning of time series data, which is also known
as function approximation or regression [2, 53, 58, 13, 29].
Most of the dynamical systems we want to understand and control are either
stochastic, or they have some inherent noise components, which makes learning
their models challenging. Moreover, in model-based control [5, 1], or optimal con-
trol [21], often the model must be learned online as we get some measurement
data throughout the process. Hence, the learning process not only has to result in
accurate and precise approximations of the real-world behavior of the system but
also must be fast enough so that it will not delay the control process. For exam-
ple, in model-based reinforcement learning [5, 1], an agent first learns a model of
the environment and then uses that model to decide which action is best to take
next. If the computation of the transition dynamics takes very long, the predicted
policy is no longer useful. On the other hand, if the control policy is learned based
on an imprecise transition model, the policy cannot be used without significant
modifications as shown in Atkeson (1994) [4]. These learning procedures can be
combined with Bayesian inference techniques to capture the stochasticities and/or
uncertainties in the dynamical systems as well as the nonlinearities [38, 41, 32].
One of the popular methods to learn the transition dynamics is supervised
learning. Usually, there are three different ways to construct the learning criteria:
global, local, or a combination of both. From another perspective, these methods
are classified into memory-based (lazy) or memory-less (eager) methods based on
whether they use the training sets in the prediction process. An example of the
former methodology is the k-nearest neighbor algorithm, and for the latter, the
artificial neural network is a good representative. In an artificial neural network,
the target function is approximated globally during training, implying that we do
not need the training set to run an inference for a new query. Therefore, it requires
much less memory than a lazy learning system. Moreover, the post-training queries
have no effect on the learned model itself, and we get the same result every time
for a given query. On the other hand, in lazy learning, the training set expands
for any new query, and, thus, the model’s prediction changes over time.
Most regression algorithms are global in that they minimize a global loss func-
tion. Gaussian processes (GPs) are popular global methods [49] that allow us to
estimate the hyperparameters under uncertainties. However, their computational
cost (O(N 3 ) for N observations) limits their applications in learning the transition
dynamics for robot control. There have been efforts both in sparsifying Gaussian
Bayesian Local Linear Regression 3
process regression (GPR) [52, 55, 34, 37, 24] and developing online [33, 60] and in-
cremental [17, 14] algorithms for the sparse GPR to make it applicable for robotic
systems. Another challenging question is how to construct these processes, as for
instance, how to initialize the hyperparameters or define a reasonable length scale.
[59] presents the convolutional Gaussian processes, in which a variational frame-
work is adopted for approximation in GP models. The variational objective mini-
mizes the KL divergence across the entire latent process [36], which guarantees an
exact model approximation given enough resources. The computational complex-
ity of this algorithm is O(N M 2 ) (M N ) through sparse approximation [55].
Moreover, it optimizes a non-Gaussian likelihood function [46].
1.3 Combining Local and Global Regression Methods for Dynamics Learning
Fig. 1: Graphical model for a global linear regression problem solved using direct
maximum likelihood (ML) estimation without any prior. y (n) is the observed ran-
dom variable and w is the unknown model parameter. In this model, no prior
knowledge is available about the unknown parameter.
Usually, the online performance of global methods depends on how well the dis-
tance metric is learned offline, which is based on how well the training data rep-
resent the input space. There are several concerns with linear models as discussed
above. Therefore, one should develop better methods to tackle these problems.
Local Gaussian regression (LGR) in [38] is a probabilistic alternative to LWR,
which transforms it into a localized inference procedure by employing variational
approximation. This method combines the best of global and local regression
frameworks by combining the well-known Bayesian regression framework with ra-
dial basis function (RBF) features for nonlinear function approximation. This is
a top-down approach that takes advantage of the efficiency of the local regression
and the accuracy of the global Bayesian regression and incorporates local features.
In this method, maximizing the likelihood function is facilitated by the introduc-
tion of the hidden targets for every basis function, which act as links that connect
observations to the unknown parameters via Bayes’ law. In addition, the likelihood
function is maximized using the Variational Expectation Maximization (EM) al-
gorithm [43, 27, 57]. EM algorithm is a popular method that iteratively maximizes
the likelihood function without explicitly computing it. The variational EM is an
alternative algorithm that approximates the posterior distribution with a factor-
ized function of the hidden variables in the model [57]. The use of variational EM
makes LGR a fast learning algorithm. A graphical model representation of this
algorithm is shown in Fig. 1.
Another useful characteristic of LGR is optimizing the RBF length scale so
that the minimum number of local models is used for learning. This optimization
reduces the size of the model significantly. In addition, it adopts the idea of Auto-
matic Relevance Determination (ARD) [35, 54, 41] that results in a sparse model
as described later in section 2.2. Both these adjustments result in a compact model
that does not occupy much memory as compared to the space needed to store the
training set.
Bayesian Local Linear Regression 5
1.5 Summary
In this paper, we adopt the LGR model presented in [38], and modify it suitably
for batch-wise learning of stochastic dynamics. We term this model the Batch-
hierarchical Bayesian linear regression (Batch-HBLR) model. Moreover, we de-
6 Behnoosh Parsa et al.
scribe all the steps to derive the Bayes update equations for the posterior dis-
tributions of the model. We then analyze the convergence of the variational EM
algorithm that is at the core of training the model. Subsequently, we evaluate
its performance experimentally on three different dynamical systems, including a
challenging external force-field actuated micro-robot. Results indicate good per-
formance on all the systems in terms of approximating the dynamics closely with
a parsimonious model structure leading to fast computation times during testing.
We, therefore, anticipate our Batch-HBLR model to provide a foundation for online
model-based reinforcement learning of robot motion control under uncertainty.
We use the following notation in the rest of the manuscript to refer to a normal
distribution, N (x; µ, Σ), where µ is the mean vector and Σ is the covariance
matrix. The log-likelihood for N i.i.d. samples drawn from N (x; µ, Σ) is:
N
N 1X
log p(X; µ, Σ) = − log((2π)d |Σ|) − (xn − µ)> Σ −1 (xn − µ) + const.
2 2 n=1
N
N 1X
∝− log((2π)d |Σ|) − Trace Σ −1 (xn − µ)(xn − µ)>
2 2 n=1
N
!
N d 1 −1
X >
= − log((2π) |Σ|) − Trace Σ (xn − µ)(xn − µ) .
2 2 n=1
(2)
Here, X ∈ RN ×d is a matrix containing all the samples, and xn ∈ Rd represents
an individual sample. Moreover, in this manuscript, we use log for the natural
logarithm.
For a random variable x drawn from the Gamma distribution (x ∼ G(α, β)),
we use the following pdf,
β α xα−1 e−βx
p(x; α, β) = . (3)
Γ (α)
We use the following notation in the rest of the manuscript to refer to a gamma
distribution, G(x; α, β). Using Stirling’s formula for the gamma function, we ap-
proximate log Γ (α) for Re(α) > 0 with α log(α) − α. Then, we get the following
log-likelihood function for the gamma distribution,
log p(x; α, β) = α log(β) + (α − 1) log(x) − βx − α + α log(α). (4)
Bayesian Local Linear Regression 7
At this point, we also want to clarify the difference between the two notations
p(x; θ) and p(x | θ). We use the former when θ is a vector of parameters, and the
probability is a function of θ and we call it the likelihood function. In contrast, the
latter denotes the conditional probability of x when θ is a random variable.
In the Bayesian framework, consider the function h(x) ∈ R and the variable x ∈
Ω ⊆ Rd . We want to predict the function value y ∗ = h(x∗ ) at an arbitrary location
n oN
x∗ ∈ Ω, using a set of N noisy observations, (X, Y) = (x(n) , y (n) ) , where
n=1
y (n) = h(x(n) ) + (n) . (n) are independently drawn form a zero-mean Gaussian
distribution,
p() = N (0, βy−1 I) (5)
h i>
where βy is the precision and = (1) , . . . , (N ) . Therefore, one can assume
(X, Y) is randomly sampled from a data generating distribution represented by
D, and denote (X, Y) ∼ DN as the i.i.d. observations of N elements, x(n) ∈ Rd ,
and y (n) ∈ R.
In basic (global) Bayesian regression, the function h(x) is modeled as a linear
combination of P basis functions, and if these functions also include a bias term,
then P = d + 1.
h(x(n) ) = w> φ(x(n) ) (6)
where, w = [w1 , . . . , wP ]> are the weights of the linearcombination, and φ(x(n) ) =
>
φ(n) = (x(n) , 1)> . By definition, y x(n) = w> φ x(n) + (n) . Therefore, we
write the likelihood function in the following form
N
N y (n) ; w> φ(n) , βy−1 .
Y
p(Y; w) = (7)
n=1
N M
!
> (n)
φm , βy−1
Y (n)
X
p(Y; w̃) = N y ; wm , (8)
n=1 m=1
h i>
where Y = y (1) , . . . , y (N ) ∈ RN ×1 , and wm ∈ RP ×1 . M is the dimensionality
(n)
of the wighted feature vector φm = φm (x(n) ) = ηm (x(n) )ξm ∈ RP , or the number
of linear models. Here, we build the feature vector with linear bases plus a bias
>
term, and define the mth feature vector as ξm = (x − cm )> , 1 . The weights
8 Behnoosh Parsa et al.
for these M spatially localized basis functions are defined by the Radial Basis
Function (RBF) kernel, whereby the mth RBF feature weight is given by
1
ηm (x) = exp − (x − cm )> Λ−1
m (x − cm ) . (9)
2
N M
Y
N y (n) ; 1> f̃ (n) , βy−1 > (n)
φm , βf−1
Y (n)
p(Y, f ; θ) = N fm ; wm m
. (10)
n=1 m=1
Fig. 2: Graphical model illustrating the algorithm (adapted from [38]). The random
variables inside the dark circles are observed. The variables inside the white circles
are unknown. The variables represented with small black circles are the model
parameters. This is a model with a hierarchical prior, since it includes priors on
the parameters of the priors. That is why, we use a variational EM algorithm to
solve the maximum likelihood estimation problem efficiently. The priors on the
precision parameters are stationary, while those on the model weights are non-
stationary.
M
N (wm ; 0, A−1
Y
p(w̃; α) = m ). (11)
m=1
After training, usually, some of these precision parameters converge to large num-
bers, which indicates that the corresponding feature does not play a significant
role in representing the data. Thus, it would be reasonable to ignore those fea-
tures in order to reduce the dimensionality of the regression model. This technique
10 Behnoosh Parsa et al.
is called pruning, which largely alleviates the problem arising from the curse of
dimensionality.
The joint prior distribution of all the model parameters is:
M M
p(w̃, βf , A−1
Y Y
m ) = p(wm | αm )p(βfm ) p(αm )
m=1 m=1
M M Y
P
N (wm ; 0, A−1
α(p)
Y β β
Y
= m )G(βfm ; a0 , b0 ) G(α(p) α
m ; a0 , b0 ).
m=1 m=1 p=1
(12)
n oM
(n)
To summarize, the model has hidden variables wm , {fm }Nn=1 and param-
m=1
eters θ = βy , {βf m , αm , λm }M
m=1 .
We now want to infer the posterior probability p(θ|Y). One of the most popular
methods to do so is maximum likelihood (ML). According to this approach, the
ML estimate is obtained as
with
Z
p(Y, z; θ)
F (q, θ) = q(z) log dz, (15)
q(z)
and
Z
p(z|Y; θ)
KL(q k p) = − q(z) log dz, (16)
q(z)
where q(z) is an arbitrary probability function for the hidden variable z, and
KL(q k p) is the Kullback-Leibler (KL) divergence between p(Y|z; θ) and q(z).
Since KL(q k p) ≥ 0, log p(Y; θ) ≥ F (q, θ). Hence, F (q, θ) is the lower bound of
the log-likelihood function. The EM algorithm maximizes the lower bound using
a two step iterative procedure. Assume that the current value of the parameters is
θold . The E-step maximizes the lower bound with respect to q(z), which happens
when q(z) = p(z|Y; θ). In this case, the lower bound is equal to the likelihood
because KL(q k p) ≥ 0. In the M-step, we keep q(z) constant, and maximize the
lower bound with respect to the parameters to find a new value θnew .
Bayesian Local Linear Regression 11
Now, if we substitute q(z) = p(z|Y; θ) in the lower bound and expand (15),
we get
Z Z
F (q, θ) = p(z|Y; θold ) log p(Y, z; θ)dz − p(f |Y; θold ) log p(Y, z; θold )dz
XZ
F (q, θ) = −KL(qj k p̃) − qi log qi dz (20)
i6=j
The bound in (20) is maximized when the KL distance become zero, which is the
case for qj (zj ) = p̃(Y, zj ; θ). In other words, the optimal distribution is obtained
from the following equation,
To summarize, the two steps of the variational EM algorithm are given by,
M-step : Find θnew = argmax F (qnew (z), θ) = argmax Q(θ, θold ). (24)
θ θ
These equations are the set of consistency conditions for the maximum of the
lower bound for the factorized approximation of the posterior. They do not give us
an explicit solution as they also depend on other factors. Hence, we need to iterate
through all the factors and replace each of them one-by-one with its revised version.
We now derive the updates for every factor of the hidden variables by considering
the factorized posterior distribution,
N
" #
(n) (n)
φm , βf−1
X
>
log q(wm ) = Efm ,βfm ,αm log N (fm ; wm m
) + log N (wm ; 0, A−1
m )
n=1
N
"
βf (n) >
> (n) (n) > (n)
X
= Efm ,βfm ,αm log βfm − m fm − wm φm fm − wm φm
n=1
2
1 > −1
+ log(|Am |) − wm Am wm
2
N
"
βf X (n) 2 (n)> > (n) (n) > (n)
= Efm ,βfm ,αm − m fm + φm wm wm φm − 2fm wm φm
2 n=1
1 >
− wm Am wm
2
N N
!
1 > X (n) (n) > >
X (n) (n)
= − wm E[Am ] + E[βfm ] φm φm wm + wm E[βfm ] φm E[fm ]
2 n=1 n=1
= log N (wm ; µwm , Σwm ).
(26)
In the rest of the manuscript, we use ( ˆ ) symbol to refer to the first moment of the
approximation of the parameters. For instance, β̂fm = E[βfm ] and Âm = E[Am ].
(n) (n)
Moreover, E[fm ] = µ> wm φm .
The posterior distribution of wm is a Normal distribution; therefore, log q(wm )
is a quadratic function of wm , which we refer as J(wm ). From (2), we know that
the negative inverse of the covariance matrix is equal to the second derivative of
J(wm ) with respect to wm . The derivatives of the right hand side of (11) are:
N
! N
∂J(wm ) X (n) (n) >
X
= − Âm + β̂fm φm φm wm + β̂fm φ(n) (n)
m E[fm ] (27)
∂wm n=1 n=1
N
!
∂ 2 J(wm ) X (n) >
2
= − Â m + β̂ f m
φ(n)
m φm . (28)
∂wm n=1
Moreover, ∂J(w
∂wm
m)
= 0 at the mean; hence, by setting (12) equal to zero and solving
for wm , we get the mean of the posterior:
N
!−1
(n) (n) >
X
Σwm = Âm + β̂fm φm φm (29)
n=1
N
X
µwm = β̂fm Σwm φ(n) (n)
m E[fm ]. (30)
n=1
Bayesian Local Linear Regression 13
α(p) α(p) 1
aN m = a0 + (32)
2
α(p) α(p) 1 (p) 2 α(p) 1 (p) 2 (p)
bN m = b0 + E wm = b0 + µwm + σw m
. (33)
2 2
α(p)
We observe that aN m is the same for all the models (∀m) and every individual
α(p)
element (p) of the precision vector αm . Therefore, we use aα
N instead of aN m later
in Algorithm 1.
Similarly, we derive the variational Bayes update for the posterior of the pre-
cision parameter βf by computing
N
" #
(n) > (n)
φm , βf−1 β β
X
log q(βfm ) = Ewm ,fm log N fm ; wm m
+ log G β f m | a ,
0 0b
n=1
N
"
βfm X (n) >
> (n) (n) > (n)
= Ewm ,fm log(βfm )N/2 − fm − wm φm fm − wm φm
2 n=1
i
+ a0 − 1 log(βfm ) − bβ
β
0 βfm
= log G βfm | aβ β
N m , bN m .
(34)
The updates are
N
aβN m = aβ0 + (35)
2
N >
1X
bβN m = bβ0 + E[fm(n)
] − E[wm ]> φ(n)
m E[f (n)
m ] − E[w m ]> (n)
φ m
2 n=1
N >
1X (36)
β
= b0 + µf (n) − µ>w m
φ(n)
m µf (n) − µ> wm
φ(n)
m + σfm
2 n=1 m m
>
i
+Trace φ(n)m Σwm φ(n)
m .
In (36), σfm is the approximate variance of the mth local model. Again, since aβN m
is identical for all the local models, we use aβN instead.
Finally, the variational Bayes update equations
for the posterior of mean and
(1) (N )
covariance of each local model target fm = fm , . . . , fm are found through the
14 Behnoosh Parsa et al.
following steps:
h i
log q(f̃ (n) ) = Ew̃,β̃f log N y (n) | 1> f̃ (n) , βy−1 + log N f̃ (n) | F̃(n) , B−1
>
βy (n) 1
= Ew̃,β̃f − y − 1> f̃ (n) y (n) − 1> f̃ (n) − log(| B |)
2 2
1 (n) >
− f̃ − F̃(n) B f̃ (n) − F̃(n)
2
= log N (f̃ (n) | µf̃ (n) , Σf̃ (n) ),
(37)
h i>
(n) > (n) > (n) > (n)
where F̃ = w1 φ1 , w2 φ2 , . . . , wm φm , B = [diag(βf1 , . . . , βfm )], β̂fm =
aβ (n)
Eβfm [βfm ] = bβ
N
, and φm ∈ Rp .
N,m
N (f̃ (n) | µf̃ (n) , Σf̃ (n) ) = N (y (n) | 1> f̃ (n) , βy−1 )N (f̃ (n) | F̃(n) , B−1 ). (38)
We re-write the right hand side of (38) as an exponential function (exp(−J(f̃ (n) ))),
where J(f̃ (n) ) is a quadratic function of f̃ (n) and is defined as
>
βy > (n)
(n)
J(f̃ ) = Ew̃,β̃f 1 f̃ − y (n) 1> f̃ (n) − y (n)
2
> (39)
1 (n)
+ f̃ − F̃(n) B f̃ (n) − F̃(n) .
2
To find the parameters of N (f̃ (n) | µf̃ (n) , Σf̃ (n) ), we compute the first and second
derivatives of J(f̃ (n) ):
∂J(f̃ (n) ) h i
= βy 1 1> f̃ (n) − y (n) + B̂ f̃ (n) − Ewm F̃(n) (40)
∂ f̃ (n)
∂ 2 J(f̃ (n) )
2 = βy 11> + B̂. (41)
∂ f̃ (n)
−1
Hence, the covariance matrix Σf̃ (n) is βy 11> + B . Using the Sherman-Morrison
formula, we reformulate the covariance matrix as
B̂−1 11> B−1
Σf̃ (n) = B̂−1 − . (42)
βy−1 + 1> B̂−1 1
Let us suppose s = βy−1 + 1> B̂−1 1. Then, the diagonal elements of Σf̃ (n) are
2
(β̂ −1 )
β̂f−1
m
− fsm . Note that the covariance matrix of the local models does not depend
on the individual samples.
The minimum of J(f̃ (n) ) is attained when the first derivative is zero. Setting
(40) to zero and solving for f̃ (n) gives us
−1 h i
µf̃ (n) = βy 11> + B̂ βy y (n) 1 + B̂Ewm F̃(n)
!
−1 B̂−1 11> B−1 h i
= B̂ − −1 βy y (n) 1 + B̂Ewm F̃(n) (43)
βy + 1> B̂−1 1
h i
= Σf̃ (n) βy y (n) 1 + B̂Ewm F̃(n) ,
Bayesian Local Linear Regression 15
where E{βf } [B] is written as B̂. The update for the individual target values
m ∀m
2
(β̂ −1 )
h i
is µfm = fm + fsm (Y − Ypre ); Ypre = Ewm F̃(n) .
λ̃opt = argmax Ef ,w̃,β̃f ,α̃ log p(Y, f , w̃, α, β̃f | Φ, λ). (44)
λ
(n)
Here, by Φ, we refer to all the feature vectors φm for every n and m. (44) nicely
factorizes into independent maximization problems for each local model
N
> (n)
φm , βf−1
X
λopt
m = argmax Ewm ,fm ,βfm
(n)
log N fm | wm m
= argmax K(λm ),
λm n=1 λm
(45)
which are optimized via gradient ascent. Simplifying (45), we have
N
" #
βfm X (n) >
> (n) (n) > (n)
λopt
m = argmax Ewm ,fm log(βfm ) N/2
− fm − wm φm fm − wm φm
λm 2 n=1
N
" #
βf X (n) > (n)
2
= argmax Ewm ,fm − m fm − wm φm
λm 2 n=1
N
" #
βfm X
> (n) 2
= argmax − µ (n) − µwm φm .
λm 2 fm
n=1
(46)
We evaluate the gradient at a single point each time and update the variables with
the approximation of the gradient based on that single data point.
N 2
X βf m > (n)
∇λm K = ∇λm − µf (n) − µwm φm
n=1
2 m
N
βf m
(n) (n) >
µf (n) − µ> ξm µwm power(x − cm , 2)> Λ−2 ,
X (n)
=− wm φm ηm
n=1
2 m
(47)
where power(., 2) indicates the element-wise power operator. Also, ∇λm K ∈ Rd ,
and the update for λm is λnew old
m = λm + κ∇λm , where κ is the learning rate.
batch of data. Note that all the update equations are local with the exception of
µfm , the posterior mean of the hidden target.
To find the predictive distribution, we marginalize the complete likelihood of
the model first over the hidden variables f
Z
N y ∗ ; 1> f̃ ∗ , βy−1 N f ∗ ; W> φ∗ , B−1 df ∗
M
! (48)
∗ > ∗ −1 > −1
X
=N y ; wm φm , βy + 1 B 1 ,
m
> ∗
(49) represents the posterior with predictive mean M
P
m wm φm , which is the sum
of all the local models predictions, and the predictive variance σ 2 (x∗ ) = βy−1 +
1> B−1 1 + φ∗m > Σwm φ∗m .
For better prediction, we divide the training set into mini-batches when the
training set is very large, and train the model for each mini-batch separately.
To make the prediction for a new location x∗ , we find the correct mini-batch
model(s) that x∗ belongs to, and then predict x∗ using (49). The prediction from
each model is averaged for the points that lie within the intersection of two or
more mini-batches. The procedure for preparing data and training the model on
all the batches is described in Algorithm 2. The procedure for testing the models
is stated in Algorithm 3.
3 Theoretical Analysis
We have explained how to solve an MLE problem for the graphical model in Fig. 2
using a variation EM algorithm, and derived the corresponding update equations
for the unknown model parameters in the previous Section. Therefore, now, an
important question is whether the variational EM algorithm has the desired per-
formance. As discussed in the Introduction, there have been many efforts toward
finding convergence guarantees of various EM algorithms. Here, we show that,
assuming a factorized posterior and considering the monotonicity of the EM algo-
rithm as established in Theorem 2.1 of [20], the (t + 1)th update is never less likely
than the tth update. Alternatively, this means that improving the Q-function in
(24) never makes the log-likelihood function worse. Let us restate the Theorem for
a part of the Batch-HBLR model:
Theorem 1 Let random variables X and Y have parametric densities with pa-
rameter θ ∈ Ω. Suppose the support of X does not depend on θ, and the Markov
relationship θ → X → Y , that is,
p(y|x, θ) = p(y|x)
Bayesian Local Linear Regression 17
32: Update λnm using gradient ascent with κoas the learning rate
33: LMm = cm , λm , β̂fm , α̂m , µwm , Σwm
34: end forh
(N ) >
i
(1)
35: Ypre = µ> >
wm φm , . . . , µwm φm
pre 2kY−Y k2
36: nM SE[t] = Var(Y)
37: if k nM SE[t] − nM SE[t − 1] k<= δ then break . Iteration threshold
38: end if
39: end while
40:
41: return LM
42: end procedure
18 Behnoosh Parsa et al.
desired distribution is not sufficient when we are dealing with the variational EM
algorithm with a factorized posterior (as discussed in the Introduction) since there
is no guarantee that the estimated parameters correspond to the stationary points
of the likelihood function.
4 Experimental Results
In this section, we present the implementation details and characterize the per-
formance of the Batch-HBLR method on three different stochastic dynamical sys-
tems. First, we evaluate our method on a stochastic mass-spring-damper (MSD)
system, and then on a more complex double inverted pendulum on a cart (DIPC)
system. Last, we validate its effectiveness on the synthetic version of a real-world
micro-robotic system.
4.1 Implementation
β β −6
In order to construct non-informative priors, we choose aα α
0 = b0 = a0 = b0 = 10
9
and βy = 10 as the hyperparameters. We assume similar initial precision (αm )
aα
for all the models weights, and set the initial value equal to bα0 1p since they are
0
generated by a gamma distribution. The matrix inversion instability avoidance
−10
parameter, , is chosen to be 10 . We use wgen = 0.5 as the RBF activation
threshold. If the RBF value for a sample (x(n) ) is less than wgen , the algorithm
adds a new local model with that sample as the center, and initializes its parame-
ters in the INITIALIZELM function in Algorithm 1. These parameters comprise
the length scale (λinit = 0.3), the mean µwm = 0, and the covariance Σwm = 0p×p
of the models weights. We only retain those features with precision values less than
1000. If the precision becomes larger than this threshold, we set the corresponding
weight to zero and prune the corresponding feature. The learning rate, κ, of λ is
selected to be 0.0001 and the iteration threshold, δ, is also chosen as 0.0001.
In the two illustrative examples, we report the normalized mean squared error
(nM SE), which is computed in line 19 of Algorithm 3. When we have a response
variable (y) that is a constant function, Var(y) is zero, and unless we add noise to
the data, we are not able to use nM SE and compute M SE instead. For the micro-
robotic system, some of the states remain constant. Therefore, we use M SE both
as a termination criterion in the training algorithm and as a performance measure
in the test algorithm. We allow a maximum iteration limit of 200 for training the
models. However, the algorithm stops if the difference between the previous and
current nM SE values is less than 0.0001. It is worth mentioning the models are
learned before 200 iterations in all the examples.
Algorithms 1 to 3 are written in Python 2.7 and tested on both Windows 10
and Ubuntu 16.04 LST operating systems. We randomly select the training (67%
of the data) and test (33% of the data) sets using the train-test-split function
from the scikit-learn library in Python. The simulator for generating the micro-
robot trajectories in Section 4.3 is written in C++ programming language, and
is run on the Ubuntu 16.04 LST operating system. The source code and the
examples would be made available in GitHub later on.
Bayesian Local Linear Regression 21
Equation (55) is an Itô equation [31], where x1 and x2 are the position and velocity
of the mass, respectively, (“ ˙ ” represents time derivative), and w ∼ N (0, 1). In
Fig. 3, x(t) is equivalent to x1 in the state equation, and k and c are the
q stiffness
k
of the spring and the damping coefficient, respectively. In (55), ν = m is the
c
natural frequency of the system, and γ = m is the damping ratio. It is solved
using itoint a numerical integration method in the sdeint Python library.
Fig. 3: Stochastic Mass and Spring Damper system. The damping ratio is γ = 1,
and the natural frequency is ν = 3 for this system.
>
The system starts from x = 3.0, 0.0 and is simulated for 10 seconds in
increments of 0.005 second. The effect of measurement noise is included by adding
a white noise in the form of x(t) = x(t)+sd ∗w where, w ∼ N (0, 1). std = 0.1 for the
first state and std = 0.4 for the second one. The training set for the Batch-HBLR
method has size N = 1340, 3-dimensional input (X = [x[0 : N − 1]> , Time]), and
2-dimension response (Y = x[1 : N ]). We train the model on every dimension of
the response separately. The predicted results for the test set (N = 660) using
the trained model are shown in Fig. 4. They closely match the “true” state values
regardless of the amount of noise added into the stochastic system.
The important quantitative performance measures of the training and test
algorithms are summarized in Table 1. The reported prediction time is the time
taken by the algorithm to make prediction for one new sample1 . We observe that
not only is the algorithm fast in making predictions, it also converges much before
reaching the maximum iteration limit during training. The number of local models
1 Only the average time is reported as the standard error is negligible
22 Behnoosh Parsa et al.
(# of LMs) depends only on the inputs and not on the responses (Y). Therefore,
it is the same for both the response states. This number is more than two orders
of magnitude smaller than the number of samples, making the model extremely
parsimonious. The training and test errors are comparable, and very small. The
actual expressions of the learned local models for both x(t) and ẋ(t) are reported
in Tables 4 and 5 in Appendix B.
We now provide a brief description of the dynamics model. The state vector is
>
defined as x = θ0 , θ1 , θ2 , θ̇0 , θ̇1 , θ̇2 , where the position of the cart is in meters
and the angles are in radians. For the system in Fig. 5 the state-space equation is
where,
0.0 0.0 0.0 1.0 0.0 0.0
0.0 0.0 0.0 0.0 1.0 0.0
0.0 0.0 0.0 0.0 0.0 1.0
A=
0.0 0.0 −7.49
0.798 0.0 0.0
0.0 0.0 74.93 −33.71 0.0 0.0
0.0 0.0 −59.94 52.12 0.0 0.0
0.0
0.0
0.0
B=
−0.61
1.5
−0.3
0.0
0.0
0.0
0.1 ,
D=
0.1
0.1
where, e(t) = x(t) − xd (t) is the error calculated with respect to the desired
trajectory xd (t).
We consider a Linear Quadratic Regulator (LQR) controller, F(t) = −Kx, and
find it by minimizing a quadratic cost J in the form of x> Qx + F> RF. Here, K
is the solution of the algebraic Riccati equation. Defining
Q = diag 10, 100, 100, 700, 700, 700 ,
and R = 1, we get
K = −3.162, 589.127, −842.986, −29.493, 4.469, −133.079 .
>
The system starts from x = 0.0, 0.175, −0.175, 0.0, 0.0, 0.0 and follows a de-
sired trajectory for 200 seconds in increments of 0.01 second. The effect of measure-
ment noise is incorporated by adding a white noise in the form of x(t) = x(t)+sd w.
>
sd = 0.5, 0.3, 0.25, 0.8, 0.3, 0.2 is a vector of the standard deviations of the
noise we add to the states, and w ∼ N (0, 1). The training set for the Batch-HBLR
method has N = 13400 paired samples. The samples have 8-dimensional input
space (X = [x[0 : N − 1]> , F, Time], which consists of the 6 states, the input
F, and the simulation time Time. The response state is a 6-dimensional vector
Y = [x[1 : N ]], which captures all the states of the system. We train the model on
every dimension of the target separately, and report the results in Table 2. The
simulated trajectories of the pendulum and cart used for testing the model are
shown in Fig. 6. The tabulated performance measures and the illustrated predic-
tion results are very similar to those for the stochastic MSD system, indicating
Bayesian Local Linear Regression 25
the general applicability of our method. The learned local models for one of the
response variables θ0 (t) are reported in Table 6 in Appendix B.
The final and most important example of the paper is a micro-robotic system, in
which a microscopic object (robot) is manipulated (actively controlled) in a fluid
medium using the optical force produced by shining a tightly focused laser beam
at the object. The dynamics of this system is stochastic due to the Langevin force
that gives rise to Brownian motion-based diffusion. Rajasekaran et al. [48] describe
the details of the dynamics model, which also includes optical forces, viscous drag,
and buoyancy, using a tensor stochastic differential equation. We directly use the
OTGO toolbox [11] to generate look-up tables of the optical trapping forces Fo .
The dynamics model for one optical trap (local controller generating the optical
forces) affecting n spherical micro-robots and other freely diffusing objects, termed
as obstacles, is then represented as
Fig. 7: In the left figure, the yellow part of the cube shows the volume used for
running the experiments (the remaining volume is ignored due to force symmetry
considerations). The green arrow is a sample trajectory followed by the controlled
micro-robot. θx and θz show the angle made by the trajectory in the X-Y plane
with the X-axis and the angle with the Z-axis in the plane defined by the Z-axis
and the trajectory vector, respectively. The obstacle is located at the nodes of the
3D grid (see the right figure for the locations on the X-Y plane).
The number of partitions used for each speed in S = {0.2, 0.4, 0.6, 0.8, 1.0} is
P = {20, 10, 6, 5, 4}, respectively.
The average quantitative results with standard errors are presented in Table
3. As for every direction, we have 350 experiments, the mean squared error (MSE)
is reported along with its statistics. The MSE for the training set is first averaged
over the segments and its statistics are then computed for the 350 experiments.
The statistics for the total number of local models obtained in each segment are
reported in the table under the heading of ”# of LMs”. The number of iterations
(# of iter.) is computed similarly to the number of local models.
In addition to accurate training and test results (low MSE), what is significant
about this learning algorithm is that it only stores very few samples as the centers
of the RBF models. For instance, in the first sub-table in Table 3, the average
size of the training set is 37200 samples, whereas the average number of local
models/RBF centers is 60. This significantly reduces the amount of memory needed
for recovering the results and making predictions for new data streams. Moreover,
if in some problems one needs to keep updating the model as new data arrives, it
is easy to transform this algorithm to operate in an incremental fashion. In this
manner, we can continuously expand the model by adding new local models and
updating the parameters of all the models accordingly.
Last, but not the least, the algorithm results in fast predictions. The last col-
umn of Table 3 corresponds to the time elapsed while the model makes prediction
for a new location. The average is less than 0.5 milliseconds for all the states. This
is a very important advantage for a learning algorithm, especially if it is used for
learning the dynamics of a robotic system (or the transition probabilities). Fast
state estimation not only makes robot motion control more precise but also more
adept in reacting to environmental changes.
Bayesian Local Linear Regression 29
Table 3: Performance measures of our Batch-HBLR method for the optically actu-
ated micro-robotic system. The reported values are the average of 350 experiments
in which the controlled micro-robot is moved along a particular specified direction.
As we partition the training data into S segments and train the HBLR model for
each segment separately, the reported training M SE values are the averages of
the MSEs observed for all the segments.
θz = 0◦ , θx is not defined
Avg. # of training samples: 37200
Avg. # of test samples: 18322
Ubuntu 16.4
Avg. MSE MSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
9.93 ∗ 10−11 3.01 ∗ 10−14 60 2 0.27
X
±2.09 ∗ 10−13 ±6.17 ∗ 10−15 ±35 ±0 ±0.07
1.00 ∗ 10−10 3.84 ∗ 10−14 60 2 0.27
Y
±1.00 ∗ 10−12 ±1.27 ∗ 10−14 ±35 ±0 ±0.07
1.07 ∗ 10−3 1.06 ∗ 10−3 60 62 0.27
Z
±1.94 ∗ 10−4 ±1.81 ∗ 10−4 ±35 ±3 ±0.07
1.27 ∗ 10−3 1.24 ∗ 10−3 60 59 0.27
Xobs
±1.16 ∗ 10−3 ±1.13 ∗ 10−3 ±35 ±23 ±0.07
1.10 ∗ 10−3 1.08 ∗ 10−3 60 41 0.27
Yobs
±1.32 ∗ 10−3 ±1.29 ∗ 10−3 ±35 ±28 ±0.07
1.30 ∗ 10−3 1.28 ∗ 10−3 60 55 0.27
Zobs
±1.18 ∗ 10−3 ±1.15 ∗ 10−3 ±35 ±24 ±0.07
θz = 90◦ , θx = 0◦
Avg. # of training samples: 37400
Avg. # of test samples: 18421
Ubuntu 16.4 - 64 bit
Avg. MSE MSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
1.90 ∗ 10−3 1.90 ∗ 10−3 58 60 0.29
X
±7.76 ∗ 10−4 ±7.67 ∗ 10−4 ±37 ±4 ±0.09
9.98 ∗ 10−11 6.28 ∗ 10−14 58 2 0.29
Y
±9.47 ∗ 10−13 ±3.11 ∗ 10−14 ±37 ±0 ±0.09
5.90 ∗ 10−6 6.01 ∗ 10−6 58 2 0.29
Z
±1.84 ∗ 10−7 ±3.37 ∗ 10−7 ±37 ±0 ±0.09
3.60 ∗ 10−3 3.60 ∗ 10−3 58 75 0.29
Xobs
±2.36 ∗ 10−3 ±2.35 ∗ 10−3 ±37 ±9 ±0.09
3.33 ∗ 10−3 3.32 ∗ 10−3 58 51 0.29
Yobs
±2.50 ∗ 10−3 ±2.49 ∗ 10−3 ±37 ±28 ±0.09
3.52 ∗ 10−3 3.51 ∗ 10−3 58 61 0.29
Zobs
±2.48 ∗ 10−3 ±2.47 ∗ 10−3 ±37 ±25 ±0.09
32 Behnoosh Parsa et al.
θz = 45◦ , θx = 0◦
Avg. # of training samples: 52200
Avg. # of test samples: 25710
Ubuntu 16.4
Avg. MSE MSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
1.78 ∗ 10−3 1.75 ∗ 10−3 81 85 0.36
X
±4.31 ∗ 10−4 ±4.08 ∗ 10−4 ±46 ±5 ±0.07
9.97 ∗ 10−11 7.40 ∗ 10−14 81 2 0.36
Y
±7.495 ∗ 10−13 ±7.67 ∗ 10−14 ±46 ±0 ±0.07
1.23 ∗ 10−3 1.21 ∗ 10−3 81 85 0.36
Z
±9.41 ∗ 10−5 ±8.97 ∗ 10−5 ±46 ±5 ±0.07
3.90 ∗ 10−3 3.87 ∗ 10−3 81 103 0.36
Xobs
±2.30 ∗ 10−3 ±2.26 ∗ 10−3 ±46 ±10 ±0.07
3.54 ∗ 10−3 3.51 ∗ 10−3 81 65 0.36
Yobs
±2.60 ∗ 10−3 ±2.57 ∗ 10−3 ±46 ±39 ±0.07
3.94 ∗ 10−3 3.91 ∗ 10−3 81 92 0.36
Zobs
±2.32 ∗ 10−3 ±2.31 ∗ 10−3 ±46 ±22 ±0.07
θz = 45◦ , θx = 45◦
Avg. # of training samples: 74000
Avg. # of test samples: 36448
Windows 10 - 64 bit
Avg. MSE MSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
1.99 ∗ 10−3 1.99 ∗ 10−3 111 108 0.47
X
±2.24 ∗ 10−4 ±2.31 ∗ 10−4 ±65 ±8 ±1.96
2.00 ∗ 10−3 1.99 ∗ 10−3 11 108 0.47
Y
±2.24 ∗ 10−4 ±2.30 ∗ 10−4 ±65 ±8 ±1.96
1.63 ∗ 10−3 1.16 ∗ 10−3 111 117 0.47
Z
±1.19 ∗ 10−4 ±1.16 ∗ 10−4 ±65 ±8 ±1.96
4.15 ∗ 10−3 4.19 ∗ 10−3 111 130 0.47
Xobs
±1.90 ∗ 10−3 ±1.93 ∗ 10−3 ±65 ±14 ±1.96
4.04 ∗ 10−3 4.07 ∗ 10−3 111 100 0.47
Yobs
±1.90 ∗ 10−3 ±1.92 ∗ 10−3 ±65 ±37 ±1.96
4.24 ∗ 10−3 4.26 ∗ 10−3 111 118 0.47
Zobs
±1.95 ∗ 10−3 ±1.97 ∗ 10−3 ±65 ±26 ±1.96
Bayesian Local Linear Regression 33
θz = 75◦ , θx = 30◦
Avg. # of training samples: 44000
Avg. # of test samples: 21672
Ubuntu 16.4
Avg. MSE MSE # of # of Prediction
States
(training) (test) LMs iter. time (ms)
1.79 ∗ 10−3 1.77 ∗ 10−3 66 69 0.32
X
±7.05 ∗ 10−4 ±6.81 ∗ 10−4 ±40 ±1 ±0.06
1.30 ∗ 10−3 1.26 ∗ 10−3 66 61 0.32
Y
±2.15 ∗ 10−4 ±1.95 ∗ 10−4 ±40 ±2 ±0.06
1.26 ∗ 10−3 1.21 ∗ 10−3 66 50 0.32
Z
±5.13 ∗ 10−5 ±5.77 ∗ 10−5 ±40 ±2 ±0.06
3.22 ∗ 10−3 3.18 ∗ 10−3 66 86 0.32
Xobs
±1.72 ∗ 10−3 ±1.69 ∗ 10−3 ±40 ±8 ±0.06
3.11 ∗ 10−3 3.07 ∗ 10−3 66 64 0.32
Yobs
±1.76 ∗ 10−3 ±1.72 ∗ 10−3 ±40 ±25 ±0.06
3.24 ∗ 10−3 3.19 ∗ 10−3 66 76 0.32
Zobs
±1.77 ∗ 10−3 ±1.75 ∗ 10−3 ±40 ±18 ±0.06
5 Conclusions
Learning the state transition dynamics for a robotic system, and being able to
make fast and accurate predictions is critical for any real-time motion planning
and control problem. The performance of the learning algorithm becomes even
more important when stochasticity plays an important role in the system. In this
paper, we present a hierarchical Bayesian linear regression model with local fea-
tures for stochastic dynamics approximation, and demonstrate its performance in
three different systems. The model is based on a top-down approach that benefits
both from the efficiency of local models and the accuracy of global models.
As the data collected from the interactions of robots with their environment
are typically enormous, we need a parsimonious learning model for compact stor-
age. In our case, only a few samples are stored and used as the local RBF feature
centers. This is done by optimizing the RBF length scale to obtain the minimum
number of local models that best represent the data. Moreover, usually the robots’
state vectors are large, and the curse of dimensionality slows down learning and
prediction substantially. This emphasizes the importance of a sparse learning algo-
rithm, meaning that the weights of the model are pruned if they do not contribute
significantly toward the predictions. The presented algorithm adopts the idea of
automatic relevance determination and only retains those weights with significant
values. The algorithm is also guaranteed to converge to the optimal values of
the log-likelihood function for a factorized representation of the posterior and a
Markov relationship among the predictor variables and their parameters.
These characteristics result in satisfactory performance (low prediction errors,
small prediction times, few local models, and consistent convergence in a reason-
able number of iterations) on all the three test systems. Hence, we believe that our
model would be suitable for optimal control policy generation using a paradigm
such as model-based reinforcement learning. Such use would require us to construct
34 Behnoosh Parsa et al.
the regression models for a denser set of trajectories that would cover the entire
operating environments of the robots with additional controlled robots and obsta-
cles. The challenges then would be to train all the models in a reasonable amount
of time (with acceptable sample complexity) and identify the correct prediction
model efficiently at run time.
6 Appendix A
where the Q-function is defined in (24). The assumptions of the base X being
independent of θ and Markov property are needed for the proof. The lower bound
of the log-likelihood, therefore, is,
where the last line follows from the assumption made in the Theorem.
7 Appendix B
Table 4: Local model parameters for the target state x(t) of the MSD system. In
models 2 and 3, as we adopt ARD, the bias term is pruned out; therefore, the
covariance matrix is in R2×2 .
Learned value
0.16434
c1
−0.58072
0.33351
λ1
0.27799
−0.2818
w1 0.87123
0.08370
βf 1 14281.53
−0.02749
c2
−0.06403
0.56134
λ2
0.31700
36 Behnoosh Parsa et al.
Table 4 – continued
from previous
page
−2.69628
w2 8.88278
0
βf 2 14250.26
−0.07787
c3
0.35885
0.33493
λ3
0.25643
−0.05474
w3 0.32631
0
βf 3 14371.72
Bayesian Local Linear Regression 37
Table 5: Local model parameters for the response state ẋ(t) of the MSD system. In
model 2, as we adopt ARD, the bias term is pruned out; therefore, the covariance
matrix is in R2×2 .
Learned value
0.16434
c1
−0.58072
0.32576
λ1
0.28048
0.96775
w1 0.30136
−0.55395
βf1 974.10
−0.02749
c2
−0.06403
0.31075
λ2
0.32949
0.52709
w2 −0.28383
0
βf2 983.51
−0.07787
c3
0.35885
Continued on next page
38 Behnoosh Parsa et al.
βf3 976.02
Bayesian Local Linear Regression 39
Table 6: Local model parameters for the response state θ0 of the SDIP system.
The bias term is pruned out in models 1 and 2, and both the bias and the first
linear feature are pruned in model 3.
Learned value
2.584 ∗ 10−2
7.47 ∗ 10−4
−5.01 ∗ 10−3
−3.64 ∗ 10−3
c1
−1.32 ∗ 10−3
−5.13 ∗ 10−3
1.55
0.41
0.30
0.27
λ1 0.28
0.29
0.27
12.27
0
−0.59
−54.63
w1 0.13
19.07
12.26
17.96
6.00 ∗ 10−7 5.90 ∗ 10−6 −6.70 ∗ 10−6 1.32 ∗ 10−6 1.65 ∗ 10−7 3.60 ∗ 10−6
5.90 ∗ 10−6 1.52 ∗ 10−4 −1.03 ∗ 10−4 2.02 ∗ 10−5 −7.80 ∗ 10−6 5.02 ∗ 10−5
−6.70 ∗ 10−6 −1.03 ∗ 10−4 3.43 ∗ 10−4 −5.95 ∗ 10−5 1.01 ∗ 10−5 −2.78 ∗ 10−4
Σ1
1.32 ∗ 10−6 2.02 ∗ 10−5 −5.95 ∗ 10−5 7.28 ∗ 10−5 5.51 ∗ 10−5 −1.39 ∗ 10−5
1.65 ∗ 10−7 −7.80 ∗ 10−6 1.01 ∗ 10−5 5.51 ∗ 10−5 1.50 ∗ 10−5 −5
−9.23 ∗ 10
3.60 ∗ 10−6 5.02 ∗ 10−5 −2.78 ∗ 10−4 −1.39 ∗ 10−5 −9.24 ∗ 10−5 3.09 ∗ 10−4
2.84
−4
3.35 ∗ 10
6.22 ∗ 101
α1 2.75 ∗ 10−3
6.65 ∗ 10−4
3.10 ∗ 10−3
βf 1 622729.59
5.66 ∗ 10−7 6.32 ∗ 10−6 −8.49 ∗ 10−6 1.16 ∗ 10−6 2.14 ∗ 10−7 4.77 ∗ 10−6
6.32 ∗ 10−6 1.89 ∗ 10−4 −1.65 ∗ 10−4 1.69 ∗ 10−5 2.91 ∗ 10−6 7.49 ∗ 10−5
−8.49 ∗ 10−6 −1.65 ∗ 10−4 2.65 ∗ 10−4 −5.83 ∗ 10−5 −1.75 ∗ 10−5 −1.67 ∗ 10−4
Σ2
1.16 ∗ 10−6 6 1.69 ∗ 10−5 −5.83 ∗ 10−5 6.79 ∗ 10−5 5.04 ∗ 10−5 −9.07 ∗ 10−6
2.14 ∗ 10−7 2.91 ∗ 10−6 −1.75 ∗ 10−5 5.04 ∗ 10−5 1.41 ∗ 10−4 −5
−7.02 ∗ 10
4.77 ∗ 10−6 7.49 ∗ 10−5 −1.67 ∗ 10−4 −9.07 ∗ 10−6 −7.02 ∗ 10−5 1.91 ∗ 10−4
3.68 ∗ 10−1
9.87 ∗ 10−4
2.73 ∗ 10−4
α2 1.81 ∗ 10−2
2.13 ∗ 10−1
2.08 ∗ 10−4
βf 2 661996.37
0.021
0.004
0.007
c3 0.003
0.003
0.007
0.757
Continued on next page
Bayesian Local Linear Regression 41
1.22 ∗ 10−4 −5.85 ∗ 10−5 4.884 ∗ 10−6 3.230 ∗ 10−6 7.690 ∗ 10−6
−5.852 ∗ 10−5 5 2.67 ∗ 10−4 −4.86 ∗ 10−5 −6.793 ∗ 10−6 −2.04 ∗ 10 −4
Σ3 4.884 ∗ 10−6 −4.859 ∗ 10−5 5.42 ∗ 10−5 3.59 ∗ 10−5 1.000 ∗ 10−5
3.230 ∗ 10−6 −6.793 ∗ 10−6 3.59 ∗ 10−5 5 1.28 ∗ 10−4 −5.357 ∗ 10−5
7.69 ∗ 10−4 −2.04 ∗ 10−4 1.000 ∗ 10−5 −5.357 ∗ 10−5 2.01 ∗ 10−4
0.002
0.000
α3 0.007
0.010
0.000
βf 3 657070.27
References
8. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer-Verlag, New York,
NY (2006)
9. Bogdanov, A.: Optimal control of a double inverted pendulum on a cart. Oregon Health
and Science University, Tech. Rep. CSE-04-006, OGI School of Science and Engineering,
Beaverton, OR (2004)
10. Boyles, R.A.: On the convergence of the EM algorithm. Journal of the Royal Statistical
Society. Series B (Methodological) pp. 47–50 (1983)
11. Callegari, A., Mijalkov, M., Gököz, A.B., Volpe, G.: Computational toolbox for optical
tweezers in geometrical optics. Journal of the Optical Society of America B 32(5), B11–B19
(2015)
12. Cherkassky, V., Shao, X., Mulier, F.M., Vapnik, V.N.: Model complexity control for re-
gression using VC generalization bounds. IEEE Transactions on Neural Networks 10(5),
1075–1089 (1999)
13. Cleveland, W.S., Devlin, S.J.: Locally weighted regression: an approach to regression anal-
ysis by local fitting. Journal of the American Statistical Association 83(403), 596–610
(1988)
14. Csató, L., Opper, M.: Sparse on-line Gaussian processes. Neural Computation 14(3),
641–668 (2002)
15. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) pp.
1–38 (1977)
16. Germain, P., Bach, F., Lacoste, A., Lacoste-Julien, S.: PAC-Bayesian theory meets
Bayesian inference. In: Advances in Neural Information Processing Systems, pp. 1884–1892
(2016)
17. Gijsberts, A., Metta, G.: Real-time model learning using incremental sparse spectrum
Gaussian process regression. Neural Networks 41, 59–69 (2013)
18. Grünwald, P.D., Mehta, N.A.: A tight excess risk bound via a unified PAC-Bayesian-
Rademacher-Shtarkov-MDL complexity. arXiv preprint arXiv:1710.07732 (2017)
19. Gunawardana, A., Byrne, W.: Convergence theorems for generalized alternating minimiza-
tion procedures. Journal of Machine Learning Research 6, 2049–2073 (2005)
20. Gupta, M.R., Chen, Y.: Theory and use of the EM algorithm. Foundations and Trends
R
in Signal Processing 4(3), 223–296 (2011)
21. Ha, J.S., Choi, H.L.: Multiscale abstraction, planning and control using diffusion wavelets
for stochastic optimal control problems. In: IEEE International Conference on Robotics
and Automation, pp. 687–694 (2017)
22. Hastie, T., Loader, C.: Local regression: Automatic kernel carpentry. Statistical Science
pp. 120–129 (1993)
23. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data mining,
inference, and prediction. Springer-Verlag, New York, NY (2009)
24. Hensman, J., Matthews, A.G., Filippone, M., Ghahramani, Z.: MCMC for variationally
sparse Gaussian processes. In: Advances in Neural Information Processing Systems, pp.
1648–1656 (2015)
25. Hunt, K.J., Sbarbaro, D., Żbikowski, R., Gawthrop, P.J.: Neural networks for control
systemsa survey. Automatica 28(6), 1083–1112 (1992)
26. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts.
Neural Computation 3(1), 79–87 (1991)
27. Jordan, M.I.: Learning in graphical models (Adaptive computation and machine learning).
MIT Press, Cambridge, MA (1999)
28. Jordan, M.I., Jacobs, R.A.: Hierarchical mixtures of experts and the EM algorithm. Neural
Computation 6(2), 181–214 (1994)
29. Júnior, A.H.S., Barreto, G.A., Corona, F.: Regional models: A new approach for nonlinear
system identification via clustering of the self-organizing map. Neurocomputing 147, 31–
46 (2015)
30. Kakade, S.M., Sridharan, K., Tewari, A.: On the complexity of linear prediction: Risk
bounds, margin bounds, and regularization. In: Advances in Neural Information Processing
Systems, pp. 793–800 (2009)
31. Karatzas, I., Steven, E.S.: Brownian motion and stochastic calculus. Springer Science &
Business Media, New York, NY (2012)
32. Kononenko, I.: Bayesian neural networks. Biological Cybernetics 61(5), 361–370 (1989)
33. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105
(2012)
Bayesian Local Linear Regression 43
34. Lázaro-Gredilla, M., Quiñonero Candela, J., Rasmussen, C.E., Figueiras-Vidal, A.R.:
Sparse spectrum Gaussian process regression. Journal of Machine Learning Research
11(Jun), 1865–1881 (2010)
35. MacKay, D.J.: Bayesian interpolation. Neural Computation 4(3), 415–447 (1992)
36. Matthews, A.: Scalable Gaussian process inference using variational methods. Ph.D. thesis,
University of Cambridge (2016)
37. Matthews, A.G.d.G., Hensman, J., Turner, R., Ghahramani, Z.: On sparse variational
methods and the Kullback-Leibler divergence between stochastic processes. Journal of
Machine Learning Research 51, 231–239 (2016)
38. Meier, F., Hennig, P., Schaal, S.: Efficient bayesian local model learning for control. In:
Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference
on, pp. 2244–2249 (2014)
39. Meir, R., Zhang, T.: Generalization error bounds for Bayesian mixture algorithms. Journal
of Machine Learning Research 4(Oct), 839–860 (2003)
40. Narendra, K.S., Parthasarathy, K.: Identification and control of dynamical systems using
neural networks. IEEE Transactions on Neural Networks 1(1), 4–27 (1990)
41. Neal, R.M.: Bayesian learning for neural networks, vol. 118. Springer Science & Business
Media (2012)
42. Neal, R.M., Hinton, G.E.: In: M.I. Jordan (ed.) Learning in Graphical Models, chap. A
View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants, pp.
355–368. Springer, Dordrecht (1998)
43. Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse,
and other variants. In: Learning in Graphical Models, pp. 355–368. Springer (1998)
44. Nelles, O.: Nonlinear system identification: from classical approaches to neural networks
and fuzzy models. Springer Science & Business Media (2013)
45. Ogunnaike, B.A., Ray, W.H.: Process dynamics, modeling, and control. Oxford University
Press New York (1994)
46. Opper, M., Archambeau, C.: The variational Gaussian approximation revisited. Neural
Computation 21(3), 786–792 (2009)
47. Opper, M., Vivarelli, F.: General bounds on Bayes errors for regression with Gaussian
processes. In: Advances in Neural Information Processing Systems, pp. 302–308 (1999)
48. Rajasekaran, K., Bollavaram, M., Banerjee, A.G.: Toward automated formation of micro-
sphere arrangements using multiplexed optical tweezers. In: SPIE Nanoscience + Engi-
neering, p. 99222Y. San Diego, CA (2016)
49. Rasmussen, C.E.: Gaussian processes in machine learning. In: Advanced Lectures on
Machine Learning, pp. 63–71. Springer (2004)
50. Rudy, S., Alla, A., Brunton, S.L., Kutz, J.N.: Data-driven identification of parametric
partial differential equations. arXiv preprint arXiv:1806.00732 (2018)
51. Schaal, S., Atkeson, C.G.: Constructive incremental learning from only local information.
Neural Computation 10(8), 2047–2084 (1998)
52. Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. In: Advances
in Neural Information Processing Systems, pp. 1257–1264 (2006)
53. Ting, J.A., Vijayakumar, S., Schaal, S.: Locally weighted regression for control. In: Ency-
clopedia of Machine Learning, pp. 613–624. Springer (2011)
54. Tipping, M.E.: Sparse Bayesian learning and the relevance vector machine. Journal of
Machine Learning Research 1, 211–244 (2001)
55. Titsias, M.: Variational learning of inducing variables in sparse Gaussian processes. In:
Artificial Intelligence and Statistics, pp. 567–574 (2009)
56. Tu, J.H., Rowley, C.W., Luchtenburg, D.M., Brunton, S.L., Kutz, J.N.: On dynamic mode
decomposition: theory and applications. arXiv preprint arXiv:1312.0041 (2013)
57. Tzikas, D.G., Likas, A.C., Galatsanos, N.P.: The variational approximation for Bayesian
inference: Life after the EM algorithm. IEEE Signal Processing Magazine 25(6), 131–146
(2008)
58. Vijayakumar, S., Schaal, S.: Locally weighted projection regression: An O(n) algorithm
for incremental real time learning in high dimensional space. In: International Conference
on Machine Learning, pp. 288–293 (2000)
59. van der Wilk, M., Rasmussen, C.E., Hensman, J.: Convolutional Gaussian processes. In:
Advances in Neural Information Processing Systems, pp. 2845–2854 (2017)
60. Wilson, A.G., Hu, Z., Salakhutdinov, R.R., Xing, E.P.: Stochastic variational deep kernel
learning. In: Advances in Neural Information Processing Systems, pp. 2586–2594 (2016)
44 Behnoosh Parsa et al.
61. Wu, C.J.: On the convergence properties of the EM algorithm. The Annals of Statistics
pp. 95–103 (1983)
62. Xu, L., Jordan, M.I., Hinton, G.E.: An alternative model for mixtures of experts. In:
Advances in Neural Information Processing Systems, pp. 633–640 (1995)