Scalable Bayesian inference for time series via divide-and-conquer
2Google)
Abstract
Bayesian computational algorithms tend to scale poorly as data size increases. This has motivated divide-and-conquer-based approaches for scalable inference. These divide the data into subsets, perform inference for each subset in parallel, and then combine these inferences. While appealing theoretical properties and practical performance have been demonstrated for independent observations, scalable inference for dependent data remains challenging. In this work, we study the problem of Bayesian inference from very long time series. The literature in this area focuses mainly on approximate approaches that usually lack rigorous theoretical guarantees and may provide arbitrarily poor accuracy in practice. We propose a simple and scalable divide-and-conquer method, and provide accuracy guarantees. Numerical simulations and real data applications demonstrate the effectiveness of our approach.
Keywords Dependent data; Dynamic models; Embarrassingly parallel; Markov chain Monte Carlo; Scalable Bayes; Wasserstein barycenter.
1 Introduction
Massive amounts of data are routinely collected in a variety of settings. This has necessitated the development of algorithms for statistical inference that scale well with data size. While variational (Beal,, 2003; Blei et al.,, 2017) and sequential Monte Carlo (Del Moral et al.,, 2006) methods are popular, Markov chain Monte Carlo (MCMC) algorithms remain the default approach for most Bayesian statisticians. Unfortunately, usual MCMC scales at least linearly with the size of the data, which is inadequate for truly massive datasets. This has motivated the development of scalable versions of MCMC algorithms.
One approach is to use sub-sampling-based algorithms for scalable MCMC (Ma et al.,, 2015; Quiroz et al.,, 2018; Nemeth and Fearnhead,, 2020). The main idea is to use random subsets of the data to approximate likelihoods and gradients at each MCMC iteration. This includes algorithms based on Langevin (Welling and Teh,, 2011) and Hamiltonian (Chen et al.,, 2014) dynamics. Such algorithms were initially developed for independent and identically distributed observations, and have since been extended to hidden Markov models (HMMs) (Ma et al., 2017b, ; Aicher et al.,, 2019), and more recently to general stationary time series models (Salomone et al.,, 2020; Villani et al.,, 2022). These algorithms incorporate approximations to the transition kernel that can have an unclear impact on the stationary distribution of the Markov chain, and theoretical guarantees require verifying challenging conditions in a case-by-case basis.
A promising recent approach based on stochastic gradients for independent observations has been the use of non-reversible continuous-time Markov processes, in particular, piecewise-deterministic Markov processes (Bouchard-Côté et al.,, 2018; Bierkens et al.,, 2019). These are appealing because, unlike other stochastic gradient MCMC algorithms, they preserve the exact posterior distribution as the invariant distribution of the process. While promising, current algorithms require the construction of certain upper bounds for gradients, which limits their practical applicability. Moreover, recent results by Johndrow et al., (2020) have shown that sub-sampling based posterior sampling algorithms have fundamental limitations on scalable performance in big data settings.
An alternative class of methods relies instead on divide-and-conquer based approaches for scalable Bayesian inference. In this case, the entire data are divided into subsets, inference is carried out using MCMC for each subset in parallel, and finally the so-called subset posteriors are combined. A key focus in the literature has been on how to combine the subset posteriors. An early proposal was the consensus Monte Carlo algorithm of Scott et al., (2016), which is based on averaging draws from each subset posterior – a strategy that can be justified theoretically when the subset posteriors are approximately Gaussian. Subsequent work has exploited the product form of the joint posterior to obtain combining algorithms, including using kernel smoothing (Neiswanger et al.,, 2014), multi-scale histograms (Wang et al.,, 2015), or a Weierstrass approximation (Wang and Dunson,, 2013). Alternatively, Minsker et al., (2014) proposed to combine via the geometric median of subset posteriors.
A recent thread has focused on using the Wasserstein barycenter (WB) of subset posteriors to combine them (Li et al.,, 2017; Srivastava et al.,, 2018). Each subset posterior is raised to an appropriate power to obtain a subset-based approximation to the full data posterior. Then one can ‘average’ these approximations to obtain a provably accurate approximation to the full data posterior. The Wasserstein barycenter provides an appropriate geometric notion of average for probability measures. In addition to impressive practical performance, it has also been shown that WB-based combining algorithms result in optimal posterior contraction rates and give the correct coverage probabilities of credible sets in certain settings (Szabó and Van Zanten,, 2019).
The approaches described in the previous two paragraphs are termed ‘embarrassingly parallel’ procedures where communication between cores is limited to a single unification step. These algorithms introduce an additional source of error beyond the MCMC sampling error(s) by combining posterior samples from the subset posteriors in an approximate manner. This means that the samples produced from the posterior are approximate even if one is able to produce exact samples from the subset posteriors. In contrast, Dai et al., (2023) have developed an algorithm which, while not embarrassingly parallel, does not induce this additional source of error for essentially independent data (see also Dai et al.,, 2019 for an earlier related paper). In this paper, we focus on embarrassingly parallel approaches for simplicity.
The above literature on divide-and-conquer algorithms focuses on independent data settings. In this article, we are motivated by the considerable practical and theoretical success of the WB-based algorithms to study appropriate modifications for very long time series. To our knowledge, although Guhaniyogi et al., (2017) developed a related approach for massive spatial datasets, there has been no consideration of divide-and-conquer Bayesian algorithms in the time series setting. For very long time series, usual MCMC and sequential Monte Carlo algorithms are impractical to implement. There is a rich literature, mostly in machine learning, on alternatives – ranging from variational approximations (Johnson and Willsky,, 2014; Foti et al.,, 2014) to assumed density filtering (Lauritzen,, 1992). In general, these approaches lack any theoretical guarantees on accuracy in approximating the true posterior, and indeed it is well known that they can perform arbitrarily poorly in practice – for example, substantially under-estimating posterior uncertainty. When parallel computing resources are available, the divide-and-conquer approach can have major practical advantages over such stochastic-gradient based methods.
In this article, we develop a simple, broadly applicable, and theoretically supported class of WB-based divide-and-conquer methods for massive time series – considering general models and not just HMMs. We call this divide-and-conquer for Bayesian time series (DC-BATS). Our approach is based on dividing the time series sequentially over time. One could alternatively devise a divide-and-conquer approach based on the Whittle likelihood (Whittle,, 1951) which consists of transforming a stationary time series to its frequency domain; Salomone et al., (2020); Villani et al., (2022) devise sub-sampling MCMC algorithms based on this. However, this will be unable to handle things like state space models and missing data as naturally as using the standard time domain approach, and we do not pursue this further here. Moreover, Guhaniyogi et al., (2017) considered a divide-and-conquer approach for large scale kriging, and a couple of differences between their work and ours are that (a) they only consider dependence using Gaussian processes, whereas we consider arbitrary time series models, and (b) we provide error guarantees for the difference between the Wasserstein posterior and the true posterior (Theorem 1), which in turn implies error rates for the bias and variance of the Wasserstein posterior (Theorem 2), whereas they provide guarantees on the mean and bias (Theorem 3.1 in their paper), which in turn implies error rates for the risk.
In a work independent from ours, Wang and Srivastava, (2023) propose a divide-and-conquer approach for finite state-space HMMs. Three differences between their work and ours are that (a) we use the Wasserstein barycenter to combine subset posteriors, whereas they use an extension of the double-parallel Monte Carlo algorithm of Xue and Liang, (2019), (b) we consider generic time series models and not just finite state-space HMMs, and (c) they consider likelihoods which can be calculated exactly for finite state-space HMMs using the forward-backward filtering algorithm, whereas we are more flexible as described in Section 2.2.
The rest of the article is organized as follows. We introduce the proposed DC-BATS method in Section 2. Section 3 is devoted to a theoretical analysis of the method. In particular, we show that the proposed method returns asymptotically exact estimates of the posterior distribution. We demonstrate the proposed method on a variety of time series models with synthetic data in Section 4. We apply the proposed method to a Los Angeles particulate matter dataset in Section 5. Finally, Section 6 concludes the article. All proofs are contained in the Supplementary Material.
2 Divide-and-conquer for time series
2.1 Generic time series model
We are interested in observations indexed by time in this article. We follow the definition of Cinlar, (2011) to define such a time-indexed stochastic process. We let be the set of integers. For each , let be a random variable taking values in . The collection is a stochastic process with state space . For any two integers , we use the notation to denote . Of primary interest in this article will be a time series ; these will be our observations. We assume that the conditional distribution of is independent of for any . This means that there is indeed a temporal dependence among the observations and precludes things like spatial dependence where an observation depends on both “past” and “future” observations. Divide-and-conquer approaches for spatial data using Gaussian processes have been proposed by Guhaniyogi et al., (2017), and we do not focus on this here. We assume that the conditional distribution of is parametrised by for every , and the marginal distribution of is also parameterised by the same .
We assume in the sequel that all measures we consider on admit densities with respect to a reference measure corresponding to the Lebesgue measure on . These will include the conditional distributions and the marginal distribution defined above, as well as posterior distributions to be defined below.
In general, the log-likelihood of the observations can be written as
(1) |
where denotes the likelihood of given the previous values ; when , is the marginal density of . The likelihood of any temporal sequence of observations can be written as equation 1, including independent observations. In this article, we are specifically interested in the situation where the observations are not independent. This includes a variety of commonly-used statistical models for time series analysis such as autoregressive moving average models and autoregressive conditional heteroskedasticity models. More generally, this includes hidden Markov models as well.
Bayesian inference for involves placing a prior distribution on and computing the posterior distribution
(2) |
We call this the full posterior as it is the posterior conditional on the entire dataset. Samples from the posterior (2) can be drawn using Markov chain Monte Carlo (MCMC) algorithms such as the Metropolis-Hastings algorithm (Metropolis et al.,, 1953; Hastings,, 1970), including efficient gradient-based algorithms such as the Metropolis-adjusted Langevin algorithm (MALA; Roberts and Tweedie,, 1996) and Hamiltonian Monte Carlo (HMC; Duane et al.,, 1987). Unfortunately the log-likelihood (1) needs to be evaluated at every iteration of MCMC algorithms, making it computationally intractable when is large. Moreover, for large , memory constraints may make it infeasible to store and manipulate the entire dataset on a single computer, thus precluding standard MCMC on the entire dataset. In this paper, we propose an embarrassingly parallel divide-and-conquer strategy to tackle this issue.
2.2 Divide-and-conquer algorithm
A generic divide-and-conquer strategy proceeds by dividing the observations into disjoint subsets of sizes , respectively, such that . To keep things simple, we assume that the subset sizes are equal, that is, . While independent observations can be divided arbitrarily into subsets, in the time series setup we divide them sequentially over time; we thus consider subsequences instead of subsets. We denote the observations within the th subsequence by and the complete dataset by . In particular, we have that .
For each subsequence , , we first define pseudo likelihoods by ignoring past observations and assuming that starts at time one. We require some additional notation to define this formally. Let denote the marginal density of , and for , let denote the conditional density of . Using this notation, the density of the first subsequence is . We define , and for , we define
(3) |
While the notation here is somewhat clunky, all we do is simply ignore observations before each subsequence and treat it as if it starts at time one. This is easy to implement in practice.
Using these pseudo likelihoods, we define subsequence posteriors for every subsequence as
(4) |
where . In other words, we define subsequence posteriors for each subsequence by ignoring observations before it. We have used the subscript in to signify that the lengths of the subsequences are equal to . For a more compact notation, we shall use to denote .
The quantities in equation 4 control the interplay between the prior and the likelihood, as well as the relative importance of each subsequence. Consider the situation where we divide the time series into only one subsequence, that is, . In this case, choosing in equation 4 gives the full posterior (2). More generally, for subsequences, choosing results in equation 4 giving the usual posteriors for the subsequences. This results, however, in the subsequence posteriors having a different scale than the full posterior as they are based on only a fraction of observations as compared to the full posterior.
We shall assume for our theoretical analysis that the series is stationary (defined formally in Assumption 1), which means that the joint distribution of is the same for all . Heuristically, this suggests that the amount of ‘information’ contained within each subsequence is the same, and it also suggests that the ‘information’ contained in the entire time series of length is intuitively times more than that contained within each subsequence. This leads us to choose . Indeed, our theoretical results in Section 3 and numerical experiments in Section 4 indicates that this is an appropriate choice in our setting.
A key question is how to combine these subsequence posteriors to approximate the full posterior. We use the Wasserstein barycenter of the set of subsequence posteriors in this article. Let denote the set of all probability measures on with finite second moments. In such a setting, the Wasserstein- () distance between probability measures is defined as
where denotes the set of all probability measures on with marginals and , respectively, and denotes the Euclidean norm. Convergence in distance on is equivalent to weak convergence plus convergence of the second moment (Bickel and Freedman,, 1981, Lemma 8.3). The Wasserstein barycenter of the subsequence posteriors is defined as
(5) |
where we use the subscript to denote that the Wasserstein barycenter is based on total observations. We also assume that admits a density with respect to the Lebesgue measure, that is, .
Remark 1.
Although other distances between probability measures can potentially be used in place of in equation 5, Wasserstein-2 is particularly appealing because it is the geometric center of the subsequence posteriors (Agueh and Carlier,, 2011; Srivastava et al.,, 2015); Agueh and Carlier, (2011) showed some desirable properties of the Wasserstein barycenter such as existence and uniqueness, and a strong consistency result was provided for the Wasserstein barycenter by Srivastava et al., (2015). Moreover, it has been shown theoretically that Wasserstein-based posteriors have appealing asymptotic properties as well (Szabó and Van Zanten,, 2019).
Li et al., (2017) used this approach to combine subset posteriors for the independent observations setting. Their approach cannot be directly applied to the time series setting because their approach leverages the fact that the full data likelihood can be factorized as a product of likelihoods for each observation – this is the case for independent data but not for time series. We show that surprisingly this issue is not a problem, and that defining subsequence posteriors as above and combining them via equation 5 results in provably accurate approximations to the true posterior.
Computing the Wasserstein barycenter exactly is computationally very expensive and remains an open area of research as such. However, efficient numerical algorithms have been developed to approximate the barycenter (Cuturi and Doucet,, 2014; Dvurechenskii et al.,, 2018). For one-dimensional functionals of the parameter , the Wasserstein barycenter can be obtained by simply averaging quantiles. Let and , and let be a one-dimensional linear functional of . We abuse notation to let and denote the cumulative distribution functions corresponding to and , respectively. For any , the quantile function of the subsequence posterior distribution of is
and the quantile function of its Wasserstein posterior is
where and are the posteriors induced by making the linear transformation from and , respectively. Given samples from the subsequence posteriors, this can be used to straightforwardly obtain credible intervals for each component of the Wasserstein posterior.
The proposed divide-and-conquer for Bayesian time series (DC-BATS) method is summarized in Algorithm 1.
Remark 2.
If there is only finite-order dependence in the observations (for example, dependence of order , meaning that depends only on ), then one can write the full log-likelihood exactly as a sum over subsets, where each subset conditions on the last observations from the previous subset. In such a case, one can use the log-likelihood split as described above rather than our proposed strategy which ignores the last observations from the previous subsequence, and then use Wasserstein averages. For a fixed , such an approach would have similar computational cost as our proposed method, and we conjecture that it is also possible to provide similar theoretical results as the ones we provide in Section 3.2. However, we do not pursue this here as our proposed method can also handle situations when it is not possible to write down the dependence structure as a finite-order dependence (for example, for hidden Markov models).
3 Theoretical guarantees
3.1 Notations and main assumptions
For notational convenience, we assume that there exists an infinitely long stationary stochastic process , and that is the observed sequence from said series. Our proposed methodology does not involve time indices outside , and indeed we are interested in the posterior distribution as given by equation 2.
We assume that is the “true” model parameter, and that is generated by , the probability measure induced by . We use to denote expectation with respect to the probability measure . We use to denote a normal distribution with mean and variance .
We define to be the pseudo log-likelihood of the th subsequence, where is as defined in equation 3. We use and to denote the gradient and Hessian matrix of with respect to , respectively. We use to denote the maximum likelihood estimator (MLE) of the th subsequence under , that is, , and to be the average MLE across the subsequences. We let be the MLE based on the complete dataset, that is, , where is the log-likelihood of the entire dataset as given in equation 1. We assume that the MLEs are unique.
Throughout this paper, we denote the norm by . We use Euclidean and Frobenius norms for vectors and matrices, respectively, that is, if and if . For a sequence of real-valued random variables , where each is measurable with respect to the sigma-algebra generated by , and a sequence of real numbers , we say that if for any , there exists a real number and a positive integer such that for any . Moreover, we say that if for any , . We let represent the sigma-algebra generated by a (potentially infinite) collection of random variables . We use as a shorthand for , that is, to represent the gradient with respect to .
We describe the main assumptions related to the time series nature of the observations in this section. There are additional assumptions which are similar to those made by Li et al., (2017), but we defer these to Appendix A to simplify the exposition.
We consider time series which are stationary and ergodic in this article, as formalised in Assumption 1.
Assumption 1 (Stationarity).
is a strictly stationary and ergodic process.
Moreover, we make the following assumption on the mixing time of the process.
Assumption 2 (Mixing time).
The -mixing coefficient of , a measure of dependency defined as
is such that for a real number .
Assumption 2 precludes long-range dependence. For processes with large , a slow mixing rate will be sufficient to guarantee . Moreover, this assumption always holds for geometrically ergodic processes as, for such processes, there exists a such that for all sufficiently large , and thus the sum of interest is upper bounded by .
Finally, we make an assumption on the score function as follows.
Assumption 3 (Geometrically decaying score function).
There exists a constant , a sufficiently large integer , and a constant such that
Assumption 3 states that the dependence of score functions on the history decays geometrically as the number of time steps increases. Such an assumption is automatically true for finite order Markov processes. For an -order Markov process, we can choose so that for all . Therefore, we have
It also holds for finite state-space hidden Markov models under some additional regularity assumptions (for example, Bickel et al.,, 1998, Lemma 6). In this lemma, there exists an such that , which implies that there exists such that
While this may appear as a strong assumption, we have seen in our numerical experiments that DC-BATS has good performance even when this is not necessarily satisfied.
3.2 Main results
We present the main theoretical results of the paper in this section. Our results are based on asymptotic theory as the total length of the time series increases to infinity. We prove that the error due to combining the subsequence posteriors using DC-BATS is asymptotically negligible as . This will require increasing the size of each subsequence , but at a potentially much slower rate than . The proofs are in the Supplementary Material, and leverage on results of Li et al., (2017) but extended to the time series setting. We first present the following Lemma 1, which is novel to this work and is instrumental in proving our later results.
Lemma 1.
Under Assumptions 1–3 in the main text and Assumptions 4–9 in Appendix A, . If we assume further that is an unbiased estimator for , that is, , and is at least , then .
Based on Lemma 1, the following Theorem 1 is the first main theoretical result of this paper, where we recall that is the MLE of the th subsequence and is the average MLE across the subsequences.
Theorem 1 (Error due to combining subsequence posteriors).
Suppose Assumptions 1–3 in the main text and Assumptions 4–9 in Appendix A hold. Let for some fixed and . Let , , and . Then we have the following.
-
1.
As ,
where the convergences are in -probability.
-
2.
If
-
(a)
is an unbiased estimator for , that is, , and
-
(b)
is at least ,
then in -probability as .
-
(a)
For the first part of the theorem to hold, it suffices to let at a much slower rate than . The second part of the theorem is more interesting and guarantees that the error between the Wasserstein posterior and full posterior has the asymptotically optimal rate , under a special circumstance when the MLE estimator is unbiased, and is at least . While we require to be at least , this is not restrictive in practice as one typically divides the entire dataset into a fairly small number of subsequences in divide-and-conquer algorithms (which is in the order of tens or hundreds), whereas is much larger by comparison (in the order of hundreds of thousands or more). For instance, if is large, say , then we require the number of subsequences to be less than , which is a reasonable requirement in practice. Moreover, we have observed good performance of DC-BATS even when and are not very large. It is possible to leverage Theorem 1 to obtain accuracy guarantees on moments of the posterior. These are provided in the following Theorem 2.
Theorem 2 (Guarantees on first and second moments).
Let for some fixed and , and let be the domain of under the transformation. Since is the truth, we define the “bias” of a distribution as , where . Under Assumptions 1–3 in Section 3.1 and Assumptions 4–9 in Appendix A, we have the following.
-
1.
and .
-
2.
and .
Remark 3.
Theorem 2 quantifies the order of the biases and variances of both the Wasserstein posterior and the full posterior. The difference in the biases of these two posteriors is controlled by up to an asymptotically negligible term . Lemma 1 further shows that the bias difference is in general, and is improved to when the MLE is unbiased and is at least . In terms of the posterior variance, both posteriors align on the dominating term and the difference is only up to an asymptotically negligible term .
Remark 4.
We note that our theory focuses on the exact Wasserstein barycenter, while in practice we will instead calculate the Wasserstein barycenter between Monte Carlo approximations to the subset posteriors. Conceptually, our theory can be easily extended to also account for the Monte Carlo error component following a similar approach to Li et al., (2017); however, we do not consider that extension in this article.
4 Synthetic data experiments
We demonstrate DC-BATS on different time series models. We have noticed that stochastic gradient MCMC algorithms tend to under-estimate posterior variances and are therefore not accurate in quantifying posterior uncertainty. This has also been noticed in the literature (for example, Figure 2 of Nemeth and Fearnhead,, 2020). Furthermore, it has been established that stochastic gradient MCMC algorithms have fundamental limitations in terms of their scalability versus accuracy (Johndrow et al.,, 2020). Given these reasons, we compare the proposed method with running MCMC to sample from the full posterior. Since we have proven theoretically that DC-BATS performs well for stationary models for large and , we also consider non-stationary models, as well as models with low/moderate and , to test the method outside of idealized cases. All numerical experiments have been performed on a 2018 i7-8700 CPU with 3.20 GHz processing power. Code for all numerical experiments is available online at https://github.com/deborsheesen/DC-BATS-deborshee. Averaging the credible intervals produced by the subsequence posteriors takes negligible time as compared to sampling from them, and we do not report this in our experiments.
4.1 Linear regression with auto-regressive errors
We first consider a linear regression model with auto-regressive errors as follows:
(6) |
and , where , and ; here denotes observations and denotes covariates. We set and . We choose and generate observations from this model.
We choose independent priors on , and on each component of . We also choose an inverse-gamma prior on . We choose and draw samples from each subsequence posterior as well as from the full posterior using the no-U-turn sampler (NUTS; Hoffman and Gelman,, 2014) as implemented in Stan (Carpenter et al.,, 2017), of which the first half are discarded as burn-in in each case. It took ten minutes to sample from the full posterior, about a minute to sample from each subsequence posterior for , and about half a minute to sample from each subsequence posterior for . We plot 95% credible intervals for in Figure 1, and observe that the credible intervals obtained by DC-BATS are virtually indistinguishable from those obtained by full data MCMC. The frequentist coverage of the credible intervals for for DC-BATS is 94% for and 92% for , and is 94% when we sample from the full posterior.
In addition to a single simulation example, we also include a simulation example under multiple sets of to represent different degree of mixing in the time series, and different numbers of machines s in Table 1. We consider (i) the i.i.d case , (ii) the fast mixing case , (iii) the slow mixing case , (iv) the unit root case I when , and (v) the unit root case II when . We set , and simulate 100 datasets for every setting. For each setting, we thus obtain 100 credible intervals and check their frequentist coverage across all the credible intervals. The results in Table 1 suggest that DC-BATS achieves a comparable frequentist coverage with the full MCMC method when is small. However, the performance deteriorates as increases.
i.i.d. | fast mixing | slow mixing | unit root case I | unit root case II | ||||||
DC | Full | DC | Full | DC | Full | DC | Full | DC | Full | |
94 | 95 | 92 | 91 | 92 | 92 | 87 | 88 | 88 | 86 | |
92 | 95 | 90 | 91 | 88 | 92 | 85 | 88 | 85 | 86 | |
92 | 95 | 88 | 91 | 88 | 92 | 86 | 88 | 85 | 86 | |
92 | 95 | 85 | 91 | 82 | 92 | 82 | 88 | 81 | 86 |
![Refer to caption](x1.png)
4.2 Generalized autoregressive conditional heteroskedasticity (GARCH) model
GARCH models (Bollerslev,, 1986) are very popular for modelling financial time series. These assume that the variance of the error term follows an autoregressive moving average process. Apart from finance, GARCH models have also been used in other domains such as healthcare (Nkalu and Edeme,, 2019) and engineering (Ma et al., 2017a, ). We consider the following GARCH model with covariates
(7) |
where denotes covariates, denotes coefficients, , , and are coefficients. We let and set .
We choose small values of and to test DC-BATS. We generate a time series of length from model (7) for , , and . We set , and . The observations and corresponding variances are plotted in Figure 2. We observe that the variances vary significantly across time (up to several orders of magnitude), as do the observations. We divide observations into subsequences as before, and it is evident that this results in a lot of variation among the observations across subsequences.
![Refer to caption](x2.png)
We place a prior on , independent priors on each component of , and independent half-normal priors333This is the normal distribution restricted to the positive real line. on each component of and . We draw samples from each subsequence posterior as well as from the full posterior using NUTS, of which the first half are discarded as burn-in in each case. We present the boxplots of effective sample size (ESS) for each parameter in Figure 3 where we record the distribution of the effective sample sizes across different machines.. The ESS is satisfactory for each parameter. The median effective sample size is over of the total sample size for every parameter. In comparison, we recorded effective samples out of the total sample size. We conclude that the full MCMC method and the divide-and-conquer MCMC are not significantly different regarding effective sample size. It took around nine minutes to sample from the full posterior, and about a minute on average to sample from each subsequence posterior.
![Refer to caption](x3.png)
We compare the credible intervals produced by DC-BATS with those obtained by running MCMC on the full dataset, as well as those obtained using the double parallel Monte Carlo (DPMC) algorithm of Wang and Srivastava, (2023), Laplace’s approximation (Kass et al.,, 1991), and automatic differentiation variational inference (ADVI; Kucukelbir et al.,, 2017) in Figure 4. We observe that DC-BATS produces more accurate estimates of the credible intervals than those using the DPMC algorithm, Laplace’s approximation, or ADVI.
![Refer to caption](x4.png)
In addition to a single simulation example, we also include a simulation example under multiple sets of to represent different degree of mixing in the time series, and different numbers of machines . In this simulation, we fix , , , , . We let , where controls the mixing of . We vary and . The process is stationary when , and the unit root exists when . The results are presented in Table 2. We observe that, as expected, the empirical posterior produced by the DC-BATS method is the closest to the empirical posterior using the full dataset as compared to the other methods in Wasserstein-2 distance.
DC-BATS | |||||||||
---|---|---|---|---|---|---|---|---|---|
DPMC | |||||||||
Laplace’s approximation | |||||||||
ADVI |
4.3 Hidden Markov models
We consider inference for continuous state-space hidden Markov models (HMMs; Rabiner and Juang,, 1986) in this section. A HMM is a process , where is an unobserved Markov chain, and each observation is conditionally independent of the rest of the process given . We consider a linear Gaussian model having the form:
(8) |
Closely related models have numerous applications ranging from guidance and navigation to robotics (Musoff and Zarchan,, 2009).
We consider a two-dimensional latent Markov chain (that is, ) and a two-dimensional observation space (that is, ). In particular, we generate observations from model (8) with true parameter values
with . We plot the observed process in Figure 5, where we see that it does not appear to be stationary. We nonetheless test DC-BATS on this model.
![Refer to caption](x5.png)
We fix at its true value and consider inference for , , and . We place independent priors on each component of the matrix . We also place independent priors on and where denotes a log-normal distribution. We choose subsequences. We write code in Python and collect posterior samples using an adaptive random walk Metropolis-Hastings algorithm (Haario et al.,, 2001) from each subsequence posterior, as well as from the full posterior. It took around 48 minutes for the MCMC algorithm to sample from the full posterior, and 11 minutes on average for each subsequence posterior. We display 95% credible intervals for DC-BATS and MCMC on the full dataset in Table 3, where we observe that the credible intervals provided by DC-BATS are extremely accurate.
DC-BATS | ||
---|---|---|
MCMC on full posterior |
DC-BATS | ||||
---|---|---|---|---|
MCMC on full posterior |
4.4 Binary auto-regressive model
We consider a model with binary observations modelled as
(9) |
where , denotes covariates at time , and denote coefficients of covariates. In this example, we choose and . We set and . We generate synthetic observations. We choose the covariates to be non-stationary. We plot the observations and the corresponding success probabilities in Figure 6.
![Refer to caption](x6.png)
We consider independent priors on and on each component of and . We draw samples from the posterior using DC-BATS as well as MCMC on the full dataset. We use NUTS and discard half of the samples as burn-in in each case. It took around twelve minutes to sample from the full posterior, and a little over a minute on average to sample from each subsequence posterior. We compare the credible intervals produced by DC-BATS, full data MCMC, the double parallel Monte Carlo (DPMC) algorithm of Wang and Srivastava, (2023), Laplace’s approximation and ADVI in Figure 7. The credible intervals from DC-BATS are virtually indistinguishable from those for full data MCMC, while the intervals for DPMC deviate for certain parameters. We also present the boxplots of effective sample size (ESS) for each parameter in Figure 8 where we record the distribution of the effective sample sizes across different machines. We observe that the ESS is satisfactory for each parameter. The median effective sample size is over of the total sample size for every parameter. In comparison, we recorded a effective samples out of the total sample size for the full MCMC method. We again conclude that the full MCMC method and the divide-and-conquer MCMC are not significantly different regarding effective sample size.
![Refer to caption](x7.png)
![Refer to caption](x8.png)
In addition to a single simulation example, we also include a simulation example under multiple sets of to represent different degree of mixing in the time series, and different numbers of machines s in Table. In this simulation, we fix , , . We let with ; the parameter regulates the level of autocorrelation of the model. We use the Wasserstein-2 distance between the obtained posterior distribution and the full posterior distribution as a metric to compare the performance. The results are presented in Table 4. We observed a slightly better performance for DPMC and Laplace’s approximation compared to DC-BATS. We also observe a significantly better performance for DC-BATS compared to ADVI.
DC-BATS | |||||||||
---|---|---|---|---|---|---|---|---|---|
DPMC | |||||||||
Laplace’s approximation | |||||||||
ADVI |
5 Application to Los Angeles particulate matter data
It is well understood that aerosol particulates have significant impact on human health, and hence understanding the dynamics of particulate matter (PM) is important in public health decision making. Modern sampling technologies have made high-resolution air monitoring possible. This makes the data produced by such monitors massive, and hence challenging to analyze with Bayesian methods. To tackle this computational challenges, we apply DC-BATS to analyze a Los Angeles air quality dataset obtained from the Environmental Protection Agency (EPA)444The dataset is available online at https://www.epa.gov/outdoor-air-quality-data..
This dataset consists of hourly measurements of particulates including (1% missingness) and (3.5% missingness) in Los Angeles during 2017. This dataset has some clearly invalid measurements, which we treat as missing data. We apply Kalman smoothing imputation as suggested in Hyndman and Khandakar, (2008) to simplify handling of the missing data. After imputation, we transform both PM observations by . Our overarching goal is to build an interpretable model that can capture the dynamics of these particulates.
![Refer to caption](x9.png)
We plot the values of the particulates over time after preprocessing in Figure 9. It is clear that the variance of the observations changes over time. In order to capture the evolution of variance within a series and correlation across series, we consider a bivariate GARCH model with constant conditional correlation (Bollerslev,, 1990) as follows:
(10) |
for , where is the (, ) levels at time . We assume is the sum of a time-independent mean and a time-dependent innovation whose covariance matrix evolves with time . We assume that each variance follows a univariate GARCH process, regressed on an intercept term , a lag- innovation , and the lag- variance through coefficients for . We also assume the correlation between particulates is time-independent, which is captured by .
We adopt a diffuse prior distribution for every , respectively, and prior distribution for every . Since it is well known that these particulates are positively correlated apriori, we adopt a prior distribution for . We draw samples from the posterior , where and using DC-BATS with subsequences as well as MCMC on the full dataset. We use NUTS and discard half of the samples as burn-in in each case. It took around minutes to sample from the full posterior, and minutes on average to sample from each subsequence posterior. We compare the credible intervals produced by DC-BATS and full data MCMC in Table 5, where it is evident that the credible intervals provided by both methods are well aligned with each other.
DC-BATS | ||
---|---|---|
MCMC on full posterior |
DC-BATS | ||
---|---|---|
MCMC on full posterior |
DC-BATS | ||
---|---|---|
MCMC on full posterior |
DC-BATS | |||
---|---|---|---|
MCMC on full posterior |
6 Discussion
We have proposed a simple divide-and-conquer approach for Bayesian inference from stationary time series. There are several natural follow-up directions. In our theoretical development, we have assumed that the time series is stationary and mixes fast; it would be interesting to relax these assumptions and develop scalable posterior inference algorithms for non-stationary time series, as well as for series with long-range dependence. Although our current algorithm has promising empirical results in certain simulation experiments with non-stationarity, we expect long range dependence to present a more challenging problem.
We have not considered the problem of defining a ‘best’ choice of the number and lengths of the subsequences, and have instead focused on experiments testing our algorithm in challenging cases in which the subset sizes are modest and/or assumptions of our theory are violated. In practice for truly massive datasets, one should ideally run MCMC in parallel for the different subsequences; the best choice of and depends on a tradeoff between statistical accuracy and ones computational budget in terms of wall clock time, number of nodes in a distributed computing network, and the capacity of each node. As a rule of thumb, approximation accuracy should improve with subsequence length as long as ones computational budget allows sufficient MCMC draws per subsequence posterior. Our simulation results are promising in suggesting that, at least in certain cases, accuracy is high even with short subsequences. However, this depends on the model and data.
Two additional important future directions include (1) modifying the simple divide-and-conquer algorithm we are proposing to allow communication between nodes; and (2) modifying the algorithm and/or theory to allow guarantees for fixed finite subsequence sizes. There has been work on both threads outside of the time series setting; for example, refer to Dai et al., (2023).
Acknowledgement
DS and DD acknowledge support from National Science Foundation grant 1546130. DS acknowledges support from grant DMS-1638521 from SAMSI. The authors thank Cheng Li for his insightful comments that led to improvements in this paper.
Appendix A Additional assumptions
We make the following additional assumptions to prove the theoretical results. These are similar to assumptions made by Li et al., (2017).
Assumption 4 (Support).
For all and all , all possible conditional distributions have the same support as the stationary distribution of .
Many classes of time series models satisfy Assumption 4, including the ones that are considered in this paper. However, exceptions to this assumption include time-varying time series models where the support of conditional densities change with respect to time .
Assumption 5 (Envelope).
This consists of three parts.
-
1.
is three times differentiable with respect to in a neighbourhood of for some constant .
-
2.
There exists functions , , such that
for for all .
-
3.
, where is the same as that in Assumption 2.
Assumptions 2 and 5 together imply a trade-off between moments and the mixing rate of the process . A higher value of leads to greater restriction on the moments on one hand; on the other hand, a slower decay rate of is required for to hold, thus leading to less restriction on the mixing rate of the process .
Assumption 5 can be verified by checking if the log-likelihood function of only depends of a finite number of past observations, that is, there existing an integer such that . In this case, it suffices to find an envelope function such that , where is under the stationary distribution. By the Ergodic Theorem of Von Neumann (Neumann,, 1932), we have
Assumption 6.
The interchange of order of integration with respect to at is justified. The score function is a martingale at for . Moreover,
where is a positive definite matrix. Further, for all sufficiently large , is positive definite with eigenvalues bounded below and above by constants for all and all values of .
Assumption 7.
For any , there exists an such that
We assume that the prior has finite second moment; this is a fairly relaxed assumption and is required as we use the distance to combine the subsequence posteriors.
Assumption 8 (Prior).
The prior density is continuous at . Moreover, . The second moment of the prior exists, that is, .
Assumption 9 (Uniform integrability).
Let , where is the expectation with respect to under the posterior Then there exists an integer such that is uniformly integrable under In other words,
where is the indicator function.
Assumptions 4, 5, 6 and 9 are generalizations of Assumptions 2, 3, 4, 7 of Li et al., (2017), respectively. These can be verified more straightforwardly if there exists a finite integer such that . In this case, is also an ergodic sequence. To verify Assumption 5, by the -ergodic theorem, it suffices to find a such that , where the expectation is with respect to the stationary distribution of . Assumption 6 is a generalization of a common regularity condition to dependent processes. Again, in view of Assumption 5 that bounds the moment of the second order derivative of the log density function, the ergodic theorem will hold automatically to guarantee that
Assumption 7 will also hold if for any , there exists an such that for all sufficiently large , where denotes the Kullback-Leibler divergence. Assumption 9 mirrors Assumption 7 in Li et al., (2017).
References
- Agueh and Carlier, (2011) Agueh, M. and Carlier, G. (2011). Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924.
- Aicher et al., (2019) Aicher, C., Ma, Y.-A., Foti, N. J., and Fox, E. B. (2019). Stochastic gradient MCMC for state space models. SIAM Journal on Mathematics of Data Science, 1(3):555–587.
- Beal, (2003) Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. PhD thesis, UCL (University College London).
- Bickel and Freedman, (1981) Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics, 9(6):1196–1217.
- Bickel et al., (1998) Bickel, P. J., Ritov, Y., and Ryden, T. (1998). Asymptotic normality of the maximum-likelihood estimator for general hidden Markov models. The Annals of Statistics, 26(4):1614–1635.
- Bierkens et al., (2019) Bierkens, J., Fearnhead, P., and Roberts, G. (2019). The zig-zag process and super-efficient sampling for Bayesian analysis of big data. The Annals of Statistics, 47(3):1288–1320.
- Bini and Capovani, (1983) Bini, D. and Capovani, M. (1983). Spectral and computational properties of band symmetric Toeplitz matrices. Linear Algebra and its Applications, 52:99–126.
- Blei et al., (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877.
- Bollerslev, (1986) Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3):307–327.
- Bollerslev, (1990) Bollerslev, T. (1990). Modelling the coherence in short-run nominal exchange rates: a multivariate generalized ARCH model. The Review of Economics and Statistics, 72(3):498–505.
- Bouchard-Côté et al., (2018) Bouchard-Côté, A., Vollmer, S. J., and Doucet, A. (2018). The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. Journal of the American Statistical Association, 113(522):855–867.
- Carpenter et al., (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1):1–32.
- Chen et al., (2014) Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691. PMLR.
- Cinlar, (2011) Cinlar, E. (2011). Probability and Stochastics. Springer.
- Cuturi and Doucet, (2014) Cuturi, M. and Doucet, A. (2014). Fast computation of Wasserstein barycenters. In International Conference on Machine Learning, pages 685–693. PMLR.
- Dai et al., (2019) Dai, H., Pollock, M., and Roberts, G. (2019). Monte Carlo fusion. Journal of Applied Probability, 56(1):174–191.
- Dai et al., (2023) Dai, H., Pollock, M., and Roberts, G. O. (2023). Bayesian fusion: scalable unification of distributed statistical analyses. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(1):84–107.
- Del Moral et al., (2006) Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436.
- Dharmadhikari et al., (1968) Dharmadhikari, S., Fabian, V., and Jogdeo, K. (1968). Bounds on the moments of martingales. The Annals of Mathematical Statistics, 39(5):1719–1723.
- Duane et al., (1987) Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195(2):216–222.
- Dvurechenskii et al., (2018) Dvurechenskii, P., Dvinskikh, D., Gasnikov, A., Uribe, C., and Nedich, A. (2018). Decentralize and randomize: Faster algorithm for wasserstein barycenters. Advances in Neural Information Processing Systems, 31.
- Foti et al., (2014) Foti, N., Xu, J., Laird, D., and Fox, E. (2014). Stochastic variational inference for hidden Markov models. In Advances in Neural Information Processing Systems, pages 3599–3607.
- Guhaniyogi et al., (2017) Guhaniyogi, R., Li, C., Savitsky, T. D., and Srivastava, S. (2017). A divide-and-conquer Bayesian approach to large-scale kriging. arXiv preprint arXiv:1712.09767.
- Haario et al., (2001) Haario, H., Saksman, E., and Tamminen, J. (2001). An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242.
- Hastings, (1970) Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109.
- Hoffman and Gelman, (2014) Hoffman, M. D. and Gelman, A. (2014). The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623.
- Hyndman and Khandakar, (2008) Hyndman, R. J. and Khandakar, Y. (2008). Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 27(3):1–22.
- Johndrow et al., (2020) Johndrow, J. E., Pillai, N. S., and Smith, A. (2020). No free lunch for approximate MCMC. arXiv preprint arXiv:2010.12514.
- Johnson and Willsky, (2014) Johnson, M. and Willsky, A. (2014). Stochastic variational inference for Bayesian time series models. In International Conference on Machine Learning, pages 1854–1862. PMLR.
- Kass et al., (1991) Kass, R. E., Tierney, L., and Kadane, J. B. (1991). Laplace’s method in Bayesian analysis. Contemporary Mathematics, 115:89–99.
- Kucukelbir et al., (2017) Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). Automatic differentiation variational inference. Journal of Machine Learning Research.
- Lauritzen, (1992) Lauritzen, S. L. (1992). Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108.
- Li et al., (2017) Li, C., Srivastava, S., and Dunson, D. B. (2017). Simple, scalable and accurate posterior interval estimation. Biometrika, 104(3):665–680.
- (34) Ma, J., Xu, F., Huang, K., and Huang, R. (2017a). GNAR-GARCH model and its application in feature extraction for rolling bearing fault diagnosis. Mechanical Systems and Signal Processing, 93:175–203.
- Ma et al., (2015) Ma, Y.-A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925.
- (36) Ma, Y.-A., Foti, N. J., and Fox, E. B. (2017b). Stochastic gradient MCMC methods for hidden Markov models. In International Conference on Machine Learning, pages 2265–2274. PMLR.
- Metropolis et al., (1953) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092.
- Minsker et al., (2014) Minsker, S., Srivastava, S., Lin, L., and Dunson, D. (2014). Scalable and robust Bayesian inference via the median posterior. In International Conference on Machine Learning, pages 1656–1664. PMLR.
- Musoff and Zarchan, (2009) Musoff, H. and Zarchan, P. (2009). Fundamentals of Kalman filtering: a practical approach. American Institute of Aeronautics and Astronautics.
- Neiswanger et al., (2014) Neiswanger, W., Wang, C., and Xing, E. (2014). Asymptotically exact, embarrassingly parallel mcmc. In Proceedings of the 30th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 623–632.
- Nemeth and Fearnhead, (2020) Nemeth, C. and Fearnhead, P. (2020). Stochastic gradient Markov chain Monte Carlo. Journal of the American Statistical Association, 116(533):1–18.
- Neumann, (1932) Neumann, J. v. (1932). Proof of the quasi-ergodic hypothesis. Proceedings of the National Academy of Sciences, 18(1):70–82.
- Nkalu and Edeme, (2019) Nkalu, C. N. and Edeme, R. K. (2019). Environmental hazards and life expectancy in Africa: evidence from GARCH model. SAGE Open, 9(1):215–222.
- Quiroz et al., (2018) Quiroz, M., Kohn, R., Villani, M., and Tran, M.-N. (2018). Speeding up MCMC by efficient data subsampling. Journal of the American Statistical Association, 114(526):831–843.
- Rabiner and Juang, (1986) Rabiner, L. and Juang, B. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1):4–16.
- Rio, (1993) Rio, E. (1993). Covariance inequalities for strongly mixing processes. Annales de l’IHP Probabilités et statistiques, 29(4):587–597.
- Roberts and Tweedie, (1996) Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363.
- Salomone et al., (2020) Salomone, R., Quiroz, M., Kohn, R., Villani, M., and Tran, M.-N. (2020). Spectral subsampling MCMC for stationary time series. In International Conference on Machine Learning, pages 8449–8458. PMLR.
- Scott et al., (2016) Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (2016). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88.
- Srivastava et al., (2015) Srivastava, S., Cevher, V., Dinh, Q., and Dunson, D. (2015). WASP: Scalable Bayes via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pages 912–920. PMLR.
- Srivastava et al., (2018) Srivastava, S., Li, C., and Dunson, D. B. (2018). Scalable Bayes via barycenter in Wasserstein space. The Journal of Machine Learning Research, 19(1):312–346.
- Szabó and Van Zanten, (2019) Szabó, B. and Van Zanten, H. (2019). An asymptotic analysis of distributed nonparametric methods. Journal of Machine Learning Research, 20(87):1–30.
- Villani, (2009) Villani, C. (2009). Optimal Transport: Old and New, volume 338. Springer.
- Villani et al., (2022) Villani, M., Quiroz, M., Kohn, R., and Salomone, R. (2022). Spectral subsampling MCMC for stationary multivariate time series with applications to vector ARTFIMA processes. Econometrics and Statistics.
- Wang and Srivastava, (2023) Wang, C. and Srivastava, S. (2023). Divide-and-conquer Bayesian inference in hidden Markov models. Electronic Journal of Statistics, 17(1):895–947.
- Wang and Dunson, (2013) Wang, X. and Dunson, D. B. (2013). Parallelizing MCMC via Weierstrass sampler. arXiv preprint arXiv:1312.4605.
- Wang et al., (2015) Wang, X., Guo, F., Heller, K. A., and Dunson, D. B. (2015). Parallelizing MCMC with random partition trees. In Advances in Neural Information Processing Systems, pages 451–459.
- Welling and Teh, (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688. PMLR.
- Whittle, (1951) Whittle, P. (1951). Hypothesis testing in time series analysis. Almquist and Wiksell, Uppsala.
- Xue and Liang, (2019) Xue, J. and Liang, F. (2019). Double-parallel Monte Carlo for Bayesian analysis of big data. Statistics and Computing, 29(1):23–32.
Supplementary Material
Appendix S1 Main proofs
Simplifying notation: we use to denote the observations within the th subsequence. We also use and to denote and , respectively; this notation will be handy as we shall consider and at different values of .
Lemma S1.
Suppose that Assumptions 1–9 hold. Then the following are true.
-
1.
There exists a weakly consistent estimator that is measurable with respect to solving the score equation . Moreover, this estimator is consistent, that is, as .
-
2.
Let be a weakly consistent estimator of based on the complete dataset solving the score equation . Then as .
-
3.
Let be a local parameter for the th subsequence, and be a local parameter for the complete dataset. Let be the th subsequence posterior induced by and be the posterior of induced by the overall posterior . Then
where denotes the the total variation of second moment distance.
Lemma S2 (Lemma 3 of Li et al.,, 2017).
Let and . Then
Lemma S1 mirrors Lemma 2 of Li et al., (2017), and its proof can be straightforwardly modified from their proof. We therefore focus our attention on the proof of Lemma 1, which is novel.
S1.1 Proof of Lemma 1
Proof.
We use the first order Taylor expansion of ,
where lies between and , and lies between and . Therefore,
(S1) | ||||
After rearranging, we obtain the difference between and as
(S2) |
where
and
The second term on the right hand side of equation S2 is . The convergence of to in distribution is established by the martingale central limit theorem. Therefore, is . Moreover, we have and thus by the continuous mapping theorem. Hence .
We show that the third term on the right hand side of equation S2 is . Define
and thus . Therefore we write
where
by the fact that we can write
Assumption 3 bounds the error between and
, and thus the error between and .
We have that
by Markov’s inequality, since for any ,
where the second inequality is by Assumption 3. Moreover, by Assumption 6, . Therefore . Furthermore, if , then .
The convergence order of the first term on the right hand side of equation S2 is established by Markov’s inequality. We define and prove later that as . Then
(S3) |
The convergence in equation S3 is established by that we will prove below. Further, when is an unbiased estimator for , then because
recalling equation S1. We will prove that in the case of unbiased , expression (S2) converges in the order of . By Markov’s inequality, for every ,
(S4) |
where we write , and denotes covariance under . The last inequality of equation S4 is by and . To prove that the last term of equation S4 converges to zero as , it suffices to prove that for every ,
(S5) |
The stationarity of the process (Assumption 1) allows us to denote by .
We employ the following -mixing inequality of Rio, (1993):
where is the alpha -mixing coefficient as stated in Assumption 2 and is the quantile function of . Further, by Markov’s inequality, , where is a positive real number as stated in Assumption 2, and so we have . Equation (S5) then follows as
as , where the second inequality is by Assumption 1 and is a constant independent of . The final inequality is by Assumption 2 and , which we will prove below.
Proof of as for every :
We first show that , and thus also converges to zero by Jensen’s inequality. Since and , is . The remaining part of the proof is to show that is dominated by some under . The dominated convergence theorem will then imply that ) converges to zero.
By Assumption 6, let be the lower bound of for all and sufficiently large . We have
(S6) |
Moreover, Assumption 5 implies that
(S7) |
It follows from equations (S6) and (S7) that for all sufficiently large ,
where and are constants that are independent of . Now define
We will show that and then apply the dominated convergence theorem to . We first apply the Cauchy-Schwarz inequality to and obtain
(S8) |
The first term of equation S8 is bounded by Assumption 5 as
To bound the second term of equation S8, recall that is a martingale at for by Assumption 8. We denote the th component of as , so that . Further, by Assumption 6, for every integer and integer . Then
where is a constant independent of . The third inequality is by Jensen’s inequality, the fourth inequality is by Dharmadhikari et al., (1968), and the fifth inequality is by Assumption 5.
Therefore, we find a that dominates with . Moreover, and thus by the dominated convergence theorem, . ∎
Appendix S2 Other proofs
The proof of Theorem 1 is similar to that of Theorem 1 of Li et al., (2017), but is included for concreteness. The proof of Theorem 2 is the same as the proof of Theorem 2 of Li et al., (2017), and is mainly based on Theorem 1 and does not involve features of dependent data. The proof of Lemma S1 is along the lines of the proof of Lemma 2 of Li et al., (2017), but is included for completeness.
For any two probability measures , the total variation of second moment distance is defined as . By Villani, (2009), .
Proof of Theorem 1.
By the triangle inequality,
(S9) |
We show that the first term of the right hand side of equation S9 is , the second term is in general and is when the MLE is unbiased and is at least , and the third term is . Therefore, the distance between and is in general and when the MLE is unbiased and is at least .
We first estimate the order of the first term in equation S9. For any ,
where the first inequality is by Lemma S2, the third inequality is by Jensen’s inequality, and the fourth inequality is by stationarity. The convergence to zero is by Lemma S1.
For the second term, note that the Wasserstein-2 distance between two -dimensional Gaussians and is given by . Therefore, Lemma 1 yields that in general and
when the MLE is unbiased and is at least .
The third term of equation S9 is a special case of the first term when and , and is hence .
This proves Theorem 1.
∎
Proof of Lemma S1.
We first show the existence of weakly consistent estimator of such that and in -probability. By Assumption 5, is differentiable in a neighbourhood . By Assumption 6, . With -probability arbitrary close to , there exists a root that solves in . We denote such a root by . By Assumption 7, in -probability. It suffices to show , as the equation is just a special case when and . We first show that .
Define and . The subsequence posterior for can be written as . The posterior of induced by the posterior above is
Define
We define . If we can show that , then , and thus
The second term in the previous equation converges to zero in -probability.
We now show that .
To achieve this, we divide into three regions: , , and ,
where and are constants that will be chosen later. Since , to prove , it suffices to prove that for .
1. Proof of : we have
(S10) |
The second term on the right hand side of equation S10 converges to zero as . We use Assumption 5 to prove that the first term of equation S10 also converges to zero as . For some , with probability approaching one and all sufficiently large , we have . Moreover, with -probability approaching 1, the consistency of guarantees that where is some positive constant. Therefore, as
where the convergence is by Assumption 8 that bounds the second moment of the prior.
2. Proof of : note that
(S11) |
The second integral on the right hand side of equation S11 is
which converges to zero by choosing large enough. The following step bounds the first term on the right hand side of equation S11.
We use Taylor expansion: . Because , , where ,
where
is a three-dimensional array with as its th element.
Moreover, is a -dimensional vector between and . By Assumption 5,
Also by Assumption 5, we have . Therefore, for all sufficiently large , by Markov’s inequality, is . If is chosen to be small enough, with probability approaching one in since Assumption 6 bounds the eigenvalues of for from below. Hence, with probability approaching one, for all sufficiently large and , by . Moreover, given that is consistent for and is continuous at , we have . Hence, to bound the first term on the right hand side of equation S11, we have
(S12) |
The right hand side of equation S12 is arbitrarily close to zero if is chosen to be large enough. Both terms on the right hand side of equation S11 thus converge to zero, and thus .
3. Proof of : note that
(S13) | ||||
(S14) |
To show , it suffices to show that terms (S13) and (S14) both converge to zero as . For all , . Therefore, for all sufficiently large , with -probability approaching one,
Moreover, with probability approaching one, for all , . By the dominated convergence theorem, (S13) converges to zero with -probability approaching one. Next, we prove that (S14) also converges to zero. For all ,
Hence, for all , . Moreover, with -probability approaching one. By the dominated convergence theorem, term (S14) also converges to zero with -probability approaching one.
We have thus proved that . The final step of the proof is to show that this can be strengthened to convergence, that is, as in -probability. We have
(S15) |
The third term of (S15) is a finite constant and the second term is as defined in Assumption 9. By Assumption 9, we have that is uniformly integrable in and thus is also uniformly integrable. ∎
Appendix S3 Verification of Assumption 9
We consider the following AR(2) normal linear model based on independent and identically distributed observations:
where . We write , , and the true parameter is . We assume that the roots of the characteristic polynomial of lie outside of the unit circle of the complex plane, so that is a stationary process. We impose the following conjugate prior on the parameter :
Let . Let the eigenvalues of and be lower bounded by and upper bounded by . Let . The subset posterior distributions of and are given by
By Yule Walker’s equation, for every , we have the following recursive relationship for the entries of :
This is a second-order linear recurrence equation. We have where , since we require the roots of the characteristic equation lie outside the unit circle. Therefore, is a Toeplitz matrix decaying at a geometric rate as increases.
We seek to bound the smallest eigenvalue of
To this end, we write the matrix as a sum of a -banded Toeplitz matrix and a perturbation matrix , where will be chosen later to derive the lower bound for the smallest eigenvalue of , , where
and
We will lower-bound the smallest eigenvalue of and upper bound the largest eigenvalue . To lower bound the smallest eigenvalue of , we will use the proposition 4.5 of Bini and Capovani, (1983). When , where is an integer, the eigenvalue of is given by , where , and are constants independent of . We will choose . We have
(S16) |
for every .
The largest eigenvalue of is upper bound by the Frobenius norm of , that is, . Hence, we can upper bound on and hence provide an upper bound for the largest eigenvalue. When , we have
(S17) |
Combining equations (S16) and (S17), when is sufficiently large and , the smallest eigenvalue of is lower bounded by a positive constant. Therefore, the largest eigenvalue of is upper bounded by a positive constant, which we denote by .
We will show that . The MLE of is given by . It is clear that
where denotes the trace of a generic square matrix . The posterior variance of can be bounded as
Appendix S4 ACF and PACF of time series used in synthetic data examples
S4.1 Linear regression with auto-regressive errors (LABEL:{sec.app.ar.error} of main text)
We plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the error terms generated using model (6) in Figure 10; recall that this is for observations. The ACF or PACF is highest when . The autocorrelation is mild for this simulation.
![Refer to caption](x10.png)
S4.2 GARCH model (LABEL:{sec.app.garch} of main text)
We plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the error terms and observations generated using model equation 7 in Figure 11; recall that this is for observations. The variance series exhibits a strong autocorrelation, but the observation series exhibits a negligible autocorrelation.
![Refer to caption](x11.png)
![Refer to caption](x12.png)
S4.3 Binary auto-regressive model (LABEL:{sec.binary_ar_model} of main text)
We plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the success probabilities generated using model equation 9 in Figure 12; recall that this is for observations. The series exhibits a weak autocorrelation, as we observe an auto-correlation of less than for every lag size.
![Refer to caption](x13.png)