Scalable Bayesian inference for time series via divide-and-conquer

Rihui Ou¹¹¹1The two authors contributed equally to this paper. Deborshee Sen²¹¹footnotemark: 1 ²²2Corresponding author; [email protected]. David Dunson¹

(¹Department of Statistical Science, Duke University
²Google)

Abstract

Bayesian computational algorithms tend to scale poorly as data size increases. This has motivated divide-and-conquer-based approaches for scalable inference. These divide the data into subsets, perform inference for each subset in parallel, and then combine these inferences. While appealing theoretical properties and practical performance have been demonstrated for independent observations, scalable inference for dependent data remains challenging. In this work, we study the problem of Bayesian inference from very long time series. The literature in this area focuses mainly on approximate approaches that usually lack rigorous theoretical guarantees and may provide arbitrarily poor accuracy in practice. We propose a simple and scalable divide-and-conquer method, and provide accuracy guarantees. Numerical simulations and real data applications demonstrate the effectiveness of our approach.

Keywords Dependent data; Dynamic models; Embarrassingly parallel; Markov chain Monte Carlo; Scalable Bayes; Wasserstein barycenter.

1 Introduction

Massive amounts of data are routinely collected in a variety of settings. This has necessitated the development of algorithms for statistical inference that scale well with data size. While variational (Beal,, 2003; Blei et al.,, 2017) and sequential Monte Carlo (Del Moral et al.,, 2006) methods are popular, Markov chain Monte Carlo (MCMC) algorithms remain the default approach for most Bayesian statisticians. Unfortunately, usual MCMC scales at least linearly with the size of the data, which is inadequate for truly massive datasets. This has motivated the development of scalable versions of MCMC algorithms.

One approach is to use sub-sampling-based algorithms for scalable MCMC (Ma et al.,, 2015; Quiroz et al.,, 2018; Nemeth and Fearnhead,, 2020). The main idea is to use random subsets of the data to approximate likelihoods and gradients at each MCMC iteration. This includes algorithms based on Langevin (Welling and Teh,, 2011) and Hamiltonian (Chen et al.,, 2014) dynamics. Such algorithms were initially developed for independent and identically distributed observations, and have since been extended to hidden Markov models (HMMs) (Ma et al., 2017b, ; Aicher et al.,, 2019), and more recently to general stationary time series models (Salomone et al.,, 2020; Villani et al.,, 2022). These algorithms incorporate approximations to the transition kernel that can have an unclear impact on the stationary distribution of the Markov chain, and theoretical guarantees require verifying challenging conditions in a case-by-case basis.

A promising recent approach based on stochastic gradients for independent observations has been the use of non-reversible continuous-time Markov processes, in particular, piecewise-deterministic Markov processes (Bouchard-Côté et al.,, 2018; Bierkens et al.,, 2019). These are appealing because, unlike other stochastic gradient MCMC algorithms, they preserve the exact posterior distribution as the invariant distribution of the process. While promising, current algorithms require the construction of certain upper bounds for gradients, which limits their practical applicability. Moreover, recent results by Johndrow et al., (2020) have shown that sub-sampling based posterior sampling algorithms have fundamental limitations on scalable performance in big data settings.

An alternative class of methods relies instead on divide-and-conquer based approaches for scalable Bayesian inference. In this case, the entire data are divided into subsets, inference is carried out using MCMC for each subset in parallel, and finally the so-called subset posteriors are combined. A key focus in the literature has been on how to combine the subset posteriors. An early proposal was the consensus Monte Carlo algorithm of Scott et al., (2016), which is based on averaging draws from each subset posterior – a strategy that can be justified theoretically when the subset posteriors are approximately Gaussian. Subsequent work has exploited the product form of the joint posterior to obtain combining algorithms, including using kernel smoothing (Neiswanger et al.,, 2014), multi-scale histograms (Wang et al.,, 2015), or a Weierstrass approximation (Wang and Dunson,, 2013). Alternatively, Minsker et al., (2014) proposed to combine via the geometric median of subset posteriors.

A recent thread has focused on using the Wasserstein barycenter (WB) of subset posteriors to combine them (Li et al.,, 2017; Srivastava et al.,, 2018). Each subset posterior is raised to an appropriate power to obtain a subset-based approximation to the full data posterior. Then one can ‘average’ these approximations to obtain a provably accurate approximation to the full data posterior. The Wasserstein barycenter provides an appropriate geometric notion of average for probability measures. In addition to impressive practical performance, it has also been shown that WB-based combining algorithms result in optimal posterior contraction rates and give the correct coverage probabilities of credible sets in certain settings (Szabó and Van Zanten,, 2019).

The approaches described in the previous two paragraphs are termed ‘embarrassingly parallel’ procedures where communication between cores is limited to a single unification step. These algorithms introduce an additional source of error beyond the MCMC sampling error(s) by combining posterior samples from the subset posteriors in an approximate manner. This means that the samples produced from the posterior are approximate even if one is able to produce exact samples from the subset posteriors. In contrast, Dai et al., (2023) have developed an algorithm which, while not embarrassingly parallel, does not induce this additional source of error for essentially independent data (see also Dai et al.,, 2019 for an earlier related paper). In this paper, we focus on embarrassingly parallel approaches for simplicity.

The above literature on divide-and-conquer algorithms focuses on independent data settings. In this article, we are motivated by the considerable practical and theoretical success of the WB-based algorithms to study appropriate modifications for very long time series. To our knowledge, although Guhaniyogi et al., (2017) developed a related approach for massive spatial datasets, there has been no consideration of divide-and-conquer Bayesian algorithms in the time series setting. For very long time series, usual MCMC and sequential Monte Carlo algorithms are impractical to implement. There is a rich literature, mostly in machine learning, on alternatives – ranging from variational approximations (Johnson and Willsky,, 2014; Foti et al.,, 2014) to assumed density filtering (Lauritzen,, 1992). In general, these approaches lack any theoretical guarantees on accuracy in approximating the true posterior, and indeed it is well known that they can perform arbitrarily poorly in practice – for example, substantially under-estimating posterior uncertainty. When parallel computing resources are available, the divide-and-conquer approach can have major practical advantages over such stochastic-gradient based methods.

In this article, we develop a simple, broadly applicable, and theoretically supported class of WB-based divide-and-conquer methods for massive time series – considering general models and not just HMMs. We call this divide-and-conquer for Bayesian time series (DC-BATS). Our approach is based on dividing the time series sequentially over time. One could alternatively devise a divide-and-conquer approach based on the Whittle likelihood (Whittle,, 1951) which consists of transforming a stationary time series to its frequency domain; Salomone et al., (2020); Villani et al., (2022) devise sub-sampling MCMC algorithms based on this. However, this will be unable to handle things like state space models and missing data as naturally as using the standard time domain approach, and we do not pursue this further here. Moreover, Guhaniyogi et al., (2017) considered a divide-and-conquer approach for large scale kriging, and a couple of differences between their work and ours are that (a) they only consider dependence using Gaussian processes, whereas we consider arbitrary time series models, and (b) we provide error guarantees for the difference between the Wasserstein posterior and the true posterior (Theorem 1), which in turn implies error rates for the bias and variance of the Wasserstein posterior (Theorem 2), whereas they provide guarantees on the mean and bias (Theorem 3.1 in their paper), which in turn implies error rates for the $L_{2}$ risk.

In a work independent from ours, Wang and Srivastava, (2023) propose a divide-and-conquer approach for finite state-space HMMs. Three differences between their work and ours are that (a) we use the Wasserstein barycenter to combine subset posteriors, whereas they use an extension of the double-parallel Monte Carlo algorithm of Xue and Liang, (2019), (b) we consider generic time series models and not just finite state-space HMMs, and (c) they consider likelihoods which can be calculated exactly for finite state-space HMMs using the forward-backward filtering algorithm, whereas we are more flexible as described in Section 2.2.

The rest of the article is organized as follows. We introduce the proposed DC-BATS method in Section 2. Section 3 is devoted to a theoretical analysis of the method. In particular, we show that the proposed method returns asymptotically exact estimates of the posterior distribution. We demonstrate the proposed method on a variety of time series models with synthetic data in Section 4. We apply the proposed method to a Los Angeles particulate matter dataset in Section 5. Finally, Section 6 concludes the article. All proofs are contained in the Supplementary Material.

2 Divide-and-conquer for time series

2.1 Generic time series model

We are interested in observations indexed by time in this article. We follow the definition of Cinlar, (2011) to define such a time-indexed stochastic process. We let $\mathbb{Z}$ be the set of integers. For each $t\in\mathbb{Z}$ , let $X_{t}$ be a random variable taking values in $(\mathbb{X},\mathcal{X})$ . The collection $\{X_{t}:t\in\mathbb{Z}\}$ is a stochastic process with state space $(\mathbb{X},\mathcal{X})$ . For any two integers $t_{1}\leq t_{2}$ , we use the notation $X_{t_{1}:t_{2}}$ to denote $(X_{t_{1}},\dots,X_{t_{2}})$ . Of primary interest in this article will be a time series $X_{1:T}=(X_{1},\dots,X_{T})$ ; these will be our observations. We assume that the conditional distribution of $X_{t}\mid(X_{1:(t-1)},X_{s})$ is independent of $X_{s}$ for any $s>t$ . This means that there is indeed a temporal dependence among the observations and precludes things like spatial dependence where an observation $X_{t}$ depends on both “past” and “future” observations. Divide-and-conquer approaches for spatial data using Gaussian processes have been proposed by Guhaniyogi et al., (2017), and we do not focus on this here. We assume that the conditional distribution of $X_{t}\mid X_{1:(t-1)}$ is parametrised by $\theta=(\theta_{1},\dots,\theta_{d})\in\Theta\subseteq\mathbb{R}^{d}$ for every $t=2,\dots,T$ , and the marginal distribution of $X_{1}$ is also parameterised by the same $\theta$ .

We assume in the sequel that all measures we consider on $\Theta$ admit densities with respect to a reference measure corresponding to the Lebesgue measure on $\mathbb{R}^{d}$ . These will include the conditional distributions and the marginal distribution defined above, as well as posterior distributions to be defined below.

In general, the log-likelihood of the observations $X_{1:T}$ can be written as

\ell(\theta)=\sum_{t=1}^{T}\log p_{\theta}(X_{t}\mid X_{1:(t-1)}),

(1)

where $p_{\theta}(X_{t}\mid X_{1:(t-1)})$ denotes the likelihood of $X_{t}$ given the previous values $X_{1:(t-1)}$ ; when $t=1$ , $p_{\theta}(X_{t}\mid X_{1:(t-1)})=p_{\theta}(X_{1})$ is the marginal density of $X_{1}$ . The likelihood of any temporal sequence of observations can be written as equation 1, including independent observations. In this article, we are specifically interested in the situation where the observations are not independent. This includes a variety of commonly-used statistical models for time series analysis such as autoregressive moving average models and autoregressive conditional heteroskedasticity models. More generally, this includes hidden Markov models as well.

Bayesian inference for $\theta$ involves placing a prior distribution $\Pi_{0}(\mathrm{d}\theta)$ on $\theta$ and computing the posterior distribution

\Pi_{T}(\mathrm{d}\theta\mid X_{1:T})\propto p_{\theta}(X_{1})\left\{\prod_{t=% 2}^{T}p_{\theta}(X_{t}\mid X_{1:(t-1)})\right\}\Pi_{0}(\mathrm{d}\theta).

(2)

We call this the full posterior as it is the posterior conditional on the entire dataset. Samples from the posterior (2) can be drawn using Markov chain Monte Carlo (MCMC) algorithms such as the Metropolis-Hastings algorithm (Metropolis et al.,, 1953; Hastings,, 1970), including efficient gradient-based algorithms such as the Metropolis-adjusted Langevin algorithm (MALA; Roberts and Tweedie,, 1996) and Hamiltonian Monte Carlo (HMC; Duane et al.,, 1987). Unfortunately the log-likelihood (1) needs to be evaluated at every iteration of MCMC algorithms, making it computationally intractable when $T$ is large. Moreover, for large $T$ , memory constraints may make it infeasible to store and manipulate the entire dataset on a single computer, thus precluding standard MCMC on the entire dataset. In this paper, we propose an embarrassingly parallel divide-and-conquer strategy to tackle this issue.

2.2 Divide-and-conquer algorithm

A generic divide-and-conquer strategy proceeds by dividing the $T$ observations into $K$ disjoint subsets of sizes $m_{1},\dots,m_{K}$ , respectively, such that $\sum_{k=1}^{K}m_{k}=T$ . To keep things simple, we assume that the subset sizes are equal, that is, $m_{1}=\cdots=m_{K}=m=T/K$ . While independent observations can be divided arbitrarily into subsets, in the time series setup we divide them sequentially over time; we thus consider subsequences instead of subsets. We denote the observations within the $k$ th subsequence by $\mathbf{X}_{[k]}$ and the complete dataset by $\mathbf{X}=(\mathbf{X}_{[1]},\dots,\mathbf{X}_{[K]})$ . In particular, we have that $\mathbf{X}_{[1]}=X_{1:m},\mathbf{X}_{[2]}=X_{(m+1):2m},\dots,\mathbf{X}_{[K]}=% X_{((K-1)m+1):T}$ .

For each subsequence $\mathbf{X}_{[k]}$ , $k=1,\dots,K$ , we first define pseudo likelihoods $\widetilde{p}_{\theta}(\mathbf{X}_{[k]})$ by ignoring past observations and assuming that $\mathbf{X}_{[k]}$ starts at time one. We require some additional notation to define this formally. Let $p_{1,\,\theta}$ denote the marginal density of $X_{1}$ , and for $t=2,\dots,T$ , let $p_{t\mid-,\,\theta}$ denote the conditional density of $X_{t}\mid X_{1:(t-1)}$ . Using this notation, the density of the first subsequence $\mathbf{X}_{[1]}$ is $p_{\theta}(\mathbf{X}_{[1]})=p_{1,\,\theta}(X_{1})\times\prod_{t=2}^{m}p_{t% \mid-,\,\theta}(X_{t}\mid X_{1:(t-1})$ . We define $\widetilde{p}_{\theta}(\mathbf{X}_{[1]})=p_{\theta}(\mathbf{X}_{[1]})$ , and for $k=2,\dots,K$ , we define

\displaystyle\widetilde{p}_{\theta}(\mathbf{X}_{[k]})

\displaystyle=p_{1,\,\theta}(X_{(k-1)m+1})\times\prod_{t=2}^{m}p_{t\mid-,\,% \theta}\big{(}X_{(k-1)m+t}\mid X_{((k-1)m+1):((k-1)m+t-1)}\big{)}.

(3)

While the notation here is somewhat clunky, all we do is simply ignore observations before each subsequence and treat it as if it starts at time one. This is easy to implement in practice.

Using these pseudo likelihoods, we define subsequence posteriors $\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})$ for every subsequence $\mathbf{X}_{[k]}$ as

\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})=\frac{{\widetilde{p}_{\theta}(% \mathbf{X}_{[k]})}^{\gamma_{k}}\Pi_{0}(\mathrm{d}\theta)}{\int_{\Theta}{% \widetilde{p}_{\theta}(\mathbf{X}_{[k]})}^{\gamma_{k}}\Pi_{0}(\mathrm{d}\theta% )},

(4)

where $\gamma_{1},\dots,\gamma_{K}>0$ . In other words, we define subsequence posteriors for each subsequence $\mathbf{X}_{[k]}$ by ignoring observations before it. We have used the subscript $m$ in $\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})$ to signify that the lengths of the subsequences are equal to $m$ . For a more compact notation, we shall use $\Pi_{m,k}(\mathrm{d}\theta)$ to denote $\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})$ .

The quantities $\gamma_{1},\dots,\gamma_{K}$ in equation 4 control the interplay between the prior and the likelihood, as well as the relative importance of each subsequence. Consider the situation where we divide the time series into only one subsequence, that is, $K=1$ . In this case, choosing $\gamma_{1}=1$ in equation 4 gives the full posterior (2). More generally, for $K$ subsequences, choosing $\gamma_{1}=\cdots=\gamma_{K}=1$ results in equation 4 giving the usual posteriors for the subsequences. This results, however, in the subsequence posteriors having a different scale than the full posterior as they are based on only a fraction of observations as compared to the full posterior.

We shall assume for our theoretical analysis that the series is stationary (defined formally in Assumption 1), which means that the joint distribution of $\mathbf{X}_{[k]}$ is the same for all $k$ . Heuristically, this suggests that the amount of ‘information’ contained within each subsequence is the same, and it also suggests that the ‘information’ contained in the entire time series of length $T$ is intuitively $K$ times more than that contained within each subsequence. This leads us to choose $\gamma_{1}=\cdots=\gamma_{K}=T/m=K$ . Indeed, our theoretical results in Section 3 and numerical experiments in Section 4 indicates that this is an appropriate choice in our setting.

A key question is how to combine these subsequence posteriors to approximate the full posterior. We use the Wasserstein barycenter of the set of subsequence posteriors in this article. Let $\mathcal{P}_{2}(\Theta)$ denote the set of all probability measures on $\Theta\subseteq\mathbb{R}^{d}$ with finite second moments. In such a setting, the Wasserstein- $2$ ( $\mathrm{W}_{2}$ ) distance between probability measures $\mu,\nu\in\mathcal{P}_{2}(\Theta)$ is defined as

\mathrm{W}_{2}(\mu,\nu)=\left\{\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{\Theta% \times\Theta}\|\theta_{1}-\theta_{2}\|^{2}\gamma(\mathrm{d}\theta_{1}\,\mathrm% {d}\theta_{2})\right\}^{1/2},

where $\Gamma(\mu,\nu)$ denotes the set of all probability measures on $\Theta\times\Theta$ with marginals $\mu$ and $\nu$ , respectively, and $\|\cdot\|$ denotes the Euclidean norm. Convergence in $\mathrm{W}_{2}$ distance on $\mathcal{P}_{2}(\Theta)$ is equivalent to weak convergence plus convergence of the second moment (Bickel and Freedman,, 1981, Lemma 8.3). The Wasserstein barycenter of the subsequence posteriors is defined as

\overline{\Pi}_{T}(\mathrm{d}\theta\mid\mathbf{X})=\underset{\mu\in\mathcal{P}% _{2}(\Theta)}{\operatorname*{argmin}}\sum_{k=1}^{K}\mathrm{W}_{2}^{2}(\mu,\Pi_% {m,k}),

(5)

where we use the subscript $T$ to denote that the Wasserstein barycenter is based on $T$ total observations. We also assume that $\overline{\Pi}_{T}(\mathrm{d}\theta\mid\mathbf{X})$ admits a density $\overline{\pi}_{T}(\theta\mid\mathbf{X})$ with respect to the Lebesgue measure, that is, $\overline{\Pi}_{T}(\mathrm{d}\theta\mid\mathbf{X})=\overline{\pi}_{T}(\theta% \mid\mathbf{X})\,\mathrm{d}\theta$ .

Remark 1.

Although other distances between probability measures can potentially be used in place of $\mathrm{W}_{2}^{2}$ in equation 5, Wasserstein-2 is particularly appealing because it is the geometric center of the subsequence posteriors (Agueh and Carlier,, 2011; Srivastava et al.,, 2015); Agueh and Carlier, (2011) showed some desirable properties of the Wasserstein barycenter such as existence and uniqueness, and a strong consistency result was provided for the Wasserstein barycenter by Srivastava et al., (2015). Moreover, it has been shown theoretically that Wasserstein-based posteriors have appealing asymptotic properties as well (Szabó and Van Zanten,, 2019).

Li et al., (2017) used this approach to combine subset posteriors for the independent observations setting. Their approach cannot be directly applied to the time series setting because their approach leverages the fact that the full data likelihood can be factorized as a product of likelihoods for each observation – this is the case for independent data but not for time series. We show that surprisingly this issue is not a problem, and that defining subsequence posteriors as above and combining them via equation 5 results in provably accurate approximations to the true posterior.

Computing the Wasserstein barycenter exactly is computationally very expensive and remains an open area of research as such. However, efficient numerical algorithms have been developed to approximate the barycenter (Cuturi and Doucet,, 2014; Dvurechenskii et al.,, 2018). For one-dimensional functionals of the parameter $\theta$ , the Wasserstein barycenter can be obtained by simply averaging quantiles. Let $a\in\mathbb{R}^{d}$ and $b\in\mathbb{R}$ , and let $\xi=a^{\top}\theta+b$ be a one-dimensional linear functional of $\theta$ . We abuse notation to let $\Pi_{m}(\theta\mid\mathbf{X}_{[k]})$ and $\overline{\Pi}_{T}(\theta\mid\mathbf{X})$ denote the cumulative distribution functions corresponding to $\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})$ and $\overline{\Pi}_{T}(\mathrm{d}\theta\mid\mathbf{X})$ , respectively. For any $u\in(0,1)$ , the quantile function of the subsequence posterior distribution of $\xi$ is

\displaystyle\overline{\Pi}_{m}^{-1}(u\mid\mathbf{X}_{[k]})

\displaystyle=\inf\{\xi\in\mathbb{R}\,:\,u\leq\Pi_{m}(\xi\mid\mathbf{X}_{[k]})\},

and the quantile function of its Wasserstein posterior is

\displaystyle\overline{\Pi}_{T}^{-1}(u\mid\mathbf{X})

\displaystyle=\inf\{\xi\in\mathbb{R}\,:\,u\leq\overline{\Pi}_{T}(\xi\mid% \mathbf{X})\}=\frac{1}{K}\sum_{k=1}^{K}\Pi_{m}^{-1}(u\mid\mathbf{X}_{[k]}),

where $\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[k]})$ and $\overline{\Pi}_{T}(\xi\mid\mathbf{X})$ are the posteriors induced by making the linear transformation $\xi=a^{\top}\theta+b$ from $\Pi_{m}(\theta\mid\mathbf{X}_{[k]})$ and $\overline{\Pi}_{T}(\theta\mid\mathbf{X})$ , respectively. Given samples from the subsequence posteriors, this can be used to straightforwardly obtain credible intervals for each component of the Wasserstein posterior.

The proposed divide-and-conquer for Bayesian time series (DC-BATS) method is summarized in Algorithm 1.

Input: time series

X_{1:T}

, prior

\Pi_{0}(\theta)

, subsequence sizes

m

, number of subsequences

K=T/m

1: Divide time series sequentially as

\mathbf{X}_{[1]}=X_{1:m}

\mathbf{X}_{[2]}=X_{(m+1):2m}

\dots

\mathbf{X}_{[K]}=X_{((K-1)m+1):T}

2: Compute subsequence posteriors as

\Pi_{m,k}(\mathrm{d}\theta)=\{\widetilde{p}_{\theta}(\mathbf{X}_{[k]})^{K}\Pi_% {0}(\mathrm{d}\theta)\}/\{\int_{\Theta}\widetilde{p}_{\theta}(\mathbf{X}_{[k]}% )^{K}\Pi_{0}(\mathrm{d}\theta)\}

k=1,\dots,K

3: for

k=1,\dots,K

4: Obtain posterior samples from

\Pi_{m,k}(\mathrm{d}\theta)

using a desired sampling algorithm.

5: end for

6: Combine samples from the subsequence posteriors using their Wasserstein barycenter.

Algorithm 1 Divide-and-conquer for Bayesian time series (DC-BATS).

Output: samples from the Wasserstein posterior.

Remark 2.

If there is only finite-order dependence in the observations (for example, dependence of order $r$ , meaning that $X_{t}$ depends only on $X_{(t-r):(t-1)}$ ), then one can write the full log-likelihood exactly as a sum over subsets, where each subset conditions on the $r$ last observations from the previous subset. In such a case, one can use the log-likelihood split as described above rather than our proposed strategy which ignores the $r$ last observations from the previous subsequence, and then use Wasserstein averages. For a fixed $r$ , such an approach would have similar computational cost as our proposed method, and we conjecture that it is also possible to provide similar theoretical results as the ones we provide in Section 3.2. However, we do not pursue this here as our proposed method can also handle situations when it is not possible to write down the dependence structure as a finite-order dependence (for example, for hidden Markov models).

3 Theoretical guarantees

3.1 Notations and main assumptions

For notational convenience, we assume that there exists an infinitely long stationary stochastic process $X_{-\infty:\infty}=(\dots,X_{-2},X_{-1},X_{0},X_{1},X_{2},\dots)$ , and that $X_{1:T}$ is the observed sequence from said series. Our proposed methodology does not involve time indices outside $\{1,\dots,T\}$ , and indeed we are interested in the posterior distribution $\Pi_{T}(\mathrm{d}\theta\mid X_{1:T})$ as given by equation 2.

We assume that $\theta_{0}\in\Theta\subseteq\mathbb{R}^{d}$ is the “true” model parameter, and that $X_{-\infty:\infty}$ is generated by $\mathbb{P}_{\theta_{0}}$ , the probability measure induced by $\theta_{0}$ . We use $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}$ to denote expectation with respect to the probability measure $\mathbb{P}_{\theta_{0}}$ . We use $\Phi(\cdot;\mu,\Sigma)$ to denote a normal distribution with mean $\mu$ and variance $\Sigma$ .

We define $\ell_{k}(\theta)=\log\widetilde{p}_{\theta}(\mathbf{X}_{[k]})$ to be the pseudo log-likelihood of the $k$ th subsequence, where $\widetilde{p}_{\theta}(\mathbf{X}_{[k]})$ is as defined in equation 3. We use $\nabla_{\theta}\ell(\theta)$ and $\nabla^{2}_{\theta}\ell(\theta)$ to denote the gradient and Hessian matrix of $\ell_{k}(\theta)$ with respect to $\theta$ , respectively. We use $\widehat{\theta}_{k}$ to denote the maximum likelihood estimator (MLE) of the $k$ th subsequence under $\ell_{k}(\theta)$ , that is, $\widehat{\theta}_{k}=\operatorname*{argmax}_{\theta\in\Theta}\ell_{k}(\theta)$ , and $\overline{\theta}=\sum_{k=1}^{K}\widehat{\theta}_{k}/K$ to be the average MLE across the $K$ subsequences. We let $\widehat{\theta}$ be the MLE based on the complete dataset, that is, $\widehat{\theta}=\operatorname*{argmax}_{\theta\in\Theta}\ell(\theta)$ , where $\ell(\theta)$ is the log-likelihood of the entire dataset as given in equation 1. We assume that the MLEs are unique.

Throughout this paper, we denote the $L^{p}$ norm by $\|\cdot\|_{p}=\{\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(|\cdot|^{p})\}^{1/p}$ . We use Euclidean and Frobenius norms for vectors and matrices, respectively, that is, $\|v\|=(\sum_{i=1}^{d}v^{2}_{i})^{1/2}$ if $v\in\mathbb{R}^{d}$ and $\|V\|=(\sum_{i=1}^{d}\sum_{j=1}^{d}V^{2}_{ij})^{1/2}$ if $V\in\mathbb{R}^{d\times d}$ . For a sequence of real-valued random variables $\{Z_{n}\}_{n\geq 1}$ , where each $Z_{n}$ is measurable with respect to the sigma-algebra generated by $X_{-\infty:\infty}$ , and a sequence of real numbers $\{a_{n}\}_{n\geq 1}$ , we say that $Z_{n}=O_{\mathbb{P}_{\theta_{0}}}(a_{n})$ if for any $\epsilon>0$ , there exists a real number $s>0$ and a positive integer $N$ such that $\mathbb{P}_{\theta_{0}}(|Z_{n}/a_{n}|>s)<\epsilon$ for any $n>N$ . Moreover, we say that $Z_{n}=o_{\mathbb{P}_{\theta_{0}}}(a_{n})$ if for any $\epsilon>0$ , $\lim_{n\to\infty}\mathbb{P}_{\theta_{0}}(|Z_{n}/a_{n}|>\epsilon)=0$ . We let $\sigma(\cdot)$ represent the sigma-algebra generated by a (potentially infinite) collection of random variables $\cdot$ . We use $\nabla_{\theta}$ as a shorthand for $\partial/(\partial\theta)$ , that is, to represent the gradient with respect to $\theta$ .

We describe the main assumptions related to the time series nature of the observations in this section. There are additional assumptions which are similar to those made by Li et al., (2017), but we defer these to Appendix A to simplify the exposition.

We consider time series which are stationary and ergodic in this article, as formalised in Assumption 1.

Assumption 1 (Stationarity).

$X_{-\infty:\infty}$ is a strictly stationary and ergodic process.

Moreover, we make the following assumption on the mixing time of the process.

Assumption 2 (Mixing time).

The $\alpha$ -mixing coefficient of $X_{-\infty:\infty}$ , a measure of dependency defined as

\alpha(n)=\sup_{A\in\sigma(\dots,X_{-1},X_{0}),\,B\in\sigma(X_{n},X_{n+1},% \dots)}|\mathbb{P}_{\theta_{0}}(A)\mathbb{P}_{\theta_{0}}(B)-\mathbb{P}_{% \theta_{0}}(A\cap B)|,

is such that $\sum_{j=1}^{\infty}\alpha(j)^{\delta/(2+\delta)}<\infty$ for a real number $\delta>0$ .

Assumption 2 precludes long-range dependence. For processes with large $\delta$ , a slow mixing rate will be sufficient to guarantee $\sum_{j=1}^{\infty}\alpha(j)^{\delta/(2+\delta)}<\infty$ . Moreover, this assumption always holds for geometrically ergodic processes as, for such processes, there exists a $0<\rho<1$ such that $\alpha(j)<\rho^{j}$ for all sufficiently large $j$ , and thus the sum of interest is upper bounded by $\sum_{j=1}^{\infty}(\rho^{j})^{\delta/(2+\delta)}=\sum_{j=1}^{\infty}\{\rho^{% \delta/(2+\delta)}\}^{j}<\infty$ .

Finally, we make an assumption on the score function as follows.

Assumption 3 (Geometrically decaying score function).

There exists a constant $\rho_{0}\in(0,1)$ , a sufficiently large integer $N$ , and a constant $C_{0}>0$ such that

\displaystyle\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\|\nabla_{\theta}\log p_{% \theta_{0}}(X_{1}\mid X_{-j:0})-\nabla_{\theta}\log p_{\theta_{0}}(X_{1}\mid X% _{-j^{\prime}:0})\|_{1}

\displaystyle\leq C_{0}\,\rho_{0}^{N}\quad\text{for all}\leavevmode\nobreak\ j% ,j^{\prime}>N.

Assumption 3 states that the dependence of score functions on the history decays geometrically as the number of time steps increases. Such an assumption is automatically true for finite order Markov processes. For an $n$ -order Markov process, we can choose $N=n$ so that $\nabla_{\theta}\log p_{\theta_{0}}\left(X_{1}\mid X_{-j:0}\right)\equiv\log p_% {\theta_{0}}\left(X_{1}\mid X_{-n},\ldots,X_{0}\right)$ for all $j>N$ . Therefore, we have

\displaystyle\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\|\nabla_{\theta}\log p_{% \theta_{0}}\left(X_{1}\mid X_{-j:0}\right)-\nabla_{\theta}\log p_{\theta_{0}}% \left(X_{1}\mid X_{-j^{\prime}:0}\right)\|_{1}

\displaystyle=0.

It also holds for finite state-space hidden Markov models under some additional regularity assumptions (for example, Bickel et al.,, 1998, Lemma 6). In this lemma, there exists an $\eta_{1}$ such that $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\|\nabla_{\theta}\log p_{\theta_{0}}(X_{1}% \mid X_{-j:0})-\eta_{1}\|_{1}\leq C_{0}\,\rho_{0}^{N}$ , which implies that there exists $N>0$ such that

	$\displaystyle\quad\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\\|\nabla_{\theta}\log p_% {\theta_{0}}\left(X_{1}\mid X_{-j:0}\right)-\nabla_{\theta}\log p_{\theta_{0}}% (X_{1}\mid X_{-j^{\prime}:0})\\|_{1}$
	$\displaystyle\leq\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\\|\nabla_{\theta}\log p_{% \theta_{0}}(X_{1}\mid X_{-j:0})-\eta_{1}\\|_{1}+\mathbb{E}_{\mathbb{P}_{\theta_% {0}}}\\|\nabla_{\theta}\log p_{\theta_{0}}\left(X_{1}\mid X_{-j^{\prime}:0}% \right)-\eta_{1}\\|_{1}\leq 2C_{0}\,\rho_{0}^{N}.$

While this may appear as a strong assumption, we have seen in our numerical experiments that DC-BATS has good performance even when this is not necessarily satisfied.

3.2 Main results

We present the main theoretical results of the paper in this section. Our results are based on asymptotic theory as the total length $T=Km$ of the time series increases to infinity. We prove that the error due to combining the subsequence posteriors using DC-BATS is asymptotically negligible as $T\to\infty$ . This will require increasing the size of each subsequence $m$ , but at a potentially much slower rate than $T$ . The proofs are in the Supplementary Material, and leverage on results of Li et al., (2017) but extended to the time series setting. We first present the following Lemma 1, which is novel to this work and is instrumental in proving our later results.

Lemma 1.

Under Assumptions 1–3 in the main text and Assumptions 4–9 in Appendix A, $\|\overline{\theta}-\widehat{\theta}\|=o_{\mathbb{P}_{\theta_{0}}}(m^{-1/2})$ . If we assume further that $\widehat{\theta}_{1}$ is an unbiased estimator for $\theta$ , that is, $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\widehat{\theta}_{1})=\theta_{0}$ , and $m$ is at least $\mathcal{O}(T^{1/2})$ , then $\|\overline{\theta}-\widehat{\theta}\|=o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ .

Based on Lemma 1, the following Theorem 1 is the first main theoretical result of this paper, where we recall that $\widehat{\theta}_{k}$ is the MLE of the $k$ th subsequence and $\overline{\theta}=\sum_{k=1}^{K}\widehat{\theta}_{k}/K$ is the average MLE across the $K$ subsequences.

Theorem 1 (Error due to combining subsequence posteriors).

Suppose Assumptions 1–3 in the main text and Assumptions 4–9 in Appendix A hold. Let $\xi=a^{\top}\theta+b$ for some fixed $a\in\mathbb{R}^{d}$ and $b\in\mathbb{R}$ . Let $I_{\xi}(\theta_{0})=[a^{\top}\{I^{-1}(\theta_{0})\}a]^{-1}$ , $\overline{\xi}=a^{\top}\overline{\theta}+b$ , and $\widehat{\xi}=a^{\top}\widehat{\theta}+b$ . Then we have the following.

As $m\to\infty$ ,

	$\displaystyle T^{1/2}\mathrm{W}_{2}\big{(}\overline{\Pi}_{T}(\mathrm{d}\xi\mid X% _{1:T}),\Phi\big{[}\mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}% \big{]}\big{)}$	$\displaystyle\to 0,$
	$\displaystyle T^{1/2}\mathrm{W}_{2}\big{(}\Pi_{T}(\mathrm{d}\xi\mid X_{1:T}),% \Phi\big{[}\mathrm{d}\xi;\widehat{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}\big{]}% \big{)}$	$\displaystyle\to 0,$
	$\displaystyle m^{1/2}\mathrm{W}_{2}\left\{\overline{\Pi}_{T}(\mathrm{d}\xi\mid X% _{1:T}),\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})\right\}$	$\displaystyle\to 0,$

where the convergences are in $\mathbb{P}_{\theta_{0}}$ -probability.

2.
If
1. (a)
  
  $\widehat{\theta}_{1}$ is an unbiased estimator for $\theta$ , that is, $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\widehat{\theta}_{1})=\theta_{0}$ , and
2. (b)
  
  $m$ is at least $\mathcal{O}(T^{1/2})$ ,
then $T^{1/2}\mathrm{W}_{2}\left\{\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T}),\Pi_% {T}(\mathrm{d}\xi\mid X_{1:T})\right\}\to 0$ in $\mathbb{P}_{\theta_{0}}$ -probability as $m\to\infty$ .

For the first part of the theorem to hold, it suffices to let $m\to\infty$ at a much slower rate than $T$ . The second part of the theorem is more interesting and guarantees that the error between the Wasserstein posterior $\overline{\Pi}_{T}$ and full posterior $\Pi_{T}$ has the asymptotically optimal rate $T^{-1/2}$ , under a special circumstance when the MLE estimator is unbiased, and $m$ is at least $\mathcal{O}(T^{1/2})$ . While we require $m$ to be at least $\mathcal{O}(T^{1/2})$ , this is not restrictive in practice as one typically divides the entire dataset into a fairly small number $K$ of subsequences in divide-and-conquer algorithms (which is in the order of tens or hundreds), whereas $T$ is much larger by comparison (in the order of hundreds of thousands or more). For instance, if $T$ is large, say $T=10^{6}$ , then we require the number of subsequences $K$ to be less than $10^{3}$ , which is a reasonable requirement in practice. Moreover, we have observed good performance of DC-BATS even when $T$ and $m$ are not very large. It is possible to leverage Theorem 1 to obtain accuracy guarantees on moments of the posterior. These are provided in the following Theorem 2.

Theorem 2 (Guarantees on first and second moments).

Let $\xi=a^{\top}\theta+b$ for some fixed $a\in\mathbb{R}^{d}$ and $b\in\mathbb{R}$ , and let $\Xi\subseteq\mathbb{R}$ be the domain of $\xi$ under the transformation. Since $\xi_{0}$ is the truth, we define the “bias” of a distribution $\Pi(\mathrm{d}\xi\mid X_{1:T})$ as $\operatorname*{bias}[\Pi(\mathrm{d}\xi\mid X_{1:T})]=\int_{\Xi}\xi\,\Pi(% \mathrm{d}\xi\mid X_{1:T})-\xi_{0}$ , where $\xi_{0}=a^{\top}\theta_{0}+b$ . Under Assumptions 1–3 in Section 3.1 and Assumptions 4–9 in Appendix A, we have the following.

1.

$\operatorname*{bias}\{\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T})\}=% \overline{\xi}-\xi_{0}+o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ and $\operatorname*{bias}\{\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})\}=\widehat{\xi}-\xi_{% 0}+o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ .
2.

$\mathrm{var}\{\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T})\}=T^{-1}I^{-1}_{% \xi}(\theta_{0})+o_{\mathbb{P}_{\theta_{0}}}(T^{-1})$ and $\mathrm{var}\{\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})\}=T^{-1}I^{-1}_{\xi}(\theta_{% 0})+o_{\mathbb{P}_{\theta_{0}}}(T^{-1})$ .

Remark 3.

The definition of bias in Theorem 2 is adopted from Li et al., (2017). The bias in Theorem 2 is defined to be the difference between the posterior mean and $\xi_{0}$ . Unlike the usual definition of bias (which is a fixed quantity), the term refers to a random quantity here.

Theorem 2 quantifies the order of the biases and variances of both the Wasserstein posterior and the full posterior. The difference in the biases of these two posteriors is controlled by $\overline{\xi}-\widehat{\xi}$ up to an asymptotically negligible term $o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ . Lemma 1 further shows that the bias difference is $o_{\mathbb{P}_{\theta_{0}}}(m^{-1/2})$ in general, and is improved to $o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ when the MLE is unbiased and $m$ is at least $\mathcal{O}(T^{1/2})$ . In terms of the posterior variance, both posteriors align on the dominating term $T^{-1}I^{-1}_{\xi}(\theta_{0})$ and the difference is only up to an asymptotically negligible term $o_{\mathbb{P}_{\theta_{0}}}(T^{-1})$ .

Remark 4.

We note that our theory focuses on the exact Wasserstein barycenter, while in practice we will instead calculate the Wasserstein barycenter between Monte Carlo approximations to the subset posteriors. Conceptually, our theory can be easily extended to also account for the Monte Carlo error component following a similar approach to Li et al., (2017); however, we do not consider that extension in this article.

4 Synthetic data experiments

We demonstrate DC-BATS on different time series models. We have noticed that stochastic gradient MCMC algorithms tend to under-estimate posterior variances and are therefore not accurate in quantifying posterior uncertainty. This has also been noticed in the literature (for example, Figure 2 of Nemeth and Fearnhead,, 2020). Furthermore, it has been established that stochastic gradient MCMC algorithms have fundamental limitations in terms of their scalability versus accuracy (Johndrow et al.,, 2020). Given these reasons, we compare the proposed method with running MCMC to sample from the full posterior. Since we have proven theoretically that DC-BATS performs well for stationary models for large $T$ and $m$ , we also consider non-stationary models, as well as models with low/moderate $T$ and $m$ , to test the method outside of idealized cases. All numerical experiments have been performed on a 2018 i7-8700 CPU with 3.20 GHz processing power. Code for all numerical experiments is available online at https://github.com/deborsheesen/DC-BATS-deborshee. Averaging the credible intervals produced by the subsequence posteriors takes negligible time as compared to sampling from them, and we do not report this in our experiments.

4.1 Linear regression with auto-regressive errors

We first consider a linear regression model with auto-regressive errors as follows:

\displaystyle\begin{aligned} X_{t}&=\alpha+\beta^{\top}Z_{t}+\varepsilon_{t},% \\ \varepsilon_{t}&=\varphi_{1}\varepsilon_{t-1}+\varphi_{2}\varepsilon_{t-2}+\xi% _{t},\quad\text{with}\leavevmode\nobreak\ \xi_{t}\stackrel{{\scriptstyle\text{% i.i.d.}}}{{\sim}}\mathrm{N}(0,\sigma^{2}),\end{aligned}

(6)

and $\varepsilon_{1},\varepsilon_{2}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}% \mathrm{N}(0,\sigma^{2})$ , where $X_{t},\alpha,\varepsilon_{t}\in\mathbb{R}$ , and $\beta,Z_{t}\in\mathbb{R}^{p}$ ; here $X_{1:T}$ denotes observations and $Z_{1:T}$ denotes covariates. We set $\varphi_{1}=0.4$ and $\varphi_{2}=-0.6$ . We choose $p=50$ and generate $T=10^{5}$ observations from this model.

We choose independent $\mathrm{N}(0,10^{2})$ priors on $\alpha,\varphi_{1},\varphi_{2}$ , and on each component of $\beta$ . We also choose an inverse-gamma $(3,10)$ prior on $\sigma^{2}$ . We choose $K\in\{10,20\}$ and draw $10^{4}$ samples from each subsequence posterior as well as from the full posterior using the no-U-turn sampler (NUTS; Hoffman and Gelman,, 2014) as implemented in Stan (Carpenter et al.,, 2017), of which the first half are discarded as burn-in in each case. It took ten minutes to sample from the full posterior, about a minute to sample from each subsequence posterior for $K=10$ , and about half a minute to sample from each subsequence posterior for $K=10$ . We plot 95% credible intervals for $\beta$ in Figure 1, and observe that the credible intervals obtained by DC-BATS are virtually indistinguishable from those obtained by full data MCMC. The frequentist coverage of the credible intervals for $\beta$ for DC-BATS is 94% for $K=10$ and 92% for $K=20$ , and is 94% when we sample from the full posterior.

In addition to a single simulation example, we also include a simulation example under multiple sets of $(\varphi_{1},\varphi_{2})$ to represent different degree of mixing in the time series, and different numbers of machines $K$ s in Table 1. We consider (i) the i.i.d case $(\varphi_{1},\varphi_{2})=(0,0)$ , (ii) the fast mixing case $(\varphi_{1},\varphi_{2})=(0.2,0)$ , (iii) the slow mixing case $(\varphi_{1},\varphi_{2})=(0.8,0)$ , (iv) the unit root case I when $(\varphi_{1},\varphi_{2})=(1,0)$ , and (v) the unit root case II when $(\varphi_{1},\varphi_{2})=(-1,-2)$ . We set $T=10^{5}$ , $p=5$ and simulate 100 datasets for every setting. For each setting, we thus obtain 100 credible intervals and check their frequentist coverage across all the credible intervals. The results in Table 1 suggest that DC-BATS achieves a comparable frequentist coverage with the full MCMC method when $K$ is small. However, the performance deteriorates as $K$ increases.

	i.i.d.		fast mixing		slow mixing		unit root case I		unit root case II
	DC	Full	DC	Full	DC	Full	DC	Full	DC	Full
$K=5$	94	95	92	91	92	92	87	88	88	86
$K=10$	92	95	90	91	88	92	85	88	85	86
$K=20$	92	95	88	91	88	92	86	88	85	86
$K=50$	92	95	85	91	82	92	82	88	81	86

Table 1: Frequentist coverage of DC-BATS and the full MCMC method under different sets of parameters.

Refer to caption — Figure 1: Credible intervals for DC-BATS for the linear regression with auto-regressive errors model (6) for $T=10^{5}$ .

4.2 Generalized autoregressive conditional heteroskedasticity (GARCH) model

GARCH models (Bollerslev,, 1986) are very popular for modelling financial time series. These assume that the variance of the error term follows an autoregressive moving average process. Apart from finance, GARCH models have also been used in other domains such as healthcare (Nkalu and Edeme,, 2019) and engineering (Ma et al., 2017a, ). We consider the following GARCH model with covariates

\displaystyle\begin{aligned} X_{t}&=Z_{t}^{\top}b+\varepsilon_{t},\quad% \varepsilon_{t}\sim\mathrm{N}(0,\sigma_{t}^{2}),\\ \sigma_{t}^{2}&=\omega^{2}+\sum_{i=1}^{q}\alpha_{i}\varepsilon_{t-i}^{2}+\sum_% {j=1}^{q}\beta_{j}\sigma_{t-j}^{2},\end{aligned}

(7)

where $Z_{t}\in\mathbb{R}^{d}$ denotes covariates, $b\in\mathbb{R}^{d}$ denotes coefficients, $\alpha=(\alpha_{1},\dots,\alpha_{q})^{\top}\in\mathbb{R}_{+}^{q}$ , $\beta=(\beta_{1},\dots,\beta_{p})^{\top}\in\mathbb{R}_{+}^{p}$ , and $\omega\in\mathbb{R}_{+}$ are coefficients. We let $r=\max(p,q)$ and set $\sigma_{1}^{2}=\cdots=\sigma_{r}^{2}=1$ .

We choose small values of $T$ and $m$ to test DC-BATS. We generate a time series of length $T=2\times 10^{5}$ from model (7) for $d=5$ , $p=2$ , and $q=2$ . We set $\alpha=(0.16,0.16)^{\top},\beta=(0.16,0.16)^{\top}$ , and $\omega=1$ . The observations $X_{1:T}$ and corresponding variances $\sigma_{1:T}^{2}$ are plotted in Figure 2. We observe that the variances $\sigma_{t}^{2}$ vary significantly across time (up to several orders of magnitude), as do the observations. We divide observations into $K=10$ subsequences as before, and it is evident that this results in a lot of variation among the observations across subsequences.

We place a $\mathrm{Gamma}(3,10)$ prior on $\omega$ , independent $\mathrm{N}(0,10^{2})$ priors on each component of $b$ , and independent half-normal $\mathrm{N}_{+}(0,10^{2})$ priors³³3This is the normal distribution restricted to the positive real line. on each component of $\alpha$ and $\beta$ . We draw $10^{4}$ samples from each subsequence posterior as well as from the full posterior using NUTS, of which the first half are discarded as burn-in in each case. We present the boxplots of effective sample size (ESS) for each parameter in Figure 3 where we record the distribution of the effective sample sizes across different machines.. The ESS is satisfactory for each parameter. The median effective sample size is over $40\%$ of the total sample size for every parameter. In comparison, we recorded $53.6\%$ effective samples out of the total sample size. We conclude that the full MCMC method and the divide-and-conquer MCMC are not significantly different regarding effective sample size. It took around nine minutes to sample from the full posterior, and about a minute on average to sample from each subsequence posterior.

We compare the credible intervals produced by DC-BATS with those obtained by running MCMC on the full dataset, as well as those obtained using the double parallel Monte Carlo (DPMC) algorithm of Wang and Srivastava, (2023), Laplace’s approximation (Kass et al.,, 1991), and automatic differentiation variational inference (ADVI; Kucukelbir et al.,, 2017) in Figure 4. We observe that DC-BATS produces more accurate estimates of the credible intervals than those using the DPMC algorithm, Laplace’s approximation, or ADVI.

In addition to a single simulation example, we also include a simulation example under multiple sets of $(\alpha,\beta)$ to represent different degree of mixing in the time series, and different numbers of machines $K$ . In this simulation, we fix $w^{2}=1$ , $b=(1,\dots,1)^{\top}$ , $d=5$ , $p=2$ , $q=2$ . We let $\alpha_{1}=\dots=\alpha_{p}=\beta_{1}=\dots=\beta_{q}=\gamma^{-1}(p+q)^{-1}$ , where $\gamma\geq 1$ controls the mixing of $\sigma_{t}^{2}$ . We vary $\gamma\in\{1,2,5\}$ and $K\in\{5,10,20\}$ . The process is stationary when $\gamma>1$ , and the unit root exists when $\gamma=1$ . The results are presented in Table 2. We observe that, as expected, the empirical posterior produced by the DC-BATS method is the closest to the empirical posterior using the full dataset as compared to the other methods in Wasserstein-2 distance.

	$K=5$			$K=10$			$K=20$
	$\gamma=1$	$\gamma=2$	$\gamma=5$	$\gamma=1$	$\gamma=2$	$\gamma=5$	$\gamma=1$	$\gamma=2$	$\gamma=5$
DC-BATS	$0.025$	$0.016$	$0.012$	$0.023$	$0.015$	$0.012$	$0.021$	$0.015$	$0.012$
DPMC	$0.040$	$0.027$	$0.020$	$0.038$	$0.027$	$0.019$	$0.037$	$0.027$	$0.019$
Laplace’s approximation	$0.124$	$0.087$	$0.081$	$0.124$	$0.087$	$0.081$	$0.124$	$0.087$	$0.081$
ADVI	$0.032$	$0.023$	$0.018$	$0.032$	$0.023$	$0.018$	$0.032$	$0.023$	$0.018$

Table 2: Wasserstein-2 distance between the full MCMC posterior distribution and the approximations using different methods.

4.3 Hidden Markov models

We consider inference for continuous state-space hidden Markov models (HMMs; Rabiner and Juang,, 1986) in this section. A HMM is a process $\{(Z_{t},X_{t})\}_{t=0}^{T}$ , where $\{Z_{t}\}_{t=0}^{T}$ is an unobserved Markov chain, and each observation $X_{t}\in\mathbb{X}$ is conditionally independent of the rest of the process given $Z_{t}\in\mathbb{Z}$ . We consider a linear Gaussian model having the form:

\displaystyle\begin{aligned} Z_{0}&\sim\mathrm{N}(\mu_{0},\Sigma_{0}),\\ Z_{t}\mid Z_{t-1}&\sim\mathrm{N}(AZ_{t-1},\Sigma_{z}),\quad t\geq 1,\\ X_{t}\mid Z_{t}&\sim\mathrm{N}(CZ_{t},\Sigma_{x}).\quad t\geq 0.\end{aligned}

(8)

Closely related models have numerous applications ranging from guidance and navigation to robotics (Musoff and Zarchan,, 2009).

We consider a two-dimensional latent Markov chain (that is, $\mathbb{Z}=\mathbb{R}^{2}$ ) and a two-dimensional observation space (that is, $\mathbb{X}=\mathbb{R}^{2}$ ). In particular, we generate $T=10^{3}$ observations from model (8) with true parameter values

\displaystyle A=\begin{pmatrix}0.9&-0.3\\ 0.2&1\end{pmatrix},\quad C=\begin{pmatrix}-1.1&0.5\\ -0.3&0.8\end{pmatrix},\quad\Sigma_{x}=\begin{pmatrix}\sigma_{x}^{2}&0\\ 0&\sigma_{x}^{2}\end{pmatrix},\quad\Sigma_{z}=\begin{pmatrix}\sigma_{z}^{2}&0% \\ 0&\sigma_{z}^{2}\end{pmatrix},

with $\sigma_{x}^{2}=\sigma_{z}^{2}=0.5$ . We plot the observed process in Figure 5, where we see that it does not appear to be stationary. We nonetheless test DC-BATS on this model.

We fix $C$ at its true value and consider inference for $A$ , $\sigma_{x}^{2}$ , and $\sigma_{z}^{2}$ . We place independent $\mathrm{N}(0,10^{2})$ priors on each component of the matrix $A\in\mathbb{R}^{2\times 2}$ . We also place independent $\log\mathrm{N}(0,10^{2})$ priors on $\sigma_{x}^{2}$ and $\sigma_{z}^{2}$ where $\log\mathrm{N}$ denotes a log-normal distribution. We choose $K=5$ subsequences. We write code in Python and collect $10^{4}$ posterior samples using an adaptive random walk Metropolis-Hastings algorithm (Haario et al.,, 2001) from each subsequence posterior, as well as from the full posterior. It took around 48 minutes for the MCMC algorithm to sample from the full posterior, and 11 minutes on average for each subsequence posterior. We display 95% credible intervals for DC-BATS and MCMC on the full dataset in Table 3, where we observe that the credible intervals provided by DC-BATS are extremely accurate.

	$\sigma_{x}^{2}$	$\sigma_{z}^{2}$
DC-BATS	$(0.463,0.572)$	$(0.396,0.540)$
MCMC on full posterior	$(0.464,0.576)$	$(0.398,0.538)$

	$A_{11}$	$A_{12}$	$A_{21}$	$A_{22}$
DC-BATS	$(0.900,0.924)$	$(-0.316,-0.288)$	$(0.191,0.215)$	$(0.979,1.007)$
MCMC on full posterior	$(0.895,0.925)$	$(-0.327,-0.289)$	$(0.182,0.209)$	$(0.964,1.000)$

Table 3: 95% posterior credible intervals for parameters of the hidden Markov model (8).

4.4 Binary auto-regressive model

We consider a model with binary observations $X_{t}\in\{0,1\}$ modelled as

\displaystyle\mathbb{P}(X_{t}=1)

\displaystyle=\frac{1}{1+\exp\{-(c+\sum_{i=1}^{p}\alpha_{i}X_{t-i}+Z_{t}^{\top% }b)\}},

(9)

where $\alpha=(\alpha_{1},\dots,\alpha_{p})^{\top}\in\mathbb{R}$ , $Z_{t}\in\mathbb{R}^{q}$ denotes covariates at time $t$ , and $b=(b_{1},\dots,b_{q})^{\top}\in\mathbb{R}^{q}$ denote coefficients of covariates. In this example, we choose $p=5$ and $q=5$ . We set $\alpha=(0.25,0.25,0.25,0.25,0.25)^{\top}$ and $b=(0.5,0.5,0,-0.5,-0.5)^{\top}$ . We generate $T=2\times 10^{5}$ synthetic observations. We choose the covariates $Z_{t}$ to be non-stationary. We plot the observations $X_{1:T}$ and the corresponding success probabilities $(\mathbb{P}(X_{1}=1),\dots,\mathbb{P}(X_{T}=1))$ in Figure 6.

We consider independent $\mathrm{N}(0,10^{2})$ priors on $c$ and on each component of $\alpha$ and $b$ . We draw $10^{4}$ samples from the posterior $p(c,\alpha,b\mid X_{1:T})$ using DC-BATS as well as MCMC on the full dataset. We use NUTS and discard half of the $10^{4}$ samples as burn-in in each case. It took around twelve minutes to sample from the full posterior, and a little over a minute on average to sample from each subsequence posterior. We compare the credible intervals produced by DC-BATS, full data MCMC, the double parallel Monte Carlo (DPMC) algorithm of Wang and Srivastava, (2023), Laplace’s approximation and ADVI in Figure 7. The credible intervals from DC-BATS are virtually indistinguishable from those for full data MCMC, while the intervals for DPMC deviate for certain parameters. We also present the boxplots of effective sample size (ESS) for each parameter in Figure 8 where we record the distribution of the effective sample sizes across different machines. We observe that the ESS is satisfactory for each parameter. The median effective sample size is over $40\%$ of the total sample size for every parameter. In comparison, we recorded a $48.1\%$ effective samples out of the total sample size for the full MCMC method. We again conclude that the full MCMC method and the divide-and-conquer MCMC are not significantly different regarding effective sample size.

In addition to a single simulation example, we also include a simulation example under multiple sets of $(\alpha,b)$ to represent different degree of mixing in the time series, and different numbers of machines $K$ s in Table. In this simulation, we fix $c=0$ , $p=5$ , $q=5$ . We let $\alpha_{1}=\dots=\alpha_{p}=b_{1}=\dots=b_{q}=\gamma^{-1}(p+q)^{-1}$ with $\gamma\geq 1$ ; the parameter $\gamma$ regulates the level of autocorrelation of the model. We use the Wasserstein-2 distance between the obtained posterior distribution and the full posterior distribution as a metric to compare the performance. The results are presented in Table 4. We observed a slightly better performance for DPMC and Laplace’s approximation compared to DC-BATS. We also observe a significantly better performance for DC-BATS compared to ADVI.

	$K=5$			$K=10$			$K=20$
	$\text{ratio}=1$	$\text{ratio}=2$	$\text{ratio}=5$	$\text{ratio}=1$	$\text{ratio}=2$	$\text{ratio}=5$	$\text{ratio}=1$	$\text{ratio}=2$	$\text{ratio}=5$
DC-BATS	$0.173$	$0.123$	$0.119$	$0.176$	$0.119$	$0.115$	$0.172$	$0.119$	$0.121$
DPMC	$0.181$	$0.132$	$0.121$	$0.175$	$0.112$	$0.111$	$0.161$	$0.116$	$0.121$
Laplace’s approximation	$0.161$	$0.115$	$0.102$	$0.161$	$0.115$	$0.102$	$0.161$	$0.115$	$0.102$
ADVI	$0.223$	$0.246$	$0.212$	$0.223$	$0.246$	$0.212$	$0.223$	$0.246$	$0.212$

Table 4: Average Wasserstein-2 distance between the full MCMC posterior distribution and the approximations of different methods for the binary auto-regressive model (9).

5 Application to Los Angeles particulate matter data

It is well understood that aerosol particulates have significant impact on human health, and hence understanding the dynamics of particulate matter (PM) is important in public health decision making. Modern sampling technologies have made high-resolution air monitoring possible. This makes the data produced by such monitors massive, and hence challenging to analyze with Bayesian methods. To tackle this computational challenges, we apply DC-BATS to analyze a Los Angeles air quality dataset obtained from the Environmental Protection Agency (EPA)⁴⁴4The dataset is available online at https://www.epa.gov/outdoor-air-quality-data..

This dataset consists of $T=8760$ hourly measurements of particulates including $\text{PM}_{\text{10}}$ (1% missingness) and $\text{PM}_{\text{2.5}}$ (3.5% missingness) in Los Angeles during 2017. This dataset has some clearly invalid measurements, which we treat as missing data. We apply Kalman smoothing imputation as suggested in Hyndman and Khandakar, (2008) to simplify handling of the missing data. After imputation, we transform both PM observations by $\log(0.1+\text{PM})$ . Our overarching goal is to build an interpretable model that can capture the dynamics of these particulates.

We plot the values of the particulates over time after preprocessing in Figure 9. It is clear that the variance of the observations changes over time. In order to capture the evolution of variance within a series and correlation across series, we consider a bivariate GARCH model with constant conditional correlation (Bollerslev,, 1990) as follows:

\displaystyle\begin{aligned} X_{t}&=\mu+v_{t},\quad v_{t}\sim\mathrm{N}_{2}% \left(0,H_{t}\right),\\ H_{t,ii}&=w_{i}+a_{i}v^{2}_{t-1,ii}+b_{i}H_{t-1,ii},\\ H_{t,ij}&=rH_{t,ii}^{1/2}H_{t,jj}^{1/2},\quad i,j=1,2,\end{aligned}

(10)

for $t=1,\dots,T$ , where $X_{t}\in\mathbb{R}^{2}$ is the ( $\text{PM}_{\text{10}}$ , $\text{PM}_{\text{2.5}}$ ) levels at time $t$ . We assume $X_{t}$ is the sum of a time-independent mean $\mu\in\mathbb{R}^{2}$ and a time-dependent innovation $v_{t}\in\mathbb{R}^{2}$ whose covariance matrix $H_{t}$ evolves with time $t$ . We assume that each variance $H_{t,ii}$ follows a univariate GARCH process, regressed on an intercept term $w_{i}\in\mathbb{R}_{+}$ , a lag- $1$ innovation $v^{2}_{t-1,ii}$ , and the lag- $1$ variance $H_{t-1,ii}$ through coefficients $a_{i},b_{i}\in\mathbb{R}_{+}$ for $i=1,2$ . We also assume the correlation between particulates is time-independent, which is captured by $r\in[-1,1]$ .

We adopt a diffuse prior distribution $\mathrm{N}(0.5,10^{6})$ for every $a_{i},b_{i},\mu_{i}$ , respectively, and prior distribution $\mathrm{N}(1.0,10^{6})$ for every $w_{i}$ . Since it is well known that these particulates are positively correlated apriori, we adopt a $\mathrm{Uniform}(0,1)$ prior distribution for $r$ . We draw $10^{4}$ samples from the posterior $p(a,b,w,\mu\mid X_{1:T})$ , where $a=(a_{1},a_{2})$ and $b=(b_{1},b_{2})$ using DC-BATS with $k=10$ subsequences as well as MCMC on the full dataset. We use NUTS and discard half of the $10^{4}$ samples as burn-in in each case. It took around $24$ minutes to sample from the full posterior, and $3.8$ minutes on average to sample from each subsequence posterior. We compare the credible intervals produced by DC-BATS and full data MCMC in Table 5, where it is evident that the credible intervals provided by both methods are well aligned with each other.

	$a_{1}$	$a_{2}$
DC-BATS	$(5.12\times 10^{-1},5.85\times 10^{-1})$	$(6.50\times 10^{-1},7.30\times 10^{-1})$
MCMC on full posterior	$(5.33\times 10^{-1},6.19\times 10^{-1})$	$(8.76\times 10^{-1},9.76\times 10^{-1})$

	$b_{1}$	$b_{2}$
DC-BATS	$(6.20\times 10^{-2},1.32\times 10^{-1})$	$(8.59\times 10^{-5},1.09\times 10^{-2})$
MCMC on full posterior	$(1.21\times 10^{-1},2.12\times 10^{-1})$	$(6.00\times 10^{-5},7.46\times 10^{-3})$

	$w_{1}$	$w_{2}$
DC-BATS	$(1.22\times 10^{-1},1.42\times 10^{-1})$	$(2.01\times 10^{-1},2.24\times 10^{-1})$
MCMC on full posterior	$(9.07\times 10^{-2},1.10\times 10^{-1})$	$(1.23\times 10^{-1},1.40\times 10^{-1})$

	$\mu_{1}$	$\mu_{2}$	$r$
DC-BATS	$(3.12,3.14)$	$(1.95,1.98)$	$(2.57\times 10^{-1},2.78\times 10^{-1})$
MCMC on full posterior	$(3.25,3.28)$	$(2.10,2.12)$	$(2.32\times 10^{-1},2.54\times 10^{-1})$

Table 5: 95% posterior credible intervals for parameters of the bivariate GARCH model (10) applied to the Los Angeles particulate matter (PM) dataset.

6 Discussion

We have proposed a simple divide-and-conquer approach for Bayesian inference from stationary time series. There are several natural follow-up directions. In our theoretical development, we have assumed that the time series is stationary and mixes fast; it would be interesting to relax these assumptions and develop scalable posterior inference algorithms for non-stationary time series, as well as for series with long-range dependence. Although our current algorithm has promising empirical results in certain simulation experiments with non-stationarity, we expect long range dependence to present a more challenging problem.

We have not considered the problem of defining a ‘best’ choice of the number and lengths of the subsequences, and have instead focused on experiments testing our algorithm in challenging cases in which the subset sizes are modest and/or assumptions of our theory are violated. In practice for truly massive datasets, one should ideally run MCMC in parallel for the different subsequences; the best choice of $m$ and $K$ depends on a tradeoff between statistical accuracy and ones computational budget in terms of wall clock time, number of nodes in a distributed computing network, and the capacity of each node. As a rule of thumb, approximation accuracy should improve with subsequence length as long as ones computational budget allows sufficient MCMC draws per subsequence posterior. Our simulation results are promising in suggesting that, at least in certain cases, accuracy is high even with short subsequences. However, this depends on the model and data.

Two additional important future directions include (1) modifying the simple divide-and-conquer algorithm we are proposing to allow communication between nodes; and (2) modifying the algorithm and/or theory to allow guarantees for fixed finite subsequence sizes. There has been work on both threads outside of the time series setting; for example, refer to Dai et al., (2023).

Acknowledgement

DS and DD acknowledge support from National Science Foundation grant 1546130. DS acknowledges support from grant DMS-1638521 from SAMSI. The authors thank Cheng Li for his insightful comments that led to improvements in this paper.

Appendix A Additional assumptions

We make the following additional assumptions to prove the theoretical results. These are similar to assumptions made by Li et al., (2017).

Assumption 4 (Support).

For all $t\geq 1$ and all $\theta\in\Theta$ , all possible conditional distributions $X_{t}\mid X_{1:(t-1)}$ have the same support as the stationary distribution of $X_{t}$ .

Many classes of time series models satisfy Assumption 4, including the ones that are considered in this paper. However, exceptions to this assumption include time-varying time series models where the support of conditional densities change with respect to time $t$ .

Assumption 5 (Envelope).

This consists of three parts.

1.

$\log p_{\theta}(x_{t}\mid x_{1:(t-1)})$ is three times differentiable with respect to $\theta$ in a neighbourhood $B_{\delta_{0}}(\theta_{0})=\{\theta\in\Theta:\|\theta-\theta_{0}\|\leq\delta_{% 0}\}$ of $\theta_{0}$ for some constant $\delta_{0}>0$ .

There exists functions $M_{t}(x_{1:t})$ , $t\geq 1$ , such that

	$\displaystyle\sup_{\theta\in B_{\delta_{0}}(\theta_{0})}\left\|\frac{\partial}{% \partial\theta_{l_{1}}}\log p_{\theta}(x_{t}\mid x_{1:(t-1)})\right\|$	$\displaystyle\leq M_{t}(x_{1:t}),$
	$\displaystyle\sup_{\theta\in B_{\delta_{0}}(\theta_{0})}\left\|\frac{\partial^{% 2}}{\partial\theta_{l_{1}}\partial\theta_{l_{2}}}\log p_{\theta}(x_{t}\mid x_{% 1:(t-1)})\right\|$	$\displaystyle\leq M_{t}(x_{1:t}),$
	$\displaystyle\sup_{\theta\in B_{\delta_{0}}(\theta_{0})}\left\|\frac{\partial^{% 3}}{\partial\theta_{l_{1}}\partial\theta_{l_{2}}\partial\theta_{l_{3}}}\log p_% {\theta}(x_{t}\mid x_{1:(t-1)})\right\|$	$\displaystyle\leq M_{t}(x_{1:t}),$

for $l_{1},l_{2},l_{3}=1,\dots,d$ for all $\{x_{t}\}_{t\geq 1}$ .

3.

$\limsup\limits_{T\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\{T^{-1}\sum_{t% =1}^{T}M_{t}(X_{1:t})^{4+2\delta}\}<\infty$ , where $\delta$ is the same as that in Assumption 2.

Assumptions 2 and 5 together imply a trade-off between moments and the mixing rate of the process $\{X_{t}\}_{t\geq 1}$ . A higher value of $\delta$ leads to greater restriction on the moments on one hand; on the other hand, a slower decay rate of $\alpha(j)$ is required for $\sum_{k=1}^{\infty}\alpha(j)^{\delta/(2+\delta)}<\infty$ to hold, thus leading to less restriction on the mixing rate of the process $\{X_{t}\}_{t\geq 1}$ .

Assumption 5 can be verified by checking if the log-likelihood function of $x_{t}$ only depends of a finite number of past observations, that is, there existing an integer $p>0$ such that $\log p_{\theta}(x_{t}\mid x_{1:(t-1)})=\log p_{\theta}(x_{t}\mid x_{(t-p):(t-1% )})$ . In this case, it suffices to find an envelope function $M(X_{1:p})$ such that $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}M^{4+2\delta}(X_{1:p})<+\infty$ , where $X_{1:p}$ is under the stationary distribution. By the $L^{p}$ Ergodic Theorem of Von Neumann (Neumann,, 1932), we have

\displaystyle\limsup\limits_{T\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}% \left\{T^{-1}\sum_{t=1}^{T}M_{t}(X_{1:t})^{4+2\delta}\right\}=\mathbb{E}_{% \mathbb{P}_{\theta_{0}}}M^{4+2\delta}(X_{1:p})<+\infty.

Assumption 6.

The interchange of order of integration with respect to $\mathbb{P}_{\theta_{0}}$ at $\theta_{0}$ is justified. The score function $\nabla_{\theta}\ell_{k}(\theta)$ is a martingale at $\theta=\theta_{0}$ for $m\geq 1$ . Moreover,

-T^{-1}\frac{\partial^{2}}{\partial\theta\partial\theta^{\top}}p_{\theta_{0}}(% X_{1:T})\stackrel{{\scriptstyle\text{a.s.}}}{{\to}}I(\theta_{0})\leavevmode% \nobreak\ \text{in}\leavevmode\nobreak\ \mathbb{P}_{\theta_{0}}\text{-% probability as }T\to\infty,

where $I(\theta_{0})$ is a positive definite matrix. Further, for all sufficiently large $m$ , $-m^{-1}\nabla_{\theta}^{2}\ell_{k}(\theta)$ is positive definite with eigenvalues bounded below and above by constants for all $\theta\in B_{\delta_{0}}(\theta_{0})$ and all values of $\mathbf{X}_{[k]}$ .

Assumption 7.

For any $\delta>0$ , there exists an $\epsilon>0$ such that

\lim_{m\to\infty}\mathbb{P}_{\theta_{0}}\left(\sup_{\theta\in\Theta\,:\,\|% \theta-\theta_{0}\|\geq\delta}\frac{\ell_{1}(\theta)-\ell_{1}(\theta_{0})}{m}% \leq-\epsilon\right)=1.

We assume that the prior $\pi_{0}(\theta)$ has finite second moment; this is a fairly relaxed assumption and is required as we use the $\mathrm{W}_{2}$ distance to combine the subsequence posteriors.

Assumption 8 (Prior).

The prior density $\pi_{0}(\theta)$ is continuous at $\theta_{0}$ . Moreover, $0<\pi_{0}(\theta_{0})<\infty$ . The second moment of the prior exists, that is, $\int_{\theta}\|\theta\|^{2}\,\pi_{0}(\theta)\,\mathrm{d}\theta<\infty$ .

Assumption 9 (Uniform integrability).

Let $\psi(\mathbf{X}_{[1]})=\mathbb{E}_{\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[1]% })}\{Km\|\theta-\widehat{\theta}_{1}\|^{2}\}$ , where $\mathbb{E}_{\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[1]})}$ is the expectation with respect to $\theta$ under the posterior $\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[1]}).$ Then there exists an integer $m_{0}\geq 1,$ such that $\{\psi(\mathbf{X}_{[1]}):m\geq m_{0},K\geq 1\}$ is uniformly integrable under $\mathbb{P}_{\theta_{0}}.$ In other words,

\lim_{C\to\infty}\sup_{m\geq m_{0},\,K\geq 1}\mathbb{E}_{\mathbb{P}_{\theta_{0% }}}\left[\psi(\mathbf{X}_{[1]})\mathbb{I}\{\psi(\mathbf{X}_{[1]})\geq C\}% \right]=0,

where $\mathbb{I}(\cdot)$ is the indicator function.

Assumptions 4, 5, 6 and 9 are generalizations of Assumptions 2, 3, 4, 7 of Li et al., (2017), respectively. These can be verified more straightforwardly if there exists a finite integer $n$ such that $p_{\theta}(x_{t}\mid x_{1:(t-1)})\equiv p_{\theta}(x_{t}\mid x_{(t-n-1):(t-1)})$ . In this case, $M_{t}\equiv M(X_{(t-n-1):(t-1)})$ is also an ergodic sequence. To verify Assumption 5, by the $L^{p}$ -ergodic theorem, it suffices to find a $\delta$ such that $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\{M(X_{(t-n-1):(t-1)})^{4+2\delta}\}<\infty$ , where the expectation is with respect to the stationary distribution of $X_{t-n-1},\dots,X_{t-1}$ . Assumption 6 is a generalization of a common regularity condition to dependent processes. Again, in view of Assumption 5 that bounds the moment of the second order derivative of the log density function, the ergodic theorem will hold automatically to guarantee that

-T^{-1}\frac{\partial^{2}}{\partial\theta\partial\theta^{\top}}p_{\theta_{0}}(% x_{1:T})\stackrel{{\scriptstyle\text{a.s.}}}{{\to}}I(\theta_{0})\quad\text{in % }\mathbb{P}_{\theta_{0}}\text{-probability}.

Assumption 7 will also hold if for any $\delta>0$ , there exists an $\epsilon>0$ such that $m^{-1}\inf_{\theta\in\Theta\,:\,\|\theta-\theta_{0}\|\geq\delta}\mathrm{KL}(p_% {\theta}\,\|\,p_{\theta_{0}})>\epsilon$ for all sufficiently large $m$ , where $\mathrm{KL}$ denotes the Kullback-Leibler divergence. Assumption 9 mirrors Assumption 7 in Li et al., (2017).

References

Agueh and Carlier, (2011) Agueh, M. and Carlier, G. (2011). Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924.
Aicher et al., (2019) Aicher, C., Ma, Y.-A., Foti, N. J., and Fox, E. B. (2019). Stochastic gradient MCMC for state space models. SIAM Journal on Mathematics of Data Science, 1(3):555–587.
Beal, (2003) Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. PhD thesis, UCL (University College London).
Bickel and Freedman, (1981) Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics, 9(6):1196–1217.
Bickel et al., (1998) Bickel, P. J., Ritov, Y., and Ryden, T. (1998). Asymptotic normality of the maximum-likelihood estimator for general hidden Markov models. The Annals of Statistics, 26(4):1614–1635.
Bierkens et al., (2019) Bierkens, J., Fearnhead, P., and Roberts, G. (2019). The zig-zag process and super-efficient sampling for Bayesian analysis of big data. The Annals of Statistics, 47(3):1288–1320.
Bini and Capovani, (1983) Bini, D. and Capovani, M. (1983). Spectral and computational properties of band symmetric Toeplitz matrices. Linear Algebra and its Applications, 52:99–126.
Blei et al., (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877.
Bollerslev, (1986) Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3):307–327.
Bollerslev, (1990) Bollerslev, T. (1990). Modelling the coherence in short-run nominal exchange rates: a multivariate generalized ARCH model. The Review of Economics and Statistics, 72(3):498–505.
Bouchard-Côté et al., (2018) Bouchard-Côté, A., Vollmer, S. J., and Doucet, A. (2018). The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. Journal of the American Statistical Association, 113(522):855–867.
Carpenter et al., (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1):1–32.
Chen et al., (2014) Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691. PMLR.
Cinlar, (2011) Cinlar, E. (2011). Probability and Stochastics. Springer.
Cuturi and Doucet, (2014) Cuturi, M. and Doucet, A. (2014). Fast computation of Wasserstein barycenters. In International Conference on Machine Learning, pages 685–693. PMLR.
Dai et al., (2019) Dai, H., Pollock, M., and Roberts, G. (2019). Monte Carlo fusion. Journal of Applied Probability, 56(1):174–191.
Dai et al., (2023) Dai, H., Pollock, M., and Roberts, G. O. (2023). Bayesian fusion: scalable unification of distributed statistical analyses. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(1):84–107.
Del Moral et al., (2006) Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436.
Dharmadhikari et al., (1968) Dharmadhikari, S., Fabian, V., and Jogdeo, K. (1968). Bounds on the moments of martingales. The Annals of Mathematical Statistics, 39(5):1719–1723.
Duane et al., (1987) Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195(2):216–222.
Dvurechenskii et al., (2018) Dvurechenskii, P., Dvinskikh, D., Gasnikov, A., Uribe, C., and Nedich, A. (2018). Decentralize and randomize: Faster algorithm for wasserstein barycenters. Advances in Neural Information Processing Systems, 31.
Foti et al., (2014) Foti, N., Xu, J., Laird, D., and Fox, E. (2014). Stochastic variational inference for hidden Markov models. In Advances in Neural Information Processing Systems, pages 3599–3607.
Guhaniyogi et al., (2017) Guhaniyogi, R., Li, C., Savitsky, T. D., and Srivastava, S. (2017). A divide-and-conquer Bayesian approach to large-scale kriging. arXiv preprint arXiv:1712.09767.
Haario et al., (2001) Haario, H., Saksman, E., and Tamminen, J. (2001). An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242.
Hastings, (1970) Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109.
Hoffman and Gelman, (2014) Hoffman, M. D. and Gelman, A. (2014). The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623.
Hyndman and Khandakar, (2008) Hyndman, R. J. and Khandakar, Y. (2008). Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 27(3):1–22.
Johndrow et al., (2020) Johndrow, J. E., Pillai, N. S., and Smith, A. (2020). No free lunch for approximate MCMC. arXiv preprint arXiv:2010.12514.
Johnson and Willsky, (2014) Johnson, M. and Willsky, A. (2014). Stochastic variational inference for Bayesian time series models. In International Conference on Machine Learning, pages 1854–1862. PMLR.
Kass et al., (1991) Kass, R. E., Tierney, L., and Kadane, J. B. (1991). Laplace’s method in Bayesian analysis. Contemporary Mathematics, 115:89–99.
Kucukelbir et al., (2017) Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). Automatic differentiation variational inference. Journal of Machine Learning Research.
Lauritzen, (1992) Lauritzen, S. L. (1992). Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108.
Li et al., (2017) Li, C., Srivastava, S., and Dunson, D. B. (2017). Simple, scalable and accurate posterior interval estimation. Biometrika, 104(3):665–680.
(34) Ma, J., Xu, F., Huang, K., and Huang, R. (2017a). GNAR-GARCH model and its application in feature extraction for rolling bearing fault diagnosis. Mechanical Systems and Signal Processing, 93:175–203.
Ma et al., (2015) Ma, Y.-A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925.
(36) Ma, Y.-A., Foti, N. J., and Fox, E. B. (2017b). Stochastic gradient MCMC methods for hidden Markov models. In International Conference on Machine Learning, pages 2265–2274. PMLR.
Metropolis et al., (1953) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092.
Minsker et al., (2014) Minsker, S., Srivastava, S., Lin, L., and Dunson, D. (2014). Scalable and robust Bayesian inference via the median posterior. In International Conference on Machine Learning, pages 1656–1664. PMLR.
Musoff and Zarchan, (2009) Musoff, H. and Zarchan, P. (2009). Fundamentals of Kalman filtering: a practical approach. American Institute of Aeronautics and Astronautics.
Neiswanger et al., (2014) Neiswanger, W., Wang, C., and Xing, E. (2014). Asymptotically exact, embarrassingly parallel mcmc. In Proceedings of the 30th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 623–632.
Nemeth and Fearnhead, (2020) Nemeth, C. and Fearnhead, P. (2020). Stochastic gradient Markov chain Monte Carlo. Journal of the American Statistical Association, 116(533):1–18.
Neumann, (1932) Neumann, J. v. (1932). Proof of the quasi-ergodic hypothesis. Proceedings of the National Academy of Sciences, 18(1):70–82.
Nkalu and Edeme, (2019) Nkalu, C. N. and Edeme, R. K. (2019). Environmental hazards and life expectancy in Africa: evidence from GARCH model. SAGE Open, 9(1):215–222.
Quiroz et al., (2018) Quiroz, M., Kohn, R., Villani, M., and Tran, M.-N. (2018). Speeding up MCMC by efficient data subsampling. Journal of the American Statistical Association, 114(526):831–843.
Rabiner and Juang, (1986) Rabiner, L. and Juang, B. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1):4–16.
Rio, (1993) Rio, E. (1993). Covariance inequalities for strongly mixing processes. Annales de l’IHP Probabilités et statistiques, 29(4):587–597.
Roberts and Tweedie, (1996) Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363.
Salomone et al., (2020) Salomone, R., Quiroz, M., Kohn, R., Villani, M., and Tran, M.-N. (2020). Spectral subsampling MCMC for stationary time series. In International Conference on Machine Learning, pages 8449–8458. PMLR.
Scott et al., (2016) Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (2016). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88.
Srivastava et al., (2015) Srivastava, S., Cevher, V., Dinh, Q., and Dunson, D. (2015). WASP: Scalable Bayes via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pages 912–920. PMLR.
Srivastava et al., (2018) Srivastava, S., Li, C., and Dunson, D. B. (2018). Scalable Bayes via barycenter in Wasserstein space. The Journal of Machine Learning Research, 19(1):312–346.
Szabó and Van Zanten, (2019) Szabó, B. and Van Zanten, H. (2019). An asymptotic analysis of distributed nonparametric methods. Journal of Machine Learning Research, 20(87):1–30.
Villani, (2009) Villani, C. (2009). Optimal Transport: Old and New, volume 338. Springer.
Villani et al., (2022) Villani, M., Quiroz, M., Kohn, R., and Salomone, R. (2022). Spectral subsampling MCMC for stationary multivariate time series with applications to vector ARTFIMA processes. Econometrics and Statistics.
Wang and Srivastava, (2023) Wang, C. and Srivastava, S. (2023). Divide-and-conquer Bayesian inference in hidden Markov models. Electronic Journal of Statistics, 17(1):895–947.
Wang and Dunson, (2013) Wang, X. and Dunson, D. B. (2013). Parallelizing MCMC via Weierstrass sampler. arXiv preprint arXiv:1312.4605.
Wang et al., (2015) Wang, X., Guo, F., Heller, K. A., and Dunson, D. B. (2015). Parallelizing MCMC with random partition trees. In Advances in Neural Information Processing Systems, pages 451–459.
Welling and Teh, (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688. PMLR.
Whittle, (1951) Whittle, P. (1951). Hypothesis testing in time series analysis. Almquist and Wiksell, Uppsala.
Xue and Liang, (2019) Xue, J. and Liang, F. (2019). Double-parallel Monte Carlo for Bayesian analysis of big data. Statistics and Computing, 29(1):23–32.

Supplementary Material

Appendix S1 Main proofs

Simplifying notation: we use $(X_{1k},\dots,X_{mk})=\mathbf{X}_{[k]}$ to denote the observations within the $k$ th subsequence. We also use $\ell_{k}^{\prime}(\theta)$ and $\ell_{k}^{\prime\prime}(\theta)$ to denote $\nabla_{\theta}\ell_{k}(\theta)$ and $\nabla_{\theta}^{2}\ell_{k}(\theta)$ , respectively; this notation will be handy as we shall consider $\ell_{k}^{\prime}(\theta)$ and $\ell_{k}^{\prime\prime}(\theta)$ at different values of $\theta$ .

The proofs of Theorem 1 and 2 rely on the following lemmas in addition to Lemma 1.

Lemma S1.

Suppose that Assumptions 1–9 hold. Then the following are true.

1.

There exists a weakly consistent estimator $\widehat{\theta}_{k}$ that is measurable with respect to $\sigma(\mathbf{X}_{[k]})$ solving the score equation $\ell_{k}^{\prime}(\widehat{\theta}_{k})=0$ . Moreover, this estimator is consistent, that is, $\widehat{\theta}_{k}\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}% \theta_{0}$ as $m\to\infty$ .
2.

Let $\widehat{\theta}$ be a weakly consistent estimator of $\theta_{0}$ based on the complete dataset $\mathbf{X}$ solving the score equation $\ell^{\prime}(\widehat{\theta})=0$ . Then $\widehat{\theta}\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}\theta_% {0}$ as $T\to\infty$ .

Let $\zeta=T^{1/2}(\theta-\widehat{\theta}_{k})$ be a local parameter for the $k$ th subsequence, and $\vartheta=T^{1/2}(\theta-\widehat{\theta})$ be a local parameter for the complete dataset. Let $\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]})$ be the $k$ th subsequence posterior induced by $\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})$ and $\Pi_{T,\vartheta}(\mathrm{d}\vartheta\mid X_{1:T})$ be the posterior of $\vartheta$ induced by the overall posterior $\Pi_{T}(\mathrm{d}\theta\mid X_{1:T})$ . Then

	$\displaystyle\lim_{m\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{TV}_% {2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]}),\Phi\{\mathrm{d}% \zeta;0,I^{-1}(\theta_{0})\}\right]$	$\displaystyle=0,$
	$\displaystyle\lim_{m\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{TV}_% {2}\left[\Pi_{T,\vartheta}(\mathrm{d}\vartheta\mid X_{1:T}),\Phi\{\mathrm{d}% \vartheta;0,I^{-1}(\theta_{0})\}\right]$	$\displaystyle=0,$

where $\mathrm{TV}_{2}$ denotes the the total variation of second moment distance.

Lemma S2 (Lemma 3 of Li et al.,, 2017).

Let $\widehat{\xi}_{k}=a^{\top}\widehat{\theta}_{k}+b$ and $\xi=a^{\top}\theta+b$ . Then

W_{2}\left(\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T}),\Phi[\mathrm{d}\xi;% \overline{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}]\right)\leq\frac{1}{K}\sum_{k=1}^% {K}W_{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[k]}),\Phi[\mathrm{d}\xi;% \widehat{\xi}_{k},\{TI_{\xi}(\theta_{0})\}^{-1}]\big{)}.

Lemma S1 mirrors Lemma 2 of Li et al., (2017), and its proof can be straightforwardly modified from their proof. We therefore focus our attention on the proof of Lemma 1, which is novel.

S1.1 Proof of Lemma 1

Proof.

We use the first order Taylor expansion of $\ell_{k}^{\prime}(\widehat{\theta}_{k})$ ,

	$\displaystyle 0$	$\displaystyle=\ell_{k}^{\prime}(\widehat{\theta}_{k})=\ell_{k}^{\prime}(\theta% _{0})+\ell_{k}^{\prime\prime}(\widetilde{\theta}_{k})(\widehat{\theta}_{k}-% \theta_{0}),$
	$\displaystyle 0$	$\displaystyle=\ell^{\prime}(\widehat{\theta})=\ell^{\prime}(\theta_{0})+\ell^{% \prime\prime}(\widetilde{\theta})(\widehat{\theta}-\theta_{0}),$

where $\widetilde{\theta}$ lies between $\widehat{\theta}$ and $\theta_{0}$ , and $\widetilde{\theta}_{k}$ lies between $\widehat{\theta}_{k}$ and $\theta_{0}$ . Therefore,

	$\displaystyle\widehat{\theta}_{k}$	$\displaystyle=\theta_{0}-\left\{\frac{1}{m}\ell_{k}^{\prime\prime}(\widetilde{% \theta}_{k})\right\}^{-1}\frac{\ell_{k}^{\prime}(\theta_{0})}{m}=\theta_{0}+% \frac{1}{m}I^{-1}(\theta_{0})\ell_{k}^{\prime}(\theta_{0})+Z_{k}\frac{\ell_{k}% ^{\prime}(\theta_{0})}{m}$		(S1)
	$\displaystyle\widehat{\theta}$	$\displaystyle=\theta_{0}-\left\{\frac{1}{T}\ell^{\prime\prime}(\widetilde{% \theta})\right\}^{-1}\frac{\ell^{\prime}(\theta_{0})}{T}=\theta_{0}+\frac{1}{T% }I^{-1}(\theta_{0})\ell^{\prime}(\theta_{0})+Z\frac{\ell^{\prime}(\theta_{0})}% {T}$

After rearranging, we obtain the difference between $\overline{\theta}$ and $\widehat{\theta}$ as

\displaystyle\overline{\theta}-\widehat{\theta}

\displaystyle=\frac{1}{K}\sum_{k=1}^{K}Z_{k}\frac{\ell_{k}^{\prime}(\theta_{0}% )}{m}-Z\frac{\ell^{\prime}(\theta_{0})}{T}+Q

(S2)

where

Z_{k}=\left\{-\frac{1}{m}\ell_{k}^{\prime\prime}(\widetilde{\theta}_{k})\right% \}^{-1}-I^{-1}(\theta_{0}),\quad Z=\left\{-\frac{1}{T}\ell^{\prime\prime}(% \widetilde{\theta})\right\}^{-1}-I^{-1}(\theta_{0}),

and

Q=I^{-1}(\theta_{0})\left\{\frac{\sum_{k=1}^{K}\ell_{k}^{\prime}(\theta_{0})-% \ell^{\prime}(\theta_{0})}{T}\right\}.

The second term on the right hand side of equation S2 is $o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ . The convergence of $T^{-1/2}\ell^{\prime}(\theta_{0})$ to $\mathrm{N}\{0,I^{-1}(\theta_{0})\}$ in distribution is established by the martingale central limit theorem. Therefore, $\ell^{\prime}(\theta_{0})/T$ is $O_{P_{\theta_{0}}}(T^{-1/2})$ . Moreover, we have $-\ell_{k}^{\prime\prime}(\widetilde{\theta})/T\stackrel{{\scriptstyle\mathbb{P% }_{\theta_{0}}}}{{\to}}I(\theta_{0})$ and thus $Z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}0$ by the continuous mapping theorem. Hence $Z\ell^{\prime}(\theta_{0})/T=o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ .

We show that the third term $Q$ on the right hand side of equation S2 is $O_{\mathbb{P}_{\theta_{0}}}(m^{-1})$ . Define

\displaystyle\ell^{\prime}_{k\mid-k}(\theta_{0})

\displaystyle=\nabla_{\theta}\log p_{\theta}(\mathbf{X}_{[k]}\mid\mathbf{X}_{[% 1]},\dots,\mathbf{X}_{[k-1]})\big{|}_{\theta={\theta_{0}}},

and thus $\ell^{\prime}(\theta_{0})=\sum_{k=1}^{K}\ell^{\prime}_{k\mid-k}(\theta_{0})$ . Therefore we write

	$\displaystyle Q$	$\displaystyle=I^{-1}(\theta_{0})\left\{\frac{\sum_{k=1}^{K}\ell_{k}^{\prime}(% \theta_{0})-\ell^{\prime}(\theta_{0})}{T}\right\}$
		$\displaystyle=I^{-1}(\theta_{0})\frac{\sum_{k=1}^{K}\{\ell_{k}^{\prime}(\theta% _{0})-\ell^{\prime}_{k\mid-k}(\theta_{0})\}}{T}=I^{-1}(\theta_{0})Q_{1},$

where

	$\displaystyle Q_{1}$	$\displaystyle=\frac{\sum_{k=1}^{K}\{\ell_{k}^{\prime}(\theta_{0})-\ell^{\prime% }_{k\mid-k}(\theta_{0})\}}{T}$
		$\displaystyle=\frac{1}{T}\sum_{k=1}^{K}\sum_{i=1}^{m}\big{\{}\nabla_{\theta}% \log p_{\theta}(X_{ik}\mid X_{1k},\dots,X_{(i-1)k})$
		$\displaystyle\hskip 79.49744pt-\nabla_{\theta}\log p_{\theta}(X_{ik}\mid% \mathbf{X}_{[1]},\dots,\mathbf{X}_{[k-1]},X_{1k},\dots,X_{(i-1)k})\}\big{\|}_{% \theta=\theta_{0}}$

by the fact that we can write

	$\displaystyle\ell_{k}^{\prime}(\theta_{0})$	$\displaystyle=\sum_{i=1}^{m}\nabla_{\theta}\log p_{\theta}(X_{ik}\mid X_{1k},% \dots,X_{(i-1)k})\big{\|}_{\theta=\theta_{0}}\quad\text{and}$
	$\displaystyle\ell^{\prime}_{k\mid-k}(\theta_{0})$	$\displaystyle=\sum_{i=1}^{m}\nabla_{\theta}\log p_{\theta}(X_{ik}\mid\mathbf{X% }_{[1]},\dots,\mathbf{X}_{[k-1]},X_{1k},\dots,X_{(i-1)k})\big{\|}_{\theta=% \theta_{0}}.$

Assumption 3 bounds the error between $\nabla_{\theta}\log p_{\theta}(X_{ik}\mid X_{1k},\dots,X_{(i-1)k})|_{\theta=% \theta_{0}}$ and
$\nabla_{\theta}\log p_{\theta}(X_{ik}\mid\mathbf{X}_{[1]},\dots,\mathbf{X}_{[k% -1]},X_{1k},\dots,X_{(i-1)k}))|_{\theta=\theta_{0}}$ , and thus the error between $\ell_{k}^{\prime}(\theta_{0})$ and $\ell^{\prime}_{k\mid-k}(\theta_{0})$ . We have that $Q_{1}=O_{\mathbb{P}_{\theta_{0}}}(m^{-1})$ by Markov’s inequality, since for any $s>0$ ,

	$\displaystyle\mathbb{P}_{\theta_{0}}\left(m\left\\|Q_{1}\right\\|_{1}>s\right)$
	$\displaystyle\leq\frac{m}{sT}\sum_{k=1}^{K}\sum_{i=1}^{m}\big{\\|}\nabla_{% \theta}\log p_{\theta_{0}}(X_{ik}\mid X_{1k},\dots,X_{(i-1)k})\big{\|}_{\theta=% \theta_{0}}$
	$\displaystyle\hskip 83.11005pt-\nabla_{\theta}\log p_{\theta_{0}}(X_{ik}\mid% \mathbf{X}_{[1]},\dots,\mathbf{X}_{[k-1]},X_{1k},\dots,X_{(i-1)k})\big{\|}_{% \theta=\theta_{0}}\big{\\|}_{1}$
	$\displaystyle\leq\frac{m\,C_{0}\sum_{k=1}^{K}\sum_{i=0}^{m}\rho_{0}^{i}}{sT}% \leq\frac{C_{0}}{s(1-\rho_{0})}\to 0\quad\text{as}\leavevmode\nobreak\ s\to\infty,$

where the second inequality is by Assumption 3. Moreover, by Assumption 6, $I^{-1}(\theta_{0})=O_{\mathbb{P}_{\theta_{0}}}(1)$ . Therefore $Q=O_{\mathbb{P}_{\theta_{0}}}(1)\times O_{\mathbb{P}_{\theta_{0}}}(m^{-1})=O_{% \mathbb{P}_{\theta_{0}}}(m^{-1})$ . Furthermore, if $m=\mathcal{O}(T^{1/2})$ , then $Q=o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ .

The convergence order of the first term on the right hand side of equation S2 is established by Markov’s inequality. We define $W_{k}=Z_{k}\ell_{k}^{\prime}(\theta_{0})/m$ and prove later that $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2})\to 0$ as $m\to\infty$ . Then

	$\displaystyle\mathbb{P}_{\theta_{0}}\bigg{(}\bigg{\\|}\frac{1}{K}\sum_{k=1}^{K}% \frac{W_{k}}{m^{1/2}}\bigg{\\|}\geq cm^{-1/2}\bigg{)}$	$\displaystyle\leq\frac{m\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\\|(1/K)\sum_{k=1}^% {K}W_{k}/m^{1/2}\\|^{2}}{c^{2}}$
		$\displaystyle\leq\frac{1}{c^{2}K}\sum_{k=1}^{K}\mathbb{E}_{\mathbb{P}_{\theta_% {0}}}(\\|W_{k}\\|^{2})=\frac{1}{c^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\\|W_{% 1}\\|^{2})\to 0.$		(S3)

The convergence in equation S3 is established by $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2})\to 0$ that we will prove below. Further, when $\widehat{\theta}_{k}$ is an unbiased estimator for $\theta_{j}$ , then $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(W_{k})=0$ because

\displaystyle\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\widehat{\theta}_{k})=\theta% _{0}+\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}\frac{1}{m}I^{-1}(\theta_{0}% )\ell_{k}^{\prime}(\theta_{0})\bigg{\}}+\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(W% _{k})=\theta_{0}+\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(W_{k})=\theta_{0},

recalling equation S1. We will prove that in the case of unbiased $\widehat{\theta}_{k}$ , expression (S2) converges in the order of $o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ . By Markov’s inequality, for every $c>0$ ,

	$\displaystyle\quad\mathbb{P}_{\theta_{0}}\bigg{(}\bigg{\\|}\frac{1}{K}\sum_{k=1% }^{K}\frac{W_{k}}{m^{1/2}}\bigg{\\|}>cT^{-1/2}\bigg{)}=\mathbb{P}_{\theta_{0}}% \bigg{(}\bigg{\\|}\frac{1}{K}\sum_{k=1}^{K}W_{k}\bigg{\\|}>cK^{-1/2}\bigg{)}$
	$\displaystyle\leq\frac{K\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\\|(1/K)\sum_{k=1}^% {K}W_{k}\\|^{2}}{c^{2}}=\frac{1}{Kc^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}% \bigg{(}\sum_{1\leq k_{1},k_{2}\leq K}W_{k_{1}}^{\top}W_{k_{2}}\bigg{)}$
	$\displaystyle\leq\frac{1}{Kc^{2}}\sum_{1\leq k_{1},k_{2}\leq K}\sum_{l=1}^{d}\|% \mathrm{cov}_{\mathbb{P}_{\theta_{0}}}(W_{k_{1}}^{l},W_{k_{2}}^{l})\|,$		(S4)

where we write $W_{k}=(W_{k}^{1},\dots,W_{k}^{d})$ , and $\mathrm{cov}_{\mathbb{P}_{\theta_{0}}}$ denotes covariance under $\mathbb{P}_{\theta_{0}}$ . The last inequality of equation S4 is by $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(W_{k})=(0,\dots,0)$ and $|\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(W_{1}^{\top}\mathrm{W}_{2})|=|\sum_{l=1}% ^{d}\mathrm{cov}_{\mathbb{P}_{\theta_{0}}}(W_{1}^{l},\mathrm{W}_{2}^{l})|$ . To prove that the last term of equation S4 converges to zero as $m\to\infty$ , it suffices to prove that for every $1\leq l\leq d$ ,

\frac{1}{K}\sum_{1\leq k_{1},k_{2}\leq K}|\mathrm{cov}_{\mathbb{P}_{\theta_{0}% }}(W_{k_{1}}^{l},W_{k_{2}}^{l})|\to 0\quad\text{as}\leavevmode\nobreak\ % \leavevmode\nobreak\ m\to\infty.

(S5)

The stationarity of the process (Assumption 1) allows us to denote $|\mathrm{cov}_{\mathbb{P}_{\theta_{0}}}(W_{k_{1}}^{l},W_{k_{2}}^{l})|$ by $\delta^{l}(k_{1}-k_{2})$ .

We employ the following $\alpha$ -mixing inequality of Rio, (1993):

|\mathrm{cov}_{\mathbb{P}_{\theta_{0}}}(W_{k_{1}}^{l},W_{k_{2}}^{l})|=\delta^{% l}(k_{1}-k_{2})\leq 2\int_{0}^{2\alpha(|(k_{1}-k_{2})m|)}Q_{W^{l}_{k_{1}}}(u)% \,Q_{W^{l}_{k_{2}}}(u)\,\mathrm{d}u,

where $\alpha(\cdot)$ is the alpha $\alpha$ -mixing coefficient as stated in Assumption 2 and $Q_{W_{k}^{l}}(u)=\inf\{t:\mathbb{P}(|W_{k}^{l}|>t)\leq u\}$ is the quantile function of $|W_{k}^{l}|$ . Further, by Markov’s inequality, $\mathbb{P}_{\theta_{0}}(|W_{k}^{l}|>t)\leq\mathbb{E}_{\mathbb{P}_{\theta_{0}}}% (|W_{k}^{l}|^{2+\delta})/t^{2+\delta}$ , where $\delta>0$ is a positive real number as stated in Assumption 2, and so we have $Q_{W^{l}_{k_{2}}}(u)\leq u^{-1/(2+\delta)}\|W_{k}^{l}\|_{2+\delta}$ . Equation (S5) then follows as

	$\displaystyle\quad\frac{1}{K}\sum_{1\leq k_{1},k_{2}\leq K}\|\mathrm{cov}_{% \mathbb{P}_{\theta_{0}}}(W_{k_{1}}^{l},W_{k_{2}}^{l})\|=\frac{1}{K}\sum_{1\leq k% _{1},k_{2}\leq K}\delta^{l}(k_{1}-k_{2})=\sum_{k=1}^{K}\frac{K-k}{K}\delta^{l}% (k)$
	$\displaystyle\leq\sum_{k=1}^{K}\delta^{l}(k)\leq 2\sum_{k=1}^{K}\int_{0}^{2% \alpha(km)}Q^{2}_{W^{l}_{1}}(u)\,\mathrm{d}u\leq 2\sum_{k=1}^{K}\int_{0}^{2% \alpha(km)}u^{-2/(2+\delta)}\\|W_{1}^{l}\\|^{2}_{2+\delta}\,\mathrm{d}u$
	$\displaystyle\leq C_{0}\bigg{\{}\sum_{k=1}^{K}\alpha(km)^{\delta/(2+\delta)}% \bigg{\}}\\|W_{1}^{l}\\|^{2}_{2+\delta}\leq\bigg{\{}\sum_{k=1}^{\infty}\alpha(km% )^{\delta/(2+\delta)}\bigg{\}}\\|W_{1}^{l}\\|^{2}_{2+\delta}\to 0,$

as $m\to\infty$ , where the second inequality is by Assumption 1 and $C_{0}$ is a constant independent of $m$ . The final inequality is by Assumption 2 and $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2+\delta})\to 0$ , which we will prove below.

Proof of $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2+\delta})\to 0$ as $m\to\infty$ for every $k=1,\dots,K$ :
We first show that $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2+\delta})\to 0$ , and thus $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2})$ also converges to zero by Jensen’s inequality. Since $\ell_{k}^{\prime}(\theta_{0})/m=O_{\mathbb{P}_{\theta_{0}}}(1)$ and $Z_{j}=o_{\mathbb{P}_{\theta_{0}}}(1)$ , $\|W_{k}\|=\|Z_{j}\,\ell_{k}^{\prime}(\theta_{0})/m\|$ is $o_{\mathbb{P}_{\theta_{0}}}(1)$ . The remaining part of the proof is to show that $\|W_{k}\|^{2+\delta}$ is dominated by some $V_{k}\in L_{1}$ under $\mathbb{P}_{\theta_{0}}$ . The dominated convergence theorem will then imply that $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2+\delta}$ ) converges to zero.

By Assumption 6, let $\underline{\lambda}>0$ be the lower bound of $-\ell_{k}^{\prime\prime}(\theta)/m$ for all $\theta\in B_{\delta_{0}}(\theta_{0})$ and sufficiently large $m$ . We have

\bigg{\|}\bigg{\{}-\frac{1}{m}\ell_{k}^{\prime\prime}(\widetilde{\theta}_{k})% \bigg{\}}^{-1}\bigg{\|}\leq d^{1/2}\underline{\lambda}^{-1}\quad\text{and}% \quad\|I(\theta_{0})^{-1}\|\leq d^{1/2}\underline{\lambda}^{-1}.

(S6)

Moreover, Assumption 5 implies that

\bigg{\|}-\frac{1}{m}\ell_{k}^{\prime\prime}(\widetilde{\theta}_{k})\bigg{\|}^% {2}\leq\frac{d^{2}}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2}.

(S7)

It follows from equations (S6) and (S7) that for all sufficiently large $m$ ,

	$\displaystyle\left\\|Z_{k}\right\\|^{2+\delta}$	$\displaystyle=\bigg{\\|}\bigg{\{}-\frac{1}{m}\frac{\partial^{2}\ell_{k}(% \widetilde{\theta}_{k})}{\partial\theta\partial\theta^{\top}}\bigg{\}}^{-1}% \bigg{\{}-\frac{1}{m}\frac{\partial^{2}\ell_{k}(\widetilde{\theta}_{k})}{% \partial\theta\partial\theta^{\top}}-I(\theta_{0})\bigg{\}}I^{-1}(\theta_{0})% \bigg{\\|}^{2+\delta}$
		$\displaystyle\leq\frac{1}{2}d^{(2+\delta)/2}\underline{\lambda}^{-(2+\delta)}% \bigg{\{}\frac{d^{2}}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+\\|% I(\theta_{0})\\|^{2+\delta}\bigg{\}}d^{(2+\delta)/2}\underline{\lambda}^{-(2+% \delta)}$
		$\displaystyle\leq c_{1}\frac{1}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+% \delta}+c_{2},$

where $c_{1}$ and $c_{2}$ are constants that are independent of $m$ . Now define

V_{k}=\bigg{\{}c_{1}\frac{1}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+% \delta}+c_{2}\bigg{\}}\times\|m^{-1/2}\ell_{k}^{\prime}(\theta_{0})\|^{2+% \delta}.

We will show that $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(V_{k})<\infty$ and then apply the dominated convergence theorem to $\|W_{k}\|^{2}$ . We first apply the Cauchy-Schwarz inequality to $V_{k}$ and obtain

\displaystyle\begin{aligned} \mathbb{E}_{\mathbb{P}_{\theta_{0}}}(V_{k})&=% \mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}c_{1}\frac{1}{m}\sum_{i=1}^{m}M_{% i}(X_{1k},\dots,X_{ik})^{2+\delta}+c_{2}\bigg{\}}\|m^{-1/2}\ell_{k}^{\prime}(% \theta_{0})\|^{2+\delta}\\ &\leq\bigg{[}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}c_{1}\frac{1}{m}\sum% _{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+c_{2}\bigg{\}}^{2}\bigg{]}^{1/% 2}\times\big{\{}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|m^{-1/2}\ell_{k}^{% \prime}(\theta_{0})\|^{4+2\delta})\big{\}}^{1/2}.\end{aligned}

(S8)

The first term of equation S8 is bounded by Assumption 5 as

	$\displaystyle\quad\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}c_{1}\frac{1}{m% }\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+c_{2}\bigg{\}}^{2}$
	$\displaystyle=\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{[}c_{1}^{2}\bigg{\{}% \frac{1}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}\bigg{\}}^{2}+2c% _{1}c_{2}\frac{1}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+c_{2}^% {2}\bigg{]}$
	$\displaystyle\leq\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}c_{1}^{2}\frac{1% }{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{4+2\delta}+2c_{1}c_{2}\frac{1}{m% }\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+c_{2}^{2}\bigg{\}}<\infty.$

To bound the second term of equation S8, recall that $\ell_{k}^{\prime}(\theta)=\sum_{t=1}^{m}p_{\theta}^{\prime}(X_{tk}\mid X_{1k},% \dots,X_{(t-1)k})/p_{\theta}(X_{tk}\mid X_{1k},\dots,X_{(t-1)k})$ is a martingale at $\theta=\theta_{0}$ for $m\geq 1$ by Assumption 8. We denote the $l$ th component of $p_{\theta}^{\prime}(X_{tk}\mid X_{1k},\dots,X_{(t-1)k})/p_{\theta}(X_{tk}\mid X% _{1k},\dots,X_{(t-1)k})$ as $U_{tl}$ , so that $p_{\theta}^{\prime}(X_{tk}\mid X_{1k},\dots,X_{(t-1)k})/p_{\theta}(X_{tk}\mid X% _{0j},\dots,X_{(t-1)k})=(U_{t1},\dots,U_{td})^{\top}$ . Further, by Assumption 6, $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(U_{tl})=0$ for every integer $1\leq t\leq m$ and integer $1\leq l\leq d$ . Then

	$\displaystyle\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\{\\|m^{-1/2}\ell_{k}^{\prime}% (\theta_{0})\\|^{4+2\delta}\}$	$\displaystyle\leq m^{-(2+\delta)}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\{\\|\ell_% {k}^{\prime}(\theta_{0})\\|^{4+2\delta}\}$
		$\displaystyle\leq m^{-(2+\delta)}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}% \sum_{l=1}^{d}\bigg{(}\sum_{t=1}^{m}U_{tl}\bigg{)}^{2}\bigg{\}}^{2+\delta}$
		$\displaystyle\leq\bigg{(}\frac{d^{1/2}}{m}\bigg{)}^{2+\delta}\mathbb{E}_{% \mathbb{P}_{\theta_{0}}}\bigg{\{}\sum_{l=1}^{d}\bigg{(}\sum_{t=1}^{m}U_{tl}% \bigg{)}^{4+2\delta}\bigg{\}}$
		$\displaystyle\leq\bigg{(}\frac{d^{1/2}}{m}\bigg{)}^{2+\delta}m^{2+\delta}\sum_% {l=1}^{d}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}c(4+2\delta)\frac{1}{m}% \sum_{t=1}^{m}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|U_{tl}\|^{4+2\delta})\bigg{\}}$
		$\displaystyle\leq d^{1+\delta/2}\sum_{l=1}^{d}\mathbb{E}_{\mathbb{P}_{\theta_{% 0}}}\bigg{[}c(4+2\delta)\frac{1}{m}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{% \{}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{4+2\delta}\bigg{\}}\bigg{]}<\infty,$

where $c(\nu)=\{8(\nu-1)\times\max(1,2^{\nu-3})\}^{\nu}$ is a constant independent of $m$ . The third inequality is by Jensen’s inequality, the fourth inequality is by Dharmadhikari et al., (1968), and the fifth inequality is by Assumption 5.

Therefore, we find a $V_{k}$ that dominates $\|W_{k}\|^{2+\delta}$ with $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(V_{k})<\infty$ . Moreover, $W_{k}\overset{\mathbb{P}_{\theta_{0}}}{\to}0$ and thus by the dominated convergence theorem, $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2+\delta})\to 0$ . ∎

Appendix S2 Other proofs

The proof of Theorem 1 is similar to that of Theorem 1 of Li et al., (2017), but is included for concreteness. The proof of Theorem 2 is the same as the proof of Theorem 2 of Li et al., (2017), and is mainly based on Theorem 1 and does not involve features of dependent data. The proof of Lemma S1 is along the lines of the proof of Lemma 2 of Li et al., (2017), but is included for completeness.

For any two probability measures $\mu,\nu\in\mathcal{P}_{2}(\Theta)$ , the total variation of second moment distance $\mathrm{TV}_{2}$ is defined as $\mathrm{TV}_{2}(\mu,\nu)=\int_{\Theta}(1+\|\theta\|^{2})\,|\mu(\mathrm{d}% \theta)-\nu(\mathrm{d}\theta)|$ . By Villani, (2009), $\mathrm{W}^{2}_{2}(\mu,\nu)\leq 2\mathrm{TV}_{2}(\mu,\nu)$ .

Proof of Theorem 1.

By the triangle inequality,

\displaystyle\begin{aligned} &\mathrm{W}_{2}\{\overline{\Pi}_{T}(\mathrm{d}\xi% \mid X_{1:T}),\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})\}\\ &\leq\mathrm{W}_{2}\{\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T}),\Phi[% \mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}]\}\\ &\quad+\mathrm{W}_{2}(\Phi[\mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta_{0})% \}^{-1}],\Phi[\mathrm{d}\xi;\widehat{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}])\\ &\quad+\mathrm{W}_{2}\{\Phi[\mathrm{d}\xi;\widehat{\xi},\{TI_{\xi}(\theta_{0})% \}^{-1}],\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})\}.\end{aligned}

(S9)

We show that the first term of the right hand side of equation S9 is $o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ , the second term is $o_{\mathbb{P}_{\theta_{0}}}(m^{-1/2})$ in general and is $o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ when the MLE is unbiased and $m$ is at least $\mathcal{O}(T^{1/2})$ , and the third term is $o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ . Therefore, the $\mathrm{W}_{2}$ distance between $\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T})$ and $\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})$ is $o_{\mathbb{P}_{\theta_{0}}}(m^{-1/2})$ in general and $o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ when the MLE is unbiased and $m$ is at least $\mathcal{O}(T^{1/2})$ .

We first estimate the order of the first term in equation S9. For any $c>0$ ,

	$\displaystyle\quad\mathbb{P}_{\theta_{0}}\{\mathrm{W}_{2}(\overline{\Pi}_{T}(% \mathrm{d}\xi\mid X_{1:T}),\Phi[\mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta% _{0})\}^{-1}])\geq cT^{-1/2}\}$
	$\displaystyle\leq\mathbb{P}_{\theta_{0}}\bigg{\{}\frac{1}{K}\sum_{k=1}^{K}% \mathrm{W}_{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[k]}),\Phi[\mathrm{d% }\xi;\widehat{\xi}_{k},\{TI_{\xi}(\theta_{0})\}^{-1}]\big{)}\geq cT^{-1/2}% \bigg{\}}$
	$\displaystyle\leq\frac{T}{c^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}% \frac{1}{K}\sum_{k=1}^{K}\mathrm{W}_{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf% {X}_{[k]}),\Phi[\mathrm{d}\xi;\widehat{\xi}_{k},\{I_{\xi}(\theta_{0})\}^{-1}]% \big{)}\bigg{\}}^{2}$
	$\displaystyle\leq\frac{T}{c^{2}K}\sum_{k=1}^{K}\mathbb{E}_{\mathbb{P}_{\theta_% {0}}}\mathrm{W}_{2}^{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[k]}),\Phi[% \mathrm{d}\xi;\widehat{\xi}_{k},\{TI_{\xi}(\theta_{0})\}^{-1}]\big{)}$
	$\displaystyle\leq\frac{T}{c^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{W}% _{2}^{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[1]}),\Phi[\mathrm{d}\xi;% \widehat{\xi}_{1},\{TI_{\xi}(\theta_{0})\}^{-1}]\big{)}$
	$\displaystyle\leq\frac{2T}{c^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{% TV}_{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[1]}),\Phi[\mathrm{d}\xi;% \widehat{\xi}_{1},\{TI_{\xi}(\theta_{0})\}^{-1}]\big{)}\to 0,$

where the first inequality is by Lemma S2, the third inequality is by Jensen’s inequality, and the fourth inequality is by stationarity. The convergence to zero is by Lemma S1.

For the second term, note that the Wasserstein-2 distance between two $d$ -dimensional Gaussians $\Phi(m_{1},\Sigma)$ and $\Phi(m_{2},\Sigma)$ is given by $\mathrm{W}_{2}\{\Phi(m_{1},\Sigma),\Phi(m_{2},\Sigma)\}=\|m_{1}-m_{2}\|^{2}$ . Therefore, Lemma 1 yields that $\mathrm{W}_{2}(\Phi[\mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}% ],\Phi[\mathrm{d}\xi;\widehat{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}])\leq|% \overline{\xi}-\widehat{\xi}|=o_{\mathbb{P}_{\theta_{0}}}(m^{-1/2})$ in general and
$\mathrm{W}_{2}(\Phi[\mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}% ],\Phi[\mathrm{d}\xi;\widehat{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}])\leq|% \overline{\xi}-\widehat{\xi}|=o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ when the MLE is unbiased and $m$ is at least $\mathcal{O}(T^{1/2})$ . The third term of equation S9 is a special case of the first term when $K=1$ and $m=T$ , and is hence $o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})$ . This proves Theorem 1. ∎

Proof of Lemma S1.

We first show the existence of weakly consistent estimator $\widehat{\theta}_{k}$ of $\theta_{0}$ such that $\ell^{\prime}_{k}(\widehat{\theta}_{k})=0$ and $\widehat{\theta}_{k}\to\theta_{0}$ in $\mathbb{P}_{\theta_{0}}$ -probability. By Assumption 5, $\ell_{k}(\theta)$ is differentiable in a neighbourhood $B_{\delta_{0}}(\theta_{0})$ . By Assumption 6, $\mathbb{E}_{\{P_{\theta_{0}}}\{\ell^{\prime}_{k}(\theta_{0})\}=0$ . With $\mathbb{P}_{\theta_{0}}$ -probability arbitrary close to $1$ , there exists a root that solves $\ell^{\prime}_{k}(\theta)$ in $B_{\delta_{0}}(\theta_{0})$ . We denote such a root by $\widehat{\theta}_{k}$ . By Assumption 7, $\widehat{\theta}_{k}\to\theta_{0}$ in $\mathbb{P}_{\theta_{0}}$ -probability. It suffices to show $\lim_{m\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{TV}_{2}\left[\Pi_% {m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]}),\\ \Phi\{\mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]=0$ , as the equation $\lim_{m\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{TV}_{2}[\Pi_{T,% \vartheta}(\mathrm{d}\vartheta\mid X_{1:T}),\Phi\{\mathrm{d}\vartheta;0,I^{-1}% (\theta_{0})\}]=0$ is just a special case when $m=T$ and $K=1$ . We first show that $\mathrm{TV}_{2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]}),\Phi\{% \mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]\stackrel{{\scriptstyle\mathbb{P}% _{\theta_{0}}}}{{\to}}0$ .

Define $w(\zeta)=\ell_{k}\{\widehat{\theta}_{k}+\zeta/(Km)^{1/2}\}-\ell_{k}(\widehat{% \theta}_{k})$ and $C_{m}=\int\exp\{Kw(z)\}\pi_{0}\{\widehat{\theta}_{k}+z/(Km)^{1/2}\}\mathrm{d}z$ . The subsequence posterior for $\mathbf{X}_{[k]}$ can be written as $\pi_{m}(\theta\mid\mathbf{X}_{[k]})=\exp\{K\ell_{k}(\theta)\}\pi_{0}(\theta)/[% \int_{\theta}\exp\{K\ell_{k}(\theta)\}\pi_{0}(\theta)\,\mathrm{d}\theta]$ . The posterior of $\zeta$ induced by the posterior above is

\pi_{m}(\zeta\mid\mathbf{X}_{[k]})=C_{m}^{-1}\exp\{Kw(\zeta)\}\pi_{0}\{% \widehat{\theta}_{k}+\zeta/(Km)^{1/2}\}.

Define

g_{m}(\zeta)=(1+\|\zeta\|^{2})\left[\exp\{Kw(\zeta)\}\pi_{0}\bigg{\{}\widehat{% \theta}_{k}+\frac{\zeta}{(Km)^{1/2}}\bigg{\}}-\exp\bigg{\{}-\frac{1}{2}\zeta^{% \top}I(\theta_{0})\zeta\bigg{\}}\pi_{0}(\theta_{0})\right].

We define $\mathcal{T}=\{\zeta\,:\,\zeta=T^{1/2}(\theta-\widehat{\theta}_{k}),\theta\in\Theta\}$ . If we can show that $\int_{\mathcal{T}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{% \theta_{0}}}}{{\to}}0$ , then $C_{m}\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}\int_{\mathbb{R}^{% d}}\exp\{-\zeta^{\top}I(\theta_{0})\zeta/2\}\pi_{0}(\theta_{0})\,\mathrm{d}% \zeta=(2\pi)^{d/2}\{\det I(\theta_{0})\}^{-1/2}\pi_{0}(\theta_{0})$ , and thus

	$\displaystyle\quad\mathrm{TV}_{2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid% \mathbf{X}_{[k]}),\Phi\{\mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]$
	$\displaystyle=\int_{\mathcal{T}}(1+\\|\zeta\\|^{2})\bigg{\|}\frac{\exp\{Kw(\zeta)% \}\pi_{0}\{\widehat{\theta}_{k}+\zeta/(Km)^{1/2}\}}{C_{m}}-\frac{\det\{I(% \theta_{0})\}^{1/2}}{(2\pi)^{d/2}}\exp\bigg{\{}-\frac{1}{2}\zeta^{\top}I(% \theta_{0})\zeta\bigg{\}}\bigg{\|}\mathrm{d}\zeta$
	$\displaystyle\leq\frac{1}{C_{m}}\int_{\mathcal{T}}\left\|g_{m}(z)\right\|\mathrm% {d}z$
	$\displaystyle\quad+\left\|\frac{(2\pi)^{d/2}\{\det I(\theta_{0})\}^{-1/2}\pi_{0% }(\theta_{0})}{C_{m}}-1\right\|\times\int_{\mathbb{R}^{d}}\frac{(1+\\|z\\|^{2})}{% (2\pi)^{d/2}\left\{\det I(\theta_{0})\right\}^{-1/2}}\exp\bigg{\{}-\frac{1}{2}% z^{\top}I(\theta_{0})z\bigg{\}}\mathrm{d}z.$

The second term in the previous equation converges to zero in $\mathbb{P}_{\theta_{0}}$ -probability.

We now show that $\int_{\mathcal{T}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{% \theta_{0}}}}{{\to}}0$ . To achieve this, we divide $\mathcal{T}$ into three regions: $A_{1}=\{z:\|z\|\geq\delta_{1}(Km)^{1/2}\}$ , $A_{2}=\{z:\delta_{2}\leq\|z\|\leq\delta_{1}(Km)^{1/2}\}$ , and $A_{3}=\{z:\|z\|<\delta_{2}\}$ , where $\delta_{1}$ and $\delta_{2}$ are constants that will be chosen later. Since $\int_{\mathcal{T}}|g_{m}(z)|\,\mathrm{d}z\leq\int_{A_{1}}|g_{m}(z)|\,\mathrm{d% }z+\int_{A_{2}}|g_{m}(z)|\,\mathrm{d}z+\int_{A_{3}}|g_{m}(z)|\,\mathrm{d}z$ , to prove $\int_{\mathcal{T}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{% \theta_{0}}}}{{\to}}0$ , it suffices to prove that $\int_{A_{i}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0$ for $i=1,2,3$ .

1. Proof of $\int_{A_{1}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0$ : we have

\displaystyle\begin{aligned} \int_{A_{1}}|g_{m}(z)|\,\mathrm{d}z&\leq\int_{A_{% 1}}(1+\|z\|^{2})\exp\{Kw(z)\}\pi_{0}\bigg{\{}\widehat{\theta}_{k}+\frac{z}{(Km% )^{1/2}}\bigg{\}}\mathrm{d}z\\ &\quad+\int_{A_{1}}(1+\|z\|^{2})\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z% \right\}\pi_{0}(\theta_{0})\,\mathrm{d}z.\end{aligned}

(S10)

The second term on the right hand side of equation S10 converges to zero as $m\to\infty$ . We use Assumption 5 to prove that the first term of equation S10 also converges to zero as $m\to\infty$ . For some $\epsilon>0$ , with probability approaching one and all sufficiently large $m$ , we have $\exp\{Kw(z)\}=\exp[K\{\ell_{k}(\widehat{\theta}_{k}+\zeta/(Km)^{1/2})-\ell_{k}% (\widehat{\theta}_{k})\}]\leq\exp(-Km\epsilon)$ . Moreover, with $\mathbb{P}_{\theta_{0}}$ -probability approaching 1, the consistency of $\widehat{\theta}_{k}$ guarantees that $\|\widehat{\theta}_{k}\|^{2}\leq C_{1}$ where $C_{1}\in\mathbb{R}_{+}$ is some positive constant. Therefore, as $m\to\infty$

	$\displaystyle\quad\int_{A_{1}}(1+\\|z\\|^{2})\exp\{Kw(z)\}\pi_{0}\bigg{\{}% \widehat{\theta}_{k}+\frac{z}{(Km)^{1/2}}\bigg{\}}\mathrm{d}z$
	$\displaystyle\leq\exp(-Km\epsilon)\bigg{\{}1+(Km)^{d/2}\int_{\Theta}2(\\|\theta% \\|^{2}+\\|\widehat{\theta}_{k}\\|^{2})\pi_{0}(\theta)\,\mathrm{d}\theta\bigg{\}}$
	$\displaystyle\leq\exp(-Km\epsilon)\bigg{\{}1+2(Km)^{d/2}C_{1}+(Km)^{d/2}\int_{% \Theta}2\\|\theta\\|^{2}\pi_{0}(\theta)\,\mathrm{d}\theta\bigg{\}}\stackrel{{% \scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}0,$

where the convergence is by Assumption 8 that bounds the second moment of the prior.

2. Proof of $\int_{A_{2}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0$ : note that

\displaystyle\begin{aligned} \int_{A_{2}}|g_{m}(z)|\,\mathrm{d}z&\leq\int_{A_{% 2}}(1+\|z\|^{2})\exp\{Kw(z)\}\pi_{0}\bigg{\{}\widehat{\theta}_{k}+\frac{z}{(Km% )^{1/2}}\bigg{\}}\mathrm{d}z\\ &\quad+\int_{A_{2}}(1+\|z\|^{2})\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z% \right\}\pi_{0}(\theta_{0})\,\mathrm{d}z.\end{aligned}

(S11)

The second integral on the right hand side of equation S11 is

	$\displaystyle\quad\int_{\left\{z\,:\,\delta_{2}\leq\right\\|z\\|\leq\delta_{1}(% Km)^{1/2}\}}(1+\\|z\\|^{2})\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z\right\}% \pi_{0}(\theta_{0})\,\mathrm{d}z$
	$\displaystyle\leq\int_{\left\{z\,:\,\\|z\\|\geq\delta_{2}\right\}}(1+\\|z\\|^{2})% \exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z\right\}\pi_{0}(\theta_{0})\,% \mathrm{d}z,$

which converges to zero by choosing $\delta_{2}$ large enough. The following step bounds the first term on the right hand side of equation S11. We use Taylor expansion: $w(z)=\ell_{k}\{\widehat{\theta}_{k}+z/(Km)^{1/2}\}-\ell_{k}(\widehat{\theta}_{% k})$ . Because $\ell_{k}^{\prime}(\widehat{\theta}_{k})=0$ , $w(z)=\ell_{k}\{\widehat{\theta}_{k}+z/(Km)^{1/2}\}-\ell_{k}(\widehat{\theta}_{% k})=-1/(2K)z^{\top}\ell_{k}^{\prime\prime}(\widehat{\theta}_{k})z/m+R_{m}(z)$ , where $R_{m}(z)=(1/6)\partial^{3}\ell_{k}(\widetilde{\theta})/(\partial\theta^{3})\{z% /(Km)^{1/2},z/(Km)^{1/2},z/(Km)^{1/2}\}$ , where
$\{\partial^{3}\ell_{k}(\widetilde{\theta})\}/(\partial\theta^{3})$ is a three-dimensional array with $\{\partial^{3}\ell_{k}(\widetilde{\theta})\}/(\partial\theta_{i}\partial\theta% _{j}\partial\theta_{k})$ as its $(i,j,k)$ th element. Moreover, $\widetilde{\theta}$ is a $d$ -dimensional vector between $\widehat{\theta}_{k}$ and $\widehat{\theta}_{k}+z/(Km)^{1/2}$ . By Assumption 5,

\left|R_{m}(z)\right|\leq\frac{d^{3}}{6}\left\|\frac{z}{(Km)^{1/2}}\right\|^{3% }\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})\leq\frac{d^{3}\delta_{1}}{6K}\|z\|^{% 2}\frac{1}{m}\sum_{t=1}^{m}M_{i}(X_{1k},\dots,X_{ik}).

Also by Assumption 5, we have $\limsup_{m\to\infty}{\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\{m^{-1}\sum_{i=1}^{m% }M_{i}(X_{1k},\dots,X_{ik})\}}<\infty$ . Therefore, for all sufficiently large $m$ , by Markov’s inequality, $m^{-1}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})$ is $O_{\mathbb{P}_{\theta_{0}}}(1)$ . If $\delta_{1}$ is chosen to be small enough, $|R_{m}(z)|\leq 1/(4K)z^{\top}\{\ell_{k}^{\prime\prime}(\widehat{\theta}_{k})/m\}z$ with probability approaching one in $\mathbb{P}_{\theta_{0}}$ since Assumption 6 bounds the eigenvalues of $\ell_{k}^{\prime\prime}(\theta)/m$ for $\theta\in B_{\delta_{0}}(\theta_{0})$ from below. Hence, with probability approaching one, for all sufficiently large $m$ and $z\in A_{2}$ , $w(z)=-1/(2K)z^{\top}\ell_{k}^{\prime\prime}(\widehat{\theta}_{k})z/m+R_{m}(z)% \leq-1/(4K)z^{\top}\ell_{k}^{\prime\prime}(\widehat{\theta}_{k})z/m\leq-1/(8K)% z^{\top}I(\theta_{0})z$ by $\ell_{k}^{\prime\prime}(\widehat{\theta}_{k})/m\stackrel{{\scriptstyle\mathbb{% P}_{\theta_{0}}}}{{\to}}-I(\theta_{0})$ . Moreover, given that $\widehat{\theta}_{k}$ is consistent for $\theta_{0}$ and $\pi_{0}(\theta)$ is continuous at $\theta_{0}$ , we have $\pi_{0}\{\widehat{\theta}_{k}+z/(Km)^{-1/2}\}\stackrel{{\scriptstyle\mathbb{P}% _{\theta_{0}}}}{{\to}}\pi_{0}(\theta_{0})$ . Hence, to bound the first term on the right hand side of equation S11, we have

\displaystyle\begin{aligned} \int_{A_{2}}(1+\|z\|^{2})\exp\{Kw(z)\}\pi_{0}% \bigg{\{}\widehat{\theta}_{k}+\frac{z}{(Km)^{1/2}}\bigg{\}}\mathrm{d}z&\leq% \int_{A_{2}}2(1+\|z\|^{2})\exp\bigg{\{}-\frac{1}{8}z^{\top}I(\theta_{0})z\bigg% {\}}\pi_{0}(\theta_{0})\,\mathrm{d}z.\end{aligned}

(S12)

The right hand side of equation S12 is arbitrarily close to zero if $\delta_{2}$ is chosen to be large enough. Both terms on the right hand side of equation S11 thus converge to zero, and thus $\int_{A_{2}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0$ .

3. Proof of $\int_{A_{3}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0$ : note that

$\displaystyle\int_{A_{3}}\|g_{m}(z)\|\,\mathrm{d}z$	$\displaystyle\leq\int_{A_{3}}(1+\\|z\\|^{2})\bigg{[}\exp\{Kw(z)\}\pi_{0}\left\{% \widehat{\theta}_{k}+\frac{z}{(Km)^{1/2}}\right\}$
	$\displaystyle\hskip 93.95122pt-\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z% \right\}\pi_{0}(\theta_{0})\bigg{]}\mathrm{d}z$
	$\displaystyle\leq\int_{A_{3}}(1+\\|z\\|^{2})\exp\{Kw(z)\}\left\|\pi_{0}\left\{% \widehat{\theta}_{k}+\frac{z}{(Km)^{1/2}}\right\}-\pi_{0}(\theta_{0})\right\|\,% \mathrm{d}z$	(S13)
	$\displaystyle\quad+\int_{A_{3}}(1+\\|z\\|^{2})\pi_{0}(\theta_{0})\left\|\exp\{Kw(% z)\}-\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z\right\}\right\|\,\mathrm{d}z$	(S14)

To show $\int_{A_{3}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0$ , it suffices to show that terms (S13) and (S14) both converge to zero as $m\to\infty$ . For all $z\in A_{3}=\left\{z:\|z\|<\delta_{2}\right\}$ , $\sup_{K\geq 1}|KR_{m}(z)|\leq d^{3}K/6\times\|z/(Km)^{1/2}\|^{3}\times\sum_{i=% 1}^{m}M_{i}(X_{1k},\dots,X_{ik})=o_{\mathbb{P}_{\theta_{0}}}(1)$ . Therefore, for all sufficiently large $m$ , with $\mathbb{P}_{\theta_{0}}$ -probability approaching one,

	$\displaystyle\quad\int_{A_{3}}(1+\\|z\\|^{2})\exp\{Kw(z)\}\left\|\pi_{0}\left\{% \widehat{\theta}_{k}+\frac{z}{(Km)^{1/2}}\right\}-\pi_{0}(\theta_{0})\right\|\,% \mathrm{d}z$
	$\displaystyle\leq 2\int_{A_{3}}(1+\\|z\\|^{2})\exp\{Kw(z)\}\pi_{0}(\theta_{0})\,% \mathrm{d}z$
	$\displaystyle=2\int_{A_{3}}(1+\\|z\\|^{2})\exp\left\{-\frac{1}{2}z^{\top}\frac{% \ell_{k}^{\prime\prime}(\widehat{\theta}_{k})}{m}z+KR_{m}(z)\right\}\pi_{0}(% \theta_{0})\,\mathrm{d}z$
	$\displaystyle\leq 2\int_{A_{3}}(1+\\|z\\|^{2})\exp\{z^{\top}I(\theta_{0})z\}\pi_% {0}(\theta_{0})\,\mathrm{d}z<\infty.$

Moreover, with probability approaching one, for all $z\in A_{3}$ , $(1+\|z\|^{2})\exp\{Kw(z)\}|\pi_{0}\{\widehat{\theta}_{k}+z/(Km)^{1/2}\}-\pi_{0% }(\theta_{0})|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{% \to}}0$ . By the dominated convergence theorem, (S13) converges to zero with $\mathbb{P}_{\theta_{0}}$ -probability approaching one. Next, we prove that (S14) also converges to zero. For all $z\in A_{3}$ ,

	$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \left\|\exp\{Kw(z)\}-% \exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z\right\}\right\|$
	$\displaystyle=\left\|\exp\left\{-\frac{1}{2}z^{\top}\frac{\ell_{k}^{\prime% \prime}(\widehat{\theta}_{k})}{m}z+KR_{m}(z)\right\}-\exp\left\{-\frac{1}{2}z^% {\top}I(\theta_{0})z\right\}\right\|\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0.$

Hence, for all $z\in A_{3}$ , $(1+\|z\|^{2})\pi_{0}(\theta_{0})|\exp\{Kw(z)\}-\exp\{-z^{\top}I(\theta_{0})z/2% \}|\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}0$ . Moreover, $\int_{A_{3}}(1+\|z\|^{2})\pi_{0}(\theta_{0})|\exp\{Kw(z)\}-\exp\{-z^{\top}I(% \theta_{0})z/2\}|\,\mathrm{d}z<\infty$ with $\mathbb{P}_{\theta_{0}}$ -probability approaching one. By the dominated convergence theorem, term (S14) also converges to zero with $\mathbb{P}_{\theta_{0}}$ -probability approaching one.

We have thus proved that $\int_{A_{3}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0$ . The final step of the proof is to show that this can be strengthened to $L_{1}$ convergence, that is, $\mathrm{TV}_{2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]}),\\ \Phi\{\mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]\stackrel{{\scriptstyle L_{% 1}}}{{\to}}0$ as $m\to\infty$ in $\mathbb{P}_{\theta_{0}}$ -probability. We have

	$\displaystyle\mathrm{TV}_{2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_% {[k]}),\Phi\{\mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]$
	$\displaystyle=\int_{\mathcal{T}}(1+\\|z\\|^{2})\left\|\frac{\exp\{Kw(z)\}\pi_{0}% \{\widehat{\theta}_{k}+z/(Km)^{1/2}\}}{C_{m}}-\frac{1}{(2\pi)^{d/2}\left\{\det I% (\theta_{0})\right\}^{-1/2}}\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z% \right\}\right\|\mathrm{d}z$
	$\displaystyle\leq\int_{\Theta}\{1+\\|T^{1/2}(\theta-\widehat{\theta}_{k})\\|^{2}% \}\pi(\theta\mid\mathbf{X}_{[k]})\,\mathrm{d}\theta+\int_{\mathbb{R}^{d}}(2\pi% )^{d/2}\left\{\det I(\theta_{0})\right\}^{-1/2}\exp\left\{-\frac{1}{2}z^{\top}% I(\theta_{0})z\right\}\mathrm{d}z$
	$\displaystyle=1+\mathbb{E}_{\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})}\{Km% \\|\theta-\widehat{\theta}_{k}\\|^{2}\}+\int_{\mathbb{R}^{d}}\frac{(1+\\|z\\|^{2})% }{(2\pi)^{d/2}\left\{\det I(\theta_{0})\right\}^{-1/2}}\exp\left\{-\frac{1}{2}% z^{\top}I(\theta_{0})z\right\}\mathrm{d}z.$		(S15)

The third term of (S15) is a finite constant and the second term is $\psi(\mathbf{X}_{[k]})$ as defined in Assumption 9. By Assumption 9, we have that $\psi(\mathbf{X}_{[k]})$ is uniformly integrable in $\mathbb{P}_{\theta_{0}}$ and thus $\mathrm{TV}_{2}[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]}),\Phi\{% \mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}]$ is also uniformly integrable. ∎

Appendix S3 Verification of Assumption 9

We consider the following AR(2) normal linear model based on independent and identically distributed observations:

	$\displaystyle X_{t}$	$\displaystyle=\beta^{\top}Z_{t}+\varepsilon_{t},$
	$\displaystyle\varepsilon_{t}$	$\displaystyle=\varphi_{1}\varepsilon_{t-1}+\varphi_{2}\varepsilon_{t-2}+\xi_{t% },\quad\varepsilon_{1},\varepsilon_{2},\xi_{t}\stackrel{{\scriptstyle\text{i.i% .d.}}}{{\sim}}\mathrm{N}(0,\sigma^{2}),$

where $\beta\in\mathbb{R}^{p}$ . We write $y=(y_{1},\ldots,y_{T})^{\top},Z=(Z_{1},\ldots,Z_{T})^{\top}$ , $\varepsilon=(\varepsilon_{1},\ldots,\varepsilon_{T})^{\top}$ , and the true parameter is $\theta_{0}=(\beta_{0}^{\top},\sigma_{0}^{2})^{\top}$ . We assume that the roots of the characteristic polynomial of $\varepsilon_{t}$ lie outside of the unit circle of the complex plane, so that $\varepsilon_{t}$ is a stationary process. We impose the following conjugate prior on the parameter $\beta$ :

\beta\mid(\sigma^{2},\mu^{*},\Omega)\sim\mathrm{N}(\mu^{*},\sigma^{2}\Omega)

Let $\left\|\beta_{0}\right\|,\left\|\mu^{*}\right\|\leq c_{1}<+\infty$ . Let the eigenvalues of $\Omega$ and $m^{-1}Z^{\top}Z$ be lower bounded by $c_{2}>0$ and upper bounded by $c_{3}>0$ . Let $\mathbb{E}(\varepsilon_{i}^{4})=c_{4}<\infty$ . The subset posterior distributions of $\beta$ and $\sigma^{2}$ are given by

\beta\mid(y,Z,\mu^{*},\Omega,a,b)\sim\mathrm{N}\left\{\beta^{*},\frac{b^{*}}{a% +Km}(KZ^{\top}\Sigma^{-1}Z+\Omega^{-1})^{-1}\right\}.

By Yule Walker’s equation, for every $i\geq j+2$ , we have the following recursive relationship for the entries of $\Sigma$ :

\displaystyle\Sigma_{i,j}

\displaystyle:=\gamma_{i-j}=\varphi_{1}\gamma_{i-j-1}+\varphi_{2}\gamma_{i-j-2}.

This is a second-order linear recurrence equation. We have $\Sigma_{i,j}=\mathcal{O}(c_{4}^{|i-j|})$ where $0<c_{4}<1$ , since we require the roots of the characteristic equation lie outside the unit circle. Therefore, $\Sigma_{i,j}$ is a Toeplitz matrix decaying at a geometric rate as $(i-j)$ increases.

We seek to bound the smallest eigenvalue of

\Sigma=\begin{pmatrix}\begin{array}[]{cccc}\gamma_{0}&\gamma_{1}&\cdots&\gamma% _{T-1}\\ \gamma_{1}&\gamma_{0}&\cdots&\gamma_{T-2}\\ \vdots&\vdots&\ddots&\vdots\\ \gamma_{T-1}&\gamma_{T-2}&\cdots&\gamma_{0}\end{array}\end{pmatrix}.

To this end, we write the $T\times T$ matrix $\Sigma$ as a sum of a $K$ -banded Toeplitz matrix $\Sigma_{K}$ and a perturbation matrix $E_{K}$ , where $K$ will be chosen later to derive the lower bound for the smallest eigenvalue of $\Sigma$ , $\Sigma=\Sigma_{K}+E_{K}$ , where

\Sigma_{K}=\begin{pmatrix}\gamma_{0}&\gamma_{1}&\cdots&\gamma_{K}&0&\cdots&0\\ \gamma_{1}&\gamma_{0}&\gamma_{1}&\ddots&\gamma_{K}&\ddots&\vdots\\ \vdots&\gamma_{1}&\ddots&\ddots&\ddots&\ddots&0\\ \gamma_{K}&\ddots&\ddots&\gamma_{0}&\gamma_{1}&\ddots&\gamma_{K}\\ 0&\ddots&\ddots&\gamma_{1}&\gamma_{0}&\gamma_{1}&\vdots\\ \vdots&\ddots&\ddots&\ddots&\gamma_{1}&\gamma_{0}&\gamma_{1}\\ 0&\cdots&0&\gamma_{K}&\cdots&\gamma_{1}&\gamma_{0}\end{pmatrix}

and

E_{K}=\begin{pmatrix}0&0&\cdots&\gamma_{K+1}&\gamma_{K+2}&\cdots&\gamma_{T}\\ 0&0&0&\ddots&\gamma_{K+1}&\ddots&\vdots\\ \vdots&0&\ddots&\ddots&\ddots&\ddots&\gamma_{K+2}\\ \gamma_{K+1}&\ddots&\ddots&0&0&\ddots&\gamma_{K+1}\\ \gamma_{K+2}&\ddots&\ddots&0&0&0&\vdots\\ \vdots&\ddots&\ddots&\ddots&0&0&0\\ \gamma_{T}&\cdots&\gamma_{K+2}&\gamma_{K+1}&\cdots&0&0\end{pmatrix}.

We will lower-bound the smallest eigenvalue of $\Sigma_{K}$ and upper bound the largest eigenvalue $E_{K}$ . To lower bound the smallest eigenvalue of $\Sigma_{K}$ , we will use the proposition 4.5 of Bini and Capovani, (1983). When $K=4k+3$ , where $k\geq 1$ is an integer, the eigenvalue of $\Sigma_{K}$ is given by $P_{2k+1}(\mu_{i})=\sum_{j=0}^{2k+1}\mu^{j}_{i}C_{j+1}$ , where $1\leq i\leq T$ , $\mu_{i}=2\cos[\pi i/(T+8k+1)]$ and $C_{j+1}>0$ are constants independent of $i$ . We will choose $k=3T/16$ . We have

P_{2k+1}(\mu_{i})=\sum_{j=0}^{2k+1}\mu^{j}_{i}C_{j+1}\geq C_{2}\cos\left(\frac% {\pi i}{T+8k+1}\right)\geq C_{2}\cos\left(\frac{\pi T}{T+8k+1}\right)\geq C_{2% }\cos\left(\frac{2}{5}\pi\right)

(S16)

for every $1\leq i\leq T$ .

The largest eigenvalue of $E_{K}$ is upper bound by the Frobenius norm of $E_{K}$ , that is, $\|E_{K}\|$ . Hence, we can upper bound on $\|E_{K}\|$ and hence provide an upper bound for the largest eigenvalue. When $k=3T/16$ , we have

\|E_{K}\|=\mathrm{tr}(E_{K}^{T}E_{K})\geq T\sum_{i=K+1}^{T}\gamma_{i}^{2}=T% \sum_{i=4k+4}^{T}\gamma_{i}^{2}=\mathcal{O}\left(Tc_{4}^{(3/2)T}\right).

(S17)

Combining equations (S16) and (S17), when $T$ is sufficiently large and $k=3T/16$ , the smallest eigenvalue of $\Sigma$ is lower bounded by a positive constant. Therefore, the largest eigenvalue of $\Sigma^{-1}$ is upper bounded by a positive constant, which we denote by $c_{5}$ .

We will show that $\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathbb{E}_{\Pi_{m}(\cdot\mid y,Z)}Km\|% \beta-\widehat{\beta}\|^{2}<\infty$ . The MLE of $\beta$ is given by $\widehat{\beta}=(Z^{\top}\Sigma^{-1}Z)^{-1}Z^{\top}\Sigma^{-1}y$ . It is clear that

\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathbb{E}_{\Pi_{m}(\cdot\mid y,Z)}Km\|% \beta-\widehat{\beta}\|^{2}=Km\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{tr}% \left\{\operatorname{var}_{\pi_{m}(\cdot\mid y,Z)}(\beta)\right\}+Km\mathbb{E}% _{\mathbb{P}_{\theta_{0}}}\|\mathbb{E}_{\Pi_{m}(\cdot\mid y,Z)}\beta-\widehat{% \beta}\|^{2},

where $\mathrm{tr}(A)$ denotes the trace of a generic square matrix $A$ . The posterior variance of $\beta$ can be bounded as

\displaystyle\begin{aligned} &Km\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{tr% }\{\mathrm{var}_{\pi_{m}(\cdot\mid y,Z)}(\beta)\}\\ &=Km\frac{a+Km+p}{a+Km+p-2}\times\mathrm{tr}\left\{\mathbb{E}_{\mathbb{P}_{% \theta_{0}}}\frac{b^{*}}{a+Km}(KZ^{\top}\Sigma^{-1}Z+\Omega^{-1})^{-1}\right\}% \\ &\leq 2\mathrm{tr}\left\{\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(b+c_{1}^{2}c_{2}% ^{-1}+Ky^{\top}y)(KZ^{\top}\Sigma^{-1}Z+\Omega^{-1})^{-1}\right\}\\ &\leq 2\mathrm{tr}\left\{\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(b+c_{1}^{2}c_{2}% ^{-1}+Kmc_{1}^{2}c_{2}+Km\sigma_{0}^{2})(Kmc_{2}c_{5}I_{p}+c_{3}^{-1}I_{p})^{-% 1}\right\}\\ &=2p\frac{Km(c_{1}^{2}c_{2}+\sigma_{0}^{2})+b+c_{1}^{2}c_{2}^{-1}}{Kmc_{2}c_{5% }+c_{3}^{-1}}\rightarrow\frac{2p(c_{1}^{2}c_{2}+\sigma_{0}^{2})}{c_{2}c_{5}}% \text{ as }m\rightarrow\infty.\end{aligned}

Appendix S4 ACF and PACF of time series used in synthetic data examples

S4.1 Linear regression with auto-regressive errors (LABEL:{sec.app.ar.error} of main text)

We plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the error terms $\varepsilon_{1:T}$ generated using model (6) in Figure 10; recall that this is for $T=10^{5}$ observations. The ACF or PACF is highest when $\text{lag}=2$ . The autocorrelation is mild for this simulation.

S4.2 GARCH model (LABEL:{sec.app.garch} of main text)

We plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the error terms $\varepsilon_{1:T}$ and observations $X_{1:T}$ generated using model equation 7 in Figure 11; recall that this is for $T=2\times 10^{5}$ observations. The variance series $\sigma_{t}^{2}$ exhibits a strong autocorrelation, but the observation series $X_{1:T}$ exhibits a negligible autocorrelation.

S4.3 Binary auto-regressive model (LABEL:{sec.binary_ar_model} of main text)

We plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the success probabilities $\mathbb{P}(Y_{t}=1)$ generated using model equation 9 in Figure 12; recall that this is for $T=2\times 10^{5}$ observations. The series $\mathbb{P}(Y_{t}=1)$ exhibits a weak autocorrelation, as we observe an auto-correlation of less than $0.2$ for every lag size.

	$\displaystyle\mathbb{P}_{\theta_{0}}\left(m\left\\|Q_{1}\right\\|_{1}>s\right)$
	$\displaystyle\leq\frac{m}{sT}\sum_{k=1}^{K}\sum_{i=1}^{m}\big{\\|}\nabla_{% \theta}\log p_{\theta_{0}}(X_{ik}\mid X_{1k},\dots,X_{(i-1)k})\big{\|}_{\theta=% \theta_{0}}$
	$\displaystyle\hskip 83.11005pt-\nabla_{\theta}\log p_{\theta_{0}}(X_{ik}\mid% \mathbf{X}_{[1]},\dots,\mathbf{X}_{[k-1]},X_{1k},\dots,X_{(i-1)k})\big{\|}_{% \theta=\theta_{0}}\big{\\|}_{1}$
	$\displaystyle\leq\frac{m\,C_{0}\sum_{k=1}^{K}\sum_{i=0}^{m}\rho_{0}^{i}}{sT}% \leq\frac{C_{0}}{s(1-\rho_{0})}\to 0\quad\text{as}\leavevmode\nobreak\ s\to\infty,$

	$\displaystyle\mathbb{P}_{\theta_{0}}\bigg{(}\bigg{\\|}\frac{1}{K}\sum_{k=1}^{K}% \frac{W_{k}}{m^{1/2}}\bigg{\\|}\geq cm^{-1/2}\bigg{)}$	$\displaystyle\leq\frac{m\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\\|(1/K)\sum_{k=1}^% {K}W_{k}/m^{1/2}\\|^{2}}{c^{2}}$
		$\displaystyle\leq\frac{1}{c^{2}K}\sum_{k=1}^{K}\mathbb{E}_{\mathbb{P}_{\theta_% {0}}}(\\|W_{k}\\|^{2})=\frac{1}{c^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\\|W_{% 1}\\|^{2})\to 0.$		(S3)

	$\displaystyle\quad\mathbb{P}_{\theta_{0}}\bigg{(}\bigg{\\|}\frac{1}{K}\sum_{k=1% }^{K}\frac{W_{k}}{m^{1/2}}\bigg{\\|}>cT^{-1/2}\bigg{)}=\mathbb{P}_{\theta_{0}}% \bigg{(}\bigg{\\|}\frac{1}{K}\sum_{k=1}^{K}W_{k}\bigg{\\|}>cK^{-1/2}\bigg{)}$
	$\displaystyle\leq\frac{K\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\\|(1/K)\sum_{k=1}^% {K}W_{k}\\|^{2}}{c^{2}}=\frac{1}{Kc^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}% \bigg{(}\sum_{1\leq k_{1},k_{2}\leq K}W_{k_{1}}^{\top}W_{k_{2}}\bigg{)}$
	$\displaystyle\leq\frac{1}{Kc^{2}}\sum_{1\leq k_{1},k_{2}\leq K}\sum_{l=1}^{d}\|% \mathrm{cov}_{\mathbb{P}_{\theta_{0}}}(W_{k_{1}}^{l},W_{k_{2}}^{l})\|,$		(S4)

	$\displaystyle\left\\|Z_{k}\right\\|^{2+\delta}$	$\displaystyle=\bigg{\\|}\bigg{\{}-\frac{1}{m}\frac{\partial^{2}\ell_{k}(% \widetilde{\theta}_{k})}{\partial\theta\partial\theta^{\top}}\bigg{\}}^{-1}% \bigg{\{}-\frac{1}{m}\frac{\partial^{2}\ell_{k}(\widetilde{\theta}_{k})}{% \partial\theta\partial\theta^{\top}}-I(\theta_{0})\bigg{\}}I^{-1}(\theta_{0})% \bigg{\\|}^{2+\delta}$
		$\displaystyle\leq\frac{1}{2}d^{(2+\delta)/2}\underline{\lambda}^{-(2+\delta)}% \bigg{\{}\frac{d^{2}}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+\\|% I(\theta_{0})\\|^{2+\delta}\bigg{\}}d^{(2+\delta)/2}\underline{\lambda}^{-(2+% \delta)}$
		$\displaystyle\leq c_{1}\frac{1}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+% \delta}+c_{2},$

	$\displaystyle\quad\mathrm{TV}_{2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid% \mathbf{X}_{[k]}),\Phi\{\mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]$
	$\displaystyle=\int_{\mathcal{T}}(1+\\|\zeta\\|^{2})\bigg{\|}\frac{\exp\{Kw(\zeta)% \}\pi_{0}\{\widehat{\theta}_{k}+\zeta/(Km)^{1/2}\}}{C_{m}}-\frac{\det\{I(% \theta_{0})\}^{1/2}}{(2\pi)^{d/2}}\exp\bigg{\{}-\frac{1}{2}\zeta^{\top}I(% \theta_{0})\zeta\bigg{\}}\bigg{\|}\mathrm{d}\zeta$
	$\displaystyle\leq\frac{1}{C_{m}}\int_{\mathcal{T}}\left\|g_{m}(z)\right\|\mathrm% {d}z$
	$\displaystyle\quad+\left\|\frac{(2\pi)^{d/2}\{\det I(\theta_{0})\}^{-1/2}\pi_{0% }(\theta_{0})}{C_{m}}-1\right\|\times\int_{\mathbb{R}^{d}}\frac{(1+\\|z\\|^{2})}{% (2\pi)^{d/2}\left\{\det I(\theta_{0})\right\}^{-1/2}}\exp\bigg{\{}-\frac{1}{2}% z^{\top}I(\theta_{0})z\bigg{\}}\mathrm{d}z.$