Scalable Bayesian inference for time series via divide-and-conquer

Rihui Ou1111The two authors contributed equally to this paper.    Deborshee Sen211footnotemark: 1  222Corresponding author; [email protected].    David Dunson1
(1Department of Statistical Science, Duke University
2Google)
Abstract

Bayesian computational algorithms tend to scale poorly as data size increases. This has motivated divide-and-conquer-based approaches for scalable inference. These divide the data into subsets, perform inference for each subset in parallel, and then combine these inferences. While appealing theoretical properties and practical performance have been demonstrated for independent observations, scalable inference for dependent data remains challenging. In this work, we study the problem of Bayesian inference from very long time series. The literature in this area focuses mainly on approximate approaches that usually lack rigorous theoretical guarantees and may provide arbitrarily poor accuracy in practice. We propose a simple and scalable divide-and-conquer method, and provide accuracy guarantees. Numerical simulations and real data applications demonstrate the effectiveness of our approach.

Keywords   Dependent data; Dynamic models; Embarrassingly parallel; Markov chain Monte Carlo; Scalable Bayes; Wasserstein barycenter.

1 Introduction

Massive amounts of data are routinely collected in a variety of settings. This has necessitated the development of algorithms for statistical inference that scale well with data size. While variational (Beal,, 2003; Blei et al.,, 2017) and sequential Monte Carlo (Del Moral et al.,, 2006) methods are popular, Markov chain Monte Carlo (MCMC) algorithms remain the default approach for most Bayesian statisticians. Unfortunately, usual MCMC scales at least linearly with the size of the data, which is inadequate for truly massive datasets. This has motivated the development of scalable versions of MCMC algorithms.

One approach is to use sub-sampling-based algorithms for scalable MCMC (Ma et al.,, 2015; Quiroz et al.,, 2018; Nemeth and Fearnhead,, 2020). The main idea is to use random subsets of the data to approximate likelihoods and gradients at each MCMC iteration. This includes algorithms based on Langevin (Welling and Teh,, 2011) and Hamiltonian (Chen et al.,, 2014) dynamics. Such algorithms were initially developed for independent and identically distributed observations, and have since been extended to hidden Markov models (HMMs) (Ma et al., 2017b, ; Aicher et al.,, 2019), and more recently to general stationary time series models (Salomone et al.,, 2020; Villani et al.,, 2022). These algorithms incorporate approximations to the transition kernel that can have an unclear impact on the stationary distribution of the Markov chain, and theoretical guarantees require verifying challenging conditions in a case-by-case basis.

A promising recent approach based on stochastic gradients for independent observations has been the use of non-reversible continuous-time Markov processes, in particular, piecewise-deterministic Markov processes (Bouchard-Côté et al.,, 2018; Bierkens et al.,, 2019). These are appealing because, unlike other stochastic gradient MCMC algorithms, they preserve the exact posterior distribution as the invariant distribution of the process. While promising, current algorithms require the construction of certain upper bounds for gradients, which limits their practical applicability. Moreover, recent results by Johndrow et al., (2020) have shown that sub-sampling based posterior sampling algorithms have fundamental limitations on scalable performance in big data settings.

An alternative class of methods relies instead on divide-and-conquer based approaches for scalable Bayesian inference. In this case, the entire data are divided into subsets, inference is carried out using MCMC for each subset in parallel, and finally the so-called subset posteriors are combined. A key focus in the literature has been on how to combine the subset posteriors. An early proposal was the consensus Monte Carlo algorithm of Scott et al., (2016), which is based on averaging draws from each subset posterior – a strategy that can be justified theoretically when the subset posteriors are approximately Gaussian. Subsequent work has exploited the product form of the joint posterior to obtain combining algorithms, including using kernel smoothing (Neiswanger et al.,, 2014), multi-scale histograms (Wang et al.,, 2015), or a Weierstrass approximation (Wang and Dunson,, 2013). Alternatively, Minsker et al., (2014) proposed to combine via the geometric median of subset posteriors.

A recent thread has focused on using the Wasserstein barycenter (WB) of subset posteriors to combine them (Li et al.,, 2017; Srivastava et al.,, 2018). Each subset posterior is raised to an appropriate power to obtain a subset-based approximation to the full data posterior. Then one can ‘average’ these approximations to obtain a provably accurate approximation to the full data posterior. The Wasserstein barycenter provides an appropriate geometric notion of average for probability measures. In addition to impressive practical performance, it has also been shown that WB-based combining algorithms result in optimal posterior contraction rates and give the correct coverage probabilities of credible sets in certain settings (Szabó and Van Zanten,, 2019).

The approaches described in the previous two paragraphs are termed ‘embarrassingly parallel’ procedures where communication between cores is limited to a single unification step. These algorithms introduce an additional source of error beyond the MCMC sampling error(s) by combining posterior samples from the subset posteriors in an approximate manner. This means that the samples produced from the posterior are approximate even if one is able to produce exact samples from the subset posteriors. In contrast, Dai et al., (2023) have developed an algorithm which, while not embarrassingly parallel, does not induce this additional source of error for essentially independent data (see also Dai et al.,, 2019 for an earlier related paper). In this paper, we focus on embarrassingly parallel approaches for simplicity.

The above literature on divide-and-conquer algorithms focuses on independent data settings. In this article, we are motivated by the considerable practical and theoretical success of the WB-based algorithms to study appropriate modifications for very long time series. To our knowledge, although Guhaniyogi et al., (2017) developed a related approach for massive spatial datasets, there has been no consideration of divide-and-conquer Bayesian algorithms in the time series setting. For very long time series, usual MCMC and sequential Monte Carlo algorithms are impractical to implement. There is a rich literature, mostly in machine learning, on alternatives – ranging from variational approximations (Johnson and Willsky,, 2014; Foti et al.,, 2014) to assumed density filtering (Lauritzen,, 1992). In general, these approaches lack any theoretical guarantees on accuracy in approximating the true posterior, and indeed it is well known that they can perform arbitrarily poorly in practice – for example, substantially under-estimating posterior uncertainty. When parallel computing resources are available, the divide-and-conquer approach can have major practical advantages over such stochastic-gradient based methods.

In this article, we develop a simple, broadly applicable, and theoretically supported class of WB-based divide-and-conquer methods for massive time series – considering general models and not just HMMs. We call this divide-and-conquer for Bayesian time series (DC-BATS). Our approach is based on dividing the time series sequentially over time. One could alternatively devise a divide-and-conquer approach based on the Whittle likelihood (Whittle,, 1951) which consists of transforming a stationary time series to its frequency domain; Salomone et al., (2020); Villani et al., (2022) devise sub-sampling MCMC algorithms based on this. However, this will be unable to handle things like state space models and missing data as naturally as using the standard time domain approach, and we do not pursue this further here. Moreover, Guhaniyogi et al., (2017) considered a divide-and-conquer approach for large scale kriging, and a couple of differences between their work and ours are that (a) they only consider dependence using Gaussian processes, whereas we consider arbitrary time series models, and (b) we provide error guarantees for the difference between the Wasserstein posterior and the true posterior (Theorem 1), which in turn implies error rates for the bias and variance of the Wasserstein posterior (Theorem 2), whereas they provide guarantees on the mean and bias (Theorem 3.1 in their paper), which in turn implies error rates for the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT risk.

In a work independent from ours, Wang and Srivastava, (2023) propose a divide-and-conquer approach for finite state-space HMMs. Three differences between their work and ours are that (a) we use the Wasserstein barycenter to combine subset posteriors, whereas they use an extension of the double-parallel Monte Carlo algorithm of Xue and Liang, (2019), (b) we consider generic time series models and not just finite state-space HMMs, and (c) they consider likelihoods which can be calculated exactly for finite state-space HMMs using the forward-backward filtering algorithm, whereas we are more flexible as described in Section 2.2.

The rest of the article is organized as follows. We introduce the proposed DC-BATS method in Section 2. Section 3 is devoted to a theoretical analysis of the method. In particular, we show that the proposed method returns asymptotically exact estimates of the posterior distribution. We demonstrate the proposed method on a variety of time series models with synthetic data in Section 4. We apply the proposed method to a Los Angeles particulate matter dataset in Section 5. Finally, Section 6 concludes the article. All proofs are contained in the Supplementary Material.

2 Divide-and-conquer for time series

2.1 Generic time series model

We are interested in observations indexed by time in this article. We follow the definition of Cinlar, (2011) to define such a time-indexed stochastic process. We let \mathbb{Z}blackboard_Z be the set of integers. For each t𝑡t\in\mathbb{Z}italic_t ∈ blackboard_Z, let Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be a random variable taking values in (𝕏,𝒳)𝕏𝒳(\mathbb{X},\mathcal{X})( blackboard_X , caligraphic_X ). The collection {Xt:t}conditional-setsubscript𝑋𝑡𝑡\{X_{t}:t\in\mathbb{Z}\}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t ∈ blackboard_Z } is a stochastic process with state space (𝕏,𝒳)𝕏𝒳(\mathbb{X},\mathcal{X})( blackboard_X , caligraphic_X ). For any two integers t1t2subscript𝑡1subscript𝑡2t_{1}\leq t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we use the notation Xt1:t2subscript𝑋:subscript𝑡1subscript𝑡2X_{t_{1}:t_{2}}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to denote (Xt1,,Xt2)subscript𝑋subscript𝑡1subscript𝑋subscript𝑡2(X_{t_{1}},\dots,X_{t_{2}})( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Of primary interest in this article will be a time series X1:T=(X1,,XT)subscript𝑋:1𝑇subscript𝑋1subscript𝑋𝑇X_{1:T}=(X_{1},\dots,X_{T})italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ); these will be our observations. We assume that the conditional distribution of Xt(X1:(t1),Xs)conditionalsubscript𝑋𝑡subscript𝑋:1𝑡1subscript𝑋𝑠X_{t}\mid(X_{1:(t-1)},X_{s})italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ ( italic_X start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is independent of Xssubscript𝑋𝑠X_{s}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for any s>t𝑠𝑡s>titalic_s > italic_t. This means that there is indeed a temporal dependence among the observations and precludes things like spatial dependence where an observation Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT depends on both “past” and “future” observations. Divide-and-conquer approaches for spatial data using Gaussian processes have been proposed by Guhaniyogi et al., (2017), and we do not focus on this here. We assume that the conditional distribution of XtX1:(t1)conditionalsubscript𝑋𝑡subscript𝑋:1𝑡1X_{t}\mid X_{1:(t-1)}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT is parametrised by θ=(θ1,,θd)Θd𝜃subscript𝜃1subscript𝜃𝑑Θsuperscript𝑑\theta=(\theta_{1},\dots,\theta_{d})\in\Theta\subseteq\mathbb{R}^{d}italic_θ = ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ∈ roman_Θ ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for every t=2,,T𝑡2𝑇t=2,\dots,Titalic_t = 2 , … , italic_T, and the marginal distribution of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is also parameterised by the same θ𝜃\thetaitalic_θ.

We assume in the sequel that all measures we consider on ΘΘ\Thetaroman_Θ admit densities with respect to a reference measure corresponding to the Lebesgue measure on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. These will include the conditional distributions and the marginal distribution defined above, as well as posterior distributions to be defined below.

In general, the log-likelihood of the observations X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT can be written as

(θ)=t=1Tlogpθ(XtX1:(t1)),𝜃superscriptsubscript𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝑋𝑡subscript𝑋:1𝑡1\ell(\theta)=\sum_{t=1}^{T}\log p_{\theta}(X_{t}\mid X_{1:(t-1)}),roman_ℓ ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT ) , (1)

where pθ(XtX1:(t1))subscript𝑝𝜃conditionalsubscript𝑋𝑡subscript𝑋:1𝑡1p_{\theta}(X_{t}\mid X_{1:(t-1)})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT ) denotes the likelihood of Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given the previous values X1:(t1)subscript𝑋:1𝑡1X_{1:(t-1)}italic_X start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT; when t=1𝑡1t=1italic_t = 1, pθ(XtX1:(t1))=pθ(X1)subscript𝑝𝜃conditionalsubscript𝑋𝑡subscript𝑋:1𝑡1subscript𝑝𝜃subscript𝑋1p_{\theta}(X_{t}\mid X_{1:(t-1)})=p_{\theta}(X_{1})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is the marginal density of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The likelihood of any temporal sequence of observations can be written as equation 1, including independent observations. In this article, we are specifically interested in the situation where the observations are not independent. This includes a variety of commonly-used statistical models for time series analysis such as autoregressive moving average models and autoregressive conditional heteroskedasticity models. More generally, this includes hidden Markov models as well.

Bayesian inference for θ𝜃\thetaitalic_θ involves placing a prior distribution Π0(dθ)subscriptΠ0d𝜃\Pi_{0}(\mathrm{d}\theta)roman_Π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_θ ) on θ𝜃\thetaitalic_θ and computing the posterior distribution

ΠT(dθX1:T)pθ(X1){t=2Tpθ(XtX1:(t1))}Π0(dθ).proportional-tosubscriptΠ𝑇conditionald𝜃subscript𝑋:1𝑇subscript𝑝𝜃subscript𝑋1superscriptsubscriptproduct𝑡2𝑇subscript𝑝𝜃conditionalsubscript𝑋𝑡subscript𝑋:1𝑡1subscriptΠ0d𝜃\Pi_{T}(\mathrm{d}\theta\mid X_{1:T})\propto p_{\theta}(X_{1})\left\{\prod_{t=% 2}^{T}p_{\theta}(X_{t}\mid X_{1:(t-1)})\right\}\Pi_{0}(\mathrm{d}\theta).roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_θ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ∝ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) { ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT ) } roman_Π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_θ ) . (2)

We call this the full posterior as it is the posterior conditional on the entire dataset. Samples from the posterior (2) can be drawn using Markov chain Monte Carlo (MCMC) algorithms such as the Metropolis-Hastings algorithm (Metropolis et al.,, 1953; Hastings,, 1970), including efficient gradient-based algorithms such as the Metropolis-adjusted Langevin algorithm (MALA; Roberts and Tweedie,, 1996) and Hamiltonian Monte Carlo (HMC; Duane et al.,, 1987). Unfortunately the log-likelihood (1) needs to be evaluated at every iteration of MCMC algorithms, making it computationally intractable when T𝑇Titalic_T is large. Moreover, for large T𝑇Titalic_T, memory constraints may make it infeasible to store and manipulate the entire dataset on a single computer, thus precluding standard MCMC on the entire dataset. In this paper, we propose an embarrassingly parallel divide-and-conquer strategy to tackle this issue.

2.2 Divide-and-conquer algorithm

A generic divide-and-conquer strategy proceeds by dividing the T𝑇Titalic_T observations into K𝐾Kitalic_K disjoint subsets of sizes m1,,mKsubscript𝑚1subscript𝑚𝐾m_{1},\dots,m_{K}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, respectively, such that k=1Kmk=Tsuperscriptsubscript𝑘1𝐾subscript𝑚𝑘𝑇\sum_{k=1}^{K}m_{k}=T∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_T. To keep things simple, we assume that the subset sizes are equal, that is, m1==mK=m=T/Ksubscript𝑚1subscript𝑚𝐾𝑚𝑇𝐾m_{1}=\cdots=m_{K}=m=T/Kitalic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_m start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = italic_m = italic_T / italic_K. While independent observations can be divided arbitrarily into subsets, in the time series setup we divide them sequentially over time; we thus consider subsequences instead of subsets. We denote the observations within the k𝑘kitalic_kth subsequence by 𝐗[k]subscript𝐗delimited-[]𝑘\mathbf{X}_{[k]}bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT and the complete dataset by 𝐗=(𝐗[1],,𝐗[K])𝐗subscript𝐗delimited-[]1subscript𝐗delimited-[]𝐾\mathbf{X}=(\mathbf{X}_{[1]},\dots,\mathbf{X}_{[K]})bold_X = ( bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT , … , bold_X start_POSTSUBSCRIPT [ italic_K ] end_POSTSUBSCRIPT ). In particular, we have that 𝐗[1]=X1:m,𝐗[2]=X(m+1):2m,,𝐗[K]=X((K1)m+1):Tformulae-sequencesubscript𝐗delimited-[]1subscript𝑋:1𝑚formulae-sequencesubscript𝐗delimited-[]2subscript𝑋:𝑚12𝑚subscript𝐗delimited-[]𝐾subscript𝑋:𝐾1𝑚1𝑇\mathbf{X}_{[1]}=X_{1:m},\mathbf{X}_{[2]}=X_{(m+1):2m},\dots,\mathbf{X}_{[K]}=% X_{((K-1)m+1):T}bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT ( italic_m + 1 ) : 2 italic_m end_POSTSUBSCRIPT , … , bold_X start_POSTSUBSCRIPT [ italic_K ] end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT ( ( italic_K - 1 ) italic_m + 1 ) : italic_T end_POSTSUBSCRIPT.

For each subsequence 𝐗[k]subscript𝐗delimited-[]𝑘\mathbf{X}_{[k]}bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT, k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K, we first define pseudo likelihoods p~θ(𝐗[k])subscript~𝑝𝜃subscript𝐗delimited-[]𝑘\widetilde{p}_{\theta}(\mathbf{X}_{[k]})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) by ignoring past observations and assuming that 𝐗[k]subscript𝐗delimited-[]𝑘\mathbf{X}_{[k]}bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT starts at time one. We require some additional notation to define this formally. Let p1,θsubscript𝑝1𝜃p_{1,\,\theta}italic_p start_POSTSUBSCRIPT 1 , italic_θ end_POSTSUBSCRIPT denote the marginal density of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and for t=2,,T𝑡2𝑇t=2,\dots,Titalic_t = 2 , … , italic_T, let pt,θsubscript𝑝conditional𝑡𝜃p_{t\mid-,\,\theta}italic_p start_POSTSUBSCRIPT italic_t ∣ - , italic_θ end_POSTSUBSCRIPT denote the conditional density of XtX1:(t1)conditionalsubscript𝑋𝑡subscript𝑋:1𝑡1X_{t}\mid X_{1:(t-1)}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT. Using this notation, the density of the first subsequence 𝐗[1]subscript𝐗delimited-[]1\mathbf{X}_{[1]}bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT is pθ(𝐗[1])=p1,θ(X1)×t=2mpt,θ(XtX1:(t1)p_{\theta}(\mathbf{X}_{[1]})=p_{1,\,\theta}(X_{1})\times\prod_{t=2}^{m}p_{t% \mid-,\,\theta}(X_{t}\mid X_{1:(t-1})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT 1 , italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) × ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t ∣ - , italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 : ( italic_t - 1 end_POSTSUBSCRIPT ). We define p~θ(𝐗[1])=pθ(𝐗[1])subscript~𝑝𝜃subscript𝐗delimited-[]1subscript𝑝𝜃subscript𝐗delimited-[]1\widetilde{p}_{\theta}(\mathbf{X}_{[1]})=p_{\theta}(\mathbf{X}_{[1]})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ), and for k=2,,K𝑘2𝐾k=2,\dots,Kitalic_k = 2 , … , italic_K, we define

p~θ(𝐗[k])subscript~𝑝𝜃subscript𝐗delimited-[]𝑘\displaystyle\widetilde{p}_{\theta}(\mathbf{X}_{[k]})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) =p1,θ(X(k1)m+1)×t=2mpt,θ(X(k1)m+tX((k1)m+1):((k1)m+t1)).absentsubscript𝑝1𝜃subscript𝑋𝑘1𝑚1superscriptsubscriptproduct𝑡2𝑚subscript𝑝conditional𝑡𝜃conditionalsubscript𝑋𝑘1𝑚𝑡subscript𝑋:𝑘1𝑚1𝑘1𝑚𝑡1\displaystyle=p_{1,\,\theta}(X_{(k-1)m+1})\times\prod_{t=2}^{m}p_{t\mid-,\,% \theta}\big{(}X_{(k-1)m+t}\mid X_{((k-1)m+1):((k-1)m+t-1)}\big{)}.= italic_p start_POSTSUBSCRIPT 1 , italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT ( italic_k - 1 ) italic_m + 1 end_POSTSUBSCRIPT ) × ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t ∣ - , italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT ( italic_k - 1 ) italic_m + italic_t end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT ( ( italic_k - 1 ) italic_m + 1 ) : ( ( italic_k - 1 ) italic_m + italic_t - 1 ) end_POSTSUBSCRIPT ) . (3)

While the notation here is somewhat clunky, all we do is simply ignore observations before each subsequence and treat it as if it starts at time one. This is easy to implement in practice.

Using these pseudo likelihoods, we define subsequence posteriors Πm(dθ𝐗[k])subscriptΠ𝑚conditionald𝜃subscript𝐗delimited-[]𝑘\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) for every subsequence 𝐗[k]subscript𝐗delimited-[]𝑘\mathbf{X}_{[k]}bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT as

Πm(dθ𝐗[k])=p~θ(𝐗[k])γkΠ0(dθ)Θp~θ(𝐗[k])γkΠ0(dθ),subscriptΠ𝑚conditionald𝜃subscript𝐗delimited-[]𝑘subscript~𝑝𝜃superscriptsubscript𝐗delimited-[]𝑘subscript𝛾𝑘subscriptΠ0d𝜃subscriptΘsubscript~𝑝𝜃superscriptsubscript𝐗delimited-[]𝑘subscript𝛾𝑘subscriptΠ0d𝜃\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})=\frac{{\widetilde{p}_{\theta}(% \mathbf{X}_{[k]})}^{\gamma_{k}}\Pi_{0}(\mathrm{d}\theta)}{\int_{\Theta}{% \widetilde{p}_{\theta}(\mathbf{X}_{[k]})}^{\gamma_{k}}\Pi_{0}(\mathrm{d}\theta% )},roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) = divide start_ARG over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_θ ) end_ARG start_ARG ∫ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_θ ) end_ARG , (4)

where γ1,,γK>0subscript𝛾1subscript𝛾𝐾0\gamma_{1},\dots,\gamma_{K}>0italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT > 0. In other words, we define subsequence posteriors for each subsequence 𝐗[k]subscript𝐗delimited-[]𝑘\mathbf{X}_{[k]}bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT by ignoring observations before it. We have used the subscript m𝑚mitalic_m in Πm(dθ𝐗[k])subscriptΠ𝑚conditionald𝜃subscript𝐗delimited-[]𝑘\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) to signify that the lengths of the subsequences are equal to m𝑚mitalic_m. For a more compact notation, we shall use Πm,k(dθ)subscriptΠ𝑚𝑘d𝜃\Pi_{m,k}(\mathrm{d}\theta)roman_Π start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT ( roman_d italic_θ ) to denote Πm(dθ𝐗[k])subscriptΠ𝑚conditionald𝜃subscript𝐗delimited-[]𝑘\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ).

The quantities γ1,,γKsubscript𝛾1subscript𝛾𝐾\gamma_{1},\dots,\gamma_{K}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT in equation 4 control the interplay between the prior and the likelihood, as well as the relative importance of each subsequence. Consider the situation where we divide the time series into only one subsequence, that is, K=1𝐾1K=1italic_K = 1. In this case, choosing γ1=1subscript𝛾11\gamma_{1}=1italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 in equation 4 gives the full posterior (2). More generally, for K𝐾Kitalic_K subsequences, choosing γ1==γK=1subscript𝛾1subscript𝛾𝐾1\gamma_{1}=\cdots=\gamma_{K}=1italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = 1 results in equation 4 giving the usual posteriors for the subsequences. This results, however, in the subsequence posteriors having a different scale than the full posterior as they are based on only a fraction of observations as compared to the full posterior.

We shall assume for our theoretical analysis that the series is stationary (defined formally in Assumption 1), which means that the joint distribution of 𝐗[k]subscript𝐗delimited-[]𝑘\mathbf{X}_{[k]}bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT is the same for all k𝑘kitalic_k. Heuristically, this suggests that the amount of ‘information’ contained within each subsequence is the same, and it also suggests that the ‘information’ contained in the entire time series of length T𝑇Titalic_T is intuitively K𝐾Kitalic_K times more than that contained within each subsequence. This leads us to choose γ1==γK=T/m=Ksubscript𝛾1subscript𝛾𝐾𝑇𝑚𝐾\gamma_{1}=\cdots=\gamma_{K}=T/m=Kitalic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = italic_T / italic_m = italic_K. Indeed, our theoretical results in Section 3 and numerical experiments in Section 4 indicates that this is an appropriate choice in our setting.

A key question is how to combine these subsequence posteriors to approximate the full posterior. We use the Wasserstein barycenter of the set of subsequence posteriors in this article. Let 𝒫2(Θ)subscript𝒫2Θ\mathcal{P}_{2}(\Theta)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ ) denote the set of all probability measures on ΘdΘsuperscript𝑑\Theta\subseteq\mathbb{R}^{d}roman_Θ ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with finite second moments. In such a setting, the Wasserstein-2222 (W2subscriptW2\mathrm{W}_{2}roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) distance between probability measures μ,ν𝒫2(Θ)𝜇𝜈subscript𝒫2Θ\mu,\nu\in\mathcal{P}_{2}(\Theta)italic_μ , italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ ) is defined as

W2(μ,ν)={infγΓ(μ,ν)Θ×Θθ1θ22γ(dθ1dθ2)}1/2,subscriptW2𝜇𝜈superscriptsubscriptinfimum𝛾Γ𝜇𝜈subscriptΘΘsuperscriptnormsubscript𝜃1subscript𝜃22𝛾dsubscript𝜃1dsubscript𝜃212\mathrm{W}_{2}(\mu,\nu)=\left\{\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{\Theta% \times\Theta}\|\theta_{1}-\theta_{2}\|^{2}\gamma(\mathrm{d}\theta_{1}\,\mathrm% {d}\theta_{2})\right\}^{1/2},roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) = { roman_inf start_POSTSUBSCRIPT italic_γ ∈ roman_Γ ( italic_μ , italic_ν ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT roman_Θ × roman_Θ end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ ( roman_d italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_d italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ,

where Γ(μ,ν)Γ𝜇𝜈\Gamma(\mu,\nu)roman_Γ ( italic_μ , italic_ν ) denotes the set of all probability measures on Θ×ΘΘΘ\Theta\times\Thetaroman_Θ × roman_Θ with marginals μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν, respectively, and \|\cdot\|∥ ⋅ ∥ denotes the Euclidean norm. Convergence in W2subscriptW2\mathrm{W}_{2}roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance on 𝒫2(Θ)subscript𝒫2Θ\mathcal{P}_{2}(\Theta)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ ) is equivalent to weak convergence plus convergence of the second moment (Bickel and Freedman,, 1981, Lemma 8.3). The Wasserstein barycenter of the subsequence posteriors is defined as

Π¯T(dθ𝐗)=argminμ𝒫2(Θ)k=1KW22(μ,Πm,k),subscript¯Π𝑇conditionald𝜃𝐗𝜇subscript𝒫2Θargminsuperscriptsubscript𝑘1𝐾superscriptsubscriptW22𝜇subscriptΠ𝑚𝑘\overline{\Pi}_{T}(\mathrm{d}\theta\mid\mathbf{X})=\underset{\mu\in\mathcal{P}% _{2}(\Theta)}{\operatorname*{argmin}}\sum_{k=1}^{K}\mathrm{W}_{2}^{2}(\mu,\Pi_% {m,k}),over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X ) = start_UNDERACCENT italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ ) end_UNDERACCENT start_ARG roman_argmin end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ , roman_Π start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT ) , (5)

where we use the subscript T𝑇Titalic_T to denote that the Wasserstein barycenter is based on T𝑇Titalic_T total observations. We also assume that Π¯T(dθ𝐗)subscript¯Π𝑇conditionald𝜃𝐗\overline{\Pi}_{T}(\mathrm{d}\theta\mid\mathbf{X})over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X ) admits a density π¯T(θ𝐗)subscript¯𝜋𝑇conditional𝜃𝐗\overline{\pi}_{T}(\theta\mid\mathbf{X})over¯ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_θ ∣ bold_X ) with respect to the Lebesgue measure, that is, Π¯T(dθ𝐗)=π¯T(θ𝐗)dθsubscript¯Π𝑇conditionald𝜃𝐗subscript¯𝜋𝑇conditional𝜃𝐗d𝜃\overline{\Pi}_{T}(\mathrm{d}\theta\mid\mathbf{X})=\overline{\pi}_{T}(\theta% \mid\mathbf{X})\,\mathrm{d}\thetaover¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X ) = over¯ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_θ ∣ bold_X ) roman_d italic_θ.

Remark 1.

Although other distances between probability measures can potentially be used in place of W22superscriptsubscriptW22\mathrm{W}_{2}^{2}roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in equation 5, Wasserstein-2 is particularly appealing because it is the geometric center of the subsequence posteriors (Agueh and Carlier,, 2011; Srivastava et al.,, 2015); Agueh and Carlier, (2011) showed some desirable properties of the Wasserstein barycenter such as existence and uniqueness, and a strong consistency result was provided for the Wasserstein barycenter by Srivastava et al., (2015). Moreover, it has been shown theoretically that Wasserstein-based posteriors have appealing asymptotic properties as well (Szabó and Van Zanten,, 2019).

Li et al., (2017) used this approach to combine subset posteriors for the independent observations setting. Their approach cannot be directly applied to the time series setting because their approach leverages the fact that the full data likelihood can be factorized as a product of likelihoods for each observation – this is the case for independent data but not for time series. We show that surprisingly this issue is not a problem, and that defining subsequence posteriors as above and combining them via equation 5 results in provably accurate approximations to the true posterior.

Computing the Wasserstein barycenter exactly is computationally very expensive and remains an open area of research as such. However, efficient numerical algorithms have been developed to approximate the barycenter (Cuturi and Doucet,, 2014; Dvurechenskii et al.,, 2018). For one-dimensional functionals of the parameter θ𝜃\thetaitalic_θ, the Wasserstein barycenter can be obtained by simply averaging quantiles. Let ad𝑎superscript𝑑a\in\mathbb{R}^{d}italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and b𝑏b\in\mathbb{R}italic_b ∈ blackboard_R, and let ξ=aθ+b𝜉superscript𝑎top𝜃𝑏\xi=a^{\top}\theta+bitalic_ξ = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ + italic_b be a one-dimensional linear functional of θ𝜃\thetaitalic_θ. We abuse notation to let Πm(θ𝐗[k])subscriptΠ𝑚conditional𝜃subscript𝐗delimited-[]𝑘\Pi_{m}(\theta\mid\mathbf{X}_{[k]})roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) and Π¯T(θ𝐗)subscript¯Π𝑇conditional𝜃𝐗\overline{\Pi}_{T}(\theta\mid\mathbf{X})over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_θ ∣ bold_X ) denote the cumulative distribution functions corresponding to Πm(dθ𝐗[k])subscriptΠ𝑚conditionald𝜃subscript𝐗delimited-[]𝑘\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) and Π¯T(dθ𝐗)subscript¯Π𝑇conditionald𝜃𝐗\overline{\Pi}_{T}(\mathrm{d}\theta\mid\mathbf{X})over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X ), respectively. For any u(0,1)𝑢01u\in(0,1)italic_u ∈ ( 0 , 1 ), the quantile function of the subsequence posterior distribution of ξ𝜉\xiitalic_ξ is

Π¯m1(u𝐗[k])superscriptsubscript¯Π𝑚1conditional𝑢subscript𝐗delimited-[]𝑘\displaystyle\overline{\Pi}_{m}^{-1}(u\mid\mathbf{X}_{[k]})over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_u ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) =inf{ξ:uΠm(ξ𝐗[k])},absentinfimumconditional-set𝜉𝑢subscriptΠ𝑚conditional𝜉subscript𝐗delimited-[]𝑘\displaystyle=\inf\{\xi\in\mathbb{R}\,:\,u\leq\Pi_{m}(\xi\mid\mathbf{X}_{[k]})\},= roman_inf { italic_ξ ∈ blackboard_R : italic_u ≤ roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_ξ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) } ,

and the quantile function of its Wasserstein posterior is

Π¯T1(u𝐗)superscriptsubscript¯Π𝑇1conditional𝑢𝐗\displaystyle\overline{\Pi}_{T}^{-1}(u\mid\mathbf{X})over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_u ∣ bold_X ) =inf{ξ:uΠ¯T(ξ𝐗)}=1Kk=1KΠm1(u𝐗[k]),absentinfimumconditional-set𝜉𝑢subscript¯Π𝑇conditional𝜉𝐗1𝐾superscriptsubscript𝑘1𝐾superscriptsubscriptΠ𝑚1conditional𝑢subscript𝐗delimited-[]𝑘\displaystyle=\inf\{\xi\in\mathbb{R}\,:\,u\leq\overline{\Pi}_{T}(\xi\mid% \mathbf{X})\}=\frac{1}{K}\sum_{k=1}^{K}\Pi_{m}^{-1}(u\mid\mathbf{X}_{[k]}),= roman_inf { italic_ξ ∈ blackboard_R : italic_u ≤ over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_ξ ∣ bold_X ) } = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_u ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) ,

where Πm(dξ𝐗[k])subscriptΠ𝑚conditionald𝜉subscript𝐗delimited-[]𝑘\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[k]})roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) and Π¯T(ξ𝐗)subscript¯Π𝑇conditional𝜉𝐗\overline{\Pi}_{T}(\xi\mid\mathbf{X})over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_ξ ∣ bold_X ) are the posteriors induced by making the linear transformation ξ=aθ+b𝜉superscript𝑎top𝜃𝑏\xi=a^{\top}\theta+bitalic_ξ = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ + italic_b from Πm(θ𝐗[k])subscriptΠ𝑚conditional𝜃subscript𝐗delimited-[]𝑘\Pi_{m}(\theta\mid\mathbf{X}_{[k]})roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) and Π¯T(θ𝐗)subscript¯Π𝑇conditional𝜃𝐗\overline{\Pi}_{T}(\theta\mid\mathbf{X})over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_θ ∣ bold_X ), respectively. Given samples from the subsequence posteriors, this can be used to straightforwardly obtain credible intervals for each component of the Wasserstein posterior.

The proposed divide-and-conquer for Bayesian time series (DC-BATS) method is summarized in Algorithm 1.

Input: time series X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, prior Π0(θ)subscriptΠ0𝜃\Pi_{0}(\theta)roman_Π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ), subsequence sizes m𝑚mitalic_m, number of subsequences K=T/m𝐾𝑇𝑚K=T/mitalic_K = italic_T / italic_m.
1:  Divide time series sequentially as 𝐗[1]=X1:msubscript𝐗delimited-[]1subscript𝑋:1𝑚\mathbf{X}_{[1]}=X_{1:m}bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT, 𝐗[2]=X(m+1):2msubscript𝐗delimited-[]2subscript𝑋:𝑚12𝑚\mathbf{X}_{[2]}=X_{(m+1):2m}bold_X start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT ( italic_m + 1 ) : 2 italic_m end_POSTSUBSCRIPT, \dots, 𝐗[K]=X((K1)m+1):Tsubscript𝐗delimited-[]𝐾subscript𝑋:𝐾1𝑚1𝑇\mathbf{X}_{[K]}=X_{((K-1)m+1):T}bold_X start_POSTSUBSCRIPT [ italic_K ] end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT ( ( italic_K - 1 ) italic_m + 1 ) : italic_T end_POSTSUBSCRIPT.
2:  Compute subsequence posteriors as Πm,k(dθ)={p~θ(𝐗[k])KΠ0(dθ)}/{Θp~θ(𝐗[k])KΠ0(dθ)}subscriptΠ𝑚𝑘d𝜃subscript~𝑝𝜃superscriptsubscript𝐗delimited-[]𝑘𝐾subscriptΠ0d𝜃subscriptΘsubscript~𝑝𝜃superscriptsubscript𝐗delimited-[]𝑘𝐾subscriptΠ0d𝜃\Pi_{m,k}(\mathrm{d}\theta)=\{\widetilde{p}_{\theta}(\mathbf{X}_{[k]})^{K}\Pi_% {0}(\mathrm{d}\theta)\}/\{\int_{\Theta}\widetilde{p}_{\theta}(\mathbf{X}_{[k]}% )^{K}\Pi_{0}(\mathrm{d}\theta)\}roman_Π start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT ( roman_d italic_θ ) = { over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_θ ) } / { ∫ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_θ ) }, k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K.
3:  for k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K do
4:     Obtain posterior samples from Πm,k(dθ)subscriptΠ𝑚𝑘d𝜃\Pi_{m,k}(\mathrm{d}\theta)roman_Π start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT ( roman_d italic_θ ) using a desired sampling algorithm.
5:  end for
6:  Combine samples from the subsequence posteriors using their Wasserstein barycenter.
Algorithm 1 Divide-and-conquer for Bayesian time series (DC-BATS).
Output: samples from the Wasserstein posterior.
Remark 2.

If there is only finite-order dependence in the observations (for example, dependence of order r𝑟ritalic_r, meaning that Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT depends only on X(tr):(t1)subscript𝑋:𝑡𝑟𝑡1X_{(t-r):(t-1)}italic_X start_POSTSUBSCRIPT ( italic_t - italic_r ) : ( italic_t - 1 ) end_POSTSUBSCRIPT), then one can write the full log-likelihood exactly as a sum over subsets, where each subset conditions on the r𝑟ritalic_r last observations from the previous subset. In such a case, one can use the log-likelihood split as described above rather than our proposed strategy which ignores the r𝑟ritalic_r last observations from the previous subsequence, and then use Wasserstein averages. For a fixed r𝑟ritalic_r, such an approach would have similar computational cost as our proposed method, and we conjecture that it is also possible to provide similar theoretical results as the ones we provide in Section 3.2. However, we do not pursue this here as our proposed method can also handle situations when it is not possible to write down the dependence structure as a finite-order dependence (for example, for hidden Markov models).

3 Theoretical guarantees

3.1 Notations and main assumptions

For notational convenience, we assume that there exists an infinitely long stationary stochastic process X:=(,X2,X1,X0,X1,X2,)subscript𝑋:subscript𝑋2subscript𝑋1subscript𝑋0subscript𝑋1subscript𝑋2X_{-\infty:\infty}=(\dots,X_{-2},X_{-1},X_{0},X_{1},X_{2},\dots)italic_X start_POSTSUBSCRIPT - ∞ : ∞ end_POSTSUBSCRIPT = ( … , italic_X start_POSTSUBSCRIPT - 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … ), and that X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT is the observed sequence from said series. Our proposed methodology does not involve time indices outside {1,,T}1𝑇\{1,\dots,T\}{ 1 , … , italic_T }, and indeed we are interested in the posterior distribution ΠT(dθX1:T)subscriptΠ𝑇conditionald𝜃subscript𝑋:1𝑇\Pi_{T}(\mathrm{d}\theta\mid X_{1:T})roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_θ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) as given by equation 2.

We assume that θ0Θdsubscript𝜃0Θsuperscript𝑑\theta_{0}\in\Theta\subseteq\mathbb{R}^{d}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_Θ ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the “true” model parameter, and that X:subscript𝑋:X_{-\infty:\infty}italic_X start_POSTSUBSCRIPT - ∞ : ∞ end_POSTSUBSCRIPT is generated by θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the probability measure induced by θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We use 𝔼θ0subscript𝔼subscriptsubscript𝜃0\mathbb{E}_{\mathbb{P}_{\theta_{0}}}blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT to denote expectation with respect to the probability measure θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We use Φ(;μ,Σ)Φ𝜇Σ\Phi(\cdot;\mu,\Sigma)roman_Φ ( ⋅ ; italic_μ , roman_Σ ) to denote a normal distribution with mean μ𝜇\muitalic_μ and variance ΣΣ\Sigmaroman_Σ.

We define k(θ)=logp~θ(𝐗[k])subscript𝑘𝜃subscript~𝑝𝜃subscript𝐗delimited-[]𝑘\ell_{k}(\theta)=\log\widetilde{p}_{\theta}(\mathbf{X}_{[k]})roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) = roman_log over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) to be the pseudo log-likelihood of the k𝑘kitalic_kth subsequence, where p~θ(𝐗[k])subscript~𝑝𝜃subscript𝐗delimited-[]𝑘\widetilde{p}_{\theta}(\mathbf{X}_{[k]})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) is as defined in equation 3. We use θ(θ)subscript𝜃𝜃\nabla_{\theta}\ell(\theta)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_ℓ ( italic_θ ) and θ2(θ)subscriptsuperscript2𝜃𝜃\nabla^{2}_{\theta}\ell(\theta)∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_ℓ ( italic_θ ) to denote the gradient and Hessian matrix of k(θ)subscript𝑘𝜃\ell_{k}(\theta)roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) with respect to θ𝜃\thetaitalic_θ, respectively. We use θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to denote the maximum likelihood estimator (MLE) of the k𝑘kitalic_kth subsequence under k(θ)subscript𝑘𝜃\ell_{k}(\theta)roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ), that is, θ^k=argmaxθΘk(θ)subscript^𝜃𝑘subscriptargmax𝜃Θsubscript𝑘𝜃\widehat{\theta}_{k}=\operatorname*{argmax}_{\theta\in\Theta}\ell_{k}(\theta)over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_argmax start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ), and θ¯=k=1Kθ^k/K¯𝜃superscriptsubscript𝑘1𝐾subscript^𝜃𝑘𝐾\overline{\theta}=\sum_{k=1}^{K}\widehat{\theta}_{k}/Kover¯ start_ARG italic_θ end_ARG = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_K to be the average MLE across the K𝐾Kitalic_K subsequences. We let θ^^𝜃\widehat{\theta}over^ start_ARG italic_θ end_ARG be the MLE based on the complete dataset, that is, θ^=argmaxθΘ(θ)^𝜃subscriptargmax𝜃Θ𝜃\widehat{\theta}=\operatorname*{argmax}_{\theta\in\Theta}\ell(\theta)over^ start_ARG italic_θ end_ARG = roman_argmax start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT roman_ℓ ( italic_θ ), where (θ)𝜃\ell(\theta)roman_ℓ ( italic_θ ) is the log-likelihood of the entire dataset as given in equation 1. We assume that the MLEs are unique.

Throughout this paper, we denote the Lpsuperscript𝐿𝑝L^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT norm by p={𝔼θ0(||p)}1/p\|\cdot\|_{p}=\{\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(|\cdot|^{p})\}^{1/p}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( | ⋅ | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) } start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT. We use Euclidean and Frobenius norms for vectors and matrices, respectively, that is, v=(i=1dvi2)1/2norm𝑣superscriptsuperscriptsubscript𝑖1𝑑subscriptsuperscript𝑣2𝑖12\|v\|=(\sum_{i=1}^{d}v^{2}_{i})^{1/2}∥ italic_v ∥ = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT if vd𝑣superscript𝑑v\in\mathbb{R}^{d}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and V=(i=1dj=1dVij2)1/2norm𝑉superscriptsuperscriptsubscript𝑖1𝑑superscriptsubscript𝑗1𝑑subscriptsuperscript𝑉2𝑖𝑗12\|V\|=(\sum_{i=1}^{d}\sum_{j=1}^{d}V^{2}_{ij})^{1/2}∥ italic_V ∥ = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT if Vd×d𝑉superscript𝑑𝑑V\in\mathbb{R}^{d\times d}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. For a sequence of real-valued random variables {Zn}n1subscriptsubscript𝑍𝑛𝑛1\{Z_{n}\}_{n\geq 1}{ italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ≥ 1 end_POSTSUBSCRIPT, where each Znsubscript𝑍𝑛Z_{n}italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is measurable with respect to the sigma-algebra generated by X:subscript𝑋:X_{-\infty:\infty}italic_X start_POSTSUBSCRIPT - ∞ : ∞ end_POSTSUBSCRIPT, and a sequence of real numbers {an}n1subscriptsubscript𝑎𝑛𝑛1\{a_{n}\}_{n\geq 1}{ italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ≥ 1 end_POSTSUBSCRIPT, we say that Zn=Oθ0(an)subscript𝑍𝑛subscript𝑂subscriptsubscript𝜃0subscript𝑎𝑛Z_{n}=O_{\mathbb{P}_{\theta_{0}}}(a_{n})italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_O start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) if for any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, there exists a real number s>0𝑠0s>0italic_s > 0 and a positive integer N𝑁Nitalic_N such that θ0(|Zn/an|>s)<ϵsubscriptsubscript𝜃0subscript𝑍𝑛subscript𝑎𝑛𝑠italic-ϵ\mathbb{P}_{\theta_{0}}(|Z_{n}/a_{n}|>s)<\epsilonblackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( | italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | > italic_s ) < italic_ϵ for any n>N𝑛𝑁n>Nitalic_n > italic_N. Moreover, we say that Zn=oθ0(an)subscript𝑍𝑛subscript𝑜subscriptsubscript𝜃0subscript𝑎𝑛Z_{n}=o_{\mathbb{P}_{\theta_{0}}}(a_{n})italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) if for any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, limnθ0(|Zn/an|>ϵ)=0subscript𝑛subscriptsubscript𝜃0subscript𝑍𝑛subscript𝑎𝑛italic-ϵ0\lim_{n\to\infty}\mathbb{P}_{\theta_{0}}(|Z_{n}/a_{n}|>\epsilon)=0roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( | italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | > italic_ϵ ) = 0. We let σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) represent the sigma-algebra generated by a (potentially infinite) collection of random variables \cdot. We use θsubscript𝜃\nabla_{\theta}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as a shorthand for /(θ)𝜃\partial/(\partial\theta)∂ / ( ∂ italic_θ ), that is, to represent the gradient with respect to θ𝜃\thetaitalic_θ.

We describe the main assumptions related to the time series nature of the observations in this section. There are additional assumptions which are similar to those made by Li et al., (2017), but we defer these to Appendix A to simplify the exposition.

We consider time series which are stationary and ergodic in this article, as formalised in Assumption 1.

Assumption 1 (Stationarity).

X:subscript𝑋:X_{-\infty:\infty}italic_X start_POSTSUBSCRIPT - ∞ : ∞ end_POSTSUBSCRIPT is a strictly stationary and ergodic process.

Moreover, we make the following assumption on the mixing time of the process.

Assumption 2 (Mixing time).

The α𝛼\alphaitalic_α-mixing coefficient of X:subscript𝑋:X_{-\infty:\infty}italic_X start_POSTSUBSCRIPT - ∞ : ∞ end_POSTSUBSCRIPT, a measure of dependency defined as

α(n)=supAσ(,X1,X0),Bσ(Xn,Xn+1,)|θ0(A)θ0(B)θ0(AB)|,𝛼𝑛subscriptsupremumformulae-sequence𝐴𝜎subscript𝑋1subscript𝑋0𝐵𝜎subscript𝑋𝑛subscript𝑋𝑛1subscriptsubscript𝜃0𝐴subscriptsubscript𝜃0𝐵subscriptsubscript𝜃0𝐴𝐵\alpha(n)=\sup_{A\in\sigma(\dots,X_{-1},X_{0}),\,B\in\sigma(X_{n},X_{n+1},% \dots)}|\mathbb{P}_{\theta_{0}}(A)\mathbb{P}_{\theta_{0}}(B)-\mathbb{P}_{% \theta_{0}}(A\cap B)|,italic_α ( italic_n ) = roman_sup start_POSTSUBSCRIPT italic_A ∈ italic_σ ( … , italic_X start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_B ∈ italic_σ ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , … ) end_POSTSUBSCRIPT | blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ) blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_B ) - blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ∩ italic_B ) | ,

is such that j=1α(j)δ/(2+δ)<superscriptsubscript𝑗1𝛼superscript𝑗𝛿2𝛿\sum_{j=1}^{\infty}\alpha(j)^{\delta/(2+\delta)}<\infty∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_α ( italic_j ) start_POSTSUPERSCRIPT italic_δ / ( 2 + italic_δ ) end_POSTSUPERSCRIPT < ∞ for a real number δ>0𝛿0\delta>0italic_δ > 0.

Assumption 2 precludes long-range dependence. For processes with large δ𝛿\deltaitalic_δ, a slow mixing rate will be sufficient to guarantee j=1α(j)δ/(2+δ)<superscriptsubscript𝑗1𝛼superscript𝑗𝛿2𝛿\sum_{j=1}^{\infty}\alpha(j)^{\delta/(2+\delta)}<\infty∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_α ( italic_j ) start_POSTSUPERSCRIPT italic_δ / ( 2 + italic_δ ) end_POSTSUPERSCRIPT < ∞. Moreover, this assumption always holds for geometrically ergodic processes as, for such processes, there exists a 0<ρ<10𝜌10<\rho<10 < italic_ρ < 1 such that α(j)<ρj𝛼𝑗superscript𝜌𝑗\alpha(j)<\rho^{j}italic_α ( italic_j ) < italic_ρ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT for all sufficiently large j𝑗jitalic_j, and thus the sum of interest is upper bounded by j=1(ρj)δ/(2+δ)=j=1{ρδ/(2+δ)}j<superscriptsubscript𝑗1superscriptsuperscript𝜌𝑗𝛿2𝛿superscriptsubscript𝑗1superscriptsuperscript𝜌𝛿2𝛿𝑗\sum_{j=1}^{\infty}(\rho^{j})^{\delta/(2+\delta)}=\sum_{j=1}^{\infty}\{\rho^{% \delta/(2+\delta)}\}^{j}<\infty∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_ρ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_δ / ( 2 + italic_δ ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT { italic_ρ start_POSTSUPERSCRIPT italic_δ / ( 2 + italic_δ ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT < ∞.

Finally, we make an assumption on the score function as follows.

Assumption 3 (Geometrically decaying score function).

There exists a constant ρ0(0,1)subscript𝜌001\rho_{0}\in(0,1)italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ ( 0 , 1 ), a sufficiently large integer N𝑁Nitalic_N, and a constant C0>0subscript𝐶00C_{0}>0italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 such that

𝔼θ0θlogpθ0(X1Xj:0)θlogpθ0(X1Xj:0)1\displaystyle\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\|\nabla_{\theta}\log p_{% \theta_{0}}(X_{1}\mid X_{-j:0})-\nabla_{\theta}\log p_{\theta_{0}}(X_{1}\mid X% _{-j^{\prime}:0})\|_{1}blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_j : 0 end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT C0ρ0Nfor allj,j>N.formulae-sequenceabsentsubscript𝐶0superscriptsubscript𝜌0𝑁for all𝑗superscript𝑗𝑁\displaystyle\leq C_{0}\,\rho_{0}^{N}\quad\text{for all}\leavevmode\nobreak\ j% ,j^{\prime}>N.≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for all italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_N .

Assumption 3 states that the dependence of score functions on the history decays geometrically as the number of time steps increases. Such an assumption is automatically true for finite order Markov processes. For an n𝑛nitalic_n-order Markov process, we can choose N=n𝑁𝑛N=nitalic_N = italic_n so that θlogpθ0(X1Xj:0)logpθ0(X1Xn,,X0)subscript𝜃subscript𝑝subscript𝜃0conditionalsubscript𝑋1subscript𝑋:𝑗0subscript𝑝subscript𝜃0conditionalsubscript𝑋1subscript𝑋𝑛subscript𝑋0\nabla_{\theta}\log p_{\theta_{0}}\left(X_{1}\mid X_{-j:0}\right)\equiv\log p_% {\theta_{0}}\left(X_{1}\mid X_{-n},\ldots,X_{0}\right)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_j : 0 end_POSTSUBSCRIPT ) ≡ roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for all j>N𝑗𝑁j>Nitalic_j > italic_N. Therefore, we have

𝔼θ0θlogpθ0(X1Xj:0)θlogpθ0(X1Xj:0)1\displaystyle\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\|\nabla_{\theta}\log p_{% \theta_{0}}\left(X_{1}\mid X_{-j:0}\right)-\nabla_{\theta}\log p_{\theta_{0}}% \left(X_{1}\mid X_{-j^{\prime}:0}\right)\|_{1}blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_j : 0 end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =0.absent0\displaystyle=0.= 0 .

It also holds for finite state-space hidden Markov models under some additional regularity assumptions (for example, Bickel et al.,, 1998, Lemma 6). In this lemma, there exists an η1subscript𝜂1\eta_{1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT such that 𝔼θ0θlogpθ0(X1Xj:0)η11C0ρ0N\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\|\nabla_{\theta}\log p_{\theta_{0}}(X_{1}% \mid X_{-j:0})-\eta_{1}\|_{1}\leq C_{0}\,\rho_{0}^{N}blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_j : 0 end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, which implies that there exists N>0𝑁0N>0italic_N > 0 such that

𝔼θ0θlogpθ0(X1Xj:0)θlogpθ0(X1Xj:0)1\displaystyle\quad\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\|\nabla_{\theta}\log p_% {\theta_{0}}\left(X_{1}\mid X_{-j:0}\right)-\nabla_{\theta}\log p_{\theta_{0}}% (X_{1}\mid X_{-j^{\prime}:0})\|_{1}blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_j : 0 end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
𝔼θ0θlogpθ0(X1Xj:0)η11+𝔼θ0θlogpθ0(X1Xj:0)η112C0ρ0N.\displaystyle\leq\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\|\nabla_{\theta}\log p_{% \theta_{0}}(X_{1}\mid X_{-j:0})-\eta_{1}\|_{1}+\mathbb{E}_{\mathbb{P}_{\theta_% {0}}}\|\nabla_{\theta}\log p_{\theta_{0}}\left(X_{1}\mid X_{-j^{\prime}:0}% \right)-\eta_{1}\|_{1}\leq 2C_{0}\,\rho_{0}^{N}.≤ blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_j : 0 end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT - italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : 0 end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT .

While this may appear as a strong assumption, we have seen in our numerical experiments that DC-BATS has good performance even when this is not necessarily satisfied.

3.2 Main results

We present the main theoretical results of the paper in this section. Our results are based on asymptotic theory as the total length T=Km𝑇𝐾𝑚T=Kmitalic_T = italic_K italic_m of the time series increases to infinity. We prove that the error due to combining the subsequence posteriors using DC-BATS is asymptotically negligible as T𝑇T\to\inftyitalic_T → ∞. This will require increasing the size of each subsequence m𝑚mitalic_m, but at a potentially much slower rate than T𝑇Titalic_T. The proofs are in the Supplementary Material, and leverage on results of Li et al., (2017) but extended to the time series setting. We first present the following Lemma 1, which is novel to this work and is instrumental in proving our later results.

Lemma 1.

Under Assumptions 1–3 in the main text and Assumptions 4–9 in Appendix A, θ¯θ^=oθ0(m1/2)norm¯𝜃^𝜃subscript𝑜subscriptsubscript𝜃0superscript𝑚12\|\overline{\theta}-\widehat{\theta}\|=o_{\mathbb{P}_{\theta_{0}}}(m^{-1/2})∥ over¯ start_ARG italic_θ end_ARG - over^ start_ARG italic_θ end_ARG ∥ = italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). If we assume further that θ^1subscript^𝜃1\widehat{\theta}_{1}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an unbiased estimator for θ𝜃\thetaitalic_θ, that is, 𝔼θ0(θ^1)=θ0subscript𝔼subscriptsubscript𝜃0subscript^𝜃1subscript𝜃0\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\widehat{\theta}_{1})=\theta_{0}blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and m𝑚mitalic_m is at least 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ), then θ¯θ^=oθ0(T1/2)norm¯𝜃^𝜃subscript𝑜subscriptsubscript𝜃0superscript𝑇12\|\overline{\theta}-\widehat{\theta}\|=o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})∥ over¯ start_ARG italic_θ end_ARG - over^ start_ARG italic_θ end_ARG ∥ = italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ).

Based on Lemma 1, the following Theorem 1 is the first main theoretical result of this paper, where we recall that θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the MLE of the k𝑘kitalic_kth subsequence and θ¯=k=1Kθ^k/K¯𝜃superscriptsubscript𝑘1𝐾subscript^𝜃𝑘𝐾\overline{\theta}=\sum_{k=1}^{K}\widehat{\theta}_{k}/Kover¯ start_ARG italic_θ end_ARG = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_K is the average MLE across the K𝐾Kitalic_K subsequences.

Theorem 1 (Error due to combining subsequence posteriors).

Suppose Assumptions 1–3 in the main text and Assumptions 4–9 in Appendix A hold. Let ξ=aθ+b𝜉superscript𝑎top𝜃𝑏\xi=a^{\top}\theta+bitalic_ξ = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ + italic_b for some fixed ad𝑎superscript𝑑a\in\mathbb{R}^{d}italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and b𝑏b\in\mathbb{R}italic_b ∈ blackboard_R. Let Iξ(θ0)=[a{I1(θ0)}a]1subscript𝐼𝜉subscript𝜃0superscriptdelimited-[]superscript𝑎topsuperscript𝐼1subscript𝜃0𝑎1I_{\xi}(\theta_{0})=[a^{\top}\{I^{-1}(\theta_{0})\}a]^{-1}italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = [ italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT { italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } italic_a ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, ξ¯=aθ¯+b¯𝜉superscript𝑎top¯𝜃𝑏\overline{\xi}=a^{\top}\overline{\theta}+bover¯ start_ARG italic_ξ end_ARG = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_θ end_ARG + italic_b, and ξ^=aθ^+b^𝜉superscript𝑎top^𝜃𝑏\widehat{\xi}=a^{\top}\widehat{\theta}+bover^ start_ARG italic_ξ end_ARG = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG + italic_b. Then we have the following.

  1. 1.

    As m𝑚m\to\inftyitalic_m → ∞,

    T1/2W2(Π¯T(dξX1:T),Φ[dξ;ξ¯,{TIξ(θ0)}1])superscript𝑇12subscriptW2subscript¯Π𝑇conditionald𝜉subscript𝑋:1𝑇Φd𝜉¯𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01\displaystyle T^{1/2}\mathrm{W}_{2}\big{(}\overline{\Pi}_{T}(\mathrm{d}\xi\mid X% _{1:T}),\Phi\big{[}\mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}% \big{]}\big{)}italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over¯ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) 0,absent0\displaystyle\to 0,→ 0 ,
    T1/2W2(ΠT(dξX1:T),Φ[dξ;ξ^,{TIξ(θ0)}1])superscript𝑇12subscriptW2subscriptΠ𝑇conditionald𝜉subscript𝑋:1𝑇Φd𝜉^𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01\displaystyle T^{1/2}\mathrm{W}_{2}\big{(}\Pi_{T}(\mathrm{d}\xi\mid X_{1:T}),% \Phi\big{[}\mathrm{d}\xi;\widehat{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}\big{]}% \big{)}italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) 0,absent0\displaystyle\to 0,→ 0 ,
    m1/2W2{Π¯T(dξX1:T),ΠT(dξX1:T)}superscript𝑚12subscriptW2subscript¯Π𝑇conditionald𝜉subscript𝑋:1𝑇subscriptΠ𝑇conditionald𝜉subscript𝑋:1𝑇\displaystyle m^{1/2}\mathrm{W}_{2}\left\{\overline{\Pi}_{T}(\mathrm{d}\xi\mid X% _{1:T}),\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})\right\}italic_m start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT { over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) } 0,absent0\displaystyle\to 0,→ 0 ,

    where the convergences are in θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability.

  2. 2.

    If

    1. (a)

      θ^1subscript^𝜃1\widehat{\theta}_{1}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an unbiased estimator for θ𝜃\thetaitalic_θ, that is, 𝔼θ0(θ^1)=θ0subscript𝔼subscriptsubscript𝜃0subscript^𝜃1subscript𝜃0\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\widehat{\theta}_{1})=\theta_{0}blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and

    2. (b)

      m𝑚mitalic_m is at least 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ),

    then T1/2W2{Π¯T(dξX1:T),ΠT(dξX1:T)}0superscript𝑇12subscriptW2subscript¯Π𝑇conditionald𝜉subscript𝑋:1𝑇subscriptΠ𝑇conditionald𝜉subscript𝑋:1𝑇0T^{1/2}\mathrm{W}_{2}\left\{\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T}),\Pi_% {T}(\mathrm{d}\xi\mid X_{1:T})\right\}\to 0italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT { over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) } → 0 in θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability as m𝑚m\to\inftyitalic_m → ∞.

For the first part of the theorem to hold, it suffices to let m𝑚m\to\inftyitalic_m → ∞ at a much slower rate than T𝑇Titalic_T. The second part of the theorem is more interesting and guarantees that the error between the Wasserstein posterior Π¯Tsubscript¯Π𝑇\overline{\Pi}_{T}over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and full posterior ΠTsubscriptΠ𝑇\Pi_{T}roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT has the asymptotically optimal rate T1/2superscript𝑇12T^{-1/2}italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT, under a special circumstance when the MLE estimator is unbiased, and m𝑚mitalic_m is at least 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ). While we require m𝑚mitalic_m to be at least 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ), this is not restrictive in practice as one typically divides the entire dataset into a fairly small number K𝐾Kitalic_K of subsequences in divide-and-conquer algorithms (which is in the order of tens or hundreds), whereas T𝑇Titalic_T is much larger by comparison (in the order of hundreds of thousands or more). For instance, if T𝑇Titalic_T is large, say T=106𝑇superscript106T=10^{6}italic_T = 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, then we require the number of subsequences K𝐾Kitalic_K to be less than 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, which is a reasonable requirement in practice. Moreover, we have observed good performance of DC-BATS even when T𝑇Titalic_T and m𝑚mitalic_m are not very large. It is possible to leverage Theorem 1 to obtain accuracy guarantees on moments of the posterior. These are provided in the following Theorem 2.

Theorem 2 (Guarantees on first and second moments).

Let ξ=aθ+b𝜉superscript𝑎top𝜃𝑏\xi=a^{\top}\theta+bitalic_ξ = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ + italic_b for some fixed ad𝑎superscript𝑑a\in\mathbb{R}^{d}italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and b𝑏b\in\mathbb{R}italic_b ∈ blackboard_R, and let ΞΞ\Xi\subseteq\mathbb{R}roman_Ξ ⊆ blackboard_R be the domain of ξ𝜉\xiitalic_ξ under the transformation. Since ξ0subscript𝜉0\xi_{0}italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the truth, we define the “bias” of a distribution Π(dξX1:T)Πconditionald𝜉subscript𝑋:1𝑇\Pi(\mathrm{d}\xi\mid X_{1:T})roman_Π ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) as bias[Π(dξX1:T)]=ΞξΠ(dξX1:T)ξ0biasΠconditionald𝜉subscript𝑋:1𝑇subscriptΞ𝜉Πconditionald𝜉subscript𝑋:1𝑇subscript𝜉0\operatorname*{bias}[\Pi(\mathrm{d}\xi\mid X_{1:T})]=\int_{\Xi}\xi\,\Pi(% \mathrm{d}\xi\mid X_{1:T})-\xi_{0}roman_bias [ roman_Π ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ] = ∫ start_POSTSUBSCRIPT roman_Ξ end_POSTSUBSCRIPT italic_ξ roman_Π ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) - italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where ξ0=aθ0+bsubscript𝜉0superscript𝑎topsubscript𝜃0𝑏\xi_{0}=a^{\top}\theta_{0}+bitalic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_b. Under Assumptions 1–3 in Section 3.1 and Assumptions 4–9 in Appendix A, we have the following.

  1. 1.

    bias{Π¯T(dξX1:T)}=ξ¯ξ0+oθ0(T1/2)biassubscript¯Π𝑇conditionald𝜉subscript𝑋:1𝑇¯𝜉subscript𝜉0subscript𝑜subscriptsubscript𝜃0superscript𝑇12\operatorname*{bias}\{\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T})\}=% \overline{\xi}-\xi_{0}+o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})roman_bias { over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) } = over¯ start_ARG italic_ξ end_ARG - italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) and bias{ΠT(dξX1:T)}=ξ^ξ0+oθ0(T1/2)biassubscriptΠ𝑇conditionald𝜉subscript𝑋:1𝑇^𝜉subscript𝜉0subscript𝑜subscriptsubscript𝜃0superscript𝑇12\operatorname*{bias}\{\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})\}=\widehat{\xi}-\xi_{% 0}+o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})roman_bias { roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) } = over^ start_ARG italic_ξ end_ARG - italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ).

  2. 2.

    var{Π¯T(dξX1:T)}=T1Iξ1(θ0)+oθ0(T1)varsubscript¯Π𝑇conditionald𝜉subscript𝑋:1𝑇superscript𝑇1subscriptsuperscript𝐼1𝜉subscript𝜃0subscript𝑜subscriptsubscript𝜃0superscript𝑇1\mathrm{var}\{\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T})\}=T^{-1}I^{-1}_{% \xi}(\theta_{0})+o_{\mathbb{P}_{\theta_{0}}}(T^{-1})roman_var { over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) } = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and var{ΠT(dξX1:T)}=T1Iξ1(θ0)+oθ0(T1)varsubscriptΠ𝑇conditionald𝜉subscript𝑋:1𝑇superscript𝑇1subscriptsuperscript𝐼1𝜉subscript𝜃0subscript𝑜subscriptsubscript𝜃0superscript𝑇1\mathrm{var}\{\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})\}=T^{-1}I^{-1}_{\xi}(\theta_{% 0})+o_{\mathbb{P}_{\theta_{0}}}(T^{-1})roman_var { roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) } = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

Remark 3.

The definition of bias in Theorem 2 is adopted from Li et al., (2017). The bias in Theorem 2 is defined to be the difference between the posterior mean and ξ0subscript𝜉0\xi_{0}italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Unlike the usual definition of bias (which is a fixed quantity), the term refers to a random quantity here.

Theorem 2 quantifies the order of the biases and variances of both the Wasserstein posterior and the full posterior. The difference in the biases of these two posteriors is controlled by ξ¯ξ^¯𝜉^𝜉\overline{\xi}-\widehat{\xi}over¯ start_ARG italic_ξ end_ARG - over^ start_ARG italic_ξ end_ARG up to an asymptotically negligible term oθ0(T1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑇12o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Lemma 1 further shows that the bias difference is oθ0(m1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑚12o_{\mathbb{P}_{\theta_{0}}}(m^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) in general, and is improved to oθ0(T1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑇12o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) when the MLE is unbiased and m𝑚mitalic_m is at least 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ). In terms of the posterior variance, both posteriors align on the dominating term T1Iξ1(θ0)superscript𝑇1subscriptsuperscript𝐼1𝜉subscript𝜃0T^{-1}I^{-1}_{\xi}(\theta_{0})italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and the difference is only up to an asymptotically negligible term oθ0(T1)subscript𝑜subscriptsubscript𝜃0superscript𝑇1o_{\mathbb{P}_{\theta_{0}}}(T^{-1})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

Remark 4.

We note that our theory focuses on the exact Wasserstein barycenter, while in practice we will instead calculate the Wasserstein barycenter between Monte Carlo approximations to the subset posteriors. Conceptually, our theory can be easily extended to also account for the Monte Carlo error component following a similar approach to Li et al., (2017); however, we do not consider that extension in this article.

4 Synthetic data experiments

We demonstrate DC-BATS on different time series models. We have noticed that stochastic gradient MCMC algorithms tend to under-estimate posterior variances and are therefore not accurate in quantifying posterior uncertainty. This has also been noticed in the literature (for example, Figure 2 of Nemeth and Fearnhead,, 2020). Furthermore, it has been established that stochastic gradient MCMC algorithms have fundamental limitations in terms of their scalability versus accuracy (Johndrow et al.,, 2020). Given these reasons, we compare the proposed method with running MCMC to sample from the full posterior. Since we have proven theoretically that DC-BATS performs well for stationary models for large T𝑇Titalic_T and m𝑚mitalic_m, we also consider non-stationary models, as well as models with low/moderate T𝑇Titalic_T and m𝑚mitalic_m, to test the method outside of idealized cases. All numerical experiments have been performed on a 2018 i7-8700 CPU with 3.20 GHz processing power. Code for all numerical experiments is available online at https://github.com/deborsheesen/DC-BATS-deborshee. Averaging the credible intervals produced by the subsequence posteriors takes negligible time as compared to sampling from them, and we do not report this in our experiments.

4.1 Linear regression with auto-regressive errors

We first consider a linear regression model with auto-regressive errors as follows:

Xt=α+βZt+εt,εt=φ1εt1+φ2εt2+ξt,withξti.i.d.N(0,σ2),subscript𝑋𝑡absent𝛼superscript𝛽topsubscript𝑍𝑡subscript𝜀𝑡subscript𝜀𝑡formulae-sequenceabsentsubscript𝜑1subscript𝜀𝑡1subscript𝜑2subscript𝜀𝑡2subscript𝜉𝑡superscriptsimilar-toi.i.d.withsubscript𝜉𝑡N0superscript𝜎2\displaystyle\begin{aligned} X_{t}&=\alpha+\beta^{\top}Z_{t}+\varepsilon_{t},% \\ \varepsilon_{t}&=\varphi_{1}\varepsilon_{t-1}+\varphi_{2}\varepsilon_{t-2}+\xi% _{t},\quad\text{with}\leavevmode\nobreak\ \xi_{t}\stackrel{{\scriptstyle\text{% i.i.d.}}}{{\sim}}\mathrm{N}(0,\sigma^{2}),\end{aligned}start_ROW start_CELL italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_α + italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , with italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG i.i.d. end_ARG end_RELOP roman_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , end_CELL end_ROW (6)

and ε1,ε2indN(0,σ2)superscriptsimilar-toindsubscript𝜀1subscript𝜀2N0superscript𝜎2\varepsilon_{1},\varepsilon_{2}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}% \mathrm{N}(0,\sigma^{2})italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG ind end_ARG end_RELOP roman_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where Xt,α,εtsubscript𝑋𝑡𝛼subscript𝜀𝑡X_{t},\alpha,\varepsilon_{t}\in\mathbb{R}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_α , italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R, and β,Ztp𝛽subscript𝑍𝑡superscript𝑝\beta,Z_{t}\in\mathbb{R}^{p}italic_β , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT; here X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT denotes observations and Z1:Tsubscript𝑍:1𝑇Z_{1:T}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT denotes covariates. We set φ1=0.4subscript𝜑10.4\varphi_{1}=0.4italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.4 and φ2=0.6subscript𝜑20.6\varphi_{2}=-0.6italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 0.6. We choose p=50𝑝50p=50italic_p = 50 and generate T=105𝑇superscript105T=10^{5}italic_T = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT observations from this model.

We choose independent N(0,102)N0superscript102\mathrm{N}(0,10^{2})roman_N ( 0 , 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) priors on α,φ1,φ2𝛼subscript𝜑1subscript𝜑2\alpha,\varphi_{1},\varphi_{2}italic_α , italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and on each component of β𝛽\betaitalic_β. We also choose an inverse-gamma(3,10)310(3,10)( 3 , 10 ) prior on σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We choose K{10,20}𝐾1020K\in\{10,20\}italic_K ∈ { 10 , 20 } and draw 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT samples from each subsequence posterior as well as from the full posterior using the no-U-turn sampler (NUTS; Hoffman and Gelman,, 2014) as implemented in Stan (Carpenter et al.,, 2017), of which the first half are discarded as burn-in in each case. It took ten minutes to sample from the full posterior, about a minute to sample from each subsequence posterior for K=10𝐾10K=10italic_K = 10, and about half a minute to sample from each subsequence posterior for K=10𝐾10K=10italic_K = 10. We plot 95% credible intervals for β𝛽\betaitalic_β in Figure 1, and observe that the credible intervals obtained by DC-BATS are virtually indistinguishable from those obtained by full data MCMC. The frequentist coverage of the credible intervals for β𝛽\betaitalic_β for DC-BATS is 94% for K=10𝐾10K=10italic_K = 10 and 92% for K=20𝐾20K=20italic_K = 20, and is 94% when we sample from the full posterior.

In addition to a single simulation example, we also include a simulation example under multiple sets of (φ1,φ2)subscript𝜑1subscript𝜑2(\varphi_{1},\varphi_{2})( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) to represent different degree of mixing in the time series, and different numbers of machines K𝐾Kitalic_Ks in Table 1. We consider (i) the i.i.d case (φ1,φ2)=(0,0)subscript𝜑1subscript𝜑200(\varphi_{1},\varphi_{2})=(0,0)( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( 0 , 0 ), (ii) the fast mixing case (φ1,φ2)=(0.2,0)subscript𝜑1subscript𝜑20.20(\varphi_{1},\varphi_{2})=(0.2,0)( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( 0.2 , 0 ), (iii) the slow mixing case (φ1,φ2)=(0.8,0)subscript𝜑1subscript𝜑20.80(\varphi_{1},\varphi_{2})=(0.8,0)( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( 0.8 , 0 ), (iv) the unit root case I when (φ1,φ2)=(1,0)subscript𝜑1subscript𝜑210(\varphi_{1},\varphi_{2})=(1,0)( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( 1 , 0 ), and (v) the unit root case II when (φ1,φ2)=(1,2)subscript𝜑1subscript𝜑212(\varphi_{1},\varphi_{2})=(-1,-2)( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( - 1 , - 2 ). We set T=105𝑇superscript105T=10^{5}italic_T = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, p=5𝑝5p=5italic_p = 5 and simulate 100 datasets for every setting. For each setting, we thus obtain 100 credible intervals and check their frequentist coverage across all the credible intervals. The results in Table 1 suggest that DC-BATS achieves a comparable frequentist coverage with the full MCMC method when K𝐾Kitalic_K is small. However, the performance deteriorates as K𝐾Kitalic_K increases.

i.i.d. fast mixing slow mixing unit root case I unit root case II
DC Full DC Full DC Full DC Full DC Full
K=5𝐾5K=5italic_K = 5 94 95 92 91 92 92 87 88 88 86
K=10𝐾10K=10italic_K = 10 92 95 90 91 88 92 85 88 85 86
K=20𝐾20K=20italic_K = 20 92 95 88 91 88 92 86 88 85 86
K=50𝐾50K=50italic_K = 50 92 95 85 91 82 92 82 88 81 86
Table 1: Frequentist coverage of DC-BATS and the full MCMC method under different sets of parameters.
Refer to caption
Figure 1: Credible intervals for DC-BATS for the linear regression with auto-regressive errors model (6) for T=105𝑇superscript105T=10^{5}italic_T = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.

4.2 Generalized autoregressive conditional heteroskedasticity (GARCH) model

GARCH models (Bollerslev,, 1986) are very popular for modelling financial time series. These assume that the variance of the error term follows an autoregressive moving average process. Apart from finance, GARCH models have also been used in other domains such as healthcare (Nkalu and Edeme,, 2019) and engineering (Ma et al., 2017a, ). We consider the following GARCH model with covariates

Xt=Ztb+εt,εtN(0,σt2),σt2=ω2+i=1qαiεti2+j=1qβjσtj2,subscript𝑋𝑡formulae-sequenceabsentsuperscriptsubscript𝑍𝑡top𝑏subscript𝜀𝑡similar-tosubscript𝜀𝑡N0superscriptsubscript𝜎𝑡2superscriptsubscript𝜎𝑡2absentsuperscript𝜔2superscriptsubscript𝑖1𝑞subscript𝛼𝑖superscriptsubscript𝜀𝑡𝑖2superscriptsubscript𝑗1𝑞subscript𝛽𝑗superscriptsubscript𝜎𝑡𝑗2\displaystyle\begin{aligned} X_{t}&=Z_{t}^{\top}b+\varepsilon_{t},\quad% \varepsilon_{t}\sim\mathrm{N}(0,\sigma_{t}^{2}),\\ \sigma_{t}^{2}&=\omega^{2}+\sum_{i=1}^{q}\alpha_{i}\varepsilon_{t-i}^{2}+\sum_% {j=1}^{q}\beta_{j}\sigma_{t-j}^{2},\end{aligned}start_ROW start_CELL italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_b + italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL = italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t - italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW (7)

where Ztdsubscript𝑍𝑡superscript𝑑Z_{t}\in\mathbb{R}^{d}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes covariates, bd𝑏superscript𝑑b\in\mathbb{R}^{d}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes coefficients, α=(α1,,αq)+q𝛼superscriptsubscript𝛼1subscript𝛼𝑞topsuperscriptsubscript𝑞\alpha=(\alpha_{1},\dots,\alpha_{q})^{\top}\in\mathbb{R}_{+}^{q}italic_α = ( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, β=(β1,,βp)+p𝛽superscriptsubscript𝛽1subscript𝛽𝑝topsuperscriptsubscript𝑝\beta=(\beta_{1},\dots,\beta_{p})^{\top}\in\mathbb{R}_{+}^{p}italic_β = ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, and ω+𝜔subscript\omega\in\mathbb{R}_{+}italic_ω ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT are coefficients. We let r=max(p,q)𝑟𝑝𝑞r=\max(p,q)italic_r = roman_max ( italic_p , italic_q ) and set σ12==σr2=1superscriptsubscript𝜎12superscriptsubscript𝜎𝑟21\sigma_{1}^{2}=\cdots=\sigma_{r}^{2}=1italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ⋯ = italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1.

We choose small values of T𝑇Titalic_T and m𝑚mitalic_m to test DC-BATS. We generate a time series of length T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT from model (7) for d=5𝑑5d=5italic_d = 5, p=2𝑝2p=2italic_p = 2, and q=2𝑞2q=2italic_q = 2. We set α=(0.16,0.16),β=(0.16,0.16)formulae-sequence𝛼superscript0.160.16top𝛽superscript0.160.16top\alpha=(0.16,0.16)^{\top},\beta=(0.16,0.16)^{\top}italic_α = ( 0.16 , 0.16 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_β = ( 0.16 , 0.16 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and ω=1𝜔1\omega=1italic_ω = 1. The observations X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT and corresponding variances σ1:T2superscriptsubscript𝜎:1𝑇2\sigma_{1:T}^{2}italic_σ start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are plotted in Figure 2. We observe that the variances σt2superscriptsubscript𝜎𝑡2\sigma_{t}^{2}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT vary significantly across time (up to several orders of magnitude), as do the observations. We divide observations into K=10𝐾10K=10italic_K = 10 subsequences as before, and it is evident that this results in a lot of variation among the observations across subsequences.

Refer to caption
Figure 2: Observations X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT and variances σ1:T2superscriptsubscript𝜎:1𝑇2\sigma_{1:T}^{2}italic_σ start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for the GARCH model (7) with T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.

We place a Gamma(3,10)Gamma310\mathrm{Gamma}(3,10)roman_Gamma ( 3 , 10 ) prior on ω𝜔\omegaitalic_ω, independent N(0,102)N0superscript102\mathrm{N}(0,10^{2})roman_N ( 0 , 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) priors on each component of b𝑏bitalic_b, and independent half-normal N+(0,102)subscriptN0superscript102\mathrm{N}_{+}(0,10^{2})roman_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( 0 , 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) priors333This is the normal distribution restricted to the positive real line. on each component of α𝛼\alphaitalic_α and β𝛽\betaitalic_β. We draw 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT samples from each subsequence posterior as well as from the full posterior using NUTS, of which the first half are discarded as burn-in in each case. We present the boxplots of effective sample size (ESS) for each parameter in Figure 3 where we record the distribution of the effective sample sizes across different machines.. The ESS is satisfactory for each parameter. The median effective sample size is over 40%percent4040\%40 % of the total sample size for every parameter. In comparison, we recorded 53.6%percent53.653.6\%53.6 % effective samples out of the total sample size. We conclude that the full MCMC method and the divide-and-conquer MCMC are not significantly different regarding effective sample size. It took around nine minutes to sample from the full posterior, and about a minute on average to sample from each subsequence posterior.

Refer to caption
Figure 3: Effective sample size for the GARCH model (7) with T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.

We compare the credible intervals produced by DC-BATS with those obtained by running MCMC on the full dataset, as well as those obtained using the double parallel Monte Carlo (DPMC) algorithm of Wang and Srivastava, (2023), Laplace’s approximation (Kass et al.,, 1991), and automatic differentiation variational inference (ADVI; Kucukelbir et al.,, 2017) in Figure 4. We observe that DC-BATS produces more accurate estimates of the credible intervals than those using the DPMC algorithm, Laplace’s approximation, or ADVI.

Refer to caption
Figure 4: Posterior credible intervals for the GARCH model (7) with T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.

In addition to a single simulation example, we also include a simulation example under multiple sets of (α,β)𝛼𝛽(\alpha,\beta)( italic_α , italic_β ) to represent different degree of mixing in the time series, and different numbers of machines K𝐾Kitalic_K. In this simulation, we fix w2=1superscript𝑤21w^{2}=1italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1, b=(1,,1)𝑏superscript11topb=(1,\dots,1)^{\top}italic_b = ( 1 , … , 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, d=5𝑑5d=5italic_d = 5, p=2𝑝2p=2italic_p = 2, q=2𝑞2q=2italic_q = 2. We let α1==αp=β1==βq=γ1(p+q)1subscript𝛼1subscript𝛼𝑝subscript𝛽1subscript𝛽𝑞superscript𝛾1superscript𝑝𝑞1\alpha_{1}=\dots=\alpha_{p}=\beta_{1}=\dots=\beta_{q}=\gamma^{-1}(p+q)^{-1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_β start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p + italic_q ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, where γ1𝛾1\gamma\geq 1italic_γ ≥ 1 controls the mixing of σt2superscriptsubscript𝜎𝑡2\sigma_{t}^{2}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We vary γ{1,2,5}𝛾125\gamma\in\{1,2,5\}italic_γ ∈ { 1 , 2 , 5 } and K{5,10,20}𝐾51020K\in\{5,10,20\}italic_K ∈ { 5 , 10 , 20 }. The process is stationary when γ>1𝛾1\gamma>1italic_γ > 1, and the unit root exists when γ=1𝛾1\gamma=1italic_γ = 1. The results are presented in Table 2. We observe that, as expected, the empirical posterior produced by the DC-BATS method is the closest to the empirical posterior using the full dataset as compared to the other methods in Wasserstein-2 distance.

K=5𝐾5K=5italic_K = 5 K=10𝐾10K=10italic_K = 10 K=20𝐾20K=20italic_K = 20
γ=1𝛾1\gamma=1italic_γ = 1 γ=2𝛾2\gamma=2italic_γ = 2 γ=5𝛾5\gamma=5italic_γ = 5 γ=1𝛾1\gamma=1italic_γ = 1 γ=2𝛾2\gamma=2italic_γ = 2 γ=5𝛾5\gamma=5italic_γ = 5 γ=1𝛾1\gamma=1italic_γ = 1 γ=2𝛾2\gamma=2italic_γ = 2 γ=5𝛾5\gamma=5italic_γ = 5
DC-BATS 0.0250.0250.0250.025 0.0160.0160.0160.016 0.0120.0120.0120.012 0.0230.0230.0230.023 0.0150.0150.0150.015 0.0120.0120.0120.012 0.0210.0210.0210.021 0.0150.0150.0150.015 0.0120.0120.0120.012
DPMC 0.0400.0400.0400.040 0.0270.0270.0270.027 0.0200.0200.0200.020 0.0380.0380.0380.038 0.0270.0270.0270.027 0.0190.0190.0190.019 0.0370.0370.0370.037 0.0270.0270.0270.027 0.0190.0190.0190.019
Laplace’s approximation 0.1240.1240.1240.124 0.0870.0870.0870.087 0.0810.0810.0810.081 0.1240.1240.1240.124 0.0870.0870.0870.087 0.0810.0810.0810.081 0.1240.1240.1240.124 0.0870.0870.0870.087 0.0810.0810.0810.081
ADVI 0.0320.0320.0320.032 0.0230.0230.0230.023 0.0180.0180.0180.018 0.0320.0320.0320.032 0.0230.0230.0230.023 0.0180.0180.0180.018 0.0320.0320.0320.032 0.0230.0230.0230.023 0.0180.0180.0180.018
Table 2: Wasserstein-2 distance between the full MCMC posterior distribution and the approximations using different methods.

4.3 Hidden Markov models

We consider inference for continuous state-space hidden Markov models (HMMs; Rabiner and Juang,, 1986) in this section. A HMM is a process {(Zt,Xt)}t=0Tsuperscriptsubscriptsubscript𝑍𝑡subscript𝑋𝑡𝑡0𝑇\{(Z_{t},X_{t})\}_{t=0}^{T}{ ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where {Zt}t=0Tsuperscriptsubscriptsubscript𝑍𝑡𝑡0𝑇\{Z_{t}\}_{t=0}^{T}{ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is an unobserved Markov chain, and each observation Xt𝕏subscript𝑋𝑡𝕏X_{t}\in\mathbb{X}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_X is conditionally independent of the rest of the process given Ztsubscript𝑍𝑡Z_{t}\in\mathbb{Z}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_Z. We consider a linear Gaussian model having the form:

Z0N(μ0,Σ0),ZtZt1N(AZt1,Σz),t1,XtZtN(CZt,Σx).t0.\displaystyle\begin{aligned} Z_{0}&\sim\mathrm{N}(\mu_{0},\Sigma_{0}),\\ Z_{t}\mid Z_{t-1}&\sim\mathrm{N}(AZ_{t-1},\Sigma_{z}),\quad t\geq 1,\\ X_{t}\mid Z_{t}&\sim\mathrm{N}(CZ_{t},\Sigma_{x}).\quad t\geq 0.\end{aligned}start_ROW start_CELL italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL ∼ roman_N ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_CELL start_CELL ∼ roman_N ( italic_A italic_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , italic_t ≥ 1 , end_CELL end_ROW start_ROW start_CELL italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL ∼ roman_N ( italic_C italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) . italic_t ≥ 0 . end_CELL end_ROW (8)

Closely related models have numerous applications ranging from guidance and navigation to robotics (Musoff and Zarchan,, 2009).

We consider a two-dimensional latent Markov chain (that is, =2superscript2\mathbb{Z}=\mathbb{R}^{2}blackboard_Z = blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) and a two-dimensional observation space (that is, 𝕏=2𝕏superscript2\mathbb{X}=\mathbb{R}^{2}blackboard_X = blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT). In particular, we generate T=103𝑇superscript103T=10^{3}italic_T = 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT observations from model (8) with true parameter values

A=(0.90.30.21),C=(1.10.50.30.8),Σx=(σx200σx2),Σz=(σz200σz2),formulae-sequence𝐴matrix0.90.30.21formulae-sequence𝐶matrix1.10.50.30.8formulae-sequencesubscriptΣ𝑥matrixsuperscriptsubscript𝜎𝑥200superscriptsubscript𝜎𝑥2subscriptΣ𝑧matrixsuperscriptsubscript𝜎𝑧200superscriptsubscript𝜎𝑧2\displaystyle A=\begin{pmatrix}0.9&-0.3\\ 0.2&1\end{pmatrix},\quad C=\begin{pmatrix}-1.1&0.5\\ -0.3&0.8\end{pmatrix},\quad\Sigma_{x}=\begin{pmatrix}\sigma_{x}^{2}&0\\ 0&\sigma_{x}^{2}\end{pmatrix},\quad\Sigma_{z}=\begin{pmatrix}\sigma_{z}^{2}&0% \\ 0&\sigma_{z}^{2}\end{pmatrix},italic_A = ( start_ARG start_ROW start_CELL 0.9 end_CELL start_CELL - 0.3 end_CELL end_ROW start_ROW start_CELL 0.2 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) , italic_C = ( start_ARG start_ROW start_CELL - 1.1 end_CELL start_CELL 0.5 end_CELL end_ROW start_ROW start_CELL - 0.3 end_CELL start_CELL 0.8 end_CELL end_ROW end_ARG ) , roman_Σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) , roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ,

with σx2=σz2=0.5superscriptsubscript𝜎𝑥2superscriptsubscript𝜎𝑧20.5\sigma_{x}^{2}=\sigma_{z}^{2}=0.5italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.5. We plot the observed process in Figure 5, where we see that it does not appear to be stationary. We nonetheless test DC-BATS on this model.

Refer to caption
Figure 5: Observed process X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT for the hidden Markov model (8).

We fix C𝐶Citalic_C at its true value and consider inference for A𝐴Aitalic_A, σx2superscriptsubscript𝜎𝑥2\sigma_{x}^{2}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and σz2superscriptsubscript𝜎𝑧2\sigma_{z}^{2}italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We place independent N(0,102)N0superscript102\mathrm{N}(0,10^{2})roman_N ( 0 , 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) priors on each component of the matrix A2×2𝐴superscript22A\in\mathbb{R}^{2\times 2}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 2 end_POSTSUPERSCRIPT. We also place independent logN(0,102)N0superscript102\log\mathrm{N}(0,10^{2})roman_log roman_N ( 0 , 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) priors on σx2superscriptsubscript𝜎𝑥2\sigma_{x}^{2}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and σz2superscriptsubscript𝜎𝑧2\sigma_{z}^{2}italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT where logNN\log\mathrm{N}roman_log roman_N denotes a log-normal distribution. We choose K=5𝐾5K=5italic_K = 5 subsequences. We write code in Python and collect 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT posterior samples using an adaptive random walk Metropolis-Hastings algorithm (Haario et al.,, 2001) from each subsequence posterior, as well as from the full posterior. It took around 48 minutes for the MCMC algorithm to sample from the full posterior, and 11 minutes on average for each subsequence posterior. We display 95% credible intervals for DC-BATS and MCMC on the full dataset in Table 3, where we observe that the credible intervals provided by DC-BATS are extremely accurate.

σx2superscriptsubscript𝜎𝑥2\sigma_{x}^{2}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT σz2superscriptsubscript𝜎𝑧2\sigma_{z}^{2}italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
DC-BATS (0.463,0.572)0.4630.572(0.463,0.572)( 0.463 , 0.572 ) (0.396,0.540)0.3960.540(0.396,0.540)( 0.396 , 0.540 )
MCMC on full posterior (0.464,0.576)0.4640.576(0.464,0.576)( 0.464 , 0.576 ) (0.398,0.538)0.3980.538(0.398,0.538)( 0.398 , 0.538 )
A11subscript𝐴11A_{11}italic_A start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT A12subscript𝐴12A_{12}italic_A start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT A21subscript𝐴21A_{21}italic_A start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT A22subscript𝐴22A_{22}italic_A start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT
DC-BATS (0.900,0.924)0.9000.924(0.900,0.924)( 0.900 , 0.924 ) (0.316,0.288)0.3160.288(-0.316,-0.288)( - 0.316 , - 0.288 ) (0.191,0.215)0.1910.215(0.191,0.215)( 0.191 , 0.215 ) (0.979,1.007)0.9791.007(0.979,1.007)( 0.979 , 1.007 )
MCMC on full posterior (0.895,0.925)0.8950.925(0.895,0.925)( 0.895 , 0.925 ) (0.327,0.289)0.3270.289(-0.327,-0.289)( - 0.327 , - 0.289 ) (0.182,0.209)0.1820.209(0.182,0.209)( 0.182 , 0.209 ) (0.964,1.000)0.9641.000(0.964,1.000)( 0.964 , 1.000 )
Table 3: 95% posterior credible intervals for parameters of the hidden Markov model (8).

4.4 Binary auto-regressive model

We consider a model with binary observations Xt{0,1}subscript𝑋𝑡01X_{t}\in\{0,1\}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } modelled as

(Xt=1)subscript𝑋𝑡1\displaystyle\mathbb{P}(X_{t}=1)blackboard_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 ) =11+exp{(c+i=1pαiXti+Ztb)},absent11𝑐superscriptsubscript𝑖1𝑝subscript𝛼𝑖subscript𝑋𝑡𝑖superscriptsubscript𝑍𝑡top𝑏\displaystyle=\frac{1}{1+\exp\{-(c+\sum_{i=1}^{p}\alpha_{i}X_{t-i}+Z_{t}^{\top% }b)\}},= divide start_ARG 1 end_ARG start_ARG 1 + roman_exp { - ( italic_c + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_b ) } end_ARG , (9)

where α=(α1,,αp)𝛼superscriptsubscript𝛼1subscript𝛼𝑝top\alpha=(\alpha_{1},\dots,\alpha_{p})^{\top}\in\mathbb{R}italic_α = ( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R, Ztqsubscript𝑍𝑡superscript𝑞Z_{t}\in\mathbb{R}^{q}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT denotes covariates at time t𝑡titalic_t, and b=(b1,,bq)q𝑏superscriptsubscript𝑏1subscript𝑏𝑞topsuperscript𝑞b=(b_{1},\dots,b_{q})^{\top}\in\mathbb{R}^{q}italic_b = ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT denote coefficients of covariates. In this example, we choose p=5𝑝5p=5italic_p = 5 and q=5𝑞5q=5italic_q = 5. We set α=(0.25,0.25,0.25,0.25,0.25)𝛼superscript0.250.250.250.250.25top\alpha=(0.25,0.25,0.25,0.25,0.25)^{\top}italic_α = ( 0.25 , 0.25 , 0.25 , 0.25 , 0.25 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and b=(0.5,0.5,0,0.5,0.5)𝑏superscript0.50.500.50.5topb=(0.5,0.5,0,-0.5,-0.5)^{\top}italic_b = ( 0.5 , 0.5 , 0 , - 0.5 , - 0.5 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. We generate T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT synthetic observations. We choose the covariates Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to be non-stationary. We plot the observations X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT and the corresponding success probabilities ((X1=1),,(XT=1))subscript𝑋11subscript𝑋𝑇1(\mathbb{P}(X_{1}=1),\dots,\mathbb{P}(X_{T}=1))( blackboard_P ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 ) , … , blackboard_P ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 1 ) ) in Figure 6.

Refer to caption
Figure 6: Observations Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and success probabilities (Xt=1)subscript𝑋𝑡1\mathbb{P}(X_{t}=1)blackboard_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 ) for model (9) with T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.

We consider independent N(0,102)N0superscript102\mathrm{N}(0,10^{2})roman_N ( 0 , 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) priors on c𝑐citalic_c and on each component of α𝛼\alphaitalic_α and b𝑏bitalic_b. We draw 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT samples from the posterior p(c,α,bX1:T)𝑝𝑐𝛼conditional𝑏subscript𝑋:1𝑇p(c,\alpha,b\mid X_{1:T})italic_p ( italic_c , italic_α , italic_b ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) using DC-BATS as well as MCMC on the full dataset. We use NUTS and discard half of the 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT samples as burn-in in each case. It took around twelve minutes to sample from the full posterior, and a little over a minute on average to sample from each subsequence posterior. We compare the credible intervals produced by DC-BATS, full data MCMC, the double parallel Monte Carlo (DPMC) algorithm of Wang and Srivastava, (2023), Laplace’s approximation and ADVI in Figure 7. The credible intervals from DC-BATS are virtually indistinguishable from those for full data MCMC, while the intervals for DPMC deviate for certain parameters. We also present the boxplots of effective sample size (ESS) for each parameter in Figure 8 where we record the distribution of the effective sample sizes across different machines. We observe that the ESS is satisfactory for each parameter. The median effective sample size is over 40%percent4040\%40 % of the total sample size for every parameter. In comparison, we recorded a 48.1%percent48.148.1\%48.1 % effective samples out of the total sample size for the full MCMC method. We again conclude that the full MCMC method and the divide-and-conquer MCMC are not significantly different regarding effective sample size.

Refer to caption
Figure 7: Posterior credible intervals for the binary auto-regressive model (9) for T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT and m=10𝑚10m=10italic_m = 10.
Refer to caption
Figure 8: Effective sample size for the binary auto-regressive model (9) with T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.

In addition to a single simulation example, we also include a simulation example under multiple sets of (α,b)𝛼𝑏(\alpha,b)( italic_α , italic_b ) to represent different degree of mixing in the time series, and different numbers of machines K𝐾Kitalic_Ks in Table. In this simulation, we fix c=0𝑐0c=0italic_c = 0, p=5𝑝5p=5italic_p = 5, q=5𝑞5q=5italic_q = 5. We let α1==αp=b1==bq=γ1(p+q)1subscript𝛼1subscript𝛼𝑝subscript𝑏1subscript𝑏𝑞superscript𝛾1superscript𝑝𝑞1\alpha_{1}=\dots=\alpha_{p}=b_{1}=\dots=b_{q}=\gamma^{-1}(p+q)^{-1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_b start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p + italic_q ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT with γ1𝛾1\gamma\geq 1italic_γ ≥ 1; the parameter γ𝛾\gammaitalic_γ regulates the level of autocorrelation of the model. We use the Wasserstein-2 distance between the obtained posterior distribution and the full posterior distribution as a metric to compare the performance. The results are presented in Table 4. We observed a slightly better performance for DPMC and Laplace’s approximation compared to DC-BATS. We also observe a significantly better performance for DC-BATS compared to ADVI.

K=5𝐾5K=5italic_K = 5 K=10𝐾10K=10italic_K = 10 K=20𝐾20K=20italic_K = 20
ratio=1ratio1\text{ratio}=1ratio = 1 ratio=2ratio2\text{ratio}=2ratio = 2 ratio=5ratio5\text{ratio}=5ratio = 5 ratio=1ratio1\text{ratio}=1ratio = 1 ratio=2ratio2\text{ratio}=2ratio = 2 ratio=5ratio5\text{ratio}=5ratio = 5 ratio=1ratio1\text{ratio}=1ratio = 1 ratio=2ratio2\text{ratio}=2ratio = 2 ratio=5ratio5\text{ratio}=5ratio = 5
DC-BATS 0.1730.1730.1730.173 0.1230.1230.1230.123 0.1190.1190.1190.119 0.1760.1760.1760.176 0.1190.1190.1190.119 0.1150.1150.1150.115 0.1720.1720.1720.172 0.1190.1190.1190.119 0.1210.1210.1210.121
DPMC 0.1810.1810.1810.181 0.1320.1320.1320.132 0.1210.1210.1210.121 0.1750.1750.1750.175 0.1120.1120.1120.112 0.1110.1110.1110.111 0.1610.1610.1610.161 0.1160.1160.1160.116 0.1210.1210.1210.121
Laplace’s approximation 0.1610.1610.1610.161 0.1150.1150.1150.115 0.1020.1020.1020.102 0.1610.1610.1610.161 0.1150.1150.1150.115 0.1020.1020.1020.102 0.1610.1610.1610.161 0.1150.1150.1150.115 0.1020.1020.1020.102
ADVI 0.2230.2230.2230.223 0.2460.2460.2460.246 0.2120.2120.2120.212 0.2230.2230.2230.223 0.2460.2460.2460.246 0.2120.2120.2120.212 0.2230.2230.2230.223 0.2460.2460.2460.246 0.2120.2120.2120.212
Table 4: Average Wasserstein-2 distance between the full MCMC posterior distribution and the approximations of different methods for the binary auto-regressive model (9).

5 Application to Los Angeles particulate matter data

It is well understood that aerosol particulates have significant impact on human health, and hence understanding the dynamics of particulate matter (PM) is important in public health decision making. Modern sampling technologies have made high-resolution air monitoring possible. This makes the data produced by such monitors massive, and hence challenging to analyze with Bayesian methods. To tackle this computational challenges, we apply DC-BATS to analyze a Los Angeles air quality dataset obtained from the Environmental Protection Agency (EPA)444The dataset is available online at https://www.epa.gov/outdoor-air-quality-data..

This dataset consists of T=8760𝑇8760T=8760italic_T = 8760 hourly measurements of particulates including PM10subscriptPM10\text{PM}_{\text{10}}PM start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (1% missingness) and PM2.5subscriptPM2.5\text{PM}_{\text{2.5}}PM start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT (3.5% missingness) in Los Angeles during 2017. This dataset has some clearly invalid measurements, which we treat as missing data. We apply Kalman smoothing imputation as suggested in Hyndman and Khandakar, (2008) to simplify handling of the missing data. After imputation, we transform both PM observations by log(0.1+PM)0.1PM\log(0.1+\text{PM})roman_log ( 0.1 + PM ). Our overarching goal is to build an interpretable model that can capture the dynamics of these particulates.

Refer to caption
Figure 9: Preprocessed observations X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT for the Los Angeles particulate matter dataset.

We plot the values of the particulates over time after preprocessing in Figure 9. It is clear that the variance of the observations changes over time. In order to capture the evolution of variance within a series and correlation across series, we consider a bivariate GARCH model with constant conditional correlation (Bollerslev,, 1990) as follows:

Xt=μ+vt,vtN2(0,Ht),Ht,ii=wi+aivt1,ii2+biHt1,ii,Ht,ij=rHt,ii1/2Ht,jj1/2,i,j=1,2,subscript𝑋𝑡formulae-sequenceabsent𝜇subscript𝑣𝑡similar-tosubscript𝑣𝑡subscriptN20subscript𝐻𝑡subscript𝐻𝑡𝑖𝑖absentsubscript𝑤𝑖subscript𝑎𝑖subscriptsuperscript𝑣2𝑡1𝑖𝑖subscript𝑏𝑖subscript𝐻𝑡1𝑖𝑖subscript𝐻𝑡𝑖𝑗formulae-sequenceabsent𝑟superscriptsubscript𝐻𝑡𝑖𝑖12superscriptsubscript𝐻𝑡𝑗𝑗12𝑖𝑗12\displaystyle\begin{aligned} X_{t}&=\mu+v_{t},\quad v_{t}\sim\mathrm{N}_{2}% \left(0,H_{t}\right),\\ H_{t,ii}&=w_{i}+a_{i}v^{2}_{t-1,ii}+b_{i}H_{t-1,ii},\\ H_{t,ij}&=rH_{t,ii}^{1/2}H_{t,jj}^{1/2},\quad i,j=1,2,\end{aligned}start_ROW start_CELL italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_μ + italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 0 , italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_H start_POSTSUBSCRIPT italic_t , italic_i italic_i end_POSTSUBSCRIPT end_CELL start_CELL = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 , italic_i italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t - 1 , italic_i italic_i end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_H start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT end_CELL start_CELL = italic_r italic_H start_POSTSUBSCRIPT italic_t , italic_i italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_t , italic_j italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , italic_i , italic_j = 1 , 2 , end_CELL end_ROW (10)

for t=1,,T𝑡1𝑇t=1,\dots,Titalic_t = 1 , … , italic_T, where Xt2subscript𝑋𝑡superscript2X_{t}\in\mathbb{R}^{2}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the (PM10subscriptPM10\text{PM}_{\text{10}}PM start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT, PM2.5subscriptPM2.5\text{PM}_{\text{2.5}}PM start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT) levels at time t𝑡titalic_t. We assume Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the sum of a time-independent mean μ2𝜇superscript2\mu\in\mathbb{R}^{2}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and a time-dependent innovation vt2subscript𝑣𝑡superscript2v_{t}\in\mathbb{R}^{2}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT whose covariance matrix Htsubscript𝐻𝑡H_{t}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT evolves with time t𝑡titalic_t. We assume that each variance Ht,iisubscript𝐻𝑡𝑖𝑖H_{t,ii}italic_H start_POSTSUBSCRIPT italic_t , italic_i italic_i end_POSTSUBSCRIPT follows a univariate GARCH process, regressed on an intercept term wi+subscript𝑤𝑖subscriptw_{i}\in\mathbb{R}_{+}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, a lag-1111 innovation vt1,ii2subscriptsuperscript𝑣2𝑡1𝑖𝑖v^{2}_{t-1,ii}italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 , italic_i italic_i end_POSTSUBSCRIPT, and the lag-1111 variance Ht1,iisubscript𝐻𝑡1𝑖𝑖H_{t-1,ii}italic_H start_POSTSUBSCRIPT italic_t - 1 , italic_i italic_i end_POSTSUBSCRIPT through coefficients ai,bi+subscript𝑎𝑖subscript𝑏𝑖subscripta_{i},b_{i}\in\mathbb{R}_{+}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT for i=1,2𝑖12i=1,2italic_i = 1 , 2. We also assume the correlation between particulates is time-independent, which is captured by r[1,1]𝑟11r\in[-1,1]italic_r ∈ [ - 1 , 1 ].

We adopt a diffuse prior distribution N(0.5,106)N0.5superscript106\mathrm{N}(0.5,10^{6})roman_N ( 0.5 , 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) for every ai,bi,μisubscript𝑎𝑖subscript𝑏𝑖subscript𝜇𝑖a_{i},b_{i},\mu_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively, and prior distribution N(1.0,106)N1.0superscript106\mathrm{N}(1.0,10^{6})roman_N ( 1.0 , 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) for every wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since it is well known that these particulates are positively correlated apriori, we adopt a Uniform(0,1)Uniform01\mathrm{Uniform}(0,1)roman_Uniform ( 0 , 1 ) prior distribution for r𝑟ritalic_r. We draw 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT samples from the posterior p(a,b,w,μX1:T)𝑝𝑎𝑏𝑤conditional𝜇subscript𝑋:1𝑇p(a,b,w,\mu\mid X_{1:T})italic_p ( italic_a , italic_b , italic_w , italic_μ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ), where a=(a1,a2)𝑎subscript𝑎1subscript𝑎2a=(a_{1},a_{2})italic_a = ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and b=(b1,b2)𝑏subscript𝑏1subscript𝑏2b=(b_{1},b_{2})italic_b = ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) using DC-BATS with k=10𝑘10k=10italic_k = 10 subsequences as well as MCMC on the full dataset. We use NUTS and discard half of the 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT samples as burn-in in each case. It took around 24242424 minutes to sample from the full posterior, and 3.83.83.83.8 minutes on average to sample from each subsequence posterior. We compare the credible intervals produced by DC-BATS and full data MCMC in Table 5, where it is evident that the credible intervals provided by both methods are well aligned with each other.

a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
DC-BATS (5.12×101,5.85×101)5.12superscript1015.85superscript101(5.12\times 10^{-1},5.85\times 10^{-1})( 5.12 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 5.85 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (6.50×101,7.30×101)6.50superscript1017.30superscript101(6.50\times 10^{-1},7.30\times 10^{-1})( 6.50 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 7.30 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )
MCMC on full posterior (5.33×101,6.19×101)5.33superscript1016.19superscript101(5.33\times 10^{-1},6.19\times 10^{-1})( 5.33 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 6.19 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (8.76×101,9.76×101)8.76superscript1019.76superscript101(8.76\times 10^{-1},9.76\times 10^{-1})( 8.76 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 9.76 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )
b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT b2subscript𝑏2b_{2}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
DC-BATS (6.20×102,1.32×101)6.20superscript1021.32superscript101(6.20\times 10^{-2},1.32\times 10^{-1})( 6.20 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 1.32 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (8.59×105,1.09×102)8.59superscript1051.09superscript102(8.59\times 10^{-5},1.09\times 10^{-2})( 8.59 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 1.09 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )
MCMC on full posterior (1.21×101,2.12×101)1.21superscript1012.12superscript101(1.21\times 10^{-1},2.12\times 10^{-1})( 1.21 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 2.12 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (6.00×105,7.46×103)6.00superscript1057.46superscript103(6.00\times 10^{-5},7.46\times 10^{-3})( 6.00 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 7.46 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT )
w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
DC-BATS (1.22×101,1.42×101)1.22superscript1011.42superscript101(1.22\times 10^{-1},1.42\times 10^{-1})( 1.22 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 1.42 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (2.01×101,2.24×101)2.01superscript1012.24superscript101(2.01\times 10^{-1},2.24\times 10^{-1})( 2.01 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 2.24 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )
MCMC on full posterior (9.07×102,1.10×101)9.07superscript1021.10superscript101(9.07\times 10^{-2},1.10\times 10^{-1})( 9.07 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 1.10 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (1.23×101,1.40×101)1.23superscript1011.40superscript101(1.23\times 10^{-1},1.40\times 10^{-1})( 1.23 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 1.40 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )
μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT r𝑟ritalic_r
DC-BATS (3.12,3.14)3.123.14(3.12,3.14)( 3.12 , 3.14 ) (1.95,1.98)1.951.98(1.95,1.98)( 1.95 , 1.98 ) (2.57×101,2.78×101)2.57superscript1012.78superscript101(2.57\times 10^{-1},2.78\times 10^{-1})( 2.57 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 2.78 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )
MCMC on full posterior (3.25,3.28)3.253.28(3.25,3.28)( 3.25 , 3.28 ) (2.10,2.12)2.102.12(2.10,2.12)( 2.10 , 2.12 ) (2.32×101,2.54×101)2.32superscript1012.54superscript101(2.32\times 10^{-1},2.54\times 10^{-1})( 2.32 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 2.54 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )
Table 5: 95% posterior credible intervals for parameters of the bivariate GARCH model (10) applied to the Los Angeles particulate matter (PM) dataset.

6 Discussion

We have proposed a simple divide-and-conquer approach for Bayesian inference from stationary time series. There are several natural follow-up directions. In our theoretical development, we have assumed that the time series is stationary and mixes fast; it would be interesting to relax these assumptions and develop scalable posterior inference algorithms for non-stationary time series, as well as for series with long-range dependence. Although our current algorithm has promising empirical results in certain simulation experiments with non-stationarity, we expect long range dependence to present a more challenging problem.

We have not considered the problem of defining a ‘best’ choice of the number and lengths of the subsequences, and have instead focused on experiments testing our algorithm in challenging cases in which the subset sizes are modest and/or assumptions of our theory are violated. In practice for truly massive datasets, one should ideally run MCMC in parallel for the different subsequences; the best choice of m𝑚mitalic_m and K𝐾Kitalic_K depends on a tradeoff between statistical accuracy and ones computational budget in terms of wall clock time, number of nodes in a distributed computing network, and the capacity of each node. As a rule of thumb, approximation accuracy should improve with subsequence length as long as ones computational budget allows sufficient MCMC draws per subsequence posterior. Our simulation results are promising in suggesting that, at least in certain cases, accuracy is high even with short subsequences. However, this depends on the model and data.

Two additional important future directions include (1) modifying the simple divide-and-conquer algorithm we are proposing to allow communication between nodes; and (2) modifying the algorithm and/or theory to allow guarantees for fixed finite subsequence sizes. There has been work on both threads outside of the time series setting; for example, refer to Dai et al., (2023).

Acknowledgement

DS and DD acknowledge support from National Science Foundation grant 1546130. DS acknowledges support from grant DMS-1638521 from SAMSI. The authors thank Cheng Li for his insightful comments that led to improvements in this paper.

Appendix A Additional assumptions

We make the following additional assumptions to prove the theoretical results. These are similar to assumptions made by Li et al., (2017).

Assumption 4 (Support).

For all t1𝑡1t\geq 1italic_t ≥ 1 and all θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, all possible conditional distributions XtX1:(t1)conditionalsubscript𝑋𝑡subscript𝑋:1𝑡1X_{t}\mid X_{1:(t-1)}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT have the same support as the stationary distribution of Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Many classes of time series models satisfy Assumption 4, including the ones that are considered in this paper. However, exceptions to this assumption include time-varying time series models where the support of conditional densities change with respect to time t𝑡titalic_t.

Assumption 5 (Envelope).

This consists of three parts.

  1. 1.

    logpθ(xtx1:(t1))subscript𝑝𝜃conditionalsubscript𝑥𝑡subscript𝑥:1𝑡1\log p_{\theta}(x_{t}\mid x_{1:(t-1)})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT ) is three times differentiable with respect to θ𝜃\thetaitalic_θ in a neighbourhood Bδ0(θ0)={θΘ:θθ0δ0}subscript𝐵subscript𝛿0subscript𝜃0conditional-set𝜃Θnorm𝜃subscript𝜃0subscript𝛿0B_{\delta_{0}}(\theta_{0})=\{\theta\in\Theta:\|\theta-\theta_{0}\|\leq\delta_{% 0}\}italic_B start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = { italic_θ ∈ roman_Θ : ∥ italic_θ - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ ≤ italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } of θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for some constant δ0>0subscript𝛿00\delta_{0}>0italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0.

  2. 2.

    There exists functions Mt(x1:t)subscript𝑀𝑡subscript𝑥:1𝑡M_{t}(x_{1:t})italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ), t1𝑡1t\geq 1italic_t ≥ 1, such that

    supθBδ0(θ0)|θl1logpθ(xtx1:(t1))|\displaystyle\sup_{\theta\in B_{\delta_{0}}(\theta_{0})}\left|\frac{\partial}{% \partial\theta_{l_{1}}}\log p_{\theta}(x_{t}\mid x_{1:(t-1)})\right|roman_sup start_POSTSUBSCRIPT italic_θ ∈ italic_B start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT | divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT ) | Mt(x1:t),absentsubscript𝑀𝑡subscript𝑥:1𝑡\displaystyle\leq M_{t}(x_{1:t}),≤ italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) ,
    supθBδ0(θ0)|2θl1θl2logpθ(xtx1:(t1))|\displaystyle\sup_{\theta\in B_{\delta_{0}}(\theta_{0})}\left|\frac{\partial^{% 2}}{\partial\theta_{l_{1}}\partial\theta_{l_{2}}}\log p_{\theta}(x_{t}\mid x_{% 1:(t-1)})\right|roman_sup start_POSTSUBSCRIPT italic_θ ∈ italic_B start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT | divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∂ italic_θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT ) | Mt(x1:t),absentsubscript𝑀𝑡subscript𝑥:1𝑡\displaystyle\leq M_{t}(x_{1:t}),≤ italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) ,
    supθBδ0(θ0)|3θl1θl2θl3logpθ(xtx1:(t1))|\displaystyle\sup_{\theta\in B_{\delta_{0}}(\theta_{0})}\left|\frac{\partial^{% 3}}{\partial\theta_{l_{1}}\partial\theta_{l_{2}}\partial\theta_{l_{3}}}\log p_% {\theta}(x_{t}\mid x_{1:(t-1)})\right|roman_sup start_POSTSUBSCRIPT italic_θ ∈ italic_B start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT | divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∂ italic_θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∂ italic_θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT ) | Mt(x1:t),absentsubscript𝑀𝑡subscript𝑥:1𝑡\displaystyle\leq M_{t}(x_{1:t}),≤ italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) ,

    for l1,l2,l3=1,,dformulae-sequencesubscript𝑙1subscript𝑙2subscript𝑙31𝑑l_{1},l_{2},l_{3}=1,\dots,ditalic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1 , … , italic_d for all {xt}t1subscriptsubscript𝑥𝑡𝑡1\{x_{t}\}_{t\geq 1}{ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT.

  3. 3.

    lim supT𝔼θ0{T1t=1TMt(X1:t)4+2δ}<subscriptlimit-supremum𝑇subscript𝔼subscriptsubscript𝜃0superscript𝑇1superscriptsubscript𝑡1𝑇subscript𝑀𝑡superscriptsubscript𝑋:1𝑡42𝛿\limsup\limits_{T\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\{T^{-1}\sum_{t% =1}^{T}M_{t}(X_{1:t})^{4+2\delta}\}<\inftylim sup start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT } < ∞, where δ𝛿\deltaitalic_δ is the same as that in Assumption 2.

Assumptions 2 and 5 together imply a trade-off between moments and the mixing rate of the process {Xt}t1subscriptsubscript𝑋𝑡𝑡1\{X_{t}\}_{t\geq 1}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT. A higher value of δ𝛿\deltaitalic_δ leads to greater restriction on the moments on one hand; on the other hand, a slower decay rate of α(j)𝛼𝑗\alpha(j)italic_α ( italic_j ) is required for k=1α(j)δ/(2+δ)<superscriptsubscript𝑘1𝛼superscript𝑗𝛿2𝛿\sum_{k=1}^{\infty}\alpha(j)^{\delta/(2+\delta)}<\infty∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_α ( italic_j ) start_POSTSUPERSCRIPT italic_δ / ( 2 + italic_δ ) end_POSTSUPERSCRIPT < ∞ to hold, thus leading to less restriction on the mixing rate of the process {Xt}t1subscriptsubscript𝑋𝑡𝑡1\{X_{t}\}_{t\geq 1}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT.

Assumption 5 can be verified by checking if the log-likelihood function of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT only depends of a finite number of past observations, that is, there existing an integer p>0𝑝0p>0italic_p > 0 such that logpθ(xtx1:(t1))=logpθ(xtx(tp):(t1))subscript𝑝𝜃conditionalsubscript𝑥𝑡subscript𝑥:1𝑡1subscript𝑝𝜃conditionalsubscript𝑥𝑡subscript𝑥:𝑡𝑝𝑡1\log p_{\theta}(x_{t}\mid x_{1:(t-1)})=\log p_{\theta}(x_{t}\mid x_{(t-p):(t-1% )})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT ) = roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT ( italic_t - italic_p ) : ( italic_t - 1 ) end_POSTSUBSCRIPT ). In this case, it suffices to find an envelope function M(X1:p)𝑀subscript𝑋:1𝑝M(X_{1:p})italic_M ( italic_X start_POSTSUBSCRIPT 1 : italic_p end_POSTSUBSCRIPT ) such that 𝔼θ0M4+2δ(X1:p)<+subscript𝔼subscriptsubscript𝜃0superscript𝑀42𝛿subscript𝑋:1𝑝\mathbb{E}_{\mathbb{P}_{\theta_{0}}}M^{4+2\delta}(X_{1:p})<+\inftyblackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 : italic_p end_POSTSUBSCRIPT ) < + ∞, where X1:psubscript𝑋:1𝑝X_{1:p}italic_X start_POSTSUBSCRIPT 1 : italic_p end_POSTSUBSCRIPT is under the stationary distribution. By the Lpsuperscript𝐿𝑝L^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT Ergodic Theorem of Von Neumann (Neumann,, 1932), we have

lim supT𝔼θ0{T1t=1TMt(X1:t)4+2δ}=𝔼θ0M4+2δ(X1:p)<+.subscriptlimit-supremum𝑇subscript𝔼subscriptsubscript𝜃0superscript𝑇1superscriptsubscript𝑡1𝑇subscript𝑀𝑡superscriptsubscript𝑋:1𝑡42𝛿subscript𝔼subscriptsubscript𝜃0superscript𝑀42𝛿subscript𝑋:1𝑝\displaystyle\limsup\limits_{T\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}% \left\{T^{-1}\sum_{t=1}^{T}M_{t}(X_{1:t})^{4+2\delta}\right\}=\mathbb{E}_{% \mathbb{P}_{\theta_{0}}}M^{4+2\delta}(X_{1:p})<+\infty.lim sup start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT } = blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 : italic_p end_POSTSUBSCRIPT ) < + ∞ .
Assumption 6.

The interchange of order of integration with respect to θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT at θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is justified. The score function θk(θ)subscript𝜃subscript𝑘𝜃\nabla_{\theta}\ell_{k}(\theta)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) is a martingale at θ=θ0𝜃subscript𝜃0\theta=\theta_{0}italic_θ = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for m1𝑚1m\geq 1italic_m ≥ 1. Moreover,

T12θθpθ0(X1:T)a.s.I(θ0)inθ0-probability as T,superscripta.s.superscript𝑇1superscript2𝜃superscript𝜃topsubscript𝑝subscript𝜃0subscript𝑋:1𝑇𝐼subscript𝜃0insubscriptsubscript𝜃0-probability as 𝑇-T^{-1}\frac{\partial^{2}}{\partial\theta\partial\theta^{\top}}p_{\theta_{0}}(% X_{1:T})\stackrel{{\scriptstyle\text{a.s.}}}{{\to}}I(\theta_{0})\leavevmode% \nobreak\ \text{in}\leavevmode\nobreak\ \mathbb{P}_{\theta_{0}}\text{-% probability as }T\to\infty,- italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ ∂ italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG a.s. end_ARG end_RELOP italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) in blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT -probability as italic_T → ∞ ,

where I(θ0)𝐼subscript𝜃0I(\theta_{0})italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is a positive definite matrix. Further, for all sufficiently large m𝑚mitalic_m, m1θ2k(θ)superscript𝑚1superscriptsubscript𝜃2subscript𝑘𝜃-m^{-1}\nabla_{\theta}^{2}\ell_{k}(\theta)- italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) is positive definite with eigenvalues bounded below and above by constants for all θBδ0(θ0)𝜃subscript𝐵subscript𝛿0subscript𝜃0\theta\in B_{\delta_{0}}(\theta_{0})italic_θ ∈ italic_B start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and all values of 𝐗[k]subscript𝐗delimited-[]𝑘\mathbf{X}_{[k]}bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT.

Assumption 7.

For any δ>0𝛿0\delta>0italic_δ > 0, there exists an ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 such that

limmθ0(supθΘ:θθ0δ1(θ)1(θ0)mϵ)=1.subscript𝑚subscriptsubscript𝜃0subscriptsupremum:𝜃Θnorm𝜃subscript𝜃0𝛿subscript1𝜃subscript1subscript𝜃0𝑚italic-ϵ1\lim_{m\to\infty}\mathbb{P}_{\theta_{0}}\left(\sup_{\theta\in\Theta\,:\,\|% \theta-\theta_{0}\|\geq\delta}\frac{\ell_{1}(\theta)-\ell_{1}(\theta_{0})}{m}% \leq-\epsilon\right)=1.roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ : ∥ italic_θ - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ ≥ italic_δ end_POSTSUBSCRIPT divide start_ARG roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ) - roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_m end_ARG ≤ - italic_ϵ ) = 1 .

We assume that the prior π0(θ)subscript𝜋0𝜃\pi_{0}(\theta)italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) has finite second moment; this is a fairly relaxed assumption and is required as we use the W2subscriptW2\mathrm{W}_{2}roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance to combine the subsequence posteriors.

Assumption 8 (Prior).

The prior density π0(θ)subscript𝜋0𝜃\pi_{0}(\theta)italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) is continuous at θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Moreover, 0<π0(θ0)<0subscript𝜋0subscript𝜃00<\pi_{0}(\theta_{0})<\infty0 < italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) < ∞. The second moment of the prior exists, that is, θθ2π0(θ)dθ<subscript𝜃superscriptnorm𝜃2subscript𝜋0𝜃differential-d𝜃\int_{\theta}\|\theta\|^{2}\,\pi_{0}(\theta)\,\mathrm{d}\theta<\infty∫ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) roman_d italic_θ < ∞.

Assumption 9 (Uniform integrability).

Let ψ(𝐗[1])=𝔼Πm(dθ𝐗[1]){Kmθθ^12}𝜓subscript𝐗delimited-[]1subscript𝔼subscriptΠ𝑚conditionald𝜃subscript𝐗delimited-[]1𝐾𝑚superscriptnorm𝜃subscript^𝜃12\psi(\mathbf{X}_{[1]})=\mathbb{E}_{\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[1]% })}\{Km\|\theta-\widehat{\theta}_{1}\|^{2}\}italic_ψ ( bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT { italic_K italic_m ∥ italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, where 𝔼Πm(dθ𝐗[1])subscript𝔼subscriptΠ𝑚conditionald𝜃subscript𝐗delimited-[]1\mathbb{E}_{\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[1]})}blackboard_E start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT is the expectation with respect to θ𝜃\thetaitalic_θ under the posterior Πm(dθ𝐗[1]).subscriptΠ𝑚conditionald𝜃subscript𝐗delimited-[]1\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[1]}).roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) . Then there exists an integer m01,subscript𝑚01m_{0}\geq 1,italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ 1 , such that {ψ(𝐗[1]):mm0,K1}conditional-set𝜓subscript𝐗delimited-[]1formulae-sequence𝑚subscript𝑚0𝐾1\{\psi(\mathbf{X}_{[1]}):m\geq m_{0},K\geq 1\}{ italic_ψ ( bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) : italic_m ≥ italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_K ≥ 1 } is uniformly integrable under θ0.subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}.blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . In other words,

limCsupmm0,K1𝔼θ0[ψ(𝐗[1])𝕀{ψ(𝐗[1])C}]=0,subscript𝐶subscriptsupremumformulae-sequence𝑚subscript𝑚0𝐾1subscript𝔼subscriptsubscript𝜃0delimited-[]𝜓subscript𝐗delimited-[]1𝕀𝜓subscript𝐗delimited-[]1𝐶0\lim_{C\to\infty}\sup_{m\geq m_{0},\,K\geq 1}\mathbb{E}_{\mathbb{P}_{\theta_{0% }}}\left[\psi(\mathbf{X}_{[1]})\mathbb{I}\{\psi(\mathbf{X}_{[1]})\geq C\}% \right]=0,roman_lim start_POSTSUBSCRIPT italic_C → ∞ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_m ≥ italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_K ≥ 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_ψ ( bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) blackboard_I { italic_ψ ( bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) ≥ italic_C } ] = 0 ,

where 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) is the indicator function.

Assumptions 4, 5, 6 and 9 are generalizations of Assumptions 2, 3, 4, 7 of Li et al., (2017), respectively. These can be verified more straightforwardly if there exists a finite integer n𝑛nitalic_n such that pθ(xtx1:(t1))pθ(xtx(tn1):(t1))subscript𝑝𝜃conditionalsubscript𝑥𝑡subscript𝑥:1𝑡1subscript𝑝𝜃conditionalsubscript𝑥𝑡subscript𝑥:𝑡𝑛1𝑡1p_{\theta}(x_{t}\mid x_{1:(t-1)})\equiv p_{\theta}(x_{t}\mid x_{(t-n-1):(t-1)})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : ( italic_t - 1 ) end_POSTSUBSCRIPT ) ≡ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT ( italic_t - italic_n - 1 ) : ( italic_t - 1 ) end_POSTSUBSCRIPT ). In this case, MtM(X(tn1):(t1))subscript𝑀𝑡𝑀subscript𝑋:𝑡𝑛1𝑡1M_{t}\equiv M(X_{(t-n-1):(t-1)})italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ italic_M ( italic_X start_POSTSUBSCRIPT ( italic_t - italic_n - 1 ) : ( italic_t - 1 ) end_POSTSUBSCRIPT ) is also an ergodic sequence. To verify Assumption 5, by the Lpsuperscript𝐿𝑝L^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT-ergodic theorem, it suffices to find a δ𝛿\deltaitalic_δ such that 𝔼θ0{M(X(tn1):(t1))4+2δ}<subscript𝔼subscriptsubscript𝜃0𝑀superscriptsubscript𝑋:𝑡𝑛1𝑡142𝛿\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\{M(X_{(t-n-1):(t-1)})^{4+2\delta}\}<\inftyblackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_M ( italic_X start_POSTSUBSCRIPT ( italic_t - italic_n - 1 ) : ( italic_t - 1 ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT } < ∞, where the expectation is with respect to the stationary distribution of Xtn1,,Xt1subscript𝑋𝑡𝑛1subscript𝑋𝑡1X_{t-n-1},\dots,X_{t-1}italic_X start_POSTSUBSCRIPT italic_t - italic_n - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Assumption 6 is a generalization of a common regularity condition to dependent processes. Again, in view of Assumption 5 that bounds the moment of the second order derivative of the log density function, the ergodic theorem will hold automatically to guarantee that

T12θθpθ0(x1:T)a.s.I(θ0)in θ0-probability.superscripta.s.superscript𝑇1superscript2𝜃superscript𝜃topsubscript𝑝subscript𝜃0subscript𝑥:1𝑇𝐼subscript𝜃0in subscriptsubscript𝜃0-probability-T^{-1}\frac{\partial^{2}}{\partial\theta\partial\theta^{\top}}p_{\theta_{0}}(% x_{1:T})\stackrel{{\scriptstyle\text{a.s.}}}{{\to}}I(\theta_{0})\quad\text{in % }\mathbb{P}_{\theta_{0}}\text{-probability}.- italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ ∂ italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG a.s. end_ARG end_RELOP italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) in blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT -probability .

Assumption 7 will also hold if for any δ>0𝛿0\delta>0italic_δ > 0, there exists an ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 such that m1infθΘ:θθ0δKL(pθpθ0)>ϵsuperscript𝑚1subscriptinfimum:𝜃Θnorm𝜃subscript𝜃0𝛿KLconditionalsubscript𝑝𝜃subscript𝑝subscript𝜃0italic-ϵm^{-1}\inf_{\theta\in\Theta\,:\,\|\theta-\theta_{0}\|\geq\delta}\mathrm{KL}(p_% {\theta}\,\|\,p_{\theta_{0}})>\epsilonitalic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_inf start_POSTSUBSCRIPT italic_θ ∈ roman_Θ : ∥ italic_θ - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ ≥ italic_δ end_POSTSUBSCRIPT roman_KL ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > italic_ϵ for all sufficiently large m𝑚mitalic_m, where KLKL\mathrm{KL}roman_KL denotes the Kullback-Leibler divergence. Assumption 9 mirrors Assumption 7 in Li et al., (2017).

References

  • Agueh and Carlier, (2011) Agueh, M. and Carlier, G. (2011). Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924.
  • Aicher et al., (2019) Aicher, C., Ma, Y.-A., Foti, N. J., and Fox, E. B. (2019). Stochastic gradient MCMC for state space models. SIAM Journal on Mathematics of Data Science, 1(3):555–587.
  • Beal, (2003) Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. PhD thesis, UCL (University College London).
  • Bickel and Freedman, (1981) Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics, 9(6):1196–1217.
  • Bickel et al., (1998) Bickel, P. J., Ritov, Y., and Ryden, T. (1998). Asymptotic normality of the maximum-likelihood estimator for general hidden Markov models. The Annals of Statistics, 26(4):1614–1635.
  • Bierkens et al., (2019) Bierkens, J., Fearnhead, P., and Roberts, G. (2019). The zig-zag process and super-efficient sampling for Bayesian analysis of big data. The Annals of Statistics, 47(3):1288–1320.
  • Bini and Capovani, (1983) Bini, D. and Capovani, M. (1983). Spectral and computational properties of band symmetric Toeplitz matrices. Linear Algebra and its Applications, 52:99–126.
  • Blei et al., (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877.
  • Bollerslev, (1986) Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3):307–327.
  • Bollerslev, (1990) Bollerslev, T. (1990). Modelling the coherence in short-run nominal exchange rates: a multivariate generalized ARCH model. The Review of Economics and Statistics, 72(3):498–505.
  • Bouchard-Côté et al., (2018) Bouchard-Côté, A., Vollmer, S. J., and Doucet, A. (2018). The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. Journal of the American Statistical Association, 113(522):855–867.
  • Carpenter et al., (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1):1–32.
  • Chen et al., (2014) Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691. PMLR.
  • Cinlar, (2011) Cinlar, E. (2011). Probability and Stochastics. Springer.
  • Cuturi and Doucet, (2014) Cuturi, M. and Doucet, A. (2014). Fast computation of Wasserstein barycenters. In International Conference on Machine Learning, pages 685–693. PMLR.
  • Dai et al., (2019) Dai, H., Pollock, M., and Roberts, G. (2019). Monte Carlo fusion. Journal of Applied Probability, 56(1):174–191.
  • Dai et al., (2023) Dai, H., Pollock, M., and Roberts, G. O. (2023). Bayesian fusion: scalable unification of distributed statistical analyses. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(1):84–107.
  • Del Moral et al., (2006) Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436.
  • Dharmadhikari et al., (1968) Dharmadhikari, S., Fabian, V., and Jogdeo, K. (1968). Bounds on the moments of martingales. The Annals of Mathematical Statistics, 39(5):1719–1723.
  • Duane et al., (1987) Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195(2):216–222.
  • Dvurechenskii et al., (2018) Dvurechenskii, P., Dvinskikh, D., Gasnikov, A., Uribe, C., and Nedich, A. (2018). Decentralize and randomize: Faster algorithm for wasserstein barycenters. Advances in Neural Information Processing Systems, 31.
  • Foti et al., (2014) Foti, N., Xu, J., Laird, D., and Fox, E. (2014). Stochastic variational inference for hidden Markov models. In Advances in Neural Information Processing Systems, pages 3599–3607.
  • Guhaniyogi et al., (2017) Guhaniyogi, R., Li, C., Savitsky, T. D., and Srivastava, S. (2017). A divide-and-conquer Bayesian approach to large-scale kriging. arXiv preprint arXiv:1712.09767.
  • Haario et al., (2001) Haario, H., Saksman, E., and Tamminen, J. (2001). An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242.
  • Hastings, (1970) Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109.
  • Hoffman and Gelman, (2014) Hoffman, M. D. and Gelman, A. (2014). The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623.
  • Hyndman and Khandakar, (2008) Hyndman, R. J. and Khandakar, Y. (2008). Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 27(3):1–22.
  • Johndrow et al., (2020) Johndrow, J. E., Pillai, N. S., and Smith, A. (2020). No free lunch for approximate MCMC. arXiv preprint arXiv:2010.12514.
  • Johnson and Willsky, (2014) Johnson, M. and Willsky, A. (2014). Stochastic variational inference for Bayesian time series models. In International Conference on Machine Learning, pages 1854–1862. PMLR.
  • Kass et al., (1991) Kass, R. E., Tierney, L., and Kadane, J. B. (1991). Laplace’s method in Bayesian analysis. Contemporary Mathematics, 115:89–99.
  • Kucukelbir et al., (2017) Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). Automatic differentiation variational inference. Journal of Machine Learning Research.
  • Lauritzen, (1992) Lauritzen, S. L. (1992). Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108.
  • Li et al., (2017) Li, C., Srivastava, S., and Dunson, D. B. (2017). Simple, scalable and accurate posterior interval estimation. Biometrika, 104(3):665–680.
  • (34) Ma, J., Xu, F., Huang, K., and Huang, R. (2017a). GNAR-GARCH model and its application in feature extraction for rolling bearing fault diagnosis. Mechanical Systems and Signal Processing, 93:175–203.
  • Ma et al., (2015) Ma, Y.-A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925.
  • (36) Ma, Y.-A., Foti, N. J., and Fox, E. B. (2017b). Stochastic gradient MCMC methods for hidden Markov models. In International Conference on Machine Learning, pages 2265–2274. PMLR.
  • Metropolis et al., (1953) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092.
  • Minsker et al., (2014) Minsker, S., Srivastava, S., Lin, L., and Dunson, D. (2014). Scalable and robust Bayesian inference via the median posterior. In International Conference on Machine Learning, pages 1656–1664. PMLR.
  • Musoff and Zarchan, (2009) Musoff, H. and Zarchan, P. (2009). Fundamentals of Kalman filtering: a practical approach. American Institute of Aeronautics and Astronautics.
  • Neiswanger et al., (2014) Neiswanger, W., Wang, C., and Xing, E. (2014). Asymptotically exact, embarrassingly parallel mcmc. In Proceedings of the 30th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 623–632.
  • Nemeth and Fearnhead, (2020) Nemeth, C. and Fearnhead, P. (2020). Stochastic gradient Markov chain Monte Carlo. Journal of the American Statistical Association, 116(533):1–18.
  • Neumann, (1932) Neumann, J. v. (1932). Proof of the quasi-ergodic hypothesis. Proceedings of the National Academy of Sciences, 18(1):70–82.
  • Nkalu and Edeme, (2019) Nkalu, C. N. and Edeme, R. K. (2019). Environmental hazards and life expectancy in Africa: evidence from GARCH model. SAGE Open, 9(1):215–222.
  • Quiroz et al., (2018) Quiroz, M., Kohn, R., Villani, M., and Tran, M.-N. (2018). Speeding up MCMC by efficient data subsampling. Journal of the American Statistical Association, 114(526):831–843.
  • Rabiner and Juang, (1986) Rabiner, L. and Juang, B. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1):4–16.
  • Rio, (1993) Rio, E. (1993). Covariance inequalities for strongly mixing processes. Annales de l’IHP Probabilités et statistiques, 29(4):587–597.
  • Roberts and Tweedie, (1996) Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363.
  • Salomone et al., (2020) Salomone, R., Quiroz, M., Kohn, R., Villani, M., and Tran, M.-N. (2020). Spectral subsampling MCMC for stationary time series. In International Conference on Machine Learning, pages 8449–8458. PMLR.
  • Scott et al., (2016) Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (2016). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88.
  • Srivastava et al., (2015) Srivastava, S., Cevher, V., Dinh, Q., and Dunson, D. (2015). WASP: Scalable Bayes via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pages 912–920. PMLR.
  • Srivastava et al., (2018) Srivastava, S., Li, C., and Dunson, D. B. (2018). Scalable Bayes via barycenter in Wasserstein space. The Journal of Machine Learning Research, 19(1):312–346.
  • Szabó and Van Zanten, (2019) Szabó, B. and Van Zanten, H. (2019). An asymptotic analysis of distributed nonparametric methods. Journal of Machine Learning Research, 20(87):1–30.
  • Villani, (2009) Villani, C. (2009). Optimal Transport: Old and New, volume 338. Springer.
  • Villani et al., (2022) Villani, M., Quiroz, M., Kohn, R., and Salomone, R. (2022). Spectral subsampling MCMC for stationary multivariate time series with applications to vector ARTFIMA processes. Econometrics and Statistics.
  • Wang and Srivastava, (2023) Wang, C. and Srivastava, S. (2023). Divide-and-conquer Bayesian inference in hidden Markov models. Electronic Journal of Statistics, 17(1):895–947.
  • Wang and Dunson, (2013) Wang, X. and Dunson, D. B. (2013). Parallelizing MCMC via Weierstrass sampler. arXiv preprint arXiv:1312.4605.
  • Wang et al., (2015) Wang, X., Guo, F., Heller, K. A., and Dunson, D. B. (2015). Parallelizing MCMC with random partition trees. In Advances in Neural Information Processing Systems, pages 451–459.
  • Welling and Teh, (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688. PMLR.
  • Whittle, (1951) Whittle, P. (1951). Hypothesis testing in time series analysis. Almquist and Wiksell, Uppsala.
  • Xue and Liang, (2019) Xue, J. and Liang, F. (2019). Double-parallel Monte Carlo for Bayesian analysis of big data. Statistics and Computing, 29(1):23–32.

Supplementary Material

Appendix S1 Main proofs

Simplifying notation: we use (X1k,,Xmk)=𝐗[k]subscript𝑋1𝑘subscript𝑋𝑚𝑘subscript𝐗delimited-[]𝑘(X_{1k},\dots,X_{mk})=\mathbf{X}_{[k]}( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT ) = bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT to denote the observations within the k𝑘kitalic_kth subsequence. We also use k(θ)superscriptsubscript𝑘𝜃\ell_{k}^{\prime}(\theta)roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) and k′′(θ)superscriptsubscript𝑘′′𝜃\ell_{k}^{\prime\prime}(\theta)roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_θ ) to denote θk(θ)subscript𝜃subscript𝑘𝜃\nabla_{\theta}\ell_{k}(\theta)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) and θ2k(θ)superscriptsubscript𝜃2subscript𝑘𝜃\nabla_{\theta}^{2}\ell_{k}(\theta)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ), respectively; this notation will be handy as we shall consider k(θ)superscriptsubscript𝑘𝜃\ell_{k}^{\prime}(\theta)roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) and k′′(θ)superscriptsubscript𝑘′′𝜃\ell_{k}^{\prime\prime}(\theta)roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_θ ) at different values of θ𝜃\thetaitalic_θ.

The proofs of Theorem 1 and 2 rely on the following lemmas in addition to Lemma 1.

Lemma S1.

Suppose that Assumptions 1–9 hold. Then the following are true.

  1. 1.

    There exists a weakly consistent estimator θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that is measurable with respect to σ(𝐗[k])𝜎subscript𝐗delimited-[]𝑘\sigma(\mathbf{X}_{[k]})italic_σ ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) solving the score equation k(θ^k)=0superscriptsubscript𝑘subscript^𝜃𝑘0\ell_{k}^{\prime}(\widehat{\theta}_{k})=0roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 0. Moreover, this estimator is consistent, that is, θ^kθ0θ0superscriptsubscriptsubscript𝜃0subscript^𝜃𝑘subscript𝜃0\widehat{\theta}_{k}\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}% \theta_{0}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as m𝑚m\to\inftyitalic_m → ∞.

  2. 2.

    Let θ^^𝜃\widehat{\theta}over^ start_ARG italic_θ end_ARG be a weakly consistent estimator of θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT based on the complete dataset 𝐗𝐗\mathbf{X}bold_X solving the score equation (θ^)=0superscript^𝜃0\ell^{\prime}(\widehat{\theta})=0roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG ) = 0. Then θ^θ0θ0superscriptsubscriptsubscript𝜃0^𝜃subscript𝜃0\widehat{\theta}\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}\theta_% {0}over^ start_ARG italic_θ end_ARG start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as T𝑇T\to\inftyitalic_T → ∞.

  3. 3.

    Let ζ=T1/2(θθ^k)𝜁superscript𝑇12𝜃subscript^𝜃𝑘\zeta=T^{1/2}(\theta-\widehat{\theta}_{k})italic_ζ = italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be a local parameter for the k𝑘kitalic_kth subsequence, and ϑ=T1/2(θθ^)italic-ϑsuperscript𝑇12𝜃^𝜃\vartheta=T^{1/2}(\theta-\widehat{\theta})italic_ϑ = italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_θ - over^ start_ARG italic_θ end_ARG ) be a local parameter for the complete dataset. Let Πm,ζ(dζ𝐗[k])subscriptΠ𝑚𝜁conditionald𝜁subscript𝐗delimited-[]𝑘\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]})roman_Π start_POSTSUBSCRIPT italic_m , italic_ζ end_POSTSUBSCRIPT ( roman_d italic_ζ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) be the k𝑘kitalic_kth subsequence posterior induced by Πm(dθ𝐗[k])subscriptΠ𝑚conditionald𝜃subscript𝐗delimited-[]𝑘\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) and ΠT,ϑ(dϑX1:T)subscriptΠ𝑇italic-ϑconditionalditalic-ϑsubscript𝑋:1𝑇\Pi_{T,\vartheta}(\mathrm{d}\vartheta\mid X_{1:T})roman_Π start_POSTSUBSCRIPT italic_T , italic_ϑ end_POSTSUBSCRIPT ( roman_d italic_ϑ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) be the posterior of ϑitalic-ϑ\varthetaitalic_ϑ induced by the overall posterior ΠT(dθX1:T)subscriptΠ𝑇conditionald𝜃subscript𝑋:1𝑇\Pi_{T}(\mathrm{d}\theta\mid X_{1:T})roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_θ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ). Then

    limm𝔼θ0TV2[Πm,ζ(dζ𝐗[k]),Φ{dζ;0,I1(θ0)}]subscript𝑚subscript𝔼subscriptsubscript𝜃0subscriptTV2subscriptΠ𝑚𝜁conditionald𝜁subscript𝐗delimited-[]𝑘Φd𝜁0superscript𝐼1subscript𝜃0\displaystyle\lim_{m\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{TV}_% {2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]}),\Phi\{\mathrm{d}% \zeta;0,I^{-1}(\theta_{0})\}\right]roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ roman_Π start_POSTSUBSCRIPT italic_m , italic_ζ end_POSTSUBSCRIPT ( roman_d italic_ζ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ { roman_d italic_ζ ; 0 , italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } ] =0,absent0\displaystyle=0,= 0 ,
    limm𝔼θ0TV2[ΠT,ϑ(dϑX1:T),Φ{dϑ;0,I1(θ0)}]subscript𝑚subscript𝔼subscriptsubscript𝜃0subscriptTV2subscriptΠ𝑇italic-ϑconditionalditalic-ϑsubscript𝑋:1𝑇Φditalic-ϑ0superscript𝐼1subscript𝜃0\displaystyle\lim_{m\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{TV}_% {2}\left[\Pi_{T,\vartheta}(\mathrm{d}\vartheta\mid X_{1:T}),\Phi\{\mathrm{d}% \vartheta;0,I^{-1}(\theta_{0})\}\right]roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ roman_Π start_POSTSUBSCRIPT italic_T , italic_ϑ end_POSTSUBSCRIPT ( roman_d italic_ϑ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , roman_Φ { roman_d italic_ϑ ; 0 , italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } ] =0,absent0\displaystyle=0,= 0 ,

    where TV2subscriptTV2\mathrm{TV}_{2}roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the the total variation of second moment distance.

Lemma S2 (Lemma 3 of Li et al.,, 2017).

Let ξ^k=aθ^k+bsubscript^𝜉𝑘superscript𝑎topsubscript^𝜃𝑘𝑏\widehat{\xi}_{k}=a^{\top}\widehat{\theta}_{k}+bover^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_b and ξ=aθ+b𝜉superscript𝑎top𝜃𝑏\xi=a^{\top}\theta+bitalic_ξ = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ + italic_b. Then

W2(Π¯T(dξX1:T),Φ[dξ;ξ¯,{TIξ(θ0)}1])1Kk=1KW2(Πm(dξ𝐗[k]),Φ[dξ;ξ^k,{TIξ(θ0)}1]).subscript𝑊2subscript¯Π𝑇conditionald𝜉subscript𝑋:1𝑇Φd𝜉¯𝜉superscript𝑇subscript𝐼𝜉subscript𝜃011𝐾superscriptsubscript𝑘1𝐾subscript𝑊2subscriptΠ𝑚conditionald𝜉subscript𝐗delimited-[]𝑘Φd𝜉subscript^𝜉𝑘superscript𝑇subscript𝐼𝜉subscript𝜃01W_{2}\left(\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T}),\Phi[\mathrm{d}\xi;% \overline{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}]\right)\leq\frac{1}{K}\sum_{k=1}^% {K}W_{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[k]}),\Phi[\mathrm{d}\xi;% \widehat{\xi}_{k},\{TI_{\xi}(\theta_{0})\}^{-1}]\big{)}.italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over¯ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) ≤ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) .

Lemma S1 mirrors Lemma 2 of Li et al., (2017), and its proof can be straightforwardly modified from their proof. We therefore focus our attention on the proof of Lemma 1, which is novel.

S1.1 Proof of Lemma 1

Proof.

We use the first order Taylor expansion of k(θ^k)superscriptsubscript𝑘subscript^𝜃𝑘\ell_{k}^{\prime}(\widehat{\theta}_{k})roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ),

00\displaystyle 0 =k(θ^k)=k(θ0)+k′′(θ~k)(θ^kθ0),absentsuperscriptsubscript𝑘subscript^𝜃𝑘superscriptsubscript𝑘subscript𝜃0superscriptsubscript𝑘′′subscript~𝜃𝑘subscript^𝜃𝑘subscript𝜃0\displaystyle=\ell_{k}^{\prime}(\widehat{\theta}_{k})=\ell_{k}^{\prime}(\theta% _{0})+\ell_{k}^{\prime\prime}(\widetilde{\theta}_{k})(\widehat{\theta}_{k}-% \theta_{0}),= roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,
00\displaystyle 0 =(θ^)=(θ0)+′′(θ~)(θ^θ0),absentsuperscript^𝜃superscriptsubscript𝜃0superscript′′~𝜃^𝜃subscript𝜃0\displaystyle=\ell^{\prime}(\widehat{\theta})=\ell^{\prime}(\theta_{0})+\ell^{% \prime\prime}(\widetilde{\theta})(\widehat{\theta}-\theta_{0}),= roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG ) = roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + roman_ℓ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG ) ( over^ start_ARG italic_θ end_ARG - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,

where θ~~𝜃\widetilde{\theta}over~ start_ARG italic_θ end_ARG lies between θ^^𝜃\widehat{\theta}over^ start_ARG italic_θ end_ARG and θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and θ~ksubscript~𝜃𝑘\widetilde{\theta}_{k}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT lies between θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Therefore,

θ^ksubscript^𝜃𝑘\displaystyle\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =θ0{1mk′′(θ~k)}1k(θ0)m=θ0+1mI1(θ0)k(θ0)+Zkk(θ0)mabsentsubscript𝜃0superscript1𝑚superscriptsubscript𝑘′′subscript~𝜃𝑘1superscriptsubscript𝑘subscript𝜃0𝑚subscript𝜃01𝑚superscript𝐼1subscript𝜃0superscriptsubscript𝑘subscript𝜃0subscript𝑍𝑘superscriptsubscript𝑘subscript𝜃0𝑚\displaystyle=\theta_{0}-\left\{\frac{1}{m}\ell_{k}^{\prime\prime}(\widetilde{% \theta}_{k})\right\}^{-1}\frac{\ell_{k}^{\prime}(\theta_{0})}{m}=\theta_{0}+% \frac{1}{m}I^{-1}(\theta_{0})\ell_{k}^{\prime}(\theta_{0})+Z_{k}\frac{\ell_{k}% ^{\prime}(\theta_{0})}{m}= italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_m end_ARG = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_m end_ARG (S1)
θ^^𝜃\displaystyle\widehat{\theta}over^ start_ARG italic_θ end_ARG =θ0{1T′′(θ~)}1(θ0)T=θ0+1TI1(θ0)(θ0)+Z(θ0)Tabsentsubscript𝜃0superscript1𝑇superscript′′~𝜃1superscriptsubscript𝜃0𝑇subscript𝜃01𝑇superscript𝐼1subscript𝜃0superscriptsubscript𝜃0𝑍superscriptsubscript𝜃0𝑇\displaystyle=\theta_{0}-\left\{\frac{1}{T}\ell^{\prime\prime}(\widetilde{% \theta})\right\}^{-1}\frac{\ell^{\prime}(\theta_{0})}{T}=\theta_{0}+\frac{1}{T% }I^{-1}(\theta_{0})\ell^{\prime}(\theta_{0})+Z\frac{\ell^{\prime}(\theta_{0})}% {T}= italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - { divide start_ARG 1 end_ARG start_ARG italic_T end_ARG roman_ℓ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_T end_ARG = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_T end_ARG italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_Z divide start_ARG roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_T end_ARG

After rearranging, we obtain the difference between θ¯¯𝜃\overline{\theta}over¯ start_ARG italic_θ end_ARG and θ^^𝜃\widehat{\theta}over^ start_ARG italic_θ end_ARG as

θ¯θ^¯𝜃^𝜃\displaystyle\overline{\theta}-\widehat{\theta}over¯ start_ARG italic_θ end_ARG - over^ start_ARG italic_θ end_ARG =1Kk=1KZkk(θ0)mZ(θ0)T+Qabsent1𝐾superscriptsubscript𝑘1𝐾subscript𝑍𝑘superscriptsubscript𝑘subscript𝜃0𝑚𝑍superscriptsubscript𝜃0𝑇𝑄\displaystyle=\frac{1}{K}\sum_{k=1}^{K}Z_{k}\frac{\ell_{k}^{\prime}(\theta_{0}% )}{m}-Z\frac{\ell^{\prime}(\theta_{0})}{T}+Q= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_m end_ARG - italic_Z divide start_ARG roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_T end_ARG + italic_Q (S2)

where

Zk={1mk′′(θ~k)}1I1(θ0),Z={1T′′(θ~)}1I1(θ0),formulae-sequencesubscript𝑍𝑘superscript1𝑚superscriptsubscript𝑘′′subscript~𝜃𝑘1superscript𝐼1subscript𝜃0𝑍superscript1𝑇superscript′′~𝜃1superscript𝐼1subscript𝜃0Z_{k}=\left\{-\frac{1}{m}\ell_{k}^{\prime\prime}(\widetilde{\theta}_{k})\right% \}^{-1}-I^{-1}(\theta_{0}),\quad Z=\left\{-\frac{1}{T}\ell^{\prime\prime}(% \widetilde{\theta})\right\}^{-1}-I^{-1}(\theta_{0}),italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_Z = { - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG roman_ℓ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,

and

Q=I1(θ0){k=1Kk(θ0)(θ0)T}.𝑄superscript𝐼1subscript𝜃0superscriptsubscript𝑘1𝐾superscriptsubscript𝑘subscript𝜃0superscriptsubscript𝜃0𝑇Q=I^{-1}(\theta_{0})\left\{\frac{\sum_{k=1}^{K}\ell_{k}^{\prime}(\theta_{0})-% \ell^{\prime}(\theta_{0})}{T}\right\}.italic_Q = italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) { divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_T end_ARG } .

The second term on the right hand side of equation S2 is oθ0(T1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑇12o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). The convergence of T1/2(θ0)superscript𝑇12superscriptsubscript𝜃0T^{-1/2}\ell^{\prime}(\theta_{0})italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) to N{0,I1(θ0)}N0superscript𝐼1subscript𝜃0\mathrm{N}\{0,I^{-1}(\theta_{0})\}roman_N { 0 , italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } in distribution is established by the martingale central limit theorem. Therefore, (θ0)/Tsuperscriptsubscript𝜃0𝑇\ell^{\prime}(\theta_{0})/Troman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) / italic_T is OPθ0(T1/2)subscript𝑂subscript𝑃subscript𝜃0superscript𝑇12O_{P_{\theta_{0}}}(T^{-1/2})italic_O start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Moreover, we have k′′(θ~)/Tθ0I(θ0)superscriptsubscriptsubscript𝜃0superscriptsubscript𝑘′′~𝜃𝑇𝐼subscript𝜃0-\ell_{k}^{\prime\prime}(\widetilde{\theta})/T\stackrel{{\scriptstyle\mathbb{P% }_{\theta_{0}}}}{{\to}}I(\theta_{0})- roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG ) / italic_T start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and thus Zθ00superscriptsubscriptsubscript𝜃0𝑍0Z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}0italic_Z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0 by the continuous mapping theorem. Hence Z(θ0)/T=oθ0(T1/2)𝑍superscriptsubscript𝜃0𝑇subscript𝑜subscriptsubscript𝜃0superscript𝑇12Z\ell^{\prime}(\theta_{0})/T=o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_Z roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) / italic_T = italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ).

We show that the third term Q𝑄Qitalic_Q on the right hand side of equation S2 is Oθ0(m1)subscript𝑂subscriptsubscript𝜃0superscript𝑚1O_{\mathbb{P}_{\theta_{0}}}(m^{-1})italic_O start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Define

kk(θ0)subscriptsuperscriptconditional𝑘𝑘subscript𝜃0\displaystyle\ell^{\prime}_{k\mid-k}(\theta_{0})roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k ∣ - italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =θlogpθ(𝐗[k]𝐗[1],,𝐗[k1])|θ=θ0,absentevaluated-atsubscript𝜃subscript𝑝𝜃conditionalsubscript𝐗delimited-[]𝑘subscript𝐗delimited-[]1subscript𝐗delimited-[]𝑘1𝜃subscript𝜃0\displaystyle=\nabla_{\theta}\log p_{\theta}(\mathbf{X}_{[k]}\mid\mathbf{X}_{[% 1]},\dots,\mathbf{X}_{[k-1]})\big{|}_{\theta={\theta_{0}}},= ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT , … , bold_X start_POSTSUBSCRIPT [ italic_k - 1 ] end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,

and thus (θ0)=k=1Kkk(θ0)superscriptsubscript𝜃0superscriptsubscript𝑘1𝐾subscriptsuperscriptconditional𝑘𝑘subscript𝜃0\ell^{\prime}(\theta_{0})=\sum_{k=1}^{K}\ell^{\prime}_{k\mid-k}(\theta_{0})roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k ∣ - italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Therefore we write

Q𝑄\displaystyle Qitalic_Q =I1(θ0){k=1Kk(θ0)(θ0)T}absentsuperscript𝐼1subscript𝜃0superscriptsubscript𝑘1𝐾superscriptsubscript𝑘subscript𝜃0superscriptsubscript𝜃0𝑇\displaystyle=I^{-1}(\theta_{0})\left\{\frac{\sum_{k=1}^{K}\ell_{k}^{\prime}(% \theta_{0})-\ell^{\prime}(\theta_{0})}{T}\right\}= italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) { divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_T end_ARG }
=I1(θ0)k=1K{k(θ0)kk(θ0)}T=I1(θ0)Q1,absentsuperscript𝐼1subscript𝜃0superscriptsubscript𝑘1𝐾superscriptsubscript𝑘subscript𝜃0subscriptsuperscriptconditional𝑘𝑘subscript𝜃0𝑇superscript𝐼1subscript𝜃0subscript𝑄1\displaystyle=I^{-1}(\theta_{0})\frac{\sum_{k=1}^{K}\{\ell_{k}^{\prime}(\theta% _{0})-\ell^{\prime}_{k\mid-k}(\theta_{0})\}}{T}=I^{-1}(\theta_{0})Q_{1},= italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k ∣ - italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } end_ARG start_ARG italic_T end_ARG = italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where

Q1subscript𝑄1\displaystyle Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =k=1K{k(θ0)kk(θ0)}Tabsentsuperscriptsubscript𝑘1𝐾superscriptsubscript𝑘subscript𝜃0subscriptsuperscriptconditional𝑘𝑘subscript𝜃0𝑇\displaystyle=\frac{\sum_{k=1}^{K}\{\ell_{k}^{\prime}(\theta_{0})-\ell^{\prime% }_{k\mid-k}(\theta_{0})\}}{T}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k ∣ - italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } end_ARG start_ARG italic_T end_ARG
=1Tk=1Ki=1m{θlogpθ(XikX1k,,X(i1)k)\displaystyle=\frac{1}{T}\sum_{k=1}^{K}\sum_{i=1}^{m}\big{\{}\nabla_{\theta}% \log p_{\theta}(X_{ik}\mid X_{1k},\dots,X_{(i-1)k})= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT { ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_i - 1 ) italic_k end_POSTSUBSCRIPT )
θlogpθ(Xik𝐗[1],,𝐗[k1],X1k,,X(i1)k)}|θ=θ0\displaystyle\hskip 79.49744pt-\nabla_{\theta}\log p_{\theta}(X_{ik}\mid% \mathbf{X}_{[1]},\dots,\mathbf{X}_{[k-1]},X_{1k},\dots,X_{(i-1)k})\}\big{|}_{% \theta=\theta_{0}}- ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT , … , bold_X start_POSTSUBSCRIPT [ italic_k - 1 ] end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_i - 1 ) italic_k end_POSTSUBSCRIPT ) } | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

by the fact that we can write

k(θ0)superscriptsubscript𝑘subscript𝜃0\displaystyle\ell_{k}^{\prime}(\theta_{0})roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =i=1mθlogpθ(XikX1k,,X(i1)k)|θ=θ0andabsentevaluated-atsuperscriptsubscript𝑖1𝑚subscript𝜃subscript𝑝𝜃conditionalsubscript𝑋𝑖𝑘subscript𝑋1𝑘subscript𝑋𝑖1𝑘𝜃subscript𝜃0and\displaystyle=\sum_{i=1}^{m}\nabla_{\theta}\log p_{\theta}(X_{ik}\mid X_{1k},% \dots,X_{(i-1)k})\big{|}_{\theta=\theta_{0}}\quad\text{and}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_i - 1 ) italic_k end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and
kk(θ0)subscriptsuperscriptconditional𝑘𝑘subscript𝜃0\displaystyle\ell^{\prime}_{k\mid-k}(\theta_{0})roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k ∣ - italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =i=1mθlogpθ(Xik𝐗[1],,𝐗[k1],X1k,,X(i1)k)|θ=θ0.absentevaluated-atsuperscriptsubscript𝑖1𝑚subscript𝜃subscript𝑝𝜃conditionalsubscript𝑋𝑖𝑘subscript𝐗delimited-[]1subscript𝐗delimited-[]𝑘1subscript𝑋1𝑘subscript𝑋𝑖1𝑘𝜃subscript𝜃0\displaystyle=\sum_{i=1}^{m}\nabla_{\theta}\log p_{\theta}(X_{ik}\mid\mathbf{X% }_{[1]},\dots,\mathbf{X}_{[k-1]},X_{1k},\dots,X_{(i-1)k})\big{|}_{\theta=% \theta_{0}}.= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT , … , bold_X start_POSTSUBSCRIPT [ italic_k - 1 ] end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_i - 1 ) italic_k end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Assumption 3 bounds the error between θlogpθ(XikX1k,,X(i1)k)|θ=θ0evaluated-atsubscript𝜃subscript𝑝𝜃conditionalsubscript𝑋𝑖𝑘subscript𝑋1𝑘subscript𝑋𝑖1𝑘𝜃subscript𝜃0\nabla_{\theta}\log p_{\theta}(X_{ik}\mid X_{1k},\dots,X_{(i-1)k})|_{\theta=% \theta_{0}}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_i - 1 ) italic_k end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and
θlogpθ(Xik𝐗[1],,𝐗[k1],X1k,,X(i1)k))|θ=θ0\nabla_{\theta}\log p_{\theta}(X_{ik}\mid\mathbf{X}_{[1]},\dots,\mathbf{X}_{[k% -1]},X_{1k},\dots,X_{(i-1)k}))|_{\theta=\theta_{0}}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT , … , bold_X start_POSTSUBSCRIPT [ italic_k - 1 ] end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_i - 1 ) italic_k end_POSTSUBSCRIPT ) ) | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and thus the error between k(θ0)superscriptsubscript𝑘subscript𝜃0\ell_{k}^{\prime}(\theta_{0})roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and kk(θ0)subscriptsuperscriptconditional𝑘𝑘subscript𝜃0\ell^{\prime}_{k\mid-k}(\theta_{0})roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k ∣ - italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). We have that Q1=Oθ0(m1)subscript𝑄1subscript𝑂subscriptsubscript𝜃0superscript𝑚1Q_{1}=O_{\mathbb{P}_{\theta_{0}}}(m^{-1})italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_O start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) by Markov’s inequality, since for any s>0𝑠0s>0italic_s > 0,

θ0(mQ11>s)subscriptsubscript𝜃0𝑚subscriptnormsubscript𝑄11𝑠\displaystyle\mathbb{P}_{\theta_{0}}\left(m\left\|Q_{1}\right\|_{1}>s\right)blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_m ∥ italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_s )
msTk=1Ki=1mθlogpθ0(XikX1k,,X(i1)k)|θ=θ0\displaystyle\leq\frac{m}{sT}\sum_{k=1}^{K}\sum_{i=1}^{m}\big{\|}\nabla_{% \theta}\log p_{\theta_{0}}(X_{ik}\mid X_{1k},\dots,X_{(i-1)k})\big{|}_{\theta=% \theta_{0}}≤ divide start_ARG italic_m end_ARG start_ARG italic_s italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_i - 1 ) italic_k end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
θlogpθ0(Xik𝐗[1],,𝐗[k1],X1k,,X(i1)k)|θ=θ01evaluated-atevaluated-atsubscript𝜃subscript𝑝subscript𝜃0conditionalsubscript𝑋𝑖𝑘subscript𝐗delimited-[]1subscript𝐗delimited-[]𝑘1subscript𝑋1𝑘subscript𝑋𝑖1𝑘𝜃subscript𝜃01\displaystyle\hskip 83.11005pt-\nabla_{\theta}\log p_{\theta_{0}}(X_{ik}\mid% \mathbf{X}_{[1]},\dots,\mathbf{X}_{[k-1]},X_{1k},\dots,X_{(i-1)k})\big{|}_{% \theta=\theta_{0}}\big{\|}_{1}- ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT , … , bold_X start_POSTSUBSCRIPT [ italic_k - 1 ] end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_i - 1 ) italic_k end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
mC0k=1Ki=0mρ0isTC0s(1ρ0)0ass,formulae-sequenceabsent𝑚subscript𝐶0superscriptsubscript𝑘1𝐾superscriptsubscript𝑖0𝑚superscriptsubscript𝜌0𝑖𝑠𝑇subscript𝐶0𝑠1subscript𝜌00as𝑠\displaystyle\leq\frac{m\,C_{0}\sum_{k=1}^{K}\sum_{i=0}^{m}\rho_{0}^{i}}{sT}% \leq\frac{C_{0}}{s(1-\rho_{0})}\to 0\quad\text{as}\leavevmode\nobreak\ s\to\infty,≤ divide start_ARG italic_m italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_s italic_T end_ARG ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_s ( 1 - italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG → 0 as italic_s → ∞ ,

where the second inequality is by Assumption 3. Moreover, by Assumption 6, I1(θ0)=Oθ0(1)superscript𝐼1subscript𝜃0subscript𝑂subscriptsubscript𝜃01I^{-1}(\theta_{0})=O_{\mathbb{P}_{\theta_{0}}}(1)italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_O start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 ). Therefore Q=Oθ0(1)×Oθ0(m1)=Oθ0(m1)𝑄subscript𝑂subscriptsubscript𝜃01subscript𝑂subscriptsubscript𝜃0superscript𝑚1subscript𝑂subscriptsubscript𝜃0superscript𝑚1Q=O_{\mathbb{P}_{\theta_{0}}}(1)\times O_{\mathbb{P}_{\theta_{0}}}(m^{-1})=O_{% \mathbb{P}_{\theta_{0}}}(m^{-1})italic_Q = italic_O start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 ) × italic_O start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) = italic_O start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Furthermore, if m=𝒪(T1/2)𝑚𝒪superscript𝑇12m=\mathcal{O}(T^{1/2})italic_m = caligraphic_O ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ), then Q=oθ0(T1/2)𝑄subscript𝑜subscriptsubscript𝜃0superscript𝑇12Q=o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_Q = italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ).

The convergence order of the first term on the right hand side of equation S2 is established by Markov’s inequality. We define Wk=Zkk(θ0)/msubscript𝑊𝑘subscript𝑍𝑘superscriptsubscript𝑘subscript𝜃0𝑚W_{k}=Z_{k}\ell_{k}^{\prime}(\theta_{0})/mitalic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) / italic_m and prove later that 𝔼θ0(Wk2)0subscript𝔼subscriptsubscript𝜃0superscriptnormsubscript𝑊𝑘20\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2})\to 0blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) → 0 as m𝑚m\to\inftyitalic_m → ∞. Then

θ0(1Kk=1KWkm1/2cm1/2)subscriptsubscript𝜃0norm1𝐾superscriptsubscript𝑘1𝐾subscript𝑊𝑘superscript𝑚12𝑐superscript𝑚12\displaystyle\mathbb{P}_{\theta_{0}}\bigg{(}\bigg{\|}\frac{1}{K}\sum_{k=1}^{K}% \frac{W_{k}}{m^{1/2}}\bigg{\|}\geq cm^{-1/2}\bigg{)}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG ∥ ≥ italic_c italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) m𝔼θ0(1/K)k=1KWk/m1/22c2absent𝑚subscript𝔼subscriptsubscript𝜃0superscriptnorm1𝐾superscriptsubscript𝑘1𝐾subscript𝑊𝑘superscript𝑚122superscript𝑐2\displaystyle\leq\frac{m\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\|(1/K)\sum_{k=1}^% {K}W_{k}/m^{1/2}\|^{2}}{c^{2}}≤ divide start_ARG italic_m blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ( 1 / italic_K ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_m start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
1c2Kk=1K𝔼θ0(Wk2)=1c2𝔼θ0(W12)0.absent1superscript𝑐2𝐾superscriptsubscript𝑘1𝐾subscript𝔼subscriptsubscript𝜃0superscriptnormsubscript𝑊𝑘21superscript𝑐2subscript𝔼subscriptsubscript𝜃0superscriptnormsubscript𝑊120\displaystyle\leq\frac{1}{c^{2}K}\sum_{k=1}^{K}\mathbb{E}_{\mathbb{P}_{\theta_% {0}}}(\|W_{k}\|^{2})=\frac{1}{c^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{% 1}\|^{2})\to 0.≤ divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) → 0 . (S3)

The convergence in equation S3 is established by 𝔼θ0(Wk2)0subscript𝔼subscriptsubscript𝜃0superscriptnormsubscript𝑊𝑘20\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2})\to 0blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) → 0 that we will prove below. Further, when θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is an unbiased estimator for θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, then 𝔼θ0(Wk)=0subscript𝔼subscriptsubscript𝜃0subscript𝑊𝑘0\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(W_{k})=0blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 0 because

𝔼θ0(θ^k)=θ0+𝔼θ0{1mI1(θ0)k(θ0)}+𝔼θ0(Wk)=θ0+𝔼θ0(Wk)=θ0,subscript𝔼subscriptsubscript𝜃0subscript^𝜃𝑘subscript𝜃0subscript𝔼subscriptsubscript𝜃01𝑚superscript𝐼1subscript𝜃0superscriptsubscript𝑘subscript𝜃0subscript𝔼subscriptsubscript𝜃0subscript𝑊𝑘subscript𝜃0subscript𝔼subscriptsubscript𝜃0subscript𝑊𝑘subscript𝜃0\displaystyle\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\widehat{\theta}_{k})=\theta% _{0}+\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}\frac{1}{m}I^{-1}(\theta_{0}% )\ell_{k}^{\prime}(\theta_{0})\bigg{\}}+\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(W% _{k})=\theta_{0}+\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(W_{k})=\theta_{0},blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } + blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,

recalling equation S1. We will prove that in the case of unbiased θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, expression (S2) converges in the order of oθ0(T1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑇12o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). By Markov’s inequality, for every c>0𝑐0c>0italic_c > 0,

θ0(1Kk=1KWkm1/2>cT1/2)=θ0(1Kk=1KWk>cK1/2)subscriptsubscript𝜃0norm1𝐾superscriptsubscript𝑘1𝐾subscript𝑊𝑘superscript𝑚12𝑐superscript𝑇12subscriptsubscript𝜃0norm1𝐾superscriptsubscript𝑘1𝐾subscript𝑊𝑘𝑐superscript𝐾12\displaystyle\quad\mathbb{P}_{\theta_{0}}\bigg{(}\bigg{\|}\frac{1}{K}\sum_{k=1% }^{K}\frac{W_{k}}{m^{1/2}}\bigg{\|}>cT^{-1/2}\bigg{)}=\mathbb{P}_{\theta_{0}}% \bigg{(}\bigg{\|}\frac{1}{K}\sum_{k=1}^{K}W_{k}\bigg{\|}>cK^{-1/2}\bigg{)}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG ∥ > italic_c italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) = blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ > italic_c italic_K start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT )
K𝔼θ0(1/K)k=1KWk2c2=1Kc2𝔼θ0(1k1,k2KWk1Wk2)absent𝐾subscript𝔼subscriptsubscript𝜃0superscriptnorm1𝐾superscriptsubscript𝑘1𝐾subscript𝑊𝑘2superscript𝑐21𝐾superscript𝑐2subscript𝔼subscriptsubscript𝜃0subscriptformulae-sequence1subscript𝑘1subscript𝑘2𝐾superscriptsubscript𝑊subscript𝑘1topsubscript𝑊subscript𝑘2\displaystyle\leq\frac{K\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\|(1/K)\sum_{k=1}^% {K}W_{k}\|^{2}}{c^{2}}=\frac{1}{Kc^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}% \bigg{(}\sum_{1\leq k_{1},k_{2}\leq K}W_{k_{1}}^{\top}W_{k_{2}}\bigg{)}≤ divide start_ARG italic_K blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ( 1 / italic_K ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_K italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT 1 ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_K end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
1Kc21k1,k2Kl=1d|covθ0(Wk1l,Wk2l)|,absent1𝐾superscript𝑐2subscriptformulae-sequence1subscript𝑘1subscript𝑘2𝐾superscriptsubscript𝑙1𝑑subscriptcovsubscriptsubscript𝜃0superscriptsubscript𝑊subscript𝑘1𝑙superscriptsubscript𝑊subscript𝑘2𝑙\displaystyle\leq\frac{1}{Kc^{2}}\sum_{1\leq k_{1},k_{2}\leq K}\sum_{l=1}^{d}|% \mathrm{cov}_{\mathbb{P}_{\theta_{0}}}(W_{k_{1}}^{l},W_{k_{2}}^{l})|,≤ divide start_ARG 1 end_ARG start_ARG italic_K italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT 1 ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_K end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | roman_cov start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) | , (S4)

where we write Wk=(Wk1,,Wkd)subscript𝑊𝑘superscriptsubscript𝑊𝑘1superscriptsubscript𝑊𝑘𝑑W_{k}=(W_{k}^{1},\dots,W_{k}^{d})italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), and covθ0subscriptcovsubscriptsubscript𝜃0\mathrm{cov}_{\mathbb{P}_{\theta_{0}}}roman_cov start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes covariance under θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The last inequality of equation S4 is by 𝔼θ0(Wk)=(0,,0)subscript𝔼subscriptsubscript𝜃0subscript𝑊𝑘00\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(W_{k})=(0,\dots,0)blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ( 0 , … , 0 ) and |𝔼θ0(W1W2)|=|l=1dcovθ0(W1l,W2l)|subscript𝔼subscriptsubscript𝜃0superscriptsubscript𝑊1topsubscriptW2superscriptsubscript𝑙1𝑑subscriptcovsubscriptsubscript𝜃0superscriptsubscript𝑊1𝑙superscriptsubscriptW2𝑙|\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(W_{1}^{\top}\mathrm{W}_{2})|=|\sum_{l=1}% ^{d}\mathrm{cov}_{\mathbb{P}_{\theta_{0}}}(W_{1}^{l},\mathrm{W}_{2}^{l})|| blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | = | ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT roman_cov start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) |. To prove that the last term of equation S4 converges to zero as m𝑚m\to\inftyitalic_m → ∞, it suffices to prove that for every 1ld1𝑙𝑑1\leq l\leq d1 ≤ italic_l ≤ italic_d,

1K1k1,k2K|covθ0(Wk1l,Wk2l)|0asm.formulae-sequence1𝐾subscriptformulae-sequence1subscript𝑘1subscript𝑘2𝐾subscriptcovsubscriptsubscript𝜃0superscriptsubscript𝑊subscript𝑘1𝑙superscriptsubscript𝑊subscript𝑘2𝑙0as𝑚\frac{1}{K}\sum_{1\leq k_{1},k_{2}\leq K}|\mathrm{cov}_{\mathbb{P}_{\theta_{0}% }}(W_{k_{1}}^{l},W_{k_{2}}^{l})|\to 0\quad\text{as}\leavevmode\nobreak\ % \leavevmode\nobreak\ m\to\infty.divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT 1 ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_K end_POSTSUBSCRIPT | roman_cov start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) | → 0 as italic_m → ∞ . (S5)

The stationarity of the process (Assumption 1) allows us to denote |covθ0(Wk1l,Wk2l)|subscriptcovsubscriptsubscript𝜃0superscriptsubscript𝑊subscript𝑘1𝑙superscriptsubscript𝑊subscript𝑘2𝑙|\mathrm{cov}_{\mathbb{P}_{\theta_{0}}}(W_{k_{1}}^{l},W_{k_{2}}^{l})|| roman_cov start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) | by δl(k1k2)superscript𝛿𝑙subscript𝑘1subscript𝑘2\delta^{l}(k_{1}-k_{2})italic_δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

We employ the following α𝛼\alphaitalic_α-mixing inequality of Rio, (1993):

|covθ0(Wk1l,Wk2l)|=δl(k1k2)202α(|(k1k2)m|)QWk1l(u)QWk2l(u)du,subscriptcovsubscriptsubscript𝜃0superscriptsubscript𝑊subscript𝑘1𝑙superscriptsubscript𝑊subscript𝑘2𝑙superscript𝛿𝑙subscript𝑘1subscript𝑘22superscriptsubscript02𝛼subscript𝑘1subscript𝑘2𝑚subscript𝑄subscriptsuperscript𝑊𝑙subscript𝑘1𝑢subscript𝑄subscriptsuperscript𝑊𝑙subscript𝑘2𝑢differential-d𝑢|\mathrm{cov}_{\mathbb{P}_{\theta_{0}}}(W_{k_{1}}^{l},W_{k_{2}}^{l})|=\delta^{% l}(k_{1}-k_{2})\leq 2\int_{0}^{2\alpha(|(k_{1}-k_{2})m|)}Q_{W^{l}_{k_{1}}}(u)% \,Q_{W^{l}_{k_{2}}}(u)\,\mathrm{d}u,| roman_cov start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) | = italic_δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ 2 ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_α ( | ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_m | ) end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u ) italic_Q start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u ) roman_d italic_u ,

where α()𝛼\alpha(\cdot)italic_α ( ⋅ ) is the alpha α𝛼\alphaitalic_α-mixing coefficient as stated in Assumption 2 and QWkl(u)=inf{t:(|Wkl|>t)u}subscript𝑄superscriptsubscript𝑊𝑘𝑙𝑢infimumconditional-set𝑡superscriptsubscript𝑊𝑘𝑙𝑡𝑢Q_{W_{k}^{l}}(u)=\inf\{t:\mathbb{P}(|W_{k}^{l}|>t)\leq u\}italic_Q start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u ) = roman_inf { italic_t : blackboard_P ( | italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | > italic_t ) ≤ italic_u } is the quantile function of |Wkl|superscriptsubscript𝑊𝑘𝑙|W_{k}^{l}|| italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT |. Further, by Markov’s inequality, θ0(|Wkl|>t)𝔼θ0(|Wkl|2+δ)/t2+δsubscriptsubscript𝜃0superscriptsubscript𝑊𝑘𝑙𝑡subscript𝔼subscriptsubscript𝜃0superscriptsuperscriptsubscript𝑊𝑘𝑙2𝛿superscript𝑡2𝛿\mathbb{P}_{\theta_{0}}(|W_{k}^{l}|>t)\leq\mathbb{E}_{\mathbb{P}_{\theta_{0}}}% (|W_{k}^{l}|^{2+\delta})/t^{2+\delta}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( | italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | > italic_t ) ≤ blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( | italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT ) / italic_t start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT, where δ>0𝛿0\delta>0italic_δ > 0 is a positive real number as stated in Assumption 2, and so we have QWk2l(u)u1/(2+δ)Wkl2+δsubscript𝑄subscriptsuperscript𝑊𝑙subscript𝑘2𝑢superscript𝑢12𝛿subscriptnormsuperscriptsubscript𝑊𝑘𝑙2𝛿Q_{W^{l}_{k_{2}}}(u)\leq u^{-1/(2+\delta)}\|W_{k}^{l}\|_{2+\delta}italic_Q start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u ) ≤ italic_u start_POSTSUPERSCRIPT - 1 / ( 2 + italic_δ ) end_POSTSUPERSCRIPT ∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 + italic_δ end_POSTSUBSCRIPT. Equation (S5) then follows as

1K1k1,k2K|covθ0(Wk1l,Wk2l)|=1K1k1,k2Kδl(k1k2)=k=1KKkKδl(k)1𝐾subscriptformulae-sequence1subscript𝑘1subscript𝑘2𝐾subscriptcovsubscriptsubscript𝜃0superscriptsubscript𝑊subscript𝑘1𝑙superscriptsubscript𝑊subscript𝑘2𝑙1𝐾subscriptformulae-sequence1subscript𝑘1subscript𝑘2𝐾superscript𝛿𝑙subscript𝑘1subscript𝑘2superscriptsubscript𝑘1𝐾𝐾𝑘𝐾superscript𝛿𝑙𝑘\displaystyle\quad\frac{1}{K}\sum_{1\leq k_{1},k_{2}\leq K}|\mathrm{cov}_{% \mathbb{P}_{\theta_{0}}}(W_{k_{1}}^{l},W_{k_{2}}^{l})|=\frac{1}{K}\sum_{1\leq k% _{1},k_{2}\leq K}\delta^{l}(k_{1}-k_{2})=\sum_{k=1}^{K}\frac{K-k}{K}\delta^{l}% (k)divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT 1 ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_K end_POSTSUBSCRIPT | roman_cov start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) | = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT 1 ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_K end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_K - italic_k end_ARG start_ARG italic_K end_ARG italic_δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_k )
k=1Kδl(k)2k=1K02α(km)QW1l2(u)du2k=1K02α(km)u2/(2+δ)W1l2+δ2duabsentsuperscriptsubscript𝑘1𝐾superscript𝛿𝑙𝑘2superscriptsubscript𝑘1𝐾superscriptsubscript02𝛼𝑘𝑚subscriptsuperscript𝑄2subscriptsuperscript𝑊𝑙1𝑢differential-d𝑢2superscriptsubscript𝑘1𝐾superscriptsubscript02𝛼𝑘𝑚superscript𝑢22𝛿subscriptsuperscriptnormsuperscriptsubscript𝑊1𝑙22𝛿differential-d𝑢\displaystyle\leq\sum_{k=1}^{K}\delta^{l}(k)\leq 2\sum_{k=1}^{K}\int_{0}^{2% \alpha(km)}Q^{2}_{W^{l}_{1}}(u)\,\mathrm{d}u\leq 2\sum_{k=1}^{K}\int_{0}^{2% \alpha(km)}u^{-2/(2+\delta)}\|W_{1}^{l}\|^{2}_{2+\delta}\,\mathrm{d}u≤ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_k ) ≤ 2 ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_α ( italic_k italic_m ) end_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u ) roman_d italic_u ≤ 2 ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_α ( italic_k italic_m ) end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT - 2 / ( 2 + italic_δ ) end_POSTSUPERSCRIPT ∥ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 + italic_δ end_POSTSUBSCRIPT roman_d italic_u
C0{k=1Kα(km)δ/(2+δ)}W1l2+δ2{k=1α(km)δ/(2+δ)}W1l2+δ20,absentsubscript𝐶0superscriptsubscript𝑘1𝐾𝛼superscript𝑘𝑚𝛿2𝛿subscriptsuperscriptnormsuperscriptsubscript𝑊1𝑙22𝛿superscriptsubscript𝑘1𝛼superscript𝑘𝑚𝛿2𝛿subscriptsuperscriptnormsuperscriptsubscript𝑊1𝑙22𝛿0\displaystyle\leq C_{0}\bigg{\{}\sum_{k=1}^{K}\alpha(km)^{\delta/(2+\delta)}% \bigg{\}}\|W_{1}^{l}\|^{2}_{2+\delta}\leq\bigg{\{}\sum_{k=1}^{\infty}\alpha(km% )^{\delta/(2+\delta)}\bigg{\}}\|W_{1}^{l}\|^{2}_{2+\delta}\to 0,≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α ( italic_k italic_m ) start_POSTSUPERSCRIPT italic_δ / ( 2 + italic_δ ) end_POSTSUPERSCRIPT } ∥ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 + italic_δ end_POSTSUBSCRIPT ≤ { ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_α ( italic_k italic_m ) start_POSTSUPERSCRIPT italic_δ / ( 2 + italic_δ ) end_POSTSUPERSCRIPT } ∥ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 + italic_δ end_POSTSUBSCRIPT → 0 ,

as m𝑚m\to\inftyitalic_m → ∞, where the second inequality is by Assumption 1 and C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a constant independent of m𝑚mitalic_m. The final inequality is by Assumption 2 and 𝔼θ0(Wk2+δ)0subscript𝔼subscriptsubscript𝜃0superscriptnormsubscript𝑊𝑘2𝛿0\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2+\delta})\to 0blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT ) → 0, which we will prove below.

Proof of 𝔼θ0(Wk2+δ)0subscript𝔼subscriptsubscript𝜃0superscriptnormsubscript𝑊𝑘2𝛿0\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2+\delta})\to 0blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT ) → 0 as m𝑚m\to\inftyitalic_m → ∞ for every k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K:
We first show that 𝔼θ0(Wk2+δ)0subscript𝔼subscriptsubscript𝜃0superscriptnormsubscript𝑊𝑘2𝛿0\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2+\delta})\to 0blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT ) → 0, and thus 𝔼θ0(Wk2)subscript𝔼subscriptsubscript𝜃0superscriptnormsubscript𝑊𝑘2\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2})blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) also converges to zero by Jensen’s inequality. Since k(θ0)/m=Oθ0(1)superscriptsubscript𝑘subscript𝜃0𝑚subscript𝑂subscriptsubscript𝜃01\ell_{k}^{\prime}(\theta_{0})/m=O_{\mathbb{P}_{\theta_{0}}}(1)roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) / italic_m = italic_O start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 ) and Zj=oθ0(1)subscript𝑍𝑗subscript𝑜subscriptsubscript𝜃01Z_{j}=o_{\mathbb{P}_{\theta_{0}}}(1)italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 ), Wk=Zjk(θ0)/mnormsubscript𝑊𝑘normsubscript𝑍𝑗superscriptsubscript𝑘subscript𝜃0𝑚\|W_{k}\|=\|Z_{j}\,\ell_{k}^{\prime}(\theta_{0})/m\|∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ = ∥ italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) / italic_m ∥ is oθ0(1)subscript𝑜subscriptsubscript𝜃01o_{\mathbb{P}_{\theta_{0}}}(1)italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 ). The remaining part of the proof is to show that Wk2+δsuperscriptnormsubscript𝑊𝑘2𝛿\|W_{k}\|^{2+\delta}∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT is dominated by some VkL1subscript𝑉𝑘subscript𝐿1V_{k}\in L_{1}italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The dominated convergence theorem will then imply that 𝔼θ0(Wk2+δ\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2+\delta}blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT) converges to zero.

By Assumption 6, let λ¯>0¯𝜆0\underline{\lambda}>0under¯ start_ARG italic_λ end_ARG > 0 be the lower bound of k′′(θ)/msuperscriptsubscript𝑘′′𝜃𝑚-\ell_{k}^{\prime\prime}(\theta)/m- roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_θ ) / italic_m for all θBδ0(θ0)𝜃subscript𝐵subscript𝛿0subscript𝜃0\theta\in B_{\delta_{0}}(\theta_{0})italic_θ ∈ italic_B start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and sufficiently large m𝑚mitalic_m. We have

{1mk′′(θ~k)}1d1/2λ¯1andI(θ0)1d1/2λ¯1.formulae-sequencenormsuperscript1𝑚superscriptsubscript𝑘′′subscript~𝜃𝑘1superscript𝑑12superscript¯𝜆1andnorm𝐼superscriptsubscript𝜃01superscript𝑑12superscript¯𝜆1\bigg{\|}\bigg{\{}-\frac{1}{m}\ell_{k}^{\prime\prime}(\widetilde{\theta}_{k})% \bigg{\}}^{-1}\bigg{\|}\leq d^{1/2}\underline{\lambda}^{-1}\quad\text{and}% \quad\|I(\theta_{0})^{-1}\|\leq d^{1/2}\underline{\lambda}^{-1}.∥ { - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ≤ italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT under¯ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and ∥ italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ≤ italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT under¯ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (S6)

Moreover, Assumption 5 implies that

1mk′′(θ~k)2d2mi=1mMi(X1k,,Xik)2.superscriptnorm1𝑚superscriptsubscript𝑘′′subscript~𝜃𝑘2superscript𝑑2𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘2\bigg{\|}-\frac{1}{m}\ell_{k}^{\prime\prime}(\widetilde{\theta}_{k})\bigg{\|}^% {2}\leq\frac{d^{2}}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2}.∥ - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (S7)

It follows from equations (S6) and (S7) that for all sufficiently large m𝑚mitalic_m,

Zk2+δsuperscriptnormsubscript𝑍𝑘2𝛿\displaystyle\left\|Z_{k}\right\|^{2+\delta}∥ italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT ={1m2k(θ~k)θθ}1{1m2k(θ~k)θθI(θ0)}I1(θ0)2+δabsentsuperscriptnormsuperscript1𝑚superscript2subscript𝑘subscript~𝜃𝑘𝜃superscript𝜃top11𝑚superscript2subscript𝑘subscript~𝜃𝑘𝜃superscript𝜃top𝐼subscript𝜃0superscript𝐼1subscript𝜃02𝛿\displaystyle=\bigg{\|}\bigg{\{}-\frac{1}{m}\frac{\partial^{2}\ell_{k}(% \widetilde{\theta}_{k})}{\partial\theta\partial\theta^{\top}}\bigg{\}}^{-1}% \bigg{\{}-\frac{1}{m}\frac{\partial^{2}\ell_{k}(\widetilde{\theta}_{k})}{% \partial\theta\partial\theta^{\top}}-I(\theta_{0})\bigg{\}}I^{-1}(\theta_{0})% \bigg{\|}^{2+\delta}= ∥ { - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ ∂ italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ ∂ italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG - italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT
12d(2+δ)/2λ¯(2+δ){d2mi=1mMi(X1k,,Xik)2+δ+I(θ0)2+δ}d(2+δ)/2λ¯(2+δ)absent12superscript𝑑2𝛿2superscript¯𝜆2𝛿superscript𝑑2𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘2𝛿superscriptnorm𝐼subscript𝜃02𝛿superscript𝑑2𝛿2superscript¯𝜆2𝛿\displaystyle\leq\frac{1}{2}d^{(2+\delta)/2}\underline{\lambda}^{-(2+\delta)}% \bigg{\{}\frac{d^{2}}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+\|% I(\theta_{0})\|^{2+\delta}\bigg{\}}d^{(2+\delta)/2}\underline{\lambda}^{-(2+% \delta)}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUPERSCRIPT ( 2 + italic_δ ) / 2 end_POSTSUPERSCRIPT under¯ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT - ( 2 + italic_δ ) end_POSTSUPERSCRIPT { divide start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT + ∥ italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT } italic_d start_POSTSUPERSCRIPT ( 2 + italic_δ ) / 2 end_POSTSUPERSCRIPT under¯ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT - ( 2 + italic_δ ) end_POSTSUPERSCRIPT
c11mi=1mMi(X1k,,Xik)2+δ+c2,absentsubscript𝑐11𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘2𝛿subscript𝑐2\displaystyle\leq c_{1}\frac{1}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+% \delta}+c_{2},≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are constants that are independent of m𝑚mitalic_m. Now define

Vk={c11mi=1mMi(X1k,,Xik)2+δ+c2}×m1/2k(θ0)2+δ.subscript𝑉𝑘subscript𝑐11𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘2𝛿subscript𝑐2superscriptnormsuperscript𝑚12superscriptsubscript𝑘subscript𝜃02𝛿V_{k}=\bigg{\{}c_{1}\frac{1}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+% \delta}+c_{2}\bigg{\}}\times\|m^{-1/2}\ell_{k}^{\prime}(\theta_{0})\|^{2+% \delta}.italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } × ∥ italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT .

We will show that 𝔼θ0(Vk)<subscript𝔼subscriptsubscript𝜃0subscript𝑉𝑘\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(V_{k})<\inftyblackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) < ∞ and then apply the dominated convergence theorem to Wk2superscriptnormsubscript𝑊𝑘2\|W_{k}\|^{2}∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We first apply the Cauchy-Schwarz inequality to Vksubscript𝑉𝑘V_{k}italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and obtain

𝔼θ0(Vk)=𝔼θ0{c11mi=1mMi(X1k,,Xik)2+δ+c2}m1/2k(θ0)2+δ[𝔼θ0{c11mi=1mMi(X1k,,Xik)2+δ+c2}2]1/2×{𝔼θ0(m1/2k(θ0)4+2δ)}1/2.subscript𝔼subscriptsubscript𝜃0subscript𝑉𝑘absentsubscript𝔼subscriptsubscript𝜃0subscript𝑐11𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘2𝛿subscript𝑐2superscriptnormsuperscript𝑚12superscriptsubscript𝑘subscript𝜃02𝛿missing-subexpressionabsentsuperscriptdelimited-[]subscript𝔼subscriptsubscript𝜃0superscriptsubscript𝑐11𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘2𝛿subscript𝑐2212superscriptsubscript𝔼subscriptsubscript𝜃0superscriptnormsuperscript𝑚12superscriptsubscript𝑘subscript𝜃042𝛿12\displaystyle\begin{aligned} \mathbb{E}_{\mathbb{P}_{\theta_{0}}}(V_{k})&=% \mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}c_{1}\frac{1}{m}\sum_{i=1}^{m}M_{% i}(X_{1k},\dots,X_{ik})^{2+\delta}+c_{2}\bigg{\}}\|m^{-1/2}\ell_{k}^{\prime}(% \theta_{0})\|^{2+\delta}\\ &\leq\bigg{[}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}c_{1}\frac{1}{m}\sum% _{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+c_{2}\bigg{\}}^{2}\bigg{]}^{1/% 2}\times\big{\{}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|m^{-1/2}\ell_{k}^{% \prime}(\theta_{0})\|^{4+2\delta})\big{\}}^{1/2}.\end{aligned}start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∥ italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ [ blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT × { blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT ) } start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (S8)

The first term of equation S8 is bounded by Assumption 5 as

𝔼θ0{c11mi=1mMi(X1k,,Xik)2+δ+c2}2subscript𝔼subscriptsubscript𝜃0superscriptsubscript𝑐11𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘2𝛿subscript𝑐22\displaystyle\quad\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}c_{1}\frac{1}{m% }\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+c_{2}\bigg{\}}^{2}blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝔼θ0[c12{1mi=1mMi(X1k,,Xik)2+δ}2+2c1c21mi=1mMi(X1k,,Xik)2+δ+c22]absentsubscript𝔼subscriptsubscript𝜃0delimited-[]superscriptsubscript𝑐12superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘2𝛿22subscript𝑐1subscript𝑐21𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘2𝛿superscriptsubscript𝑐22\displaystyle=\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{[}c_{1}^{2}\bigg{\{}% \frac{1}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}\bigg{\}}^{2}+2c% _{1}c_{2}\frac{1}{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+c_{2}^% {2}\bigg{]}= blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
𝔼θ0{c121mi=1mMi(X1k,,Xik)4+2δ+2c1c21mi=1mMi(X1k,,Xik)2+δ+c22}<.absentsubscript𝔼subscriptsubscript𝜃0superscriptsubscript𝑐121𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘42𝛿2subscript𝑐1subscript𝑐21𝑚superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘2𝛿superscriptsubscript𝑐22\displaystyle\leq\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}c_{1}^{2}\frac{1% }{m}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{4+2\delta}+2c_{1}c_{2}\frac{1}{m% }\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{2+\delta}+c_{2}^{2}\bigg{\}}<\infty.≤ blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT + 2 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } < ∞ .

To bound the second term of equation S8, recall that k(θ)=t=1mpθ(XtkX1k,,X(t1)k)/pθ(XtkX1k,,X(t1)k)superscriptsubscript𝑘𝜃superscriptsubscript𝑡1𝑚superscriptsubscript𝑝𝜃conditionalsubscript𝑋𝑡𝑘subscript𝑋1𝑘subscript𝑋𝑡1𝑘subscript𝑝𝜃conditionalsubscript𝑋𝑡𝑘subscript𝑋1𝑘subscript𝑋𝑡1𝑘\ell_{k}^{\prime}(\theta)=\sum_{t=1}^{m}p_{\theta}^{\prime}(X_{tk}\mid X_{1k},% \dots,X_{(t-1)k})/p_{\theta}(X_{tk}\mid X_{1k},\dots,X_{(t-1)k})roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_t - 1 ) italic_k end_POSTSUBSCRIPT ) / italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_t - 1 ) italic_k end_POSTSUBSCRIPT ) is a martingale at θ=θ0𝜃subscript𝜃0\theta=\theta_{0}italic_θ = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for m1𝑚1m\geq 1italic_m ≥ 1 by Assumption 8. We denote the l𝑙litalic_lth component of pθ(XtkX1k,,X(t1)k)/pθ(XtkX1k,,X(t1)k)superscriptsubscript𝑝𝜃conditionalsubscript𝑋𝑡𝑘subscript𝑋1𝑘subscript𝑋𝑡1𝑘subscript𝑝𝜃conditionalsubscript𝑋𝑡𝑘subscript𝑋1𝑘subscript𝑋𝑡1𝑘p_{\theta}^{\prime}(X_{tk}\mid X_{1k},\dots,X_{(t-1)k})/p_{\theta}(X_{tk}\mid X% _{1k},\dots,X_{(t-1)k})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_t - 1 ) italic_k end_POSTSUBSCRIPT ) / italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_t - 1 ) italic_k end_POSTSUBSCRIPT ) as Utlsubscript𝑈𝑡𝑙U_{tl}italic_U start_POSTSUBSCRIPT italic_t italic_l end_POSTSUBSCRIPT, so that pθ(XtkX1k,,X(t1)k)/pθ(XtkX0j,,X(t1)k)=(Ut1,,Utd)superscriptsubscript𝑝𝜃conditionalsubscript𝑋𝑡𝑘subscript𝑋1𝑘subscript𝑋𝑡1𝑘subscript𝑝𝜃conditionalsubscript𝑋𝑡𝑘subscript𝑋0𝑗subscript𝑋𝑡1𝑘superscriptsubscript𝑈𝑡1subscript𝑈𝑡𝑑topp_{\theta}^{\prime}(X_{tk}\mid X_{1k},\dots,X_{(t-1)k})/p_{\theta}(X_{tk}\mid X% _{0j},\dots,X_{(t-1)k})=(U_{t1},\dots,U_{td})^{\top}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_t - 1 ) italic_k end_POSTSUBSCRIPT ) / italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_t - 1 ) italic_k end_POSTSUBSCRIPT ) = ( italic_U start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT , … , italic_U start_POSTSUBSCRIPT italic_t italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Further, by Assumption 6, 𝔼θ0(Utl)=0subscript𝔼subscriptsubscript𝜃0subscript𝑈𝑡𝑙0\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(U_{tl})=0blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_t italic_l end_POSTSUBSCRIPT ) = 0 for every integer 1tm1𝑡𝑚1\leq t\leq m1 ≤ italic_t ≤ italic_m and integer 1ld1𝑙𝑑1\leq l\leq d1 ≤ italic_l ≤ italic_d. Then

𝔼θ0{m1/2k(θ0)4+2δ}subscript𝔼subscriptsubscript𝜃0superscriptnormsuperscript𝑚12superscriptsubscript𝑘subscript𝜃042𝛿\displaystyle\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\{\|m^{-1/2}\ell_{k}^{\prime}% (\theta_{0})\|^{4+2\delta}\}blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∥ italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT } m(2+δ)𝔼θ0{k(θ0)4+2δ}absentsuperscript𝑚2𝛿subscript𝔼subscriptsubscript𝜃0superscriptnormsuperscriptsubscript𝑘subscript𝜃042𝛿\displaystyle\leq m^{-(2+\delta)}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\{\|\ell_% {k}^{\prime}(\theta_{0})\|^{4+2\delta}\}≤ italic_m start_POSTSUPERSCRIPT - ( 2 + italic_δ ) end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∥ roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT }
m(2+δ)𝔼θ0{l=1d(t=1mUtl)2}2+δabsentsuperscript𝑚2𝛿subscript𝔼subscriptsubscript𝜃0superscriptsuperscriptsubscript𝑙1𝑑superscriptsuperscriptsubscript𝑡1𝑚subscript𝑈𝑡𝑙22𝛿\displaystyle\leq m^{-(2+\delta)}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}% \sum_{l=1}^{d}\bigg{(}\sum_{t=1}^{m}U_{tl}\bigg{)}^{2}\bigg{\}}^{2+\delta}≤ italic_m start_POSTSUPERSCRIPT - ( 2 + italic_δ ) end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_t italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT
(d1/2m)2+δ𝔼θ0{l=1d(t=1mUtl)4+2δ}absentsuperscriptsuperscript𝑑12𝑚2𝛿subscript𝔼subscriptsubscript𝜃0superscriptsubscript𝑙1𝑑superscriptsuperscriptsubscript𝑡1𝑚subscript𝑈𝑡𝑙42𝛿\displaystyle\leq\bigg{(}\frac{d^{1/2}}{m}\bigg{)}^{2+\delta}\mathbb{E}_{% \mathbb{P}_{\theta_{0}}}\bigg{\{}\sum_{l=1}^{d}\bigg{(}\sum_{t=1}^{m}U_{tl}% \bigg{)}^{4+2\delta}\bigg{\}}≤ ( divide start_ARG italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_t italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT }
(d1/2m)2+δm2+δl=1d𝔼θ0{c(4+2δ)1mt=1m𝔼θ0(|Utl|4+2δ)}absentsuperscriptsuperscript𝑑12𝑚2𝛿superscript𝑚2𝛿superscriptsubscript𝑙1𝑑subscript𝔼subscriptsubscript𝜃0𝑐42𝛿1𝑚superscriptsubscript𝑡1𝑚subscript𝔼subscriptsubscript𝜃0superscriptsubscript𝑈𝑡𝑙42𝛿\displaystyle\leq\bigg{(}\frac{d^{1/2}}{m}\bigg{)}^{2+\delta}m^{2+\delta}\sum_% {l=1}^{d}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}c(4+2\delta)\frac{1}{m}% \sum_{t=1}^{m}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(|U_{tl}|^{4+2\delta})\bigg{\}}≤ ( divide start_ARG italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG ) start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_c ( 4 + 2 italic_δ ) divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( | italic_U start_POSTSUBSCRIPT italic_t italic_l end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT ) }
d1+δ/2l=1d𝔼θ0[c(4+2δ)1m𝔼θ0{i=1mMi(X1k,,Xik)4+2δ}]<,absentsuperscript𝑑1𝛿2superscriptsubscript𝑙1𝑑subscript𝔼subscriptsubscript𝜃0delimited-[]𝑐42𝛿1𝑚subscript𝔼subscriptsubscript𝜃0superscriptsubscript𝑖1𝑚subscript𝑀𝑖superscriptsubscript𝑋1𝑘subscript𝑋𝑖𝑘42𝛿\displaystyle\leq d^{1+\delta/2}\sum_{l=1}^{d}\mathbb{E}_{\mathbb{P}_{\theta_{% 0}}}\bigg{[}c(4+2\delta)\frac{1}{m}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{% \{}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})^{4+2\delta}\bigg{\}}\bigg{]}<\infty,≤ italic_d start_POSTSUPERSCRIPT 1 + italic_δ / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_c ( 4 + 2 italic_δ ) divide start_ARG 1 end_ARG start_ARG italic_m end_ARG blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 4 + 2 italic_δ end_POSTSUPERSCRIPT } ] < ∞ ,

where c(ν)={8(ν1)×max(1,2ν3)}ν𝑐𝜈superscript8𝜈11superscript2𝜈3𝜈c(\nu)=\{8(\nu-1)\times\max(1,2^{\nu-3})\}^{\nu}italic_c ( italic_ν ) = { 8 ( italic_ν - 1 ) × roman_max ( 1 , 2 start_POSTSUPERSCRIPT italic_ν - 3 end_POSTSUPERSCRIPT ) } start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT is a constant independent of m𝑚mitalic_m. The third inequality is by Jensen’s inequality, the fourth inequality is by Dharmadhikari et al., (1968), and the fifth inequality is by Assumption 5.

Therefore, we find a Vksubscript𝑉𝑘V_{k}italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that dominates Wk2+δsuperscriptnormsubscript𝑊𝑘2𝛿\|W_{k}\|^{2+\delta}∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT with 𝔼θ0(Vk)<subscript𝔼subscriptsubscript𝜃0subscript𝑉𝑘\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(V_{k})<\inftyblackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) < ∞. Moreover, Wkθ00subscript𝑊𝑘subscriptsubscript𝜃00W_{k}\overset{\mathbb{P}_{\theta_{0}}}{\to}0italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_OVERACCENT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_OVERACCENT start_ARG → end_ARG 0 and thus by the dominated convergence theorem, 𝔼θ0(Wk2+δ)0subscript𝔼subscriptsubscript𝜃0superscriptnormsubscript𝑊𝑘2𝛿0\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(\|W_{k}\|^{2+\delta})\to 0blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT ) → 0. ∎

Appendix S2 Other proofs

The proof of Theorem 1 is similar to that of Theorem 1 of Li et al., (2017), but is included for concreteness. The proof of Theorem 2 is the same as the proof of Theorem 2 of Li et al., (2017), and is mainly based on Theorem 1 and does not involve features of dependent data. The proof of Lemma S1 is along the lines of the proof of Lemma 2 of Li et al., (2017), but is included for completeness.

For any two probability measures μ,ν𝒫2(Θ)𝜇𝜈subscript𝒫2Θ\mu,\nu\in\mathcal{P}_{2}(\Theta)italic_μ , italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ ), the total variation of second moment distance TV2subscriptTV2\mathrm{TV}_{2}roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is defined as TV2(μ,ν)=Θ(1+θ2)|μ(dθ)ν(dθ)|subscriptTV2𝜇𝜈subscriptΘ1superscriptnorm𝜃2𝜇d𝜃𝜈d𝜃\mathrm{TV}_{2}(\mu,\nu)=\int_{\Theta}(1+\|\theta\|^{2})\,|\mu(\mathrm{d}% \theta)-\nu(\mathrm{d}\theta)|roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) = ∫ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( 1 + ∥ italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | italic_μ ( roman_d italic_θ ) - italic_ν ( roman_d italic_θ ) |. By Villani, (2009), W22(μ,ν)2TV2(μ,ν)subscriptsuperscriptW22𝜇𝜈2TsubscriptV2𝜇𝜈\mathrm{W}^{2}_{2}(\mu,\nu)\leq 2\mathrm{TV}_{2}(\mu,\nu)roman_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) ≤ 2 roman_T roman_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ).

Proof of Theorem 1.

By the triangle inequality,

W2{Π¯T(dξX1:T),ΠT(dξX1:T)}W2{Π¯T(dξX1:T),Φ[dξ;ξ¯,{TIξ(θ0)}1]}+W2(Φ[dξ;ξ¯,{TIξ(θ0)}1],Φ[dξ;ξ^,{TIξ(θ0)}1])+W2{Φ[dξ;ξ^,{TIξ(θ0)}1],ΠT(dξX1:T)}.missing-subexpressionsubscriptW2subscript¯Π𝑇conditionald𝜉subscript𝑋:1𝑇subscriptΠ𝑇conditionald𝜉subscript𝑋:1𝑇missing-subexpressionabsentsubscriptW2subscript¯Π𝑇conditionald𝜉subscript𝑋:1𝑇Φd𝜉¯𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01missing-subexpressionsubscriptW2Φd𝜉¯𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01Φd𝜉^𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01missing-subexpressionsubscriptW2Φd𝜉^𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01subscriptΠ𝑇conditionald𝜉subscript𝑋:1𝑇\displaystyle\begin{aligned} &\mathrm{W}_{2}\{\overline{\Pi}_{T}(\mathrm{d}\xi% \mid X_{1:T}),\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})\}\\ &\leq\mathrm{W}_{2}\{\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T}),\Phi[% \mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}]\}\\ &\quad+\mathrm{W}_{2}(\Phi[\mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta_{0})% \}^{-1}],\Phi[\mathrm{d}\xi;\widehat{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}])\\ &\quad+\mathrm{W}_{2}\{\Phi[\mathrm{d}\xi;\widehat{\xi},\{TI_{\xi}(\theta_{0})% \}^{-1}],\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})\}.\end{aligned}start_ROW start_CELL end_CELL start_CELL roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT { over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT { over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over¯ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Φ [ roman_d italic_ξ ; over¯ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] , roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT { roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] , roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) } . end_CELL end_ROW (S9)

We show that the first term of the right hand side of equation S9 is oθ0(T1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑇12o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ), the second term is oθ0(m1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑚12o_{\mathbb{P}_{\theta_{0}}}(m^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) in general and is oθ0(T1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑇12o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) when the MLE is unbiased and m𝑚mitalic_m is at least 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ), and the third term is oθ0(T1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑇12o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Therefore, the W2subscriptW2\mathrm{W}_{2}roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between Π¯T(dξX1:T)subscript¯Π𝑇conditionald𝜉subscript𝑋:1𝑇\overline{\Pi}_{T}(\mathrm{d}\xi\mid X_{1:T})over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) and ΠT(dξX1:T)subscriptΠ𝑇conditionald𝜉subscript𝑋:1𝑇\Pi_{T}(\mathrm{d}\xi\mid X_{1:T})roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) is oθ0(m1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑚12o_{\mathbb{P}_{\theta_{0}}}(m^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) in general and oθ0(T1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑇12o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) when the MLE is unbiased and m𝑚mitalic_m is at least 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ).

We first estimate the order of the first term in equation S9. For any c>0𝑐0c>0italic_c > 0,

θ0{W2(Π¯T(dξX1:T),Φ[dξ;ξ¯,{TIξ(θ0)}1])cT1/2}subscriptsubscript𝜃0subscriptW2subscript¯Π𝑇conditionald𝜉subscript𝑋:1𝑇Φd𝜉¯𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01𝑐superscript𝑇12\displaystyle\quad\mathbb{P}_{\theta_{0}}\{\mathrm{W}_{2}(\overline{\Pi}_{T}(% \mathrm{d}\xi\mid X_{1:T}),\Phi[\mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta% _{0})\}^{-1}])\geq cT^{-1/2}\}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over¯ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over¯ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) ≥ italic_c italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT }
θ0{1Kk=1KW2(Πm(dξ𝐗[k]),Φ[dξ;ξ^k,{TIξ(θ0)}1])cT1/2}absentsubscriptsubscript𝜃01𝐾superscriptsubscript𝑘1𝐾subscriptW2subscriptΠ𝑚conditionald𝜉subscript𝐗delimited-[]𝑘Φd𝜉subscript^𝜉𝑘superscript𝑇subscript𝐼𝜉subscript𝜃01𝑐superscript𝑇12\displaystyle\leq\mathbb{P}_{\theta_{0}}\bigg{\{}\frac{1}{K}\sum_{k=1}^{K}% \mathrm{W}_{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[k]}),\Phi[\mathrm{d% }\xi;\widehat{\xi}_{k},\{TI_{\xi}(\theta_{0})\}^{-1}]\big{)}\geq cT^{-1/2}% \bigg{\}}≤ blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) ≥ italic_c italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT }
Tc2𝔼θ0{1Kk=1KW2(Πm(dξ𝐗[k]),Φ[dξ;ξ^k,{Iξ(θ0)}1])}2absent𝑇superscript𝑐2subscript𝔼subscriptsubscript𝜃0superscript1𝐾superscriptsubscript𝑘1𝐾subscriptW2subscriptΠ𝑚conditionald𝜉subscript𝐗delimited-[]𝑘Φd𝜉subscript^𝜉𝑘superscriptsubscript𝐼𝜉subscript𝜃012\displaystyle\leq\frac{T}{c^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\bigg{\{}% \frac{1}{K}\sum_{k=1}^{K}\mathrm{W}_{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf% {X}_{[k]}),\Phi[\mathrm{d}\xi;\widehat{\xi}_{k},\{I_{\xi}(\theta_{0})\}^{-1}]% \big{)}\bigg{\}}^{2}≤ divide start_ARG italic_T end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Tc2Kk=1K𝔼θ0W22(Πm(dξ𝐗[k]),Φ[dξ;ξ^k,{TIξ(θ0)}1])absent𝑇superscript𝑐2𝐾superscriptsubscript𝑘1𝐾subscript𝔼subscriptsubscript𝜃0superscriptsubscriptW22subscriptΠ𝑚conditionald𝜉subscript𝐗delimited-[]𝑘Φd𝜉subscript^𝜉𝑘superscript𝑇subscript𝐼𝜉subscript𝜃01\displaystyle\leq\frac{T}{c^{2}K}\sum_{k=1}^{K}\mathbb{E}_{\mathbb{P}_{\theta_% {0}}}\mathrm{W}_{2}^{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[k]}),\Phi[% \mathrm{d}\xi;\widehat{\xi}_{k},\{TI_{\xi}(\theta_{0})\}^{-1}]\big{)}≤ divide start_ARG italic_T end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] )
Tc2𝔼θ0W22(Πm(dξ𝐗[1]),Φ[dξ;ξ^1,{TIξ(θ0)}1])absent𝑇superscript𝑐2subscript𝔼subscriptsubscript𝜃0superscriptsubscriptW22subscriptΠ𝑚conditionald𝜉subscript𝐗delimited-[]1Φd𝜉subscript^𝜉1superscript𝑇subscript𝐼𝜉subscript𝜃01\displaystyle\leq\frac{T}{c^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{W}% _{2}^{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[1]}),\Phi[\mathrm{d}\xi;% \widehat{\xi}_{1},\{TI_{\xi}(\theta_{0})\}^{-1}]\big{)}≤ divide start_ARG italic_T end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] )
2Tc2𝔼θ0TV2(Πm(dξ𝐗[1]),Φ[dξ;ξ^1,{TIξ(θ0)}1])0,absent2𝑇superscript𝑐2subscript𝔼subscriptsubscript𝜃0subscriptTV2subscriptΠ𝑚conditionald𝜉subscript𝐗delimited-[]1Φd𝜉subscript^𝜉1superscript𝑇subscript𝐼𝜉subscript𝜃010\displaystyle\leq\frac{2T}{c^{2}}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{% TV}_{2}\big{(}\Pi_{m}(\mathrm{d}\xi\mid\mathbf{X}_{[1]}),\Phi[\mathrm{d}\xi;% \widehat{\xi}_{1},\{TI_{\xi}(\theta_{0})\}^{-1}]\big{)}\to 0,≤ divide start_ARG 2 italic_T end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_ξ ∣ bold_X start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ) , roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) → 0 ,

where the first inequality is by Lemma S2, the third inequality is by Jensen’s inequality, and the fourth inequality is by stationarity. The convergence to zero is by Lemma S1.

For the second term, note that the Wasserstein-2 distance between two d𝑑ditalic_d-dimensional Gaussians Φ(m1,Σ)Φsubscript𝑚1Σ\Phi(m_{1},\Sigma)roman_Φ ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Σ ) and Φ(m2,Σ)Φsubscript𝑚2Σ\Phi(m_{2},\Sigma)roman_Φ ( italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_Σ ) is given by W2{Φ(m1,Σ),Φ(m2,Σ)}=m1m22subscriptW2Φsubscript𝑚1ΣΦsubscript𝑚2Σsuperscriptnormsubscript𝑚1subscript𝑚22\mathrm{W}_{2}\{\Phi(m_{1},\Sigma),\Phi(m_{2},\Sigma)\}=\|m_{1}-m_{2}\|^{2}roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT { roman_Φ ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Σ ) , roman_Φ ( italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_Σ ) } = ∥ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Therefore, Lemma 1 yields that W2(Φ[dξ;ξ¯,{TIξ(θ0)}1],Φ[dξ;ξ^,{TIξ(θ0)}1])|ξ¯ξ^|=oθ0(m1/2)subscriptW2Φd𝜉¯𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01Φd𝜉^𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01¯𝜉^𝜉subscript𝑜subscriptsubscript𝜃0superscript𝑚12\mathrm{W}_{2}(\Phi[\mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}% ],\Phi[\mathrm{d}\xi;\widehat{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}])\leq|% \overline{\xi}-\widehat{\xi}|=o_{\mathbb{P}_{\theta_{0}}}(m^{-1/2})roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Φ [ roman_d italic_ξ ; over¯ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] , roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) ≤ | over¯ start_ARG italic_ξ end_ARG - over^ start_ARG italic_ξ end_ARG | = italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) in general and
W2(Φ[dξ;ξ¯,{TIξ(θ0)}1],Φ[dξ;ξ^,{TIξ(θ0)}1])|ξ¯ξ^|=oθ0(T1/2)subscriptW2Φd𝜉¯𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01Φd𝜉^𝜉superscript𝑇subscript𝐼𝜉subscript𝜃01¯𝜉^𝜉subscript𝑜subscriptsubscript𝜃0superscript𝑇12\mathrm{W}_{2}(\Phi[\mathrm{d}\xi;\overline{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}% ],\Phi[\mathrm{d}\xi;\widehat{\xi},\{TI_{\xi}(\theta_{0})\}^{-1}])\leq|% \overline{\xi}-\widehat{\xi}|=o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})roman_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Φ [ roman_d italic_ξ ; over¯ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] , roman_Φ [ roman_d italic_ξ ; over^ start_ARG italic_ξ end_ARG , { italic_T italic_I start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] ) ≤ | over¯ start_ARG italic_ξ end_ARG - over^ start_ARG italic_ξ end_ARG | = italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) when the MLE is unbiased and m𝑚mitalic_m is at least 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ). The third term of equation S9 is a special case of the first term when K=1𝐾1K=1italic_K = 1 and m=T𝑚𝑇m=Titalic_m = italic_T, and is hence oθ0(T1/2)subscript𝑜subscriptsubscript𝜃0superscript𝑇12o_{\mathbb{P}_{\theta_{0}}}(T^{-1/2})italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). This proves Theorem 1. ∎

Proof of Lemma S1.

We first show the existence of weakly consistent estimator θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that k(θ^k)=0subscriptsuperscript𝑘subscript^𝜃𝑘0\ell^{\prime}_{k}(\widehat{\theta}_{k})=0roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 0 and θ^kθ0subscript^𝜃𝑘subscript𝜃0\widehat{\theta}_{k}\to\theta_{0}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability. By Assumption 5, k(θ)subscript𝑘𝜃\ell_{k}(\theta)roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) is differentiable in a neighbourhood Bδ0(θ0)subscript𝐵subscript𝛿0subscript𝜃0B_{\delta_{0}}(\theta_{0})italic_B start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). By Assumption 6, 𝔼{Pθ0{k(θ0)}=0\mathbb{E}_{\{P_{\theta_{0}}}\{\ell^{\prime}_{k}(\theta_{0})\}=0blackboard_E start_POSTSUBSCRIPT { italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } = 0. With θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability arbitrary close to 1111, there exists a root that solves k(θ)subscriptsuperscript𝑘𝜃\ell^{\prime}_{k}(\theta)roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) in Bδ0(θ0)subscript𝐵subscript𝛿0subscript𝜃0B_{\delta_{0}}(\theta_{0})italic_B start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). We denote such a root by θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. By Assumption 7, θ^kθ0subscript^𝜃𝑘subscript𝜃0\widehat{\theta}_{k}\to\theta_{0}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability. It suffices to show limm𝔼θ0TV2[Πm,ζ(dζ𝐗[k]),Φ{dζ;0,I1(θ0)}]=0subscript𝑚subscript𝔼subscriptsubscript𝜃0subscriptTV2subscriptΠ𝑚𝜁conditionald𝜁subscript𝐗delimited-[]𝑘Φd𝜁0superscript𝐼1subscript𝜃00\lim_{m\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{TV}_{2}\left[\Pi_% {m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]}),\\ \Phi\{\mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]=0roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ roman_Π start_POSTSUBSCRIPT italic_m , italic_ζ end_POSTSUBSCRIPT ( roman_d italic_ζ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ { roman_d italic_ζ ; 0 , italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } ] = 0, as the equation limm𝔼θ0TV2[ΠT,ϑ(dϑX1:T),Φ{dϑ;0,I1(θ0)}]=0subscript𝑚subscript𝔼subscriptsubscript𝜃0subscriptTV2subscriptΠ𝑇italic-ϑconditionalditalic-ϑsubscript𝑋:1𝑇Φditalic-ϑ0superscript𝐼1subscript𝜃00\lim_{m\to\infty}\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{TV}_{2}[\Pi_{T,% \vartheta}(\mathrm{d}\vartheta\mid X_{1:T}),\Phi\{\mathrm{d}\vartheta;0,I^{-1}% (\theta_{0})\}]=0roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ roman_Π start_POSTSUBSCRIPT italic_T , italic_ϑ end_POSTSUBSCRIPT ( roman_d italic_ϑ ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , roman_Φ { roman_d italic_ϑ ; 0 , italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } ] = 0 is just a special case when m=T𝑚𝑇m=Titalic_m = italic_T and K=1𝐾1K=1italic_K = 1. We first show that TV2[Πm,ζ(dζ𝐗[k]),Φ{dζ;0,I1(θ0)}]θ00superscriptsubscriptsubscript𝜃0subscriptTV2subscriptΠ𝑚𝜁conditionald𝜁subscript𝐗delimited-[]𝑘Φd𝜁0superscript𝐼1subscript𝜃00\mathrm{TV}_{2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]}),\Phi\{% \mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]\stackrel{{\scriptstyle\mathbb{P}% _{\theta_{0}}}}{{\to}}0roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ roman_Π start_POSTSUBSCRIPT italic_m , italic_ζ end_POSTSUBSCRIPT ( roman_d italic_ζ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ { roman_d italic_ζ ; 0 , italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } ] start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0.

Define w(ζ)=k{θ^k+ζ/(Km)1/2}k(θ^k)𝑤𝜁subscript𝑘subscript^𝜃𝑘𝜁superscript𝐾𝑚12subscript𝑘subscript^𝜃𝑘w(\zeta)=\ell_{k}\{\widehat{\theta}_{k}+\zeta/(Km)^{1/2}\}-\ell_{k}(\widehat{% \theta}_{k})italic_w ( italic_ζ ) = roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ζ / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT } - roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and Cm=exp{Kw(z)}π0{θ^k+z/(Km)1/2}dzsubscript𝐶𝑚𝐾𝑤𝑧subscript𝜋0subscript^𝜃𝑘𝑧superscript𝐾𝑚12differential-d𝑧C_{m}=\int\exp\{Kw(z)\}\pi_{0}\{\widehat{\theta}_{k}+z/(Km)^{1/2}\}\mathrm{d}zitalic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ∫ roman_exp { italic_K italic_w ( italic_z ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT } roman_d italic_z. The subsequence posterior for 𝐗[k]subscript𝐗delimited-[]𝑘\mathbf{X}_{[k]}bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT can be written as πm(θ𝐗[k])=exp{Kk(θ)}π0(θ)/[θexp{Kk(θ)}π0(θ)dθ]subscript𝜋𝑚conditional𝜃subscript𝐗delimited-[]𝑘𝐾subscript𝑘𝜃subscript𝜋0𝜃delimited-[]subscript𝜃𝐾subscript𝑘𝜃subscript𝜋0𝜃differential-d𝜃\pi_{m}(\theta\mid\mathbf{X}_{[k]})=\exp\{K\ell_{k}(\theta)\}\pi_{0}(\theta)/[% \int_{\theta}\exp\{K\ell_{k}(\theta)\}\pi_{0}(\theta)\,\mathrm{d}\theta]italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) = roman_exp { italic_K roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) / [ ∫ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_exp { italic_K roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) roman_d italic_θ ]. The posterior of ζ𝜁\zetaitalic_ζ induced by the posterior above is

πm(ζ𝐗[k])=Cm1exp{Kw(ζ)}π0{θ^k+ζ/(Km)1/2}.subscript𝜋𝑚conditional𝜁subscript𝐗delimited-[]𝑘superscriptsubscript𝐶𝑚1𝐾𝑤𝜁subscript𝜋0subscript^𝜃𝑘𝜁superscript𝐾𝑚12\pi_{m}(\zeta\mid\mathbf{X}_{[k]})=C_{m}^{-1}\exp\{Kw(\zeta)\}\pi_{0}\{% \widehat{\theta}_{k}+\zeta/(Km)^{1/2}\}.italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_ζ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) = italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_exp { italic_K italic_w ( italic_ζ ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ζ / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT } .

Define

gm(ζ)=(1+ζ2)[exp{Kw(ζ)}π0{θ^k+ζ(Km)1/2}exp{12ζI(θ0)ζ}π0(θ0)].subscript𝑔𝑚𝜁1superscriptnorm𝜁2delimited-[]𝐾𝑤𝜁subscript𝜋0subscript^𝜃𝑘𝜁superscript𝐾𝑚1212superscript𝜁top𝐼subscript𝜃0𝜁subscript𝜋0subscript𝜃0g_{m}(\zeta)=(1+\|\zeta\|^{2})\left[\exp\{Kw(\zeta)\}\pi_{0}\bigg{\{}\widehat{% \theta}_{k}+\frac{\zeta}{(Km)^{1/2}}\bigg{\}}-\exp\bigg{\{}-\frac{1}{2}\zeta^{% \top}I(\theta_{0})\zeta\bigg{\}}\pi_{0}(\theta_{0})\right].italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_ζ ) = ( 1 + ∥ italic_ζ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) [ roman_exp { italic_K italic_w ( italic_ζ ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_ζ end_ARG start_ARG ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG } - roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ζ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_ζ } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] .

We define 𝒯={ζ:ζ=T1/2(θθ^k),θΘ}𝒯conditional-set𝜁formulae-sequence𝜁superscript𝑇12𝜃subscript^𝜃𝑘𝜃Θ\mathcal{T}=\{\zeta\,:\,\zeta=T^{1/2}(\theta-\widehat{\theta}_{k}),\theta\in\Theta\}caligraphic_T = { italic_ζ : italic_ζ = italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_θ ∈ roman_Θ }. If we can show that 𝒯|gm(z)|dzθ00superscriptsubscriptsubscript𝜃0subscript𝒯subscript𝑔𝑚𝑧differential-d𝑧0\int_{\mathcal{T}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{% \theta_{0}}}}{{\to}}0∫ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0, then Cmθ0dexp{ζI(θ0)ζ/2}π0(θ0)dζ=(2π)d/2{detI(θ0)}1/2π0(θ0)superscriptsubscriptsubscript𝜃0subscript𝐶𝑚subscriptsuperscript𝑑superscript𝜁top𝐼subscript𝜃0𝜁2subscript𝜋0subscript𝜃0differential-d𝜁superscript2𝜋𝑑2superscript𝐼subscript𝜃012subscript𝜋0subscript𝜃0C_{m}\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}\int_{\mathbb{R}^{% d}}\exp\{-\zeta^{\top}I(\theta_{0})\zeta/2\}\pi_{0}(\theta_{0})\,\mathrm{d}% \zeta=(2\pi)^{d/2}\{\det I(\theta_{0})\}^{-1/2}\pi_{0}(\theta_{0})italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp { - italic_ζ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_ζ / 2 } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d italic_ζ = ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT { roman_det italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and thus

TV2[Πm,ζ(dζ𝐗[k]),Φ{dζ;0,I1(θ0)}]subscriptTV2subscriptΠ𝑚𝜁conditionald𝜁subscript𝐗delimited-[]𝑘Φd𝜁0superscript𝐼1subscript𝜃0\displaystyle\quad\mathrm{TV}_{2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid% \mathbf{X}_{[k]}),\Phi\{\mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ roman_Π start_POSTSUBSCRIPT italic_m , italic_ζ end_POSTSUBSCRIPT ( roman_d italic_ζ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ { roman_d italic_ζ ; 0 , italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } ]
=𝒯(1+ζ2)|exp{Kw(ζ)}π0{θ^k+ζ/(Km)1/2}Cmdet{I(θ0)}1/2(2π)d/2exp{12ζI(θ0)ζ}|dζabsentsubscript𝒯1superscriptnorm𝜁2𝐾𝑤𝜁subscript𝜋0subscript^𝜃𝑘𝜁superscript𝐾𝑚12subscript𝐶𝑚superscript𝐼subscript𝜃012superscript2𝜋𝑑212superscript𝜁top𝐼subscript𝜃0𝜁differential-d𝜁\displaystyle=\int_{\mathcal{T}}(1+\|\zeta\|^{2})\bigg{|}\frac{\exp\{Kw(\zeta)% \}\pi_{0}\{\widehat{\theta}_{k}+\zeta/(Km)^{1/2}\}}{C_{m}}-\frac{\det\{I(% \theta_{0})\}^{1/2}}{(2\pi)^{d/2}}\exp\bigg{\{}-\frac{1}{2}\zeta^{\top}I(% \theta_{0})\zeta\bigg{\}}\bigg{|}\mathrm{d}\zeta= ∫ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( 1 + ∥ italic_ζ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | divide start_ARG roman_exp { italic_K italic_w ( italic_ζ ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ζ / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT } end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG - divide start_ARG roman_det { italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ζ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_ζ } | roman_d italic_ζ
1Cm𝒯|gm(z)|dzabsent1subscript𝐶𝑚subscript𝒯subscript𝑔𝑚𝑧differential-d𝑧\displaystyle\leq\frac{1}{C_{m}}\int_{\mathcal{T}}\left|g_{m}(z)\right|\mathrm% {d}z≤ divide start_ARG 1 end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ∫ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z
+|(2π)d/2{detI(θ0)}1/2π0(θ0)Cm1|×d(1+z2)(2π)d/2{detI(θ0)}1/2exp{12zI(θ0)z}dz.superscript2𝜋𝑑2superscript𝐼subscript𝜃012subscript𝜋0subscript𝜃0subscript𝐶𝑚1subscriptsuperscript𝑑1superscriptnorm𝑧2superscript2𝜋𝑑2superscript𝐼subscript𝜃01212superscript𝑧top𝐼subscript𝜃0𝑧differential-d𝑧\displaystyle\quad+\left|\frac{(2\pi)^{d/2}\{\det I(\theta_{0})\}^{-1/2}\pi_{0% }(\theta_{0})}{C_{m}}-1\right|\times\int_{\mathbb{R}^{d}}\frac{(1+\|z\|^{2})}{% (2\pi)^{d/2}\left\{\det I(\theta_{0})\right\}^{-1/2}}\exp\bigg{\{}-\frac{1}{2}% z^{\top}I(\theta_{0})z\bigg{\}}\mathrm{d}z.+ | divide start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT { roman_det italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG - 1 | × ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT { roman_det italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_ARG roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } roman_d italic_z .

The second term in the previous equation converges to zero in θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability.

We now show that 𝒯|gm(z)|dzθ00superscriptsubscriptsubscript𝜃0subscript𝒯subscript𝑔𝑚𝑧differential-d𝑧0\int_{\mathcal{T}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{% \theta_{0}}}}{{\to}}0∫ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0. To achieve this, we divide 𝒯𝒯\mathcal{T}caligraphic_T into three regions: A1={z:zδ1(Km)1/2}subscript𝐴1conditional-set𝑧norm𝑧subscript𝛿1superscript𝐾𝑚12A_{1}=\{z:\|z\|\geq\delta_{1}(Km)^{1/2}\}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_z : ∥ italic_z ∥ ≥ italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT }, A2={z:δ2zδ1(Km)1/2}subscript𝐴2conditional-set𝑧subscript𝛿2norm𝑧subscript𝛿1superscript𝐾𝑚12A_{2}=\{z:\delta_{2}\leq\|z\|\leq\delta_{1}(Km)^{1/2}\}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_z : italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ italic_z ∥ ≤ italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT }, and A3={z:z<δ2}subscript𝐴3conditional-set𝑧norm𝑧subscript𝛿2A_{3}=\{z:\|z\|<\delta_{2}\}italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { italic_z : ∥ italic_z ∥ < italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, where δ1subscript𝛿1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and δ2subscript𝛿2\delta_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are constants that will be chosen later. Since 𝒯|gm(z)|dzA1|gm(z)|dz+A2|gm(z)|dz+A3|gm(z)|dzsubscript𝒯subscript𝑔𝑚𝑧differential-d𝑧subscriptsubscript𝐴1subscript𝑔𝑚𝑧differential-d𝑧subscriptsubscript𝐴2subscript𝑔𝑚𝑧differential-d𝑧subscriptsubscript𝐴3subscript𝑔𝑚𝑧differential-d𝑧\int_{\mathcal{T}}|g_{m}(z)|\,\mathrm{d}z\leq\int_{A_{1}}|g_{m}(z)|\,\mathrm{d% }z+\int_{A_{2}}|g_{m}(z)|\,\mathrm{d}z+\int_{A_{3}}|g_{m}(z)|\,\mathrm{d}z∫ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z ≤ ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z + ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z + ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z, to prove 𝒯|gm(z)|dzθ00superscriptsubscriptsubscript𝜃0subscript𝒯subscript𝑔𝑚𝑧differential-d𝑧0\int_{\mathcal{T}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{% \theta_{0}}}}{{\to}}0∫ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0, it suffices to prove that Ai|gm(z)|dzθ00superscriptsubscriptsubscript𝜃0subscriptsubscript𝐴𝑖subscript𝑔𝑚𝑧differential-d𝑧0\int_{A_{i}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0 for i=1,2,3𝑖123i=1,2,3italic_i = 1 , 2 , 3.

1. Proof of A1|gm(z)|dzθ00superscriptsubscriptsubscript𝜃0subscriptsubscript𝐴1subscript𝑔𝑚𝑧differential-d𝑧0\int_{A_{1}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0: we have

A1|gm(z)|dzA1(1+z2)exp{Kw(z)}π0{θ^k+z(Km)1/2}dz+A1(1+z2)exp{12zI(θ0)z}π0(θ0)dz.subscriptsubscript𝐴1subscript𝑔𝑚𝑧differential-d𝑧absentsubscriptsubscript𝐴11superscriptnorm𝑧2𝐾𝑤𝑧subscript𝜋0subscript^𝜃𝑘𝑧superscript𝐾𝑚12differential-d𝑧missing-subexpressionsubscriptsubscript𝐴11superscriptnorm𝑧212superscript𝑧top𝐼subscript𝜃0𝑧subscript𝜋0subscript𝜃0differential-d𝑧\displaystyle\begin{aligned} \int_{A_{1}}|g_{m}(z)|\,\mathrm{d}z&\leq\int_{A_{% 1}}(1+\|z\|^{2})\exp\{Kw(z)\}\pi_{0}\bigg{\{}\widehat{\theta}_{k}+\frac{z}{(Km% )^{1/2}}\bigg{\}}\mathrm{d}z\\ &\quad+\int_{A_{1}}(1+\|z\|^{2})\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z% \right\}\pi_{0}(\theta_{0})\,\mathrm{d}z.\end{aligned}start_ROW start_CELL ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z end_CELL start_CELL ≤ ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { italic_K italic_w ( italic_z ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_z end_ARG start_ARG ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG } roman_d italic_z end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d italic_z . end_CELL end_ROW (S10)

The second term on the right hand side of equation S10 converges to zero as m𝑚m\to\inftyitalic_m → ∞. We use Assumption 5 to prove that the first term of equation S10 also converges to zero as m𝑚m\to\inftyitalic_m → ∞. For some ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, with probability approaching one and all sufficiently large m𝑚mitalic_m, we have exp{Kw(z)}=exp[K{k(θ^k+ζ/(Km)1/2)k(θ^k)}]exp(Kmϵ)𝐾𝑤𝑧𝐾subscript𝑘subscript^𝜃𝑘𝜁superscript𝐾𝑚12subscript𝑘subscript^𝜃𝑘𝐾𝑚italic-ϵ\exp\{Kw(z)\}=\exp[K\{\ell_{k}(\widehat{\theta}_{k}+\zeta/(Km)^{1/2})-\ell_{k}% (\widehat{\theta}_{k})\}]\leq\exp(-Km\epsilon)roman_exp { italic_K italic_w ( italic_z ) } = roman_exp [ italic_K { roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ζ / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) - roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } ] ≤ roman_exp ( - italic_K italic_m italic_ϵ ). Moreover, with θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability approaching 1, the consistency of θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT guarantees that θ^k2C1superscriptnormsubscript^𝜃𝑘2subscript𝐶1\|\widehat{\theta}_{k}\|^{2}\leq C_{1}∥ over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT where C1+subscript𝐶1subscriptC_{1}\in\mathbb{R}_{+}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is some positive constant. Therefore, as m𝑚m\to\inftyitalic_m → ∞

A1(1+z2)exp{Kw(z)}π0{θ^k+z(Km)1/2}dzsubscriptsubscript𝐴11superscriptnorm𝑧2𝐾𝑤𝑧subscript𝜋0subscript^𝜃𝑘𝑧superscript𝐾𝑚12differential-d𝑧\displaystyle\quad\int_{A_{1}}(1+\|z\|^{2})\exp\{Kw(z)\}\pi_{0}\bigg{\{}% \widehat{\theta}_{k}+\frac{z}{(Km)^{1/2}}\bigg{\}}\mathrm{d}z∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { italic_K italic_w ( italic_z ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_z end_ARG start_ARG ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG } roman_d italic_z
exp(Kmϵ){1+(Km)d/2Θ2(θ2+θ^k2)π0(θ)dθ}absent𝐾𝑚italic-ϵ1superscript𝐾𝑚𝑑2subscriptΘ2superscriptnorm𝜃2superscriptnormsubscript^𝜃𝑘2subscript𝜋0𝜃differential-d𝜃\displaystyle\leq\exp(-Km\epsilon)\bigg{\{}1+(Km)^{d/2}\int_{\Theta}2(\|\theta% \|^{2}+\|\widehat{\theta}_{k}\|^{2})\pi_{0}(\theta)\,\mathrm{d}\theta\bigg{\}}≤ roman_exp ( - italic_K italic_m italic_ϵ ) { 1 + ( italic_K italic_m ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT 2 ( ∥ italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) roman_d italic_θ }
exp(Kmϵ){1+2(Km)d/2C1+(Km)d/2Θ2θ2π0(θ)dθ}θ00,absent𝐾𝑚italic-ϵ12superscript𝐾𝑚𝑑2subscript𝐶1superscript𝐾𝑚𝑑2subscriptΘ2superscriptnorm𝜃2subscript𝜋0𝜃differential-d𝜃superscriptsubscriptsubscript𝜃00\displaystyle\leq\exp(-Km\epsilon)\bigg{\{}1+2(Km)^{d/2}C_{1}+(Km)^{d/2}\int_{% \Theta}2\|\theta\|^{2}\pi_{0}(\theta)\,\mathrm{d}\theta\bigg{\}}\stackrel{{% \scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}0,≤ roman_exp ( - italic_K italic_m italic_ϵ ) { 1 + 2 ( italic_K italic_m ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( italic_K italic_m ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT 2 ∥ italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) roman_d italic_θ } start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0 ,

where the convergence is by Assumption 8 that bounds the second moment of the prior.

2. Proof of A2|gm(z)|dzθ00superscriptsubscriptsubscript𝜃0subscriptsubscript𝐴2subscript𝑔𝑚𝑧differential-d𝑧0\int_{A_{2}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0: note that

A2|gm(z)|dzA2(1+z2)exp{Kw(z)}π0{θ^k+z(Km)1/2}dz+A2(1+z2)exp{12zI(θ0)z}π0(θ0)dz.subscriptsubscript𝐴2subscript𝑔𝑚𝑧differential-d𝑧absentsubscriptsubscript𝐴21superscriptnorm𝑧2𝐾𝑤𝑧subscript𝜋0subscript^𝜃𝑘𝑧superscript𝐾𝑚12differential-d𝑧missing-subexpressionsubscriptsubscript𝐴21superscriptnorm𝑧212superscript𝑧top𝐼subscript𝜃0𝑧subscript𝜋0subscript𝜃0differential-d𝑧\displaystyle\begin{aligned} \int_{A_{2}}|g_{m}(z)|\,\mathrm{d}z&\leq\int_{A_{% 2}}(1+\|z\|^{2})\exp\{Kw(z)\}\pi_{0}\bigg{\{}\widehat{\theta}_{k}+\frac{z}{(Km% )^{1/2}}\bigg{\}}\mathrm{d}z\\ &\quad+\int_{A_{2}}(1+\|z\|^{2})\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z% \right\}\pi_{0}(\theta_{0})\,\mathrm{d}z.\end{aligned}start_ROW start_CELL ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z end_CELL start_CELL ≤ ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { italic_K italic_w ( italic_z ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_z end_ARG start_ARG ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG } roman_d italic_z end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d italic_z . end_CELL end_ROW (S11)

The second integral on the right hand side of equation S11 is

{z:δ2zδ1(Km)1/2}(1+z2)exp{12zI(θ0)z}π0(θ0)dzsubscriptconditional-set𝑧subscript𝛿2norm𝑧subscript𝛿1superscript𝐾𝑚121superscriptnorm𝑧212superscript𝑧top𝐼subscript𝜃0𝑧subscript𝜋0subscript𝜃0differential-d𝑧\displaystyle\quad\int_{\left\{z\,:\,\delta_{2}\leq\right\|z\|\leq\delta_{1}(% Km)^{1/2}\}}(1+\|z\|^{2})\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z\right\}% \pi_{0}(\theta_{0})\,\mathrm{d}z∫ start_POSTSUBSCRIPT { italic_z : italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ italic_z ∥ ≤ italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d italic_z
{z:zδ2}(1+z2)exp{12zI(θ0)z}π0(θ0)dz,absentsubscriptconditional-set𝑧norm𝑧subscript𝛿21superscriptnorm𝑧212superscript𝑧top𝐼subscript𝜃0𝑧subscript𝜋0subscript𝜃0differential-d𝑧\displaystyle\leq\int_{\left\{z\,:\,\|z\|\geq\delta_{2}\right\}}(1+\|z\|^{2})% \exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z\right\}\pi_{0}(\theta_{0})\,% \mathrm{d}z,≤ ∫ start_POSTSUBSCRIPT { italic_z : ∥ italic_z ∥ ≥ italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d italic_z ,

which converges to zero by choosing δ2subscript𝛿2\delta_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT large enough. The following step bounds the first term on the right hand side of equation S11. We use Taylor expansion: w(z)=k{θ^k+z/(Km)1/2}k(θ^k)𝑤𝑧subscript𝑘subscript^𝜃𝑘𝑧superscript𝐾𝑚12subscript𝑘subscript^𝜃𝑘w(z)=\ell_{k}\{\widehat{\theta}_{k}+z/(Km)^{1/2}\}-\ell_{k}(\widehat{\theta}_{% k})italic_w ( italic_z ) = roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT } - roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Because k(θ^k)=0superscriptsubscript𝑘subscript^𝜃𝑘0\ell_{k}^{\prime}(\widehat{\theta}_{k})=0roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 0, w(z)=k{θ^k+z/(Km)1/2}k(θ^k)=1/(2K)zk′′(θ^k)z/m+Rm(z)𝑤𝑧subscript𝑘subscript^𝜃𝑘𝑧superscript𝐾𝑚12subscript𝑘subscript^𝜃𝑘12𝐾superscript𝑧topsuperscriptsubscript𝑘′′subscript^𝜃𝑘𝑧𝑚subscript𝑅𝑚𝑧w(z)=\ell_{k}\{\widehat{\theta}_{k}+z/(Km)^{1/2}\}-\ell_{k}(\widehat{\theta}_{% k})=-1/(2K)z^{\top}\ell_{k}^{\prime\prime}(\widehat{\theta}_{k})z/m+R_{m}(z)italic_w ( italic_z ) = roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT } - roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = - 1 / ( 2 italic_K ) italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_z / italic_m + italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ), where Rm(z)=(1/6)3k(θ~)/(θ3){z/(Km)1/2,z/(Km)1/2,z/(Km)1/2}subscript𝑅𝑚𝑧16superscript3subscript𝑘~𝜃superscript𝜃3𝑧superscript𝐾𝑚12𝑧superscript𝐾𝑚12𝑧superscript𝐾𝑚12R_{m}(z)=(1/6)\partial^{3}\ell_{k}(\widetilde{\theta})/(\partial\theta^{3})\{z% /(Km)^{1/2},z/(Km)^{1/2},z/(Km)^{1/2}\}italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) = ( 1 / 6 ) ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) / ( ∂ italic_θ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) { italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT }, where
{3k(θ~)}/(θ3)superscript3subscript𝑘~𝜃superscript𝜃3\{\partial^{3}\ell_{k}(\widetilde{\theta})\}/(\partial\theta^{3}){ ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) } / ( ∂ italic_θ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) is a three-dimensional array with {3k(θ~)}/(θiθjθk)superscript3subscript𝑘~𝜃subscript𝜃𝑖subscript𝜃𝑗subscript𝜃𝑘\{\partial^{3}\ell_{k}(\widetilde{\theta})\}/(\partial\theta_{i}\partial\theta% _{j}\partial\theta_{k}){ ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) } / ( ∂ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∂ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∂ italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) as its (i,j,k)𝑖𝑗𝑘(i,j,k)( italic_i , italic_j , italic_k )th element. Moreover, θ~~𝜃\widetilde{\theta}over~ start_ARG italic_θ end_ARG is a d𝑑ditalic_d-dimensional vector between θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and θ^k+z/(Km)1/2subscript^𝜃𝑘𝑧superscript𝐾𝑚12\widehat{\theta}_{k}+z/(Km)^{1/2}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT. By Assumption 5,

|Rm(z)|d36z(Km)1/23i=1mMi(X1k,,Xik)d3δ16Kz21mt=1mMi(X1k,,Xik).subscript𝑅𝑚𝑧superscript𝑑36superscriptnorm𝑧superscript𝐾𝑚123superscriptsubscript𝑖1𝑚subscript𝑀𝑖subscript𝑋1𝑘subscript𝑋𝑖𝑘superscript𝑑3subscript𝛿16𝐾superscriptnorm𝑧21𝑚superscriptsubscript𝑡1𝑚subscript𝑀𝑖subscript𝑋1𝑘subscript𝑋𝑖𝑘\left|R_{m}(z)\right|\leq\frac{d^{3}}{6}\left\|\frac{z}{(Km)^{1/2}}\right\|^{3% }\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})\leq\frac{d^{3}\delta_{1}}{6K}\|z\|^{% 2}\frac{1}{m}\sum_{t=1}^{m}M_{i}(X_{1k},\dots,X_{ik}).| italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | ≤ divide start_ARG italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG 6 end_ARG ∥ divide start_ARG italic_z end_ARG start_ARG ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 6 italic_K end_ARG ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) .

Also by Assumption 5, we have lim supm𝔼θ0{m1i=1mMi(X1k,,Xik)}<subscriptlimit-supremum𝑚subscript𝔼subscriptsubscript𝜃0superscript𝑚1superscriptsubscript𝑖1𝑚subscript𝑀𝑖subscript𝑋1𝑘subscript𝑋𝑖𝑘\limsup_{m\to\infty}{\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\{m^{-1}\sum_{i=1}^{m% }M_{i}(X_{1k},\dots,X_{ik})\}}<\inftylim sup start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) } < ∞. Therefore, for all sufficiently large m𝑚mitalic_m, by Markov’s inequality, m1i=1mMi(X1k,,Xik)superscript𝑚1superscriptsubscript𝑖1𝑚subscript𝑀𝑖subscript𝑋1𝑘subscript𝑋𝑖𝑘m^{-1}\sum_{i=1}^{m}M_{i}(X_{1k},\dots,X_{ik})italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) is Oθ0(1)subscript𝑂subscriptsubscript𝜃01O_{\mathbb{P}_{\theta_{0}}}(1)italic_O start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 ). If δ1subscript𝛿1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is chosen to be small enough, |Rm(z)|1/(4K)z{k′′(θ^k)/m}zsubscript𝑅𝑚𝑧14𝐾superscript𝑧topsuperscriptsubscript𝑘′′subscript^𝜃𝑘𝑚𝑧|R_{m}(z)|\leq 1/(4K)z^{\top}\{\ell_{k}^{\prime\prime}(\widehat{\theta}_{k})/m\}z| italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | ≤ 1 / ( 4 italic_K ) italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT { roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_m } italic_z with probability approaching one in θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT since Assumption 6 bounds the eigenvalues of k′′(θ)/msuperscriptsubscript𝑘′′𝜃𝑚\ell_{k}^{\prime\prime}(\theta)/mroman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_θ ) / italic_m for θBδ0(θ0)𝜃subscript𝐵subscript𝛿0subscript𝜃0\theta\in B_{\delta_{0}}(\theta_{0})italic_θ ∈ italic_B start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) from below. Hence, with probability approaching one, for all sufficiently large m𝑚mitalic_m and zA2𝑧subscript𝐴2z\in A_{2}italic_z ∈ italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, w(z)=1/(2K)zk′′(θ^k)z/m+Rm(z)1/(4K)zk′′(θ^k)z/m1/(8K)zI(θ0)z𝑤𝑧12𝐾superscript𝑧topsuperscriptsubscript𝑘′′subscript^𝜃𝑘𝑧𝑚subscript𝑅𝑚𝑧14𝐾superscript𝑧topsuperscriptsubscript𝑘′′subscript^𝜃𝑘𝑧𝑚18𝐾superscript𝑧top𝐼subscript𝜃0𝑧w(z)=-1/(2K)z^{\top}\ell_{k}^{\prime\prime}(\widehat{\theta}_{k})z/m+R_{m}(z)% \leq-1/(4K)z^{\top}\ell_{k}^{\prime\prime}(\widehat{\theta}_{k})z/m\leq-1/(8K)% z^{\top}I(\theta_{0})zitalic_w ( italic_z ) = - 1 / ( 2 italic_K ) italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_z / italic_m + italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) ≤ - 1 / ( 4 italic_K ) italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_z / italic_m ≤ - 1 / ( 8 italic_K ) italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z by k′′(θ^k)/mθ0I(θ0)superscriptsubscriptsubscript𝜃0superscriptsubscript𝑘′′subscript^𝜃𝑘𝑚𝐼subscript𝜃0\ell_{k}^{\prime\prime}(\widehat{\theta}_{k})/m\stackrel{{\scriptstyle\mathbb{% P}_{\theta_{0}}}}{{\to}}-I(\theta_{0})roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_m start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP - italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Moreover, given that θ^ksubscript^𝜃𝑘\widehat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is consistent for θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and π0(θ)subscript𝜋0𝜃\pi_{0}(\theta)italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) is continuous at θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have π0{θ^k+z/(Km)1/2}θ0π0(θ0)superscriptsubscriptsubscript𝜃0subscript𝜋0subscript^𝜃𝑘𝑧superscript𝐾𝑚12subscript𝜋0subscript𝜃0\pi_{0}\{\widehat{\theta}_{k}+z/(Km)^{-1/2}\}\stackrel{{\scriptstyle\mathbb{P}% _{\theta_{0}}}}{{\to}}\pi_{0}(\theta_{0})italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT } start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Hence, to bound the first term on the right hand side of equation S11, we have

A2(1+z2)exp{Kw(z)}π0{θ^k+z(Km)1/2}dzA22(1+z2)exp{18zI(θ0)z}π0(θ0)dz.subscriptsubscript𝐴21superscriptnorm𝑧2𝐾𝑤𝑧subscript𝜋0subscript^𝜃𝑘𝑧superscript𝐾𝑚12differential-d𝑧absentsubscriptsubscript𝐴221superscriptnorm𝑧218superscript𝑧top𝐼subscript𝜃0𝑧subscript𝜋0subscript𝜃0differential-d𝑧\displaystyle\begin{aligned} \int_{A_{2}}(1+\|z\|^{2})\exp\{Kw(z)\}\pi_{0}% \bigg{\{}\widehat{\theta}_{k}+\frac{z}{(Km)^{1/2}}\bigg{\}}\mathrm{d}z&\leq% \int_{A_{2}}2(1+\|z\|^{2})\exp\bigg{\{}-\frac{1}{8}z^{\top}I(\theta_{0})z\bigg% {\}}\pi_{0}(\theta_{0})\,\mathrm{d}z.\end{aligned}start_ROW start_CELL ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { italic_K italic_w ( italic_z ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_z end_ARG start_ARG ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG } roman_d italic_z end_CELL start_CELL ≤ ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 2 ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { - divide start_ARG 1 end_ARG start_ARG 8 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d italic_z . end_CELL end_ROW (S12)

The right hand side of equation S12 is arbitrarily close to zero if δ2subscript𝛿2\delta_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is chosen to be large enough. Both terms on the right hand side of equation S11 thus converge to zero, and thus A2|gm(z)|dzθ00superscriptsubscriptsubscript𝜃0subscriptsubscript𝐴2subscript𝑔𝑚𝑧differential-d𝑧0\int_{A_{2}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0.

3. Proof of A3|gm(z)|dzθ00superscriptsubscriptsubscript𝜃0subscriptsubscript𝐴3subscript𝑔𝑚𝑧differential-d𝑧0\int_{A_{3}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0: note that

A3|gm(z)|dzsubscriptsubscript𝐴3subscript𝑔𝑚𝑧differential-d𝑧\displaystyle\int_{A_{3}}|g_{m}(z)|\,\mathrm{d}z∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z A3(1+z2)[exp{Kw(z)}π0{θ^k+z(Km)1/2}\displaystyle\leq\int_{A_{3}}(1+\|z\|^{2})\bigg{[}\exp\{Kw(z)\}\pi_{0}\left\{% \widehat{\theta}_{k}+\frac{z}{(Km)^{1/2}}\right\}≤ ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) [ roman_exp { italic_K italic_w ( italic_z ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_z end_ARG start_ARG ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG }
exp{12zI(θ0)z}π0(θ0)]dz\displaystyle\hskip 93.95122pt-\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z% \right\}\pi_{0}(\theta_{0})\bigg{]}\mathrm{d}z- roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] roman_d italic_z
A3(1+z2)exp{Kw(z)}|π0{θ^k+z(Km)1/2}π0(θ0)|dzabsentsubscriptsubscript𝐴31superscriptnorm𝑧2𝐾𝑤𝑧subscript𝜋0subscript^𝜃𝑘𝑧superscript𝐾𝑚12subscript𝜋0subscript𝜃0differential-d𝑧\displaystyle\leq\int_{A_{3}}(1+\|z\|^{2})\exp\{Kw(z)\}\left|\pi_{0}\left\{% \widehat{\theta}_{k}+\frac{z}{(Km)^{1/2}}\right\}-\pi_{0}(\theta_{0})\right|\,% \mathrm{d}z≤ ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { italic_K italic_w ( italic_z ) } | italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_z end_ARG start_ARG ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG } - italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | roman_d italic_z (S13)
+A3(1+z2)π0(θ0)|exp{Kw(z)}exp{12zI(θ0)z}|dzsubscriptsubscript𝐴31superscriptnorm𝑧2subscript𝜋0subscript𝜃0𝐾𝑤𝑧12superscript𝑧top𝐼subscript𝜃0𝑧differential-d𝑧\displaystyle\quad+\int_{A_{3}}(1+\|z\|^{2})\pi_{0}(\theta_{0})\left|\exp\{Kw(% z)\}-\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z\right\}\right|\,\mathrm{d}z+ ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | roman_exp { italic_K italic_w ( italic_z ) } - roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } | roman_d italic_z (S14)

To show A3|gm(z)|dzθ00superscriptsubscriptsubscript𝜃0subscriptsubscript𝐴3subscript𝑔𝑚𝑧differential-d𝑧0\int_{A_{3}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0, it suffices to show that terms (S13) and (S14) both converge to zero as m𝑚m\to\inftyitalic_m → ∞. For all zA3={z:z<δ2}𝑧subscript𝐴3conditional-set𝑧norm𝑧subscript𝛿2z\in A_{3}=\left\{z:\|z\|<\delta_{2}\right\}italic_z ∈ italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { italic_z : ∥ italic_z ∥ < italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, supK1|KRm(z)|d3K/6×z/(Km)1/23×i=1mMi(X1k,,Xik)=oθ0(1)subscriptsupremum𝐾1𝐾subscript𝑅𝑚𝑧superscript𝑑3𝐾6superscriptnorm𝑧superscript𝐾𝑚123superscriptsubscript𝑖1𝑚subscript𝑀𝑖subscript𝑋1𝑘subscript𝑋𝑖𝑘subscript𝑜subscriptsubscript𝜃01\sup_{K\geq 1}|KR_{m}(z)|\leq d^{3}K/6\times\|z/(Km)^{1/2}\|^{3}\times\sum_{i=% 1}^{m}M_{i}(X_{1k},\dots,X_{ik})=o_{\mathbb{P}_{\theta_{0}}}(1)roman_sup start_POSTSUBSCRIPT italic_K ≥ 1 end_POSTSUBSCRIPT | italic_K italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | ≤ italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_K / 6 × ∥ italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT × ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) = italic_o start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 ). Therefore, for all sufficiently large m𝑚mitalic_m, with θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability approaching one,

A3(1+z2)exp{Kw(z)}|π0{θ^k+z(Km)1/2}π0(θ0)|dzsubscriptsubscript𝐴31superscriptnorm𝑧2𝐾𝑤𝑧subscript𝜋0subscript^𝜃𝑘𝑧superscript𝐾𝑚12subscript𝜋0subscript𝜃0differential-d𝑧\displaystyle\quad\int_{A_{3}}(1+\|z\|^{2})\exp\{Kw(z)\}\left|\pi_{0}\left\{% \widehat{\theta}_{k}+\frac{z}{(Km)^{1/2}}\right\}-\pi_{0}(\theta_{0})\right|\,% \mathrm{d}z∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { italic_K italic_w ( italic_z ) } | italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_z end_ARG start_ARG ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG } - italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | roman_d italic_z
2A3(1+z2)exp{Kw(z)}π0(θ0)dzabsent2subscriptsubscript𝐴31superscriptnorm𝑧2𝐾𝑤𝑧subscript𝜋0subscript𝜃0differential-d𝑧\displaystyle\leq 2\int_{A_{3}}(1+\|z\|^{2})\exp\{Kw(z)\}\pi_{0}(\theta_{0})\,% \mathrm{d}z≤ 2 ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { italic_K italic_w ( italic_z ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d italic_z
=2A3(1+z2)exp{12zk′′(θ^k)mz+KRm(z)}π0(θ0)dzabsent2subscriptsubscript𝐴31superscriptnorm𝑧212superscript𝑧topsuperscriptsubscript𝑘′′subscript^𝜃𝑘𝑚𝑧𝐾subscript𝑅𝑚𝑧subscript𝜋0subscript𝜃0differential-d𝑧\displaystyle=2\int_{A_{3}}(1+\|z\|^{2})\exp\left\{-\frac{1}{2}z^{\top}\frac{% \ell_{k}^{\prime\prime}(\widehat{\theta}_{k})}{m}z+KR_{m}(z)\right\}\pi_{0}(% \theta_{0})\,\mathrm{d}z= 2 ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_m end_ARG italic_z + italic_K italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d italic_z
2A3(1+z2)exp{zI(θ0)z}π0(θ0)dz<.absent2subscriptsubscript𝐴31superscriptnorm𝑧2superscript𝑧top𝐼subscript𝜃0𝑧subscript𝜋0subscript𝜃0differential-d𝑧\displaystyle\leq 2\int_{A_{3}}(1+\|z\|^{2})\exp\{z^{\top}I(\theta_{0})z\}\pi_% {0}(\theta_{0})\,\mathrm{d}z<\infty.≤ 2 ∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d italic_z < ∞ .

Moreover, with probability approaching one, for all zA3𝑧subscript𝐴3z\in A_{3}italic_z ∈ italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, (1+z2)exp{Kw(z)}|π0{θ^k+z/(Km)1/2}π0(θ0)|dzθ00superscriptsubscriptsubscript𝜃01superscriptnorm𝑧2𝐾𝑤𝑧subscript𝜋0subscript^𝜃𝑘𝑧superscript𝐾𝑚12subscript𝜋0subscript𝜃0d𝑧0(1+\|z\|^{2})\exp\{Kw(z)\}|\pi_{0}\{\widehat{\theta}_{k}+z/(Km)^{1/2}\}-\pi_{0% }(\theta_{0})|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{% \to}}0( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_exp { italic_K italic_w ( italic_z ) } | italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT } - italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0. By the dominated convergence theorem, (S13) converges to zero with θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability approaching one. Next, we prove that (S14) also converges to zero. For all zA3𝑧subscript𝐴3z\in A_{3}italic_z ∈ italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT,

|exp{Kw(z)}exp{12zI(θ0)z}|𝐾𝑤𝑧12superscript𝑧top𝐼subscript𝜃0𝑧\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \left|\exp\{Kw(z)\}-% \exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z\right\}\right|| roman_exp { italic_K italic_w ( italic_z ) } - roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } |
=|exp{12zk′′(θ^k)mz+KRm(z)}exp{12zI(θ0)z}|θ00.absent12superscript𝑧topsuperscriptsubscript𝑘′′subscript^𝜃𝑘𝑚𝑧𝐾subscript𝑅𝑚𝑧12superscript𝑧top𝐼subscript𝜃0𝑧superscriptsubscriptsubscript𝜃00\displaystyle=\left|\exp\left\{-\frac{1}{2}z^{\top}\frac{\ell_{k}^{\prime% \prime}(\widehat{\theta}_{k})}{m}z+KR_{m}(z)\right\}-\exp\left\{-\frac{1}{2}z^% {\top}I(\theta_{0})z\right\}\right|\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0.= | roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_m end_ARG italic_z + italic_K italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) } - roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } | start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0 .

Hence, for all zA3𝑧subscript𝐴3z\in A_{3}italic_z ∈ italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, (1+z2)π0(θ0)|exp{Kw(z)}exp{zI(θ0)z/2}|θ00superscriptsubscriptsubscript𝜃01superscriptnorm𝑧2subscript𝜋0subscript𝜃0𝐾𝑤𝑧superscript𝑧top𝐼subscript𝜃0𝑧20(1+\|z\|^{2})\pi_{0}(\theta_{0})|\exp\{Kw(z)\}-\exp\{-z^{\top}I(\theta_{0})z/2% \}|\stackrel{{\scriptstyle\mathbb{P}_{\theta_{0}}}}{{\to}}0( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | roman_exp { italic_K italic_w ( italic_z ) } - roman_exp { - italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z / 2 } | start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0. Moreover, A3(1+z2)π0(θ0)|exp{Kw(z)}exp{zI(θ0)z/2}|dz<subscriptsubscript𝐴31superscriptnorm𝑧2subscript𝜋0subscript𝜃0𝐾𝑤𝑧superscript𝑧top𝐼subscript𝜃0𝑧2differential-d𝑧\int_{A_{3}}(1+\|z\|^{2})\pi_{0}(\theta_{0})|\exp\{Kw(z)\}-\exp\{-z^{\top}I(% \theta_{0})z/2\}|\,\mathrm{d}z<\infty∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | roman_exp { italic_K italic_w ( italic_z ) } - roman_exp { - italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z / 2 } | roman_d italic_z < ∞ with θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability approaching one. By the dominated convergence theorem, term (S14) also converges to zero with θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability approaching one.

We have thus proved that A3|gm(z)|dzθ00superscriptsubscriptsubscript𝜃0subscriptsubscript𝐴3subscript𝑔𝑚𝑧differential-d𝑧0\int_{A_{3}}|g_{m}(z)|\,\mathrm{d}z\stackrel{{\scriptstyle\mathbb{P}_{\theta_{% 0}}}}{{\to}}0∫ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_z ) | roman_d italic_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP 0. The final step of the proof is to show that this can be strengthened to L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT convergence, that is, TV2[Πm,ζ(dζ𝐗[k]),Φ{dζ;0,I1(θ0)}]L10superscriptsubscript𝐿1subscriptTV2subscriptΠ𝑚𝜁conditionald𝜁subscript𝐗delimited-[]𝑘Φd𝜁0superscript𝐼1subscript𝜃00\mathrm{TV}_{2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]}),\\ \Phi\{\mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]\stackrel{{\scriptstyle L_{% 1}}}{{\to}}0roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ roman_Π start_POSTSUBSCRIPT italic_m , italic_ζ end_POSTSUBSCRIPT ( roman_d italic_ζ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ { roman_d italic_ζ ; 0 , italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } ] start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_RELOP 0 as m𝑚m\to\inftyitalic_m → ∞ in θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT-probability. We have

TV2[Πm,ζ(dζ𝐗[k]),Φ{dζ;0,I1(θ0)}]subscriptTV2subscriptΠ𝑚𝜁conditionald𝜁subscript𝐗delimited-[]𝑘Φd𝜁0superscript𝐼1subscript𝜃0\displaystyle\mathrm{TV}_{2}\left[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_% {[k]}),\Phi\{\mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}\right]roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ roman_Π start_POSTSUBSCRIPT italic_m , italic_ζ end_POSTSUBSCRIPT ( roman_d italic_ζ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ { roman_d italic_ζ ; 0 , italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } ]
=𝒯(1+z2)|exp{Kw(z)}π0{θ^k+z/(Km)1/2}Cm1(2π)d/2{detI(θ0)}1/2exp{12zI(θ0)z}|dzabsentsubscript𝒯1superscriptnorm𝑧2𝐾𝑤𝑧subscript𝜋0subscript^𝜃𝑘𝑧superscript𝐾𝑚12subscript𝐶𝑚1superscript2𝜋𝑑2superscript𝐼subscript𝜃01212superscript𝑧top𝐼subscript𝜃0𝑧differential-d𝑧\displaystyle=\int_{\mathcal{T}}(1+\|z\|^{2})\left|\frac{\exp\{Kw(z)\}\pi_{0}% \{\widehat{\theta}_{k}+z/(Km)^{1/2}\}}{C_{m}}-\frac{1}{(2\pi)^{d/2}\left\{\det I% (\theta_{0})\right\}^{-1/2}}\exp\left\{-\frac{1}{2}z^{\top}I(\theta_{0})z% \right\}\right|\mathrm{d}z= ∫ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | divide start_ARG roman_exp { italic_K italic_w ( italic_z ) } italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT { over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_z / ( italic_K italic_m ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT } end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT { roman_det italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_ARG roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } | roman_d italic_z
Θ{1+T1/2(θθ^k)2}π(θ𝐗[k])dθ+d(2π)d/2{detI(θ0)}1/2exp{12zI(θ0)z}dzabsentsubscriptΘ1superscriptnormsuperscript𝑇12𝜃subscript^𝜃𝑘2𝜋conditional𝜃subscript𝐗delimited-[]𝑘differential-d𝜃subscriptsuperscript𝑑superscript2𝜋𝑑2superscript𝐼subscript𝜃01212superscript𝑧top𝐼subscript𝜃0𝑧differential-d𝑧\displaystyle\leq\int_{\Theta}\{1+\|T^{1/2}(\theta-\widehat{\theta}_{k})\|^{2}% \}\pi(\theta\mid\mathbf{X}_{[k]})\,\mathrm{d}\theta+\int_{\mathbb{R}^{d}}(2\pi% )^{d/2}\left\{\det I(\theta_{0})\right\}^{-1/2}\exp\left\{-\frac{1}{2}z^{\top}% I(\theta_{0})z\right\}\mathrm{d}z≤ ∫ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT { 1 + ∥ italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } italic_π ( italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) roman_d italic_θ + ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT { roman_det italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } roman_d italic_z
=1+𝔼Πm(dθ𝐗[k]){Kmθθ^k2}+d(1+z2)(2π)d/2{detI(θ0)}1/2exp{12zI(θ0)z}dz.absent1subscript𝔼subscriptΠ𝑚conditionald𝜃subscript𝐗delimited-[]𝑘𝐾𝑚superscriptnorm𝜃subscript^𝜃𝑘2subscriptsuperscript𝑑1superscriptnorm𝑧2superscript2𝜋𝑑2superscript𝐼subscript𝜃01212superscript𝑧top𝐼subscript𝜃0𝑧differential-d𝑧\displaystyle=1+\mathbb{E}_{\Pi_{m}(\mathrm{d}\theta\mid\mathbf{X}_{[k]})}\{Km% \|\theta-\widehat{\theta}_{k}\|^{2}\}+\int_{\mathbb{R}^{d}}\frac{(1+\|z\|^{2})% }{(2\pi)^{d/2}\left\{\det I(\theta_{0})\right\}^{-1/2}}\exp\left\{-\frac{1}{2}% z^{\top}I(\theta_{0})z\right\}\mathrm{d}z.= 1 + blackboard_E start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_d italic_θ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT { italic_K italic_m ∥ italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } + ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG ( 1 + ∥ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT { roman_det italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_ARG roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_I ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_z } roman_d italic_z . (S15)

The third term of (S15) is a finite constant and the second term is ψ(𝐗[k])𝜓subscript𝐗delimited-[]𝑘\psi(\mathbf{X}_{[k]})italic_ψ ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) as defined in Assumption 9. By Assumption 9, we have that ψ(𝐗[k])𝜓subscript𝐗delimited-[]𝑘\psi(\mathbf{X}_{[k]})italic_ψ ( bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) is uniformly integrable in θ0subscriptsubscript𝜃0\mathbb{P}_{\theta_{0}}blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and thus TV2[Πm,ζ(dζ𝐗[k]),Φ{dζ;0,I1(θ0)}]subscriptTV2subscriptΠ𝑚𝜁conditionald𝜁subscript𝐗delimited-[]𝑘Φd𝜁0superscript𝐼1subscript𝜃0\mathrm{TV}_{2}[\Pi_{m,\zeta}(\mathrm{d}\zeta\mid\mathbf{X}_{[k]}),\Phi\{% \mathrm{d}\zeta;0,I^{-1}(\theta_{0})\}]roman_TV start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ roman_Π start_POSTSUBSCRIPT italic_m , italic_ζ end_POSTSUBSCRIPT ( roman_d italic_ζ ∣ bold_X start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ) , roman_Φ { roman_d italic_ζ ; 0 , italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } ] is also uniformly integrable. ∎

Appendix S3 Verification of Assumption 9

We consider the following AR(2) normal linear model based on independent and identically distributed observations:

Xtsubscript𝑋𝑡\displaystyle X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =βZt+εt,absentsuperscript𝛽topsubscript𝑍𝑡subscript𝜀𝑡\displaystyle=\beta^{\top}Z_{t}+\varepsilon_{t},= italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,
εtsubscript𝜀𝑡\displaystyle\varepsilon_{t}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =φ1εt1+φ2εt2+ξt,ε1,ε2,ξti.i.d.N(0,σ2),formulae-sequenceabsentsubscript𝜑1subscript𝜀𝑡1subscript𝜑2subscript𝜀𝑡2subscript𝜉𝑡subscript𝜀1subscript𝜀2superscriptsimilar-toi.i.d.subscript𝜉𝑡N0superscript𝜎2\displaystyle=\varphi_{1}\varepsilon_{t-1}+\varphi_{2}\varepsilon_{t-2}+\xi_{t% },\quad\varepsilon_{1},\varepsilon_{2},\xi_{t}\stackrel{{\scriptstyle\text{i.i% .d.}}}{{\sim}}\mathrm{N}(0,\sigma^{2}),= italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG i.i.d. end_ARG end_RELOP roman_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where βp𝛽superscript𝑝\beta\in\mathbb{R}^{p}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. We write y=(y1,,yT),Z=(Z1,,ZT)formulae-sequence𝑦superscriptsubscript𝑦1subscript𝑦𝑇top𝑍superscriptsubscript𝑍1subscript𝑍𝑇topy=(y_{1},\ldots,y_{T})^{\top},Z=(Z_{1},\ldots,Z_{T})^{\top}italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_Z = ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, ε=(ε1,,εT)𝜀superscriptsubscript𝜀1subscript𝜀𝑇top\varepsilon=(\varepsilon_{1},\ldots,\varepsilon_{T})^{\top}italic_ε = ( italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and the true parameter is θ0=(β0,σ02)subscript𝜃0superscriptsuperscriptsubscript𝛽0topsuperscriptsubscript𝜎02top\theta_{0}=(\beta_{0}^{\top},\sigma_{0}^{2})^{\top}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. We assume that the roots of the characteristic polynomial of εtsubscript𝜀𝑡\varepsilon_{t}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lie outside of the unit circle of the complex plane, so that εtsubscript𝜀𝑡\varepsilon_{t}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a stationary process. We impose the following conjugate prior on the parameter β𝛽\betaitalic_β :

β(σ2,μ,Ω)N(μ,σ2Ω)similar-toconditional𝛽superscript𝜎2superscript𝜇ΩNsuperscript𝜇superscript𝜎2Ω\beta\mid(\sigma^{2},\mu^{*},\Omega)\sim\mathrm{N}(\mu^{*},\sigma^{2}\Omega)italic_β ∣ ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_Ω ) ∼ roman_N ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Ω )

Let β0,μc1<+normsubscript𝛽0normsuperscript𝜇subscript𝑐1\left\|\beta_{0}\right\|,\left\|\mu^{*}\right\|\leq c_{1}<+\infty∥ italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ , ∥ italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < + ∞. Let the eigenvalues of ΩΩ\Omegaroman_Ω and m1ZZsuperscript𝑚1superscript𝑍top𝑍m^{-1}Z^{\top}Zitalic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z be lower bounded by c2>0subscript𝑐20c_{2}>0italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 and upper bounded by c3>0subscript𝑐30c_{3}>0italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0. Let 𝔼(εi4)=c4<𝔼superscriptsubscript𝜀𝑖4subscript𝑐4\mathbb{E}(\varepsilon_{i}^{4})=c_{4}<\inftyblackboard_E ( italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) = italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT < ∞. The subset posterior distributions of β𝛽\betaitalic_β and σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are given by

β(y,Z,μ,Ω,a,b)N{β,ba+Km(KZΣ1Z+Ω1)1}.similar-toconditional𝛽𝑦𝑍superscript𝜇Ω𝑎𝑏Nsuperscript𝛽superscript𝑏𝑎𝐾𝑚superscript𝐾superscript𝑍topsuperscriptΣ1𝑍superscriptΩ11\beta\mid(y,Z,\mu^{*},\Omega,a,b)\sim\mathrm{N}\left\{\beta^{*},\frac{b^{*}}{a% +Km}(KZ^{\top}\Sigma^{-1}Z+\Omega^{-1})^{-1}\right\}.italic_β ∣ ( italic_y , italic_Z , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_Ω , italic_a , italic_b ) ∼ roman_N { italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , divide start_ARG italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_a + italic_K italic_m end_ARG ( italic_K italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z + roman_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT } .

By Yule Walker’s equation, for every ij+2𝑖𝑗2i\geq j+2italic_i ≥ italic_j + 2, we have the following recursive relationship for the entries of ΣΣ\Sigmaroman_Σ:

Σi,jsubscriptΣ𝑖𝑗\displaystyle\Sigma_{i,j}roman_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT :=γij=φ1γij1+φ2γij2.assignabsentsubscript𝛾𝑖𝑗subscript𝜑1subscript𝛾𝑖𝑗1subscript𝜑2subscript𝛾𝑖𝑗2\displaystyle:=\gamma_{i-j}=\varphi_{1}\gamma_{i-j-1}+\varphi_{2}\gamma_{i-j-2}.:= italic_γ start_POSTSUBSCRIPT italic_i - italic_j end_POSTSUBSCRIPT = italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i - italic_j - 1 end_POSTSUBSCRIPT + italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i - italic_j - 2 end_POSTSUBSCRIPT .

This is a second-order linear recurrence equation. We have Σi,j=𝒪(c4|ij|)subscriptΣ𝑖𝑗𝒪superscriptsubscript𝑐4𝑖𝑗\Sigma_{i,j}=\mathcal{O}(c_{4}^{|i-j|})roman_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = caligraphic_O ( italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_i - italic_j | end_POSTSUPERSCRIPT ) where 0<c4<10subscript𝑐410<c_{4}<10 < italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT < 1, since we require the roots of the characteristic equation lie outside the unit circle. Therefore, Σi,jsubscriptΣ𝑖𝑗\Sigma_{i,j}roman_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is a Toeplitz matrix decaying at a geometric rate as (ij)𝑖𝑗(i-j)( italic_i - italic_j ) increases.

We seek to bound the smallest eigenvalue of

Σ=(γ0γ1γT1γ1γ0γT2γT1γT2γ0).Σmatrixsubscript𝛾0subscript𝛾1subscript𝛾𝑇1subscript𝛾1subscript𝛾0subscript𝛾𝑇2subscript𝛾𝑇1subscript𝛾𝑇2subscript𝛾0\Sigma=\begin{pmatrix}\begin{array}[]{cccc}\gamma_{0}&\gamma_{1}&\cdots&\gamma% _{T-1}\\ \gamma_{1}&\gamma_{0}&\cdots&\gamma_{T-2}\\ \vdots&\vdots&\ddots&\vdots\\ \gamma_{T-1}&\gamma_{T-2}&\cdots&\gamma_{0}\end{array}\end{pmatrix}.roman_Σ = ( start_ARG start_ROW start_CELL start_ARRAY start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_T - 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_T - 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY end_CELL end_ROW end_ARG ) .

To this end, we write the T×T𝑇𝑇T\times Titalic_T × italic_T matrix ΣΣ\Sigmaroman_Σ as a sum of a K𝐾Kitalic_K-banded Toeplitz matrix ΣKsubscriptΣ𝐾\Sigma_{K}roman_Σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and a perturbation matrix EKsubscript𝐸𝐾E_{K}italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, where K𝐾Kitalic_K will be chosen later to derive the lower bound for the smallest eigenvalue of ΣΣ\Sigmaroman_Σ, Σ=ΣK+EKΣsubscriptΣ𝐾subscript𝐸𝐾\Sigma=\Sigma_{K}+E_{K}roman_Σ = roman_Σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, where

ΣK=(γ0γ1γK00γ1γ0γ1γKγ10γKγ0γ1γK0γ1γ0γ1γ1γ0γ100γKγ1γ0)subscriptΣ𝐾matrixsubscript𝛾0subscript𝛾1subscript𝛾𝐾00subscript𝛾1subscript𝛾0subscript𝛾1subscript𝛾𝐾subscript𝛾10subscript𝛾𝐾subscript𝛾0subscript𝛾1subscript𝛾𝐾0subscript𝛾1subscript𝛾0subscript𝛾1subscript𝛾1subscript𝛾0subscript𝛾100subscript𝛾𝐾subscript𝛾1subscript𝛾0\Sigma_{K}=\begin{pmatrix}\gamma_{0}&\gamma_{1}&\cdots&\gamma_{K}&0&\cdots&0\\ \gamma_{1}&\gamma_{0}&\gamma_{1}&\ddots&\gamma_{K}&\ddots&\vdots\\ \vdots&\gamma_{1}&\ddots&\ddots&\ddots&\ddots&0\\ \gamma_{K}&\ddots&\ddots&\gamma_{0}&\gamma_{1}&\ddots&\gamma_{K}\\ 0&\ddots&\ddots&\gamma_{1}&\gamma_{0}&\gamma_{1}&\vdots\\ \vdots&\ddots&\ddots&\ddots&\gamma_{1}&\gamma_{0}&\gamma_{1}\\ 0&\cdots&0&\gamma_{K}&\cdots&\gamma_{1}&\gamma_{0}\end{pmatrix}roman_Σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋱ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋱ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG )

and

EK=(00γK+1γK+2γT000γK+10γK+2γK+100γK+1γK+2000000γTγK+2γK+100).subscript𝐸𝐾matrix00subscript𝛾𝐾1subscript𝛾𝐾2subscript𝛾𝑇000subscript𝛾𝐾10subscript𝛾𝐾2subscript𝛾𝐾100subscript𝛾𝐾1subscript𝛾𝐾2000000subscript𝛾𝑇subscript𝛾𝐾2subscript𝛾𝐾100E_{K}=\begin{pmatrix}0&0&\cdots&\gamma_{K+1}&\gamma_{K+2}&\cdots&\gamma_{T}\\ 0&0&0&\ddots&\gamma_{K+1}&\ddots&\vdots\\ \vdots&0&\ddots&\ddots&\ddots&\ddots&\gamma_{K+2}\\ \gamma_{K+1}&\ddots&\ddots&0&0&\ddots&\gamma_{K+1}\\ \gamma_{K+2}&\ddots&\ddots&0&0&0&\vdots\\ \vdots&\ddots&\ddots&\ddots&0&0&0\\ \gamma_{T}&\cdots&\gamma_{K+2}&\gamma_{K+1}&\cdots&0&0\end{pmatrix}.italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K + 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋱ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL 0 end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K + 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋱ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_K + 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K + 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) .

We will lower-bound the smallest eigenvalue of ΣKsubscriptΣ𝐾\Sigma_{K}roman_Σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and upper bound the largest eigenvalue EKsubscript𝐸𝐾E_{K}italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. To lower bound the smallest eigenvalue of ΣKsubscriptΣ𝐾\Sigma_{K}roman_Σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, we will use the proposition 4.5 of Bini and Capovani, (1983). When K=4k+3𝐾4𝑘3K=4k+3italic_K = 4 italic_k + 3, where k1𝑘1k\geq 1italic_k ≥ 1 is an integer, the eigenvalue of ΣKsubscriptΣ𝐾\Sigma_{K}roman_Σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT is given by P2k+1(μi)=j=02k+1μijCj+1subscript𝑃2𝑘1subscript𝜇𝑖superscriptsubscript𝑗02𝑘1subscriptsuperscript𝜇𝑗𝑖subscript𝐶𝑗1P_{2k+1}(\mu_{i})=\sum_{j=0}^{2k+1}\mu^{j}_{i}C_{j+1}italic_P start_POSTSUBSCRIPT 2 italic_k + 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_k + 1 end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT, where 1iT1𝑖𝑇1\leq i\leq T1 ≤ italic_i ≤ italic_T, μi=2cos[πi/(T+8k+1)]subscript𝜇𝑖2𝜋𝑖𝑇8𝑘1\mu_{i}=2\cos[\pi i/(T+8k+1)]italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 roman_cos [ italic_π italic_i / ( italic_T + 8 italic_k + 1 ) ] and Cj+1>0subscript𝐶𝑗10C_{j+1}>0italic_C start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT > 0 are constants independent of i𝑖iitalic_i. We will choose k=3T/16𝑘3𝑇16k=3T/16italic_k = 3 italic_T / 16. We have

P2k+1(μi)=j=02k+1μijCj+1C2cos(πiT+8k+1)C2cos(πTT+8k+1)C2cos(25π)subscript𝑃2𝑘1subscript𝜇𝑖superscriptsubscript𝑗02𝑘1subscriptsuperscript𝜇𝑗𝑖subscript𝐶𝑗1subscript𝐶2𝜋𝑖𝑇8𝑘1subscript𝐶2𝜋𝑇𝑇8𝑘1subscript𝐶225𝜋P_{2k+1}(\mu_{i})=\sum_{j=0}^{2k+1}\mu^{j}_{i}C_{j+1}\geq C_{2}\cos\left(\frac% {\pi i}{T+8k+1}\right)\geq C_{2}\cos\left(\frac{\pi T}{T+8k+1}\right)\geq C_{2% }\cos\left(\frac{2}{5}\pi\right)italic_P start_POSTSUBSCRIPT 2 italic_k + 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_k + 1 end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ≥ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_cos ( divide start_ARG italic_π italic_i end_ARG start_ARG italic_T + 8 italic_k + 1 end_ARG ) ≥ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_cos ( divide start_ARG italic_π italic_T end_ARG start_ARG italic_T + 8 italic_k + 1 end_ARG ) ≥ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_cos ( divide start_ARG 2 end_ARG start_ARG 5 end_ARG italic_π ) (S16)

for every 1iT1𝑖𝑇1\leq i\leq T1 ≤ italic_i ≤ italic_T.

The largest eigenvalue of EKsubscript𝐸𝐾E_{K}italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT is upper bound by the Frobenius norm of EKsubscript𝐸𝐾E_{K}italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, that is, EKnormsubscript𝐸𝐾\|E_{K}\|∥ italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥. Hence, we can upper bound on EKnormsubscript𝐸𝐾\|E_{K}\|∥ italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ and hence provide an upper bound for the largest eigenvalue. When k=3T/16𝑘3𝑇16k=3T/16italic_k = 3 italic_T / 16, we have

EK=tr(EKTEK)Ti=K+1Tγi2=Ti=4k+4Tγi2=𝒪(Tc4(3/2)T).normsubscript𝐸𝐾trsuperscriptsubscript𝐸𝐾𝑇subscript𝐸𝐾𝑇superscriptsubscript𝑖𝐾1𝑇superscriptsubscript𝛾𝑖2𝑇superscriptsubscript𝑖4𝑘4𝑇superscriptsubscript𝛾𝑖2𝒪𝑇superscriptsubscript𝑐432𝑇\|E_{K}\|=\mathrm{tr}(E_{K}^{T}E_{K})\geq T\sum_{i=K+1}^{T}\gamma_{i}^{2}=T% \sum_{i=4k+4}^{T}\gamma_{i}^{2}=\mathcal{O}\left(Tc_{4}^{(3/2)T}\right).∥ italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ = roman_tr ( italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ≥ italic_T ∑ start_POSTSUBSCRIPT italic_i = italic_K + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_T ∑ start_POSTSUBSCRIPT italic_i = 4 italic_k + 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = caligraphic_O ( italic_T italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 / 2 ) italic_T end_POSTSUPERSCRIPT ) . (S17)

Combining equations (S16) and (S17), when T𝑇Titalic_T is sufficiently large and k=3T/16𝑘3𝑇16k=3T/16italic_k = 3 italic_T / 16, the smallest eigenvalue of ΣΣ\Sigmaroman_Σ is lower bounded by a positive constant. Therefore, the largest eigenvalue of Σ1superscriptΣ1\Sigma^{-1}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is upper bounded by a positive constant, which we denote by c5subscript𝑐5c_{5}italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT.

We will show that 𝔼θ0𝔼Πm(y,Z)Kmββ^2<\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathbb{E}_{\Pi_{m}(\cdot\mid y,Z)}Km\|% \beta-\widehat{\beta}\|^{2}<\inftyblackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ ∣ italic_y , italic_Z ) end_POSTSUBSCRIPT italic_K italic_m ∥ italic_β - over^ start_ARG italic_β end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∞. The MLE of β𝛽\betaitalic_β is given by β^=(ZΣ1Z)1ZΣ1y^𝛽superscriptsuperscript𝑍topsuperscriptΣ1𝑍1superscript𝑍topsuperscriptΣ1𝑦\widehat{\beta}=(Z^{\top}\Sigma^{-1}Z)^{-1}Z^{\top}\Sigma^{-1}yover^ start_ARG italic_β end_ARG = ( italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y. It is clear that

𝔼θ0𝔼Πm(y,Z)Kmββ^2=Km𝔼θ0tr{varπm(y,Z)(β)}+Km𝔼θ0𝔼Πm(y,Z)ββ^2,\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathbb{E}_{\Pi_{m}(\cdot\mid y,Z)}Km\|% \beta-\widehat{\beta}\|^{2}=Km\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{tr}% \left\{\operatorname{var}_{\pi_{m}(\cdot\mid y,Z)}(\beta)\right\}+Km\mathbb{E}% _{\mathbb{P}_{\theta_{0}}}\|\mathbb{E}_{\Pi_{m}(\cdot\mid y,Z)}\beta-\widehat{% \beta}\|^{2},blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ ∣ italic_y , italic_Z ) end_POSTSUBSCRIPT italic_K italic_m ∥ italic_β - over^ start_ARG italic_β end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_K italic_m blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_tr { roman_var start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ ∣ italic_y , italic_Z ) end_POSTSUBSCRIPT ( italic_β ) } + italic_K italic_m blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ blackboard_E start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ ∣ italic_y , italic_Z ) end_POSTSUBSCRIPT italic_β - over^ start_ARG italic_β end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where tr(A)tr𝐴\mathrm{tr}(A)roman_tr ( italic_A ) denotes the trace of a generic square matrix A𝐴Aitalic_A. The posterior variance of β𝛽\betaitalic_β can be bounded as

Km𝔼θ0tr{varπm(y,Z)(β)}=Kma+Km+pa+Km+p2×tr{𝔼θ0ba+Km(KZΣ1Z+Ω1)1}2tr{𝔼θ0(b+c12c21+Kyy)(KZΣ1Z+Ω1)1}2tr{𝔼θ0(b+c12c21+Kmc12c2+Kmσ02)(Kmc2c5Ip+c31Ip)1}=2pKm(c12c2+σ02)+b+c12c21Kmc2c5+c312p(c12c2+σ02)c2c5 as m.\displaystyle\begin{aligned} &Km\mathbb{E}_{\mathbb{P}_{\theta_{0}}}\mathrm{tr% }\{\mathrm{var}_{\pi_{m}(\cdot\mid y,Z)}(\beta)\}\\ &=Km\frac{a+Km+p}{a+Km+p-2}\times\mathrm{tr}\left\{\mathbb{E}_{\mathbb{P}_{% \theta_{0}}}\frac{b^{*}}{a+Km}(KZ^{\top}\Sigma^{-1}Z+\Omega^{-1})^{-1}\right\}% \\ &\leq 2\mathrm{tr}\left\{\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(b+c_{1}^{2}c_{2}% ^{-1}+Ky^{\top}y)(KZ^{\top}\Sigma^{-1}Z+\Omega^{-1})^{-1}\right\}\\ &\leq 2\mathrm{tr}\left\{\mathbb{E}_{\mathbb{P}_{\theta_{0}}}(b+c_{1}^{2}c_{2}% ^{-1}+Kmc_{1}^{2}c_{2}+Km\sigma_{0}^{2})(Kmc_{2}c_{5}I_{p}+c_{3}^{-1}I_{p})^{-% 1}\right\}\\ &=2p\frac{Km(c_{1}^{2}c_{2}+\sigma_{0}^{2})+b+c_{1}^{2}c_{2}^{-1}}{Kmc_{2}c_{5% }+c_{3}^{-1}}\rightarrow\frac{2p(c_{1}^{2}c_{2}+\sigma_{0}^{2})}{c_{2}c_{5}}% \text{ as }m\rightarrow\infty.\end{aligned}start_ROW start_CELL end_CELL start_CELL italic_K italic_m blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_tr { roman_var start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ ∣ italic_y , italic_Z ) end_POSTSUBSCRIPT ( italic_β ) } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_K italic_m divide start_ARG italic_a + italic_K italic_m + italic_p end_ARG start_ARG italic_a + italic_K italic_m + italic_p - 2 end_ARG × roman_tr { blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_a + italic_K italic_m end_ARG ( italic_K italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z + roman_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ 2 roman_t roman_r { blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_b + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_K italic_y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y ) ( italic_K italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z + roman_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ 2 roman_t roman_r { blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_b + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_K italic_m italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_K italic_m italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_K italic_m italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = 2 italic_p divide start_ARG italic_K italic_m ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_b + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K italic_m italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG → divide start_ARG 2 italic_p ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_ARG as italic_m → ∞ . end_CELL end_ROW

Appendix S4 ACF and PACF of time series used in synthetic data examples

S4.1 Linear regression with auto-regressive errors (LABEL:{sec.app.ar.error} of main text)

We plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the error terms ε1:Tsubscript𝜀:1𝑇\varepsilon_{1:T}italic_ε start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT generated using model (6) in Figure 10; recall that this is for T=105𝑇superscript105T=10^{5}italic_T = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT observations. The ACF or PACF is highest when lag=2lag2\text{lag}=2lag = 2. The autocorrelation is mild for this simulation.

Refer to caption
Figure 10: ACF (left) and PACF (right) plots of the residuals εtsubscript𝜀𝑡\varepsilon_{t}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for model (6) with T=105𝑇superscript105T=10^{5}italic_T = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.

S4.2 GARCH model (LABEL:{sec.app.garch} of main text)

We plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the error terms ε1:Tsubscript𝜀:1𝑇\varepsilon_{1:T}italic_ε start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT and observations X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT generated using model equation 7 in Figure 11; recall that this is for T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT observations. The variance series σt2superscriptsubscript𝜎𝑡2\sigma_{t}^{2}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT exhibits a strong autocorrelation, but the observation series X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT exhibits a negligible autocorrelation.

Refer to caption
Refer to caption
Figure 11: ACF and PACF plots of σ1:T2superscriptsubscript𝜎:1𝑇2\sigma_{1:T}^{2}italic_σ start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and X1:Tsubscript𝑋:1𝑇X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT for the GARCH model (7) with T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.

S4.3 Binary auto-regressive model (LABEL:{sec.binary_ar_model} of main text)

We plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the success probabilities (Yt=1)subscript𝑌𝑡1\mathbb{P}(Y_{t}=1)blackboard_P ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 ) generated using model equation 9 in Figure 12; recall that this is for T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT observations. The series (Yt=1)subscript𝑌𝑡1\mathbb{P}(Y_{t}=1)blackboard_P ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 ) exhibits a weak autocorrelation, as we observe an auto-correlation of less than 0.20.20.20.2 for every lag size.

Refer to caption
Figure 12: ACF (left) and PACF (right) plots of success probabilities (Yt=1)subscript𝑌𝑡1\mathbb{P}(Y_{t}=1)blackboard_P ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 ) for model (9) with T=2×105𝑇2superscript105T=2\times 10^{5}italic_T = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.