Learning latent tree models with small query complexity

Luc Devroye School of Computer Science, McGill University, Montreal [email protected] Gábor Lugosi Department of Economics and Business, Pompeu Fabra University, Barcelona, Spain; ICREA, Pg. Lluís Companys 23, 08010 Barcelona, Spain
Barcelona School of Economics
[email protected]
 and  Piotr Zwiernik Department of Statistical Sciences, University of Toronto [email protected]
(Date: August 28, 2024)
Abstract.

We consider the problem of structure recovery in a graphical model of a tree where some variables are latent. Specifically, we focus on the Gaussian case, which can be reformulated as a well-studied problem: recovering a semi-labeled tree from a distance metric. We introduce randomized procedures that achieve query complexity of optimal order. Additionally, we provide statistical analysis for scenarios where the tree distances are noisy. The Gaussian setting can be extended to other situations, including the binary case and non-paranormal distributions.

Key words and phrases:
latent tree models, phylogenetics, query complexity, probabilistic graphical models, structure learning, mathematical statistics, separation number
2010 Mathematics Subject Classification:
60E15, 62H99, 15B48

1. Introduction

First discussed by Judea Pearl as tree-decomposable distributions to generalize star-decomposable distributions such as the latent class model (Pearl,, 1988, Section 8.3), latent tree models are probabilistic graphical models defined on trees, where only a subset of variables is observed. Zhang, (2004); Zhang & Poon, (2017); Mourad et al., (2013) extended our theoretical understanding of these models. We refer to Zwiernik, (2018) for more details and references.

Latent tree models have found wide applications in various fields. They are used in phylogenetic analysis, network tomography, computer vision, causal modelling, and data clustering. They also contain other well-known classes of models like hidden Markov models, the Brownian motion tree model, the Ising model on a tree, and many popular models used in phylogenetics. In generic high-dimensional problems, latent tree models can be useful in various ways. They share many computational advantages of observed tree models but are more expressible. Latent tree models have been used for hierarchical topic detection (Côme et al.,, 2021) and clustering.

In phylogenetics, latent tree models have been used to reconstruct the tree of life from the genetic material of surviving species. They have also been used in bioinformatics and computer vision. Machine-learning methods for models with latent variables attract substantial attention from the research community. Some other applications include latent tree models and novel algorithms for high-dimensional data Chen et al., (2019), and the design of low-rank tensor completion methods Zhang et al., (2022).

Structure learning

The problem of structure or parameter learning for latent tree models has been very extensively studied. A seminal work in this field is by Choi et al., (2011), where two consistent and computationally efficient algorithms for learning latent trees were proposed. The main idea is to use the link between a broad class of latent tree models and tree metrics studied extensively in phylogenetics, a link first established by Pearl, (1988) for binary and Gaussian distributions. This has been extended to symmetric discrete distributions by Choi et al., (2011). The proof in (Zwiernik,, 2018, Section 2.3) makes it clear that the only essential assumption is that the conditional expectation of every node given its neighbour is a linear function of the neighbour.

From the statistical perspective, testing the corresponding algebraic restrictions on the correlation matrix was also studied in Shiers et al., (2016), and a more recent take on this problem is by Sturma et al., (2022). From the machine learning perspective, we should mention Jaffe et al., (2021); Zhou et al., (2020); Anandkumar et al., (2011); Huang et al., (2020); Aizenbud et al., (2021); Kandiros et al., (2023). Most of these papers build on the idea of local recursive grouping as proposed by Choi et al., (2011). In particular, they all start by computing all distances between all observed nodes in the tree.

Our contributions

Our work is motivated by novel applications where the dimension n𝑛nitalic_n is so large that computing all the distances—or all correlations between observed variables—is impossible. We propose a randomized algorithm that queries the distance oracle and show that the expected query time for our algorithm is O(ΔnlogΔ(n))𝑂Δ𝑛subscriptΔ𝑛O(\Delta n\log_{\Delta}(n))italic_O ( roman_Δ italic_n roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ). A special case of our problem is the problem of phylogenetic tree recovery. In this case, our approach resembles other phylogenetic tree recovery methods that try to minimize query complexity (see Afshar et al., (2020) for references) and seems to be the first such method that is asymptotically optimal; see King et al., (2003) for the matching lower bound. More importantly, our algorithm can deal with general latent tree models and in this context, it is again the first such algorithm. As we show in the last section, our algorithm can be easily adjusted to the case of noisy oracles, which is relevant in statistical practice.

2. Preliminaries and basic results

2.1. Trees and semi-labelled trees

A tree T=(V,E)𝑇𝑉𝐸T=(V,E)italic_T = ( italic_V , italic_E ) is a connected undirected graph with no cycles. In particular, for any two u,vV𝑢𝑣𝑉u,v\in Vitalic_u , italic_v ∈ italic_V there is a unique path between them, which we denote by uv¯¯𝑢𝑣\overline{uv}over¯ start_ARG italic_u italic_v end_ARG. A vertex of T𝑇Titalic_T with only one neighbour is called a leaf. A vertex of T𝑇Titalic_T that is not a leaf is called an inner vertex or internal node. An edge of T𝑇Titalic_T is inner if both ends are inner vertices; otherwise, it is called terminal. A connected subgraph of T𝑇Titalic_T is called a subtree of T𝑇Titalic_T. A rooted tree (T,ρ)𝑇𝜌(T,\rho)( italic_T , italic_ρ ) is simply a tree T=(V,E)𝑇𝑉𝐸T=(V,E)italic_T = ( italic_V , italic_E ) with one distinguished vertex ρV𝜌𝑉\rho\in Vitalic_ρ ∈ italic_V.

A tree T𝑇Titalic_T is called a semi-labelled tree with labelled nodes WV𝑊𝑉W\subseteq Vitalic_W ⊆ italic_V if every vertex of T𝑇Titalic_T of degree 2absent2\leq 2≤ 2 lies in W𝑊Witalic_W111This differs slightly from the regular definition of semi-labelled trees (or X-trees) in phylogenetics, where regular nodes can get multiple labels; see Semple & Steel, (2003).. We say that T𝑇Titalic_T is a phylogenetic tree if T𝑇Titalic_T has no degree-2222 nodes and W𝑊Witalic_W is exactly the set of leaves of T𝑇Titalic_T. If vW𝑣𝑊v\in Witalic_v ∈ italic_W, then we say that v𝑣vitalic_v is regular and we depict it by a solid vertex. If |W|=n𝑊𝑛|W|=n| italic_W | = italic_n then we typically label the vertices in W𝑊Witalic_W with [n]={1,,n}delimited-[]𝑛1𝑛[n]=\{1,\ldots,n\}[ italic_n ] = { 1 , … , italic_n }. A vertex that is not labelled is called latent. An example semi-labelled tree is shown in Figure 1.

111122223333444455{5}5
Figure 1. A semi-labeled tree with five regular nodes labeled with {1,2,3,4,5}12345\{1,2,3,4,5\}{ 1 , 2 , 3 , 4 , 5 } and two latent nodes.

By definition, all latent nodes are internal and have degree 3absent3\geq 3≥ 3. Therefore, if |W|=n𝑊𝑛|W|=n| italic_W | = italic_n, then |T|2n+1𝑇2𝑛1|T|\leq 2n+1| italic_T | ≤ 2 italic_n + 1.

2.2. Tree metrics

Consider metrics on the discrete set W𝑊Witalic_W induced by semi-labeled trees in the following sense. Let T=(V,E)𝑇𝑉𝐸T=(V,E)italic_T = ( italic_V , italic_E ) be a tree and suppose that :E+:𝐸subscript\ell:E\rightarrow\mathbb{R}_{+}roman_ℓ : italic_E → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a map that assigns lengths to the edges of T𝑇Titalic_T. For any pair u,vV𝑢𝑣𝑉u,v\in Vitalic_u , italic_v ∈ italic_V by uv¯¯𝑢𝑣\overline{uv}over¯ start_ARG italic_u italic_v end_ARG we denote the path in T𝑇Titalic_T joining u𝑢uitalic_u and v𝑣vitalic_v. We now define the map dT,:V×V:subscript𝑑𝑇𝑉𝑉d_{T,\ell}:V\times V\rightarrow\mathbb{R}italic_d start_POSTSUBSCRIPT italic_T , roman_ℓ end_POSTSUBSCRIPT : italic_V × italic_V → blackboard_R by setting, for all u,vV𝑢𝑣𝑉u,v\in Vitalic_u , italic_v ∈ italic_V,

dT,(u,v)={euv¯(e), if uv,0,otherwise.subscript𝑑𝑇𝑢𝑣casessubscript𝑒¯𝑢𝑣𝑒 if 𝑢𝑣0otherwise.d_{T,\ell}(u,v)=\left\{\begin{array}[]{ll}\sum_{e\in\overline{uv}}\ell(e),&% \mbox{ if }u\neq v,\\ 0,&\mbox{otherwise.}\end{array}\right.italic_d start_POSTSUBSCRIPT italic_T , roman_ℓ end_POSTSUBSCRIPT ( italic_u , italic_v ) = { start_ARRAY start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_e ∈ over¯ start_ARG italic_u italic_v end_ARG end_POSTSUBSCRIPT roman_ℓ ( italic_e ) , end_CELL start_CELL if italic_u ≠ italic_v , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise. end_CELL end_ROW end_ARRAY

Suppose now we are interested only in the distances between the regular vertices.

Definition 2.1.

A function d:[n]×[n]:𝑑delimited-[]𝑛delimited-[]𝑛d:\,[n]\times[n]\rightarrow\mathbb{R}italic_d : [ italic_n ] × [ italic_n ] → blackboard_R is called a tree metric if there exists a semi-labeled tree T=(V,E)𝑇𝑉𝐸T=(V,E)italic_T = ( italic_V , italic_E ) with n𝑛nitalic_n regular nodes W𝑊Witalic_W and a (strictly) positive length assignment :E+:𝐸subscript\ell:E\rightarrow\mathbb{R}_{+}roman_ℓ : italic_E → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT such that for all i,jW𝑖𝑗𝑊i,j\in Witalic_i , italic_j ∈ italic_W

d(i,j)=dT,(i,j).𝑑𝑖𝑗subscript𝑑𝑇𝑖𝑗d(i,j)\;=\;d_{T,\ell}(i,j).italic_d ( italic_i , italic_j ) = italic_d start_POSTSUBSCRIPT italic_T , roman_ℓ end_POSTSUBSCRIPT ( italic_i , italic_j ) .
Example 2.2.

Consider a quartet tree with edge lengths as indicated on the left in Figure 2. The distance between vertices 1111 and 3333 is d(1,3)=2+5+2.5=9.5𝑑13252.59.5d({1,3})=2+5+2.5=9.5italic_d ( 1 , 3 ) = 2 + 5 + 2.5 = 9.5 and the whole distance matrix is given on the right in Figure 2, where the dots indicate that this matrix is symmetric.

111122223333444455553.53.53.53.5222211112.52.52.52.5

[05.59.580119.503.50],matrix05.59.580119.503.50\begin{bmatrix}0&5.5&9.5&8\\ \cdot&0&11&9.5\\ \cdot&\cdot&0&3.5\\ \cdot&\cdot&\cdot&0\end{bmatrix},[ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 5.5 end_CELL start_CELL 9.5 end_CELL start_CELL 8 end_CELL end_ROW start_ROW start_CELL ⋅ end_CELL start_CELL 0 end_CELL start_CELL 11 end_CELL start_CELL 9.5 end_CELL end_ROW start_ROW start_CELL ⋅ end_CELL start_CELL ⋅ end_CELL start_CELL 0 end_CELL start_CELL 3.5 end_CELL end_ROW start_ROW start_CELL ⋅ end_CELL start_CELL ⋅ end_CELL start_CELL ⋅ end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] ,

Figure 2. A metric on a quartet tree.

It is easy to describe the set of all possible tree metrics.

Definition 2.3.

We say that a map d:[n]×[n]:𝑑delimited-[]𝑛delimited-[]𝑛{d}:\,[n]\times[n]\to\mathbb{R}italic_d : [ italic_n ] × [ italic_n ] → blackboard_R satisfies the four-point condition if for every four (not necessarily distinct) elements i,j,k,l[n]𝑖𝑗𝑘𝑙delimited-[]𝑛i,j,k,l\in[n]italic_i , italic_j , italic_k , italic_l ∈ [ italic_n ],

d(i,j)+d(k,l)max{d(i,k)+d(j,l)d(i,l)+d(j,k).𝑑𝑖𝑗𝑑𝑘𝑙cases𝑑𝑖𝑘𝑑𝑗𝑙𝑑𝑖𝑙𝑑𝑗𝑘{d}({i,j})+{d}({k,l})\quad\leq\quad\max\left\{\begin{array}[]{l}{d}({i,k})+{d}% ({j,l})\\ {d}({i,l})+{d}({j,k}).\end{array}\right.italic_d ( italic_i , italic_j ) + italic_d ( italic_k , italic_l ) ≤ roman_max { start_ARRAY start_ROW start_CELL italic_d ( italic_i , italic_k ) + italic_d ( italic_j , italic_l ) end_CELL end_ROW start_ROW start_CELL italic_d ( italic_i , italic_l ) + italic_d ( italic_j , italic_k ) . end_CELL end_ROW end_ARRAY

Since the elements i,j,k,l[n]𝑖𝑗𝑘𝑙delimited-[]𝑛i,j,k,l\in[n]italic_i , italic_j , italic_k , italic_l ∈ [ italic_n ] in Definition 2.3 need not be distinct, every such map is a metric on [n]delimited-[]𝑛[n][ italic_n ] given that d(i,i)=0𝑑𝑖𝑖0{d}({i,i})=0italic_d ( italic_i , italic_i ) = 0 and d(i,j)=d(j,i)𝑑𝑖𝑗𝑑𝑗𝑖{d}({i,j})={d}({j,i})italic_d ( italic_i , italic_j ) = italic_d ( italic_j , italic_i ) for all i,j[n]𝑖𝑗delimited-[]𝑛i,j\in[n]italic_i , italic_j ∈ [ italic_n ]. The following fundamental theorem links tree metrics with the four-point condition.

Theorem 2.4 (Tree-metric theorem, Buneman, (1974)).

Suppose that d:[n]×[n]:𝑑delimited-[]𝑛delimited-[]𝑛{d}:\,[n]\times[n]\to\mathbb{R}italic_d : [ italic_n ] × [ italic_n ] → blackboard_R is such that d(i,i)=0𝑑𝑖𝑖0{d}({i,i})=0italic_d ( italic_i , italic_i ) = 0 and d(i,j)=d(j,i)𝑑𝑖𝑗𝑑𝑗𝑖{d}({i,j})=d({j,i})italic_d ( italic_i , italic_j ) = italic_d ( italic_j , italic_i ) for all i,j[n]𝑖𝑗delimited-[]𝑛i,j\in[n]italic_i , italic_j ∈ [ italic_n ]. Then, d𝑑{d}italic_d is a tree metric on [n]delimited-[]𝑛[n][ italic_n ] if and only if it satisfies the four-point condition. Moreover, a tree metric uniquely determines the defining semi-labeled tree and edge lengths.

Note that the assumption about strictly positive lengths of each edge in Definition 2.1 is crucial for uniqueness in Theorem 2.4.

2.3. Recovering a tree from a tree metric

Our problem is to recover the semi-labelled tree T𝑇Titalic_T based on a few queries of the distance matrix D=[d(u,v)]𝐷delimited-[]𝑑𝑢𝑣D=[d(u,v)]italic_D = [ italic_d ( italic_u , italic_v ) ], which contains the distances between the regular nodes of T𝑇Titalic_T. We consider the query complexity, which measures the number of queries of the distance matrix, also called queries of the distance oracle. Trivially, we can reconstruct the tree with query complexity (n2)binomial𝑛2{n\choose 2}( binomial start_ARG italic_n end_ARG start_ARG 2 end_ARG ). It is known that one can do better for trees with a bounded maximal degree

Δ:=maxdegree(T).assignΔdegree𝑇\Delta\;:=\;\max{\rm degree}(T).roman_Δ := roman_max roman_degree ( italic_T ) .

When T𝑇Titalic_T is a phylogenetic tree (W𝑊Witalic_W is the set of leaves of T𝑇Titalic_T), the distance queries are sometimes called “additive queries” (Waterman et al., (1977)). When ΔΔ\Deltaroman_Δ is bounded, Hein, (1989) showed that this problem has a solution that uses O(ΔnlogΔ(n))𝑂Δ𝑛subscriptΔ𝑛O(\Delta n\log_{\Delta}(n))italic_O ( roman_Δ italic_n roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ) distance queries, which is asymptotically optimal by the work of King et al., (2003). For phylogenetic trees Kannan et al., (1996) proposed an algorithm that has O(Δnlog(n))𝑂Δ𝑛𝑛O(\Delta n\log(n))italic_O ( roman_Δ italic_n roman_log ( italic_n ) ) query complexity. See Afshar et al., (2020) for more references.

When the query complexity is jointly measured in terms of n𝑛nitalic_n and ΔΔ\Deltaroman_Δ, a lower bound for both worst-case and expected query complexity is Ω(nΔlogΔ(n))Ω𝑛ΔsubscriptΔ𝑛\Omega(n\Delta\log_{\Delta}(n))roman_Ω ( italic_n roman_Δ roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ) (King et al., (2003)). Our objective is to exhibit a randomized algorithm that has matching expected query complexity O(nΔlogΔ(n))𝑂𝑛ΔsubscriptΔ𝑛O(n\Delta\log_{\Delta}(n))italic_O ( italic_n roman_Δ roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ) for the semi-labelled tree reconstruction problem, regardless of how ΔΔ\Deltaroman_Δ varies with n𝑛nitalic_n. This problem generalizes the phylogenetic tree reconstruction problem, since internal non-latent nodes can have degree two and regular nodes are not necessarily leaves, and therefore methods for phylogenetic tree reconstruction are no longer directly applicable.

3. The new algorithm

The proposed algorithm uses a randomized version of divide-and-conquer. We will use the notion of a bag BV𝐵𝑉B\subseteq Vitalic_B ⊆ italic_V. The algorithm maintains a queue, consisting of sets of bags. Initially, there is only one bag, containing all regular nodes, that is, B=W𝐵𝑊B=Witalic_B = italic_W. The procedure takes a bag B𝐵Bitalic_B. One node in this set, denoted by ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ), is marked as a representative, which can be thought of as a root. A set of edges that jointly form a tree is called a skeleton and is typically denoted by the mnemonic S𝑆Sitalic_S. Our algorithm starts with an empty skeleton and incrementally constructs the skeleton of the sought tree, which we call the tree induced by B𝐵Bitalic_B.

Let κ=Δ𝜅Δ\kappa=\Deltaitalic_κ = roman_Δ ;
Pick a node uW𝑢𝑊u\in Witalic_u ∈ italic_W, and set ρ(W)u𝜌𝑊𝑢\rho(W)\leftarrow uitalic_ρ ( italic_W ) ← italic_u (note: W𝑊Witalic_W is now a bag);
Make an empty queue Q𝑄Qitalic_Q ;
Add W𝑊Witalic_W to Q𝑄Qitalic_Q ;
Set S𝑆S\leftarrow\emptysetitalic_S ← ∅ ;
while |Q|>0𝑄0|Q|>0| italic_Q | > 0 do
       Remove bag B𝐵Bitalic_B from the front of Q𝑄Qitalic_Q ;
       if |B|κ𝐵𝜅|B|\leq\kappa| italic_B | ≤ italic_κ then
             Query all (|B|2)binomial𝐵2{|B|\choose 2}( binomial start_ARG | italic_B | end_ARG start_ARG 2 end_ARG ) distances between nodes in B𝐵Bitalic_B;
             Find Ssuperscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the full skeleton for the tree induced by B𝐵Bitalic_B ;
             SSS𝑆𝑆superscript𝑆S\leftarrow S\cup S^{*}italic_S ← italic_S ∪ italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT;
            
      else
            Sample uniformly at random and without replacement nodes u1,,uκsubscript𝑢1subscript𝑢𝜅u_{1},\ldots,u_{\kappa}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT from B{ρ(B)}𝐵𝜌𝐵B\setminus\{\rho(B)\}italic_B ∖ { italic_ρ ( italic_B ) };
             Apply procedure bigsplit (B,u1,,uκ)𝐵subscript𝑢1subscript𝑢𝜅(B,u_{1},\ldots,u_{\kappa})( italic_B , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ). This procedure outputs a skeleton Ssuperscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (connecting nodes from B𝐵Bitalic_B and possibly latent nodes) and bags B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT overlaps with the nodes of the skeleton in ρ(Bi)𝜌subscript𝐵𝑖\rho(B_{i})italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) only, and are non-overlapping otherwise (i.e., (Biρ(Bi))(Bjρ(Bj))=subscript𝐵𝑖𝜌subscript𝐵𝑖subscript𝐵𝑗𝜌subscript𝐵𝑗(B_{i}\setminus\rho(B_{i}))\cap(B_{j}\setminus\rho(B_{j}))=\emptyset( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∖ italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∩ ( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∖ italic_ρ ( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) = ∅) ;
             SSS𝑆𝑆superscript𝑆S\leftarrow S\cup S^{*}italic_S ← italic_S ∪ italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ;
             Add B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to the rear of Q𝑄Qitalic_Q ;
            
Return the skeleton S𝑆Sitalic_S ;
Algorithm 1 Outline of our algorithm

The procedure bigsplit takes a bag B𝐵Bitalic_B and a random set of nodes in it, u1,,uκsubscript𝑢1subscript𝑢𝜅u_{1},\ldots,u_{\kappa}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT, and forms the subtree that connects u1,,uκsubscript𝑢1subscript𝑢𝜅u_{1},\ldots,u_{\kappa}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT and ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ). The edges of this subtree give the skeleton that is the output. The remaining nodes of B𝐵Bitalic_B are collected in bags that “hang” from the skeleton. The representatives of these bags are precisely those nodes where the bags overlap the skeleton. Note that the skeleton may contain latent nodes not originally in B𝐵Bitalic_B. Within bigsplit, all representatives of the hanging bags have their distances to all nodes in their bags queried, so for all practical purposes, the newly discovered latent nodes act as regular nodes. The bags become smaller as the algorithm proceeds, which leads to a logarithmic number of rounds. The main result of the paper is the following theorem, whose proof is given in Section 4 below.

Theorem 3.1.

Given a distance oracle D𝐷Ditalic_D between the regular vertices of a semi-labeled tree T𝑇Titalic_T, Algorithm 1 with parameter κ=Δ𝜅Δ\kappa=\Deltaitalic_κ = roman_Δ correctly recovers the induced tree with expected query complexity O(ΔnlogΔ(n))𝑂Δ𝑛subscriptΔ𝑛O(\Delta n\log_{\Delta}(n))italic_O ( roman_Δ italic_n roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ).

The procedure bigsplit uses two sub-operations, called basic and explode that we describe next.

3.1. The ”basic” operation

Let B𝐵Bitalic_B be a bag with representative ρ=ρ(B)𝜌𝜌𝐵\rho=\rho(B)italic_ρ = italic_ρ ( italic_B ), and let αB𝛼𝐵\alpha\in Bitalic_α ∈ italic_B be a distinct regular vertex. In our basic step, we query d(v,α)𝑑𝑣𝛼d(v,\alpha)italic_d ( italic_v , italic_α ) for all vB𝑣𝐵v\in Bitalic_v ∈ italic_B, and set

(1) D(v)=d(v,α)d(v,ρ).𝐷𝑣𝑑𝑣𝛼𝑑𝑣𝜌D(v)=d(v,\alpha)-d(v,\rho).italic_D ( italic_v ) = italic_d ( italic_v , italic_α ) - italic_d ( italic_v , italic_ρ ) .

We group all nodes v𝑣vitalic_v according to the different values D(v)𝐷𝑣D(v)italic_D ( italic_v ) that are observed. Ordering the sets in this partition of B𝐵Bitalic_B from small to large value of D()𝐷D(\cdot)italic_D ( ⋅ ), we obtain bags B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. It is clear that ρBk𝜌subscript𝐵𝑘\rho\in B_{k}italic_ρ ∈ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and αB1𝛼subscript𝐵1\alpha\in B_{1}italic_α ∈ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Within each Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we let uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the node of Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT closest to α𝛼\alphaitalic_α. If

d(ui,α)+d(ui,ρ)=d(α,ρ),𝑑subscript𝑢𝑖𝛼𝑑subscript𝑢𝑖𝜌𝑑𝛼𝜌d(u_{i},\alpha)+d(u_{i},\rho)=d(\alpha,\rho),italic_d ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α ) + italic_d ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ ) = italic_d ( italic_α , italic_ρ ) ,

then uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (a regular node) is on the path from α𝛼\alphaitalic_α to ρ𝜌\rhoitalic_ρ in the induced tree. We set ρ(Bi)=ui𝜌subscript𝐵𝑖subscript𝑢𝑖\rho(B_{i})=u_{i}italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If, however,

d(ui,α)+d(ui,ρ)>d(α,ρ),𝑑subscript𝑢𝑖𝛼𝑑subscript𝑢𝑖𝜌𝑑𝛼𝜌d(u_{i},\alpha)+d(u_{i},\rho)>d(\alpha,\rho),italic_d ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α ) + italic_d ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ ) > italic_d ( italic_α , italic_ρ ) ,

then we know that there must be a latent node wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that connects the (α,ρ)𝛼𝜌(\alpha,\rho)( italic_α , italic_ρ ) path to the nodes in Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In fact, for all vBi𝑣subscript𝐵𝑖v\in B_{i}italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we have

(2) d(v,wi)=12(d(v,α)+d(v,ρ)d(α,ρ)).𝑑𝑣subscript𝑤𝑖12𝑑𝑣𝛼𝑑𝑣𝜌𝑑𝛼𝜌d(v,w_{i})=\frac{1}{2}\left(d(v,\alpha)+d(v,\rho)-d(\alpha,\rho)\right).italic_d ( italic_v , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_d ( italic_v , italic_α ) + italic_d ( italic_v , italic_ρ ) - italic_d ( italic_α , italic_ρ ) ) .

These values can be stored for further use. So, we add wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and define ρ(Bi)=wi𝜌subscript𝐵𝑖subscript𝑤𝑖\rho(B_{i})=w_{i}italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In this manner, we have identified Ssuperscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the part of the final skeleton that connects α𝛼\alphaitalic_α with ρ𝜌\rhoitalic_ρ:

(ρ(B1),ρ(B2)),,(ρ(Bk1),ρ(Bk)).𝜌subscript𝐵1𝜌subscript𝐵2𝜌subscript𝐵𝑘1𝜌subscript𝐵𝑘(\rho(B_{1}),\rho(B_{2})),\ldots,(\rho(B_{k-1}),\rho(B_{k})).( italic_ρ ( italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_ρ ( italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) , … , ( italic_ρ ( italic_B start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) , italic_ρ ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) .
Refer to caption
Figure 3. In a basic operation, a bag B𝐵Bitalic_B with root ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ) and query node α𝛼\alphaitalic_α is decomposed into several smaller bags, with root nodes on the skeleton.
Input: a bag B𝐵Bitalic_B, and αB𝛼𝐵\alpha\in Bitalic_α ∈ italic_B;
Set ρ=ρ(B)𝜌𝜌𝐵\rho=\rho(B)italic_ρ = italic_ρ ( italic_B );
for vB𝑣𝐵v\in Bitalic_v ∈ italic_B do
       Compute D(v)𝐷𝑣D(v)italic_D ( italic_v ) in (1);
      
Assign v𝑣vitalic_v to bags B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT according to decreasing values of D(v)𝐷𝑣D(v)italic_D ( italic_v );
for i=1,,k𝑖1𝑘i=1,\ldots,kitalic_i = 1 , … , italic_k do
       ui=argminvBid(v,α)subscript𝑢𝑖subscript𝑣subscript𝐵𝑖𝑑𝑣𝛼u_{i}=\arg\min_{v\in B_{i}}d(v,\alpha)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d ( italic_v , italic_α );
       if d(ρ,ui)+d(ui,α)=d(ρ,α)𝑑𝜌subscript𝑢𝑖𝑑subscript𝑢𝑖𝛼𝑑𝜌𝛼d(\rho,u_{i})+d(u_{i},\alpha)=d(\rho,\alpha)italic_d ( italic_ρ , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_d ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α ) = italic_d ( italic_ρ , italic_α ) then
            ρ(Bi)=ui𝜌subscript𝐵𝑖subscript𝑢𝑖\rho(B_{i})=u_{i}italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
      else
            Identify latent node wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
             Add wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
             Set ρ(Bi)=wi𝜌subscript𝐵𝑖subscript𝑤𝑖\rho(B_{i})=w_{i}italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
             Update the distance oracle by calculating d(wi,v)𝑑subscript𝑤𝑖𝑣d(w_{i},v)italic_d ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v ) for all vBi𝑣subscript𝐵𝑖v\in B_{i}italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using (2);
Return B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and the skeleton (ρ(B1),ρ(B2)),,(ρ(Bk1),ρ(Bk))𝜌subscript𝐵1𝜌subscript𝐵2𝜌subscript𝐵𝑘1𝜌subscript𝐵𝑘(\rho(B_{1}),\rho(B_{2})),\ldots,(\rho(B_{k-1}),\rho(B_{k}))( italic_ρ ( italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_ρ ( italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) , … , ( italic_ρ ( italic_B start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) , italic_ρ ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) );
Algorithm 2 basic(B,α)𝐵𝛼(B,\alpha)( italic_B , italic_α )

The query complexity of Algorithm 2 is bounded by 2|B|32𝐵32|B|-32 | italic_B | - 3.

3.2. The operation “explode”

Another fundamental operation, called explode, decomposes a bag B𝐵Bitalic_B into smaller bags B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT—all having the same representative ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B )—according to the different subtrees of ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ) that are part of the tree induced by B𝐵Bitalic_B. For arbitrary nodes uvB𝑢𝑣𝐵u\not=v\in Bitalic_u ≠ italic_v ∈ italic_B, u,vρ(B)𝑢𝑣𝜌𝐵u,v\not=\rho(B)italic_u , italic_v ≠ italic_ρ ( italic_B ), we note that u𝑢uitalic_u and v𝑣vitalic_v are in the same subtree if and only if

d(u,v)<d(u,ρ(B))+d(ρ(B),v).𝑑𝑢𝑣𝑑𝑢𝜌𝐵𝑑𝜌𝐵𝑣d(u,v)<d(u,\rho(B))+d(\rho(B),v).italic_d ( italic_u , italic_v ) < italic_d ( italic_u , italic_ρ ( italic_B ) ) + italic_d ( italic_ρ ( italic_B ) , italic_v ) .

Thus, in query time O(|B|)𝑂𝐵O(|B|)italic_O ( | italic_B | ), we can determine the nodes that are in the same subtree as u𝑢uitalic_u. Therefore, we can partition all nodes of B{ρ(B)}𝐵𝜌𝐵B\setminus\{\rho(B)\}italic_B ∖ { italic_ρ ( italic_B ) } into disjoint subtrees of ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ) (without constructing these trees yet) in query time at most |B|Δ𝐵Δ|B|\Delta| italic_B | roman_Δ by peeling off each set in the partition in turn. These sets are output as bags denoted by B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Refer to caption
Figure 4. In the explode operation, a bag B𝐵Bitalic_B with root ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ) is decomposed into several smaller bags, each with root ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ). All other nodes of B𝐵Bitalic_B end up in only one of the smaller bags.
Input: a bag B𝐵Bitalic_B ;
while |B|>1𝐵1|B|>1| italic_B | > 1 do
      Take uB𝑢𝐵u\in Bitalic_u ∈ italic_B, uρ(B)𝑢𝜌𝐵u\not=\rho(B)italic_u ≠ italic_ρ ( italic_B );
       Set B={vB,vρ(B):d(u,v)<d(u,ρ(B))+d(v,ρ(B))}{ρ(B)}superscript𝐵conditional-setformulae-sequence𝑣𝐵𝑣𝜌𝐵𝑑𝑢𝑣𝑑𝑢𝜌𝐵𝑑𝑣𝜌𝐵𝜌𝐵B^{*}=\{v\in B,v\not=\rho(B):d(u,v)<d(u,\rho(B))+d(v,\rho(B))\}\cup\{\rho(B)\}italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_v ∈ italic_B , italic_v ≠ italic_ρ ( italic_B ) : italic_d ( italic_u , italic_v ) < italic_d ( italic_u , italic_ρ ( italic_B ) ) + italic_d ( italic_v , italic_ρ ( italic_B ) ) } ∪ { italic_ρ ( italic_B ) };
       Set ρ(B)=ρ(B)𝜌superscript𝐵𝜌𝐵\rho(B^{*})=\rho(B)italic_ρ ( italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_ρ ( italic_B );
       Output Bsuperscript𝐵B^{*}italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ;
       Set B{ρ(B)}(BB)𝐵𝜌𝐵𝐵superscript𝐵B\leftarrow\{\rho(B)\}\cup(B\setminus B^{*})italic_B ← { italic_ρ ( italic_B ) } ∪ ( italic_B ∖ italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT );
      
Algorithm 3 explode(B)𝐵(B)( italic_B )

The query complexity of Algorithm 3 is bounded by (|B|1)Δ𝐵1Δ(|B|-1)\Delta( | italic_B | - 1 ) roman_Δ.

3.3. The procedure bigsplit

We are finally ready to provide the details of the procedure bigsplit, which takes as input a bag B𝐵Bitalic_B and distinct nodes u1,,uksubscript𝑢1subscript𝑢𝑘u_{1},\ldots,u_{k}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in B𝐵Bitalic_B not equal to ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ).

Input: a bag B𝐵Bitalic_B and distinct nodes u1,,ukBsubscript𝑢1subscript𝑢𝑘𝐵u_{1},\ldots,u_{k}{\in B}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_B not equal to ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ) ;
Let 𝒞={B}𝒞𝐵\mathcal{C}=\{B\}caligraphic_C = { italic_B } ;
Let M=maxB𝒞|B|𝑀subscriptsuperscript𝐵𝒞superscript𝐵M=\max_{B^{\prime}\in\mathcal{C}}|B^{\prime}|italic_M = roman_max start_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_C end_POSTSUBSCRIPT | italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ;
while M>|B|/Δ𝑀𝐵ΔM>|B|/\sqrt{\Delta}italic_M > | italic_B | / square-root start_ARG roman_Δ end_ARG do
      Let \mathcal{B}caligraphic_B be a collection of bags. Initially, ={B}𝐵\mathcal{B}=\{B\}caligraphic_B = { italic_B };
       Let S𝑆Sitalic_S be a skeleton. Initially, S=𝑆S=\emptysetitalic_S = ∅ ;
       for i=1𝑖1i=1italic_i = 1 to k𝑘kitalic_k do
             Find the bag Bsuperscript𝐵B^{*}italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in \mathcal{B}caligraphic_B to which uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belongs ;
             if uiρ(B)subscript𝑢𝑖𝜌superscript𝐵u_{i}\not=\rho(B^{*})italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_ρ ( italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) then
                   Apply basic(B,ui)superscript𝐵subscript𝑢𝑖(B^{*},u_{i})( italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which outputs a skeleton Ssuperscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and bags B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ;
                   {B}superscript𝐵{\mathcal{B}}\leftarrow{\mathcal{B}}\setminus\{B^{*}\}caligraphic_B ← caligraphic_B ∖ { italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT };
                   j=1k{Bj}{\mathcal{B}}\leftarrow{\mathcal{B}}\cup\cup_{j=1}^{k}\{B_{j}\}caligraphic_B ← caligraphic_B ∪ ∪ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT { italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT };
                   SSS𝑆𝑆superscript𝑆S\leftarrow S\cup S^{*}italic_S ← italic_S ∪ italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ;
                  
      Let 𝒞𝒞\mathcal{C}caligraphic_C be an empty collection of bags ;
       for all B𝐵B\in{\mathcal{B}}italic_B ∈ caligraphic_B do
             explode(B)𝐵(B)( italic_B ), which leaves output B1,,Bsubscript𝐵1subscript𝐵B_{1},\ldots,B_{\ell}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ;
             Add B1,,Bsubscript𝐵1subscript𝐵B_{1},\ldots,B_{\ell}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT to 𝒞𝒞\mathcal{C}caligraphic_C;
            
      Let M=maxB𝒞|B|𝑀subscriptsuperscript𝐵𝒞superscript𝐵M=\max_{B^{\prime}\in\mathcal{C}}|B^{\prime}|italic_M = roman_max start_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_C end_POSTSUBSCRIPT | italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ;
      
Output S𝑆Sitalic_S ;
Output all bags in 𝒞𝒞\mathcal{C}caligraphic_C ;
Algorithm 4 bigsplit(B,u1,,uk)𝐵subscript𝑢1subscript𝑢𝑘(B,u_{1},\ldots,u_{k})( italic_B , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
Refer to caption
Figure 5. Illustration of bigsplit(B,u1,u2,u3)𝐵subscript𝑢1subscript𝑢2subscript𝑢3(B,u_{1},u_{2},u_{3})( italic_B , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). Each bag anchored on the skeleton is further exploded into smaller bags (not shown).

4. The query complexity

4.1. Proof of Theorem 3.1

For a bag B𝐵Bitalic_B, and nodes u1,,uκsubscript𝑢1subscript𝑢𝜅u_{1},\ldots,u_{\kappa}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT taken uniformly at random without replacement from B{ρ(B)}𝐵𝜌𝐵B\setminus\{\rho(B)\}italic_B ∖ { italic_ρ ( italic_B ) }, let M𝑀Mitalic_M be the size of the largest bag output by bigsplit (B,u1,,uκ)𝐵subscript𝑢1subscript𝑢𝜅(B,u_{1},\ldots,u_{\kappa})( italic_B , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ), where |B|κ+1𝐵𝜅1|B|\geq\kappa+1| italic_B | ≥ italic_κ + 1, and κ=Δ𝜅Δ\kappa=\Deltaitalic_κ = roman_Δ. Lemma 4.1 below shows that

{M>|B|Δ}12.𝑀𝐵Δ12\mathbb{P}\left\{M>\frac{|B|}{\sqrt{\Delta}}\right\}\leq\frac{1}{2}.blackboard_P { italic_M > divide start_ARG | italic_B | end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG } ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG .

The while loop in bigsplit is repeated until M|B|Δ𝑀𝐵ΔM\leq\frac{|B|}{\sqrt{\Delta}}italic_M ≤ divide start_ARG | italic_B | end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG. Thus, we can visualize the algorithm in rounds, starting with W𝑊Witalic_W. In other words, in the r𝑟ritalic_r-th round, we apply bigsplit to all bags that have been through a bigsplit r1𝑟1r-1italic_r - 1 times. In every round, all bags of the previous round are reduced in size by a factor of 1/Δ1Δ1/\sqrt{\Delta}1 / square-root start_ARG roman_Δ end_ARG. Therefore, there are logΔ(n)=2logΔ(n)absentsubscriptΔ𝑛2subscriptΔ𝑛\leq\log_{\sqrt{\Delta}}(n)=2\log_{\Delta}(n)≤ roman_log start_POSTSUBSCRIPT square-root start_ARG roman_Δ end_ARG end_POSTSUBSCRIPT ( italic_n ) = 2 roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) rounds. The query complexity of one bigsplit without the explode operations is at most (κ+1)(2|B|1)𝜅12𝐵1(\kappa+1)(2|B|-1)( italic_κ + 1 ) ( 2 | italic_B | - 1 ). The total query complexity due to all explode operations is at most Δ(|B|1)Δ𝐵1\Delta(|B|-1)roman_Δ ( | italic_B | - 1 ), for a grand total bounded by

κ+(|B|1)×(2+2κ+Δ).𝜅𝐵122𝜅Δ\kappa+(|B|-1)\times(2+2\kappa+\Delta).italic_κ + ( | italic_B | - 1 ) × ( 2 + 2 italic_κ + roman_Δ ) .

Then the (random) query complexity for splitting bag B𝐵Bitalic_B is bounded by

(3) κ+X(|B|1)×(2+2κ+Δ)=Δ+X(|B|1)×(2+3Δ),𝜅𝑋𝐵122𝜅ΔΔ𝑋𝐵123Δ\kappa+X(|B|-1)\times(2+2\kappa+\Delta)=\Delta+X(|B|-1)\times(2+3\Delta)~{},italic_κ + italic_X ( | italic_B | - 1 ) × ( 2 + 2 italic_κ + roman_Δ ) = roman_Δ + italic_X ( | italic_B | - 1 ) × ( 2 + 3 roman_Δ ) ,

where X𝑋Xitalic_X is geometric (1/2)12(1/2)( 1 / 2 ). The expected value of this is Δ+2(|B|1)(2+3Δ)Δ2𝐵123Δ\Delta+2(|B|-1)(2+3\Delta)roman_Δ + 2 ( | italic_B | - 1 ) ( 2 + 3 roman_Δ ). Summing over all bags B𝐵Bitalic_B that participate in one round yields an expected value bound of O(Δ)𝑂ΔO(\Delta)italic_O ( roman_Δ ) times the sum of |B|1𝐵1|B|-1| italic_B | - 1 over all participating B𝐵Bitalic_B. But the bags do not overlap, except possibly for their representatives. Hence the sum of all values (|B|1)𝐵1(|B|-1)( | italic_B | - 1 ) is at most n𝑛nitalic_n, as each proper item in a bag is a regular node. As κ=Δ𝜅Δ\kappa=\Deltaitalic_κ = roman_Δ, the expected cost of one round of splitting is at most Δ+n(4+6Δ)Δ𝑛46Δ\Delta+n(4+6\Delta)roman_Δ + italic_n ( 4 + 6 roman_Δ ).

There is another component of the query complexity due to the part in which we construct the induced tree for a bag B𝐵Bitalic_B when |B|κ𝐵𝜅|B|\leq\kappa| italic_B | ≤ italic_κ. A bag B𝐵Bitalic_B dealt with in this manner is called final. So the total query complexity becomes the sum of (|B|2)binomial𝐵2{|B|\choose 2}( binomial start_ARG | italic_B | end_ARG start_ARG 2 end_ARG ) computed over all final bags of size at least two. Let the sizes of the final bags be denoted by nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Noting that i(ni1)nsubscript𝑖subscript𝑛𝑖1𝑛\sum_{i}(n_{i}-1)\leq n∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) ≤ italic_n, we see that the query complexity due to the final bags is at most

(4) ini(ni1)2maxiniini(ni1)2κn2<nΔ.subscript𝑖subscript𝑛𝑖subscript𝑛𝑖12subscript𝑖subscript𝑛𝑖subscript𝑖subscript𝑛𝑖subscript𝑛𝑖12𝜅𝑛2𝑛Δ\sum_{i}\frac{n_{i}(n_{i}-1)}{2}\leq\max_{i}n_{i}\,\sum_{i}\frac{n_{i}(n_{i}-1% )}{2}\leq\frac{\kappa n}{2}<n\Delta.∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) end_ARG start_ARG 2 end_ARG ≤ roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) end_ARG start_ARG 2 end_ARG ≤ divide start_ARG italic_κ italic_n end_ARG start_ARG 2 end_ARG < italic_n roman_Δ .

The overall expected query complexity does not exceed

(5) nΔ+(Δ+n(4+6Δ))×2logΔ(n).𝑛ΔΔ𝑛46Δ2subscriptΔ𝑛n\Delta+(\Delta+n(4+6\Delta))\times 2\log_{\Delta}(n).italic_n roman_Δ + ( roman_Δ + italic_n ( 4 + 6 roman_Δ ) ) × 2 roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) .

This finishes the proof of Theorem 3.1.

4.2. The main technical lemma

Lemma 4.1.

For a bag B𝐵Bitalic_B, and random nodes u1,,uκsubscript𝑢1subscript𝑢𝜅u_{1},\ldots,u_{\kappa}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT taken uniformly at random without replacement from B{ρ(B}B\setminus\{\rho(B\}italic_B ∖ { italic_ρ ( italic_B }, let M𝑀Mitalic_M be the size of the largest bag output by bigsplit (B,u1,,uκ)𝐵subscript𝑢1subscript𝑢𝜅(B,u_{1},\ldots,u_{\kappa})( italic_B , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ), where |B|κ+1𝐵𝜅1|B|\geq\kappa+1| italic_B | ≥ italic_κ + 1, and κ=Δ𝜅Δ\kappa=\Deltaitalic_κ = roman_Δ. Then

{M1+|B|Δ}12.𝑀1𝐵Δ12\mathbb{P}\left\{M\geq 1+\frac{|B|}{\sqrt{\Delta}}\right\}\leq\frac{1}{2}.blackboard_P { italic_M ≥ 1 + divide start_ARG | italic_B | end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG } ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG .
Refer to caption
Figure 6. Illustration of bigsplit(B,u1,u2,u3,u4)𝐵subscript𝑢1subscript𝑢2subscript𝑢3subscript𝑢4(B,u_{1},u_{2},u_{3},u_{4})( italic_B , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ). The skeleton induced by {ρ(B),u1,u2,u3,u4}𝜌𝐵subscript𝑢1subscript𝑢2subscript𝑢3subscript𝑢4\{\rho(B),u_{1},u_{2},u_{3},u_{4}\}{ italic_ρ ( italic_B ) , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT } is shown in thick lines. The remainder of the tree induced by B𝐵Bitalic_B is shown in thin lines. The nodes traversed by dfs between visits of {ρ(B),u1,u2,u3,u4}𝜌𝐵subscript𝑢1subscript𝑢2subscript𝑢3subscript𝑢4\{\rho(B),u_{1},u_{2},u_{3},u_{4}\}{ italic_ρ ( italic_B ) , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT } are grouped.
Proof..

Consider ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ) as the root of the tree induced by B𝐵Bitalic_B, on which we perform dfs (depth-first-search). Note that each of the bags left after bigsplit (B,u1,,uκ)𝐵subscript𝑢1subscript𝑢𝜅(B,u_{1},\ldots,u_{\kappa})( italic_B , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ) corresponds to a subtree with root on the skeleton output by bigsplit and having root degree one. We list the regular nodes in dfs order and separate this list into κ+1𝜅1\kappa+1italic_κ + 1 sublists, all separated by ρ(B),u1,,uκ𝜌𝐵subscript𝑢1subscript𝑢𝜅\rho(B),u_{1},\ldots,u_{\kappa}italic_ρ ( italic_B ) , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT. Call the sizes of the sublists N0,N1,,Nκsubscript𝑁0subscript𝑁1subscript𝑁𝜅N_{0},N_{1},\ldots,N_{\kappa}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT. Note that the bags consist of regular nodes, except possibly their representatives on the skeleton. Each bag is either contained in a sublist or a sublist plus one of the nodes ρ(B),u1,,uκ𝜌𝐵subscript𝑢1subscript𝑢𝜅\rho(B),u_{1},\ldots,u_{\kappa}italic_ρ ( italic_B ) , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT. Thus,

MmaxiNi+1.𝑀subscript𝑖subscript𝑁𝑖1M\leq\max_{i}N_{i}+1.italic_M ≤ roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 .

If we number the nodes by dfs order, starting at ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ), then picking k𝑘kitalic_k nodes uniformly at random without replacement from all nodes, ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ) excepted, shows that the Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s correspond to the cardinalities of the intervals defined by the selected nodes. Therefore, N0,,Nκsubscript𝑁0subscript𝑁𝜅N_{0},\ldots,N_{\kappa}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT are identically distributed. In addition, for an arbitrary integer k𝑘kitalic_k,

{N0k}subscript𝑁0𝑘\displaystyle\mathbb{P}\left\{N_{0}\geq k\right\}blackboard_P { italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ italic_k } =|B|1k|B|1×|B|2k|B|2××|B|κk|B|κabsent𝐵1𝑘𝐵1𝐵2𝑘𝐵2𝐵𝜅𝑘𝐵𝜅\displaystyle=\frac{|B|-1-k}{|B|-1}\times\frac{|B|-2-k}{|B|-2}\times\cdots% \times\frac{|B|-\kappa-k}{|B|-\kappa}= divide start_ARG | italic_B | - 1 - italic_k end_ARG start_ARG | italic_B | - 1 end_ARG × divide start_ARG | italic_B | - 2 - italic_k end_ARG start_ARG | italic_B | - 2 end_ARG × ⋯ × divide start_ARG | italic_B | - italic_κ - italic_k end_ARG start_ARG | italic_B | - italic_κ end_ARG
(1k|B|1)κabsentsuperscript1𝑘𝐵1𝜅\displaystyle\leq\left(1-\frac{k}{|B|-1}\right)^{\kappa}≤ ( 1 - divide start_ARG italic_k end_ARG start_ARG | italic_B | - 1 end_ARG ) start_POSTSUPERSCRIPT italic_κ end_POSTSUPERSCRIPT
exp(kκ|B|1).absent𝑘𝜅𝐵1\displaystyle\leq\exp\left(-\frac{k\kappa}{|B|-1}\right).≤ roman_exp ( - divide start_ARG italic_k italic_κ end_ARG start_ARG | italic_B | - 1 end_ARG ) .

Thus, by the union bound,

{M1+|B|Δ}𝑀1𝐵Δ\displaystyle\mathbb{P}\left\{M\geq 1+\frac{|B|}{\sqrt{\Delta}}\right\}blackboard_P { italic_M ≥ 1 + divide start_ARG | italic_B | end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG } (κ+1)exp(|B|κ(|B|1)Δ)absent𝜅1𝐵𝜅𝐵1Δ\displaystyle\leq(\kappa+1)\exp\left(-\frac{|B|\kappa}{(|B|-1)\sqrt{\Delta}}\right)≤ ( italic_κ + 1 ) roman_exp ( - divide start_ARG | italic_B | italic_κ end_ARG start_ARG ( | italic_B | - 1 ) square-root start_ARG roman_Δ end_ARG end_ARG )
(κ+1)exp(κ+1Δ)absent𝜅1𝜅1Δ\displaystyle\leq(\kappa+1)\exp\left(-\frac{\kappa+1}{\sqrt{\Delta}}\right)≤ ( italic_κ + 1 ) roman_exp ( - divide start_ARG italic_κ + 1 end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG )
(as |B|κ+1𝐵𝜅1|B|\geq\kappa+1| italic_B | ≥ italic_κ + 1)
(Δ+1)exp(Δ+1Δ)absentΔ1Δ1Δ\displaystyle\leq(\Delta+1)\exp\left(-\frac{\Delta+1}{\sqrt{\Delta}}\right)≤ ( roman_Δ + 1 ) roman_exp ( - divide start_ARG roman_Δ + 1 end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG )
4exp(43)absent443\displaystyle\leq 4\exp\left(-\frac{4}{\sqrt{3}}\right)≤ 4 roman_exp ( - divide start_ARG 4 end_ARG start_ARG square-root start_ARG 3 end_ARG end_ARG )
(as the expression decreases for Δ3Δ3\Delta\geq 3roman_Δ ≥ 3)
=0.39728absent0.39728\displaystyle=0.39728\ldots= 0.39728 …
<12.absent12\displaystyle<\frac{1}{2}.< divide start_ARG 1 end_ARG start_ARG 2 end_ARG .

4.3. More refined probabilistic bounds

In this section we refine Theorem 3.1 by offering more detailed distributional estimates for the query complexity of Algorithm 1.

Theorem 4.2.

The query complexity Z𝑍Zitalic_Z of Algorithm 1 with parameter κ=Δ𝜅Δ\kappa=\Deltaitalic_κ = roman_Δ has

𝔼{Z} 19ΔnlogΔ(n).𝔼𝑍19Δ𝑛subscriptΔ𝑛\mathbb{E}\{Z\}\;\leq\;19\Delta n\log_{\Delta}(n).blackboard_E { italic_Z } ≤ 19 roman_Δ italic_n roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) .

Furthermore, if logΔ(n)subscriptΔ𝑛\log_{\Delta}(n)\to\inftyroman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) → ∞ as n𝑛n\to\inftyitalic_n → ∞, then {Z2𝔼{Z}}=o(1)𝑍2𝔼𝑍𝑜1\mathbb{P}\{Z\geq 2\mathbb{E}\{Z\}\}=o(1)blackboard_P { italic_Z ≥ 2 blackboard_E { italic_Z } } = italic_o ( 1 ). Finally, if logΔ(n)=O(1)subscriptΔ𝑛𝑂1\log_{\Delta}(n)=O(1)roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) = italic_O ( 1 ), then Z/(Δn)st.6X+op(1)Z/(\Delta n){\stackrel{{\scriptstyle st.}}{{\leq}}}6X+o_{p}(1)italic_Z / ( roman_Δ italic_n ) start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG italic_s italic_t . end_ARG end_RELOP 6 italic_X + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( 1 ), where st.{\stackrel{{\scriptstyle st.}}{{\leq}}}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG italic_s italic_t . end_ARG end_RELOP denotes stochastic domination, X𝑋Xitalic_X is a geometric (1/2)12(1/2)( 1 / 2 ) random variable, and op(1)subscript𝑜𝑝1o_{p}(1)italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( 1 ) is a random variable tending to 00 in probability.

Proof..

With the choice κ=Δ𝜅Δ\kappa=\Deltaitalic_κ = roman_Δ, using (3) for the complexity due to bag splitting and (4) for the complexity due to the final treatment of bags, the query complexity of the algorithm can be bounded by the random variable

Z=nΔ+kbags B split in round k(Δ+XB(2+3Δ)(|B|1)),𝑍𝑛Δsubscript𝑘subscriptbags B split in round kΔsubscript𝑋𝐵23Δ𝐵1Z=n\Delta+\sum_{k}\sum_{\textrm{bags $B$ split in round $k$}}\left(\Delta+X_{B% }(2+3\Delta)(|B|-1)\right),italic_Z = italic_n roman_Δ + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT bags italic_B split in round italic_k end_POSTSUBSCRIPT ( roman_Δ + italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( 2 + 3 roman_Δ ) ( | italic_B | - 1 ) ) ,

where all XBsubscript𝑋𝐵X_{B}italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are i.i.d. geometric (1/2)12(1/2)( 1 / 2 ) random variables. Using (5), we have

𝔼{Z}nΔ+(Δ+(4+6Δ)n)×2logΔ(n)19ΔnlogΔ(n).𝔼𝑍𝑛ΔΔ46Δ𝑛2subscriptΔ𝑛19Δ𝑛subscriptΔ𝑛\mathbb{E}\{Z\}\leq n\Delta+(\Delta+(4+6\Delta)n)\times 2\log_{\Delta}(n)\leq 1% 9\Delta n\log_{\Delta}(n).blackboard_E { italic_Z } ≤ italic_n roman_Δ + ( roman_Δ + ( 4 + 6 roman_Δ ) italic_n ) × 2 roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ≤ 19 roman_Δ italic_n roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) .

As argued in the proof of Theorem 3.1 in each round the bags get reduced by a factor of 1/Δ1Δ1/\sqrt{\Delta}1 / square-root start_ARG roman_Δ end_ARG. Thus, given all bags B𝐵Bitalic_B in all levels, we see that the variance 𝕍{Z}𝕍𝑍\mathbb{V}\{Z\}blackboard_V { italic_Z } is not more than

kbags B split in round ksubscript𝑘subscriptbags B split in round k\displaystyle\sum_{k}\sum_{\textrm{bags $B$ split in round $k$}}∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT bags italic_B split in round italic_k end_POSTSUBSCRIPT 𝕍{XB}(2+3Δ)2(|B|1)2𝕍subscript𝑋𝐵superscript23Δ2superscript𝐵12\displaystyle\mathbb{V}\{X_{B}\}(2+3\Delta)^{2}(|B|-1)^{2}blackboard_V { italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ( 2 + 3 roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( | italic_B | - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(2+3Δ)2knΔk/2bags B split in round k2(|B|1)absentsuperscript23Δ2subscript𝑘𝑛superscriptΔ𝑘2subscriptbags B split in round k2𝐵1\displaystyle\leq(2+3\Delta)^{2}\,\sum_{k}\frac{n}{\Delta^{k/2}}\sum_{\textrm{% bags $B$ split in round $k$}}2(|B|-1)≤ ( 2 + 3 roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG italic_n end_ARG start_ARG roman_Δ start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bags italic_B split in round italic_k end_POSTSUBSCRIPT 2 ( | italic_B | - 1 )
2n(2+3Δ)2knΔk/2absent2𝑛superscript23Δ2subscript𝑘𝑛superscriptΔ𝑘2\displaystyle\leq 2n(2+3\Delta)^{2}\,\sum_{k}\frac{n}{\Delta^{k/2}}≤ 2 italic_n ( 2 + 3 roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG italic_n end_ARG start_ARG roman_Δ start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG
32(nΔ)211/Δabsent32superscript𝑛Δ211Δ\displaystyle\leq\frac{32(n\Delta)^{2}}{1-1/\sqrt{\Delta}}≤ divide start_ARG 32 ( italic_n roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - 1 / square-root start_ARG roman_Δ end_ARG end_ARG
90(nΔ)2.absent90superscript𝑛Δ2\displaystyle\leq 90(n\Delta)^{2}.≤ 90 ( italic_n roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

There are two cases: if logΔ(n)subscriptΔ𝑛\log_{\Delta}(n)\to\inftyroman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) → ∞, then by Chebyshev’s inequality,

{Z>2𝔼{Z}}0,𝑍2𝔼𝑍0\mathbb{P}\{Z>2\mathbb{E}\{Z\}\}\to 0~{},blackboard_P { italic_Z > 2 blackboard_E { italic_Z } } → 0 ,

where we used the fact that 𝔼(Z)=Ω(ΔnlogΔ(n))𝔼𝑍ΩΔ𝑛subscriptΔ𝑛\mathbb{E}(Z)=\Omega(\Delta n\log_{\Delta}(n))blackboard_E ( italic_Z ) = roman_Ω ( roman_Δ italic_n roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ) King et al., (2003). If on the other hand, logΔ(n)KsubscriptΔ𝑛𝐾\log_{\Delta}(n)\leq Kroman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ≤ italic_K for a large constant K𝐾Kitalic_K, i.e., Δn1/KΔsuperscript𝑛1𝐾\Delta\geq n^{1/K}roman_Δ ≥ italic_n start_POSTSUPERSCRIPT 1 / italic_K end_POSTSUPERSCRIPT, then

ZnΔ+Δ+X(2+3Δ)n+Z 6nΔX+Z,𝑍𝑛ΔΔ𝑋23Δ𝑛superscript𝑍6𝑛Δ𝑋superscript𝑍Z\;\leq\;n\Delta+\Delta+X(2+3\Delta)n+Z^{\prime}\;\leq\;6n\Delta X+Z^{\prime},italic_Z ≤ italic_n roman_Δ + roman_Δ + italic_X ( 2 + 3 roman_Δ ) italic_n + italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ 6 italic_n roman_Δ italic_X + italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,

where Zsuperscript𝑍Z^{\prime}italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is defined as Z𝑍Zitalic_Z with level k=0𝑘0k=0italic_k = 0 excluded. Arguing as above, we have

𝕍{Z}32(nΔ)2Δ1=o((𝔼{Z})2),𝕍superscript𝑍32superscript𝑛Δ2Δ1𝑜superscript𝔼𝑍2\mathbb{V}\{Z^{\prime}\}\leq\frac{32(n\Delta)^{2}}{\sqrt{\Delta}-1}=o\left((% \mathbb{E}\{Z\})^{2}\right),blackboard_V { italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } ≤ divide start_ARG 32 ( italic_n roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG roman_Δ end_ARG - 1 end_ARG = italic_o ( ( blackboard_E { italic_Z } ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

so that by Chebyshev’s inequality,

ZnΔ6X+op(1),𝑍𝑛Δ6𝑋subscript𝑜𝑝1\frac{Z}{n\Delta}\leq 6X+o_{p}(1),divide start_ARG italic_Z end_ARG start_ARG italic_n roman_Δ end_ARG ≤ 6 italic_X + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( 1 ) ,

where op(1)subscript𝑜𝑝1o_{p}(1)italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( 1 ) denotes a quantity that tends to zero in probability as n𝑛n\to\inftyitalic_n → ∞. In other words, the sequence of random variables Z/(nΔ)𝑍𝑛ΔZ/(n\Delta)italic_Z / ( italic_n roman_Δ ) is tight. ∎

5. Graphical models on semi-labelled trees

We present a family of probabilistic models over partially observed trees for which the distance-based Algorithm 1 recovers the underlying tree. Although the Gaussian and the binary tree models (also known as the Ising tree model) discussed in Section 5.1 are probably the most interesting, with little effort we can generalize this, which we do in Section 5.2 and Section 5.3.

Given a tree T=(V,E)𝑇𝑉𝐸T=(V,E)italic_T = ( italic_V , italic_E ) and a random vector Y=(Yv)vV𝑌subscriptsubscript𝑌𝑣𝑣𝑉Y=(Y_{v})_{v\in V}italic_Y = ( italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_v ∈ italic_V end_POSTSUBSCRIPT with values in the product space 𝒴=vV𝒴v𝒴subscriptproduct𝑣𝑉subscript𝒴𝑣\mathcal{Y}=\prod_{v\in V}\mathcal{Y}_{v}caligraphic_Y = ∏ start_POSTSUBSCRIPT italic_v ∈ italic_V end_POSTSUBSCRIPT caligraphic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, consider the underlying graphical model over T𝑇Titalic_T, that is, the family of density functions that factorize according to the tree

(6) fY(y)=uvEψuv(yu,yv)for y𝒴,formulae-sequencesubscript𝑓𝑌𝑦subscriptproduct𝑢𝑣𝐸subscript𝜓𝑢𝑣subscript𝑦𝑢subscript𝑦𝑣for 𝑦𝒴f_{Y}(y)\;=\;\prod_{uv\in E}\psi_{uv}(y_{u},y_{v})\qquad\mbox{for }y\in% \mathcal{Y},italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) = ∏ start_POSTSUBSCRIPT italic_u italic_v ∈ italic_E end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) for italic_y ∈ caligraphic_Y ,

where ψuvsubscript𝜓𝑢𝑣\psi_{uv}italic_ψ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT are non-negative potential functions Lauritzen, (1996). The underlying latent tree model over the semi-labelled tree T𝑇Titalic_T with the labelling set W=[n]𝑊delimited-[]𝑛W=[n]italic_W = [ italic_n ] is a model for X=(X1,,Xn)𝑋subscript𝑋1subscript𝑋𝑛X=(X_{1},\ldots,X_{n})italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), which is the sub-vector of Y=(X,H)𝑌𝑋𝐻Y=(X,H)italic_Y = ( italic_X , italic_H ) associated to the regular vertices. The density of X𝑋Xitalic_X is obtained from the joint density of Y𝑌Yitalic_Y by marginalizing out the latent variables H𝐻Hitalic_H

f(x)=fY(x,h)dh.𝑓𝑥subscriptsubscript𝑓𝑌𝑥differential-df(x)\;=\;\int_{\mathcal{H}}f_{Y}(x,h){\rm d}h.italic_f ( italic_x ) = ∫ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_x , italic_h ) roman_d italic_h .

Note that in the definition of a semi-labelled tree, we required that all nodes of T𝑇Titalic_T of degree 2absent2\leq 2≤ 2 are regular. The restriction complies with this definition; c.f. Section 11.1 in Zwiernik, (2018).

5.1. The binary/Gaussian case

We single out two simple examples where Y𝑌Yitalic_Y is jointly Gaussian or when Y𝑌Yitalic_Y is binary. In the second case, the model on a tree coincides with the binary Ising model on T𝑇Titalic_T. In both cases, we get a useful path product formula for the correlations, which states that the correlation between any two regular nodes can be written as the product of edge correlations for edges on the path joining these nodes

(7) ρij=uvij¯ρuvfor all i,j[n].formulae-sequencesubscript𝜌𝑖𝑗subscriptproduct𝑢𝑣¯𝑖𝑗subscript𝜌𝑢𝑣for all 𝑖𝑗delimited-[]𝑛\rho_{ij}\;=\;\prod_{uv\in\overline{ij}}\rho_{uv}\qquad\mbox{for all }i,j\in[n].italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_u italic_v ∈ over¯ start_ARG italic_i italic_j end_ARG end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT for all italic_i , italic_j ∈ [ italic_n ] .

The advantage of this representation is that (7) gives a direct translation of correlation structures in these models to tree metrics via d(i,j):=log|ρij|assign𝑑𝑖𝑗subscript𝜌𝑖𝑗d({i,j}):=-\log|\rho_{ij}|italic_d ( italic_i , italic_j ) := - roman_log | italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | for all ij𝑖𝑗i\neq jitalic_i ≠ italic_j; see also (Pearl,, 1988, Section 8.3.3). To make this explicit we formulate the following proposition.

Proposition 5.1.

Consider a latent tree model over a semi-labelled tree T𝑇Titalic_T. Whenever the correlations Σ=[ρij]Σdelimited-[]subscript𝜌𝑖𝑗\Sigma=[\rho_{ij}]roman_Σ = [ italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] in the underlying tree model admit the path-product formula (7), Algorithm 1, applied to the distance oracle defined by d(i,j)=log|ρij|𝑑𝑖𝑗subscript𝜌𝑖𝑗d({i,j})=-\log|\rho_{ij}|italic_d ( italic_i , italic_j ) = - roman_log | italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | for all ij𝑖𝑗i\neq jitalic_i ≠ italic_j, recovers the underlying tree.

Note that our algorithm recovers all the edge lengths, so we can also find the absolute values of the correlations between the latent variables. In the Gaussian case, this yields parameter identification.

Remark 5.2.

In the Gaussian latent tree model over a semi-labelled tree T𝑇Titalic_T, Algorithm 1 recovers the underlying tree and the model parameters up to sign swapping of the latent variables.

It turns out that the basic binary/Gaussian setting can be largely generalized. We discuss three such generalizations:

  1. (1)

    General Markov models

  2. (2)

    Linear models

  3. (3)

    Non-paranormal distributions

We briefly describe these models for completeness. The former two are dealt with in Section 11.2 in Zwiernik, (2018).

5.2. General Markov models and linear models

By general Markov model, we mean a generalization of the binary latent tree models, where each 𝒴v={0,,d1}subscript𝒴𝑣0𝑑1\mathcal{Y}_{v}=\{0,\ldots,d-1\}caligraphic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = { 0 , … , italic_d - 1 } for some finite r1𝑟1r\geq 1italic_r ≥ 1. Denoting by Puvsuperscript𝑃𝑢𝑣P^{uv}italic_P start_POSTSUPERSCRIPT italic_u italic_v end_POSTSUPERSCRIPT the d×d𝑑𝑑d\times ditalic_d × italic_d matrix representing the joint distribution of (Yu,Yv)subscript𝑌𝑢subscript𝑌𝑣(Y_{u},Y_{v})( italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ), and by Pvvsuperscript𝑃𝑣𝑣P^{vv}italic_P start_POSTSUPERSCRIPT italic_v italic_v end_POSTSUPERSCRIPT the diagonal d×d𝑑𝑑d\times ditalic_d × italic_d matrix with the marginal distribution of Yvsubscript𝑌𝑣Y_{v}italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT on the diagonal, we can define for any two nodes u,v𝑢𝑣u,vitalic_u , italic_v

(8) τuv:=det(Puv)det(Puu)det(Pvv).assignsubscript𝜏𝑢𝑣superscript𝑃𝑢𝑣superscript𝑃𝑢𝑢superscript𝑃𝑣𝑣\tau_{uv}\;:=\;\frac{\det(P^{uv})}{\sqrt{\det(P^{uu})\det(P^{vv})}}.italic_τ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT := divide start_ARG roman_det ( italic_P start_POSTSUPERSCRIPT italic_u italic_v end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG roman_det ( italic_P start_POSTSUPERSCRIPT italic_u italic_u end_POSTSUPERSCRIPT ) roman_det ( italic_P start_POSTSUPERSCRIPT italic_v italic_v end_POSTSUPERSCRIPT ) end_ARG end_ARG .

It turns out that for these new quantities, an equation of type (7) still holds, namely, τij=uvij¯τuvsubscript𝜏𝑖𝑗subscriptproduct𝑢𝑣¯𝑖𝑗subscript𝜏𝑢𝑣\tau_{ij}=\prod_{uv\in\overline{ij}}\tau_{uv}italic_τ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_u italic_v ∈ over¯ start_ARG italic_i italic_j end_ARG end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT, so we again obtain the tree distance d(i,j)=log|τij|𝑑𝑖𝑗subscript𝜏𝑖𝑗d({i,j})=-\log|\tau_{ij}|italic_d ( italic_i , italic_j ) = - roman_log | italic_τ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT |.

Proposition 5.3.

In a general Markov model over a semi-labelled tree T𝑇Titalic_T we can recover T𝑇Titalic_T using Algorithm 1 from the distances d(i,j)=log|τij|𝑑𝑖𝑗subscript𝜏𝑖𝑗d({i,j})=-\log|\tau_{ij}|italic_d ( italic_i , italic_j ) = - roman_log | italic_τ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | with τijsubscript𝜏𝑖𝑗\tau_{ij}italic_τ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT defined in (8).

More generally, suppose that Yvsubscript𝑌𝑣Y_{v}italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are now potentially vector-valued, all in ksuperscript𝑘\mathbb{R}^{k}blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and it holds for every edge uv𝑢𝑣uvitalic_u italic_v of T𝑇Titalic_T that 𝔼[Yu|Yv]𝔼delimited-[]conditionalsubscript𝑌𝑢subscript𝑌𝑣\mathbb{E}[Y_{u}|Y_{v}]blackboard_E [ italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] is an affine function of Yvsubscript𝑌𝑣Y_{v}italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Let

Σuv=cov(Yu,Yv)=𝔼YuYv𝔼Yu𝔼Yv.subscriptΣ𝑢𝑣covsubscript𝑌𝑢subscript𝑌𝑣𝔼subscript𝑌𝑢superscriptsubscript𝑌𝑣top𝔼subscript𝑌𝑢𝔼superscriptsubscript𝑌𝑣top\Sigma_{uv}\;=\;{\rm cov}(Y_{u},Y_{v})\;=\;\mathbb{E}Y_{u}Y_{v}^{\top}-\mathbb% {E}Y_{u}\mathbb{E}Y_{v}^{\top}.roman_Σ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT = roman_cov ( italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = blackboard_E italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - blackboard_E italic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT blackboard_E italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

In this case, defining τuvsubscript𝜏𝑢𝑣\tau_{uv}italic_τ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT as

(9) τuv:=det(Σuu1/2ΣuvΣvv1/2),assignsubscript𝜏𝑢𝑣superscriptsubscriptΣ𝑢𝑢12subscriptΣ𝑢𝑣subscriptsuperscriptΣ12𝑣𝑣\tau_{uv}\;:=\;\det(\Sigma_{uu}^{-1/2}\Sigma_{uv}\Sigma^{-1/2}_{vv}),italic_τ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT := roman_det ( roman_Σ start_POSTSUBSCRIPT italic_u italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v italic_v end_POSTSUBSCRIPT ) ,

we reach the same conclusion as for general Markov models; see Section 11.2.3 in Zwiernik, (2018) for details.

Proposition 5.4.

In a linear model over a semi-labelled tree T𝑇Titalic_T we can recover recover T𝑇Titalic_T using Algorithm 1 from the distances d(i,j)=log|τij|𝑑𝑖𝑗subscript𝜏𝑖𝑗d({i,j})=-\log|\tau_{ij}|italic_d ( italic_i , italic_j ) = - roman_log | italic_τ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | with τijsubscript𝜏𝑖𝑗\tau_{ij}italic_τ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT defined in (9).

5.3. Non-paranormal distributions

Suppose that Z=(Zv)𝑍subscript𝑍𝑣Z=(Z_{v})italic_Z = ( italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) is a Gaussian vector with underlying latent tree T=(V,E)𝑇𝑉𝐸T=(V,E)italic_T = ( italic_V , italic_E ). Suppose that Y𝑌Yitalic_Y is a monotone transformation of Z𝑍Zitalic_Z, so that Yv=fv(Zv)subscript𝑌𝑣subscript𝑓𝑣subscript𝑍𝑣Y_{v}=f_{v}(Z_{v})italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) for strictly monotone functions fvsubscript𝑓𝑣f_{v}italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, vV𝑣𝑉v\in Vitalic_v ∈ italic_V. The conditional independence structure of Y𝑌Yitalic_Y and Z𝑍Zitalic_Z are the same. The problem is that the correlation structure of Y𝑌Yitalic_Y may be quite complicated and may not satisfy the product path formula in (7). Suppose however that we have access to the Kendall-τ𝜏\tauitalic_τ coefficients K=[κij]𝐾delimited-[]subscript𝜅𝑖𝑗K=[\kappa_{ij}]italic_K = [ italic_κ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] for X𝑋Xitalic_X with

(10) κij=𝔼[𝟙(Xi>Xi)𝟙(Xj>Xj)],subscript𝜅𝑖𝑗𝔼delimited-[]1subscript𝑋𝑖superscriptsubscript𝑋𝑖1subscript𝑋𝑗superscriptsubscript𝑋𝑗\kappa_{ij}\;=\;\mathbb{E}[\mathbbm{1}(X_{i}>X_{i}^{\prime})\mathbbm{1}(X_{j}>% X_{j}^{\prime})],italic_κ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = blackboard_E [ blackboard_1 ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) blackboard_1 ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ,

where Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is an independent copy of X𝑋Xitalic_X. Then we can use the fact that for all ij𝑖𝑗i\neq jitalic_i ≠ italic_j

𝔼[𝟙(Xi>Xi)𝟙(Xj>Xj)]=𝔼[𝟙(Zi>Zi)𝟙(Zj>Zj)]=:κijZ.\mathbb{E}[\mathbbm{1}(X_{i}>X_{i}^{\prime})\mathbbm{1}(X_{j}>X_{j}^{\prime})]% \;=\;\mathbb{E}[\mathbbm{1}(Z_{i}>Z_{i}^{\prime})\mathbbm{1}(Z_{j}>Z_{j}^{% \prime})]\;=:\;\kappa^{Z}_{ij}.blackboard_E [ blackboard_1 ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) blackboard_1 ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] = blackboard_E [ blackboard_1 ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) blackboard_1 ( italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] = : italic_κ start_POSTSUPERSCRIPT italic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT .

As observed by Liu et al., (2012), since Z𝑍Zitalic_Z is Gaussian, we have a simple formula that relates κijZsubscriptsuperscript𝜅𝑍𝑖𝑗\kappa^{Z}_{ij}italic_κ start_POSTSUPERSCRIPT italic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to the correlation coefficient ρij=corr(Zi,Zj)subscript𝜌𝑖𝑗corrsubscript𝑍𝑖subscript𝑍𝑗\rho_{ij}={\rm corr}(Z_{i},Z_{j})italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = roman_corr ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), for all ij𝑖𝑗i\neq jitalic_i ≠ italic_j

ρij=sin(π2κijZ)=sin(π2κij).subscript𝜌𝑖𝑗𝜋2subscriptsuperscript𝜅𝑍𝑖𝑗𝜋2subscript𝜅𝑖𝑗\rho_{ij}\;=\;\sin(\tfrac{\pi}{2}\kappa^{Z}_{ij})\;=\;\sin(\tfrac{\pi}{2}% \kappa_{ij}).italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = roman_sin ( divide start_ARG italic_π end_ARG start_ARG 2 end_ARG italic_κ start_POSTSUPERSCRIPT italic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = roman_sin ( divide start_ARG italic_π end_ARG start_ARG 2 end_ARG italic_κ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) .

Applying a simple transformation to the oracle K𝐾Kitalic_K gives us access to the underlying Gaussian correlation pattern, which now can be used as in the Gaussian case.

Proposition 5.5.

In a non=paranormal distribution over a semi-labelled tree T𝑇Titalic_T we can can recover recover T𝑇Titalic_T using Algorithm 1 from the distances

d(i,j):=log|sin(π2κij)|,assign𝑑𝑖𝑗𝜋2subscript𝜅𝑖𝑗d({i,j})\;:=\;-\log|\sin(\tfrac{\pi}{2}\kappa_{ij})|,italic_d ( italic_i , italic_j ) := - roman_log | roman_sin ( divide start_ARG italic_π end_ARG start_ARG 2 end_ARG italic_κ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) | ,

where κijsubscript𝜅𝑖𝑗\kappa_{ij}italic_κ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the Kendall-τ𝜏\tauitalic_τ coefficient defined in (10).

6. Statistical guarantees

In this section, we illustrate how the results developed in this paper can be applied in a more realistic scenario when the entries of the covariance matrix cannot be measured exactly. In particular, suppose a random sample of size N𝑁Nitalic_N is observed from a zero-mean distribution with covariance matrix ΣΣ\Sigmaroman_Σ. Assume that Σii=1subscriptΣ𝑖𝑖1\Sigma_{ii}=1roman_Σ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = 1 for all i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n and the correlations Σij=ρijsubscriptΣ𝑖𝑗subscript𝜌𝑖𝑗\Sigma_{ij}=\rho_{ij}roman_Σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT satisfy parametrization (7) for some semilabeled tree. In this case d(i,j)=log|ρij|𝑑𝑖𝑗subscript𝜌𝑖𝑗d(i,j)=-\log|\rho_{ij}|italic_d ( italic_i , italic_j ) = - roman_log | italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | for all ij𝑖𝑗i\neq jitalic_i ≠ italic_j forms a tree metric.

Denote by ρ^ijsubscript^𝜌𝑖𝑗\widehat{\rho}_{ij}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT a suitable estimator of the correlations based on the sample and let d^(i,j)=log|ρ^ij|^𝑑𝑖𝑗subscript^𝜌𝑖𝑗\widehat{d}(i,j)=-\log|\widehat{\rho}_{ij}|over^ start_ARG italic_d end_ARG ( italic_i , italic_j ) = - roman_log | over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | be the corresponding plug-in estimator of the distances. Since the d^(i,j)^𝑑𝑖𝑗\widehat{d}(i,j)over^ start_ARG italic_d end_ARG ( italic_i , italic_j ) do not form a tree metric, we cannot apply directly Algorithm 1. The algorithm uses distances at five places:

  1. (1)

    In Algorithm 2, to compute D(v)𝐷𝑣D(v)italic_D ( italic_v ).

  2. (2)

    In Algorithm 2, to decide if ρ(Bi)𝜌subscript𝐵𝑖\rho(B_{i})italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is latent or not.

  3. (3)

    In the case when ρ(Bi)𝜌subscript𝐵𝑖\rho(B_{i})italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is latent, Algorithm 2 also uses the distances to calculate d(ρ(Bi),v)𝑑𝜌subscript𝐵𝑖𝑣d(\rho(B_{i}),v)italic_d ( italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_v ) for all vBi𝑣subscript𝐵𝑖v\in B_{i}italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

  4. (4)

    In Algorithm 3, to group nodes in B𝐵Bitalic_B according to the connected components obtained by removing ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ).

  5. (5)

    When recovering the skeleton of a small subtree in the case when |B|κ𝐵𝜅|B|\leq\kappa| italic_B | ≤ italic_κ.

In order to adapt the algorithm to the ”noisy” distance oracle, we first propose the noisy versions of basic and explode. The procedure basic.noisy is outlined in Algorithm 5, and explode.noisy in Algorithm 6. The performance of the procedure depends on the following quantities

(11) :=minijd(i,j),u:=maxijd(i,j),formulae-sequenceassignsubscript𝑖𝑗𝑑𝑖𝑗assign𝑢subscript𝑖𝑗𝑑𝑖𝑗\ell\;:=\;\min_{i\neq j}d({i,j}),\quad\qquad\mathit{u}\;:=\;\max_{i\neq j}d({i% ,j}),roman_ℓ := roman_min start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_d ( italic_i , italic_j ) , italic_u := roman_max start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_d ( italic_i , italic_j ) ,

where the minimum and maximum are taken over all regular nodes.

The algorithms have an additional input parameter ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 that is an upper bound for the noise level. More precisely, the algorithms work correctly whenever maxij|d^(i,j)d(i,j)|ϵsubscript𝑖𝑗^𝑑𝑖𝑗𝑑𝑖𝑗italic-ϵ\max_{i\neq j}|\widehat{d}(i,j)-d(i,j)|\leq\epsilonroman_max start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_d end_ARG ( italic_i , italic_j ) - italic_d ( italic_i , italic_j ) | ≤ italic_ϵ.

Input: a bag B𝐵Bitalic_B, αB𝛼𝐵\alpha\in Bitalic_α ∈ italic_B, and ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0;
for vB𝑣𝐵v\in Bitalic_v ∈ italic_B do
       Compute D^(v)=d^(v,α)d^(v,ρ)^𝐷𝑣^𝑑𝑣𝛼^𝑑𝑣𝜌\widehat{D}(v)=\widehat{d}({v,\alpha})-\widehat{d}({v,\rho})over^ start_ARG italic_D end_ARG ( italic_v ) = over^ start_ARG italic_d end_ARG ( italic_v , italic_α ) - over^ start_ARG italic_d end_ARG ( italic_v , italic_ρ );
      
Order vB𝑣𝐵v\in Bitalic_v ∈ italic_B according to the decreasing value of D^(v)^𝐷𝑣\widehat{D}(v)over^ start_ARG italic_D end_ARG ( italic_v );
If |D^(u)D^(v)|4ϵ^𝐷𝑢^𝐷𝑣4italic-ϵ|\widehat{D}(u)-\widehat{D}(v)|\leq 4\epsilon| over^ start_ARG italic_D end_ARG ( italic_u ) - over^ start_ARG italic_D end_ARG ( italic_v ) | ≤ 4 italic_ϵ, assign u,v𝑢𝑣u,vitalic_u , italic_v to the same bag;
Denote the resulting bags by B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT;
for i=1,,k𝑖1𝑘i=1,\ldots,kitalic_i = 1 , … , italic_k do
       ui=argminvBid^(α,v)subscript𝑢𝑖subscript𝑣subscript𝐵𝑖^𝑑𝛼𝑣u_{i}=\arg\min_{v\in B_{i}}\widehat{d}(\alpha,v)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG ( italic_α , italic_v );
       if |d^(ρ,ui)+d^(ui,α)d^(ρ,α)|3ϵ^𝑑𝜌subscript𝑢𝑖^𝑑subscript𝑢𝑖𝛼^𝑑𝜌𝛼3italic-ϵ|\widehat{d}(\rho,u_{i})+\widehat{d}(u_{i},\alpha)-\widehat{d}(\rho,\alpha)|% \leq 3\epsilon| over^ start_ARG italic_d end_ARG ( italic_ρ , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + over^ start_ARG italic_d end_ARG ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α ) - over^ start_ARG italic_d end_ARG ( italic_ρ , italic_α ) | ≤ 3 italic_ϵ then
            ρ(Bi)=ui𝜌subscript𝐵𝑖subscript𝑢𝑖\rho(B_{i})=u_{i}italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
      else
            Identify latent node wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
             Add wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
             Set ρ(Bi)=wi𝜌subscript𝐵𝑖subscript𝑤𝑖\rho(B_{i})=w_{i}italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
             Calculate d^(wi,v)^𝑑subscript𝑤𝑖𝑣\widehat{d}(w_{i},v)over^ start_ARG italic_d end_ARG ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v ) for all vBi𝑣subscript𝐵𝑖v\in B_{i}italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using (2);
Return B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and the skeleton (ρ(B1),ρ(B2)),,(ρ(Bk1,ρ(Bk)))𝜌subscript𝐵1𝜌subscript𝐵2𝜌subscript𝐵𝑘1𝜌subscript𝐵𝑘(\rho(B_{1}),\rho(B_{2})),\ldots,(\rho(B_{k-1},\rho(B_{k})))( italic_ρ ( italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_ρ ( italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) , … , ( italic_ρ ( italic_B start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_ρ ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) );
Algorithm 5 basic.noisy(B,α,ϵ)𝐵𝛼italic-ϵ(B,\alpha,\epsilon)( italic_B , italic_α , italic_ϵ )

The next simple fact is used repeatedly below.

Lemma 6.1.

Suppose maxij|d^(i,j)d(i,j)|ϵsubscript𝑖𝑗^𝑑𝑖𝑗𝑑𝑖𝑗italic-ϵ\max_{i\neq j}|\widehat{d}(i,j)-d(i,j)|\leq\epsilonroman_max start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_d end_ARG ( italic_i , italic_j ) - italic_d ( italic_i , italic_j ) | ≤ italic_ϵ for all regular ij𝑖𝑗i\neq jitalic_i ≠ italic_j and let a(n2)𝑎superscriptbinomial𝑛2a\in\mathbb{R}^{\binom{n}{2}}italic_a ∈ blackboard_R start_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG ) end_POSTSUPERSCRIPT. Then |a(d^d)|a1ϵsuperscript𝑎top^𝑑𝑑subscriptnorm𝑎1italic-ϵ|a^{\top}(\widehat{d}-d)|\leq\|a\|_{1}\epsilon| italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over^ start_ARG italic_d end_ARG - italic_d ) | ≤ ∥ italic_a ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ. In particular,

  1. (i)

    if ad=0superscript𝑎top𝑑0a^{\top}d=0italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d = 0 then |ad^|a1ϵsuperscript𝑎top^𝑑subscriptnorm𝑎1italic-ϵ|a^{\top}\widehat{d}|\leq\|a\|_{1}\epsilon| italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG | ≤ ∥ italic_a ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ;

  2. (ii)

    if adηsuperscript𝑎top𝑑𝜂a^{\top}d\geq\etaitalic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d ≥ italic_η then ad^ηa1ϵsuperscript𝑎top^𝑑𝜂subscriptnorm𝑎1italic-ϵa^{\top}\widehat{d}\geq\eta-\|a\|_{1}\epsilonitalic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG ≥ italic_η - ∥ italic_a ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ.

Input: a bag B𝐵Bitalic_B ;
while |B|>1𝐵1|B|>1| italic_B | > 1 do
      Take uB𝑢𝐵u\in Bitalic_u ∈ italic_B, uρ(B)𝑢𝜌𝐵u\not=\rho(B)italic_u ≠ italic_ρ ( italic_B );
       Set B={vB,vρ(B):d(u,ρ(B))+d(v,ρ(B))d(u,v)>3ϵ}{ρ(B)}superscript𝐵conditional-setformulae-sequence𝑣𝐵𝑣𝜌𝐵𝑑𝑢𝜌𝐵𝑑𝑣𝜌𝐵𝑑𝑢𝑣3italic-ϵ𝜌𝐵B^{*}=\{v\in B,v\not=\rho(B):d(u,\rho(B))+d(v,\rho(B))-d(u,v)>3\epsilon\}\cup% \{\rho(B)\}italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_v ∈ italic_B , italic_v ≠ italic_ρ ( italic_B ) : italic_d ( italic_u , italic_ρ ( italic_B ) ) + italic_d ( italic_v , italic_ρ ( italic_B ) ) - italic_d ( italic_u , italic_v ) > 3 italic_ϵ } ∪ { italic_ρ ( italic_B ) };
       Set ρ(B)=ρ(B)𝜌superscript𝐵𝜌𝐵\rho(B^{*})=\rho(B)italic_ρ ( italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_ρ ( italic_B );
       Output Bsuperscript𝐵B^{*}italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ;
       Set B{ρ(B)}(BB)𝐵𝜌𝐵𝐵superscript𝐵B\leftarrow\{\rho(B)\}\cup(B\setminus B^{*})italic_B ← { italic_ρ ( italic_B ) } ∪ ( italic_B ∖ italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT );
      
Algorithm 6 explode.noisy(B,ϵ)𝐵italic-ϵ(B,\epsilon)( italic_B , italic_ϵ )

In what follows, we condition on the random event

(ϵ)italic-ϵ\displaystyle\mathcal{E}(\epsilon)caligraphic_E ( italic_ϵ ) =\displaystyle== {maxij|d^(i,j)d(i,j)|ϵfor all regular ij}subscript𝑖𝑗^𝑑𝑖𝑗𝑑𝑖𝑗italic-ϵfor all regular 𝑖𝑗\displaystyle\left\{\max_{i\neq j}|\widehat{d}(i,j)-d(i,j)|\leq\epsilon\;\;% \mbox{for all regular }i\neq j\right\}{ roman_max start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_d end_ARG ( italic_i , italic_j ) - italic_d ( italic_i , italic_j ) | ≤ italic_ϵ for all regular italic_i ≠ italic_j }
=\displaystyle== {maxij|log|ρ^ijρij||ϵ for all regular ij}.subscript𝑖𝑗subscript^𝜌𝑖𝑗subscript𝜌𝑖𝑗italic-ϵ for all regular 𝑖𝑗\displaystyle\left\{\max_{i\neq j}\left|\log\left|\frac{\widehat{\rho}_{ij}}{% \rho_{ij}}\right|\right|\leq\epsilon\mbox{ for all regular }i\neq j\right\}.{ roman_max start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT | roman_log | divide start_ARG over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG | | ≤ italic_ϵ for all regular italic_i ≠ italic_j } .
Proposition 6.2.

Suppose that in the current round, the event (ϵ)italic-ϵ\mathcal{E}(\epsilon)caligraphic_E ( italic_ϵ ) holds with ϵ</4italic-ϵ4\epsilon<\ell/4italic_ϵ < roman_ℓ / 4. Then Algorithm 5 and Algorithm 6 applied to the noisy distances gives the same output as Algorithm 2 and Algorithm 3 applied to their noiseless versions.

Proof..

In the first part, Algorithm 5 computes D^(v)^𝐷𝑣\widehat{D}(v)over^ start_ARG italic_D end_ARG ( italic_v ) for all v𝑣vitalic_v and it uses this information to produce bags B1,,Bksubscript𝐵1subscript𝐵𝑘B_{1},\ldots,B_{k}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The bags in Algorithm 2 are obtained by grouping nodes based on the increasing values of D(v)𝐷𝑣D(v)italic_D ( italic_v ). By Lemma 6.1(i), if D(u)=D(v)𝐷𝑢𝐷𝑣D(u)=D(v)italic_D ( italic_u ) = italic_D ( italic_v ) then |D^(u)D^(v)| 4ϵ^𝐷𝑢^𝐷𝑣4italic-ϵ|\widehat{D}(u)-\widehat{D}(v)|\;\leq\;4\epsilon| over^ start_ARG italic_D end_ARG ( italic_u ) - over^ start_ARG italic_D end_ARG ( italic_v ) | ≤ 4 italic_ϵ. Moreover, if D(u)>D(v)𝐷𝑢𝐷𝑣D(u)>D(v)italic_D ( italic_u ) > italic_D ( italic_v ) then it must be that D(u)D(v)2𝐷𝑢𝐷𝑣2D(u)-D(v)\geq 2\ellitalic_D ( italic_u ) - italic_D ( italic_v ) ≥ 2 roman_ℓ and so, by Lemma 6.1(ii), D^(u)D^(v)>24ϵ>4ϵ^𝐷𝑢^𝐷𝑣24italic-ϵ4italic-ϵ\widehat{D}(u)-\widehat{D}(v)>2\ell-4\epsilon>4\epsilonover^ start_ARG italic_D end_ARG ( italic_u ) - over^ start_ARG italic_D end_ARG ( italic_v ) > 2 roman_ℓ - 4 italic_ϵ > 4 italic_ϵ. This shows that this step of Algorithm 5 provides the same bags as Algorithm 2. In the second part of the algorithm, we decide whether or not the corresponding path nodes ρ(Bi)𝜌subscript𝐵𝑖\rho(B_{i})italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are regular. Let u^i=argminvBid^(α,v)subscript^𝑢𝑖subscript𝑣subscript𝐵𝑖^𝑑𝛼𝑣\hat{u}_{i}=\arg\min_{v\in B_{i}}\widehat{d}(\alpha,v)over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG ( italic_α , italic_v ) and ui=argminvBid(α,v)subscript𝑢𝑖subscript𝑣subscript𝐵𝑖𝑑𝛼𝑣u_{i}=\arg\min_{v\in B_{i}}d(\alpha,v)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d ( italic_α , italic_v ). If u^iuisubscript^𝑢𝑖subscript𝑢𝑖\hat{u}_{i}\neq u_{i}over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT then d(ρ,u^i)d(ρ,ui)𝑑𝜌subscript^𝑢𝑖𝑑𝜌subscript𝑢𝑖d({\rho,\hat{u}_{i}})-d({\rho,u_{i}})\geq\ellitalic_d ( italic_ρ , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_d ( italic_ρ , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ roman_ℓ. By Lemma 6.1(ii),

d^(ρ,u^i)d^(ρ,ui)2ϵ 2ϵ,^𝑑𝜌subscript^𝑢𝑖^𝑑𝜌subscript𝑢𝑖2italic-ϵ2italic-ϵ\widehat{d}({\rho,\hat{u}_{i}})-\widehat{d}({\rho,u_{i}})\;\geq\;\ell-2% \epsilon\;\geq\;2\epsilon,over^ start_ARG italic_d end_ARG ( italic_ρ , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over^ start_ARG italic_d end_ARG ( italic_ρ , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ roman_ℓ - 2 italic_ϵ ≥ 2 italic_ϵ ,

which contradicts the fact that u^i=argminvBid^(α,v)subscript^𝑢𝑖subscript𝑣subscript𝐵𝑖^𝑑𝛼𝑣\hat{u}_{i}=\arg\min_{v\in B_{i}}\widehat{d}({\alpha,v})over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG ( italic_α , italic_v ). We conclude that u^i=uisubscript^𝑢𝑖subscript𝑢𝑖\hat{u}_{i}=u_{i}over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Now consider the problem of deciding whether ρ(Bi)=ui𝜌subscript𝐵𝑖subscript𝑢𝑖\rho(B_{i})=u_{i}italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Suppose first that d(ρ,ui)+d(α,ui)d(ρ,α)=0𝑑𝜌subscript𝑢𝑖𝑑𝛼subscript𝑢𝑖𝑑𝜌𝛼0d({\rho,u_{i}})+d({\alpha,u_{i}})-d({\rho,\alpha})=0italic_d ( italic_ρ , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_d ( italic_α , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_d ( italic_ρ , italic_α ) = 0 (i.e., ρ(Bi)=ui𝜌subscript𝐵𝑖subscript𝑢𝑖\rho(B_{i})=u_{i}italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT). By Lemma 6.1(i),

|d^(ρ,ui)+d^(α,ui)d^(ρ,α)| 3ϵ.^𝑑𝜌subscript𝑢𝑖^𝑑𝛼subscript𝑢𝑖^𝑑𝜌𝛼3italic-ϵ|\widehat{d}({\rho,u_{i}})+\widehat{d}({\alpha,u_{i}})-\widehat{d}({\rho,% \alpha})|\;\leq\;3\epsilon.| over^ start_ARG italic_d end_ARG ( italic_ρ , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + over^ start_ARG italic_d end_ARG ( italic_α , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over^ start_ARG italic_d end_ARG ( italic_ρ , italic_α ) | ≤ 3 italic_ϵ .

If d(ρ,ui)+d(α,ui)d(ρ,α)>0𝑑𝜌subscript𝑢𝑖𝑑𝛼subscript𝑢𝑖𝑑𝜌𝛼0d({\rho,u_{i}})+d({\alpha,u_{i}})-d({\rho,\alpha})>0italic_d ( italic_ρ , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_d ( italic_α , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_d ( italic_ρ , italic_α ) > 0 then, since D𝐷Ditalic_D is a tree metric, it must be d(ρ,ui)+d(α,ui)d(ρ,α)2𝑑𝜌subscript𝑢𝑖𝑑𝛼subscript𝑢𝑖𝑑𝜌𝛼2d({\rho,u_{i}})+d({\alpha,u_{i}})-d({\rho,\alpha})\geq 2\ellitalic_d ( italic_ρ , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_d ( italic_α , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_d ( italic_ρ , italic_α ) ≥ 2 roman_ℓ. Then, by Lemma 6.1(ii) and by the fact that ϵ</4italic-ϵ4\epsilon<\ell/4italic_ϵ < roman_ℓ / 4

d^(ρ,ui)+d^(α,ui)d^(ρ,α) 23ϵ> 5ϵ.^𝑑𝜌subscript𝑢𝑖^𝑑𝛼subscript𝑢𝑖^𝑑𝜌𝛼23italic-ϵ5italic-ϵ\widehat{d}({\rho,u_{i}})+\widehat{d}({\alpha,u_{i}})-\widehat{d}({\rho,\alpha% })\;\geq\;2\ell-3\epsilon\;>\;5\epsilon.over^ start_ARG italic_d end_ARG ( italic_ρ , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + over^ start_ARG italic_d end_ARG ( italic_α , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over^ start_ARG italic_d end_ARG ( italic_ρ , italic_α ) ≥ 2 roman_ℓ - 3 italic_ϵ > 5 italic_ϵ .

This shows the correctness of the second part of Algorithm 5.

Since d(v,α)+d(v,ρ)d(α)2𝑑𝑣𝛼𝑑𝑣𝜌𝑑𝛼2d(v,\alpha)+d(v,\rho)-d(\alpha)\geq 2\ellitalic_d ( italic_v , italic_α ) + italic_d ( italic_v , italic_ρ ) - italic_d ( italic_α ) ≥ 2 roman_ℓ, we get d^(v,α)+d^(v,ρ)d^(α)23ϵ^𝑑𝑣𝛼^𝑑𝑣𝜌^𝑑𝛼23italic-ϵ\widehat{d}(v,\alpha)+\widehat{d}(v,\rho)-\widehat{d}(\alpha)\geq 2\ell-3\epsilonover^ start_ARG italic_d end_ARG ( italic_v , italic_α ) + over^ start_ARG italic_d end_ARG ( italic_v , italic_ρ ) - over^ start_ARG italic_d end_ARG ( italic_α ) ≥ 2 roman_ℓ - 3 italic_ϵ

We can similarly show that Algorithm 6 gives the same output as Algorithm 3. ∎

The problem with applying Proposition 6.2 recursively to each round is that the event (ϵ)italic-ϵ\mathcal{E}(\epsilon)caligraphic_E ( italic_ϵ ) only bounds the noise for distances between the n𝑛nitalic_n originally regular nodes. As the procedure progresses, new nodes are made regular; if ρ(Bi)𝜌subscript𝐵𝑖\rho(B_{i})italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is latent we make it “regular” by updating distances d^(wi,v)^𝑑subscript𝑤𝑖𝑣\widehat{d}(w_{i},v)over^ start_ARG italic_d end_ARG ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v ) for all vBi𝑣subscript𝐵𝑖v\in B_{i}italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using (2).

Lemma 6.3.

Suppose that event (ϵ)italic-ϵ\mathcal{E}(\epsilon)caligraphic_E ( italic_ϵ ) holds. After R𝑅Ritalic_R rounds of the algorithm, we have |d^(i,j)d(i,j)|(1+R2)ϵ^𝑑𝑖𝑗𝑑𝑖𝑗1𝑅2italic-ϵ|\widehat{d}(i,j)-d(i,j)|\leq(1+\tfrac{R}{2})\epsilon| over^ start_ARG italic_d end_ARG ( italic_i , italic_j ) - italic_d ( italic_i , italic_j ) | ≤ ( 1 + divide start_ARG italic_R end_ARG start_ARG 2 end_ARG ) italic_ϵ for all i,jB𝑖𝑗𝐵i,j\in Bitalic_i , italic_j ∈ italic_B that are regular in the R𝑅Ritalic_R-th round.

Proof..

Consider the first run of Algorithm 5. In this case, ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ) and α𝛼\alphaitalic_α are both regular. If wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is identified as a latent node, we calculate d^(wi,v)^𝑑subscript𝑤𝑖𝑣\widehat{d}(w_{i},v)over^ start_ARG italic_d end_ARG ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v ) for all vBi𝑣subscript𝐵𝑖v\in B_{i}italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using (2). By the triangle inequality, |d^(wi,v)d(wi,v)|32ϵ^𝑑subscript𝑤𝑖𝑣𝑑subscript𝑤𝑖𝑣32italic-ϵ|\widehat{d}(w_{i},v)-d(w_{i},v)|\leq\tfrac{3}{2}\epsilon| over^ start_ARG italic_d end_ARG ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v ) - italic_d ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v ) | ≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_ϵ and this bound is sharp. This establishes the case R=1𝑅1R=1italic_R = 1. Now suppose the bound in the lemma holds up to the (R1)𝑅1(R-1)( italic_R - 1 )-st round. In the R𝑅Ritalic_R-th round, ρ(B)𝜌𝐵\rho(B)italic_ρ ( italic_B ) may be a latent node added in the previous call but α𝛼\alphaitalic_α is still sampled from the originally regular nodes. If ρ(Bi)𝜌subscript𝐵𝑖\rho(B_{i})italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is identified as a latent node, we calculate d^(ρ(Bi),v)^𝑑𝜌subscript𝐵𝑖𝑣\widehat{d}(\rho(B_{i}),v)over^ start_ARG italic_d end_ARG ( italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_v ) for all vBi𝑣subscript𝐵𝑖v\in B_{i}italic_v ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since v𝑣vitalic_v is also among the originally regular nodes, we conclude |d^(v,α)d(v,α)|ϵ^𝑑𝑣𝛼𝑑𝑣𝛼italic-ϵ|\widehat{d}(v,\alpha)-d(v,\alpha)|\leq\epsilon| over^ start_ARG italic_d end_ARG ( italic_v , italic_α ) - italic_d ( italic_v , italic_α ) | ≤ italic_ϵ. By induction, |d^(v,ρ(B))d(v,ρ(B))|(1+R12)ϵ^𝑑𝑣𝜌𝐵𝑑𝑣𝜌𝐵1𝑅12italic-ϵ|\widehat{d}(v,\rho(B))-d(v,\rho(B))|\leq(1+\tfrac{R-1}{2})\epsilon| over^ start_ARG italic_d end_ARG ( italic_v , italic_ρ ( italic_B ) ) - italic_d ( italic_v , italic_ρ ( italic_B ) ) | ≤ ( 1 + divide start_ARG italic_R - 1 end_ARG start_ARG 2 end_ARG ) italic_ϵ and |d^(α,ρ(B))d(α,ρ(B))|(1+R12)ϵ^𝑑𝛼𝜌𝐵𝑑𝛼𝜌𝐵1𝑅12italic-ϵ|\widehat{d}(\alpha,\rho(B))-d(\alpha,\rho(B))|\leq(1+\tfrac{R-1}{2})\epsilon| over^ start_ARG italic_d end_ARG ( italic_α , italic_ρ ( italic_B ) ) - italic_d ( italic_α , italic_ρ ( italic_B ) ) | ≤ ( 1 + divide start_ARG italic_R - 1 end_ARG start_ARG 2 end_ARG ) italic_ϵ. By the triangle inequality,

|d^(ρ(Bi),v)d(ρ(Bi),v)|12(ϵ+(1+R12)ϵ+(1+R12)ϵ)=(1+R2)ϵ.^𝑑𝜌subscript𝐵𝑖𝑣𝑑𝜌subscript𝐵𝑖𝑣12italic-ϵ1𝑅12italic-ϵ1𝑅12italic-ϵ1𝑅2italic-ϵ|\widehat{d}(\rho(B_{i}),v)-d(\rho(B_{i}),v)|\;\leq\;\tfrac{1}{2}(\epsilon+(1+% \tfrac{R-1}{2})\epsilon+(1+\tfrac{R-1}{2})\epsilon)\;=\;(1+\tfrac{R}{2})\epsilon.| over^ start_ARG italic_d end_ARG ( italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_v ) - italic_d ( italic_ρ ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_v ) | ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_ϵ + ( 1 + divide start_ARG italic_R - 1 end_ARG start_ARG 2 end_ARG ) italic_ϵ + ( 1 + divide start_ARG italic_R - 1 end_ARG start_ARG 2 end_ARG ) italic_ϵ ) = ( 1 + divide start_ARG italic_R end_ARG start_ARG 2 end_ARG ) italic_ϵ .

The result now follows by induction. ∎

It is generally easy to show that the event (ϵ)italic-ϵ\mathcal{E}(\epsilon)caligraphic_E ( italic_ϵ ) holds with probability at least 1η1𝜂1-\eta1 - italic_η as long as the sample size N𝑁Nitalic_N is large enough. Let δ=1eϵ𝛿1superscript𝑒italic-ϵ\delta=1-e^{-\epsilon}italic_δ = 1 - italic_e start_POSTSUPERSCRIPT - italic_ϵ end_POSTSUPERSCRIPT and suppose that the following event holds

(δ):={|ρ^ijρij||ρij|δ for all regular ij}.assignsuperscript𝛿subscript^𝜌𝑖𝑗subscript𝜌𝑖𝑗subscript𝜌𝑖𝑗𝛿 for all regular 𝑖𝑗\mathcal{E}^{\prime}(\delta)\;:=\;\left\{|\widehat{\rho}_{ij}-\rho_{ij}|\leq|% \rho_{ij}|\delta\;\;\mbox{ for all regular }i\neq j\right\}.caligraphic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_δ ) := { | over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ≤ | italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_δ for all regular italic_i ≠ italic_j } .

Since δ<1𝛿1\delta<1italic_δ < 1, under the event superscript\mathcal{E}^{\prime}caligraphic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT the signs of ρ^ijsubscript^𝜌𝑖𝑗\widehat{\rho}_{ij}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and ρijsubscript𝜌𝑖𝑗\rho_{ij}italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are the same. It is easy to see that superscript\mathcal{E}^{\prime}\subseteq\mathcal{E}caligraphic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ caligraphic_E. Indeed, under superscript\mathcal{E}^{\prime}caligraphic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, ρ^ijρij>0subscript^𝜌𝑖𝑗subscript𝜌𝑖𝑗0\frac{\hat{\rho}_{ij}}{\rho_{ij}}>0divide start_ARG over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG > 0 and for all 1δx1+δ1𝛿𝑥1𝛿1-\delta\leq x\leq 1+\delta1 - italic_δ ≤ italic_x ≤ 1 + italic_δ we have log(1δ)log(x)log(1+δ)1𝛿𝑥1𝛿\log(1-\delta)\leq\log(x)\leq\log(1+\delta)roman_log ( 1 - italic_δ ) ≤ roman_log ( italic_x ) ≤ roman_log ( 1 + italic_δ ). It follows that

|logρ^ijρij|max{log(1+δ),|log(1δ)|}=log(1δ)=ϵ.subscript^𝜌𝑖𝑗subscript𝜌𝑖𝑗1𝛿1𝛿1𝛿italic-ϵ\left|\log\frac{\hat{\rho}_{ij}}{\rho_{ij}}\right|\;\leq\;\max\left\{\log(1+% \delta),|\log(1-\delta)|\right\}\;=\;-\log(1-\delta)\;=\;\epsilon.| roman_log divide start_ARG over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG | ≤ roman_max { roman_log ( 1 + italic_δ ) , | roman_log ( 1 - italic_δ ) | } = - roman_log ( 1 - italic_δ ) = italic_ϵ .

It is then enough to bound the probability of the event superscript\mathcal{E}^{\prime}caligraphic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. To illustrate how this can be done without going into unnecessary technicalities, suppose maxi𝔼Xi4κsubscript𝑖𝔼superscriptsubscript𝑋𝑖4𝜅\max_{i}\mathbb{E}X_{i}^{4}\leq\kapparoman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ italic_κ for some κ>0𝜅0\kappa>0italic_κ > 0. In this case var(XiXj)κvarsubscript𝑋𝑖subscript𝑋𝑗𝜅{\rm var}(X_{i}X_{j})\leq\kapparoman_var ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_κ, and therefore one may use the median-of-means estimator (see, e.g., Lugosi & Mendelson, (2019)) to estimate ρij=𝔼[XiXj]subscript𝜌𝑖𝑗𝔼delimited-[]subscript𝑋𝑖subscript𝑋𝑗\rho_{ij}=\mathbb{E}[X_{i}X_{j}]italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = blackboard_E [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ]. We get the following result.

Theorem 6.4.

Suppose a random sample of size N𝑁Nitalic_N is generated from a mean zero distribution with covariance matrix ΣΣ\Sigmaroman_Σ satisfying Σii=1subscriptΣ𝑖𝑖1\Sigma_{ii}=1roman_Σ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = 1 and suppose maxi𝔼Xi4κsubscript𝑖𝔼superscriptsubscript𝑋𝑖4𝜅\max_{i}\mathbb{E}X_{i}^{4}\leq\kapparoman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ italic_κ for some κ>0𝜅0\kappa>0italic_κ > 0. Let ,u𝑢\ell,\mathit{u}roman_ℓ , italic_u be as defined in (11). Fix η(0,1)𝜂01\eta\in(0,1)italic_η ∈ ( 0 , 1 ) and suppose

ϵ4(1+logΔ(n))andN64κlog(n/η)e2u(12eϵ)formulae-sequenceitalic-ϵ41subscriptΔ𝑛and𝑁64𝜅𝑛𝜂superscript𝑒2𝑢12superscript𝑒italic-ϵ\epsilon\;\leq\;\frac{\ell}{4(1+\log_{\Delta}(n))}\qquad\mbox{and}\qquad N\;% \geq\;\frac{64\kappa\log(n/\eta)}{e^{-2\mathit{u}}(1-2e^{-\epsilon})}italic_ϵ ≤ divide start_ARG roman_ℓ end_ARG start_ARG 4 ( 1 + roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ) end_ARG and italic_N ≥ divide start_ARG 64 italic_κ roman_log ( italic_n / italic_η ) end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - 2 italic_u end_POSTSUPERSCRIPT ( 1 - 2 italic_e start_POSTSUPERSCRIPT - italic_ϵ end_POSTSUPERSCRIPT ) end_ARG

then the noisy version of the Algorithm 1 correctly recovers the underlying semi-labeled tree with probability 1η1𝜂1-\eta1 - italic_η.

Proof..

Suppose the event (ϵ)italic-ϵ\mathcal{E}(\epsilon)caligraphic_E ( italic_ϵ ) holds. Like in the proof of Theorem 3.1 we modify the procedure by repeating bigsplit.noisy until the largest bag is bounded in size by |B|/Δ𝐵Δ|B|/\sqrt{\Delta}| italic_B | / square-root start_ARG roman_Δ end_ARG. With this modification, the algorithm stops after each round. With our choice of ϵitalic-ϵ\epsilonitalic_ϵ, by Lemma 6.3, we are guaranteed that all computed distances satisfy |d^(u,v)d(u,v)|ϵ^𝑑𝑢𝑣𝑑𝑢𝑣italic-ϵ|\widehat{d}(u,v)-d(u,v)|\leq\epsilon| over^ start_ARG italic_d end_ARG ( italic_u , italic_v ) - italic_d ( italic_u , italic_v ) | ≤ italic_ϵ in the first 2logΔ(n)2subscriptΔ𝑛2\log_{\Delta}(n)2 roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) rounds. By Proposition 6.2, all these subsequent calls of bigsplit.noisy return the same answer as bigsplit applied to noiseless distances. The proof now follows if we can show that, with probability at least 1η1𝜂1-\eta1 - italic_η, event (ϵ)italic-ϵ\mathcal{E}(\epsilon)caligraphic_E ( italic_ϵ ) holds. We show that (δ)superscript𝛿\mathcal{E}^{\prime}(\delta)caligraphic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_δ ) with δ=1eϵ𝛿1superscript𝑒italic-ϵ\delta=1-e^{-\epsilon}italic_δ = 1 - italic_e start_POSTSUPERSCRIPT - italic_ϵ end_POSTSUPERSCRIPT holds, which is a stronger condition. Recall that 𝔼Xi2=1𝔼superscriptsubscript𝑋𝑖21\mathbb{E}X_{i}^{2}=1blackboard_E italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1, 𝔼Xi4κ𝔼superscriptsubscript𝑋𝑖4𝜅\mathbb{E}X_{i}^{4}\leq\kappablackboard_E italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ italic_κ and, in consequence, var(XiXi)κvarsubscript𝑋𝑖subscript𝑋𝑖𝜅\operatorname{var}(X_{i}X_{i})\leq\kapparoman_var ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_κ. We want to estimate 𝔼(XiXj)=ρij𝔼subscript𝑋𝑖subscript𝑋𝑗subscript𝜌𝑖𝑗\mathbb{E}(X_{i}X_{j})=\rho_{ij}blackboard_E ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. By Theorem 2 in Lugosi & Mendelson, (2019), the median-of-means estimator ρ^ijsubscript^𝜌𝑖𝑗\widehat{\rho}_{ij}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (with appropriately chosen number of blocks that depends on η𝜂\etaitalic_η only) satisfies that with probability at least 12η/(n2)12𝜂binomial𝑛21-2\eta/\binom{n}{2}1 - 2 italic_η / ( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG )

|ρ^ijρij|32κlog((n2)/η)N.subscript^𝜌𝑖𝑗subscript𝜌𝑖𝑗32𝜅binomial𝑛2𝜂𝑁|\widehat{\rho}_{ij}-\rho_{ij}|\;\leq\;\sqrt{\frac{32\kappa\log({\binom{n}{2}}% /{\eta})}{N}}.| over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ≤ square-root start_ARG divide start_ARG 32 italic_κ roman_log ( ( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG ) / italic_η ) end_ARG start_ARG italic_N end_ARG end_ARG .

Thus, we get that with probability at least 1η1𝜂1-\eta1 - italic_η, all ρ^ijsubscript^𝜌𝑖𝑗\widehat{\rho}_{ij}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT satisfy simultaneously that |ρ^ijρij||ρij|δsubscript^𝜌𝑖𝑗subscript𝜌𝑖𝑗subscript𝜌𝑖𝑗𝛿|\widehat{\rho}_{ij}-\rho_{ij}|\leq|\rho_{ij}|\delta| over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ≤ | italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_δ as long as

N64κlog(n/η)δ2minijρij2=64κlog(n/η)e2uδ2.𝑁64𝜅𝑛𝜂superscript𝛿2subscript𝑖𝑗superscriptsubscript𝜌𝑖𝑗264𝜅𝑛𝜂superscript𝑒2𝑢superscript𝛿2N\;\geq\;\frac{64\kappa\log(n/\eta)}{\delta^{2}\min_{i\neq j}\rho_{ij}^{2}}\;=% \;\frac{64\kappa\log(n/\eta)}{e^{-2\mathit{u}}\delta^{2}}.italic_N ≥ divide start_ARG 64 italic_κ roman_log ( italic_n / italic_η ) end_ARG start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_min start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = divide start_ARG 64 italic_κ roman_log ( italic_n / italic_η ) end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - 2 italic_u end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

The inequality ϵ<4(1+logΔ(n))italic-ϵ41subscriptΔ𝑛\epsilon<\tfrac{\ell}{4(1+\log_{\Delta}(n))}italic_ϵ < divide start_ARG roman_ℓ end_ARG start_ARG 4 ( 1 + roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ) end_ARG is equivalent to δ<1e4(1+logΔ(n))𝛿1superscript𝑒41subscriptΔ𝑛\delta<1-e^{-\tfrac{\ell}{4(1+\log_{\Delta}(n))}}italic_δ < 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_ℓ end_ARG start_ARG 4 ( 1 + roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ) end_ARG end_POSTSUPERSCRIPT. Thus, we require

N64κlog(n/η)e2u(12e4(1+logΔ(n))).𝑁64𝜅𝑛𝜂superscript𝑒2𝑢12superscript𝑒41subscriptΔ𝑛N\;\geq\;\frac{64\kappa\log(n/\eta)}{e^{-2\mathit{u}}(1-2e^{-\tfrac{\ell}{4(1+% \log_{\Delta}(n))}})}.italic_N ≥ divide start_ARG 64 italic_κ roman_log ( italic_n / italic_η ) end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - 2 italic_u end_POSTSUPERSCRIPT ( 1 - 2 italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_ℓ end_ARG start_ARG 4 ( 1 + roman_log start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_n ) ) end_ARG end_POSTSUPERSCRIPT ) end_ARG .

Acknowledgements

LD was supported by NSERC grant A3456. GL acknowledges the support of Ayudas Fundación BBVA a Proyectos de Investigación Científica 2021 and the Spanish Ministry of Economy and Competitiveness grant PID2022-138268NB-I00, financed by MCIN/AEI/10.13039/501100011033, FSE+MTM2015-67304-P, and FEDER, EU. PZ was supported by NSERC grant RGPIN-2023-03481.

References

  • Afshar et al., (2020) Afshar, Ramtin, Goodrich, Michael T, Matias, Pedro, & Osegueda, Martha C. 2020. Reconstructing biological and digital phylogenetic trees in parallel. Pages 0–0 of: 28th Annual European Symposium on Algorithms (ESA 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
  • Aizenbud et al., (2021) Aizenbud, Yariv, Jaffe, Ariel, Wang, Meng, Hu, Amber, Amsel, Noah, Nadler, Boaz, Chang, Joseph T., & Kluger, Yuval. 2021. Spectral top-down recovery of latent tree models. arXiv preprint arXiv:2102.13276.
  • Anandkumar et al., (2011) Anandkumar, Animashree, Chaudhuri, Kamalika, Hsu, Daniel J., Kakade, Sham M., Song, Le, & Zhang, Tong. 2011. Spectral methods for learning multivariate latent tree structure. In: Advances in Neural Information Processing Systems, vol. 24.
  • Buneman, (1974) Buneman, Peter. 1974. A note on the metric properties of trees. Journal of Combinatorial Theory Series B, 17(1), 48–50.
  • Chen et al., (2019) Chen, Peixian, Chen, Zhourong, & Zhang, Nevin L. 2019. A novel document generation process for topic detection based on hierarchical latent tree models. Pages 265–276 of: Symbolic and Quantitative Approaches to Reasoning with Uncertainty: 15th European Conference, ECSQARU 2019, Belgrade, Serbia, September 18-20, 2019, Proceedings 15. Springer.
  • Choi et al., (2011) Choi, Myung Jin, Tan, Vincent Y.F., Anandkumar, Animashree, & Willsky, Alan S. 2011. Learning latent tree graphical models. Journal of Machine Learning Research, 12, 1771–1812.
  • Côme et al., (2021) Côme, Etienne, Jouvin, Nicolas, Latouche, Pierre, & Bouveyron, Charles. 2021. Hierarchical clustering with discrete latent variable models and the integrated classification likelihood. Advances in Data Analysis and Classification, 15(4), 957–986.
  • Hein, (1989) Hein, Jotun J. 1989. An optimal algorithm to reconstruct trees from additive distance data. Bulletin of Mathematical Biology, 51(5), 597–603.
  • Huang et al., (2020) Huang, Furong, Naresh, Niranjan Uma, Perros, Ioakeim, Chen, Robert, Sun, Jimeng, & Anandkumar, Anima. 2020. Guaranteed scalable learning of latent tree models. Pages 883–893 of: Uncertainty in Artificial Intelligence. PMLR.
  • Jaffe et al., (2021) Jaffe, Ariel, Amsel, Noah, Aizenbud, Yariv, Nadler, Boaz, Chang, Joseph T., & Kluger, Yuval. 2021. Spectral neighbor joining for reconstruction of latent tree models. SIAM Journal on Mathematics of Data Science, 3(1), 113–141.
  • Kandiros et al., (2023) Kandiros, Vardis, Daskalakis, Constantinos, Dagan, Yuval, & Choo, Davin. 2023. Learning and testing latent-tree Ising models efficiently. Pages 1666–1729 of: The Thirty Sixth Annual Conference on Learning Theory. PMLR.
  • Kannan et al., (1996) Kannan, Sampath K., Lawler, Eugene L., & Warnow, Tandy J. 1996. Determining the evolutionary tree using experiments. Journal of Algorithms, 21(1), 26–50.
  • King et al., (2003) King, Valerie, Zhang, Li, & Zhout, Yunhong. 2003. On the complexity of distance-based evolutionary tree reconstruction. Page 444 of: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM.
  • Lauritzen, (1996) Lauritzen, S. L. 1996. Graphical Models. Oxford, United Kingdom: Clarendon Press.
  • Liu et al., (2012) Liu, Han, Han, Fang, Yuan, Ming, Lafferty, John, & Wasserman, Larry. 2012. High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4), 2293–2326.
  • Lugosi & Mendelson, (2019) Lugosi, Gábor, & Mendelson, Shahar. 2019. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5), 1145–1190.
  • Mourad et al., (2013) Mourad, Raphaël, Sinoquet, Christine, Zhang, Nevin Lianwen, Liu, Tengfei, & Leray, Philippe. 2013. A survey on latent tree models and applications. Journal of Artificial Intelligence Research, 47, 157–203.
  • Pearl, (1988) Pearl, Judea. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. The Morgan Kaufmann Series in Representation and Reasoning. Morgan Kaufmann, San Mateo, CA.
  • Semple & Steel, (2003) Semple, Charles, & Steel, Mike A. 2003. Phylogenetics. Vol. 24. Oxford University Press on Demand.
  • Shiers et al., (2016) Shiers, Nathaniel, Zwiernik, Piotr, Aston, John A.D., & Smith, James Q. 2016. The correlation space of Gaussian latent tree models and model selection without fitting. Biometrika, 103(3), 531–545.
  • Sturma et al., (2022) Sturma, Nils, Drton, Mathias, & Leung, Dennis. 2022. Testing many and possibly singular polynomial constraints. arXiv preprint arXiv:2208.11756.
  • Waterman et al., (1977) Waterman, Michael S, Smith, Temple F, Singh, Mona, & Beyer, William A. 1977. Additive evolutionary trees. Journal of Theoretical Biology, 64(2), 199–213.
  • Zhang et al., (2022) Zhang, Haipeng, Zhao, Jian, Wang, Xiaoyu, & Xuan, Yi. 2022. Low-voltage distribution grid topology identification with latent tree model. IEEE Transactions on Smart Grid, 13(3), 2158–2169.
  • Zhang & Poon, (2017) Zhang, Nevin, & Poon, Leonard. 2017. Latent tree analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31.
  • Zhang, (2004) Zhang, Nevin L. 2004. Hierarchical latent class models for cluster analysis. The Journal of Machine Learning Research, 5, 697–723.
  • Zhou et al., (2020) Zhou, Can, Wang, Xiaofei, & Guo, Jianhua. 2020. Learning mixed latent tree models. Journal of Machine Learning Research, 21(249), 1–35.
  • Zwiernik, (2018) Zwiernik, Piotr. 2018. Latent tree models. Pages 283–306 of: Handbook of Graphical Models. CRC Press.