Learning latent tree models with small query complexity

Luc Devroye School of Computer Science, McGill University, Montreal [email protected] , Gábor Lugosi Department of Economics and Business, Pompeu Fabra University, Barcelona, Spain; ICREA, Pg. Lluís Companys 23, 08010 Barcelona, Spain
Barcelona School of Economics [email protected] and Piotr Zwiernik Department of Statistical Sciences, University of Toronto [email protected]

(Date: August 28, 2024)

Abstract.

We consider the problem of structure recovery in a graphical model of a tree where some variables are latent. Specifically, we focus on the Gaussian case, which can be reformulated as a well-studied problem: recovering a semi-labeled tree from a distance metric. We introduce randomized procedures that achieve query complexity of optimal order. Additionally, we provide statistical analysis for scenarios where the tree distances are noisy. The Gaussian setting can be extended to other situations, including the binary case and non-paranormal distributions.

Key words and phrases:

latent tree models, phylogenetics, query complexity, probabilistic graphical models, structure learning, mathematical statistics, separation number

2010 Mathematics Subject Classification:

60E15, 62H99, 15B48

1. Introduction

First discussed by Judea Pearl as tree-decomposable distributions to generalize star-decomposable distributions such as the latent class model (Pearl,, 1988, Section 8.3), latent tree models are probabilistic graphical models defined on trees, where only a subset of variables is observed. Zhang, (2004); Zhang & Poon, (2017); Mourad et al., (2013) extended our theoretical understanding of these models. We refer to Zwiernik, (2018) for more details and references.

Latent tree models have found wide applications in various fields. They are used in phylogenetic analysis, network tomography, computer vision, causal modelling, and data clustering. They also contain other well-known classes of models like hidden Markov models, the Brownian motion tree model, the Ising model on a tree, and many popular models used in phylogenetics. In generic high-dimensional problems, latent tree models can be useful in various ways. They share many computational advantages of observed tree models but are more expressible. Latent tree models have been used for hierarchical topic detection (Côme et al.,, 2021) and clustering.

In phylogenetics, latent tree models have been used to reconstruct the tree of life from the genetic material of surviving species. They have also been used in bioinformatics and computer vision. Machine-learning methods for models with latent variables attract substantial attention from the research community. Some other applications include latent tree models and novel algorithms for high-dimensional data Chen et al., (2019), and the design of low-rank tensor completion methods Zhang et al., (2022).

Structure learning

The problem of structure or parameter learning for latent tree models has been very extensively studied. A seminal work in this field is by Choi et al., (2011), where two consistent and computationally efficient algorithms for learning latent trees were proposed. The main idea is to use the link between a broad class of latent tree models and tree metrics studied extensively in phylogenetics, a link first established by Pearl, (1988) for binary and Gaussian distributions. This has been extended to symmetric discrete distributions by Choi et al., (2011). The proof in (Zwiernik,, 2018, Section 2.3) makes it clear that the only essential assumption is that the conditional expectation of every node given its neighbour is a linear function of the neighbour.

From the statistical perspective, testing the corresponding algebraic restrictions on the correlation matrix was also studied in Shiers et al., (2016), and a more recent take on this problem is by Sturma et al., (2022). From the machine learning perspective, we should mention Jaffe et al., (2021); Zhou et al., (2020); Anandkumar et al., (2011); Huang et al., (2020); Aizenbud et al., (2021); Kandiros et al., (2023). Most of these papers build on the idea of local recursive grouping as proposed by Choi et al., (2011). In particular, they all start by computing all distances between all observed nodes in the tree.

Our contributions

Our work is motivated by novel applications where the dimension $n$ is so large that computing all the distances—or all correlations between observed variables—is impossible. We propose a randomized algorithm that queries the distance oracle and show that the expected query time for our algorithm is $O(\Delta n\log_{\Delta}(n))$ . A special case of our problem is the problem of phylogenetic tree recovery. In this case, our approach resembles other phylogenetic tree recovery methods that try to minimize query complexity (see Afshar et al., (2020) for references) and seems to be the first such method that is asymptotically optimal; see King et al., (2003) for the matching lower bound. More importantly, our algorithm can deal with general latent tree models and in this context, it is again the first such algorithm. As we show in the last section, our algorithm can be easily adjusted to the case of noisy oracles, which is relevant in statistical practice.

2. Preliminaries and basic results

2.1. Trees and semi-labelled trees

A tree $T=(V,E)$ is a connected undirected graph with no cycles. In particular, for any two $u,v\in V$ there is a unique path between them, which we denote by $\overline{uv}$ . A vertex of $T$ with only one neighbour is called a leaf. A vertex of $T$ that is not a leaf is called an inner vertex or internal node. An edge of $T$ is inner if both ends are inner vertices; otherwise, it is called terminal. A connected subgraph of $T$ is called a subtree of $T$ . A rooted tree $(T,\rho)$ is simply a tree $T=(V,E)$ with one distinguished vertex $\rho\in V$ .

A tree $T$ is called a semi-labelled tree with labelled nodes $W\subseteq V$ if every vertex of $T$ of degree $\leq 2$ lies in $W$ ¹¹1This differs slightly from the regular definition of semi-labelled trees (or X-trees) in phylogenetics, where regular nodes can get multiple labels; see Semple & Steel, (2003).. We say that $T$ is a phylogenetic tree if $T$ has no degree- $2$ nodes and $W$ is exactly the set of leaves of $T$ . If $v\in W$ , then we say that $v$ is regular and we depict it by a solid vertex. If $|W|=n$ then we typically label the vertices in $W$ with $[n]=\{1,\ldots,n\}$ . A vertex that is not labelled is called latent. An example semi-labelled tree is shown in Figure 1.

Figure 1. A semi-labeled tree with five regular nodes labeled with

\{1,2,3,4,5\}

and two latent nodes.

By definition, all latent nodes are internal and have degree $\geq 3$ . Therefore, if $|W|=n$ , then $|T|\leq 2n+1$ .

2.2. Tree metrics

Consider metrics on the discrete set $W$ induced by semi-labeled trees in the following sense. Let $T=(V,E)$ be a tree and suppose that $\ell:E\rightarrow\mathbb{R}_{+}$ is a map that assigns lengths to the edges of $T$ . For any pair $u,v\in V$ by $\overline{uv}$ we denote the path in $T$ joining $u$ and $v$ . We now define the map $d_{T,\ell}:V\times V\rightarrow\mathbb{R}$ by setting, for all $u,v\in V$ ,

d_{T,\ell}(u,v)=\left\{\begin{array}[]{ll}\sum_{e\in\overline{uv}}\ell(e),&% \mbox{ if }u\neq v,\\ 0,&\mbox{otherwise.}\end{array}\right.

Suppose now we are interested only in the distances between the regular vertices.

Definition 2.1.

A function $d:\,[n]\times[n]\rightarrow\mathbb{R}$ is called a tree metric if there exists a semi-labeled tree $T=(V,E)$ with $n$ regular nodes $W$ and a (strictly) positive length assignment $\ell:E\rightarrow\mathbb{R}_{+}$ such that for all $i,j\in W$

d(i,j)\;=\;d_{T,\ell}(i,j).

Example 2.2.

Consider a quartet tree with edge lengths as indicated on the left in Figure 2. The distance between vertices $1$ and $3$ is $d({1,3})=2+5+2.5=9.5$ and the whole distance matrix is given on the right in Figure 2, where the dots indicate that this matrix is symmetric.

$\begin{bmatrix}0&5.5&9.5&8\\ \cdot&0&11&9.5\\ \cdot&\cdot&0&3.5\\ \cdot&\cdot&\cdot&0\end{bmatrix},$

Figure 2. A metric on a quartet tree.

It is easy to describe the set of all possible tree metrics.

Definition 2.3.

We say that a map ${d}:\,[n]\times[n]\to\mathbb{R}$ satisfies the four-point condition if for every four (not necessarily distinct) elements $i,j,k,l\in[n]$ ,

{d}({i,j})+{d}({k,l})\quad\leq\quad\max\left\{\begin{array}[]{l}{d}({i,k})+{d}% ({j,l})\\ {d}({i,l})+{d}({j,k}).\end{array}\right.

Since the elements $i,j,k,l\in[n]$ in Definition 2.3 need not be distinct, every such map is a metric on $[n]$ given that ${d}({i,i})=0$ and ${d}({i,j})={d}({j,i})$ for all $i,j\in[n]$ . The following fundamental theorem links tree metrics with the four-point condition.

Theorem 2.4 (Tree-metric theorem, Buneman, (1974)).

Suppose that ${d}:\,[n]\times[n]\to\mathbb{R}$ is such that ${d}({i,i})=0$ and ${d}({i,j})=d({j,i})$ for all $i,j\in[n]$ . Then, ${d}$ is a tree metric on $[n]$ if and only if it satisfies the four-point condition. Moreover, a tree metric uniquely determines the defining semi-labeled tree and edge lengths.

Note that the assumption about strictly positive lengths of each edge in Definition 2.1 is crucial for uniqueness in Theorem 2.4.

2.3. Recovering a tree from a tree metric

Our problem is to recover the semi-labelled tree $T$ based on a few queries of the distance matrix $D=[d(u,v)]$ , which contains the distances between the regular nodes of $T$ . We consider the query complexity, which measures the number of queries of the distance matrix, also called queries of the distance oracle. Trivially, we can reconstruct the tree with query complexity ${n\choose 2}$ . It is known that one can do better for trees with a bounded maximal degree

\Delta\;:=\;\max{\rm degree}(T).

When $T$ is a phylogenetic tree ( $W$ is the set of leaves of $T$ ), the distance queries are sometimes called “additive queries” (Waterman et al., (1977)). When $\Delta$ is bounded, Hein, (1989) showed that this problem has a solution that uses $O(\Delta n\log_{\Delta}(n))$ distance queries, which is asymptotically optimal by the work of King et al., (2003). For phylogenetic trees Kannan et al., (1996) proposed an algorithm that has $O(\Delta n\log(n))$ query complexity. See Afshar et al., (2020) for more references.

When the query complexity is jointly measured in terms of $n$ and $\Delta$ , a lower bound for both worst-case and expected query complexity is $\Omega(n\Delta\log_{\Delta}(n))$ (King et al., (2003)). Our objective is to exhibit a randomized algorithm that has matching expected query complexity $O(n\Delta\log_{\Delta}(n))$ for the semi-labelled tree reconstruction problem, regardless of how $\Delta$ varies with $n$ . This problem generalizes the phylogenetic tree reconstruction problem, since internal non-latent nodes can have degree two and regular nodes are not necessarily leaves, and therefore methods for phylogenetic tree reconstruction are no longer directly applicable.

3. The new algorithm

The proposed algorithm uses a randomized version of divide-and-conquer. We will use the notion of a bag $B\subseteq V$ . The algorithm maintains a queue, consisting of sets of bags. Initially, there is only one bag, containing all regular nodes, that is, $B=W$ . The procedure takes a bag $B$ . One node in this set, denoted by $\rho(B)$ , is marked as a representative, which can be thought of as a root. A set of edges that jointly form a tree is called a skeleton and is typically denoted by the mnemonic $S$ . Our algorithm starts with an empty skeleton and incrementally constructs the skeleton of the sought tree, which we call the tree induced by $B$ .

Let

\kappa=\Delta

;

Pick a node

u\in W

, and set

\rho(W)\leftarrow u

(note:

W

is now a bag);

Make an empty queue

Q

;

Add

W

Q

;

Set

S\leftarrow\emptyset

;

while $|Q|>0$ do

Remove bag

B

from the front of

Q

;

if $|B|\leq\kappa$ then

Query all

{|B|\choose 2}

distances between nodes in

B

;

Find

S^{*}

, the full skeleton for the tree induced by

B

;

S\leftarrow S\cup S^{*}

;

else

Sample uniformly at random and without replacement nodes

u_{1},\ldots,u_{\kappa}

from

B\setminus\{\rho(B)\}

;

Apply procedure bigsplit

(B,u_{1},\ldots,u_{\kappa})

. This procedure outputs a skeleton

S^{*}

(connecting nodes from

B

and possibly latent nodes) and bags

B_{1},\ldots,B_{k}

where

B_{i}

overlaps with the nodes of the skeleton in

\rho(B_{i})

only, and are non-overlapping otherwise (i.e.,

(B_{i}\setminus\rho(B_{i}))\cap(B_{j}\setminus\rho(B_{j}))=\emptyset

) ;

S\leftarrow S\cup S^{*}

;

Add

B_{1},\ldots,B_{k}

to the rear of

Q

;

Return the skeleton

S

;

Algorithm 1 Outline of our algorithm

The procedure bigsplit takes a bag $B$ and a random set of nodes in it, $u_{1},\ldots,u_{\kappa}$ , and forms the subtree that connects $u_{1},\ldots,u_{\kappa}$ and $\rho(B)$ . The edges of this subtree give the skeleton that is the output. The remaining nodes of $B$ are collected in bags that “hang” from the skeleton. The representatives of these bags are precisely those nodes where the bags overlap the skeleton. Note that the skeleton may contain latent nodes not originally in $B$ . Within bigsplit, all representatives of the hanging bags have their distances to all nodes in their bags queried, so for all practical purposes, the newly discovered latent nodes act as regular nodes. The bags become smaller as the algorithm proceeds, which leads to a logarithmic number of rounds. The main result of the paper is the following theorem, whose proof is given in Section 4 below.

Theorem 3.1.

Given a distance oracle $D$ between the regular vertices of a semi-labeled tree $T$ , Algorithm 1 with parameter $\kappa=\Delta$ correctly recovers the induced tree with expected query complexity $O(\Delta n\log_{\Delta}(n))$ .

The procedure bigsplit uses two sub-operations, called basic and explode that we describe next.

3.1. The ”basic” operation

Let $B$ be a bag with representative $\rho=\rho(B)$ , and let $\alpha\in B$ be a distinct regular vertex. In our basic step, we query $d(v,\alpha)$ for all $v\in B$ , and set

(1)

D(v)=d(v,\alpha)-d(v,\rho).

We group all nodes $v$ according to the different values $D(v)$ that are observed. Ordering the sets in this partition of $B$ from small to large value of $D(\cdot)$ , we obtain bags $B_{1},\ldots,B_{k}$ . It is clear that $\rho\in B_{k}$ and $\alpha\in B_{1}$ . Within each $B_{i}$ , we let $u_{i}$ be the node of $B_{i}$ closest to $\alpha$ . If

d(u_{i},\alpha)+d(u_{i},\rho)=d(\alpha,\rho),

then $u_{i}$ (a regular node) is on the path from $\alpha$ to $\rho$ in the induced tree. We set $\rho(B_{i})=u_{i}$ . If, however,

d(u_{i},\alpha)+d(u_{i},\rho)>d(\alpha,\rho),

then we know that there must be a latent node $w_{i}$ that connects the $(\alpha,\rho)$ path to the nodes in $B_{i}$ . In fact, for all $v\in B_{i}$ , we have

(2)

d(v,w_{i})=\frac{1}{2}\left(d(v,\alpha)+d(v,\rho)-d(\alpha,\rho)\right).

These values can be stored for further use. So, we add $w_{i}$ to $B_{i}$ and define $\rho(B_{i})=w_{i}$ . In this manner, we have identified $S^{*}$ , the part of the final skeleton that connects $\alpha$ with $\rho$ :

(\rho(B_{1}),\rho(B_{2})),\ldots,(\rho(B_{k-1}),\rho(B_{k})).

Refer to caption — Figure 3. In a basic operation, a bag $B$ with root $\rho(B)$ and query node $\alpha$ is decomposed into several smaller bags, with root nodes on the skeleton.

Input: a bag

B

, and

\alpha\in B

;

Set

\rho=\rho(B)

;

for $v\in B$ do

Compute

D(v)

in (1);

Assign

v

to bags

B_{1},\ldots,B_{k}

according to decreasing values of

D(v)

;

for $i=1,\ldots,k$ do

u_{i}=\arg\min_{v\in B_{i}}d(v,\alpha)

;

if $d(\rho,u_{i})+d(u_{i},\alpha)=d(\rho,\alpha)$ then

\rho(B_{i})=u_{i}

else

Identify latent node

w_{i}

;

Add

w_{i}

B_{i}

;

Set

\rho(B_{i})=w_{i}

;

Update the distance oracle by calculating

d(w_{i},v)

for all

v\in B_{i}

using (2);

Return

B_{1},\ldots,B_{k}

and the skeleton

(\rho(B_{1}),\rho(B_{2})),\ldots,(\rho(B_{k-1}),\rho(B_{k}))

;

Algorithm 2 basic

(B,\alpha)

The query complexity of Algorithm 2 is bounded by $2|B|-3$ .

3.2. The operation “explode”

Another fundamental operation, called explode, decomposes a bag $B$ into smaller bags $B_{1},\ldots,B_{k}$ —all having the same representative $\rho(B)$ —according to the different subtrees of $\rho(B)$ that are part of the tree induced by $B$ . For arbitrary nodes $u\not=v\in B$ , $u,v\not=\rho(B)$ , we note that $u$ and $v$ are in the same subtree if and only if

d(u,v)<d(u,\rho(B))+d(\rho(B),v).

Thus, in query time $O(|B|)$ , we can determine the nodes that are in the same subtree as $u$ . Therefore, we can partition all nodes of $B\setminus\{\rho(B)\}$ into disjoint subtrees of $\rho(B)$ (without constructing these trees yet) in query time at most $|B|\Delta$ by peeling off each set in the partition in turn. These sets are output as bags denoted by $B_{1},\ldots,B_{k}$ .

Input: a bag

B

;

while $|B|>1$ do

Take

u\in B

u\not=\rho(B)

;

Set

B^{*}=\{v\in B,v\not=\rho(B):d(u,v)<d(u,\rho(B))+d(v,\rho(B))\}\cup\{\rho(B)\}

;

Set

\rho(B^{*})=\rho(B)

;

Output

B^{*}

;

Set

B\leftarrow\{\rho(B)\}\cup(B\setminus B^{*})

;

Algorithm 3 explode

(B)

The query complexity of Algorithm 3 is bounded by $(|B|-1)\Delta$ .

3.3. The procedure bigsplit

We are finally ready to provide the details of the procedure bigsplit, which takes as input a bag $B$ and distinct nodes $u_{1},\ldots,u_{k}$ in $B$ not equal to $\rho(B)$ .

Input: a bag

B

and distinct nodes

u_{1},\ldots,u_{k}{\in B}

not equal to

\rho(B)

;

Let

\mathcal{C}=\{B\}

;

Let

M=\max_{B^{\prime}\in\mathcal{C}}|B^{\prime}|

;

while $M>|B|/\sqrt{\Delta}$ do

Let

\mathcal{B}

be a collection of bags. Initially,

\mathcal{B}=\{B\}

;

Let

S

be a skeleton. Initially,

S=\emptyset

;

for $i=1$ to $k$ do

Find the bag

B^{*}

\mathcal{B}

to which

u_{i}

belongs ;

if $u_{i}\not=\rho(B^{*})$ then

Apply basic

(B^{*},u_{i})

, which outputs a skeleton

S^{*}

and bags

B_{1},\ldots,B_{k}

;

{\mathcal{B}}\leftarrow{\mathcal{B}}\setminus\{B^{*}\}

;

{\mathcal{B}}\leftarrow{\mathcal{B}}\cup\cup_{j=1}^{k}\{B_{j}\}

;

S\leftarrow S\cup S^{*}

;

Let

\mathcal{C}

be an empty collection of bags ;

for all $B\in{\mathcal{B}}$ do

explode

(B)

, which leaves output

B_{1},\ldots,B_{\ell}

;

Add

B_{1},\ldots,B_{\ell}

\mathcal{C}

;

Let

M=\max_{B^{\prime}\in\mathcal{C}}|B^{\prime}|

;

Output

S

;

Output all bags in

\mathcal{C}

;

Algorithm 4 bigsplit

(B,u_{1},\ldots,u_{k})

4. The query complexity

4.1. Proof of Theorem 3.1

For a bag $B$ , and nodes $u_{1},\ldots,u_{\kappa}$ taken uniformly at random without replacement from $B\setminus\{\rho(B)\}$ , let $M$ be the size of the largest bag output by bigsplit $(B,u_{1},\ldots,u_{\kappa})$ , where $|B|\geq\kappa+1$ , and $\kappa=\Delta$ . Lemma 4.1 below shows that

\mathbb{P}\left\{M>\frac{|B|}{\sqrt{\Delta}}\right\}\leq\frac{1}{2}.

The while loop in bigsplit is repeated until $M\leq\frac{|B|}{\sqrt{\Delta}}$ . Thus, we can visualize the algorithm in rounds, starting with $W$ . In other words, in the $r$ -th round, we apply bigsplit to all bags that have been through a bigsplit $r-1$ times. In every round, all bags of the previous round are reduced in size by a factor of $1/\sqrt{\Delta}$ . Therefore, there are $\leq\log_{\sqrt{\Delta}}(n)=2\log_{\Delta}(n)$ rounds. The query complexity of one bigsplit without the explode operations is at most $(\kappa+1)(2|B|-1)$ . The total query complexity due to all explode operations is at most $\Delta(|B|-1)$ , for a grand total bounded by

\kappa+(|B|-1)\times(2+2\kappa+\Delta).

Then the (random) query complexity for splitting bag $B$ is bounded by

(3)

\kappa+X(|B|-1)\times(2+2\kappa+\Delta)=\Delta+X(|B|-1)\times(2+3\Delta)~{},

where $X$ is geometric $(1/2)$ . The expected value of this is $\Delta+2(|B|-1)(2+3\Delta)$ . Summing over all bags $B$ that participate in one round yields an expected value bound of $O(\Delta)$ times the sum of $|B|-1$ over all participating $B$ . But the bags do not overlap, except possibly for their representatives. Hence the sum of all values $(|B|-1)$ is at most $n$ , as each proper item in a bag is a regular node. As $\kappa=\Delta$ , the expected cost of one round of splitting is at most $\Delta+n(4+6\Delta)$ .

There is another component of the query complexity due to the part in which we construct the induced tree for a bag $B$ when $|B|\leq\kappa$ . A bag $B$ dealt with in this manner is called final. So the total query complexity becomes the sum of ${|B|\choose 2}$ computed over all final bags of size at least two. Let the sizes of the final bags be denoted by $n_{i}$ . Noting that $\sum_{i}(n_{i}-1)\leq n$ , we see that the query complexity due to the final bags is at most

(4)

\sum_{i}\frac{n_{i}(n_{i}-1)}{2}\leq\max_{i}n_{i}\,\sum_{i}\frac{n_{i}(n_{i}-1% )}{2}\leq\frac{\kappa n}{2}<n\Delta.

The overall expected query complexity does not exceed

(5)

n\Delta+(\Delta+n(4+6\Delta))\times 2\log_{\Delta}(n).

This finishes the proof of Theorem 3.1.

4.2. The main technical lemma

Lemma 4.1.

For a bag $B$ , and random nodes $u_{1},\ldots,u_{\kappa}$ taken uniformly at random without replacement from $B\setminus\{\rho(B\}$ , let $M$ be the size of the largest bag output by bigsplit $(B,u_{1},\ldots,u_{\kappa})$ , where $|B|\geq\kappa+1$ , and $\kappa=\Delta$ . Then

\mathbb{P}\left\{M\geq 1+\frac{|B|}{\sqrt{\Delta}}\right\}\leq\frac{1}{2}.

Proof..

Consider $\rho(B)$ as the root of the tree induced by $B$ , on which we perform dfs (depth-first-search). Note that each of the bags left after bigsplit $(B,u_{1},\ldots,u_{\kappa})$ corresponds to a subtree with root on the skeleton output by bigsplit and having root degree one. We list the regular nodes in dfs order and separate this list into $\kappa+1$ sublists, all separated by $\rho(B),u_{1},\ldots,u_{\kappa}$ . Call the sizes of the sublists $N_{0},N_{1},\ldots,N_{\kappa}$ . Note that the bags consist of regular nodes, except possibly their representatives on the skeleton. Each bag is either contained in a sublist or a sublist plus one of the nodes $\rho(B),u_{1},\ldots,u_{\kappa}$ . Thus,

M\leq\max_{i}N_{i}+1.

If we number the nodes by dfs order, starting at $\rho(B)$ , then picking $k$ nodes uniformly at random without replacement from all nodes, $\rho(B)$ excepted, shows that the $N_{i}$ ’s correspond to the cardinalities of the intervals defined by the selected nodes. Therefore, $N_{0},\ldots,N_{\kappa}$ are identically distributed. In addition, for an arbitrary integer $k$ ,

	$\displaystyle\mathbb{P}\left\{N_{0}\geq k\right\}$	$\displaystyle=\frac{\|B\|-1-k}{\|B\|-1}\times\frac{\|B\|-2-k}{\|B\|-2}\times\cdots% \times\frac{\|B\|-\kappa-k}{\|B\|-\kappa}$
		$\displaystyle\leq\left(1-\frac{k}{\|B\|-1}\right)^{\kappa}$
		$\displaystyle\leq\exp\left(-\frac{k\kappa}{\|B\|-1}\right).$

Thus, by the union bound,

	$\displaystyle\mathbb{P}\left\{M\geq 1+\frac{\|B\|}{\sqrt{\Delta}}\right\}$	$\displaystyle\leq(\kappa+1)\exp\left(-\frac{\|B\|\kappa}{(\|B\|-1)\sqrt{\Delta}}\right)$
		$\displaystyle\leq(\kappa+1)\exp\left(-\frac{\kappa+1}{\sqrt{\Delta}}\right)$
		(as $\|B\|\geq\kappa+1$ )
		$\displaystyle\leq(\Delta+1)\exp\left(-\frac{\Delta+1}{\sqrt{\Delta}}\right)$
		$\displaystyle\leq 4\exp\left(-\frac{4}{\sqrt{3}}\right)$
		(as the expression decreases for $\Delta\geq 3$ )
		$\displaystyle=0.39728\ldots$
		$\displaystyle<\frac{1}{2}.$

∎

4.3. More refined probabilistic bounds

In this section we refine Theorem 3.1 by offering more detailed distributional estimates for the query complexity of Algorithm 1.

Theorem 4.2.

The query complexity $Z$ of Algorithm 1 with parameter $\kappa=\Delta$ has

\mathbb{E}\{Z\}\;\leq\;19\Delta n\log_{\Delta}(n).

Furthermore, if $\log_{\Delta}(n)\to\infty$ as $n\to\infty$ , then $\mathbb{P}\{Z\geq 2\mathbb{E}\{Z\}\}=o(1)$ . Finally, if $\log_{\Delta}(n)=O(1)$ , then $Z/(\Delta n){\stackrel{{\scriptstyle st.}}{{\leq}}}6X+o_{p}(1)$ , where ${\stackrel{{\scriptstyle st.}}{{\leq}}}$ denotes stochastic domination, $X$ is a geometric $(1/2)$ random variable, and $o_{p}(1)$ is a random variable tending to $0$ in probability.

Proof..

With the choice $\kappa=\Delta$ , using (3) for the complexity due to bag splitting and (4) for the complexity due to the final treatment of bags, the query complexity of the algorithm can be bounded by the random variable

Z=n\Delta+\sum_{k}\sum_{\textrm{bags $B$ split in round $k$}}\left(\Delta+X_{B% }(2+3\Delta)(|B|-1)\right),

where all $X_{B}$ are i.i.d. geometric $(1/2)$ random variables. Using (5), we have

\mathbb{E}\{Z\}\leq n\Delta+(\Delta+(4+6\Delta)n)\times 2\log_{\Delta}(n)\leq 1% 9\Delta n\log_{\Delta}(n).

As argued in the proof of Theorem 3.1 in each round the bags get reduced by a factor of $1/\sqrt{\Delta}$ . Thus, given all bags $B$ in all levels, we see that the variance $\mathbb{V}\{Z\}$ is not more than

	$\displaystyle\sum_{k}\sum_{\textrm{bags $B$ split in round $k$}}$	$\displaystyle\mathbb{V}\{X_{B}\}(2+3\Delta)^{2}(\|B\|-1)^{2}$
		$\displaystyle\leq(2+3\Delta)^{2}\,\sum_{k}\frac{n}{\Delta^{k/2}}\sum_{\textrm{% bags $B$ split in round $k$}}2(\|B\|-1)$
		$\displaystyle\leq 2n(2+3\Delta)^{2}\,\sum_{k}\frac{n}{\Delta^{k/2}}$
		$\displaystyle\leq\frac{32(n\Delta)^{2}}{1-1/\sqrt{\Delta}}$
		$\displaystyle\leq 90(n\Delta)^{2}.$

There are two cases: if $\log_{\Delta}(n)\to\infty$ , then by Chebyshev’s inequality,

\mathbb{P}\{Z>2\mathbb{E}\{Z\}\}\to 0~{},

where we used the fact that $\mathbb{E}(Z)=\Omega(\Delta n\log_{\Delta}(n))$ King et al., (2003). If on the other hand, $\log_{\Delta}(n)\leq K$ for a large constant $K$ , i.e., $\Delta\geq n^{1/K}$ , then

Z\;\leq\;n\Delta+\Delta+X(2+3\Delta)n+Z^{\prime}\;\leq\;6n\Delta X+Z^{\prime},

where $Z^{\prime}$ is defined as $Z$ with level $k=0$ excluded. Arguing as above, we have

\mathbb{V}\{Z^{\prime}\}\leq\frac{32(n\Delta)^{2}}{\sqrt{\Delta}-1}=o\left((% \mathbb{E}\{Z\})^{2}\right),

so that by Chebyshev’s inequality,

\frac{Z}{n\Delta}\leq 6X+o_{p}(1),

where $o_{p}(1)$ denotes a quantity that tends to zero in probability as $n\to\infty$ . In other words, the sequence of random variables $Z/(n\Delta)$ is tight. ∎

5. Graphical models on semi-labelled trees

We present a family of probabilistic models over partially observed trees for which the distance-based Algorithm 1 recovers the underlying tree. Although the Gaussian and the binary tree models (also known as the Ising tree model) discussed in Section 5.1 are probably the most interesting, with little effort we can generalize this, which we do in Section 5.2 and Section 5.3.

Given a tree $T=(V,E)$ and a random vector $Y=(Y_{v})_{v\in V}$ with values in the product space $\mathcal{Y}=\prod_{v\in V}\mathcal{Y}_{v}$ , consider the underlying graphical model over $T$ , that is, the family of density functions that factorize according to the tree

(6)

f_{Y}(y)\;=\;\prod_{uv\in E}\psi_{uv}(y_{u},y_{v})\qquad\mbox{for }y\in% \mathcal{Y},

where $\psi_{uv}$ are non-negative potential functions Lauritzen, (1996). The underlying latent tree model over the semi-labelled tree $T$ with the labelling set $W=[n]$ is a model for $X=(X_{1},\ldots,X_{n})$ , which is the sub-vector of $Y=(X,H)$ associated to the regular vertices. The density of $X$ is obtained from the joint density of $Y$ by marginalizing out the latent variables $H$

f(x)\;=\;\int_{\mathcal{H}}f_{Y}(x,h){\rm d}h.

Note that in the definition of a semi-labelled tree, we required that all nodes of $T$ of degree $\leq 2$ are regular. The restriction complies with this definition; c.f. Section 11.1 in Zwiernik, (2018).

5.1. The binary/Gaussian case

We single out two simple examples where $Y$ is jointly Gaussian or when $Y$ is binary. In the second case, the model on a tree coincides with the binary Ising model on $T$ . In both cases, we get a useful path product formula for the correlations, which states that the correlation between any two regular nodes can be written as the product of edge correlations for edges on the path joining these nodes

(7)

\rho_{ij}\;=\;\prod_{uv\in\overline{ij}}\rho_{uv}\qquad\mbox{for all }i,j\in[n].

The advantage of this representation is that (7) gives a direct translation of correlation structures in these models to tree metrics via $d({i,j}):=-\log|\rho_{ij}|$ for all $i\neq j$ ; see also (Pearl,, 1988, Section 8.3.3). To make this explicit we formulate the following proposition.

Proposition 5.1.

Consider a latent tree model over a semi-labelled tree $T$ . Whenever the correlations $\Sigma=[\rho_{ij}]$ in the underlying tree model admit the path-product formula (7), Algorithm 1, applied to the distance oracle defined by $d({i,j})=-\log|\rho_{ij}|$ for all $i\neq j$ , recovers the underlying tree.

Note that our algorithm recovers all the edge lengths, so we can also find the absolute values of the correlations between the latent variables. In the Gaussian case, this yields parameter identification.

Remark 5.2.

In the Gaussian latent tree model over a semi-labelled tree $T$ , Algorithm 1 recovers the underlying tree and the model parameters up to sign swapping of the latent variables.

It turns out that the basic binary/Gaussian setting can be largely generalized. We discuss three such generalizations:

(1)

General Markov models
(2)

Linear models
(3)

Non-paranormal distributions

We briefly describe these models for completeness. The former two are dealt with in Section 11.2 in Zwiernik, (2018).

5.2. General Markov models and linear models

By general Markov model, we mean a generalization of the binary latent tree models, where each $\mathcal{Y}_{v}=\{0,\ldots,d-1\}$ for some finite $r\geq 1$ . Denoting by $P^{uv}$ the $d\times d$ matrix representing the joint distribution of $(Y_{u},Y_{v})$ , and by $P^{vv}$ the diagonal $d\times d$ matrix with the marginal distribution of $Y_{v}$ on the diagonal, we can define for any two nodes $u,v$

(8)

\tau_{uv}\;:=\;\frac{\det(P^{uv})}{\sqrt{\det(P^{uu})\det(P^{vv})}}.

It turns out that for these new quantities, an equation of type (7) still holds, namely, $\tau_{ij}=\prod_{uv\in\overline{ij}}\tau_{uv}$ , so we again obtain the tree distance $d({i,j})=-\log|\tau_{ij}|$ .

Proposition 5.3.

In a general Markov model over a semi-labelled tree $T$ we can recover $T$ using Algorithm 1 from the distances $d({i,j})=-\log|\tau_{ij}|$ with $\tau_{ij}$ defined in (8).

More generally, suppose that $Y_{v}$ are now potentially vector-valued, all in $\mathbb{R}^{k}$ , and it holds for every edge $uv$ of $T$ that $\mathbb{E}[Y_{u}|Y_{v}]$ is an affine function of $Y_{v}$ . Let

\Sigma_{uv}\;=\;{\rm cov}(Y_{u},Y_{v})\;=\;\mathbb{E}Y_{u}Y_{v}^{\top}-\mathbb% {E}Y_{u}\mathbb{E}Y_{v}^{\top}.

In this case, defining $\tau_{uv}$ as

(9)

\tau_{uv}\;:=\;\det(\Sigma_{uu}^{-1/2}\Sigma_{uv}\Sigma^{-1/2}_{vv}),

we reach the same conclusion as for general Markov models; see Section 11.2.3 in Zwiernik, (2018) for details.

Proposition 5.4.

In a linear model over a semi-labelled tree $T$ we can recover recover $T$ using Algorithm 1 from the distances $d({i,j})=-\log|\tau_{ij}|$ with $\tau_{ij}$ defined in (9).

5.3. Non-paranormal distributions

Suppose that $Z=(Z_{v})$ is a Gaussian vector with underlying latent tree $T=(V,E)$ . Suppose that $Y$ is a monotone transformation of $Z$ , so that $Y_{v}=f_{v}(Z_{v})$ for strictly monotone functions $f_{v}$ , $v\in V$ . The conditional independence structure of $Y$ and $Z$ are the same. The problem is that the correlation structure of $Y$ may be quite complicated and may not satisfy the product path formula in (7). Suppose however that we have access to the Kendall- $\tau$ coefficients $K=[\kappa_{ij}]$ for $X$ with

(10)

\kappa_{ij}\;=\;\mathbb{E}[\mathbbm{1}(X_{i}>X_{i}^{\prime})\mathbbm{1}(X_{j}>% X_{j}^{\prime})],

where $X^{\prime}$ is an independent copy of $X$ . Then we can use the fact that for all $i\neq j$

\mathbb{E}[\mathbbm{1}(X_{i}>X_{i}^{\prime})\mathbbm{1}(X_{j}>X_{j}^{\prime})]% \;=\;\mathbb{E}[\mathbbm{1}(Z_{i}>Z_{i}^{\prime})\mathbbm{1}(Z_{j}>Z_{j}^{% \prime})]\;=:\;\kappa^{Z}_{ij}.

As observed by Liu et al., (2012), since $Z$ is Gaussian, we have a simple formula that relates $\kappa^{Z}_{ij}$ to the correlation coefficient $\rho_{ij}={\rm corr}(Z_{i},Z_{j})$ , for all $i\neq j$

\rho_{ij}\;=\;\sin(\tfrac{\pi}{2}\kappa^{Z}_{ij})\;=\;\sin(\tfrac{\pi}{2}% \kappa_{ij}).

Applying a simple transformation to the oracle $K$ gives us access to the underlying Gaussian correlation pattern, which now can be used as in the Gaussian case.

Proposition 5.5.

In a non=paranormal distribution over a semi-labelled tree $T$ we can can recover recover $T$ using Algorithm 1 from the distances

d({i,j})\;:=\;-\log|\sin(\tfrac{\pi}{2}\kappa_{ij})|,

where $\kappa_{ij}$ is the Kendall- $\tau$ coefficient defined in (10).

6. Statistical guarantees

In this section, we illustrate how the results developed in this paper can be applied in a more realistic scenario when the entries of the covariance matrix cannot be measured exactly. In particular, suppose a random sample of size $N$ is observed from a zero-mean distribution with covariance matrix $\Sigma$ . Assume that $\Sigma_{ii}=1$ for all $i=1,\ldots,n$ and the correlations $\Sigma_{ij}=\rho_{ij}$ satisfy parametrization (7) for some semilabeled tree. In this case $d(i,j)=-\log|\rho_{ij}|$ for all $i\neq j$ forms a tree metric.

Denote by $\widehat{\rho}_{ij}$ a suitable estimator of the correlations based on the sample and let $\widehat{d}(i,j)=-\log|\widehat{\rho}_{ij}|$ be the corresponding plug-in estimator of the distances. Since the $\widehat{d}(i,j)$ do not form a tree metric, we cannot apply directly Algorithm 1. The algorithm uses distances at five places:

(1)

In Algorithm 2, to compute $D(v)$ .
(2)

In Algorithm 2, to decide if $\rho(B_{i})$ is latent or not.
(3)

In the case when $\rho(B_{i})$ is latent, Algorithm 2 also uses the distances to calculate $d(\rho(B_{i}),v)$ for all $v\in B_{i}$ .
(4)

In Algorithm 3, to group nodes in $B$ according to the connected components obtained by removing $\rho(B)$ .
(5)

When recovering the skeleton of a small subtree in the case when $|B|\leq\kappa$ .

In order to adapt the algorithm to the ”noisy” distance oracle, we first propose the noisy versions of basic and explode. The procedure basic.noisy is outlined in Algorithm 5, and explode.noisy in Algorithm 6. The performance of the procedure depends on the following quantities

(11)

\ell\;:=\;\min_{i\neq j}d({i,j}),\quad\qquad\mathit{u}\;:=\;\max_{i\neq j}d({i% ,j}),

where the minimum and maximum are taken over all regular nodes.

The algorithms have an additional input parameter $\epsilon>0$ that is an upper bound for the noise level. More precisely, the algorithms work correctly whenever $\max_{i\neq j}|\widehat{d}(i,j)-d(i,j)|\leq\epsilon$ .

Input: a bag

B

\alpha\in B

, and

\epsilon>0

;

for $v\in B$ do

Compute

\widehat{D}(v)=\widehat{d}({v,\alpha})-\widehat{d}({v,\rho})

;

Order

v\in B

according to the decreasing value of

\widehat{D}(v)

;

|\widehat{D}(u)-\widehat{D}(v)|\leq 4\epsilon

, assign

u,v

to the same bag;

Denote the resulting bags by

B_{1},\ldots,B_{k}

;

for $i=1,\ldots,k$ do

u_{i}=\arg\min_{v\in B_{i}}\widehat{d}(\alpha,v)

;

if $|\widehat{d}(\rho,u_{i})+\widehat{d}(u_{i},\alpha)-\widehat{d}(\rho,\alpha)|% \leq 3\epsilon$ then

\rho(B_{i})=u_{i}

else

Identify latent node

w_{i}

;

Add

w_{i}

B_{i}

;

Set

\rho(B_{i})=w_{i}

;

Calculate

\widehat{d}(w_{i},v)

for all

v\in B_{i}

using (2);

Return

B_{1},\ldots,B_{k}

and the skeleton

(\rho(B_{1}),\rho(B_{2})),\ldots,(\rho(B_{k-1},\rho(B_{k})))

;

Algorithm 5 basic.noisy

(B,\alpha,\epsilon)

The next simple fact is used repeatedly below.

Lemma 6.1.

Suppose $\max_{i\neq j}|\widehat{d}(i,j)-d(i,j)|\leq\epsilon$ for all regular $i\neq j$ and let $a\in\mathbb{R}^{\binom{n}{2}}$ . Then $|a^{\top}(\widehat{d}-d)|\leq\|a\|_{1}\epsilon$ . In particular,

(i)

if $a^{\top}d=0$ then $|a^{\top}\widehat{d}|\leq\|a\|_{1}\epsilon$ ;
(ii)

if $a^{\top}d\geq\eta$ then $a^{\top}\widehat{d}\geq\eta-\|a\|_{1}\epsilon$ .

Input: a bag

B

;

while $|B|>1$ do

Take

u\in B

u\not=\rho(B)

;

Set

B^{*}=\{v\in B,v\not=\rho(B):d(u,\rho(B))+d(v,\rho(B))-d(u,v)>3\epsilon\}\cup% \{\rho(B)\}

;

Set

\rho(B^{*})=\rho(B)

;

Output

B^{*}

;

Set

B\leftarrow\{\rho(B)\}\cup(B\setminus B^{*})

;

Algorithm 6 explode.noisy

(B,\epsilon)

In what follows, we condition on the random event

	$\displaystyle\mathcal{E}(\epsilon)$	$\displaystyle=$	$\displaystyle\left\{\max_{i\neq j}\|\widehat{d}(i,j)-d(i,j)\|\leq\epsilon\;\;% \mbox{for all regular }i\neq j\right\}$
		$\displaystyle=$	$\displaystyle\left\{\max_{i\neq j}\left\|\log\left\|\frac{\widehat{\rho}_{ij}}{% \rho_{ij}}\right\|\right\|\leq\epsilon\mbox{ for all regular }i\neq j\right\}.$

Proposition 6.2.

Suppose that in the current round, the event $\mathcal{E}(\epsilon)$ holds with $\epsilon<\ell/4$ . Then Algorithm 5 and Algorithm 6 applied to the noisy distances gives the same output as Algorithm 2 and Algorithm 3 applied to their noiseless versions.

Proof..

In the first part, Algorithm 5 computes $\widehat{D}(v)$ for all $v$ and it uses this information to produce bags $B_{1},\ldots,B_{k}$ . The bags in Algorithm 2 are obtained by grouping nodes based on the increasing values of $D(v)$ . By Lemma 6.1(i), if $D(u)=D(v)$ then $|\widehat{D}(u)-\widehat{D}(v)|\;\leq\;4\epsilon$ . Moreover, if $D(u)>D(v)$ then it must be that $D(u)-D(v)\geq 2\ell$ and so, by Lemma 6.1(ii), $\widehat{D}(u)-\widehat{D}(v)>2\ell-4\epsilon>4\epsilon$ . This shows that this step of Algorithm 5 provides the same bags as Algorithm 2. In the second part of the algorithm, we decide whether or not the corresponding path nodes $\rho(B_{i})$ are regular. Let $\hat{u}_{i}=\arg\min_{v\in B_{i}}\widehat{d}(\alpha,v)$ and $u_{i}=\arg\min_{v\in B_{i}}d(\alpha,v)$ . If $\hat{u}_{i}\neq u_{i}$ then $d({\rho,\hat{u}_{i}})-d({\rho,u_{i}})\geq\ell$ . By Lemma 6.1(ii),

\widehat{d}({\rho,\hat{u}_{i}})-\widehat{d}({\rho,u_{i}})\;\geq\;\ell-2% \epsilon\;\geq\;2\epsilon,

which contradicts the fact that $\hat{u}_{i}=\arg\min_{v\in B_{i}}\widehat{d}({\alpha,v})$ . We conclude that $\hat{u}_{i}=u_{i}$ . Now consider the problem of deciding whether $\rho(B_{i})=u_{i}$ . Suppose first that $d({\rho,u_{i}})+d({\alpha,u_{i}})-d({\rho,\alpha})=0$ (i.e., $\rho(B_{i})=u_{i}$ ). By Lemma 6.1(i),

|\widehat{d}({\rho,u_{i}})+\widehat{d}({\alpha,u_{i}})-\widehat{d}({\rho,% \alpha})|\;\leq\;3\epsilon.

If $d({\rho,u_{i}})+d({\alpha,u_{i}})-d({\rho,\alpha})>0$ then, since $D$ is a tree metric, it must be $d({\rho,u_{i}})+d({\alpha,u_{i}})-d({\rho,\alpha})\geq 2\ell$ . Then, by Lemma 6.1(ii) and by the fact that $\epsilon<\ell/4$

\widehat{d}({\rho,u_{i}})+\widehat{d}({\alpha,u_{i}})-\widehat{d}({\rho,\alpha% })\;\geq\;2\ell-3\epsilon\;>\;5\epsilon.

This shows the correctness of the second part of Algorithm 5.

Since $d(v,\alpha)+d(v,\rho)-d(\alpha)\geq 2\ell$ , we get $\widehat{d}(v,\alpha)+\widehat{d}(v,\rho)-\widehat{d}(\alpha)\geq 2\ell-3\epsilon$

We can similarly show that Algorithm 6 gives the same output as Algorithm 3. ∎

The problem with applying Proposition 6.2 recursively to each round is that the event $\mathcal{E}(\epsilon)$ only bounds the noise for distances between the $n$ originally regular nodes. As the procedure progresses, new nodes are made regular; if $\rho(B_{i})$ is latent we make it “regular” by updating distances $\widehat{d}(w_{i},v)$ for all $v\in B_{i}$ using (2).

Lemma 6.3.

Suppose that event $\mathcal{E}(\epsilon)$ holds. After $R$ rounds of the algorithm, we have $|\widehat{d}(i,j)-d(i,j)|\leq(1+\tfrac{R}{2})\epsilon$ for all $i,j\in B$ that are regular in the $R$ -th round.

Proof..

Consider the first run of Algorithm 5. In this case, $\rho(B)$ and $\alpha$ are both regular. If $w_{i}$ is identified as a latent node, we calculate $\widehat{d}(w_{i},v)$ for all $v\in B_{i}$ using (2). By the triangle inequality, $|\widehat{d}(w_{i},v)-d(w_{i},v)|\leq\tfrac{3}{2}\epsilon$ and this bound is sharp. This establishes the case $R=1$ . Now suppose the bound in the lemma holds up to the $(R-1)$ -st round. In the $R$ -th round, $\rho(B)$ may be a latent node added in the previous call but $\alpha$ is still sampled from the originally regular nodes. If $\rho(B_{i})$ is identified as a latent node, we calculate $\widehat{d}(\rho(B_{i}),v)$ for all $v\in B_{i}$ . Since $v$ is also among the originally regular nodes, we conclude $|\widehat{d}(v,\alpha)-d(v,\alpha)|\leq\epsilon$ . By induction, $|\widehat{d}(v,\rho(B))-d(v,\rho(B))|\leq(1+\tfrac{R-1}{2})\epsilon$ and $|\widehat{d}(\alpha,\rho(B))-d(\alpha,\rho(B))|\leq(1+\tfrac{R-1}{2})\epsilon$ . By the triangle inequality,

|\widehat{d}(\rho(B_{i}),v)-d(\rho(B_{i}),v)|\;\leq\;\tfrac{1}{2}(\epsilon+(1+% \tfrac{R-1}{2})\epsilon+(1+\tfrac{R-1}{2})\epsilon)\;=\;(1+\tfrac{R}{2})\epsilon.

The result now follows by induction. ∎

It is generally easy to show that the event $\mathcal{E}(\epsilon)$ holds with probability at least $1-\eta$ as long as the sample size $N$ is large enough. Let $\delta=1-e^{-\epsilon}$ and suppose that the following event holds

\mathcal{E}^{\prime}(\delta)\;:=\;\left\{|\widehat{\rho}_{ij}-\rho_{ij}|\leq|% \rho_{ij}|\delta\;\;\mbox{ for all regular }i\neq j\right\}.

Since $\delta<1$ , under the event $\mathcal{E}^{\prime}$ the signs of $\widehat{\rho}_{ij}$ and $\rho_{ij}$ are the same. It is easy to see that $\mathcal{E}^{\prime}\subseteq\mathcal{E}$ . Indeed, under $\mathcal{E}^{\prime}$ , $\frac{\hat{\rho}_{ij}}{\rho_{ij}}>0$ and for all $1-\delta\leq x\leq 1+\delta$ we have $\log(1-\delta)\leq\log(x)\leq\log(1+\delta)$ . It follows that

\left|\log\frac{\hat{\rho}_{ij}}{\rho_{ij}}\right|\;\leq\;\max\left\{\log(1+% \delta),|\log(1-\delta)|\right\}\;=\;-\log(1-\delta)\;=\;\epsilon.

It is then enough to bound the probability of the event $\mathcal{E}^{\prime}$ . To illustrate how this can be done without going into unnecessary technicalities, suppose $\max_{i}\mathbb{E}X_{i}^{4}\leq\kappa$ for some $\kappa>0$ . In this case ${\rm var}(X_{i}X_{j})\leq\kappa$ , and therefore one may use the median-of-means estimator (see, e.g., Lugosi & Mendelson, (2019)) to estimate $\rho_{ij}=\mathbb{E}[X_{i}X_{j}]$ . We get the following result.

Theorem 6.4.

Suppose a random sample of size $N$ is generated from a mean zero distribution with covariance matrix $\Sigma$ satisfying $\Sigma_{ii}=1$ and suppose $\max_{i}\mathbb{E}X_{i}^{4}\leq\kappa$ for some $\kappa>0$ . Let $\ell,\mathit{u}$ be as defined in (11). Fix $\eta\in(0,1)$ and suppose

\epsilon\;\leq\;\frac{\ell}{4(1+\log_{\Delta}(n))}\qquad\mbox{and}\qquad N\;% \geq\;\frac{64\kappa\log(n/\eta)}{e^{-2\mathit{u}}(1-2e^{-\epsilon})}

then the noisy version of the Algorithm 1 correctly recovers the underlying semi-labeled tree with probability $1-\eta$ .

Proof..

Suppose the event $\mathcal{E}(\epsilon)$ holds. Like in the proof of Theorem 3.1 we modify the procedure by repeating bigsplit.noisy until the largest bag is bounded in size by $|B|/\sqrt{\Delta}$ . With this modification, the algorithm stops after each round. With our choice of $\epsilon$ , by Lemma 6.3, we are guaranteed that all computed distances satisfy $|\widehat{d}(u,v)-d(u,v)|\leq\epsilon$ in the first $2\log_{\Delta}(n)$ rounds. By Proposition 6.2, all these subsequent calls of bigsplit.noisy return the same answer as bigsplit applied to noiseless distances. The proof now follows if we can show that, with probability at least $1-\eta$ , event $\mathcal{E}(\epsilon)$ holds. We show that $\mathcal{E}^{\prime}(\delta)$ with $\delta=1-e^{-\epsilon}$ holds, which is a stronger condition. Recall that $\mathbb{E}X_{i}^{2}=1$ , $\mathbb{E}X_{i}^{4}\leq\kappa$ and, in consequence, $\operatorname{var}(X_{i}X_{i})\leq\kappa$ . We want to estimate $\mathbb{E}(X_{i}X_{j})=\rho_{ij}$ . By Theorem 2 in Lugosi & Mendelson, (2019), the median-of-means estimator $\widehat{\rho}_{ij}$ (with appropriately chosen number of blocks that depends on $\eta$ only) satisfies that with probability at least $1-2\eta/\binom{n}{2}$

|\widehat{\rho}_{ij}-\rho_{ij}|\;\leq\;\sqrt{\frac{32\kappa\log({\binom{n}{2}}% /{\eta})}{N}}.

Thus, we get that with probability at least $1-\eta$ , all $\widehat{\rho}_{ij}$ satisfy simultaneously that $|\widehat{\rho}_{ij}-\rho_{ij}|\leq|\rho_{ij}|\delta$ as long as

N\;\geq\;\frac{64\kappa\log(n/\eta)}{\delta^{2}\min_{i\neq j}\rho_{ij}^{2}}\;=% \;\frac{64\kappa\log(n/\eta)}{e^{-2\mathit{u}}\delta^{2}}.

The inequality $\epsilon<\tfrac{\ell}{4(1+\log_{\Delta}(n))}$ is equivalent to $\delta<1-e^{-\tfrac{\ell}{4(1+\log_{\Delta}(n))}}$ . Thus, we require

N\;\geq\;\frac{64\kappa\log(n/\eta)}{e^{-2\mathit{u}}(1-2e^{-\tfrac{\ell}{4(1+% \log_{\Delta}(n))}})}.

∎

Acknowledgements

LD was supported by NSERC grant A3456. GL acknowledges the support of Ayudas Fundación BBVA a Proyectos de Investigación Científica 2021 and the Spanish Ministry of Economy and Competitiveness grant PID2022-138268NB-I00, financed by MCIN/AEI/10.13039/501100011033, FSE+MTM2015-67304-P, and FEDER, EU. PZ was supported by NSERC grant RGPIN-2023-03481.

References

Afshar et al., (2020) Afshar, Ramtin, Goodrich, Michael T, Matias, Pedro, & Osegueda, Martha C. 2020. Reconstructing biological and digital phylogenetic trees in parallel. Pages 0–0 of: 28th Annual European Symposium on Algorithms (ESA 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
Aizenbud et al., (2021) Aizenbud, Yariv, Jaffe, Ariel, Wang, Meng, Hu, Amber, Amsel, Noah, Nadler, Boaz, Chang, Joseph T., & Kluger, Yuval. 2021. Spectral top-down recovery of latent tree models. arXiv preprint arXiv:2102.13276.
Anandkumar et al., (2011) Anandkumar, Animashree, Chaudhuri, Kamalika, Hsu, Daniel J., Kakade, Sham M., Song, Le, & Zhang, Tong. 2011. Spectral methods for learning multivariate latent tree structure. In: Advances in Neural Information Processing Systems, vol. 24.
Buneman, (1974) Buneman, Peter. 1974. A note on the metric properties of trees. Journal of Combinatorial Theory Series B, 17(1), 48–50.
Chen et al., (2019) Chen, Peixian, Chen, Zhourong, & Zhang, Nevin L. 2019. A novel document generation process for topic detection based on hierarchical latent tree models. Pages 265–276 of: Symbolic and Quantitative Approaches to Reasoning with Uncertainty: 15th European Conference, ECSQARU 2019, Belgrade, Serbia, September 18-20, 2019, Proceedings 15. Springer.
Choi et al., (2011) Choi, Myung Jin, Tan, Vincent Y.F., Anandkumar, Animashree, & Willsky, Alan S. 2011. Learning latent tree graphical models. Journal of Machine Learning Research, 12, 1771–1812.
Côme et al., (2021) Côme, Etienne, Jouvin, Nicolas, Latouche, Pierre, & Bouveyron, Charles. 2021. Hierarchical clustering with discrete latent variable models and the integrated classification likelihood. Advances in Data Analysis and Classification, 15(4), 957–986.
Hein, (1989) Hein, Jotun J. 1989. An optimal algorithm to reconstruct trees from additive distance data. Bulletin of Mathematical Biology, 51(5), 597–603.
Huang et al., (2020) Huang, Furong, Naresh, Niranjan Uma, Perros, Ioakeim, Chen, Robert, Sun, Jimeng, & Anandkumar, Anima. 2020. Guaranteed scalable learning of latent tree models. Pages 883–893 of: Uncertainty in Artificial Intelligence. PMLR.
Jaffe et al., (2021) Jaffe, Ariel, Amsel, Noah, Aizenbud, Yariv, Nadler, Boaz, Chang, Joseph T., & Kluger, Yuval. 2021. Spectral neighbor joining for reconstruction of latent tree models. SIAM Journal on Mathematics of Data Science, 3(1), 113–141.
Kandiros et al., (2023) Kandiros, Vardis, Daskalakis, Constantinos, Dagan, Yuval, & Choo, Davin. 2023. Learning and testing latent-tree Ising models efficiently. Pages 1666–1729 of: The Thirty Sixth Annual Conference on Learning Theory. PMLR.
Kannan et al., (1996) Kannan, Sampath K., Lawler, Eugene L., & Warnow, Tandy J. 1996. Determining the evolutionary tree using experiments. Journal of Algorithms, 21(1), 26–50.
King et al., (2003) King, Valerie, Zhang, Li, & Zhout, Yunhong. 2003. On the complexity of distance-based evolutionary tree reconstruction. Page 444 of: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM.
Lauritzen, (1996) Lauritzen, S. L. 1996. Graphical Models. Oxford, United Kingdom: Clarendon Press.
Liu et al., (2012) Liu, Han, Han, Fang, Yuan, Ming, Lafferty, John, & Wasserman, Larry. 2012. High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4), 2293–2326.
Lugosi & Mendelson, (2019) Lugosi, Gábor, & Mendelson, Shahar. 2019. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5), 1145–1190.
Mourad et al., (2013) Mourad, Raphaël, Sinoquet, Christine, Zhang, Nevin Lianwen, Liu, Tengfei, & Leray, Philippe. 2013. A survey on latent tree models and applications. Journal of Artificial Intelligence Research, 47, 157–203.
Pearl, (1988) Pearl, Judea. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. The Morgan Kaufmann Series in Representation and Reasoning. Morgan Kaufmann, San Mateo, CA.
Semple & Steel, (2003) Semple, Charles, & Steel, Mike A. 2003. Phylogenetics. Vol. 24. Oxford University Press on Demand.
Shiers et al., (2016) Shiers, Nathaniel, Zwiernik, Piotr, Aston, John A.D., & Smith, James Q. 2016. The correlation space of Gaussian latent tree models and model selection without fitting. Biometrika, 103(3), 531–545.
Sturma et al., (2022) Sturma, Nils, Drton, Mathias, & Leung, Dennis. 2022. Testing many and possibly singular polynomial constraints. arXiv preprint arXiv:2208.11756.
Waterman et al., (1977) Waterman, Michael S, Smith, Temple F, Singh, Mona, & Beyer, William A. 1977. Additive evolutionary trees. Journal of Theoretical Biology, 64(2), 199–213.
Zhang et al., (2022) Zhang, Haipeng, Zhao, Jian, Wang, Xiaoyu, & Xuan, Yi. 2022. Low-voltage distribution grid topology identification with latent tree model. IEEE Transactions on Smart Grid, 13(3), 2158–2169.
Zhang & Poon, (2017) Zhang, Nevin, & Poon, Leonard. 2017. Latent tree analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31.
Zhang, (2004) Zhang, Nevin L. 2004. Hierarchical latent class models for cluster analysis. The Journal of Machine Learning Research, 5, 697–723.
Zhou et al., (2020) Zhou, Can, Wang, Xiaofei, & Guo, Jianhua. 2020. Learning mixed latent tree models. Journal of Machine Learning Research, 21(249), 1–35.
Zwiernik, (2018) Zwiernik, Piotr. 2018. Latent tree models. Pages 283–306 of: Handbook of Graphical Models. CRC Press.

	$\displaystyle\mathbb{P}\left\{N_{0}\geq k\right\}$	$\displaystyle=\frac{\|B\|-1-k}{\|B\|-1}\times\frac{\|B\|-2-k}{\|B\|-2}\times\cdots% \times\frac{\|B\|-\kappa-k}{\|B\|-\kappa}$
		$\displaystyle\leq\left(1-\frac{k}{\|B\|-1}\right)^{\kappa}$
		$\displaystyle\leq\exp\left(-\frac{k\kappa}{\|B\|-1}\right).$