Approximate learning of parsimonious
Bayesian context trees

Daniyar Ghani \orcidlink0000-0001-8611-9966 Department of Mathematics, Imperial College London
180 Queen’s Gate, London SW7 2AZ, United Kingdom
[email protected]
Nicholas A. Heard \orcidlink0000-0002-8767-0810 Department of Mathematics, Imperial College London
180 Queen’s Gate, London SW7 2AZ, United Kingdom
[email protected]
Francesco Sanna Passino \orcidlink0000-0002-4571-6681 Department of Mathematics, Imperial College London
180 Queen’s Gate, London SW7 2AZ, United Kingdom
[email protected]
Abstract

Models for categorical sequences typically assume exchangeable or first-order dependent sequence elements. These are common assumptions, for example, in models of computer malware traces and protein sequences. Although such simplifying assumptions lead to computational tractability, these models fail to capture long-range, complex dependence structures that may be harnessed for greater predictive power. To this end, a Bayesian modelling framework is proposed to parsimoniously capture rich dependence structures in categorical sequences, with memory efficiency suitable for real-time processing of data streams. Parsimonious Bayesian context trees are introduced as a form of variable-order Markov model with conjugate prior distributions. The novel framework requires fewer parameters than fixed-order Markov models by dropping redundant dependencies and clustering sequential contexts. Approximate inference on the context tree structure is performed via a computationally efficient model-based agglomerative clustering procedure. The proposed framework is tested on synthetic and real-world data examples, and it outperforms existing sequence models when fitted to real protein sequences and honeypot computer terminal sessions.

Keywords — categorical sequences, context trees, Markov models, model-based clustering.

1 Introduction

Sequences of categorical data are ubiquitous in application areas such as language processing, bioinformatics and cyber-security. Statistical models for categorical sequences typically involve Markov assumptions (see, for example, Lewis, , 2001; Mächler and Bühlmann, , 2004), which yield easily interpretable and powerful methods for sequential prediction, changepoint detection and classification. Fixed order-D𝐷Ditalic_D Markov models are the natural extension of order-1 Markov chains, where the prediction of a new sequence element depends on the values of the previous D𝐷Ditalic_D elements, known as the context. However, while fixed Markov models can theoretically learn rich dependence structures, they are limited by the exponential growth of parameter space. Given a state space, or vocabulary, of V𝑉Vitalic_V elements, a fixed order-D𝐷Ditalic_D Markov model requires VD(V1)superscript𝑉𝐷𝑉1V^{D}(V-1)italic_V start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( italic_V - 1 ) parameters. The computational cost and storage requirements become excessively large as vocabulary size V𝑉Vitalic_V or Markov order D𝐷Ditalic_D grow. On the contrary, models using short contexts are computationally tractable but can have inferior predictive performance.

To retain the ability to learn complex, long-range dependence structures while reducing the size of the parameter space, many parsimonious alternatives to fixed higher-order Markov models have been proposed in the literature. Rissanen, (1983) introduce variable-order Markov models (VOMM) for the purpose of data compression, where the Markov order depends on the context of the current element. Unlike fixed order-D𝐷Ditalic_D Markov models, the context lengths used for prediction in VOMMs can vary depending on the data. Inference algorithms for these models are explored in Begleiter et al., (2004).

Bourguignon and Robelin, (2004) extend variable-order Markov models in the form of a parsimonious context tree (PCT) structure, where nodes are fused together so that leaves may correspond to multiple contexts sharing the same predictive distributions. Developments in PCT learning algorithms are centred around dynamic programming (Eggeling et al., , 2013, 2019), and while the efficiency of PCT learning has improved, the combinatorially large context trees required at initialisation limit the feasible vocabulary to V10𝑉10V\approx 10italic_V ≈ 10.

Sparse Markov models (Jääskinen et al., , 2014; Xiong et al., , 2016; Bennett et al., , 2023) and minimal Markov models (García and González-López, , 2011, 2017) are further generalisations of VOMMs with no hierarchical context tree structure; contexts of varying lengths are grouped into equivalence classes with the same predictive distributions. Such flexible modelling can uncover interesting dependence structures, yet inference remains challenging.

Bayesian context trees (BCT, Kontoyiannis et al., , 2022; Papageorgiou and Kontoyiannis, , 2024) are a recent area of research focusing on exact Bayesian inference for VOMMs. BCT inference algorithms can efficiently recover long-range dependencies, and enable computation of quantities such as the prior predictive likelihood and model posterior probabilities, useful for downstream analysis. However, BCTs are still limited to small vocabularies.

In this article, parsimonious Bayesian context trees (PBCT) are introduced as a new class of variable-order Markov model. The main differentiator of the proposed model with existing Bayesian context trees is the clustering of contexts, as the vocabulary is repeatedly partitioned in the tree to yield a variable-order Markov model over clusters. The advantages of context clustering are two-fold: a significant reduction in dimensionality and the ability to borrow statistical strength across similar contexts. To generate the structure of a PBCT, a natural recursive partitioning procedure is developed using the Chinese restaurant process (CRP, Aldous, , 1985). Dirichlet priors are placed on the unknown predictive distributions associated with each leaf of the tree to allow simple evaluation of the marginal likelihood of a sequence 𝒙𝒙\bm{x}bold_italic_x given a context tree structure.

A novel algorithm is developed for efficient approximate inference of parsimonious Bayesian context trees, using model-based agglomerative clustering. The method proposed for inference enables the analysis of data with larger vocabulary sizes compared to existing inference schemes for similar context tree models. Results are obtained using synthetic sequences as well as real-world examples in cyber-security and bioinformatics, demonstrating the ability of the PBCT model to perfectly recover simulated model structures and outperform existing models for sequential prediction.

In Section 2, the parsimonious Bayesian context tree model is described, including the assumed generative process. Section 3 outlines the model-based clustering inference scheme for learning PBCTs. Results of a simulation study are presented in Section 4, and real data examples are investigated in Section 5.

2 Parsimonious Bayesian context trees

2.1 Markov models

Consider a discrete vocabulary 𝒱={1,2,,V}𝒱12𝑉\mathcal{V}=\{1,2,\dots,V\}caligraphic_V = { 1 , 2 , … , italic_V }, V1𝑉1V\geq 1italic_V ≥ 1, and a potentially infinite sequence x1,x2,subscript𝑥1subscript𝑥2italic-…x_{1},x_{2},\dotsitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_…, where each xi𝒱subscript𝑥𝑖𝒱x_{i}\in\mathcal{V}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V. Suppose the first N𝑁Nitalic_N sequence elements have been observed as 𝒙=(x1,,xN)𝒱N𝒙subscript𝑥1subscript𝑥𝑁superscript𝒱𝑁\bm{x}=(x_{1},\dots,x_{N})\in\mathcal{V}^{N}bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∈ caligraphic_V start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. For D0𝐷0D\geq 0italic_D ≥ 0, an order-D𝐷Ditalic_D Markov model predicts the next element of the sequence xN+1subscript𝑥𝑁1x_{N+1}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT conditional on the previous D𝐷Ditalic_D elements:

p(xN+1|xN,,x1)={p(xN+1),if D=0,p(xN+1|xN,,xND+1),if D1.𝑝conditionalsubscript𝑥𝑁1subscript𝑥𝑁subscript𝑥1cases𝑝subscript𝑥𝑁1if 𝐷0𝑝conditionalsubscript𝑥𝑁1subscript𝑥𝑁subscript𝑥𝑁𝐷1if 𝐷1p(x_{N+1}\,|\,x_{N},\dots,x_{1})=\begin{cases}p(x_{N+1}),&\text{if }D=0,\\ p(x_{N+1}\,|\,x_{N},\dots,x_{N-D+1}),&\text{if }D\geq 1.\end{cases}italic_p ( italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = { start_ROW start_CELL italic_p ( italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT ) , end_CELL start_CELL if italic_D = 0 , end_CELL end_ROW start_ROW start_CELL italic_p ( italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N - italic_D + 1 end_POSTSUBSCRIPT ) , end_CELL start_CELL if italic_D ≥ 1 . end_CELL end_ROW (2.1)

Equivalently, the next element depends on its length-D𝐷Ditalic_D context (xN,,xND+1)𝒱Dsubscript𝑥𝑁subscript𝑥𝑁𝐷1superscript𝒱𝐷(x_{N},\dots,x_{N-D+1})\in\mathcal{V}^{D}( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N - italic_D + 1 end_POSTSUBSCRIPT ) ∈ caligraphic_V start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

Markov models can be represented naturally using context trees, whose formal definition is given in Section 2.2. The root node of the tree represents the unobserved next element xN+1subscript𝑥𝑁1x_{N+1}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT of a sequence, and the non-root nodes represent possible values of observed elements. The depth of a node corresponds to context length. In context tree representations of standard Markov models, nodes contain single elements of the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V. In this article, parsimonious context trees are introduced as representations of parsimonious Markov models (Bourguignon and Robelin, , 2004), where nodes contain subsets of elements of 𝒱𝒱\mathcal{V}caligraphic_V. Each path from the root to a leaf node represents a collection of contexts which share the same predictive distribution for the next sequential element.

2.2 Context trees

A tree 𝓣=(𝓒,𝓔)𝓣𝓒𝓔\bm{\mathcal{T}}=(\bm{\mathcal{C}},\,\bm{\mathcal{E}})bold_caligraphic_T = ( bold_caligraphic_C , bold_caligraphic_E ) is an undirected graph with no cycles, where 𝓒𝓒\bm{\mathcal{C}}bold_caligraphic_C denotes the nodes of the graph and 𝓔𝓒×𝓒𝓔𝓒𝓒\bm{\mathcal{E}}\subset\bm{\mathcal{C}}\times\bm{\mathcal{C}}bold_caligraphic_E ⊂ bold_caligraphic_C × bold_caligraphic_C is a set of edges. In predictive models for sequences over a vocabulary 𝒱𝒱\mathcal{V}caligraphic_V, the nodes 𝓒𝓒\bm{\mathcal{C}}bold_caligraphic_C of a parsimonious context tree 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T each contain subsets of 𝒱𝒱\mathcal{V}caligraphic_V, and are indexed according to their position within the tree; specifically, each node is associated with an index in dsuperscript𝑑\mathbb{N}^{d}blackboard_N start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for some 0dD0𝑑𝐷0\leq d\leq D0 ≤ italic_d ≤ italic_D, whose length corresponds to its depth in the tree hierarchy, and D0𝐷0D\geq 0italic_D ≥ 0 is the maximum depth of the tree.

Let 𝔼D=d=0Ddsubscript𝔼𝐷superscriptsubscript𝑑0𝐷superscript𝑑\mathbb{E}_{D}=\bigcup_{d=0}^{D}\mathbb{N}^{d}blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_d = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT blackboard_N start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denote the set of all possible indices for nodes in 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T. Each node in 𝓒𝓒\bm{\mathcal{C}}bold_caligraphic_C will be uniquely referred to as C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT, for an index 𝒆𝔼D𝒆subscript𝔼𝐷\bm{e}\in\mathbb{E}_{D}bold_italic_e ∈ blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. The root node CC=𝒱𝐶subscript𝐶𝒱C\equiv C_{\emptyset}=\mathcal{V}italic_C ≡ italic_C start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT = caligraphic_V is indexed by the empty tuple. For any node pair (C𝒆,C𝒆)𝓔subscript𝐶𝒆subscript𝐶superscript𝒆𝓔(C_{\bm{e}},\,C_{\bm{e}^{\prime}})\in\bm{\mathcal{E}}( italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ∈ bold_caligraphic_E, where 𝒆,𝒆𝔼D𝒆superscript𝒆subscript𝔼𝐷\bm{e},\bm{e}^{\prime}\in\mathbb{E}_{D}bold_italic_e , bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, the node C𝒆subscript𝐶superscript𝒆C_{\bm{e}^{\prime}}italic_C start_POSTSUBSCRIPT bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is said to be a child of C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT, and 𝒆superscript𝒆\bm{e}^{\prime}bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT must be the concatenation (𝒆,k)𝒆𝑘(\bm{e},k)( bold_italic_e , italic_k ) for some k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N. Therefore, if C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT has K1𝐾1K\geq 1italic_K ≥ 1 children, these are indexed by sequences in {𝒆}×{1,,K}𝒆1𝐾\{\bm{e}\}\times\{1,\dots,K\}{ bold_italic_e } × { 1 , … , italic_K }, written 𝒞𝒆=children(C𝒆)={C𝒆1,C𝒆2,,C𝒆K}subscript𝒞𝒆childrensubscript𝐶𝒆subscript𝐶𝒆1subscript𝐶𝒆2subscript𝐶𝒆𝐾\mathcal{C}_{\bm{e}}=\text{children}(C_{\bm{e}})=\{C_{\bm{e}1},C_{\bm{e}2},% \dots,C_{\bm{e}K}\}caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT = children ( italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) = { italic_C start_POSTSUBSCRIPT bold_italic_e 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT bold_italic_e 2 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT bold_italic_e italic_K end_POSTSUBSCRIPT }. Crucially, every set 𝒞𝒆subscript𝒞𝒆\mathcal{C}_{\bm{e}}caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT of child nodes must form a partition of 𝒱𝒱\mathcal{V}caligraphic_V, so that k=1KC𝒆k=𝒱superscriptsubscript𝑘1𝐾subscript𝐶𝒆𝑘𝒱\bigcup_{k=1}^{K}C_{\bm{e}k}=\mathcal{V}⋃ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT bold_italic_e italic_k end_POSTSUBSCRIPT = caligraphic_V and C𝒆kC𝒆k=subscript𝐶𝒆𝑘subscript𝐶𝒆superscript𝑘C_{\bm{e}k}\cap C_{\bm{e}k^{\prime}}=\emptysetitalic_C start_POSTSUBSCRIPT bold_italic_e italic_k end_POSTSUBSCRIPT ∩ italic_C start_POSTSUBSCRIPT bold_italic_e italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ∅ for any kk𝑘superscript𝑘k\neq k^{\prime}italic_k ≠ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. A leaf node has no children, and each leaf node is associated with a unique conditional probability distribution (2.1) for the next element in a sequence, given a context matching the corresponding path through the tree.

A CBA BCAB CABACCBACBACBFull order-2 treeParsimonious context tree
Figure 1: Comparison of tree structures for fixed order-2 Markov and parsimonious context tree models over the vocabulary 𝒱={A,B,C}𝒱ABC\mathcal{V}=\{\texttt{A},\texttt{B},\texttt{C}\}caligraphic_V = { A , B , C }.

Figure 1 compares the tree structures representing a fixed order-2 Markov model (D=2𝐷2D=2italic_D = 2) and a parsimonious context tree with D=3𝐷3D=3italic_D = 3. To understand how the tree structures determine sequential prediction, consider a sequence 𝒙𝒱N𝒙superscript𝒱𝑁\bm{x}\in\mathcal{V}^{N}bold_italic_x ∈ caligraphic_V start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over a 3-term vocabulary 𝒱={A,B,C}𝒱ABC\mathcal{V}=\{\texttt{A},\texttt{B},\texttt{C}\}caligraphic_V = { A , B , C }. Suppose the last three elements of the sequence 𝒙𝒙\bm{x}bold_italic_x are observed as (xN,xN1,xN2)=(A,C,B)subscript𝑥𝑁subscript𝑥𝑁1subscript𝑥𝑁2ACB(x_{N},x_{N-1},x_{N-2})=(\texttt{A},\texttt{C},\texttt{B})( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_N - 2 end_POSTSUBSCRIPT ) = ( A , C , B ). In the order-2 model, the distribution of the next element is given by p(xN+1|𝒙)=p(xN+1|xN=A,xN1=C)𝑝conditionalsubscript𝑥𝑁1𝒙𝑝formulae-sequenceconditionalsubscript𝑥𝑁1subscript𝑥𝑁Asubscript𝑥𝑁1Cp(x_{N+1}\,|\,\bm{x})=p(x_{N+1}\,|\,x_{N}=\texttt{A},\,x_{N-1}=\texttt{C})italic_p ( italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT | bold_italic_x ) = italic_p ( italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = A , italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT = C ), corresponding to the path: root \to A \to C. Instead, in the parsimonious model in Figure 1, the predictive distribution corresponds to the path: root \to {A, C} \to {C} \to {B, C}, and is given by p(xN+1|xN{A,C},xN1{C},xN2{B,C})𝑝formulae-sequenceconditionalsubscript𝑥𝑁1subscript𝑥𝑁ACformulae-sequencesubscript𝑥𝑁1Csubscript𝑥𝑁2BCp(x_{N+1}\,|\,x_{N}\in\{\texttt{A},\texttt{C}\},\,x_{N-1}\in\{\texttt{C}\},\,x% _{N-2}\in\{\texttt{B},\texttt{C}\})italic_p ( italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∈ { A , C } , italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ∈ { C } , italic_x start_POSTSUBSCRIPT italic_N - 2 end_POSTSUBSCRIPT ∈ { B , C } ). The comparison illustrates the desired dimensionality reduction from fixed to parsimonious Markov models: a fixed order-3 model implies a tree with 27 leaves, each associated with a different predictive distribution, whereas the parsimonious model in Figure 1 defines only 4 predictive distributions.

2.3 Model specification

2.3.1 Conditional distributions

Given the parsimonious context tree 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T with nodes 𝓒𝓒\bm{\mathcal{C}}bold_caligraphic_C, let E𝔼D𝐸subscript𝔼𝐷E\subseteq\mathbb{E}_{D}italic_E ⊆ blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT denote the set of leaf node indices. Assign to each leaf node C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT, 𝒆E𝒆𝐸\bm{e}\in Ebold_italic_e ∈ italic_E, a probability mass function ϕ𝒆subscriptbold-italic-ϕ𝒆\bm{\phi}_{\bm{e}}bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT over the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V with Dirichlet prior distributions:

ϕ𝒆Dirichlet(𝜼),similar-tosubscriptbold-italic-ϕ𝒆Dirichlet𝜼\bm{\phi}_{\bm{e}}\sim\text{Dirichlet}(\bm{\eta}),bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ∼ Dirichlet ( bold_italic_η ) , (2.2)

where 𝜼+V𝜼subscriptsuperscript𝑉\bm{\eta}\in\mathbb{R}^{V}_{+}bold_italic_η ∈ blackboard_R start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. The tree 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T and the implied conditional distributions {ϕ𝒆}𝒆Esubscriptsubscriptbold-italic-ϕ𝒆𝒆𝐸\{\bm{\phi}_{\bm{e}}\}_{\bm{e}\in E}{ bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT } start_POSTSUBSCRIPT bold_italic_e ∈ italic_E end_POSTSUBSCRIPT define a parsimonious Bayesian context tree (PBCT) model.

For each leaf index 𝒆=(ϵ1,ϵ2,,ϵd)E𝒆subscriptitalic-ϵ1subscriptitalic-ϵ2subscriptitalic-ϵ𝑑𝐸\bm{e}=(\epsilon_{1},\epsilon_{2},\dots,\epsilon_{d})\in Ebold_italic_e = ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_ϵ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ∈ italic_E, define

𝒫𝒆N={𝒙𝒱N:xNCϵ1,xN1Cϵ1ϵ2,,xNd+1C𝒆}superscriptsubscript𝒫𝒆𝑁conditional-set𝒙superscript𝒱𝑁formulae-sequencesubscript𝑥𝑁subscript𝐶subscriptitalic-ϵ1formulae-sequencesubscript𝑥𝑁1subscript𝐶subscriptitalic-ϵ1subscriptitalic-ϵ2subscript𝑥𝑁𝑑1subscript𝐶𝒆\mathcal{P}_{\bm{e}}^{N}=\{\bm{x}\in\mathcal{V}^{N}:x_{N}\in C_{\epsilon_{1}},% \,x_{N-1}\in C_{\epsilon_{1}\epsilon_{2}},\dots,\,x_{N-d+1}\in C_{\bm{e}}\}caligraphic_P start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = { bold_italic_x ∈ caligraphic_V start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT : italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N - italic_d + 1 end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT } (2.3)

to be the set of 𝒙𝒱N𝒙superscript𝒱𝑁\bm{x}\in\mathcal{V}^{N}bold_italic_x ∈ caligraphic_V start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT such that the length-d𝑑ditalic_d context (xN,xN1,,xNd+1)subscript𝑥𝑁subscript𝑥𝑁1subscript𝑥𝑁𝑑1(x_{N},x_{N-1},\dots,x_{N-d+1})( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N - italic_d + 1 end_POSTSUBSCRIPT ) matches the path from the root of 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T to the leaf node indexed by 𝒆𝒆\bm{e}bold_italic_e. Suppose the leaf corresponding to sequence 𝒙𝒙\bm{x}bold_italic_x has index 𝒆𝒙Esuperscript𝒆𝒙𝐸\bm{e}^{\bm{x}}\in Ebold_italic_e start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT ∈ italic_E, such that 𝒙𝒫𝒆𝒙N𝒙superscriptsubscript𝒫superscript𝒆𝒙𝑁\bm{x}\in\mathcal{P}_{\bm{e}^{\bm{x}}}^{N}bold_italic_x ∈ caligraphic_P start_POSTSUBSCRIPT bold_italic_e start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Then the next element xN+1subscript𝑥𝑁1x_{N+1}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT is sampled from the conditional distribution ϕ𝒆𝒙subscriptbold-italic-ϕsuperscript𝒆𝒙\bm{\phi}_{\bm{e}^{\bm{x}}}bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT end_POSTSUBSCRIPT:

xN+1|𝒙,𝓣,{ϕ𝒆}𝒆ECategorical(ϕ𝒆𝒙),similar-toconditionalsubscript𝑥𝑁1𝒙𝓣subscriptsubscriptbold-italic-ϕ𝒆𝒆𝐸Categoricalsubscriptbold-italic-ϕsuperscript𝒆𝒙x_{N+1}\,|\,\bm{x},\bm{\mathcal{T}},\{\bm{\phi}_{\bm{e}}\}_{\bm{e}\in E}\sim% \text{Categorical}(\bm{\phi}_{\bm{e}^{\bm{x}}}),italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT | bold_italic_x , bold_caligraphic_T , { bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT } start_POSTSUBSCRIPT bold_italic_e ∈ italic_E end_POSTSUBSCRIPT ∼ Categorical ( bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , (2.4)

yielding the Markov property p(xN+1|𝒙)=ϕ𝒆𝒙𝑝conditionalsubscript𝑥𝑁1𝒙subscriptbold-italic-ϕsuperscript𝒆𝒙p(x_{N+1}\,|\,\bm{x})=\bm{\phi}_{\bm{e}^{\bm{x}}}italic_p ( italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT | bold_italic_x ) = bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for sequential prediction. Figure 2 gives an illustrative example of next-element prediction given a PBCT 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T and sequence 𝒙𝒙\bm{x}bold_italic_x.

children(C12)𝒞12={C121,C122}childrensubscript𝐶12subscript𝒞12subscript𝐶121subscript𝐶122\text{children}(C_{12})\equiv\mathcal{C}_{12}=\{C_{121},\,C_{122}\}children ( italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT ) ≡ caligraphic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT 121 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT } A CBA BCAB CABACCBACBACBFull order-2 treeParsimonious context treeCsubscript𝐶C_{\emptyset}italic_C start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPTA CC1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTBC2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTA BC11subscript𝐶11C_{11}italic_C start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPTCC12subscript𝐶12C_{12}italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPTAC121subscript𝐶121C_{121}italic_C start_POSTSUBSCRIPT 121 end_POSTSUBSCRIPTB CC122subscript𝐶122C_{122}italic_C start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPTϕ11subscriptbold-italic-ϕ11\bm{\phi}_{11}bold_italic_ϕ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPTϕ121subscriptbold-italic-ϕ121\bm{\phi}_{121}bold_italic_ϕ start_POSTSUBSCRIPT 121 end_POSTSUBSCRIPTϕ122subscriptbold-italic-ϕ122\bm{\phi}_{122}bold_italic_ϕ start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPTϕ2subscriptbold-italic-ϕ2\bm{\phi}_{2}bold_italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTAx1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTCx2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTCAB\cdotsABCAxNsubscript𝑥𝑁x_{N}italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPTSequence 𝒙𝒙\bm{x}bold_italic_xPredict: xN+1=Bsubscript𝑥𝑁1Bx_{N+1}=\textnormal{{\color[rgb]{0.75,0,0}\definecolor[named]{pgfstrokecolor}{% rgb}{0.75,0,0}{{B}}}}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT = B𝒙𝒫122N=C1×C12×C122×VN3𝒙superscriptsubscript𝒫122𝑁subscript𝐶1subscript𝐶12subscript𝐶122superscript𝑉𝑁3\bm{x}\in\mathcal{P}_{122}^{N}=C_{1}\times C_{12}\times C_{122}\times V^{N-3}bold_italic_x ∈ caligraphic_P start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT × italic_V start_POSTSUPERSCRIPT italic_N - 3 end_POSTSUPERSCRIPTxN+1ϕ122similar-tosubscript𝑥𝑁1subscriptbold-italic-ϕ122x_{N+1}\sim\bm{\phi}_{122}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT ∼ bold_italic_ϕ start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT {B, C}{C}{A, C}ABCTree 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T
Figure 2: Example workings of the proposed Markov model for predicting the next element given a length-N𝑁Nitalic_N sequence 𝐱𝐱\bm{x}bold_italic_x and tree 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T with 4 leaves. The vocabulary is 𝒱={A,B,C}𝒱ABC\mathcal{V}=\{\texttt{A},\texttt{B},\texttt{C}\}caligraphic_V = { A , B , C }. Read the sequence from right to left and traverse the tree starting from the root (in red). The sequence corresponds to a path from root to a leaf via nodes C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, C12subscript𝐶12C_{12}italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT and C122subscript𝐶122C_{122}italic_C start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT. This path admits the predictive distribution ϕ122subscriptbold-ϕ122\bm{\phi}_{122}bold_italic_ϕ start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT from which the next element xN+1subscript𝑥𝑁1x_{N+1}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT is sampled.

2.3.2 Marginal likelihood

Consider a PBCT 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T with leaf indices E𝐸Eitalic_E. For 𝒙𝒱N𝒙superscript𝒱𝑁\bm{x}\in\mathcal{V}^{N}bold_italic_x ∈ caligraphic_V start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, define 𝑿𝒆=(X𝒆,1,,X𝒆,V)subscript𝑿𝒆subscript𝑋𝒆1subscript𝑋𝒆𝑉\bm{X}_{\bm{e}}=(X_{\bm{e},1},\dots,X_{\bm{e},V})bold_italic_X start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT bold_italic_e , 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT bold_italic_e , italic_V end_POSTSUBSCRIPT ), where

X𝒆,v=n=1N𝟙𝒫𝒆n1(𝒙n1)𝟙{v}(xn).subscript𝑋𝒆𝑣superscriptsubscript𝑛1𝑁subscript1subscriptsuperscript𝒫𝑛1𝒆subscript𝒙𝑛1subscript1𝑣subscript𝑥𝑛X_{\bm{e},v}=\sum_{n=1}^{N}\mathds{1}_{\mathcal{P}^{n-1}_{\bm{e}}}(\bm{x}_{n-1% })\mathds{1}_{\{v\}}(x_{n}).italic_X start_POSTSUBSCRIPT bold_italic_e , italic_v end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) blackboard_1 start_POSTSUBSCRIPT { italic_v } end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . (2.5)

The count X𝒆,vsubscript𝑋𝒆𝑣X_{\bm{e},v}italic_X start_POSTSUBSCRIPT bold_italic_e , italic_v end_POSTSUBSCRIPT is the number of times in 𝒙𝒙\bm{x}bold_italic_x that v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V follows a context indexed by 𝒆𝒆\bm{e}bold_italic_e, and the subsequence 𝒙n=(x1,,xn)subscript𝒙𝑛subscript𝑥1subscript𝑥𝑛\bm{x}_{n}=(x_{1},\dots,x_{n})bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) denotes the first n𝑛nitalic_n elements of 𝒙𝒙\bm{x}bold_italic_x. Due to the conjugacy of (2.2) with (2.4), the marginal likelihood of 𝒙𝒙\bm{x}bold_italic_x under 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T is obtained from the Dirichlet–Categorical distribution:

p(𝒙|𝓣)=eEB(𝑿𝒆+𝜼)B(𝜼),𝑝conditional𝒙𝓣subscriptproduct𝑒𝐸𝐵subscript𝑿𝒆𝜼𝐵𝜼p(\bm{x}\,|\,\bm{\mathcal{T}})=\prod_{e\in E}\frac{B(\bm{X}_{\bm{e}}+\bm{\eta}% )}{B(\bm{\eta})},italic_p ( bold_italic_x | bold_caligraphic_T ) = ∏ start_POSTSUBSCRIPT italic_e ∈ italic_E end_POSTSUBSCRIPT divide start_ARG italic_B ( bold_italic_X start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT + bold_italic_η ) end_ARG start_ARG italic_B ( bold_italic_η ) end_ARG , (2.6)

where B(𝜶)=iΓ(αi)/Γ(iαi)𝐵𝜶subscriptproduct𝑖Γsubscript𝛼𝑖Γsubscript𝑖subscript𝛼𝑖B(\bm{\alpha})=\prod_{i}\Gamma(\alpha_{i})/\Gamma(\sum_{i}\alpha_{i})italic_B ( bold_italic_α ) = ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Γ ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / roman_Γ ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the multivariate beta function.

In practice, for the simulation study in Section 4 and real-data applications in Section 5, the counts 𝑿𝒆subscript𝑿𝒆\bm{X}_{\bm{e}}bold_italic_X start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT are calculated ignoring the first D𝐷Ditalic_D elements of a sequence, where D𝐷Ditalic_D is the specified maximum tree depth. In this way, the sum of counts are computed over the values D+1n𝐷1𝑛absentD+1\leq n\leqitalic_D + 1 ≤ italic_n ≤, avoiding any cases where the current subsequence 𝒙nsubscript𝒙𝑛\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT does not have enough sequential history to map to a leaf in the context tree.

Another consideration for practical model inference is that discrete sequence data commonly occur as collections of many sequences over the same vocabulary. The computations of counts (2.5) and marginal likelihood (2.6) in this case are modified as follows: given a collection of observed sequences, the counts X𝒆,vsubscript𝑋𝒆𝑣X_{\bm{e},v}italic_X start_POSTSUBSCRIPT bold_italic_e , italic_v end_POSTSUBSCRIPT for each sequence are calculated separately and aggregated, and the marginal likelihood is computed using the aggregated counts.

2.3.3 Generative process

This section describes a method for generating a parsimonious Bayesian context tree graph. For each parent node in a PBCT, it is assumed that the child nodes correspond to a partition of the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V, as described in Section 2.2. A natural approach to randomly partitioning a set of integers is the Chinese restaurant process (CRP, Aldous, , 1985), a common representation of the Dirichlet process (Ferguson, , 1973). Suppose customers are to be seated at a restaurant with an infinite number of tables. Let zmsubscript𝑧𝑚z_{m}\in\mathbb{N}italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_N denote the table allocation of customer m𝑚mitalic_m. The first customer sits at the first table, so z1=1subscript𝑧11z_{1}=1italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1. For each successive customer m2𝑚2m\geq 2italic_m ≥ 2, if K𝐾Kitalic_K tables have have already been occupied, the customer sits at an occupied table k{1,,K}𝑘1𝐾k\in\{1,\dots,K\}italic_k ∈ { 1 , … , italic_K } or a new table K+1𝐾1K+1italic_K + 1 with probabilities:

(zm=k)=mkα+m,(zm=K+1)subscript𝑧𝑚𝑘subscript𝑚𝑘𝛼𝑚subscript𝑧𝑚𝐾1\displaystyle\mathbb{P}(z_{m}=k)=\frac{m_{k}}{\alpha+m},\quad\mathbb{P}(z_{m}=% K+1)blackboard_P ( italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_k ) = divide start_ARG italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_α + italic_m end_ARG , blackboard_P ( italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_K + 1 ) =αα+m,absent𝛼𝛼𝑚\displaystyle=\frac{\alpha}{\alpha+m},= divide start_ARG italic_α end_ARG start_ARG italic_α + italic_m end_ARG ,

where α>0𝛼0\alpha>0italic_α > 0 controls the rate at which new tables are formed and mksubscript𝑚𝑘m_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the number of customers seated at table k𝑘kitalic_k. By taking a finite realisation of the CRP after V𝑉Vitalic_V customers are seated, each occupied table will be associated with a subset of integers which together form an exchangeable partition of 𝒱={1,,V}𝒱1𝑉\mathcal{V}=\{1,\dots,V\}caligraphic_V = { 1 , … , italic_V }. The proposed generative process assumes a CRP prior distribution with parameter α𝛼\alphaitalic_α over the set of partitions of 𝒱𝒱\mathcal{V}caligraphic_V, denoted CRPV(α)subscriptCRP𝑉𝛼\text{CRP}_{V}(\alpha)CRP start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_α ).

The following processes describe the generation of a PBCT 𝓣=(𝓒,𝓔)𝓣𝓒𝓔\bm{\mathcal{T}}=(\bm{\mathcal{C}},\,\bm{\mathcal{E}})bold_caligraphic_T = ( bold_caligraphic_C , bold_caligraphic_E ). Let 𝒜={A1,,AK}𝒜subscript𝐴1subscript𝐴𝐾\mathcal{A}=\{A_{1},\dots,A_{K}\}caligraphic_A = { italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } be a random partition of the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V, where each Aksubscript𝐴𝑘A_{k}\neq\emptysetitalic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ ∅. For an infinite sequence of sets (C1,C2,)subscript𝐶1subscript𝐶2(C_{1},C_{2},\dots)( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … ), suppose (C1,C2,)HV(α)similar-tosubscript𝐶1subscript𝐶2subscript𝐻𝑉𝛼(C_{1},C_{2},\dots)\sim H_{V}(\alpha)( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … ) ∼ italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_α ) if:

𝒜CRPV(α),Ck={Ak,if |𝒜|>1 and k|𝒜|,,otherwise,for k.formulae-sequencesimilar-to𝒜subscriptCRP𝑉𝛼subscript𝐶𝑘casessubscript𝐴𝑘if 𝒜1 and 𝑘𝒜otherwisefor 𝑘\displaystyle\mathcal{A}\sim\text{CRP}_{V}(\alpha),\quad C_{k}=\begin{cases}A_% {k},&\text{if }|\mathcal{A}|>1\text{ and }k\leq|\mathcal{A}|,\\ \emptyset,&\text{otherwise},\end{cases}\quad\text{for }k\in\mathbb{N}.caligraphic_A ∼ CRP start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_α ) , italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { start_ROW start_CELL italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , end_CELL start_CELL if | caligraphic_A | > 1 and italic_k ≤ | caligraphic_A | , end_CELL end_ROW start_ROW start_CELL ∅ , end_CELL start_CELL otherwise , end_CELL end_ROW for italic_k ∈ blackboard_N .

The process HV(α)subscript𝐻𝑉𝛼H_{V}(\alpha)italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_α ) describes the construction of a sequence of sets starting with the components of a partition of 𝒱𝒱\mathcal{V}caligraphic_V generated by the CRP, followed by a countably infinite number of empty sets. For D1𝐷1D\geq 1italic_D ≥ 1, define GV(α,D)subscript𝐺𝑉𝛼𝐷G_{V}(\alpha,D)italic_G start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_α , italic_D ) as the process:

CC𝐶subscript𝐶\displaystyle C\equiv C_{\emptyset}italic_C ≡ italic_C start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT =𝒱,absent𝒱\displaystyle=\mathcal{V},= caligraphic_V , (2.7)
(C𝒆1,C𝒆2,)subscript𝐶𝒆1subscript𝐶𝒆2\displaystyle(C_{\bm{e}1},C_{\bm{e}2},\ldots)( italic_C start_POSTSUBSCRIPT bold_italic_e 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT bold_italic_e 2 end_POSTSUBSCRIPT , … ) {HV(α),if C𝒆,δ(,,),otherwise,for 𝒆𝔼D1,similar-toabsentcasessubscript𝐻𝑉𝛼if subscript𝐶𝒆subscript𝛿otherwisefor 𝒆subscript𝔼𝐷1\displaystyle\sim\begin{cases}H_{V}(\alpha),&\text{if }C_{\bm{e}}\neq\emptyset% ,\\ \delta_{(\emptyset,\emptyset,\ldots)},&\text{otherwise},\end{cases}\quad\text{% for }\bm{e}\in\mathbb{E}_{D-1},∼ { start_ROW start_CELL italic_H start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_α ) , end_CELL start_CELL if italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ≠ ∅ , end_CELL end_ROW start_ROW start_CELL italic_δ start_POSTSUBSCRIPT ( ∅ , ∅ , … ) end_POSTSUBSCRIPT , end_CELL start_CELL otherwise , end_CELL end_ROW for bold_italic_e ∈ blackboard_E start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT , (2.8)
𝓒𝓒\displaystyle\bm{\mathcal{C}}bold_caligraphic_C ={C𝒆𝒱:C𝒆,𝒆𝔼D},absentconditional-setsubscript𝐶𝒆𝒱formulae-sequencesubscript𝐶𝒆𝒆subscript𝔼𝐷\displaystyle=\{C_{\bm{e}}\subseteq\mathcal{V}:C_{\bm{e}}\neq\emptyset,\,\bm{e% }\in\mathbb{E}_{D}\},= { italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ⊆ caligraphic_V : italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ≠ ∅ , bold_italic_e ∈ blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } , (2.9)
𝓔𝓔\displaystyle\bm{\mathcal{E}}bold_caligraphic_E ={(C𝒆,C𝒆k)𝓒×𝓒:𝒆𝔼D1,k},absentconditional-setsubscript𝐶𝒆subscript𝐶𝒆𝑘𝓒𝓒formulae-sequence𝒆subscript𝔼𝐷1𝑘\displaystyle=\{(C_{\bm{e}},C_{\bm{e}k})\in\bm{\mathcal{C}}\times\bm{\mathcal{% C}}:\bm{e}\in\mathbb{E}_{D-1},\,k\in\mathbb{N}\},= { ( italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT bold_italic_e italic_k end_POSTSUBSCRIPT ) ∈ bold_caligraphic_C × bold_caligraphic_C : bold_italic_e ∈ blackboard_E start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT , italic_k ∈ blackboard_N } ,

where δ(,,)subscript𝛿\delta_{(\emptyset,\emptyset,\ldots)}italic_δ start_POSTSUBSCRIPT ( ∅ , ∅ , … ) end_POSTSUBSCRIPT denotes a Dirac measure placing probability one on the sequence (,,)(\emptyset,\emptyset,\ldots)( ∅ , ∅ , … ). Then 𝓣GV(α,D)similar-to𝓣subscript𝐺𝑉𝛼𝐷\bm{\mathcal{T}}\sim G_{V}(\alpha,D)bold_caligraphic_T ∼ italic_G start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_α , italic_D ) is the generative process of the tree 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T of maximum depth D𝐷Ditalic_D with nodes 𝓒𝓒\bm{\mathcal{C}}bold_caligraphic_C and edges 𝓔𝓔\bm{\mathcal{E}}bold_caligraphic_E. The children of each non-leaf node of the tree form a partition of 𝒱𝒱\mathcal{V}caligraphic_V.

There are two ways to control the complexity of a generated tree: (i) vary the maximum depth of the tree, D𝐷Ditalic_D, or (ii) vary the CRP rate parameter α𝛼\alphaitalic_α. Smaller values of α𝛼\alphaitalic_α lead to partitions of 𝒱𝒱\mathcal{V}caligraphic_V with fewer components, or clusters, constraining the complexity of the tree. Additionally, to limit tree size, α𝛼\alphaitalic_α may be chosen to decay with depth, causing the expected number of generated child nodes to decrease with depth.

3 Inference

Section 2.3.3 described the assumed generative process of a parsimonious Bayesian context tree. In this section, an approximate Bayesian inference scheme is described to learn PBCT structure given data. The introduced method, based on agglomerative clustering, is applied in Section 4 to infer synthetic tree structures, and in Section 5 to estimate the dependence structures in real protein sequences and command-line data.

3.1 Agglomerative clustering

Agglomerative hierarchical clustering (Duda and Hart, , 1973) is a general method for partitioning a discrete set of elements using a similarity metric. The procedure starts with each element in a singleton cluster, and successively merges clusters according to their similarities until all elements are grouped together in the same cluster. A nested sequence of clusterings is created, and the optimal clustering is chosen to maximise an objective function. Bayesian model-based agglomerative schemes (Heller and Ghahramani, , 2005; Heard et al., , 2006), use marginal posterior probabilities to define similarities between clusters.

Model-based agglomerative clustering is applied to infer the parsimonious Bayesian context tree 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T of maximum depth D𝐷Ditalic_D given a sequence 𝒙𝒙\bm{x}bold_italic_x. The clustering process is initialised at the root node and repeated recursively to obtain the optimal children of each node: each recursion terminates if the optimal child configuration is {𝒱}𝒱\{\mathcal{V}\}{ caligraphic_V }, or the current node is at the maximum depth D𝐷Ditalic_D, at which point the current node is set as a leaf of the tree.

Starting from the root node CC𝐶subscript𝐶C\equiv C_{\emptyset}italic_C ≡ italic_C start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT, let 𝒆𝔼d𝒆subscript𝔼𝑑\bm{e}\in\mathbb{E}_{d}bold_italic_e ∈ blackboard_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT be an index at tree depth dD𝑑𝐷d\leq Ditalic_d ≤ italic_D, and let 𝒞𝒆={C𝒆1,,C𝒆K}subscript𝒞𝒆subscript𝐶𝒆1subscript𝐶𝒆𝐾\mathcal{C}_{\bm{e}}=\{C_{\bm{e}1},\ldots,\,C_{\bm{e}K}\}caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT bold_italic_e 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT bold_italic_e italic_K end_POSTSUBSCRIPT } denote the children of node C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT which form a partition of the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V. Denote by 𝓣dsubscript𝓣𝑑\bm{\mathcal{T}}_{d}bold_caligraphic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT the tree cut at the current depth d𝑑ditalic_d. The cluster similarity at each step is the multiplicative increase in probability obtained by merging C𝒆isubscript𝐶𝒆𝑖C_{\bm{e}i}italic_C start_POSTSUBSCRIPT bold_italic_e italic_i end_POSTSUBSCRIPT and C𝒆jsubscript𝐶𝒆𝑗C_{\bm{e}j}italic_C start_POSTSUBSCRIPT bold_italic_e italic_j end_POSTSUBSCRIPT:

si,j=p(𝒙|𝓣d,𝒞𝒆(i,j))p(𝒞𝒆(i,j))p(𝒙|𝓣d,𝒞𝒆)p(𝒞𝒆),subscript𝑠𝑖𝑗𝑝conditional𝒙subscript𝓣𝑑superscriptsubscript𝒞𝒆𝑖𝑗𝑝superscriptsubscript𝒞𝒆𝑖𝑗𝑝conditional𝒙subscript𝓣𝑑subscript𝒞𝒆𝑝subscript𝒞𝒆s_{i,j}=\frac{p(\bm{x}\,|\,\bm{\mathcal{T}}_{d},\,\mathcal{C}_{\bm{e}}^{(i,j)}% )\,p(\mathcal{C}_{\bm{e}}^{(i,j)})}{p(\bm{x}\,|\,\bm{\mathcal{T}}_{d},\,% \mathcal{C}_{\bm{e}})\,p(\mathcal{C}_{\bm{e}})},italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG italic_p ( bold_italic_x | bold_caligraphic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ) italic_p ( caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p ( bold_italic_x | bold_caligraphic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) italic_p ( caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) end_ARG , (3.1)

where 𝒞𝒆(i,j)superscriptsubscript𝒞𝒆𝑖𝑗\mathcal{C}_{\bm{e}}^{(i,j)}caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT denotes the cluster configuration in which clusters C𝒆isubscript𝐶𝒆𝑖C_{\bm{e}i}italic_C start_POSTSUBSCRIPT bold_italic_e italic_i end_POSTSUBSCRIPT and C𝒆jsubscript𝐶𝒆𝑗C_{\bm{e}j}italic_C start_POSTSUBSCRIPT bold_italic_e italic_j end_POSTSUBSCRIPT are merged, and p(𝒙|𝓣d,𝒞𝒆)𝑝conditional𝒙subscript𝓣𝑑subscript𝒞𝒆p(\bm{x}\,|\,\bm{\mathcal{T}}_{d},\,\mathcal{C}_{\bm{e}})italic_p ( bold_italic_x | bold_caligraphic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) is the marginal likelihood of 𝒙𝒙\bm{x}bold_italic_x under the current tree 𝓣dsubscript𝓣𝑑\bm{\mathcal{T}}_{d}bold_caligraphic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT with the proposed cluster configuration 𝒞𝒆subscript𝒞𝒆\mathcal{C}_{\bm{e}}caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT at depth d+1𝑑1d+1italic_d + 1. Assuming the generative process in Section 2.3.3, the prior probability p(𝒞𝒆)𝑝subscript𝒞𝒆p(\mathcal{C}_{\bm{e}})italic_p ( caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) is available from the Chinese restaurant process. The agglomerative clustering procedure for the children of each node C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT at depth d𝑑ditalic_d starts with each element of 𝒱𝒱\mathcal{V}caligraphic_V belonging to a singleton cluster, and is sequentially iterated until all elements are merged in the same group; the clustering structure 𝒞𝒆subscript𝒞𝒆\mathcal{C}_{\bm{e}}caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT which locally maximises the marginal posterior probability π(𝒞𝒆)=p(𝒙|𝓣d,𝒞𝒆)p(𝒞𝒆)𝜋subscript𝒞𝒆𝑝conditional𝒙subscript𝓣𝑑subscript𝒞𝒆𝑝subscript𝒞𝒆\pi(\mathcal{C}_{\bm{e}})=p(\bm{x}\,|\,\bm{\mathcal{T}}_{d},\,\mathcal{C}_{\bm% {e}})\,p(\mathcal{C}_{\bm{e}})italic_π ( caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) = italic_p ( bold_italic_x | bold_caligraphic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) italic_p ( caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) is chosen as the optimal child configuration for the corresponding node in the tree. This scheme is denoted recursive agglomerative clustering (RAC), and it is outlined in Algorithm 1.

Algorithm 1 Recursive agglomerative clustering (RAC)
1:  Initialise: Set current node to root node: C𝒆C=𝒱subscript𝐶𝒆subscript𝐶𝒱C_{\bm{e}}\equiv C_{\emptyset}=\mathcal{V}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ≡ italic_C start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT = caligraphic_V.
2:  while current node C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT is not a leaf do
3:     Initialise clustering 𝒞𝒆={C𝒆1,,C𝒆V},subscript𝒞𝒆subscript𝐶𝒆1subscript𝐶𝒆𝑉\mathcal{C}_{\bm{e}}=\{C_{\bm{e}1},\ldots,C_{\bm{e}V}\},caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT bold_italic_e 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT bold_italic_e italic_V end_POSTSUBSCRIPT } , where each C𝒆i={i}subscript𝐶𝒆𝑖𝑖C_{\bm{e}i}=\{i\}italic_C start_POSTSUBSCRIPT bold_italic_e italic_i end_POSTSUBSCRIPT = { italic_i }.
4:     Set number of clusters K=V𝐾𝑉K=Vitalic_K = italic_V and calculate πK=π(𝒞𝒆)subscript𝜋𝐾𝜋subscript𝒞𝒆\pi_{K}=\pi(\mathcal{C}_{\bm{e}})italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = italic_π ( caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ).
5:     while K>1𝐾1K>1italic_K > 1 do
6:         Merge clusters C𝒆isubscript𝐶𝒆𝑖C_{\bm{e}i}italic_C start_POSTSUBSCRIPT bold_italic_e italic_i end_POSTSUBSCRIPT and C𝒆jsubscript𝐶𝒆𝑗C_{\bm{e}j}italic_C start_POSTSUBSCRIPT bold_italic_e italic_j end_POSTSUBSCRIPT with maximal similarity si,jsubscript𝑠𝑖𝑗s_{i,j}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT.
7:         Set K=K1𝐾𝐾1K=K-1italic_K = italic_K - 1 and relabel clusters i,j𝑖𝑗i,jitalic_i , italic_j accordingly.
8:         Calculate πK=π(𝒞𝒆)subscript𝜋𝐾𝜋subscript𝒞𝒆\pi_{K}=\pi(\mathcal{C}_{\bm{e}})italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = italic_π ( caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) for new clustering.
9:     Select the optimal clustering 𝒞𝒆subscript𝒞𝒆\mathcal{C}_{\bm{e}}caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT which maximises π(𝒞𝒆)𝜋subscript𝒞𝒆\pi(\mathcal{C}_{\bm{e}})italic_π ( caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ).
10:     if 𝒞𝒆={𝒱}subscript𝒞𝒆𝒱\mathcal{C}_{\bm{e}}=\{\mathcal{V}\}caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT = { caligraphic_V } then
11:         Set current node C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT as a leaf.
12:     else
13:         for each child node C𝒆𝒞𝒆subscript𝐶superscript𝒆subscript𝒞𝒆C_{\bm{e}^{\prime}}\in\mathcal{C}_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT do
14:            Set current node to child node: C𝒆=C𝒆subscript𝐶𝒆subscript𝐶superscript𝒆C_{\bm{e}}=C_{\bm{e}^{\prime}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.
15:            if C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT is at maximum depth then
16:               Set current node C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT as a leaf.
17:            else
18:               Repeat the procedure from line 2 to get optimal children of current node C𝒆subscript𝐶𝒆C_{\bm{e}}italic_C start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT.

The RAC algorithm provides a fast and scalable procedure for estimating the tree which generated an observed sequence. In Section  4, the RAC algorithm is shown to recover the true underlying context trees from simulated data in a variety of settings. An advantage of the RAC algorithm over existing methods is the ability to handle significantly larger vocabulary sizes than existing methods for context tree learning. For instance, the algorithms presented in Eggeling et al., (2019) require initialisation with extended trees containing all possible combinations of child node configurations up to the specified maximum tree depth. This limits feasible vocabulary size to V10𝑉10V\approx 10italic_V ≈ 10, whereas RAC can learn PBCTs in reasonable computational time for data with V100𝑉100V\approx 100italic_V ≈ 100, as shown in Section 5.

3.2 Model evaluation

Suppose the PBCT 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T is inferred from the training sequence 𝒙trainsuperscript𝒙train\bm{x}^{\text{train}}bold_italic_x start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT. If 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T has leaf indices E𝐸Eitalic_E, the predictive marginal likelihood of an unseen length-N𝑁Nitalic_N test sequence 𝒙test𝒱Nsuperscript𝒙testsuperscript𝒱𝑁\bm{x}^{\text{test}}\in\mathcal{V}^{N}bold_italic_x start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT under the trained model is

p(𝒙test|𝒙train,𝓣)=𝒆EB(𝑿𝒆test+𝑿𝒆train+𝜼)B(𝑿𝒆train+𝜼)𝑝conditionalsuperscript𝒙testsuperscript𝒙train𝓣subscriptproduct𝒆𝐸𝐵superscriptsubscript𝑿𝒆testsuperscriptsubscript𝑿𝒆train𝜼𝐵superscriptsubscript𝑿𝒆train𝜼\displaystyle p(\bm{x}^{\text{test}}\,|\,\bm{x}^{\text{train}},\,\bm{\mathcal{% T}}\,)=\prod_{\bm{e}\in E}\frac{B(\bm{X}_{\bm{e}}^{\text{test}}+\bm{X}_{\bm{e}% }^{\text{train}}+\bm{\eta})}{B(\bm{X}_{\bm{e}}^{\text{train}}+\bm{\eta})}italic_p ( bold_italic_x start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT , bold_caligraphic_T ) = ∏ start_POSTSUBSCRIPT bold_italic_e ∈ italic_E end_POSTSUBSCRIPT divide start_ARG italic_B ( bold_italic_X start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT + bold_italic_X start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT + bold_italic_η ) end_ARG start_ARG italic_B ( bold_italic_X start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT + bold_italic_η ) end_ARG (3.2)

where 𝑿𝒆superscriptsubscript𝑿𝒆\bm{X}_{\bm{e}}^{\cdot}bold_italic_X start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋅ end_POSTSUPERSCRIPT are the counts (2.5) calculated separately over the training and test sequences. To evaluate the predictive performance of a trained model, define the marginal log-loss (𝒙test)superscript𝒙test\ell(\bm{x}^{\text{test}})roman_ℓ ( bold_italic_x start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ) to be the mean negative predictive log-likelihood of the test sequence,

(𝒙test|𝒙train,𝓣)=1Nlogp(𝒙test|𝒙train,𝓣).conditionalsuperscript𝒙testsuperscript𝒙train𝓣1𝑁𝑝conditionalsuperscript𝒙testsuperscript𝒙train𝓣\displaystyle\ell(\bm{x}^{\text{test}}\,|\,\bm{x}^{\text{train}},\,\bm{% \mathcal{T}}\,)=-\frac{1}{N}\log p(\bm{x}^{\text{test}}\,|\,\bm{x}^{\text{% train}},\,\bm{\mathcal{T}}\,).roman_ℓ ( bold_italic_x start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT , bold_caligraphic_T ) = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG roman_log italic_p ( bold_italic_x start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT , bold_caligraphic_T ) . (3.3)

Lower marginal log-loss indicates superior predictive ability.

To obtain the marginal log-loss (3.3) for a given tree, the predictive distributions for each leaf are marginalised. For simulation studies, it is also possible to define a true log-loss for a model for which the true conditional distributions, denoted {ϕ𝒆}superscriptsubscriptbold-italic-ϕ𝒆\{\bm{\phi}_{\bm{e}}^{*}\}{ bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }, are assumed known:

(𝒙test|𝒙train,𝓣,{ϕ𝒆}𝒆E)=1Ni=1Nlogϕ𝒆𝒙i,xi.conditionalsuperscript𝒙testsuperscript𝒙train𝓣subscriptsuperscriptsubscriptbold-italic-ϕ𝒆𝒆𝐸1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscriptitalic-ϕsuperscript𝒆subscript𝒙𝑖subscript𝑥𝑖\displaystyle\ell(\bm{x}^{\text{test}}\,|\,\bm{x}^{\text{train}},\,\bm{% \mathcal{T}},\,\{\bm{\phi}_{\bm{e}}^{*}\}_{\bm{e}\in E})=-\frac{1}{N}\sum_{i=1% }^{N}\log\phi^{*}_{\bm{e}^{\bm{x}_{i}},\,x_{i}}.roman_ℓ ( bold_italic_x start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT , bold_caligraphic_T , { bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT bold_italic_e ∈ italic_E end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_e start_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (3.4)

Estimates of the predictive distributions can be useful for efficient downstream analysis: for each leaf 𝒆E𝒆𝐸\bm{e}\in Ebold_italic_e ∈ italic_E, denote by ϕ^𝒆=(ϕ^𝒆,1,,ϕ^𝒆,V)subscript^bold-italic-ϕ𝒆subscript^italic-ϕ𝒆1subscript^italic-ϕ𝒆𝑉\hat{\bm{\phi}}_{\bm{e}}=(\hat{\phi}_{\bm{e},1},\dots,\hat{\phi}_{\bm{e},V})over^ start_ARG bold_italic_ϕ end_ARG start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT = ( over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT bold_italic_e , 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT bold_italic_e , italic_V end_POSTSUBSCRIPT ) the posterior mean estimates of the corresponding Dirichlet–Categorical distribution,

ϕ^𝒆,v=X𝒆,vtrain+ηvu=1VX𝒆,utrain+ηu,v𝒱.formulae-sequencesubscript^italic-ϕ𝒆𝑣superscriptsubscript𝑋𝒆𝑣trainsubscript𝜂𝑣superscriptsubscript𝑢1𝑉superscriptsubscript𝑋𝒆𝑢trainsubscript𝜂𝑢𝑣𝒱\displaystyle\hat{\phi}_{\bm{e},v}=\frac{X_{\bm{e},v}^{\text{train}}+\eta_{v}}% {\sum_{u=1}^{V}X_{\bm{e},u}^{\text{train}}+\eta_{u}},\quad v\in\mathcal{V}.over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT bold_italic_e , italic_v end_POSTSUBSCRIPT = divide start_ARG italic_X start_POSTSUBSCRIPT bold_italic_e , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT bold_italic_e , italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG , italic_v ∈ caligraphic_V . (3.5)

A similarity metric based on the adjusted Rand index (ARI, Hubert and Arabie, , 1985) is defined to compare tree structures; for example, to compare a fitted tree with a known, simulated tree. For each clustering at depth dD𝑑𝐷d\leq Ditalic_d ≤ italic_D in the fitted tree, the metric obtains the ARI of the most similar clustering at depth d𝑑ditalic_d in the simulated tree; these maximal ARIs are then weighted by the frequencies of contexts in the training sequence and averaged. In this way, more importance is placed on the parts of the trees which are commonly represented in the data, and errors are penalised less for rare contexts.

The tree similarity metric is defined as follows. Let 𝓣1subscript𝓣1\bm{\mathcal{T}}_{1}bold_caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝓣2subscript𝓣2\bm{\mathcal{T}}_{2}bold_caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be trees of maximum depths D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, such that D1D2subscript𝐷1subscript𝐷2D_{1}\geq D_{2}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For m{1,2}𝑚12m\in\{1,2\}italic_m ∈ { 1 , 2 }, let Im𝔼Dmsubscript𝐼𝑚subscript𝔼subscript𝐷𝑚I_{m}\subseteq\mathbb{E}_{D_{m}}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊆ blackboard_E start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the indices of all nodes in 𝓣msubscript𝓣𝑚\bm{\mathcal{T}}_{m}bold_caligraphic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and EmImsubscript𝐸𝑚subscript𝐼𝑚E_{m}\subseteq I_{m}italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊆ italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denote the indices of leaf nodes in 𝓣msubscript𝓣𝑚\bm{\mathcal{T}}_{m}bold_caligraphic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. For any node 𝒆mImsubscript𝒆𝑚subscript𝐼𝑚\bm{e}_{m}\in I_{m}bold_italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT at depth dmsubscript𝑑𝑚d_{m}italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, let 𝒞𝒆msubscript𝒞subscript𝒆𝑚\mathcal{C}_{\bm{e}_{m}}caligraphic_C start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the partition of 𝒱𝒱\mathcal{V}caligraphic_V comprising the children of node C𝒆msubscript𝐶subscript𝒆𝑚C_{\bm{e}_{m}}italic_C start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT. If 𝒆mEmsubscript𝒆𝑚subscript𝐸𝑚\bm{e}_{m}\in E_{m}bold_italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a leaf, then let 𝒞𝒆m={𝒱}subscript𝒞subscript𝒆𝑚𝒱\mathcal{C}_{\bm{e}_{m}}=\{\mathcal{V}\}caligraphic_C start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { caligraphic_V }. Define the weight w𝒆1subscript𝑤subscript𝒆1w_{\bm{e}_{1}}italic_w start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT as the proportion of contexts in 𝒙𝒙\bm{x}bold_italic_x associated with node C𝒆1subscript𝐶subscript𝒆1C_{\bm{e}_{1}}italic_C start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over all nodes at the same depth in tree 𝓣1subscript𝓣1\bm{\mathcal{T}}_{1}bold_caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and let ARI𝒆1,𝒆2subscriptARIsubscript𝒆1subscript𝒆2\text{ARI}_{\bm{e}_{1},\bm{e}_{2}}ARI start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the adjusted Rand index between clusterings 𝒞𝒆1subscript𝒞subscript𝒆1\mathcal{C}_{\bm{e}_{1}}caligraphic_C start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT in 𝓣1subscript𝓣1\bm{\mathcal{T}}_{1}bold_caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒞𝒆2subscript𝒞subscript𝒆2\mathcal{C}_{\bm{e}_{2}}caligraphic_C start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT in 𝓣2subscript𝓣2\bm{\mathcal{T}}_{2}bold_caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then, for 0dD110𝑑subscript𝐷110\leq d\leq D_{1}-10 ≤ italic_d ≤ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1, define the tree similarity at depth d+1𝑑1d+1italic_d + 1 as

𝒆1dI1w𝒆1max𝒆2dI2{ARI𝒆1,𝒆2}.subscriptsubscript𝒆1superscript𝑑subscript𝐼1subscript𝑤subscript𝒆1subscriptsubscript𝒆2superscript𝑑subscript𝐼2subscriptARIsubscript𝒆1subscript𝒆2\sum_{\bm{e}_{1}\in\mathbb{N}^{d}\,\cap\,I_{1}}w_{\bm{e}_{1}}\max_{\bm{e}_{2}% \in\mathbb{N}^{d}\,\cap\,I_{2}}\left\{\text{ARI}_{\bm{e}_{1},\bm{e}_{2}}\right\}.∑ start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_N start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∩ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_N start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∩ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ARI start_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } . (3.6)

4 Simulation study

A simulation study enables assessment of the performance of the inference procedure for learning a parsimonious Bayesian context tree (PBCT). This section explores two experiments: (i) a hyperparameter study, and (ii) a model comparison as training lengths vary.

4.1 Setup

In both experiments, PBCT models are randomly generated over the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V as described in Section 2.3.3, and sequences are simulated using these trees. Each simulated sequence is split into a training sequence 𝒙trainsuperscript𝒙train\bm{x}^{\text{train}}bold_italic_x start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT and test sequence 𝒙testsuperscript𝒙test\bm{x}^{\text{test}}bold_italic_x start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT. Given a simulated sequence, the task is to recover the true generated tree. Trees are learned from the training sequence via the recursive agglomerative clustering (RAC) procedure described in Section 3.1, Algorithm 1. The evaluation metrics described in Section 3.2 – marginal log-loss (3.3), true log-loss (3.4) and tree similarity (3.6) – are used to evaluate model performance.

The first experiment investigates the sensitivity of RAC to the simulation parameter 𝜼+V𝜼subscriptsuperscript𝑉\bm{\eta}\in\mathbb{R}^{V}_{+}bold_italic_η ∈ blackboard_R start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Additionally, the simulated predictive distributions are modified to capture the influence of a background level of stochastic variation in a sequence. Let λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ] be a parameter to control the strength of the stochastic variation. Then, the conditional distribution ϕ𝒆truesuperscriptsubscriptbold-italic-ϕ𝒆true\bm{\phi}_{\bm{e}}^{\text{true}}bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT true end_POSTSUPERSCRIPT for each leaf 𝒆𝒆\bm{e}bold_italic_e of a simulated tree is considered to be a mixture of the Dirichlet(𝜼)Dirichlet𝜼\text{Dirichlet}(\bm{\eta})Dirichlet ( bold_italic_η ) distribution and the uniform distribution over 𝒱𝒱\mathcal{V}caligraphic_V: the simulated distribution is given by ϕ𝒆true=(1λ)ϕ𝒆+λ𝟏V/V,ϕ𝒆Dirichlet(𝜼)formulae-sequencesuperscriptsubscriptbold-italic-ϕ𝒆true1𝜆subscriptbold-italic-ϕ𝒆𝜆subscript1𝑉𝑉similar-tosubscriptbold-italic-ϕ𝒆Dirichlet𝜼\bm{\phi}_{\bm{e}}^{\text{true}}=(1-\lambda)\bm{\phi}_{\bm{e}}+\lambda\mathbf{% 1}_{V}/V,\,\,\bm{\phi}_{\bm{e}}\sim\text{Dirichlet}(\bm{\eta})bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT true end_POSTSUPERSCRIPT = ( 1 - italic_λ ) bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT + italic_λ bold_1 start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT / italic_V , bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ∼ Dirichlet ( bold_italic_η ), where 𝟏Vsubscript1𝑉\mathbf{1}_{V}bold_1 start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT is a length-V𝑉Vitalic_V vector of ones. In this “spike-and-slab” representation of ϕ𝒆truesuperscriptsubscriptbold-italic-ϕ𝒆true\bm{\phi}_{\bm{e}}^{\text{true}}bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT true end_POSTSUPERSCRIPT, λ=0𝜆0\lambda=0italic_λ = 0 yields the standard Dirichlet distribution, and λ=1𝜆1\lambda=1italic_λ = 1 yields the uniform distribution.

Synthetic parsimonious Bayesian context trees are generated over a vocabulary of size V=10𝑉10V=10italic_V = 10, and each partition of the vocabulary is drawn from CRPV(α)subscriptCRP𝑉𝛼\text{CRP}_{V}(\alpha)CRP start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_α ) with rate parameter α=1𝛼1\alpha=1italic_α = 1. The maximum depth of the PBCTs is specified as D=3𝐷3D=3italic_D = 3. Training sequences of length N=10,000𝑁10000N=10,000italic_N = 10 , 000 and test sequences of length N=1,000𝑁1000N=1,000italic_N = 1 , 000 are simulated from the generated trees. The Dirichlet hyperparameters are specified as 𝜼=η𝟏V𝜼𝜂subscript1𝑉\bm{\eta}=\eta\mathbf{1}_{V}bold_italic_η = italic_η bold_1 start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, and the simulation parameters η𝜂\etaitalic_η and λ𝜆\lambdaitalic_λ are varied. Results are obtained for 15 different model simulations. Model performance is evaluated by computing the ARI-based similarity metric (3.6) over the training sequences to compare tree structures, and marginal log-losses (3.3) are computed over the test sequences to evaluate predictive performance.

In the second experiment, fixed and variable-order Markov models are considered for comparison with the PBCT. Let D𝐷Ditalic_D be the maximum tree depth, so PBCT-D𝐷Ditalic_D denotes a parsimonious Bayesian context tree of maximum depth D𝐷Ditalic_D. A fixed order-d𝑑ditalic_d Bayesian Markov model (FBM-d𝑑ditalic_d), d{1,,D}𝑑1𝐷d\in\{1,\dots,D\}italic_d ∈ { 1 , … , italic_D }, is defined such that Dirichlet priors are placed on the conditional distributions for each order-d𝑑ditalic_d context. Similarly, a variable-order Bayesian Markov model with maximum order D𝐷Ditalic_D (VBM-D𝐷Ditalic_D; see, for example, Dimitrakakis, , 2010) is defined with Dirichlet conditional distributions for each context. VBM structure is inferred using a modified version of RAC, Algorithm 1: at each vocabulary partitioning step, only two outcomes are compared: (i) all elements in the same cluster, or (ii) V𝑉Vitalic_V singleton clusters.

Predictive performance is investigated as the length of training sequence varies. The hyperparameter pair 𝜼=𝟏V𝜼subscript1𝑉\bm{\eta}=\mathbf{1}_{V}bold_italic_η = bold_1 start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and λ=0𝜆0\lambda=0italic_λ = 0 are used for sequential prediction so that the predictive distributions associated with each leaf are standard Dirichlet distributions. The synthetic models are generated with vocabulary size V=10𝑉10V=10italic_V = 10, CRP parameter α=1𝛼1\alpha=1italic_α = 1 and maximum depth D=3𝐷3D=3italic_D = 3. Predictive log-losses are averaged over 15 model simulations.

Refer to caption
Figure 3: Boxplots to compare structures of fitted and simulated trees. Synthetic trees are generated with vocabulary size V=10𝑉10V=10italic_V = 10, CRP parameter α=1𝛼1\alpha=1italic_α = 1 and maximum depth D=3𝐷3D=3italic_D = 3. N=10,000𝑁10000N=10,000italic_N = 10 , 000 training samples are simulated using the trees for varying distributional hyperparameters η𝜂\etaitalic_η and λ𝜆\lambdaitalic_λ. Boxplots of weighted ARI for each depth d=1,2,3𝑑123d=1,2,3italic_d = 1 , 2 , 3 over 15 different model simulations.

4.2 Results

Figure 3 illustrates boxplots of tree similarities for different simulation parameter configurations, and Table 1 contains the average log-losses of the simulated and fitted trees for the first experiment. It is found that the RAC algorithm for learning PBCTs generally performs well, and can perfectly recover synthetic context tree structures for several parameter configurations. The best and most consistent performance is achieved by the hyperparameter pair η=1,λ=0formulae-sequence𝜂1𝜆0\eta=1,\,\lambda=0italic_η = 1 , italic_λ = 0. It can be seen that the presence of random noise can lead to improved performance for small values of η𝜂\etaitalic_η. Small η𝜂\etaitalic_η values lead to highly sparse Dirichlet distributions, leading to issues with model identifiability. Adding noise smooths the conditional distributions to ensure greater variation of contexts in the simulated sequence, which helps RAC recover the simulated tree. On the other hand, as η𝜂\etaitalic_η becomes large, the predictive distributions each become close to uniform, again leading to identifiability issues.

Table 1: Average log-losses for simulated and fitted PBCT models for varying distributional hyperparameter pairs η𝜂\etaitalic_η and λ𝜆\lambdaitalic_λ repeated over 15 models. Standard deviations in parentheses. Synthetic trees generated with vocabulary size V=10𝑉10V=10italic_V = 10, CRP parameter α=1𝛼1\alpha=1italic_α = 1 and maximum depth D=3𝐷3D=3italic_D = 3. Sequences of length N=10,000𝑁10000N=10,000italic_N = 10 , 000 are simulated from each tree. Best performance is given by the hyperparameters η=1𝜂1\eta=1italic_η = 1, λ=0𝜆0\lambda=0italic_λ = 0.
η𝜂\etaitalic_η λ𝜆\lambdaitalic_λ Simulated Fitted Difference
1.0 0.0 1.9297 (0.0674) 1.9407 (0.0667) 0.0017 (0.0023)
1.0 0.25 2.1158 (0.0501) 2.1269 (0.0538) 0.0093 (0.0134)
1.0 0.5 2.2177 (0.0168) 2.2290 (0.0211) 0.0090 (0.0099)
0.1 0.0 0.9045 (0.1314) 1.0249 (0.2631) 0.1236 (0.2117)
0.1 0.25 1.5552 (0.1408) 1.5921 (0.1988) 0.0477 (0.1123)
0.1 0.5 1.9397 (0.0337) 1.9436 (0.0372) 0.0042 (0.0054)
0.5 0.0 1.6800 (0.0888) 1.6979 (0.0993) 0.0214 (0.0252)
0.5 0.25 1.9587 (0.0693) 1.9656 (0.0669) 0.0088 (0.0076)
0.5 0.5 2.1503 (0.0336) 2.1681 (0.0456) 0.0231 (0.0309)
5.0 0.0 2.2165 (0.0262) 2.2271 (0.0257) 0.0110 (0.0109)
5.0 0.25 2.2581 (0.0160) 2.2716 (0.0161) 0.0089 (0.0099)
5.0 0.5 2.2871 (0.0054) 2.2959 (0.0073) 0.0094 (0.0067)

Figure 4 displays average log-losses for a selection of fitted models, as well as the difference in log-losses with the simulated models, as training length varies. Figures 4(a) and 4(c) show the log-losses for training lengths up to N=5,000𝑁5000N=5,000italic_N = 5 , 000, where the PBCT outperforms fixed and variable-order Markov models for all training lengths. Additionally, Figures 4(b) and 4(d) illustrate continuations of the same experimental simulations for longer training lengths, plotted on a log-scale, from N=5,000𝑁5000N=5,000italic_N = 5 , 000 up to N=200,000𝑁200000N=200,000italic_N = 200 , 000. The PBCT model consistently recovers the simulated trees given longer training sequences. The fixed and variable-order Markov models, FBM-3 and VBM-3, continue to improve with longer training lengths but do not reach the same performance as PBCT-3.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 4: Average fitted log-losses and difference in log-losses between fitted and simulated models averaged over 15 simulations, for varying training lengths and a fixed test length of 1000 elements. Synthetic trees are generated with vocabulary size V=10𝑉10V=10italic_V = 10, CRP parameter α=1𝛼1\alpha=1italic_α = 1 and maximum depth D=3𝐷3D=3italic_D = 3. (4(a)) and (4(c)) show PBCT outperforms the other Markov models over all training lengths. (4(b)) and (4(d)) show consistent recovery of simulated PBCT structure for longer training lengths.

5 Application to real data

The parsimonious Bayesian context tree model is applied to two real-world examples of categorical sequence data: (i) Imperial honeypot terminal sessions and (ii) UniProt protein sequences. Comparisons are made with fixed and variable-order Markov models.

Table 2: Sequence statistics for real-world datasets.
Dataset Mean length Total training length Total test length
Honeypot 52 46225 5402
UniProt 69 31167 3370
Table 3: Predictive performance of each model evaluated on the honeypot and UniProt datasets. Best log-loss performance is achieved by PBCT for both datasets.
Model Honeypot UniProt
L𝐿Litalic_L log-loss L𝐿Litalic_L log-loss
FBM-0 1 2.72327 1 2.83083
FBM-1 93 1.04483 21 2.57862
FBM-2 8,649 0.72451 441 1.89768
FBM-3 804,357 - 9,261 1.48856
FBM-4 - - 194,481 1.52664
VBM 93 1.04483 5,641 1.48510
BCT111BCTs were fitted to concatenations of sequences; the software did not accept multiple sequence inputs. 1,312 0.70253 3,820 1.48539
PBCT 654 0.69012 1,076 1.47870

5.1 Honeypot terminal sessions

In this example, the data are sessions of command-line sequences collected on a honeypot at Imperial College London. A honeypot is a type of computer network host designed to observe the behaviour of cyber attackers when granted access to the network. The Imperial honeypot data are in the form of sessions, where each session is formed of a sequence of Unix terminal commands executed by an attacker. A collection of 1000 sessions, recorded between May 2021 and January 2022, are considered for model training and evaluation. 90% of the sessions are used for training and the remaining 10% are used to evaluate predictive performance. Table 2 details the average session lengths and total number of commands in the data. From the honeypot sessions, each command is considered a word in the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V. After preprocessing commands into common categories and key Unix commands, the vocabulary has V=93𝑉93V=93italic_V = 93 elements. The PBCT model is trained using CRP rate parameter α=1𝛼1\alpha=1italic_α = 1 and Dirichlet hyperparameter 𝜼=𝟏V𝜼subscript1𝑉\bm{\eta}=\mathbf{1}_{V}bold_italic_η = bold_1 start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. Fixed Bayesian Markov models of orders D{0,1,2,3}𝐷0123D\in\{0,1,2,3\}italic_D ∈ { 0 , 1 , 2 , 3 } are defined (FBM-D𝐷Ditalic_D, where FBM-0 is a bag-of-words model) and the variable-order Bayesian Markov (VBM) model is described in Section 4. For comparison, a Bayesian context tree (BCT) is fitted using the authors’ publicly available software (Kontoyiannis et al., , 2022). For each model, the conditional distributions ϕ𝒆subscriptbold-italic-ϕ𝒆\bm{\phi}_{\bm{e}}bold_italic_ϕ start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT used for prediction are Dirichlet with parameter 𝜼=𝟏V𝜼subscript1𝑉\bm{\eta}=\mathbf{1}_{V}bold_italic_η = bold_1 start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. A maximum depth of D=3𝐷3D=3italic_D = 3 is specified before training the variable-order models PBCT, VBM and BCT.

Table 3 outlines the means and standard deviations of log-losses computed on the test sequences for each model. The number of conditional distributions defined by each model is denoted by L𝐿L\in\mathbb{N}italic_L ∈ blackboard_N. The best predictive performance is achieved by the PBCT. The fitted PBCT has L=654𝐿654L=654italic_L = 654 and a maximum depth of 2, exhibiting a significant reduction in dimensionality while achieving better predictive performance than the fixed order-2 Markov model (FBM-2), which has L=8,649𝐿8649L=8,649italic_L = 8 , 649. The fitted BCT, also of maximum depth D=2𝐷2D=2italic_D = 2, has L=1,132𝐿1132L=1,132italic_L = 1 , 132 and is slightly outperformed by the PBCT in terms of predictive log-loss. The log-loss for FBM-3 is omitted due to excessive computational cost (L=804,357𝐿804357L=804,357italic_L = 804 , 357).

children(C12)𝒞12={C121,C122}childrensubscript𝐶12subscript𝒞12subscript𝐶121subscript𝐶122\text{children}(C_{12})\equiv\mathcal{C}_{12}=\{C_{121},\,C_{122}\}children ( italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT ) ≡ caligraphic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT 121 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT } A CBA BCAB CABACCBACBACBFull order-2 treeParsimonious context treeCsubscript𝐶C_{\emptyset}italic_C start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPTA CC1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTBC2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTA BC11subscript𝐶11C_{11}italic_C start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPTCC12subscript𝐶12C_{12}italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPTAC121subscript𝐶121C_{121}italic_C start_POSTSUBSCRIPT 121 end_POSTSUBSCRIPTB CC122subscript𝐶122C_{122}italic_C start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPTϕ11subscriptbold-italic-ϕ11\bm{\phi}_{11}bold_italic_ϕ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPTϕ121subscriptbold-italic-ϕ121\bm{\phi}_{121}bold_italic_ϕ start_POSTSUBSCRIPT 121 end_POSTSUBSCRIPTϕ122subscriptbold-italic-ϕ122\bm{\phi}_{122}bold_italic_ϕ start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPTϕ2subscriptbold-italic-ϕ2\bm{\phi}_{2}bold_italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTAx1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTCx2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTCAB\cdotsABCAxNsubscript𝑥𝑁x_{N}italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPTSequence 𝒙𝒙\bm{x}bold_italic_xPredict: xN+1=Bsubscript𝑥𝑁1Bx_{N+1}=\textnormal{{\color[rgb]{0.75,0,0}\definecolor[named]{pgfstrokecolor}{% rgb}{0.75,0,0}{{B}}}}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT = B𝒙𝒫122N=C1×C12×C122×VN3𝒙superscriptsubscript𝒫122𝑁subscript𝐶1subscript𝐶12subscript𝐶122superscript𝑉𝑁3\bm{x}\in\mathcal{P}_{122}^{N}=C_{1}\times C_{12}\times C_{122}\times V^{N-3}bold_italic_x ∈ caligraphic_P start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT × italic_V start_POSTSUPERSCRIPT italic_N - 3 end_POSTSUPERSCRIPTxN+1ϕ122similar-tosubscript𝑥𝑁1subscriptbold-italic-ϕ122x_{N+1}\sim\bm{\phi}_{122}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT ∼ bold_italic_ϕ start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT {B, C}{C}{A, C}ABCTree 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T
.S4Y .LAYER .PTMX
.HUMAN .MISA SH
WRITE_FILE OTHER
xN1subscript𝑥𝑁1x_{N-1}italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT
cd
xNsubscript𝑥𝑁x_{N}italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPTHOME_DIROTHER000.50.50.50.51111p(xN+1|𝒙)𝑝conditionalsubscript𝑥𝑁1𝒙p(x_{N+1}\,|\,\bm{x})italic_p ( italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT | bold_italic_x )Predict next command xN+1subscript𝑥𝑁1x_{N+1}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT
Figure 5: Example of learned command contexts and prediction using the PBCT fitted to honeypot sessions.

Figure 5 illustrates an example of a learned order-2 context in the PBCT and the two largest predictive probabilities, calculated as posterior means. The example demonstrates the ability to predict the behaviour of a network intruder: an attacker first attempts installation of a MIRAI variant (such as .PTMX or .MISA; see Sanna Passino et al., , 2023, for further discussion), then changes directory using the cd command. Following this context, the PBCT predicts the common behaviour of navigating back to the home directory.

5.2 Protein sequences

The UniProt knowledgebase (The UniProt Consortium, , 2016) is a large collection of protein sequences. Each entry is a sequence with elements from a vocabulary of the 20 standard amino acids. There is an extra element added to the vocabulary in the case of a missing or erroneous amino acid, so the final vocabulary has size V=21𝑉21V=21italic_V = 21. A random sample of sequences of lengths between 10 and 100 is considered for computational feasibility. Table 2 details the average and total protein sequence lengths used for analysis.

The PBCT model is fitted to the protein sequences using a CRP rate parameter α=1𝛼1\alpha=1italic_α = 1 and Dirichlet hyperparameter 𝜼=𝟏V𝜼subscript1𝑉\bm{\eta}=\mathbf{1}_{V}bold_italic_η = bold_1 start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. The fixed and variable-order Bayesian Markov (FBM, VBM) and Bayesian context tree (BCT) models are considered for comparison. Each variable-order model is specified a maximum depth D=6𝐷6D=6italic_D = 6 before training. Table 3 contains the results of fitting each model to the UniProt data. The fitted PBCT, VBM and BCT models each have maximum depth D=4𝐷4D=4italic_D = 4. The PBCT achieves the best predictive performance in terms of log-loss with fewer parameters than the comparable models FBM-3, VBM and BCT.

children(C12)𝒞12={C121,C122}childrensubscript𝐶12subscript𝒞12subscript𝐶121subscript𝐶122\text{children}(C_{12})\equiv\mathcal{C}_{12}=\{C_{121},\,C_{122}\}children ( italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT ) ≡ caligraphic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT 121 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT } A CBA BCAB CABACCBACBACBFull order-2 treeParsimonious context treeCsubscript𝐶C_{\emptyset}italic_C start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPTA CC1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTBC2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTA BC11subscript𝐶11C_{11}italic_C start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPTCC12subscript𝐶12C_{12}italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPTAC121subscript𝐶121C_{121}italic_C start_POSTSUBSCRIPT 121 end_POSTSUBSCRIPTB CC122subscript𝐶122C_{122}italic_C start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPTϕ11subscriptbold-italic-ϕ11\bm{\phi}_{11}bold_italic_ϕ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPTϕ121subscriptbold-italic-ϕ121\bm{\phi}_{121}bold_italic_ϕ start_POSTSUBSCRIPT 121 end_POSTSUBSCRIPTϕ122subscriptbold-italic-ϕ122\bm{\phi}_{122}bold_italic_ϕ start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPTϕ2subscriptbold-italic-ϕ2\bm{\phi}_{2}bold_italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTAx1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTCx2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTCAB\cdotsABCAxNsubscript𝑥𝑁x_{N}italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPTSequence 𝒙𝒙\bm{x}bold_italic_xPredict: xN+1=Bsubscript𝑥𝑁1Bx_{N+1}=\textnormal{{\color[rgb]{0.75,0,0}\definecolor[named]{pgfstrokecolor}{% rgb}{0.75,0,0}{{B}}}}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT = B𝒙𝒫122N=C1×C12×C122×VN3𝒙superscriptsubscript𝒫122𝑁subscript𝐶1subscript𝐶12subscript𝐶122superscript𝑉𝑁3\bm{x}\in\mathcal{P}_{122}^{N}=C_{1}\times C_{12}\times C_{122}\times V^{N-3}bold_italic_x ∈ caligraphic_P start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT × italic_V start_POSTSUPERSCRIPT italic_N - 3 end_POSTSUPERSCRIPTxN+1ϕ122similar-tosubscript𝑥𝑁1subscriptbold-italic-ϕ122x_{N+1}\sim\bm{\phi}_{122}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT ∼ bold_italic_ϕ start_POSTSUBSCRIPT 122 end_POSTSUBSCRIPT {B, C}{C}{A, C}ABCTree 𝓣𝓣\bm{\mathcal{T}}bold_caligraphic_T
C L
S R
xN3subscript𝑥𝑁3x_{N-3}italic_x start_POSTSUBSCRIPT italic_N - 3 end_POSTSUBSCRIPT
¬\neg\,¬C
xN2subscript𝑥𝑁2x_{N-2}italic_x start_POSTSUBSCRIPT italic_N - 2 end_POSTSUBSCRIPT
T X
xN1subscript𝑥𝑁1x_{N-1}italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT
D
xNsubscript𝑥𝑁x_{N}italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPTRKN000.250.250.250.250.50.50.50.5p(xN+1|𝒙)𝑝conditionalsubscript𝑥𝑁1𝒙p(x_{N+1}\,|\,\bm{x})italic_p ( italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT | bold_italic_x )Predict next amino acid xN+1subscript𝑥𝑁1x_{N+1}italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT
Figure 6: Example of protein motif discovery and prediction using the PBCT fitted to UniProt sequences. Here, the shorthand ¬\neg\,¬C denotes all amino acids excluding C.

Figure 6 shows a learned order-4 context and the corresponding estimated probabilities for predicting the next amino acid. Aside from prediction, this example demonstrates an application of the PBCT to the discovery of protein sequence motifs (see, for example, Bailey, , 2007). Sequence motifs are short subsequences of amino acids used to characterise groups of proteins by functional or structural similarity. Fitting PBCTs to collections of protein sequences may lead to the discovery of new motifs, represented by learned contexts.

6 Conclusion

A novel inference method has been introduced to efficiently capture complex dependence structures and make predictions in discrete sequences. The proposed parsimonious Bayesian context tree model has been verified using a simulation study for a range of model configurations, and outperforms existing context tree models in real-world examples while reducing the parameter space. Key advantages of the PBCT model include scalability to large vocabularies and interpretability of results. PBCTs can be applied in practice for prediction tasks, anomaly detection and changepoint analysis, in addition to the protein motif discovery application discussed in Section 5.2. An interesting extension of the inference method would involve updates of tree structures given streaming data.

Markov chain Monte Carlo (MCMC) provides another method for inference of PBCT structure given data. MCMC via Gibbs sampling can be implemented by iteratively resampling from the posterior distributions of parent-child node structures. At each vocabulary clustering step, the results of agglomerative clustering can be used to initialise a Gibbs sampler. In such an implementation of MCMC, preliminary testing showed that the fitted tree structures rarely improved on the results using only RAC, Algorithm 1.


SUPPLEMENTARY MATERIAL

The Python library pbct contains the code used to fit parsimonious Bayesian context trees, available in the GitHub repository at https://github.com/daniyarghani/pbct.

References

  • Aldous, (1985) Aldous, D. J. (1985). Exchangeability and related topics. In Lecture Notes in Mathematics, pages 1–198. Springer Berlin Heidelberg.
  • Bailey, (2007) Bailey, T. L. (2007). Discovering sequence motifs. Methods in Molecular Biology, 395:271–292.
  • Begleiter et al., (2004) Begleiter, R., El-Yaniv, R., and Yona, G. (2004). On prediction using variable order Markov models. Journal of Artificial Intelligence Research, 22:385–421.
  • Bennett et al., (2023) Bennett, I., Martin, D. E. K., and Lahiri, S. N. (2023). Fitting sparse Markov models through a collapsed Gibbs sampler. Computational Statistics, 38:1977–1994.
  • Bourguignon and Robelin, (2004) Bourguignon, P. Y. and Robelin, D. (2004). Modèles de Markov parcimonieux : sélection de modèle et estimation. In Proceedings of the 5e édition des Journées Ouvertes en Biologie, Informatique et Mathématiques, Montréal.
  • Dimitrakakis, (2010) Dimitrakakis, C. (2010). Bayesian variable order Markov models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9, pages 161–168. PMLR.
  • Duda and Hart, (1973) Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene analysis, volume 3. Wiley New York.
  • Eggeling et al., (2013) Eggeling, R., Gohr, A., Bourguignon, P.-Y., Wingender, E., and Grosse, I. (2013). Inhomogeneous parsimonious Markov models. In Machine Learning and Knowledge Discovery in Databases, pages 321–336. Springer Berlin Heidelberg.
  • Eggeling et al., (2019) Eggeling, R., Grosse, I., and Koivisto, M. (2019). Algorithms for learning parsimonious context trees. Machine Learning, 108(6):879–911.
  • Ferguson, (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230.
  • García and González-López, (2011) García, J. E. and González-López, V. A. (2011). Minimal Markov models. In Fourth Workshop on Information Theoretic Methods in Science and Engineering: Proceedings, pages 25–28. University of Helsinki.
  • García and González-López, (2017) García, J. E. and González-López, V. A. (2017). Consistent estimation of partition Markov models. Entropy, 19(4):160.
  • Heard et al., (2006) Heard, N. A., Holmes, C. C., and Stephens, D. A. (2006). A quantitative study of gene regulation involved in the immune response of Anopheline mosquitoes. Journal of the American Statistical Association, 101(473):18–29.
  • Heller and Ghahramani, (2005) Heller, K. A. and Ghahramani, Z. (2005). Bayesian hierarchical clustering. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, pages 297–304.
  • Hubert and Arabie, (1985) Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193–218.
  • Jääskinen et al., (2014) Jääskinen, V., Xiong, J., Corander, J., and Koski, T. (2014). Sparse Markov chains for sequence data. Scandinavian Journal of Statistics, 41(3):639–655.
  • Kontoyiannis et al., (2022) Kontoyiannis, I., Mertzanis, L., Panotopoulou, A., Papageorgiou, I., and Skoularidou, M. (2022). Bayesian context trees: Modelling and exact inference for discrete time series. Journal of the Royal Statistical Society Series B, 84(4):1287–1323.
  • Lewis, (2001) Lewis, P. (2001). A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology, 50(6):913–925.
  • Mächler and Bühlmann, (2004) Mächler, M. and Bühlmann, P. (2004). Variable Length Markov Chains: Methodology, Computing, and Software. Journal of Computational and Graphical Statistics, 13(2):435–455.
  • Papageorgiou and Kontoyiannis, (2024) Papageorgiou, I. and Kontoyiannis, I. (2024). Posterior Representations for Bayesian Context Trees: Sampling, Estimation and Convergence. Bayesian Analysis, 19(2):501–529.
  • Rissanen, (1983) Rissanen, J. (1983). A universal data compression system. IEEE Transactions on Information Theory, 29(5):656–664.
  • Sanna Passino et al., (2023) Sanna Passino, F., Mantziou, A., Ghani, D., Thiede, P., Bevington, R., and Heard, N. A. (2023). Nested Dirichlet models for unsupervised attack pattern detection in honeypot data. arXiv e-prints, 2301.02505.
  • The UniProt Consortium, (2016) The UniProt Consortium (2016). UniProt: the universal protein knowledgebase. Nucleic Acids Research, 45(D1):D158–D169.
  • Xiong et al., (2016) Xiong, J., Jääskinen, V., and Corander, J. (2016). Recursive learning for sparse Markov models. Bayesian Analysis, 11(1):247–263.