License: arXiv.org perpetual non-exclusive license
arXiv:2404.04201v1 [cs.PL] 05 Apr 2024

V-Star: Learning Visibly Pushdown Grammars from Program Inputs (Extended Version)

Xiaodong Jia [email protected] 0000-0003-2493-9111  and  Gang Tan [email protected] 0000-0001-6109-6091 The Pennsylvania State University201 Old MainState CollegePennsylvaniaUSA16802
(2018)
Abstract.

Accurate description of program inputs remains a critical challenge in the field of programming languages. Active learning, as a well-established field, achieves exact learning for regular languages. We offer an innovative grammar inference tool, V-Star, based on the active learning of visibly pushdown automata. V-Star deduces nesting structures of program input languages from sample inputs, employing a novel inference mechanism based on nested patterns. This mechanism identifies token boundaries and converts languages such as XML documents into VPLs. We then adapted Angluin’s L-Star, an exact learning algorithm, for VPA learning, which improves the precision of our tool. Our evaluation demonstrates that V-Star effectively and efficiently learns a variety of practical grammars, including S-Expressions, JSON, and XML, and outperforms other state-of-the-art tools.

grammar inference; visibly pushdown grammar
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYisbn: 978-1-4503-XXXX-X/18/06ccs: Theory of computation Program analysisccs: Theory of computation Grammars and context-free languagesccs: Software and its engineering Automatic programming

1. Introduction

In recent years, there has been a growing interest in learning grammars from a set of sample strings. This interest stems from a wide range of applications in fuzzing, program validation, and other areas (Arefin et al., 2024; Bastani et al., 2017; Kulkarni et al., 2022; Wu et al., 2019; Bendrissou et al., 2022). Despite significant progress, the challenge of learning the input grammars for black-box programs remains, particularly when considering grammars with inherent complexities. This challenge is part of a broader problem that has been extensively studied for regular languages but is significantly more difficult when dealing with broader classes of grammars.

Recently, GLADE (Bastani et al., 2017) (followed by a replication study (Bendrissou et al., 2022)) and ARVADA (Kulkarni et al., 2022) have been proposed to learn context-free grammars (CFGs) under active learning. Both approaches require positive seed inputs and utilize enumeration and heuristics to reduce the search space. However, methods in CFG learning such as Glade and Arvada do not fully utilize the concept of nesting structures, which could potentially improve a grammar-learning process’s accuracy.

Nesting structures are widely observed in practical languages, where recursions are explicitly delimited in their sentences. For example, an XML document’s open and close tags delimit a component of the document and can be nested within other open and close tags. These nesting structures often carry valuable insights into the grammars’ structure and could potentially be a powerful tool for learning grammars.

To achieve accurate learning, we model the nesting structures together with the target language as Visibly Pushdown Grammars (VPGs) (Alur and Madhusudan, 2009), a subclass of CFGs. VPGs formally specify nesting structures, and despite being slightly weaker than CFGs they can specify many practical format languages such as XML and JSON; they also enjoy all desirable closure properties; e.g., the set of visibly pushdown languages is closed under intersection, concatenation, and complement (Alur and Madhusudan, 2009). We posit that these properties position VPGs as an ideal mechanism for learning practical grammars, which is the focus of this paper.

Learning VPGs is a problem that fits nicely into the well-studied active learning field. In this field, Angluin (Angluin, 1987) first demonstrated that it is possible to efficiently learn regular languages from a minimally adequate teacher (MAT), which answers (1) whether a string is in the language, which is called a membership query, and (2) whether a finite state automaton accepts exactly the language held by the teacher, which is called an equivalence query; if not, the teacher provides a counterexample accepted by either the automaton or the language, but not both.

Despite advancements in VPG learning facilitated by a MAT, current techniques, as highlighted in previous work (Barbot et al., 2021; Michaliszyn and Otop, 2022; Isberner, 2015), are constrained by assumptions that do not match practical settings. Specifically, they assume the availability of known nesting patterns, or more technically, a predefined tagging function—a cornerstone of the VPG formalism detailed in Section 3. This tagging function defines a set of call and return symbols and is assumed to operate on individual characters. In contrast, in many practical settings the tagging must be inferred and it operates on sequences of characters (i.e., tokens) rather than on individual characters; a more detailed comparison with prior work and the limitations of these assumptions are discussed in Section 2.

To address these limitations, we introduce V-Star, a novel grammar-inference framework designed to learn VPGs from a black-box program using a collection of sample seed strings. V-Star’s algorithm is inspired by L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT (Angluin, 1987), which learns a finite-state automaton using a MAT and seed strings. We develop V-Star in several pivotal steps. First, we develop an L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT-like algorithm for learning VPGs when tags are known and are on single characters; this algorithm utilizes k𝑘kitalic_k-SEVPA (Alur et al., 2005) to define a set of congruence relations, critical to the algorithm. Second, we develop a tag-inference algorithm, which utilizes a novel notion of nesting patterns to infer call and return symbols, assuming they are of single characters. In the third step, we lift the restriction that tags are on single characters and develop an algorithm for inferring token-based tags. Finally, we remove the requirement of equivalence queries by simulating them using membership queries via sampling. These steps all together result in a practical framework for learning VPGs from seed strings. The main contributions of V-Star are summarized as follows:

  1. (1)

    Innovative Tool and Methodology: V-Star is a novel tool for VPG inference. Its algorithm adapts Angluin’s L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT algorithm and integrates a set of novel techniques such as nesting patterns to infer call and return tokens. To our best knowledge, this is the first VPG-learning algorithm without knowing what call/return tokens are a priori.

  2. (2)

    Theoretical Reasoning: We provide a theoretical analysis elucidating the conditions under which V-Star achieves accurate learning. We first prove that for any character-based visibly pushdown language with uniquely paired call/return symbols, there exists a finite set of seed strings from which V-Star can learn a tagging function that achieves exact learning. We further show that under some realistic assumptions V-Star can infer tagged tokens for token-based visibly pushdown languages.

  3. (3)

    Accuracy: Our evaluation of V-Star demonstrates its better accuracy in learning practical grammars, in comparison with state-of-the-art grammar learning tools. The accuracy result highlights the benefit of utilizing the concept of nesting structures and V-Star’s ability to simulate the equivalence queries by sampling test strings from the seed strings.

2. Related Work

In the realm of automata learning, learning finite state automata is a well-studied field. The L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT Algorithm by Angluin (Angluin, 1987), which learns a finite state automaton held by a MAT in polynomial time, is a seminal work in this area. Following L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, various adaptations focusing on the active learning problem have been proposed, such as (Rivest and Schapire, 1993; Maler and Pnueli, 1991; Irfan et al., 2010; Howar, 2012; Isberner, 2015), to list a few.

Glade (Bastani et al., 2017), targeting learning a CFG from an oracle, employs a two-step algorithmic approach. Initially, it enumerates all substrings of seed strings and attempts to generalize these substrings into regular expressions. Subsequently, nonterminals are created and merged based on learned regular expressions. Arvada (Kulkarni et al., 2022) also aims to learn a CFG by using a technique that exchanges two substrings from seed strings. If two substrings are interchangeable, they are assigned the same nonterminal, and the process gradually constructs parse trees. Arvada employs heuristics to exchange substrings with similar contexts. In its evaluation, the context is considered to be the surrounding substrings of length four. While context strings bear some similarity to call and return symbols in VPGs, call and return symbols are more flexible than context strings in Arvada: they can wrap contexts of an arbitrary length. Moreover, call and return symbols have a stronger implication on the recursive structure of the oracle language, a feature that V-Star capitalizes on for better grammar-inference accuracy, which is evidenced by V-Star’s experimental comparison with Glade and Arvada discussed in Section 6. As an extension of GLADE, REINAM (Wu et al., 2019) refines the grammar learned by GLADE using reinforcement learning. This process allows for the potential replacement of GLADE by other learning tools such as Arvada (Kulkarni et al., 2022) or our tool V-Star.

Learning VPGs, a subset of CFGs, has been the focus of several systems. For example, assuming that the set of call and return symbols is known, VPL* (Barbot et al., 2021) learns VPGs with membership and equivalence queries. The approach taken by VPL* is indirect, first using TL* (Drewes and Högberg, 2007; Drewes et al., 2011) to learn a tree automaton, which is then converted to a VPG. Another work by Michaliszyn and Otop (2022) also assumes that the set of call and return symbols is known and attempts to learn a visibly pushdown automaton (VPA). Moreover, it requires a stronger teacher who, in addition to providing membership and equivalence queries, can also report the stack content during the VPA’s execution. TTT (Isberner, 2015) is another VPA-inference tool under the active-learning setting, based on discrimination trees. V-Star differentiates itself from these systems by learning call and return symbols from the oracle and seed strings.

In practice, a MAT is often instantiated by a black-box program and the oracle language comprises input strings that do not trigger program errors (assuming the program always terminates). Thus, membership queries require only program execution. However, equivalence queries are much harder to answer. Simulating equivalence queries has a long history, often under the name of conformance testing (Aichernig et al., 2024). Chow (Chow, 1978) first proposed the so-called W-method for Mealy machines; we note that the FSA version of the W-method is essentially a brute force approach of enumerating suffix strings that distinguish the representative prefix strings in the Nerode relation. The space of suffix strings is restricted from Σ*superscriptΣ\Sigma^{*}roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT to ΣksuperscriptΣ𝑘\Sigma^{k}roman_Σ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, under the assumption that the difference between the size of the oracle FSA and the size of the learned FSA is k𝑘kitalic_k. The W-method has many variants (Vasilevskii, 1973; Raffelt et al., 2005; Fujiwara et al., 1991). For more information, we refer to Aichernig et al. (2024).

3. Background

3.1. Grammar Inference

In grammar inference, we assume an oracle for the grammar being inferred, denoted as 𝒪𝒪\mathcal{O}caligraphic_O, which maps strings to booleans—true for valid strings and false for invalid ones. The set of valid strings according to the oracle forms the oracle language, denoted as 𝒪subscript𝒪\mathcal{L}_{\mathcal{O}}caligraphic_L start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT. When the oracle language is defined by a grammar, we refer to this grammar as the oracle grammar, denoted as G𝒪subscript𝐺𝒪G_{\mathcal{O}}italic_G start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT. Similarly, when the oracle language can be recognized by a deterministic finite automaton (DFA), we refer to this DFA as the oracle automaton, denoted as 𝒪subscript𝒪\mathcal{H}_{\mathcal{O}}caligraphic_H start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT.

We define the active learning problem as follows:

Inputs: The problem takes two inputs:

  1. (1)

    A set ΣΣ\Sigmaroman_Σ of terminals, and

  2. (2)

    A minimally adequate teacher (MAT), which can answer both membership and equivalence queries. For an equivalence query with a hypothesis grammar, the teacher returns true when the hypothesis grammar is equivalent to the oracle grammar, meaning they generate the same language; if they are not equivalent, it provides a counterexample, which is a string accepted by either the hypothesis grammar or the oracle grammar, but not by both.

Output: The goal of the active learning problem is to construct a grammar, denoted as 𝒢𝒢\mathcal{G}caligraphic_G, such that the language it generates, denoted as 𝒢subscript𝒢\mathcal{L}_{\mathcal{G}}caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT, is identical to the oracle language 𝒪subscript𝒪\mathcal{L}_{\mathcal{O}}caligraphic_L start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT.

3.2. Visibly Pushdown Grammars

The expressive power of VPGs (Alur and Madhusudan, 2009) is between regular grammars and context-free grammars, and VPGs are sufficient for describing the syntax of many practical languages, such as JSON, XML, and HTML. Application wise, VPGs have been used in program analysis, XML processing, and other fields (Harris et al., 2012; Heizmann et al., 2010; Chaudhuri and Alur, 2007; Nguyen and Sudholt, 2006; Kumar et al., 2007; Alur, 2007; Thomo and Venkatesh, 2011; Gauwin et al., 2008; Mozafari et al., 2010, 2012). Besides, since they can be efficiently parsed, VPGs are also found valuable to specify practical languages (Jia et al., 2021, 2023).

A language is called a visibly pushdown language (VPL) if it can be generated by a VPG. VPLs enjoy the same appealing theoretical closure properties as regular languages; e.g., the set of VPLs is closed under intersection, concatenation, and complement (Alur and Madhusudan, 2009). Further, since VPLs are a subset of deterministic context-free languages, it is always possible to build a deterministic pushdown automaton from a VPL.

A VPG (Alur and Madhusudan, 2009) is formally defined as a tuple (V,Σ,P,L0)𝑉Σ𝑃subscript𝐿0(V,\Sigma,P,L_{0})( italic_V , roman_Σ , italic_P , italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), where V𝑉Vitalic_V is a set of nonterminals, ΣΣ\Sigmaroman_Σ a set of terminals, P𝑃Pitalic_P a set of production rules, and L0Vsubscript𝐿0𝑉L_{0}\in Vitalic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_V the start nonterminal. The set of terminals ΣΣ\Sigmaroman_Σ is partitioned into three kinds: ΣplainsubscriptΣplain\Sigma_{\text{plain}}roman_Σ start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT, ΣcallsubscriptΣcall\Sigma_{\text{call}}roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT, ΣretsubscriptΣret\Sigma_{\text{ret}}roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT, which contain plain, call, and return symbols, respectively. The stack action associated with an input symbol is fully determined by the kind of the symbol: an action of pushing to the stack is always performed for a call symbol, an action of popping from the stack is always performed for a return symbol, and no stack action is performed for a plain symbol. Notation-wise, a terminal in ΣcallsubscriptΣcall\Sigma_{\text{call}}roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT is tagged with \guilsinglleft on the left, and a terminal in ΣretsubscriptΣret\Sigma_{\text{ret}}roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT is tagged with \guilsinglright on the right. For example, \guilsingllefta𝑎aitalic_a is a call symbol in ΣcallsubscriptΣcall\Sigma_{\text{call}}roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT, and b𝑏bitalic_b\guilsinglright is a return symbol in ΣretsubscriptΣret\Sigma_{\text{ret}}roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT.

Well-matched VPGs produce strings where each call symbol is always paired with a return symbol. They are formally defined as follows:

Definition 3.0 (Well-matched VPGs).

A grammar G=(V,Σ,P,L0)𝐺𝑉Σ𝑃subscript𝐿0G=(V,\Sigma,P,L_{0})italic_G = ( italic_V , roman_Σ , italic_P , italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is a well-matched VPG if every production rule in P𝑃Pitalic_P adheres to one of the following forms:

  1. (1)

    Lϵ𝐿italic-ϵL\to\epsilonitalic_L → italic_ϵ, where ϵitalic-ϵ\epsilonitalic_ϵ stands for the empty string.

  2. (2)

    LcL1𝐿𝑐subscript𝐿1L\to cL_{1}italic_L → italic_c italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where cΣplain𝑐subscriptΣplainc\in\Sigma_{\text{plain}}italic_c ∈ roman_Σ start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT.

  3. (3)

    L\guilsinglleftaL1b\guilsinglrightL2𝐿\guilsinglleftasubscript𝐿1b\guilsinglrightsubscript𝐿2L\to{\text{\guilsinglleft{$a$}}{L_{1}}\text{{$b$}\guilsinglright}}L_{2}italic_L → italic_a italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_b italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where \guilsinglleftaΣcall\guilsinglleftasubscriptΣcall\text{\guilsinglleft{$a$}}\in\Sigma_{\text{call}}italic_a ∈ roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT and b\guilsinglrightΣretb\guilsinglrightsubscriptΣret\text{{$b$}\guilsinglright}\in\Sigma_{\text{ret}}italic_b ∈ roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT.

Note that in LcL1𝐿𝑐subscript𝐿1L\to cL_{1}italic_L → italic_c italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT terminal c𝑐citalic_c must be a plain symbol, and in L\guilsinglleftaL1b\guilsinglrightL2𝐿\guilsinglleftasubscript𝐿1b\guilsinglrightsubscript𝐿2L\to{\text{\guilsinglleft{$a$}}{L_{1}}\text{{$b$}\guilsinglright}}L_{2}italic_L → italic_a italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_b italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT a call symbol must be matched with a return symbol; these requirements ensure that any derived string must be well-matched. This is useful in languages like XML, where tags open and close in a nested, well-matched manner. For instance, the grammar rule “elementOpenTag content CloseTag EmptySingleTag EmptyelementconditionalOpenTag content CloseTag EmptySingleTag Empty\text{element}\to\text{OpenTag content CloseTag Empty}\mid\text{SingleTag Empty}element → OpenTag content CloseTag Empty ∣ SingleTag Empty” represents an XML element that either contains content within matched open and close tags or is an empty single tag.

In this paper, we consider only well-matched VPGs, and use the term VPGs for well-matched VPGs. We also call rules in the form of L\guilsinglleftaL1b\guilsinglrightL2𝐿\guilsinglleftasubscript𝐿1b\guilsinglrightsubscript𝐿2L\to{\text{\guilsinglleft{$a$}}{L_{1}}\text{{$b$}\guilsinglright}}L_{2}italic_L → italic_a italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_b italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT matching rules, and rules of the form LcL1𝐿𝑐subscript𝐿1L\to cL_{1}italic_L → italic_c italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT linear rules.

3.3. Visibly Pushdown Automata

A Visibly Pushdown Automaton (VPA) (Alur and Madhusudan, 2004) on finite strings over symbols ΣcallsubscriptΣcall\Sigma_{\text{call}}roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT, ΣretsubscriptΣret\Sigma_{\text{ret}}roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT, and ΣplainsubscriptΣplain\Sigma_{\text{plain}}roman_Σ start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT is a tuple =(Q,q0,Γ,δ,QF)𝑄subscript𝑞0Γ𝛿subscript𝑄𝐹\mathcal{H}=(Q,q_{0},\Gamma,\delta,Q_{F})caligraphic_H = ( italic_Q , italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , roman_Γ , italic_δ , italic_Q start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) where Q𝑄Qitalic_Q is a finite set of states, q0Qsubscript𝑞0𝑄q_{0}\in Qitalic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_Q is the initial state, ΓΓ\Gammaroman_Γ is a finite stack alphabet that contains a special bottom-of-stack symbol bottom\bot, δ=δcallδretδpln𝛿subscript𝛿callsubscript𝛿retsubscript𝛿pln\delta=\delta_{\text{call}}\cup\delta_{\text{ret}}\cup\delta_{\text{pln}}italic_δ = italic_δ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT ∪ italic_δ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT ∪ italic_δ start_POSTSUBSCRIPT pln end_POSTSUBSCRIPT is the transition function, where δcall:Q×ΣcallQ×(Γ{}):subscript𝛿call𝑄subscriptΣcall𝑄Γbottom\delta_{\text{call}}:Q\times\Sigma_{\text{call}}\rightarrow Q\times(\Gamma% \setminus\{\bot\})italic_δ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT : italic_Q × roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT → italic_Q × ( roman_Γ ∖ { ⊥ } ), δret:Q×Σret×ΓQ:subscript𝛿ret𝑄subscriptΣretΓ𝑄\delta_{\text{ret}}:Q\times\Sigma_{\text{ret}}\times\Gamma\rightarrow Qitalic_δ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT : italic_Q × roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT × roman_Γ → italic_Q, and δpln:Q×ΣplainQ:subscript𝛿pln𝑄subscriptΣplain𝑄\delta_{\text{pln}}:Q\times\Sigma_{\text{plain}}\rightarrow Qitalic_δ start_POSTSUBSCRIPT pln end_POSTSUBSCRIPT : italic_Q × roman_Σ start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT → italic_Q, and QFQsubscript𝑄𝐹𝑄Q_{F}\subseteq Qitalic_Q start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ⊆ italic_Q is a set of final states.

The function δcall(q,\guilsingllefta)=(q,γ)subscript𝛿call𝑞\guilsinglleftasuperscript𝑞𝛾\delta_{\text{call}}(q,\text{\guilsinglleft{$a$}})=(q^{\prime},\gamma)italic_δ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT ( italic_q , italic_a ) = ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_γ ) means that upon reading \guilsingllefta𝑎aitalic_a, state q𝑞qitalic_q is changed to state qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and γ𝛾\gammaitalic_γ is pushed onto the stack. Similarly, δret(q,b\guilsinglright,γ)=qsubscript𝛿ret𝑞b\guilsinglright𝛾superscript𝑞\delta_{\text{ret}}(q,\text{{$b$}\guilsinglright},\gamma)=q^{\prime}italic_δ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT ( italic_q , italic_b , italic_γ ) = italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT means that upon reading b𝑏bitalic_b\guilsinglright and the stack top is γ𝛾\gammaitalic_γ, q𝑞qitalic_q is changed to qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and γ𝛾\gammaitalic_γ is removed from the top of the stack (if γ𝛾\gammaitalic_γ is bottom\bot, the empty stack remains unaltered). Finally, δpln(q,c)=qsubscript𝛿pln𝑞𝑐superscript𝑞\delta_{\text{pln}}(q,c)=q^{\prime}italic_δ start_POSTSUBSCRIPT pln end_POSTSUBSCRIPT ( italic_q , italic_c ) = italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT means state q𝑞qitalic_q is changed to qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT upon reading symbol c𝑐citalic_c.

A stack is a non-empty finite sequence over ΓΓ\Gammaroman_Γ ending in the bottom-of-stack symbol bottom\bot. The set of all stacks is denoted as St=(Γ{})*{}𝑆𝑡superscriptΓbottombottomSt=(\Gamma\setminus\{\bot\})^{*}\cdot\{\bot\}italic_S italic_t = ( roman_Γ ∖ { ⊥ } ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ⋅ { ⊥ }. A configuration is a pair (q,T)𝑞𝑇(q,T)( italic_q , italic_T ) of state q𝑞qitalic_q and stack TSt𝑇𝑆𝑡T\in Stitalic_T ∈ italic_S italic_t. We define the single-step transition of configurations δ((q,T),i)𝛿𝑞𝑇𝑖\delta((q,T),i)italic_δ ( ( italic_q , italic_T ) , italic_i ) as tuple (q,T)superscript𝑞superscript𝑇(q^{\prime},T^{\prime})( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), based on the type of symbol i𝑖iitalic_i and the transition functions:

  1. (1)

    If iΣcall𝑖subscriptΣcalli\in\Sigma_{\text{call}}italic_i ∈ roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT, then (q,γ)=δcall(q,i)superscript𝑞𝛾subscript𝛿call𝑞𝑖(q^{\prime},\gamma)=\delta_{\text{call}}(q,i)( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_γ ) = italic_δ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT ( italic_q , italic_i ) and T=γTsuperscript𝑇𝛾𝑇T^{\prime}=\gamma\cdot Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_γ ⋅ italic_T for certain γΓ𝛾Γ\gamma\in\Gammaitalic_γ ∈ roman_Γ;

  2. (2)

    If iΣret𝑖subscriptΣreti\in\Sigma_{\text{ret}}italic_i ∈ roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT, then q=δret(q,i,γ)superscript𝑞subscript𝛿ret𝑞𝑖𝛾q^{\prime}=\delta_{\text{ret}}(q,i,\gamma)italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_δ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT ( italic_q , italic_i , italic_γ ) for certain γ𝛾\gammaitalic_γ and either T=γT𝑇𝛾superscript𝑇T=\gamma\cdot T^{\prime}italic_T = italic_γ ⋅ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, or γ=𝛾bottom\gamma=\botitalic_γ = ⊥ and T=T=𝑇superscript𝑇bottomT=T^{\prime}=\botitalic_T = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⊥;

  3. (3)

    If iΣplain𝑖subscriptΣplaini\in\Sigma_{\text{plain}}italic_i ∈ roman_Σ start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT, then q=δpln(q,i)superscript𝑞subscript𝛿pln𝑞𝑖q^{\prime}=\delta_{\text{pln}}(q,i)italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_δ start_POSTSUBSCRIPT pln end_POSTSUBSCRIPT ( italic_q , italic_i ) and T=Tsuperscript𝑇𝑇T^{\prime}=Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_T.

We extend the single-step transition for string si𝑠𝑖siitalic_s italic_i as δ((q,T),si)=δ(δ((q,T),s),i)𝛿𝑞𝑇𝑠𝑖𝛿𝛿𝑞𝑇𝑠𝑖\delta((q,T),si)=\delta(\delta((q,T),s),i)italic_δ ( ( italic_q , italic_T ) , italic_s italic_i ) = italic_δ ( italic_δ ( ( italic_q , italic_T ) , italic_s ) , italic_i ). A string sΣ*𝑠superscriptΣs\in\Sigma^{*}italic_s ∈ roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is accepted by VPA \mathcal{H}caligraphic_H if δ((q0,),s)QF𝛿subscript𝑞0bottom𝑠subscript𝑄𝐹\delta((q_{0},\bot),s)\in Q_{F}italic_δ ( ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⊥ ) , italic_s ) ∈ italic_Q start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. The language of \mathcal{H}caligraphic_H is the set of strings accepted by \mathcal{H}caligraphic_H.

3.4. Angluin’s L-Star Algorithm

We next briefly discuss the L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT algorithm (Angluin, 1987), which learns a finite state automaton from a MAT in polynomial time. We first introduce a notion of equivalence. We define two strings equivalent w.r.t. to language \mathcal{L}caligraphic_L if extending them with any suffix w𝑤witalic_w has the same membership result in \mathcal{L}caligraphic_L:

s1s2w,s1ws2w.s_{1}\simeq s_{2}\equiv\forall w,s_{1}w\in\mathcal{L}\;\Leftrightarrow\;s_{2}w% \in\mathcal{L}.italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≃ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≡ ∀ italic_w , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w ∈ caligraphic_L ⇔ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_w ∈ caligraphic_L .

We also introduce a notion of approximate equivalence relative to suffixes in a test-string set T𝑇Titalic_T:

s1Ts2wT,s1ws2w.s_{1}\simeq_{T}s_{2}\equiv\forall w\in T,s_{1}w\in\mathcal{L}\;\Leftrightarrow% \;s_{2}w\in\mathcal{L}.italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≃ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≡ ∀ italic_w ∈ italic_T , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w ∈ caligraphic_L ⇔ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_w ∈ caligraphic_L .

The L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT algorithm operates in iterations and maintains two sets of strings: Q𝑄Qitalic_Q and T𝑇Titalic_T, both of which start with {ϵ}italic-ϵ\{\epsilon\}{ italic_ϵ }. The set T𝑇Titalic_T contains a set of test strings. The set Q𝑄Qitalic_Q contains a set of strings that are separable by T𝑇Titalic_T, which means any two different strings in Q𝑄Qitalic_Q are not T𝑇Titalic_T equivalent: s1s2Q,s1s2s1≄Ts2formulae-sequencefor-allsubscript𝑠1subscript𝑠2𝑄subscript𝑠1subscript𝑠2subscript𝑠1subscriptnot-similar-to-or-equals𝑇subscript𝑠2\forall s_{1}\ s_{2}\in Q,s_{1}\neq s_{2}\;\Rightarrow\;s_{1}\not\simeq_{T}s_{2}∀ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_Q , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⇒ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≄ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In addition, (Q,T)𝑄𝑇(Q,T)( italic_Q , italic_T ) is closed in the sense that for any sQ𝑠𝑄s\in Qitalic_s ∈ italic_Q and any symbol c𝑐citalic_c there exists sQsuperscript𝑠𝑄s^{\prime}\in Qitalic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_Q such that scTssubscriptsimilar-to-or-equals𝑇𝑠𝑐superscript𝑠sc\simeq_{T}s^{\prime}italic_s italic_c ≃ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Given separable and closed (Q,T)𝑄𝑇(Q,T)( italic_Q , italic_T ), we can construct a hypothesis DFA: each string in Q𝑄Qitalic_Q becomes a state and we add a transition from sQ𝑠𝑄s\in Qitalic_s ∈ italic_Q with input symbol c𝑐citalic_c to the unique state sQsuperscript𝑠𝑄s^{\prime}\in Qitalic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_Q such that scTssubscriptsimilar-to-or-equals𝑇𝑠𝑐superscript𝑠sc\simeq_{T}s^{\prime}italic_s italic_c ≃ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; the initial state is the empty string ϵitalic-ϵ\epsilonitalic_ϵ and acceptance states are those Q𝑄Qitalic_Q strings that are in \mathcal{L}caligraphic_L. With the hypothesis DFA, we can ask the MAT to check if the DFA is equivalent to the oracle language. If it is, a DFA for the oracle language has been learned and the algorithm terminates. If not, the MAT gives a counterexample. With the counterexample, the algorithm can extend Q𝑄Qitalic_Q and T𝑇Titalic_T and then use membership queries provided by the MAT to make Q𝑄Qitalic_Q and T𝑇Titalic_T separable and closed again. Details can be found in the L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT paper (Angluin, 1987).

4. V-Star for a Character-Based VPL

V-Star learns a Visibly Pushdown Automaton (VPA) using a MAT, which provides both membership and equivalence queries. For ease of exposition, we divide our discussion into two steps: in this section, we consider grammar inference for a character-based VPL, in which the tagging of call/return symbols is on individual characters; in the next section, we consider grammar inference for a token-based VPL, a more realistic setting in which the tagging is performed on tokens (sequences of characters).

For both steps, we develop an algorithm that not only infers the tagging but also constructs a VPA. We further prove that this constructed VPA achieves exact learning, meaning it recognizes the oracle language. We also provide an analysis of the algorithm’s time complexity.

This section proceeds as follows: we start with a precise problem statement in Section 4.1; we then introduce a new algorithm for learning a VPA in Section 4.2 assuming a given tagging function. Then, in Section 4.3, we study how to infer a tagging function that makes the VPA-learning algorithm terminate and achieve exact learning.

4.1. Problem Statement

V-Star seeks to infer a Visibly Pushdown Grammar (VPG) from a black-box oracle that knows a VPL. We next define the precise knowledge of the oracle and what queries it allows.

We assume ΣΣ\Sigmaroman_Σ is the alphabet set from which valid strings can draw characters. A VPL tags each character i𝑖iitalic_i in ΣΣ\Sigmaroman_Σ as a call symbol \guilsingllefti𝑖iitalic_i, a return symbol i𝑖iitalic_i\guilsinglright, or a plain symbol. This is modeled by a tagging function t:ΣΣ^:𝑡Σ^Σt:\Sigma\to\hat{\Sigma}italic_t : roman_Σ → over^ start_ARG roman_Σ end_ARG, which maps a character i𝑖iitalic_i to either \guilsingllefti𝑖iitalic_i, i𝑖iitalic_i\guilsinglright, or i𝑖iitalic_i itself. This function extends to strings: t(s)=t(s[1])t(s[n])𝑡𝑠𝑡𝑠delimited-[]1𝑡𝑠delimited-[]𝑛t(s)=t(s[1])\ldots t(s[n])italic_t ( italic_s ) = italic_t ( italic_s [ 1 ] ) … italic_t ( italic_s [ italic_n ] ), where n𝑛nitalic_n is the length of s𝑠sitalic_s and s[j]𝑠delimited-[]𝑗s[j]italic_s [ italic_j ] is its j𝑗jitalic_j-th character. Given a tagging function t𝑡titalic_t, we define the terminal set Σ^tsubscript^Σ𝑡\hat{\Sigma}_{t}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (also denoted as Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG when t𝑡titalic_t is clear from the context) as the set of tagged characters: Σ^t=ΣcallΣplainΣretsubscript^Σ𝑡subscriptΣcallsubscriptΣplainsubscriptΣret\hat{\Sigma}_{t}=\Sigma_{\text{call}}\cup\Sigma_{\text{plain}}\cup\Sigma_{% \text{ret}}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT ∪ roman_Σ start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT ∪ roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT, where ΣcallsubscriptΣcall\Sigma_{\text{call}}roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT, ΣretsubscriptΣret\Sigma_{\text{ret}}roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT, and ΣplainsubscriptΣplain\Sigma_{\text{plain}}roman_Σ start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT include call, return, and plain symbols defined by t𝑡titalic_t, respectively.

An oracle 𝒪𝒪\mathcal{O}caligraphic_O knows a language Σ*superscriptΣ\mathcal{L}\subseteq\Sigma^{*}caligraphic_L ⊆ roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and a tagging function t𝒪subscript𝑡𝒪t_{\mathcal{O}}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT such that the tagged language ^𝒪={t𝒪(s)s}subscript^𝒪conditional-setsubscript𝑡𝒪𝑠𝑠\hat{\mathcal{L}}_{\mathcal{O}}=\{t_{\mathcal{O}}(s)\mid s\in\mathcal{L}\}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL over terminal set Σ^t𝒪subscript^Σsubscript𝑡𝒪\hat{\Sigma}_{t_{\mathcal{O}}}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The oracle 𝒪𝒪\mathcal{O}caligraphic_O can answer membership and equivalence queries in active learning. The oracle’s ability to answer these queries is modeled as two functions. The membership query function, χ:Σ*{True,False}:subscript𝜒superscriptΣTrueFalse\chi_{\mathcal{L}}:\Sigma^{*}\to\{\text{True},\text{False}\}italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT : roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT → { True , False }, is defined as follows:

χ(s)={Trueif s,Falseotherwise.subscript𝜒𝑠casesTrueif 𝑠Falseotherwise\chi_{\mathcal{L}}(s)=\begin{cases}\text{True}&\text{if }s\in\mathcal{L},\\ \text{False}&\text{otherwise}.\end{cases}italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ) = { start_ROW start_CELL True end_CELL start_CELL if italic_s ∈ caligraphic_L , end_CELL end_ROW start_ROW start_CELL False end_CELL start_CELL otherwise . end_CELL end_ROW

That is, it returns true iff the input string s𝑠sitalic_s belongs to \mathcal{L}caligraphic_L. Note that input strings to membership queries do not carry tags, which reflects the fact that existing oracles are typically recognizers/parsers that take untagged strings. An example oracle used in our experiments is an off-the-shelf JSON parser, which takes untagged JSON strings; the goal of V-Star is to learn the JSON grammar from this oracle. Also note that we sometimes abuse the notation and pretend that χsubscript𝜒\chi_{\mathcal{L}}italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT can also take tagged strings, in which case it performs membership testing using the string after tagging is removed; i.e., for a tagging function t𝑡titalic_t, χ(t(s))subscript𝜒𝑡𝑠\chi_{\mathcal{L}}(t(s))italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_t ( italic_s ) ) is defined as χ(s)subscript𝜒𝑠\chi_{\mathcal{L}}(s)italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ).

We next define the equivalence query function, which checks the equivalence between the oracle language \mathcal{L}caligraphic_L and the language defined by a hypothesis VPA \mathcal{H}caligraphic_H proposed by some learning algorithm. One complication is that the tagging function produced by the learner might be different from the oracle tagging function, even if the underlying untagged language is the same as the oracle one. This is due to the inherent flexibility of VPL tagging. As an example, suppose the oracle language is {(\guilsingllefta\guilsinglleftg)k(h\guilsinglrightb\guilsinglright)kk0}conditional-setsuperscript\guilsingllefta\guilsinglleftg𝑘superscripth\guilsinglrightb\guilsinglright𝑘𝑘0\{(\text{\guilsinglleft{$a$}}\text{\guilsinglleft{$g$}})^{k}(\text{{$h$}% \guilsinglright}\text{{$b$}\guilsinglright})^{k}\mid k\geq 0\}{ ( \guilsingllefta \guilsinglleftg ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( h\guilsinglright b\guilsinglright ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∣ italic_k ≥ 0 }, then its underlying untagged language is the same as the untagged language of {(\guilsinglleftag)k(hb\guilsinglright)kk0}conditional-setsuperscript\guilsingllefta𝑔𝑘superscriptb\guilsinglright𝑘𝑘0\{(\text{\guilsinglleft{$a$}}{g})^{k}({h}\text{{$b$}\guilsinglright})^{k}\mid k% \geq 0\}{ ( italic_a italic_g ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_h italic_b ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∣ italic_k ≥ 0 }, which tags only a𝑎aitalic_a and b𝑏bitalic_b, or of {(a\guilsinglleftg)k(h\guilsinglrightb)kk0}conditional-setsuperscript𝑎\guilsinglleftg𝑘superscripth\guilsinglright𝑏𝑘𝑘0\{(a\text{\guilsinglleft{$g$}})^{k}(\text{{$h$}\guilsinglright}b)^{k}\mid k% \geq 0\}{ ( italic_a italic_g ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_h italic_b ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∣ italic_k ≥ 0 }, which tags only g𝑔gitalic_g and hhitalic_h.

Since what is relevant is the underlying untagged language, we should allow a learner to learn a different tagging function. Therefore we assume that the learner produces a hypothesis Visibly Pushdown Automaton (VPA) \mathcal{H}caligraphic_H, as well as a hypothesis tagging function tsubscript𝑡t_{\mathcal{H}}italic_t start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT. The learner should achieve exact learning, formally stated as sΣ*,χ(s)=χ(,t)(s)formulae-sequencefor-all𝑠superscriptΣsubscript𝜒𝑠subscript𝜒subscript𝑡𝑠\forall s\in\Sigma^{*},\;\chi_{\mathcal{L}}(s)=\chi_{(\mathcal{H},t_{\mathcal{% H}})}(s)∀ italic_s ∈ roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ) = italic_χ start_POSTSUBSCRIPT ( caligraphic_H , italic_t start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_s ), where

χ(,t)(s)={Trueif t(s) is accepted by ,Falseotherwise.subscript𝜒subscript𝑡𝑠casesTrueif subscript𝑡𝑠 is accepted by Falseotherwise\chi_{(\mathcal{H},t_{\mathcal{H}})}(s)=\begin{cases}\text{True}&\text{if }t_{% \mathcal{H}}(s)\text{ is accepted by }\mathcal{H},\\ \text{False}&\text{otherwise}.\end{cases}italic_χ start_POSTSUBSCRIPT ( caligraphic_H , italic_t start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_s ) = { start_ROW start_CELL True end_CELL start_CELL if italic_t start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ( italic_s ) is accepted by caligraphic_H , end_CELL end_ROW start_ROW start_CELL False end_CELL start_CELL otherwise . end_CELL end_ROW

Now the equivalence query function \mathcal{E}caligraphic_E is defined as follows: (,t)subscript𝑡\mathcal{E}(\mathcal{H},t_{\mathcal{H}})caligraphic_E ( caligraphic_H , italic_t start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ) returns none when the oracle language is equivalent to the untagged language recognized by \mathcal{H}caligraphic_H and otherwise returns some s𝑠sitalic_s such that χ(s)χ(,t)(s)subscript𝜒𝑠subscript𝜒subscript𝑡𝑠\chi_{\mathcal{L}}(s)\neq\chi_{(\mathcal{H},t_{\mathcal{H}})}(s)italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ) ≠ italic_χ start_POSTSUBSCRIPT ( caligraphic_H , italic_t start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_s ).

V-Star’s active learning goal is, with an oracle that provides membership and equivalence queries, to learn a tagging function t𝑡titalic_t and a VPA \mathcal{H}caligraphic_H so that exact learning is achieved.

The Unique Pairing assumption for oracle languages

To simplify the tagging inference algorithm that will be discussed in Section 4.3, we assume that in the oracle VPL ^𝒪={t𝒪(s)s}subscript^𝒪conditional-setsubscript𝑡𝒪𝑠𝑠\hat{\mathcal{L}}_{\mathcal{O}}=\{t_{\mathcal{O}}(s)\mid s\in\mathcal{L}\}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_s ) ∣ italic_s ∈ caligraphic_L }, a call symbol is uniquely paired with a return symbol; i.e., if \guilsingllefta𝑎aitalic_a is matched with b𝑏bitalic_b\guilsinglright in one sentence, then \guilsingllefta𝑎aitalic_a can be matched with only b𝑏bitalic_b\guilsinglright in every sentence of the language. This assumption simplifies our algorithm design, and is satisfied by languages we experimented with (e.g., XML and JSON). We now represent pairs (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) in a tagging function t𝑡titalic_t as a tagging T2Σ×Σ𝑇superscript2ΣΣT\subseteq 2^{\Sigma\times\Sigma}italic_T ⊆ 2 start_POSTSUPERSCRIPT roman_Σ × roman_Σ end_POSTSUPERSCRIPT, where t(a)=\guilsingllefta𝑡𝑎\guilsinglleftat(a)=\text{\guilsinglleft{$a$}}italic_t ( italic_a ) = italic_a, t(b)=b\guilsinglright𝑡𝑏b\guilsinglrightt(b)=\text{{$b$}\guilsinglright}italic_t ( italic_b ) = italic_b. While Algorithm 3 technically can be adjusted to operate without the above assumption, efficiency would be significantly decreased.

4.2. Learning VPA with Known Tagging

This subsection outlines an algorithm for learning a Visibly Pushdown Automaton (VPA) using a MAT, assuming a tagging function t𝑡titalic_t as input. Σ^tsubscript^Σ𝑡\hat{\Sigma}_{t}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the tagged alphabet according to t𝑡titalic_t and ^t={t(s)s}subscript^𝑡conditional-set𝑡𝑠𝑠\hat{\mathcal{L}}_{t}=\{t(s)\mid s\in\mathcal{L}\}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_t ( italic_s ) ∣ italic_s ∈ caligraphic_L } is the oracle language \mathcal{L}caligraphic_L tagged with t𝑡titalic_t. We assume that t𝑡titalic_t must make ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG contain a set of well-matched strings. To avoid clutter, we will write Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG for Σ^tsubscript^Σ𝑡\hat{\Sigma}_{t}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG for ^tsubscript^𝑡\hat{\mathcal{L}}_{t}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in this subsection. While there were prior VPA-learning algorithms proposed under this setting, some required more information from the oracle, such as the stack content during VPA execution (Michaliszyn and Otop, 2022). Isberner (2015) used advanced discrimination tree structures to minimize the number of membership queries; however, both Isberner (2015) and Howar (2012) discussed that discrimination tree-based algorithms could significantly raise the number of equivalence queries. Since in our implementation we simulate equivalence queries using membership queries (see Section 6), increasing the number of equivalence queries would escalate the simulation effort.

In this section, we introduce a VPA learning algorithm based on k𝑘kitalic_k-SEVPA ((Alur et al., 2005; Kumar et al., 2007)) and demonstrate its polynomial-time efficiency in Theorem 4.7. Although the concept of polynomial-time VPA learning has been previously explored, as in Isberner (2015)’s TTT-VPA, our approach differs by adopting a table-based methodology, inspired by the clarity and directness of the L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT algorithm (Angluin, 1987). This shift not only simplifies the presentation but also makes it easy to interface with tag-inference algorithms that we will discuss later in this paper.

Our algorithm is outlined in Algorithm 1. At every iteration, it maintains a set of separable and closed equivalence states and test strings in 𝒬𝒬\mathcal{Q}caligraphic_Q. The current states are used to produce a hypothesis VPA through 𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀(𝒬)𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀𝒬\mathbf{constructVPA}(\mathcal{Q})bold_constructVPA ( caligraphic_Q ). It then queries the oracle using an equivalence query. If the query does not produce a counterexample, then the iterative process terminates with the hypothesis VPA as the result; otherwise, the returned counterexample is tagged through the assumed tagging function t𝑡titalic_t and employed to refine the current set of equivalence states and test strings, through 𝐮𝐩𝐝𝐚𝐭𝐞(,,)𝐮𝐩𝐝𝐚𝐭𝐞\mathbf{update}(-,-,-)bold_update ( - , - , - ) and 𝐜𝐥𝐨𝐬𝐞(,)𝐜𝐥𝐨𝐬𝐞\mathbf{close}(-,-)bold_close ( - , - ). Next we describe 𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀()𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀\mathbf{constructVPA}(-)bold_constructVPA ( - ), 𝐜𝐥𝐨𝐬𝐞(,)𝐜𝐥𝐨𝐬𝐞\mathbf{close}(-,-)bold_close ( - , - ), and 𝐮𝐩𝐝𝐚𝐭𝐞(,,)𝐮𝐩𝐝𝐚𝐭𝐞\mathbf{update}(-,-,-)bold_update ( - , - , - ), starting with some background information.

Input: Oracle 𝒪𝒪\mathcal{O}caligraphic_O with membership queries and equivalence queries \mathcal{E}caligraphic_E, tagging function t𝑡titalic_t, terminals Σ^=ΣcallΣretΣplain^ΣsubscriptΣcallsubscriptΣretsubscriptΣplain\hat{\Sigma}=\Sigma_{\text{call}}\cup\Sigma_{\text{ret}}\cup\Sigma_{\text{% plain}}over^ start_ARG roman_Σ end_ARG = roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT ∪ roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT ∪ roman_Σ start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT.
Output: Learned VPA 𝒬subscript𝒬\mathcal{H}_{\mathcal{Q}}caligraphic_H start_POSTSUBSCRIPT caligraphic_Q end_POSTSUBSCRIPT.
1 Initialize Qi,i=0..|Σcall|Q_{i,i=0..|\Sigma_{\text{call}}|}italic_Q start_POSTSUBSCRIPT italic_i , italic_i = 0 . . | roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT | end_POSTSUBSCRIPT as {ϵ}italic-ϵ\{\epsilon\}{ italic_ϵ }, C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as {ϵ}italic-ϵ\{\epsilon\}{ italic_ϵ }, and Cj,j=1..|Σcall|C_{j,j=1..|\Sigma_{\text{call}}|}italic_C start_POSTSUBSCRIPT italic_j , italic_j = 1 . . | roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT | end_POSTSUBSCRIPT as {(\guilsinglleftaj,j=1..|Σcall|,b\guilsinglright)b\guilsinglrightΣret}\left\{\left(\text{\guilsinglleft{$a$}}_{j,j=1..|\Sigma_{\text{call}}|},\text{% {$b$}\guilsinglright}\right)\mid\text{{$b$}\guilsinglright}\in\Sigma_{\text{% ret}}\right\}{ ( italic_a start_POSTSUBSCRIPT italic_j , italic_j = 1 . . | roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT | end_POSTSUBSCRIPT , italic_b ) ∣ italic_b ∈ roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT };
2 𝒬𝐜𝐥𝐨𝐬𝐞(𝒪,𝒬)𝒬𝐜𝐥𝐨𝐬𝐞𝒪𝒬\mathcal{Q}\leftarrow\mathbf{close}(\mathcal{O},\mathcal{Q})caligraphic_Q ← bold_close ( caligraphic_O , caligraphic_Q );
3 while (𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀(𝒬))𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀𝒬\mathcal{E}\left(\mathbf{constructVPA}(\mathcal{Q})\right)caligraphic_E ( bold_constructVPA ( caligraphic_Q ) ) produces a counterexample s𝑠sitalic_s do
4       𝒬𝐮𝐩𝐝𝐚𝐭𝐞(𝒪,𝒬,t(s))superscript𝒬𝐮𝐩𝐝𝐚𝐭𝐞𝒪𝒬𝑡𝑠\mathcal{Q}^{\prime}\leftarrow\mathbf{update}(\mathcal{O},\mathcal{Q},t(s))caligraphic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← bold_update ( caligraphic_O , caligraphic_Q , italic_t ( italic_s ) );
5       𝒬𝐜𝐥𝐨𝐬𝐞(𝒪,𝒬)𝒬𝐜𝐥𝐨𝐬𝐞𝒪superscript𝒬\mathcal{Q}\leftarrow\mathbf{close}(\mathcal{O},\mathcal{Q}^{\prime})caligraphic_Q ← bold_close ( caligraphic_O , caligraphic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT );
6      
7 end while
return 𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀(𝒬)𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀𝒬\mathbf{constructVPA}(\mathcal{Q})bold_constructVPA ( caligraphic_Q );
Algorithm 1 The 𝐥𝐞𝐚𝐫𝐧(𝒪,t,Σ^)𝐥𝐞𝐚𝐫𝐧𝒪𝑡^Σ\mathbf{learn}(\mathcal{O},t,\hat{\Sigma})bold_learn ( caligraphic_O , italic_t , over^ start_ARG roman_Σ end_ARG ) function that learns a VPA from a MAT.

4.2.1. Background: k𝑘kitalic_k-SEVPA and Congruence Relations

Unlike regular languages, a VPL may not have a unique minimum-state deterministic pushdown recognizer. Nonetheless, partitioning the call symbols into k𝑘kitalic_k distinct groups and mandating the following ensure the existence of a unique minimal VPA: (1) the states are partitioned to a set of k+1𝑘1k+1italic_k + 1 modules (each is a set of states), with the 00-th module as the base module with the initial state and the i𝑖iitalic_i-th module for the i𝑖iitalic_i-th group of call symbols with i[1..k]i\in[1..k]italic_i ∈ [ 1 . . italic_k ]; (2) the machine stays in the same module when encountering a plain symbol; the machine transitions to a unique entry state in the i𝑖iitalic_i-th module when encountering a call symbol from the i𝑖iitalic_i-th group; the machine transitions back to the caller module when encountering a return symbol. Such a VPA is known as a k𝑘kitalic_k-Single Entry VPA (k𝑘kitalic_k-SEVPA (Alur et al., 2005)) and is similar to a control-flow graph with the 00-th module for the main function and the i𝑖iitalic_i-th module for the i𝑖iitalic_i-th function in a program.

The minimal k𝑘kitalic_k-SEVPA can be defined with a set of congruence relations.

Definition 4.0 ().

[Congruence Relations for the Minimal k𝑘kitalic_k-SEVPA (Alur et al., 2005)] Given a VPL ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG over Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG, let ΣcallisuperscriptsubscriptΣcall𝑖\Sigma_{\text{call}}^{i}roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, i[1..k]i\in[1..k]italic_i ∈ [ 1 . . italic_k ], represent the i𝑖iitalic_i-th group of call symbols. Given well-matched strings s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we introduce k+1𝑘1k+1italic_k + 1 congruence relations:

(1) s10s2 iffsubscriptsimilar-to0subscript𝑠1subscript𝑠2 iff\displaystyle s_{1}\sim_{0}s_{2}\text{ iff }italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT iff wΣ^*,s1w^s2w^;iffformulae-sequencefor-all𝑤superscript^Σsubscript𝑠1𝑤^subscript𝑠2𝑤^\displaystyle\forall w\in\hat{\Sigma}^{*},\ s_{1}w\in\hat{\mathcal{L}}\iff s_{% 2}w\in\hat{\mathcal{L}};∀ italic_w ∈ over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w ∈ over^ start_ARG caligraphic_L end_ARG ⇔ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_w ∈ over^ start_ARG caligraphic_L end_ARG ;
(2) s1is2 iffsubscriptsimilar-to𝑖subscript𝑠1subscript𝑠2 iff\displaystyle s_{1}\sim_{i}s_{2}\text{ iff }italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT iff w,wΣ^*,\guilsinglleftaΣcalli,w\guilsinglleftas1w^w\guilsinglleftas2w^, for i[1..k].\displaystyle\forall w,w^{\prime}\in\hat{\Sigma}^{*},\ \forall\text{% \guilsinglleft{$a$}}\in\Sigma_{\text{call}}^{i},\ w\text{\guilsinglleft{$a$}}s% _{1}w^{\prime}\in\hat{\mathcal{L}}\iff w\text{\guilsinglleft{$a$}}s_{2}w^{% \prime}\in\hat{\mathcal{L}},\text{ for }i\in[1..k].∀ italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , ∀ italic_a ∈ roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_w italic_a italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over^ start_ARG caligraphic_L end_ARG ⇔ italic_w italic_a italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over^ start_ARG caligraphic_L end_ARG , for italic_i ∈ [ 1 . . italic_k ] .

Note that 0subscriptsimilar-to0\sim_{0}∼ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the Myhill-Nerode right congruence and can be used to construct the minimal DFA for a regular language. For isubscriptsimilar-to𝑖\sim_{i}∼ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT when i[1..k]i\in[1..k]italic_i ∈ [ 1 . . italic_k ], the context strings assume specialized forms: the left context string ends with a call symbol from the i𝑖iitalic_i-th group, and the string w\guilsinglleftaw𝑤\guilsinglleftasuperscript𝑤w\text{\guilsinglleft{$a$}}w^{\prime}italic_w italic_a italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a well-matched string, since both w\guilsinglleftas1w^𝑤\guilsinglleftasubscript𝑠1superscript𝑤^w\text{\guilsinglleft{$a$}}s_{1}w^{\prime}\in\hat{\mathcal{L}}italic_w italic_a italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over^ start_ARG caligraphic_L end_ARG and s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are well matched. From the above congruence relations, we can construct the minimal k𝑘kitalic_k-SEVPA: the equivalence classes of isubscriptsimilar-to𝑖\sim_{i}∼ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT become the states of the i𝑖iitalic_i-th module, with [ϵ]isubscriptdelimited-[]italic-ϵsubscriptsimilar-to𝑖[\epsilon]_{\sim_{i}}[ italic_ϵ ] start_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT being the unique entry state of the i𝑖iitalic_i-th module; transition edges can also be added (see Alur et al. (2005)): e.g., [s]isubscriptdelimited-[]𝑠subscriptsimilar-to𝑖[s]_{\sim_{i}}[ italic_s ] start_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT transitions to [si]isubscriptdelimited-[]𝑠𝑖subscriptsimilar-to𝑖[si]_{\sim_{i}}[ italic_s italic_i ] start_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for plain symbol i𝑖iitalic_i, and to [ϵ]jsubscriptdelimited-[]italic-ϵsubscriptsimilar-to𝑗[\epsilon]_{\sim_{j}}[ italic_ϵ ] start_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT for call symbol \guilsinglleftaΣcallj\guilsinglleftasuperscriptsubscriptΣcall𝑗\text{\guilsinglleft{$a$}}\in\Sigma_{\text{call}}^{j}italic_a ∈ roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

In our algorithm, we set k𝑘kitalic_k to be the number of call symbols decided by the input tagging function so that each call symbol is in its own group. We write \guilsinglleftaisubscript\guilsingllefta𝑖\text{\guilsinglleft{$a$}}_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the i𝑖iitalic_i-th call symbol. This partitioning is practical because call symbols often fulfill diverse roles and hence find themselves in separate contexts. Further, Proposition 2 in Alur et al. (2005) tells that enlarging k𝑘kitalic_k can lead to a more compact VPA.

4.2.2. Access Words and Test Words

At each step, V-Star maintains 𝒬𝒬\mathcal{Q}caligraphic_Q, which contains

  1. (1)

    a set of k+1𝑘1k+1italic_k + 1 modules Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, each containing empty string ϵitalic-ϵ\epsilonitalic_ϵ and a set of well-matched access words in Σ^*superscript^Σ\hat{\Sigma}^{*}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT,

  2. (2)

    and a set of test words C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, with C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT containing strings in Σ^*superscript^Σ\hat{\Sigma}^{*}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT for testing Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT containing strings in the form of (w\guilsinglleftai,w)𝑤subscript\guilsingllefta𝑖superscript𝑤(w\text{\guilsinglleft{$a$}}_{i},w^{\prime})( italic_w italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for testing Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are in Σ^*superscript^Σ\hat{\Sigma}^{*}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, \guilsinglleftaisubscript\guilsingllefta𝑖\text{\guilsinglleft{$a$}}_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th call symbol, and w\guilsinglleftaiw𝑤subscript\guilsingllefta𝑖superscript𝑤w\text{\guilsinglleft{$a$}}_{i}w^{\prime}italic_w italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is well matched.

Given test words Ci[0..k]C_{i\in[0..k]}italic_C start_POSTSUBSCRIPT italic_i ∈ [ 0 . . italic_k ] end_POSTSUBSCRIPT, two well-matched strings q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are Cilimit-fromsubscript𝐶𝑖C_{i}-italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT -equivalent, denoted as q1Ciq2subscriptsimilar-tosubscript𝐶𝑖subscript𝑞1subscript𝑞2q_{1}\sim_{C_{i}}q_{2}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, if (1) when i=0𝑖0i=0italic_i = 0, wC0for-all𝑤subscript𝐶0\forall w\in C_{0}∀ italic_w ∈ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, q1w^subscript𝑞1𝑤^q_{1}w\in\hat{\mathcal{L}}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w ∈ over^ start_ARG caligraphic_L end_ARG iff q2w^subscript𝑞2𝑤^q_{2}w\in\hat{\mathcal{L}}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_w ∈ over^ start_ARG caligraphic_L end_ARG; and (2) when i[1..k]i\in[1..k]italic_i ∈ [ 1 . . italic_k ], (w\guilsingllefta,w)Cifor-all𝑤\guilsinglleftasuperscript𝑤subscript𝐶𝑖\forall(w\text{\guilsinglleft{$a$}},w^{\prime})\in C_{i}∀ ( italic_w italic_a , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, w\guilsinglleftaq1w^𝑤\guilsinglleftasubscript𝑞1superscript𝑤^w\text{\guilsinglleft{$a$}}q_{1}w^{\prime}\in\hat{\mathcal{L}}italic_w italic_a italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over^ start_ARG caligraphic_L end_ARG iff w\guilsinglleftaq2w^𝑤\guilsinglleftasubscript𝑞2superscript𝑤^w\text{\guilsinglleft{$a$}}q_{2}w^{\prime}\in\hat{\mathcal{L}}italic_w italic_a italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over^ start_ARG caligraphic_L end_ARG. These are essentially the same equivalence relations as those in Definition 4.1, relative to test words in Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

We define the following two properties of 𝒬={(Qi,Ci)i[0..k]}\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}caligraphic_Q = { ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i ∈ [ 0 . . italic_k ] }:

  1. (1)

    Separability: no two distinct strings in Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-equivalent, meaning qqQi,qqq≁Ciqformulae-sequencefor-all𝑞superscript𝑞subscript𝑄𝑖𝑞superscript𝑞𝑞subscriptnot-similar-tosubscript𝐶𝑖superscript𝑞\forall q\ q^{\prime}\in Q_{i},q\neq q^{\prime}\;\Rightarrow\;q\not\sim_{C_{i}% }q^{\prime}∀ italic_q italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q ≠ italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⇒ italic_q ≁ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

  2. (2)

    Closedness: for every qQi𝑞subscript𝑄𝑖q\in Q_{i}italic_q ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and mΣM𝑚subscriptΣ𝑀m\in\Sigma_{M}italic_m ∈ roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT (defined below), there is some qQisuperscript𝑞subscript𝑄𝑖q^{\prime}\in Q_{i}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that qmCiqsubscriptsimilar-tosubscript𝐶𝑖𝑞𝑚superscript𝑞qm\sim_{C_{i}}q^{\prime}italic_q italic_m ∼ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Definition 4.0 (Nested Words and ΣMsubscriptnormal-Σ𝑀\Sigma_{M}roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT).

Given Σ^=ΣcallΣplainΣret^ΣsubscriptΣcallsubscriptΣplainsubscriptΣret\hat{\Sigma}=\Sigma_{\text{call}}\cup\Sigma_{\text{plain}}\cup\Sigma_{\text{% ret}}over^ start_ARG roman_Σ end_ARG = roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT ∪ roman_Σ start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT ∪ roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT, along with (Qi,Ci)i[1..k](Q_{i},C_{i})_{i\in[1..k]}( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ [ 1 . . italic_k ] end_POSTSUBSCRIPT, we define the nested words for (Qi,Ci)subscript𝑄𝑖subscript𝐶𝑖(Q_{i},C_{i})( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), denoted as Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as

Mi={\guilsinglleftaiqb\guilsinglrightqQi,b\guilsinglrightΣret},subscript𝑀𝑖conditional-setsubscript\guilsingllefta𝑖𝑞b\guilsinglrightformulae-sequence𝑞subscript𝑄𝑖b\guilsinglrightsubscriptΣretM_{i}=\{\text{\guilsinglleft{$a$}}_{i}q\text{{$b$}\guilsinglright}\mid q\in Q_% {i},\text{{$b$}\guilsinglright}\in\Sigma_{\text{ret}}\},italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q italic_b ∣ italic_q ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b ∈ roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT } ,

where \guilsinglleftaisubscript\guilsingllefta𝑖\text{\guilsinglleft{$a$}}_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th call symbol. We define ΣM=iMiΣ^subscriptΣ𝑀subscript𝑖subscript𝑀𝑖^Σ\Sigma_{M}=\cup_{i}M_{i}\cup\hat{\Sigma}roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = ∪ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ over^ start_ARG roman_Σ end_ARG.

Our learning algorithm is then based on the following set of propositions.

Definition 4.0 (𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀(𝒬)𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀𝒬\mathbf{constructVPA}(\mathcal{Q})bold_constructVPA ( caligraphic_Q ) function).

For separable and closed 𝒬={(Qi,Ci)i[0..k]}\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}caligraphic_Q = { ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i ∈ [ 0 . . italic_k ] }, we can construct a hypothesis k𝑘kitalic_k-SEVPA, denoted as \mathcal{H}caligraphic_H as follows. The set of states of \mathcal{H}caligraphic_H is i[0..k]Qi\bigcup_{i\in[0..k]}Q_{i}⋃ start_POSTSUBSCRIPT italic_i ∈ [ 0 . . italic_k ] end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We write qQi𝑞subscript𝑄𝑖q\in Q_{i}italic_q ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as [q]isubscriptdelimited-[]𝑞𝑖[q]_{i}[ italic_q ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The initial state is [ϵ]0subscriptdelimited-[]italic-ϵ0[\epsilon]_{0}[ italic_ϵ ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Define the set of acceptance states, QFsubscript𝑄𝐹Q_{F}italic_Q start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, to be {[q]0qQ0^}conditional-setsubscriptdelimited-[]𝑞0𝑞subscript𝑄0^\{[q]_{0}\mid q\in Q_{0}\cap\hat{\mathcal{L}}\}{ [ italic_q ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ italic_q ∈ italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∩ over^ start_ARG caligraphic_L end_ARG }, which can be constructed via membership queries. The transition function δ𝛿\deltaitalic_δ from the current state [q]isubscriptdelimited-[]𝑞𝑖[q]_{i}[ italic_q ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i[0..k]i\in[0..k]italic_i ∈ [ 0 . . italic_k ], and the next input symbol is defined as follows:

  1. (1)

    For plain symbol c𝑐citalic_c, the transition is [q]i𝑐[q]i𝑐subscriptdelimited-[]𝑞𝑖subscriptdelimited-[]superscript𝑞𝑖[q]_{i}\xrightarrow{c}[q^{\prime}]_{i}[ italic_q ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_ARROW overitalic_c → end_ARROW [ italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where qCiqcsubscriptsimilar-tosubscript𝐶𝑖superscript𝑞𝑞𝑐q^{\prime}\sim_{C_{i}}qcitalic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q italic_c.

  2. (2)

    For call symbol \guilsinglleftajsubscript\guilsingllefta𝑗\text{\guilsinglleft{$a$}}_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the transition is [q]i\guilsinglleftaj, push ([q]i,\guilsinglleftaj)[ϵ]jsubscript\guilsingllefta𝑗 push subscriptdelimited-[]𝑞𝑖subscript\guilsingllefta𝑗subscriptdelimited-[]𝑞𝑖subscriptdelimited-[]italic-ϵ𝑗[q]_{i}\xrightarrow{\text{\guilsinglleft{$a$}}_{j},\text{ push }([q]_{i},\text% {\guilsinglleft{$a$}}_{j})}[\epsilon]_{j}[ italic_q ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , push ( [ italic_q ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_OVERACCENT → end_ARROW [ italic_ϵ ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the unique entry state for module j𝑗jitalic_j.

  3. (3)

    For return symbol b𝑏bitalic_b\guilsinglright, the transition is [q]ib\guilsinglright, pop ([q]j,\guilsinglleftai)[q′′]jb\guilsinglright pop subscriptdelimited-[]superscript𝑞𝑗subscript\guilsingllefta𝑖subscriptdelimited-[]𝑞𝑖subscriptdelimited-[]superscript𝑞′′𝑗[q]_{i}\xrightarrow{\text{{$b$}\guilsinglright},\text{ pop }([q^{\prime}]_{j},% \text{\guilsinglleft{$a$}}_{i})}[q^{\prime\prime}]_{j}[ italic_q ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_b , pop ( [ italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_OVERACCENT → end_ARROW [ italic_q start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where q′′Cjq\guilsinglleftaiqb\guilsinglrightsubscriptsimilar-tosubscript𝐶𝑗superscript𝑞′′superscript𝑞subscript\guilsingllefta𝑖𝑞b\guilsinglrightq^{\prime\prime}\sim_{C_{j}}q^{\prime}\text{\guilsinglleft{$a$}}_{i}q\text{{$b% $}\guilsinglright}italic_q start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∼ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q italic_b.

The target state in each transition exists by closedness and is unique by separability. To run the VPA on a string s𝑠sitalic_s, start with the initial state [ϵ]0subscriptdelimited-[]italic-ϵ0[\epsilon]_{0}[ italic_ϵ ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and an empty stack and use δ𝛿\deltaitalic_δ for transitions. The automaton accepts a string if it terminates in a configuration with a state within QFsubscript𝑄𝐹Q_{F}italic_Q start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and an empty stack.

Proposition 4.0 ().

If 𝒬={(Qi,Ci)i[0..k]}\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}caligraphic_Q = { ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i ∈ [ 0 . . italic_k ] } is separable and language ^={t(s)s}^conditional-set𝑡𝑠𝑠\hat{\mathcal{L}}=\{t(s)\mid s\in\mathcal{L}\}over^ start_ARG caligraphic_L end_ARG = { italic_t ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL, then the number of states in 𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀(𝒬)𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀𝒬\mathbf{constructVPA}(\mathcal{Q})bold_constructVPA ( caligraphic_Q ) is bounded above by the number of states in the minimal k𝑘kitalic_k-SEVPA for VPL ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG.

Proof.

For two strings s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, if s1is2subscriptsimilar-to𝑖subscript𝑠1subscript𝑠2s_{1}\sim_{i}s_{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Definition 4.1), then s1Cis2subscriptsimilar-tosubscript𝐶𝑖subscript𝑠1subscript𝑠2s_{1}\sim_{C_{i}}s_{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Hence, the number of equivalence classes of Cisubscriptsimilar-tosubscript𝐶𝑖\sim_{C_{i}}∼ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is less than that of isubscriptsimilar-to𝑖\sim_{i}∼ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which corresponds to the number of states of the i𝑖iitalic_i-th module in the minimal k𝑘kitalic_k-SEVPA. Further, since Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is separable, each element of Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponds to a unique equivalence class of Cisubscriptsimilar-tosubscript𝐶𝑖\sim_{C_{i}}∼ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Therefore, |Qi|subscript𝑄𝑖|Q_{i}|| italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | is bounded above by the number of equivalence class of Cisubscriptsimilar-tosubscript𝐶𝑖\sim_{C_{i}}∼ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which is bounded above by the number of states of the i𝑖iitalic_i-th module in the minimal k𝑘kitalic_k-SEVPA. Since the number of states in 𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀(𝒬)𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀𝒬\mathbf{constructVPA}(\mathcal{Q})bold_constructVPA ( caligraphic_Q ) is i[0..k]|Qi|\sum_{i\in[0..k]}|Q_{i}|∑ start_POSTSUBSCRIPT italic_i ∈ [ 0 . . italic_k ] end_POSTSUBSCRIPT | italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |, it is upper bounded by the number of states in the minimal k𝑘kitalic_k-SEVPA for ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG. ∎

Proposition 4.0 ().

If 𝒬={(Qi,Ci)i[0..k]}\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}caligraphic_Q = { ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i ∈ [ 0 . . italic_k ] } is separable but not closed, then using membership queries one can find i𝑖iitalic_i and qΣ^*Qi𝑞superscript^Σsubscript𝑄𝑖q\in\hat{\Sigma}^{*}\setminus Q_{i}italic_q ∈ over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∖ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that (Qi{qi},Ci)subscript𝑄𝑖subscript𝑞𝑖subscript𝐶𝑖(Q_{i}\cup\{q_{i}\},C_{i})( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the rest (Qj,Cj)jisubscriptsubscript𝑄𝑗subscript𝐶𝑗𝑗𝑖(Q_{j},C_{j})_{j\neq i}( italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT remain separable.

Proof.

Since (Qi,Ci)i[0..k](Q_{i},C_{i})_{i\in[0..k]}( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ [ 0 . . italic_k ] end_POSTSUBSCRIPT are not closed, there exists qQi𝑞subscript𝑄𝑖q\in Q_{i}italic_q ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for certain i𝑖iitalic_i and mΣM𝑚subscriptΣ𝑀m\in\Sigma_{M}italic_m ∈ roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT such that qm𝑞𝑚qmitalic_q italic_m is not Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-equivalent to any state in Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Using membership queries (by enumerating all test strings in Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) we can find such a q𝑞qitalic_q and m𝑚mitalic_m, and then add qm𝑞𝑚qmitalic_q italic_m to Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which remains separable by construction. ∎

Algorithm 2 outlines the 𝐜𝐥𝐨𝐬𝐞(𝒪,𝒬)𝐜𝐥𝐨𝐬𝐞𝒪𝒬\mathbf{close}(\mathcal{O},\mathcal{Q})bold_close ( caligraphic_O , caligraphic_Q ) function, which keeps applying Proposition 4.5 until 𝒬𝒬\mathcal{Q}caligraphic_Q becomes separable and closed.

Input: Oracle 𝒪𝒪\mathcal{O}caligraphic_O and separable 𝒬={(Qi,Ci)i[0..k]}\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}caligraphic_Q = { ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i ∈ [ 0 . . italic_k ] }.
Output: Separable and closed 𝒬superscript𝒬\mathcal{Q}^{\prime}caligraphic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
1 Initialize ΣMsubscriptΣ𝑀\Sigma_{M}roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT as i=1..k{\guilsinglleftaiqb\guilsinglrightqQi,b\guilsinglrightΣret}Σ^\bigcup_{i=1..k}\{\text{\guilsinglleft{$a$}}_{i}q\text{{$b$}\guilsinglright}% \mid q\in Q_{i},\text{{$b$}\guilsinglright}\in\Sigma_{\text{ret}}\}\cup\hat{\Sigma}⋃ start_POSTSUBSCRIPT italic_i = 1 . . italic_k end_POSTSUBSCRIPT { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q italic_b ∣ italic_q ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b ∈ roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT } ∪ over^ start_ARG roman_Σ end_ARG;
2 Initialize the work list W𝑊Witalic_W as {(q,i,m)qQi,i0..k,mΣM}\{(q,i,m)\mid q\in Q_{i},i\in 0..k,m\in\Sigma_{M}\}{ ( italic_q , italic_i , italic_m ) ∣ italic_q ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ 0 . . italic_k , italic_m ∈ roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT };
3 while W𝑊Witalic_W is not empty do
4       Take (q,i,m)𝑞𝑖𝑚(q,i,m)( italic_q , italic_i , italic_m ) from W𝑊Witalic_W;
5       if  qQifor-allsuperscript𝑞normal-′subscript𝑄𝑖\forall q^{\prime}\in Q_{i}∀ italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, qm≁Ciqsubscriptnot-similar-tosubscript𝐶𝑖𝑞𝑚superscript𝑞normal-′qm\not\sim_{C_{i}}q^{\prime}italic_q italic_m ≁ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT  then
6             QiQi{qm}subscript𝑄𝑖subscript𝑄𝑖𝑞𝑚Q_{i}\leftarrow Q_{i}\cup\{qm\}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ { italic_q italic_m };
7             WW{(qm,i,m)mΣM}𝑊𝑊conditional-set𝑞𝑚𝑖superscript𝑚superscript𝑚subscriptΣ𝑀W\leftarrow W\cup\{(qm,i,m^{\prime})\,\mid\,m^{\prime}\in\Sigma_{M}\}italic_W ← italic_W ∪ { ( italic_q italic_m , italic_i , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∣ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT };
8             if i>0𝑖0i>0italic_i > 0 then
9                   ΣMΣM{\guilsinglleftaiqmb\guilsinglrightb\guilsinglrightΣret}subscriptΣ𝑀subscriptΣ𝑀conditional-setsubscript\guilsingllefta𝑖𝑞𝑚b\guilsinglrightb\guilsinglrightsubscriptΣret\Sigma_{M}\leftarrow\Sigma_{M}\cup\{\text{\guilsinglleft{$a$}}_{i}qm\text{{$b$% }\guilsinglright}\mid\text{{$b$}\guilsinglright}\in\Sigma_{\text{ret}}\}roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ← roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∪ { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q italic_m italic_b ∣ italic_b ∈ roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT };
10                   WW{(q′′,j,\guilsinglleftaiqmb\guilsinglright)q′′Qj,j[0..k],b\guilsinglrightΣret}W\leftarrow W\cup\{(q^{\prime\prime},j,\text{\guilsinglleft{$a$}}_{i}qm\text{{% $b$}\guilsinglright})\,\mid\,q^{\prime\prime}\in Q_{j},j\in[0..k],\text{{$b$}% \guilsinglright}\in\Sigma_{\text{ret}}\}italic_W ← italic_W ∪ { ( italic_q start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , italic_j , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q italic_m italic_b ) ∣ italic_q start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ [ 0 . . italic_k ] , italic_b ∈ roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT };
11                  
12             end if
13            
14       end if
15      
16 end while
17return {(Qi,Ci)i[0..k]}\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}{ ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i ∈ [ 0 . . italic_k ] };
Algorithm 2 The 𝐜𝐥𝐨𝐬𝐞(𝒪,𝒬)𝐜𝐥𝐨𝐬𝐞𝒪𝒬\mathbf{close}(\mathcal{O},\mathcal{Q})bold_close ( caligraphic_O , caligraphic_Q ) function.
Proposition 4.0 ().

Suppose that 𝒬={(Qi,Ci)i[0..k]}\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}caligraphic_Q = { ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i ∈ [ 0 . . italic_k ] } is separable and closed, and let \mathcal{H}caligraphic_H be the hypothesis automaton (Definition 4.3). Suppose the oracle returns a counterexample s𝑠sitalic_s for an equivalence query with \mathcal{H}caligraphic_H. Using log|s|𝑠\log|s|roman_log | italic_s | membership queries, one can find i[1..k]i\in[1..k]italic_i ∈ [ 1 . . italic_k ] and qΣ^*Qi𝑞superscript^Σsubscript𝑄𝑖q\in\hat{\Sigma}^{*}\setminus Q_{i}italic_q ∈ over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∖ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and (w\guilsinglleftai,w)Σ^*Σcall×Σ^*𝑤subscript\guilsingllefta𝑖superscript𝑤superscript^ΣsubscriptΣcallsuperscript^Σ(w\text{\guilsinglleft{$a$}}_{i},w^{\prime})\in\hat{\Sigma}^{*}\Sigma_{\text{% call}}\times\hat{\Sigma}^{*}( italic_w italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT × over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that (Qi{q},Ci{(w\guilsinglleftai,w)})subscript𝑄𝑖𝑞subscript𝐶𝑖𝑤subscript\guilsingllefta𝑖superscript𝑤(Q_{i}\cup\{q\},C_{i}\cup\{(w\text{\guilsinglleft{$a$}}_{i},w^{\prime})\})( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ { italic_q } , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ { ( italic_w italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } ) is separable, or find wΣ^*𝑤superscript^Σw\in\hat{\Sigma}^{*}italic_w ∈ over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT when i=0𝑖0i=0italic_i = 0 such that (Q0{q},C0{w})subscript𝑄0𝑞subscript𝐶0𝑤(Q_{0}\cup\{q\},C_{0}\cup\{w\})( italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∪ { italic_q } , italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∪ { italic_w } ) is separable.

Proof.

Let n𝑛nitalic_n be the length of s𝑠sitalic_s. Let q0=[ϵ]0subscript𝑞0subscriptdelimited-[]italic-ϵ0q_{0}=[\epsilon]_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ italic_ϵ ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be the initial state of \mathcal{H}caligraphic_H, and δ𝛿\deltaitalic_δ be the transition function of \mathcal{H}caligraphic_H. For i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n, define qi=δ(q0,s[1]s[i])subscript𝑞𝑖𝛿subscript𝑞0𝑠delimited-[]1𝑠delimited-[]𝑖q_{i}=\delta(q_{0},s[1]\ldots s[i])italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_δ ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s [ 1 ] … italic_s [ italic_i ] ) to be the state in Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT reached by \mathcal{H}caligraphic_H after reading the prefix s[1]s[i]𝑠delimited-[]1𝑠delimited-[]𝑖s[1]\ldots s[i]italic_s [ 1 ] … italic_s [ italic_i ] of s𝑠sitalic_s, and define Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the corresponding stack. For convenience, we write [qi]jsubscriptdelimited-[]subscript𝑞𝑖𝑗[q_{i}]_{j}[ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for the state qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in module j𝑗jitalic_j.

We define the context of qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as follows. When Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is empty, we define the context to be (ϵ,s[i+1]s[n])italic-ϵ𝑠delimited-[]𝑖1𝑠delimited-[]𝑛(\epsilon,s[i+1]\ldots s[n])( italic_ϵ , italic_s [ italic_i + 1 ] … italic_s [ italic_n ] ). Otherwise, let Ti=(qjn,\guilsinglleftajn)(qj1,\guilsinglleftaj1)T_{i}=(q_{j_{n^{\prime}}},\text{\guilsinglleft{$a$}}_{j_{n^{\prime}}})\cdots(q% _{j_{1}},\text{\guilsinglleft{$a$}}_{j_{1}})\cdot\botitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋯ ( italic_q start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ ⊥. We define the context as

(qj1\guilsinglleftaj1qjn\guilsinglleftajn,s[i+1]s[n]).subscript𝑞subscript𝑗1subscript\guilsinglleftasubscript𝑗1subscript𝑞subscript𝑗superscript𝑛subscript\guilsinglleftasubscript𝑗superscript𝑛𝑠delimited-[]𝑖1𝑠delimited-[]𝑛(q_{j_{1}}\text{\guilsinglleft{$a$}}_{j_{1}}\dots q_{j_{n^{\prime}}}\text{% \guilsinglleft{$a$}}_{j_{n^{\prime}}},s[i+1]\ldots s[n]).( italic_q start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … italic_q start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s [ italic_i + 1 ] … italic_s [ italic_n ] ) .

We denote the context of qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as (wi,wi)subscript𝑤𝑖superscriptsubscript𝑤𝑖(w_{i},w_{i}^{\prime})( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). We say that state qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is correct if χ(wiqiwi)=χ(s)subscript𝜒subscript𝑤𝑖subscript𝑞𝑖superscriptsubscript𝑤𝑖subscript𝜒𝑠\chi_{\mathcal{L}}(w_{i}q_{i}w_{i}^{\prime})=\chi_{\mathcal{L}}(s)italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ).

State q0=[ϵ]0subscript𝑞0subscriptdelimited-[]italic-ϵ0q_{0}=[\epsilon]_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ italic_ϵ ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is obviously correct since its context is (ϵ,s)italic-ϵ𝑠(\epsilon,s)( italic_ϵ , italic_s ). However, state qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT must be incorrect because of the following. First, state qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT must be in module 00: if the counterexample s𝑠sitalic_s is accepted by \mathcal{H}caligraphic_H, then s𝑠sitalic_s is well-matched under t𝑡titalic_t; otherwise, the counterexample s𝑠sitalic_s is in ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG, which is a well-matched language. Therefore, we write [qn]0subscriptdelimited-[]subscript𝑞𝑛0[q_{n}]_{0}[ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Next, since s𝑠sitalic_s is a counterexample, we have χ(,t)(s)χ(s)subscript𝜒𝑡𝑠subscript𝜒𝑠\chi_{(\mathcal{H},t)}(s)\neq\chi_{\mathcal{L}}(s)italic_χ start_POSTSUBSCRIPT ( caligraphic_H , italic_t ) end_POSTSUBSCRIPT ( italic_s ) ≠ italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ). By the construction of \mathcal{H}caligraphic_H, we have χ(,t)(s)=χ(qn)subscript𝜒𝑡𝑠𝜒subscript𝑞𝑛\chi_{(\mathcal{H},t)}(s)=\chi(q_{n})italic_χ start_POSTSUBSCRIPT ( caligraphic_H , italic_t ) end_POSTSUBSCRIPT ( italic_s ) = italic_χ ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Therefore, we have χ(qn)χ(s)𝜒subscript𝑞𝑛subscript𝜒𝑠\chi(q_{n})\neq\chi_{\mathcal{L}}(s)italic_χ ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≠ italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ), which means state [qn]0subscriptdelimited-[]subscript𝑞𝑛0[q_{n}]_{0}[ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is incorrect.

Now we can then use binary search (using log|s|𝑠\log|s|roman_log | italic_s | membership queries) to find i𝑖iitalic_i such that [qi]jsubscriptdelimited-[]subscript𝑞𝑖𝑗[q_{i}]_{j}[ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is correct, while [qi+1]jsubscriptdelimited-[]subscript𝑞𝑖1superscript𝑗[q_{i+1}]_{j^{\prime}}[ italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is incorrect. We first show that s[i+1]𝑠delimited-[]𝑖1s[i+1]italic_s [ italic_i + 1 ] cannot be a call symbol. Otherwise, we must have qi+1=ϵsubscript𝑞𝑖1italic-ϵq_{i+1}=\epsilonitalic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_ϵ and Ti+1=(qi,s[i+1])(qj1,\guilsinglleftaj1)T_{i+1}=(q_{i},s[i+1])\cdots(q_{j_{1}},\text{\guilsinglleft{$a$}}_{j_{1}})\cdot\botitalic_T start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s [ italic_i + 1 ] ) ⋯ ( italic_q start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ ⊥. The context of qi+1subscript𝑞𝑖1q_{i+1}italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT is (qj1\guilsinglleftaj1qis[i+1],s[i+2]s[n])subscript𝑞subscript𝑗1subscript\guilsinglleftasubscript𝑗1subscript𝑞𝑖𝑠delimited-[]𝑖1𝑠delimited-[]𝑖2𝑠delimited-[]𝑛(q_{j_{1}}\text{\guilsinglleft{$a$}}_{j_{1}}\dots q_{i}s[i+1],s[i+2]\ldots s[n])( italic_q start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s [ italic_i + 1 ] , italic_s [ italic_i + 2 ] … italic_s [ italic_n ] ). We have

χ(wi+1qi+1wi+1)=χ(qj1\guilsinglleftaj1qis[i+1]s[i+2]s[n])=χ(wiqiwi)=χ(s),subscript𝜒subscript𝑤𝑖1subscript𝑞𝑖1superscriptsubscript𝑤𝑖1subscript𝜒subscript𝑞subscript𝑗1subscript\guilsinglleftasubscript𝑗1subscript𝑞𝑖𝑠delimited-[]𝑖1𝑠delimited-[]𝑖2𝑠delimited-[]𝑛subscript𝜒subscript𝑤𝑖subscript𝑞𝑖superscriptsubscript𝑤𝑖subscript𝜒𝑠\chi_{\mathcal{L}}(w_{i+1}q_{i+1}w_{i+1}^{\prime})=\chi_{\mathcal{L}}(q_{j_{1}% }\text{\guilsinglleft{$a$}}_{j_{1}}\dots q_{i}s[i+1]s[i+2]\ldots s[n])=\chi_{% \mathcal{L}}(w_{i}q_{i}w_{i}^{\prime})=\chi_{\mathcal{L}}(s),italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s [ italic_i + 1 ] italic_s [ italic_i + 2 ] … italic_s [ italic_n ] ) = italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ) ,

but χ(wi+1qi+1wi+1)χ(s)subscript𝜒subscript𝑤𝑖1subscript𝑞𝑖1superscriptsubscript𝑤𝑖1subscript𝜒𝑠\chi_{\mathcal{L}}(w_{i+1}q_{i+1}w_{i+1}^{\prime})\neq\chi_{\mathcal{L}}(s)italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≠ italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ), a contradiction.

Assume s[i+1]𝑠delimited-[]𝑖1s[i+1]italic_s [ italic_i + 1 ] is a plain symbol. Since qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a state in module j𝑗jitalic_j, we let Qj=Qj{qis[i+1]}superscriptsubscript𝑄𝑗subscript𝑄𝑗subscript𝑞𝑖𝑠delimited-[]𝑖1Q_{j}^{\prime}=Q_{j}\cup\{q_{i}s[i+1]\}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∪ { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s [ italic_i + 1 ] } and Cj=Cj{(wi+1,wi+1)}superscriptsubscript𝐶𝑗subscript𝐶𝑗subscript𝑤𝑖1superscriptsubscript𝑤𝑖1C_{j}^{\prime}=C_{j}\cup\{(w_{i+1},w_{i+1}^{\prime})\}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∪ { ( italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) }. By definition of the transition function of \mathcal{H}caligraphic_H, qi+1subscript𝑞𝑖1q_{i+1}italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT is the unique element of Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that is Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT-equivalent to qis[i+1]subscript𝑞𝑖𝑠delimited-[]𝑖1q_{i}s[i+1]italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s [ italic_i + 1 ]. On the other hand, the test (wi+1,wi+1)subscript𝑤𝑖1superscriptsubscript𝑤𝑖1(w_{i+1},w_{i+1}^{\prime})( italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) distinguishes qi+1subscript𝑞𝑖1q_{i+1}italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT from qis[i+1]subscript𝑞𝑖𝑠delimited-[]𝑖1q_{i}s[i+1]italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s [ italic_i + 1 ], since we can get χ(wiqiwi)=χ(wi+1qis[i+1]s[i+2]s[n])χ(wi+1qi+1s[i+2]s[n])subscript𝜒subscript𝑤𝑖subscript𝑞𝑖subscriptsuperscript𝑤𝑖subscript𝜒subscript𝑤𝑖1subscript𝑞𝑖𝑠delimited-[]𝑖1𝑠delimited-[]𝑖2𝑠delimited-[]𝑛subscript𝜒subscript𝑤𝑖1subscript𝑞𝑖1𝑠delimited-[]𝑖2𝑠delimited-[]𝑛\chi_{\mathcal{L}}(w_{i}q_{i}w^{\prime}_{i})=\chi_{\mathcal{L}}(w_{i+1}q_{i}s[% i+1]s[i+2]...s[n])\neq\chi_{\mathcal{L}}(w_{i+1}q_{i+1}s[i+2]...s[n])italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s [ italic_i + 1 ] italic_s [ italic_i + 2 ] … italic_s [ italic_n ] ) ≠ italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT italic_s [ italic_i + 2 ] … italic_s [ italic_n ] ).

Otherwise, s[i+1]𝑠delimited-[]𝑖1s[i+1]italic_s [ italic_i + 1 ] is a return symbol b𝑏bitalic_b\guilsinglright. Let Ti=(qi,\guilsinglleftaj)(qi1,\guilsinglleftaj1)T_{i}=(q_{i^{\prime}},\text{\guilsinglleft{$a$}}_{j})\cdots(q_{i_{1}},\text{% \guilsinglleft{$a$}}_{j_{1}})\cdot\botitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋯ ( italic_q start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ ⊥. Recall that at state qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, \mathcal{H}caligraphic_H reads b𝑏bitalic_b\guilsinglright and transfers to [qi+1]jsubscriptdelimited-[]subscript𝑞𝑖1superscript𝑗[q_{i+1}]_{j^{\prime}}[ italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT such that qi+1Cjqi\guilsinglleftajqib\guilsinglrightsubscriptsimilar-tosubscript𝐶superscript𝑗subscript𝑞𝑖1subscript𝑞superscript𝑖subscript\guilsingllefta𝑗subscript𝑞𝑖b\guilsinglrightq_{i+1}\sim_{C_{j^{\prime}}}q_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}% \text{{$b$}\guilsinglright}italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b. Notice that χ(wi+1qi\guilsinglleftajqib\guilsinglrightwi+1)=χ(wiqiwi)=χ(s)subscript𝜒subscript𝑤𝑖1subscript𝑞superscript𝑖subscript\guilsingllefta𝑗subscript𝑞𝑖b\guilsinglrightsuperscriptsubscript𝑤𝑖1subscript𝜒subscript𝑤𝑖subscript𝑞𝑖superscriptsubscript𝑤𝑖subscript𝜒𝑠\chi_{\mathcal{L}}(w_{i+1}q_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}% \text{{$b$}\guilsinglright}w_{i+1}^{\prime})=\chi_{\mathcal{L}}(w_{i}q_{i}w_{i% }^{\prime})=\chi_{\mathcal{L}}(s)italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ). Therefore, qi\guilsinglleftajqib\guilsinglrightsubscript𝑞superscript𝑖subscript\guilsingllefta𝑗subscript𝑞𝑖b\guilsinglrightq_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}\text{{$b$}\guilsinglright}italic_q start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b is correct. We let Qj=Qj{qi\guilsinglleftajqib\guilsinglright}superscriptsubscript𝑄superscript𝑗subscript𝑄superscript𝑗subscript𝑞superscript𝑖subscript\guilsingllefta𝑗subscript𝑞𝑖b\guilsinglrightQ_{j^{\prime}}^{\prime}=Q_{j^{\prime}}\cup\{q_{i^{\prime}}\text{\guilsinglleft% {$a$}}_{j}q_{i}\text{{$b$}\guilsinglright}\}italic_Q start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_Q start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∪ { italic_q start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b } and Cj=Cj{(wi+1,wi+1)}superscriptsubscript𝐶superscript𝑗subscript𝐶superscript𝑗subscript𝑤𝑖1superscriptsubscript𝑤𝑖1C_{j^{\prime}}^{\prime}=C_{j^{\prime}}\cup\{(w_{i+1},w_{i+1}^{\prime})\}italic_C start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_C start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∪ { ( italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) }. By definition of the transition function of \mathcal{H}caligraphic_H, qi+1subscript𝑞𝑖1q_{i+1}italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT is the unique element of Qjsubscript𝑄superscript𝑗Q_{j^{\prime}}italic_Q start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT that is Cjsubscript𝐶superscript𝑗C_{j^{\prime}}italic_C start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT-equivalent to qi\guilsinglleftajqib\guilsinglrightsubscript𝑞superscript𝑖subscript\guilsingllefta𝑗subscript𝑞𝑖b\guilsinglrightq_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}\text{{$b$}\guilsinglright}italic_q start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b. On the other hand, the test (wi+1,wi+1)subscript𝑤𝑖1superscriptsubscript𝑤𝑖1(w_{i+1},w_{i+1}^{\prime})( italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) distinguishes qi+1subscript𝑞𝑖1q_{i+1}italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT from qi\guilsinglleftajqib\guilsinglrightsubscript𝑞superscript𝑖subscript\guilsingllefta𝑗subscript𝑞𝑖b\guilsinglrightq_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}\text{{$b$}\guilsinglright}italic_q start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b. We conclude that qi\guilsinglleftajqib\guilsinglrightQjsubscript𝑞superscript𝑖subscript\guilsingllefta𝑗subscript𝑞𝑖b\guilsinglrightsubscript𝑄superscript𝑗q_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}\text{{$b$}\guilsinglright}% \notin Q_{j^{\prime}}italic_q start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b ∉ italic_Q start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, and that (Qj,Cj)superscriptsubscript𝑄superscript𝑗superscriptsubscript𝐶superscript𝑗(Q_{j^{\prime}}^{\prime},C_{j^{\prime}}^{\prime})( italic_Q start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is separable. ∎

We call the procedure in Proposition 4.6 𝐮𝐩𝐝𝐚𝐭𝐞(𝒪,𝒬,t(s))𝐮𝐩𝐝𝐚𝐭𝐞𝒪𝒬𝑡𝑠\mathbf{update}(\mathcal{O},\mathcal{Q},t(s))bold_update ( caligraphic_O , caligraphic_Q , italic_t ( italic_s ) ), which takes a separable and closed 𝒬𝒬\mathcal{Q}caligraphic_Q and a counterexample s𝑠sitalic_s and returns a separable 𝒬superscript𝒬\mathcal{Q}^{\prime}caligraphic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

With these lemmas, we can prove the following theorem; its proof is provided in Appendix A.1.

Theorem 4.7 ().

Given a tagging function t𝑡titalic_t such that language ^={t(s)s}^conditional-set𝑡𝑠𝑠\hat{\mathcal{L}}=\{t(s)\mid s\in\mathcal{L}\}over^ start_ARG caligraphic_L end_ARG = { italic_t ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL, the minimal k𝑘kitalic_k-SEVPA of language ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG can be learned in polynomial numbers of equivalence and membership queries.

Therefore, if language ^={t(s)s}^conditional-set𝑡𝑠𝑠\hat{\mathcal{L}}=\{t(s)\mid s\in\mathcal{L}\}over^ start_ARG caligraphic_L end_ARG = { italic_t ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL, then we can use Algorithm 1 to learn the minimal k𝑘kitalic_k-SEVPA of language ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG. However, in general, a tagging function t𝑡titalic_t does not necessarily introduce a VPL ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG, even if each sentence in ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG is well-matched. For example, consider the language ^𝒪={akbkk>0}subscript^𝒪conditional-setsuperscript𝑎𝑘superscript𝑏𝑘𝑘0\hat{\mathcal{L}}_{\mathcal{O}}=\{a^{k}b^{k}\mid k>0\}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT = { italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∣ italic_k > 0 } and the tagging t𝑡titalic_t that maps a𝑎aitalic_a and b𝑏bitalic_b to plain symbols. The resulting language is trivially well matched (as it does not have call/return symbols), but it is not a VPL. In the next section, we discuss the procedure to find a right tagging function that makes \mathcal{L}caligraphic_L a VPL.

4.3. Tagging Inference

L𝐿\displaystyle Litalic_L \guilsinglleftaAb\guilsinglrightLcBϵabsent\guilsingllefta𝐴b\guilsinglright𝐿delimited-∣∣𝑐𝐵italic-ϵ\displaystyle\to{\text{\guilsinglleft{$a$}}{\ A\ }\text{{$b$}\guilsinglright}}% \ L\mid c\ B\mid\epsilon→ italic_a italic_A italic_b italic_L ∣ italic_c italic_B ∣ italic_ϵ
A𝐴\displaystyle Aitalic_A \guilsinglleftgLh\guilsinglrightEabsent\guilsinglleftg𝐿h\guilsinglright𝐸\displaystyle\to\text{\guilsinglleft{$g$}}\ L\ \text{{$h$}\guilsinglright}\ E→ italic_g italic_L italic_h italic_E
B𝐵\displaystyle Bitalic_B dLabsent𝑑𝐿\displaystyle\to d\ L→ italic_d italic_L
E𝐸\displaystyle Eitalic_E ϵabsentitalic-ϵ\displaystyle\to\epsilon→ italic_ϵ
Seed strings S={agcdcdhbcd}𝑆𝑎𝑔𝑐𝑑𝑐𝑑𝑏𝑐𝑑\displaystyle S=\{agcdcdhbcd\}italic_S = { italic_a italic_g italic_c italic_d italic_c italic_d italic_h italic_b italic_c italic_d }
Figure 1. An oracle VPG and a set of seed strings.
\Description

An oracle VPG and a seed string.

We next propose an algorithm that infers a tagging function t𝑡titalic_t so that the tagged language ^t={t(s)s}subscript^𝑡conditional-set𝑡𝑠𝑠\hat{\mathcal{L}}_{t}=\{t(s)\mid s\in\mathcal{L}\}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_t ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL. Then with Theorem 4.7, we can use the tagging function in Algorithm 1 to learn the VPL efficiently. Our algorithm takes a set of seed strings S𝑆Sitalic_S for inference. In practice, the seed strings can be collected via existing corpora of data (e.g., a corpus of XML strings) or via valid inputs to black-box implementations of oracles (e.g., an XML parser).

We will use the oracle VPG in Figure 1 as a running example, which includes a single seed string. Note that seed strings are untagged: it is the task of our algorithm to infer the tagging. As discussed earlier, the inferred and the oracle tagging functions may differ. For the example VPG in Figure 1, we can remove the tags on either (\guilsingllefta,b\guilsinglright)\guilsinglleftab\guilsinglright(\text{\guilsinglleft{$a$}},\text{{$b$}\guilsinglright})( italic_a , italic_b ) or (\guilsinglleftg,h\guilsinglright)\guilsinglleftgh\guilsinglright(\text{\guilsinglleft{$g$}},\text{{$h$}\guilsinglright})( italic_g , italic_h ) and the resulting grammar is still a VPG and generates the same untagged language. As we will explain, in such a case the algorithm opts for the outermost tags in its inferred VPG (i.e., \guilsingllefta𝑎aitalic_a and b𝑏bitalic_b\guilsinglright), while treating g𝑔{g}italic_g and h{h}italic_h as plain symbols.

This section unfolds as follows. We first introduce a VPL pumping lemma, which enables a nesting test to filter out invalid taggings. We then present a tagging inference algorithm based on the nesting test and state theorems discussing properties of the algorithm. We leave most of the proofs of these theorems to Appendix B.

One straightforward observation is that a tagging T𝑇Titalic_T is invalid if some seed string after tagging through T𝑇Titalic_T is not well matched. However, this test alone would not eliminate too many possibilities. We next introduce a nesting test to filter out more invalid taggings. For this, we first propose a pumping lemma for VPLs. This lemma diverges from traditional pumping lemmas for regular languages and context-free languages by focusing on the unique requirements of call and return symbols in VPLs.

Lemma 4.0 (Pumping Lemma for VPLs).

For any VPL ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG, there exists a positive number l𝑙litalic_l such that, for any string s𝑠sitalic_s in ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG with length greater than l𝑙litalic_l, it is possible to express s𝑠sitalic_s according to one of the following conditions:

  1. (1)

    (Regular Pumping) We can partition s𝑠sitalic_s into s=uxv𝑠𝑢𝑥𝑣s=uxvitalic_s = italic_u italic_x italic_v for strings u,x,𝑢𝑥u,x,italic_u , italic_x , and v𝑣vitalic_v, with x𝑥xitalic_x being non-empty, such that uxkv𝑢superscript𝑥𝑘𝑣ux^{k}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v remains in ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG for all k0𝑘0k\geq 0italic_k ≥ 0.

  2. (2)

    (Nesting Pumping) We can partition s𝑠sitalic_s into s=uxzyv𝑠𝑢𝑥𝑧𝑦𝑣s=uxzyvitalic_s = italic_u italic_x italic_z italic_y italic_v for strings u,x,z,y,𝑢𝑥𝑧𝑦u,x,z,y,italic_u , italic_x , italic_z , italic_y , and v𝑣vitalic_v, with x𝑥xitalic_x and y𝑦yitalic_y being non-empty, x𝑥xitalic_x containing a call symbol, and y𝑦yitalic_y containing a return symbol, such that uxkzykv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑘𝑣ux^{k}zy^{k}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v is valid for all k1𝑘1k\geq 1italic_k ≥ 1.

For example, consider the VPG in Figure 1. Any string in ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG with length greater than 6 can be decomposed based on the above two ways; e.g., for the tagged seed string s=\guilsingllefta\guilsinglleftgcdcdh\guilsinglrightb\guilsinglrightcd𝑠\guilsingllefta\guilsinglleftg𝑐𝑑𝑐𝑑h\guilsinglrightb\guilsinglright𝑐𝑑s=\text{\guilsinglleft{$a$}}\text{\guilsinglleft{$g$}}cdcd\text{{$h$}% \guilsinglright}\text{{$b$}\guilsinglright}cditalic_s = \guilsingllefta \guilsinglleftg italic_c italic_d italic_c italic_d h\guilsinglright b\guilsinglright italic_c italic_d, we have (\guilsingllefta\guilsinglleftg)kcdcd(h\guilsinglrightb\guilsinglright)kcdsuperscript\guilsingllefta\guilsinglleftg𝑘𝑐𝑑𝑐𝑑superscripth\guilsinglrightb\guilsinglright𝑘𝑐𝑑(\text{\guilsinglleft{$a$}}\text{\guilsinglleft{$g$}})^{k}cdcd(\text{{$h$}% \guilsinglright}\text{{$b$}\guilsinglright})^{k}cd( \guilsingllefta \guilsinglleftg ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c italic_d italic_c italic_d ( h\guilsinglright b\guilsinglright ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c italic_d in the language, for k1𝑘1k\geq 1italic_k ≥ 1; it happens it can also be decomposed through regular pumping: we have \guilsingllefta\guilsinglleftgcd(cd)jh\guilsinglrightb\guilsinglrightcd\guilsingllefta\guilsinglleftg𝑐𝑑superscript𝑐𝑑𝑗h\guilsinglrightb\guilsinglright𝑐𝑑\text{\guilsinglleft{$a$}}\text{\guilsinglleft{$g$}}cd(cd)^{j}\text{{$h$}% \guilsinglright}\text{{$b$}\guilsinglright}cd\guilsingllefta \guilsinglleftg italic_c italic_d ( italic_c italic_d ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT h\guilsinglright b\guilsinglright italic_c italic_d in the language, for j0𝑗0j\geq 0italic_j ≥ 0. We now extend the concept of nesting pumping to untagged strings, calling them nesting patterns.

Definition 4.0 (Nesting Patterns).

For an untagged string s𝑠sitalic_s in the oracle language \mathcal{L}caligraphic_L, a nesting pattern is a partitioning of s=uxzyv𝑠𝑢𝑥𝑧𝑦𝑣s=uxzyvitalic_s = italic_u italic_x italic_z italic_y italic_v, where (1) x𝑥xitalic_x and y𝑦yitalic_y are non-empty, (2) uxkzykv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑘𝑣ux^{k}zy^{k}v\in\mathcal{L}italic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v ∈ caligraphic_L for all k1𝑘1k\geq 1italic_k ≥ 1, (3) but for kj𝑘𝑗k\neq jitalic_k ≠ italic_j (both 0absent0\geq 0≥ 0), uxkzyjv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑗𝑣ux^{k}zy^{j}v\not\in\mathcal{L}italic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_v ∉ caligraphic_L. The third condition precludes the possibility that uxzyv𝑢𝑥𝑧𝑦𝑣uxzyvitalic_u italic_x italic_z italic_y italic_v represents a regular pumping, which allows uxjzyv𝑢superscript𝑥𝑗𝑧𝑦𝑣ux^{j}zyvitalic_u italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_z italic_y italic_v and uxzyjv𝑢𝑥𝑧superscript𝑦𝑗𝑣uxzy^{j}vitalic_u italic_x italic_z italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_v for all j0𝑗0j\geq 0italic_j ≥ 0. When u𝑢uitalic_u, z𝑧zitalic_z, and v𝑣vitalic_v are not the focus, we may succinctly write a nesting pattern in a string as a pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ).

Definition 4.0 (Compatible Tagging).

We say that tagging t𝑡titalic_t is compatible with a nesting pattern s=uxzyv𝑠𝑢𝑥𝑧𝑦𝑣s=uxzyvitalic_s = italic_u italic_x italic_z italic_y italic_v, if there exists a pair (\guilsingllefta,b\guilsinglright)\guilsinglleftab\guilsinglright(\text{\guilsinglleft{$a$}},\text{{$b$}\guilsinglright})( italic_a , italic_b ) in t𝑡titalic_t, such that (1) x𝑥xitalic_x includes a𝑎aitalic_a and y𝑦yitalic_y includes b𝑏bitalic_b, and (2) t(x)𝑡𝑥t(x)italic_t ( italic_x ) includes an unmatched \guilsingllefta𝑎aitalic_a and t(y)𝑡𝑦t(y)italic_t ( italic_y ) includes an unmatched b𝑏bitalic_b\guilsinglright in t(s)𝑡𝑠t(s)italic_t ( italic_s ).

We say that tagging t𝑡titalic_t is compatible with a set of seed strings S𝑆Sitalic_S, if (1) strings in S𝑆Sitalic_S are well-matched under t𝑡titalic_t, and (2) t𝑡titalic_t is compatible with all nesting patterns of S𝑆Sitalic_S. We say that tagging t𝑡titalic_t is compatible with language \mathcal{L}caligraphic_L if it is compatible with each string in the language.

Theorem 4.11 ().

Given an oracle language \mathcal{L}caligraphic_L, for any tagging t𝑡titalic_t compatible with \mathcal{L}caligraphic_L, language ^t={t(s)s}subscript^𝑡conditional-set𝑡𝑠𝑠\hat{\mathcal{L}}_{t}=\{t(s)\mid s\in\mathcal{L}\}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_t ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL.

For the example in Figure 1, the single seed string’s nesting patterns include

{(ag,hb),(agcd,hb),,(ag,cdcdhbcd)}.𝑎𝑔𝑏𝑎𝑔𝑐𝑑𝑏𝑎𝑔𝑐𝑑𝑐𝑑𝑏𝑐𝑑\{(ag,hb),(agcd,hb),\dots,(ag,cdcdhbcd)\}.{ ( italic_a italic_g , italic_h italic_b ) , ( italic_a italic_g italic_c italic_d , italic_h italic_b ) , … , ( italic_a italic_g , italic_c italic_d italic_c italic_d italic_h italic_b italic_c italic_d ) } .

One compatible tagging is {(a,b)}𝑎𝑏\{(a,b)\}{ ( italic_a , italic_b ) }: firstly, the tagging would make the seed string well matched; secondly, each nesting pattern includes (a,b)𝑎𝑏(a,b)( italic_a , italic_b ). By Theorem 4.11, when we tag a𝑎aitalic_a as a call symbol and b𝑏bitalic_b as a return symbol, the oracle language becomes a VPL. Other compatible taggings include {(a,h)}𝑎\{(a,h)\}{ ( italic_a , italic_h ) }, {(g,h)}𝑔\{(g,h)\}{ ( italic_g , italic_h ) }, {(g,b)}𝑔𝑏\{(g,b)\}{ ( italic_g , italic_b ) }, and {(a,b),(g,h)}𝑎𝑏𝑔\{(a,b),(g,h)\}{ ( italic_a , italic_b ) , ( italic_g , italic_h ) }. In contrast, the tagging {(a,h),(g,b)}𝑎𝑔𝑏\{(a,h),(g,b)\}{ ( italic_a , italic_h ) , ( italic_g , italic_b ) } is incompatible since this tagging would not make the seed string well matched.

Recall that Theorem 4.7 tells us that, if a tagging t𝑡titalic_t makes \mathcal{L}caligraphic_L a VPL, we can efficiently learn the VPL under active learning through Algorithm 1. Now Theorem 4.11 tells us that a compatible tagging t𝑡titalic_t makes \mathcal{L}caligraphic_L a VPL. Therefore, what is remaining is to infer a compatible tagging. With such a tagging, we can use it in Algorithm 1 to learn a VPL whose untagged strings are the same as the oracle language.

Input: Oracle 𝒪𝒪\mathcal{O}caligraphic_O and seed strings S𝑆Sitalic_S.
Output: Some tagging T𝑇Titalic_T compatible with S𝑆Sitalic_S, or None if no compatible tagging is found.
1 Function 𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠\mathbf{candidateNesting}bold_candidateNesting(S,K𝑆𝐾S,Kitalic_S , italic_K):
2       NS,Ksubscript𝑁𝑆𝐾N_{S,K}\leftarrow\emptysetitalic_N start_POSTSUBSCRIPT italic_S , italic_K end_POSTSUBSCRIPT ← ∅;
3       foreach partitioning uxzyv𝑢𝑥𝑧𝑦𝑣uxzyvitalic_u italic_x italic_z italic_y italic_v of  sS𝑠𝑆s\in Sitalic_s ∈ italic_S do
4             if kK,χ(uxkzykv)=Trueformulae-sequencefor-all𝑘𝐾subscript𝜒𝑢superscript𝑥𝑘𝑧superscript𝑦𝑘𝑣normal-True\forall k\leq K,\chi_{\mathcal{L}}(ux^{k}zy^{k}v)=\mathrm{True}∀ italic_k ≤ italic_K , italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v ) = roman_True and k,jK,kjχ(uxkzyjv)=Falseformulae-sequencefor-all𝑘𝑗𝐾𝑘𝑗normal-⇒subscript𝜒𝑢superscript𝑥𝑘𝑧superscript𝑦𝑗𝑣normal-False\forall k,j\leq K,k\neq j\;\Rightarrow\;\chi_{\mathcal{L}}(ux^{k}zy^{j}v)=% \mathrm{False}∀ italic_k , italic_j ≤ italic_K , italic_k ≠ italic_j ⇒ italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_v ) = roman_False then NS,KNS,K{uxzyv}subscript𝑁𝑆𝐾subscript𝑁𝑆𝐾𝑢𝑥𝑧𝑦𝑣N_{S,K}\leftarrow N_{S,K}\cup\{uxzyv\}italic_N start_POSTSUBSCRIPT italic_S , italic_K end_POSTSUBSCRIPT ← italic_N start_POSTSUBSCRIPT italic_S , italic_K end_POSTSUBSCRIPT ∪ { italic_u italic_x italic_z italic_y italic_v };
5            
6       end foreach
7      return NS,Ksubscript𝑁𝑆𝐾N_{S,K}italic_N start_POSTSUBSCRIPT italic_S , italic_K end_POSTSUBSCRIPT;
8      
9
10 Function 𝐬𝐞𝐚𝐫𝐜𝐡𝐬𝐞𝐚𝐫𝐜𝐡\mathbf{search}bold_search(N,Ndone,T𝑁subscript𝑁normal-done𝑇N,N_{\mathrm{done}},Titalic_N , italic_N start_POSTSUBSCRIPT roman_done end_POSTSUBSCRIPT , italic_T):
11       if N𝑁Nitalic_N is empty then return Some(T)Some𝑇\texttt{Some}(T)Some ( italic_T );
12       Take a nesting pattern uxzyv𝑢𝑥𝑧𝑦𝑣uxzyvitalic_u italic_x italic_z italic_y italic_v from N𝑁Nitalic_N;
13       if T𝑇Titalic_T is incompatible with uxzyv𝑢𝑥𝑧𝑦𝑣uxzyvitalic_u italic_x italic_z italic_y italic_v then
14             foreach character a𝑎aitalic_a in x𝑥xitalic_x and b𝑏bitalic_b in y𝑦yitalic_y  do
15                   if all strings in S𝑆Sitalic_S are well-matched under T{(a,b)}𝑇𝑎𝑏T\cup\{(a,b)\}italic_T ∪ { ( italic_a , italic_b ) }
16                   and T{(a,b)}𝑇𝑎𝑏T\cup\{(a,b)\}italic_T ∪ { ( italic_a , italic_b ) } is compatible with Ndone{uxzyv}subscript𝑁done𝑢𝑥𝑧𝑦𝑣N_{\mathrm{done}}\cup\{uxzyv\}italic_N start_POSTSUBSCRIPT roman_done end_POSTSUBSCRIPT ∪ { italic_u italic_x italic_z italic_y italic_v } then
17                         T𝐬𝐞𝐚𝐫𝐜𝐡(N{uxzyv},Ndone{uxzyv},T{(a,b)})superscript𝑇𝐬𝐞𝐚𝐫𝐜𝐡𝑁𝑢𝑥𝑧𝑦𝑣subscript𝑁done𝑢𝑥𝑧𝑦𝑣𝑇𝑎𝑏T^{\prime}\leftarrow\mathbf{search}(N\setminus\{uxzyv\},N_{\mathrm{done}}\cup% \{uxzyv\},T\cup\{(a,b)\})italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← bold_search ( italic_N ∖ { italic_u italic_x italic_z italic_y italic_v } , italic_N start_POSTSUBSCRIPT roman_done end_POSTSUBSCRIPT ∪ { italic_u italic_x italic_z italic_y italic_v } , italic_T ∪ { ( italic_a , italic_b ) } );
18                         if Tsuperscript𝑇normal-′T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not None then return Tsuperscript𝑇normal-′T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT;
19                        
20                   end if
21                  
22             end foreach
            return None; // No compatible tagging found
23            
24       end if
25      else return 𝐬𝐞𝐚𝐫𝐜𝐡(N{uxzyv},Ndone{uxzyv},T)𝐬𝐞𝐚𝐫𝐜𝐡𝑁𝑢𝑥𝑧𝑦𝑣subscript𝑁normal-done𝑢𝑥𝑧𝑦𝑣𝑇\mathbf{search}(N\setminus\{uxzyv\},N_{\mathrm{done}}\cup\{uxzyv\},T)bold_search ( italic_N ∖ { italic_u italic_x italic_z italic_y italic_v } , italic_N start_POSTSUBSCRIPT roman_done end_POSTSUBSCRIPT ∪ { italic_u italic_x italic_z italic_y italic_v } , italic_T );
26      
27
28Initialize K𝐾Kitalic_K as 1111;
29 repeat
30       KK+1𝐾𝐾1K\leftarrow K+1italic_K ← italic_K + 1; NS,K𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠(S,K)subscript𝑁𝑆𝐾𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠𝑆𝐾N_{S,K}\leftarrow\mathbf{candidateNesting}(S,K)italic_N start_POSTSUBSCRIPT italic_S , italic_K end_POSTSUBSCRIPT ← bold_candidateNesting ( italic_S , italic_K ); T𝐬𝐞𝐚𝐫𝐜𝐡(NS,K,,)𝑇𝐬𝐞𝐚𝐫𝐜𝐡subscript𝑁𝑆𝐾T\leftarrow\mathbf{search}(N_{S,K},\emptyset,\emptyset)italic_T ← bold_search ( italic_N start_POSTSUBSCRIPT italic_S , italic_K end_POSTSUBSCRIPT , ∅ , ∅ );
31until TNone𝑇NoneT\neq\texttt{None}italic_T ≠ None;
32return T𝑇Titalic_T;
Algorithm 3 The 𝐭𝐚𝐠𝐈𝐧𝐟𝐞𝐫(𝒪,S)𝐭𝐚𝐠𝐈𝐧𝐟𝐞𝐫𝒪𝑆\mathbf{tagInfer}(\mathcal{O},S)bold_tagInfer ( caligraphic_O , italic_S ) algorithm that infers tagging.

We next describe Algorithm 3, which performs inference of a compatible tagging using an input set of seed strings. Its runtime complexity is exponential in the worst case; however, as will be discussed in our evaluation section, its runtime performance on practical grammars is efficient. As an overview, starting with a bound K=2𝐾2K=2italic_K = 2, the algorithm (1) employs a bounded checking approach in the 𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠\mathbf{candidateNesting}bold_candidateNesting function to compute candidate nesting patterns NS,Ksubscript𝑁𝑆𝐾N_{S,K}italic_N start_POSTSUBSCRIPT italic_S , italic_K end_POSTSUBSCRIPT for seed strings S𝑆Sitalic_S, and (2) for NS,Ksubscript𝑁𝑆𝐾N_{S,K}italic_N start_POSTSUBSCRIPT italic_S , italic_K end_POSTSUBSCRIPT, the 𝐬𝐞𝐚𝐫𝐜𝐡𝐬𝐞𝐚𝐫𝐜𝐡\mathbf{search}bold_search function tries to find a compatible tagging using a search algorithm (which may backtrack). If the 𝐬𝐞𝐚𝐫𝐜𝐡𝐬𝐞𝐚𝐫𝐜𝐡\mathbf{search}bold_search function fails to find a compatible tagging, we increase K𝐾Kitalic_K by 1111 and start anew. In more detail, in 𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠\mathbf{candidateNesting}bold_candidateNesting, for each disjoint substring pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) in each seed string, we check if uxkzykv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑘𝑣ux^{k}zy^{k}v\in\mathcal{L}italic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v ∈ caligraphic_L for kK𝑘𝐾k\leq Kitalic_k ≤ italic_K and check if uxkzyjv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑗𝑣ux^{k}zy^{j}v\not\in\mathcal{L}italic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_v ∉ caligraphic_L for k,jK𝑘𝑗𝐾k,j\leq Kitalic_k , italic_j ≤ italic_K and kj𝑘𝑗k\neq jitalic_k ≠ italic_j. If so, (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is a candidate nesting pattern. In the 𝐬𝐞𝐚𝐫𝐜𝐡𝐬𝐞𝐚𝐫𝐜𝐡\mathbf{search}bold_search function, we begin with an empty tagging T𝑇Titalic_T, which tags each character as a plain symbol. We then check if a candidate nesting pattern is already covered by the current tagging; if not, we treat a symbol in x𝑥xitalic_x as a call symbol and a symbol in y𝑦yitalic_y as a return symbol and continue the search process.

Returning to our example shown in Figure 1, the seed string includes the nesting pattern {(ag,hb)}𝑎𝑔𝑏\{(ag,hb)\}{ ( italic_a italic_g , italic_h italic_b ) }. Our algorithm prioritizes the outermost characters for pairing. Consequently, the pair (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) is selected and it covers all nesting patterns of the seed string, resulting in a compatible tagging {(a,b)}𝑎𝑏\{(a,b)\}{ ( italic_a , italic_b ) }.

The following theorem states that for some bounded K𝐾Kitalic_K Algorithm 3 terminates and returns a compatible tagging.

Theorem 4.12 (Termination and Correctness of Algorithm 3).

Let m𝑚mitalic_m be the number of states of the minimal k𝑘kitalic_k-SEVPA for the oracle VPL. There exists a number K((m2+2m)2+1)2𝐾superscriptsuperscriptsuperscript𝑚22𝑚212K\leq((m^{2}+2m)^{2}+1)^{2}italic_K ≤ ( ( italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_m ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, such that 𝐭𝐚𝐠𝐈𝐧𝐟𝐞𝐫(𝒪,S)𝐭𝐚𝐠𝐈𝐧𝐟𝐞𝐫𝒪𝑆\mathbf{tagInfer}(\mathcal{O},S)bold_tagInfer ( caligraphic_O , italic_S ) returns a tagging that is compatible with a finite set of seed strings S𝑆Sitalic_S.

Note that the theorem is with respect to a finite set of seed strings S𝑆Sitalic_S. It does not say whether the found tagging is compatible with all strings in the oracle language \mathcal{L}caligraphic_L. We address this by demonstrating the existence of a finite set of seed strings S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, for which a compatible tagging T𝑇Titalic_T with S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ensures compatibility with all strings in \mathcal{L}caligraphic_L.

Theorem 4.13 (Finite and Sufficient Seed Strings).

For any given oracle language \mathcal{L}caligraphic_L, there exists a finite set of seed strings, denoted as S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, such that any tagging that is compatible with S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is also compatible with \mathcal{L}caligraphic_L.

The proof of Theorem 4.13 is provided in Appendix B.7. We illustrate how S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is constructed using the VPG shown in Figure 1. In general, for each matching rule L\guilsinglleftaAb\guilsinglrightB𝐿\guilsingllefta𝐴b\guilsinglright𝐵L\to{\text{\guilsinglleft{$a$}}{A}\text{{$b$}\guilsinglright}}{B}italic_L → italic_a italic_A italic_b italic_B where the nonterminal A𝐴Aitalic_A can be recursively rewritten into L𝐿Litalic_L via a set of derivations, we generate a string reflecting this recursion and incorporate it into S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The example VPG includes matching rules L\guilsinglleftaAb\guilsinglrightL𝐿\guilsingllefta𝐴b\guilsinglright𝐿L\to{\text{\guilsinglleft{$a$}}{A}\text{{$b$}\guilsinglright}}{L}italic_L → italic_a italic_A italic_b italic_L and A\guilsinglleftgLh\guilsinglrightE𝐴\guilsinglleftg𝐿h\guilsinglright𝐸A\to\text{\guilsinglleft{$g$}}L\text{{$h$}\guilsinglright}Eitalic_A → italic_g italic_L italic_h italic_E. We start by expanding L𝐿Litalic_L to reveal A𝐴Aitalic_A, followed by expanding A𝐴Aitalic_A to unveil L𝐿Litalic_L, resulting in the pattern \guilsingllefta\guilsinglleftgLh\guilsinglrightb\guilsinglrightL\guilsingllefta\guilsinglleftg𝐿h\guilsinglrightb\guilsinglright𝐿{\text{\guilsinglleft{$a$}}{\text{\guilsinglleft{$g$}}L\text{{$h$}% \guilsinglright}}\text{{$b$}\guilsinglright}}L\guilsingllefta \guilsinglleftg italic_L h\guilsinglright b\guilsinglright italic_L. Then the expansion of L𝐿Litalic_L to cd𝑐𝑑cditalic_c italic_d gives us a seed string agcdhbcd𝑎𝑔𝑐𝑑𝑏𝑐𝑑agcdhbcditalic_a italic_g italic_c italic_d italic_h italic_b italic_c italic_d. We also generate a seed string witnessing the recursive transition from A𝐴Aitalic_A to L𝐿Litalic_L and back to A𝐴Aitalic_A, which would lead to strings like agagcdhbhbcd𝑎𝑔𝑎𝑔𝑐𝑑𝑏𝑏𝑐𝑑agagcdhbhbcditalic_a italic_g italic_a italic_g italic_c italic_d italic_h italic_b italic_h italic_b italic_c italic_d.

In conclusion, it is established that a finite set of seed strings exists, enabling our algorithm to identify a tagging T𝑇Titalic_T that is compatible with the oracle language \mathcal{L}caligraphic_L. This compatibility guarantees that the tagged oracle language is a VPL. With the learned tagging as an input, Algorithm 1 can be employed to achieve exact learning.

5. V-Star for a Token-Basd VPL

In Section 4, we assumed that tagging of the oracle language is on individual characters; i.e., each character is uniquely tagged. This assumption does not always align with real-world scenarios. For instance, in JSON, a curly bracket { may serve as a call symbol, yet it can also be a key, exemplified by {"{":true}; in XML documents, an opening tag such as <p> functions as a call symbol, but it is composed of multiple characters. In this section, we enhance V-Star to address these scenarios.

5.1. Problem Statement

The oracle language \mathcal{L}caligraphic_L is a VPL when sentences in \mathcal{L}caligraphic_L are converted to sequences of tokens determined by an oracle tokenizer. Formally, a tokenizer function, τ:Σ*Hτ*:𝜏superscriptΣsuperscriptsubscript𝐻𝜏{\tau}:\Sigma^{*}\rightarrow H_{\tau}^{*}italic_τ : roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT → italic_H start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, translates a string s𝑠sitalic_s from \mathcal{L}caligraphic_L into a sequence of tagged tokens, where Hτ=HcallHplainHretsubscript𝐻𝜏subscript𝐻callsubscript𝐻plainsubscript𝐻retH_{\tau}=H_{\text{call}}\cup H_{\text{plain}}\cup H_{\text{ret}}italic_H start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT call end_POSTSUBSCRIPT ∪ italic_H start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT ∪ italic_H start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT represents the set of tagged tokens given by τ𝜏{\tau}italic_τ; we write H𝐻Hitalic_H when τ𝜏{\tau}italic_τ is clear from the context. The language {τ(s)s}conditional-set𝜏𝑠𝑠\{{\tau}(s)\mid s\in\mathcal{L}\}{ italic_τ ( italic_s ) ∣ italic_s ∈ caligraphic_L } is assumed to be a VPL over tokens in H𝐻Hitalic_H. Each category of token hhitalic_h is defined as a regular language, often specified by a regular expression. The notation sh𝑠s\in hitalic_s ∈ italic_h indicates that string s𝑠sitalic_s belongs to token hhitalic_h. We use metasymbols hasubscript𝑎{h_{a}}italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, hcsubscript𝑐h_{c}italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, or hbsubscript𝑏{h_{b}}italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT for call, plain, or return tokens, respectively.

L𝐿\displaystyle Litalic_L 𝙾𝙿𝙴𝙽L𝙲𝙻𝙾𝚂𝙴𝚃𝙴𝚇𝚃absentconditional𝙾𝙿𝙴𝙽𝐿𝙲𝙻𝙾𝚂𝙴𝚃𝙴𝚇𝚃\displaystyle\to\ {\texttt{OPEN}}\ L\ {\texttt{CLOSE}}\mid\texttt{TEXT}→ OPEN italic_L CLOSE ∣ TEXT
OPEN <p>absent<p>\displaystyle\to\texttt{<p>}→ <p>
CLOSE </p>absent</p>\displaystyle\to\texttt{</p>}→ </p>
TEXT [a..z]+absent[a..z]+\displaystyle\to\text{[a..z]+}→ [a..z]+
Figure 2. An example XML grammar and the associated lexical rules.
\Description

An example XML grammar and the associated lexical rules.

A toy XML grammar is shown in Figure 2 and we use the seed string s=<p><p>p</p></p>𝑠<p><p>p</p></p>s=\texttt{<p><p>p</p></p>}italic_s = <p><p>p</p></p> as an example. The tokens are OPEN, TEXT, and CLOSE. The oracle tokenizer converts <p><p>p</p></p> into the token sequence [OPEN,OPEN,TEXT,CLOSE,CLOSE], where OPEN is a call symbol, TEXT is a plain symbol, and CLOSE is a return symbol.

The oracle still provides membership and equivalence queries. The membership query function χ:Σ*{True,False}:subscript𝜒superscriptΣTrueFalse\chi_{\mathcal{L}}:\Sigma^{*}\to\{\text{True},\text{False}\}italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT : roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT → { True , False } is as before. However, we change the form of equivalence queries. The reason for the change is to convert the oracle language to a character-based VPL so that we can reuse Algorithm 1 for learning a hypothesis VPA.

To model equivalence queries, we first define a converter function. A tokenizer τ𝜏{\tau}italic_τ identifies boundaries of call and return tokens for a string. We then use 𝐜𝐨𝐧𝐯τ:Σ*Σ~τ*:subscript𝐜𝐨𝐧𝐯𝜏superscriptΣsuperscriptsubscript~Σ𝜏{\mathbf{conv}}_{\tau}:\Sigma^{*}\to\tilde{\Sigma}_{{\tau}}^{*}bold_conv start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT : roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT → over~ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT to transform a valid string s𝑠s\in\mathcal{L}italic_s ∈ caligraphic_L into a new string s~=𝐜𝐨𝐧𝐯τ(s)~𝑠subscript𝐜𝐨𝐧𝐯𝜏𝑠\tilde{s}={\mathbf{conv}}_{\tau}(s)over~ start_ARG italic_s end_ARG = bold_conv start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s ) by inserting artificial call and return symbols to mark token boundaries. This process is formalized next. Given a tokenizer τ𝜏{\tau}italic_τ with Hτ=HcallHplainHretsubscript𝐻𝜏subscript𝐻callsubscript𝐻plainsubscript𝐻retH_{{\tau}}=H_{\text{call}}\cup H_{\text{plain}}\cup H_{\text{ret}}italic_H start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT call end_POSTSUBSCRIPT ∪ italic_H start_POSTSUBSCRIPT plain end_POSTSUBSCRIPT ∪ italic_H start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT, we first build an extended character set Σ~τsubscript~Σ𝜏\tilde{\Sigma}_{{\tau}}over~ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT: for the i𝑖iitalic_i-th pair of call and return tokens haisubscriptsubscript𝑎𝑖h_{a_{i}}italic_h start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and hbisubscriptsubscript𝑏𝑖h_{b_{i}}italic_h start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we generate a pair of call and return symbol \guilsinglleftaisubscript\guilsingllefta𝑖\text{\guilsinglleft{$a$}}_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT\guilsinglright outside of ΣΣ\Sigmaroman_Σ. We define Σ~τsubscript~Σ𝜏\tilde{\Sigma}_{{\tau}}over~ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT as Σ{\guilsinglleftaii[1..|Hcall|]}{bi\guilsinglrighti[1..|Hret|]}\Sigma\cup\{\text{\guilsinglleft{$a$}}_{i}\mid i\in[1..|H_{\text{call}}|]\}% \cup\{\text{{$b_{i}$}\guilsinglright}\mid i\in[1..|H_{\text{ret}}|]\}roman_Σ ∪ { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ 1 . . | italic_H start_POSTSUBSCRIPT call end_POSTSUBSCRIPT | ] } ∪ { italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ 1 . . | italic_H start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT | ] }. Then, the transformation of s𝑠s\in\mathcal{L}italic_s ∈ caligraphic_L into language ~~\tilde{\mathcal{L}}over~ start_ARG caligraphic_L end_ARG over Σ~τsubscript~Σ𝜏\tilde{\Sigma}_{{\tau}}over~ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT proceeds as follows. Let τ(s)=τ(s1sk)=t1tk𝜏𝑠𝜏subscript𝑠1subscript𝑠𝑘subscript𝑡1subscript𝑡𝑘{\tau}(s)={\tau}(s_{1}\dots s_{k})=t_{1}\dots t_{k}italic_τ ( italic_s ) = italic_τ ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where sktksubscript𝑠𝑘subscript𝑡𝑘s_{k}\in t_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. We construct s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG based on tokenization: for each i[1..k]i\in[1..k]italic_i ∈ [ 1 . . italic_k ], if tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belongs to Hajsubscript𝐻subscript𝑎𝑗H_{a_{j}}italic_H start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the call symbol \guilsinglleftajsubscript\guilsingllefta𝑗\text{\guilsinglleft{$a$}}_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is added before sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in s𝑠sitalic_s; if tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is in Hretsubscript𝐻retH_{\text{ret}}italic_H start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT, the return symbol bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT\guilsinglright is added after sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in s𝑠sitalic_s. For instance, for the XML grammar in Figure 2, with the call-return token pair being (OPEN,CLOSE), our extended character set Σ~τsubscript~Σ𝜏\tilde{\Sigma}_{{\tau}}over~ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT have two additional characters, say \triangleleft and \triangleright. The seed string s=<p><p>p</p></p>𝑠<p><p>p</p></p>s=\texttt{<p><p>p</p></p>}italic_s = <p><p>p</p></p> is converted to <p><p>p</p></p><p><p>𝑝</p>limit-from</p>\triangleleft\texttt{<p>}\triangleleft\texttt{<p>}p\texttt{</p>}\triangleright% \texttt{</p>}\triangleright◁ <p> ◁ <p> italic_p </p> ▷ </p> ▷. Note that the resulting string after conversion is a well-matched string in a character-based VPL that has the call symbol \triangleleft and return symbol \triangleright. This allows us to reuse the previous algorithm for learning a character-based VPA.

With the converter function defined, we can model an equivalence query. (τ,τ)subscript𝜏𝜏\mathcal{E}(\mathcal{H}_{\tau},{\tau})caligraphic_E ( caligraphic_H start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) takes a hypothesis VPA τsubscript𝜏\mathcal{H}_{\tau}caligraphic_H start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and a hypothesis tokenizer τ𝜏{\tau}italic_τ and returns none when the oracle language is equivalent to the unconverted language recognized by τsubscript𝜏\mathcal{H}_{\tau}caligraphic_H start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and otherwise returns some s𝑠sitalic_s such that χ(s)χ(τ,τ)(s)subscript𝜒𝑠subscript𝜒subscript𝜏𝜏𝑠\chi_{\mathcal{L}}(s)\neq\chi_{(\mathcal{H}_{\tau},{\tau})}(s)italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ) ≠ italic_χ start_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) end_POSTSUBSCRIPT ( italic_s ), where

χ(τ,τ)(s)={Trueif 𝐜𝐨𝐧𝐯τ(s) is accepted by τ,Falseotherwise.subscript𝜒subscript𝜏𝜏𝑠casesTrueif subscript𝐜𝐨𝐧𝐯𝜏𝑠 is accepted by subscript𝜏Falseotherwise\chi_{(\mathcal{H}_{\tau},{\tau})}(s)=\begin{cases}\text{True}&\text{if }{% \mathbf{conv}}_{\tau}(s)\text{ is accepted by }\mathcal{H}_{\tau},\\ \text{False}&\text{otherwise}.\end{cases}italic_χ start_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) end_POSTSUBSCRIPT ( italic_s ) = { start_ROW start_CELL True end_CELL start_CELL if bold_conv start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s ) is accepted by caligraphic_H start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL False end_CELL start_CELL otherwise . end_CELL end_ROW

A learner achieves exact learning if sΣ*,χ(s)=χ(τ,τ)(s)formulae-sequencefor-all𝑠superscriptΣsubscript𝜒𝑠subscript𝜒subscript𝜏𝜏𝑠\forall s\in\Sigma^{*},\;\chi_{\mathcal{L}}(s)=\chi_{(\mathcal{H}_{\tau},{\tau% })}(s)∀ italic_s ∈ roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_s ) = italic_χ start_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) end_POSTSUBSCRIPT ( italic_s ).

Similar to the method we discussed in Section 4.3, we utilize nesting patterns to identify the boundaries of call and return tokens. Our objective is to discover a compatible tokenizer, which ensures that the language ~τ={𝐜𝐨𝐧𝐯τ(s)s}subscript~𝜏conditional-setsubscript𝐜𝐨𝐧𝐯𝜏𝑠𝑠\tilde{\mathcal{L}}_{{\tau}}=\{{\mathbf{conv}}_{\tau}(s)\mid s\in\mathcal{L}\}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = { bold_conv start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL. We will demonstrate the existence of a finite set of substrings from which our algorithm can successfully learn a compatible tokenizer. Then the aforementioned converter function transforms the oracle language into a character-based VPL, which according to Theorem 4.7 can be efficiently learned by Algorithm 1.

Assumptions for oracle languages and tokenizers

We previously defined a tokenizer as a function that maps a string to a list of tokens. However, assuming an arbitrary tokenizer is insufficient, as it has been demonstrated that any CFG can be mapped to a VPG through some tagging (Alur and Madhusudan, 2004). Take, for instance, the CFG {LcLcc}conditional-set𝐿𝑐𝐿𝑐𝑐\{L\to cLc\mid c\}{ italic_L → italic_c italic_L italic_c ∣ italic_c }. A tokenizer might tag c𝑐citalic_c differently based on its position within a string, e.g., maps the string ccc𝑐𝑐𝑐cccitalic_c italic_c italic_c to the token list [\guilsinglleftc,c,c\guilsinglright]\guilsinglleftc𝑐c\guilsinglright[\text{\guilsinglleft{$c$}},c,\text{{$c$}\guilsinglright}][ italic_c , italic_c , italic_c ], where \guilsinglleftc𝑐citalic_c and c𝑐citalic_c\guilsinglright represent call and return tokens, respectively; the resulting language is a VPG. To simplify tag learning to the setting where tagging is context independent, the oracle tokenizer and the language are assumed to satisfy the following properties:

Tokenization Consistency. For a string p=p1pk𝑝subscript𝑝1subscript𝑝𝑘p=p_{1}\ldots p_{k}italic_p = italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, if each substring pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belongs to a token tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, then τ(p)=t1tk𝜏𝑝subscript𝑡1subscript𝑡𝑘{\tau}(p)=t_{1}\ldots t_{k}italic_τ ( italic_p ) = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. For example, string <p></p> can be split into [<p>,</p>]; this assumption requires it to be tokenized as [OPEN, CLOSE].

Separation. Strings for different tokens do not overlap. For the previous example of {LcLcc}conditional-set𝐿𝑐𝐿𝑐𝑐\{L\to cLc\mid c\}{ italic_L → italic_c italic_L italic_c ∣ italic_c }, tagging the first c𝑐citalic_c as a call token and the last as a return token would violate this property.

Exclusivity. A prefix or suffix of a call or return token hhitalic_h cannot serve as an infix of hhitalic_h. Exclusivity is not required for a token that contains only a single character.

Unique Pairing. Each call token is uniquely paired with a return token, similar to an assumption we made for the setting of character-based VPLs.

Token Fixed Prefix and Suffix. For each call or return token hhitalic_h, if hhitalic_h contains more than a single character, we require that there exists a prefix q𝑞qitalic_q and a suffix g𝑔gitalic_g, such that all strings of hhitalic_h starts with q𝑞qitalic_q and ends with g𝑔gitalic_g. Further, there exists a string shsubscript𝑠s_{h}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT of hhitalic_h, such that the combination of the prefixes and suffixes of shsubscript𝑠s_{h}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT constitutes a sufficient set of test strings for exact learning of the token using L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, from the membership query function λs.χ(wsw)formulae-sequence𝜆𝑠subscript𝜒𝑤𝑠superscript𝑤\lambda s.\chi_{\mathcal{L}}(wsw^{\prime})italic_λ italic_s . italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w italic_s italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where w,w𝑤superscript𝑤w,w^{\prime}italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are any strings such that wshw𝑤subscript𝑠superscript𝑤ws_{h}w^{\prime}\in\mathcal{L}italic_w italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_L.

k𝑘kitalic_k-Repetition. Given a positive numbr k𝑘kitalic_k, for each valid string s=w1ww3𝑠subscript𝑤1𝑤subscript𝑤3s=w_{1}ww_{3}italic_s = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT where w𝑤witalic_w is a nonempty substring, we say that w𝑤witalic_w is k𝑘kitalic_k-repeatable in s𝑠sitalic_s, if w1wkw3subscript𝑤1superscript𝑤𝑘subscript𝑤3w_{1}w^{k}w_{3}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is also a valid string. A language \mathcal{L}caligraphic_L and its tokenizer τ𝜏{\tau}italic_τ are said to satisfy k𝑘kitalic_k-Repetition if, for any valid string s𝑠s\in\mathcal{L}italic_s ∈ caligraphic_L and any substring w𝑤witalic_w in s𝑠sitalic_s, if w𝑤witalic_w belongs to a call or return token hhitalic_h, but is not tokenized as hhitalic_h in s𝑠sitalic_s, then w𝑤witalic_w is k𝑘kitalic_k-repeatable in s𝑠sitalic_s.

For example, consider the JSON string {"{":true}. Suppose w𝑤witalic_w is the second {; since it is inside quotes, it belongs to part of the token for a JSON key (i.e., "{"), even though {{\{{ itself is a call token. For any k𝑘kitalic_k, w𝑤witalic_w is k𝑘kitalic_k-repeatable since the string {"wk":true}={"{{":true}{"wk":true}{"{{":true}\texttt{\{"$w^{k}$":true\}}=\texttt{\{"\{}\ldots\texttt{\{":true\}}{" italic_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ":true} = {"{ … {":true} remains a valid JSON string. In our implementation, we set k𝑘kitalic_k to 2222.

While we make the aforementioned assumptions, our approach is still quite expressive, since the above properties are typically satisfied by practical grammars, including those in our evaluation.

5.2. Tagging Inference for Tokens

In this section, we define the compatibility of tokenizers and present a theorem about the relation between a compatible tokenizer and the converted language ~τsubscript~𝜏\tilde{\mathcal{L}}_{{\tau}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT. Then, we discuss an algorithm that infers a compatible tokenizer from given seed strings.

To define compatible tokenizers, we introduce some additional definitions. Given a tokenizer τ𝜏{\tau}italic_τ, recall that for a string s𝑠sitalic_s in \mathcal{L}caligraphic_L, 𝐜𝐨𝐧𝐯τ(s)subscript𝐜𝐨𝐧𝐯𝜏𝑠{\mathbf{conv}}_{\tau}(s)bold_conv start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s ) is built by inserting artificial call and return symbols to s𝑠sitalic_s. Now, let s=s1s2s3𝑠subscript𝑠1subscript𝑠2subscript𝑠3s=s_{1}s_{2}s_{3}italic_s = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. We define 𝐜𝐨𝐧𝐯τ,s(s2)subscript𝐜𝐨𝐧𝐯𝜏𝑠subscript𝑠2{\mathbf{conv}}_{{\tau},s}(s_{2})bold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) as the substring in 𝐜𝐨𝐧𝐯τ(s)subscript𝐜𝐨𝐧𝐯𝜏𝑠{\mathbf{conv}}_{\tau}(s)bold_conv start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s ) that corresponds to s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; additionally, if 𝐜𝐨𝐧𝐯τsubscript𝐜𝐨𝐧𝐯𝜏{\mathbf{conv}}_{\tau}bold_conv start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT inserted a call symbol between s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then 𝐜𝐨𝐧𝐯τ,s(s2)subscript𝐜𝐨𝐧𝐯𝜏𝑠subscript𝑠2{\mathbf{conv}}_{{\tau},s}{(s_{2})}bold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) includes and starts with that call symbol; if 𝐜𝐨𝐧𝐯τsubscript𝐜𝐨𝐧𝐯𝜏{\mathbf{conv}}_{\tau}bold_conv start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT inserted a return symbol between s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, then 𝐜𝐨𝐧𝐯τ,s(s2)subscript𝐜𝐨𝐧𝐯𝜏𝑠subscript𝑠2{\mathbf{conv}}_{{\tau},s}{(s_{2})}bold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) includes and ends with that return symbol. For example, for the seed string s=<p><p>p</p></p>𝑠<p><p>p</p></p>s=\texttt{<p><p>p</p></p>}italic_s = <p><p>p</p></p>, 𝐜𝐨𝐧𝐯τ,s(<p>)=<p>subscript𝐜𝐨𝐧𝐯𝜏𝑠<p><p>{\mathbf{conv}}_{{\tau},s}{(\texttt{<p>})}=\triangleleft\ \texttt{<p>}bold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( <p> ) = ◁ <p>, and 𝐜𝐨𝐧𝐯τ,s(</p>)=</p>subscript𝐜𝐨𝐧𝐯𝜏𝑠</p>limit-from</p>{\mathbf{conv}}_{{\tau},s}{(\texttt{</p>})}=\texttt{</p>}\ \trianglerightbold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( </p> ) = </p> ▷.

Definition 5.0 (Compatible Tokenizers).

We say that a tokenizer τ𝜏{\tau}italic_τ is compatible with a set of nesting patterns N𝑁Nitalic_N, if for each nesting pattern s=uxzyv𝑠𝑢𝑥𝑧𝑦𝑣s=uxzyvitalic_s = italic_u italic_x italic_z italic_y italic_v in N𝑁Nitalic_N, 𝐜𝐨𝐧𝐯τ,s(x)subscript𝐜𝐨𝐧𝐯𝜏𝑠𝑥{\mathbf{conv}}_{{\tau},s}{(x)}bold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( italic_x ), and 𝐜𝐨𝐧𝐯τ,s(y)subscript𝐜𝐨𝐧𝐯𝜏𝑠𝑦{\mathbf{conv}}_{{\tau},s}{(y)}bold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( italic_y ), there exists a pair of artificial call and return symbols (,)(\triangleleft,\triangleright)( ◁ , ▷ ) in τ𝜏{\tau}italic_τ, such that (1) 𝐜𝐨𝐧𝐯τ,s(x)subscript𝐜𝐨𝐧𝐯𝜏𝑠𝑥{\mathbf{conv}}_{{\tau},s}{(x)}bold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( italic_x ) includes \triangleleft and 𝐜𝐨𝐧𝐯τ,s(y)subscript𝐜𝐨𝐧𝐯𝜏𝑠𝑦{\mathbf{conv}}_{{\tau},s}{(y)}bold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( italic_y ) includes \triangleright, and (2) \triangleleft is unmatched in 𝐜𝐨𝐧𝐯τ,s(x)subscript𝐜𝐨𝐧𝐯𝜏𝑠𝑥{\mathbf{conv}}_{{\tau},s}{(x)}bold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( italic_x ) and \triangleright is unmatched in 𝐜𝐨𝐧𝐯τ,s(y)subscript𝐜𝐨𝐧𝐯𝜏𝑠𝑦{\mathbf{conv}}_{{\tau},s}{(y)}bold_conv start_POSTSUBSCRIPT italic_τ , italic_s end_POSTSUBSCRIPT ( italic_y ).

We say that a tokenizer τ𝜏{\tau}italic_τ is compatible with a set of seed strings S𝑆Sitalic_S, if (1) for each string s𝑠sitalic_s in S𝑆Sitalic_S, 𝐜𝐨𝐧𝐯τ(s)subscript𝐜𝐨𝐧𝐯𝜏𝑠{\mathbf{conv}}_{\tau}{(s)}bold_conv start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s ) is well-matched, and (2) τ𝜏{\tau}italic_τ is compatible with all nesting patterns of S𝑆Sitalic_S.

Now we present Theorem 5.2 as the basis for exact learning.

Theorem 5.2 ().

Assume the oracle language and the oracle tokenizer satisfy the Tokenization Consistency and Separation properties. Given a tokenizer τ𝜏{\tau}italic_τ that is compatible with the oracle language \mathcal{L}caligraphic_L, language ~τsubscript~𝜏\tilde{\mathcal{L}}_{{\tau}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is a VPL.

Now, we propose Algorithm 4 to infer a hypothesis compatible tokenizer. Instead of finding a full-fledged tokenizer, the algorithm infers a partial tokenizer, which recognizes only call and return tokens in an input string; the syntax of plain tokens is instead learned during the VPA learning process. As a result of this choice, substrings between call/return tokens recognized by a partial tokenizer are implicitly treated as plain tokens. We represent a partial tokenizer as a set D={(ri,ri)i[1..|Hcall]|}D=\{(r_{i},r_{i}^{\prime})\mid i\in[1..|H_{\text{call}}]|\}italic_D = { ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∣ italic_i ∈ [ 1 . . | italic_H start_POSTSUBSCRIPT call end_POSTSUBSCRIPT ] | }, where risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and risuperscriptsubscript𝑟𝑖r_{i}^{\prime}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the regular expressions for the i𝑖iitalic_i-th paired call and return token, respectively. Function 𝐜𝐨𝐧𝐯D(s)subscript𝐜𝐨𝐧𝐯𝐷𝑠{\mathbf{conv}}_{D}{(s)}bold_conv start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_s ) and compatibility are similarly defined for a partial tokenizer D𝐷Ditalic_D; we omit them for brevity.

At a high level, Algorithm 4 identifies call/return tokens by enumerating potential prefixes and suffixes based on the Token Fixed Prefix and Suffix assumption. Further, under the Exclusivity assumption, we can prove that oracle call/return tokens must appear in (x2,y2)superscript𝑥2superscript𝑦2(x^{2},y^{2})( italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for certain nesting pattern (x,y)𝑥𝑦(x,y)( italic_x , italic_y ); the proof is provided in the appendix as Lemma C.2. Therefore, we restrict our enumeration to substrings within (x2,y2)superscript𝑥2superscript𝑦2(x^{2},y^{2})( italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Our approach begins by searching within (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) and then progressively expands the search space to (x2,y2)superscript𝑥2superscript𝑦2(x^{2},y^{2})( italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Upon identifying a candidate prefix-suffix pair for a token, we learn the token’s lexical rules as a regular expression within the prefix-suffix pair using L*superscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT at line 6; in this learning, we simulate the equivalence queries using test strings obtained by combining the prefixes and suffixes of x𝑥xitalic_x and y𝑦yitalic_y, respectively. We then incorporate the tokens into the partial tokenizer and proceed to assess the tokenizer’s compatibility with the nesting patterns of seed strings (line 7).

One compatibility condition is that the seed strings after tokenization should be well-matched; for that, Algorithm 5 is used to tokenize a string based on a given partial tokenizer. The main challenge of tokenization is that we have only a partial tokenizer and we need to rely on k𝑘kitalic_k-Repetition to deal with the case when a plain token string contains a call/return token as part of its substring. E.g., in s={"{":true}𝑠{"{":true}s=\texttt{\{"\{":true\}}italic_s = {"{":true}, the second { is actually part of the plain token "{" and should not be treated as a call token. To demonstrate Algorithm 5, consider a partial tokenizer D={({,})}𝐷{}D=\{(\texttt{\{},\texttt{\}})\}italic_D = { ( { , } ) } and an input JSON string s={"{":true}𝑠{"{":true}s=\texttt{\{"\{":true\}}italic_s = {"{":true}. Start with string index i=1𝑖1i=1italic_i = 1 and token list l=[]𝑙l=[]italic_l = [ ], Algorithm 5 matches the first { as m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and pushes it to l𝑙litalic_l. Since i=2𝑖2i=2italic_i = 2 does not result in any match, i𝑖iitalic_i is updated to 3333, where the second { is matched as m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; however it is not added to the token list since it is k𝑘kitalic_k-repeatable. Finally, the last } is matched as m3subscript𝑚3m_{3}italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. As a result, Algorithm 5 returns [m1,m3]subscript𝑚1subscript𝑚3[m_{1},m_{3}][ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ].

Input: Oracle 𝒪𝒪\mathcal{O}caligraphic_O and seed strings S𝑆Sitalic_S.
Output: Some tokenizer D𝐷Ditalic_D compatible with S𝑆Sitalic_S, or None if no compatible tokenizer is found.
1 Function 𝐭𝐨𝐤𝐞𝐧𝐒𝐞𝐚𝐫𝐜𝐡𝐭𝐨𝐤𝐞𝐧𝐒𝐞𝐚𝐫𝐜𝐡\mathbf{tokenSearch}bold_tokenSearch(N𝑁Nitalic_N, Ndonesubscript𝑁normal-doneN_{\mathrm{done}}italic_N start_POSTSUBSCRIPT roman_done end_POSTSUBSCRIPT, D𝐷Ditalic_D):
2       if N𝑁Nitalic_N is empty then return Some(D)Some𝐷\texttt{Some}(D)Some ( italic_D );
3       Take a nesting pattern s=uxzyv𝑠𝑢𝑥𝑧𝑦𝑣s=uxzyvitalic_s = italic_u italic_x italic_z italic_y italic_v from N𝑁Nitalic_N;
4       if D𝐷Ditalic_D is incompatible with uxzyv𝑢𝑥𝑧𝑦𝑣uxzyvitalic_u italic_x italic_z italic_y italic_v then
5             foreach disjoint substrings q𝑞qitalic_q and g𝑔gitalic_g in x𝑥xitalic_x and x2superscript𝑥2x^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and qsuperscript𝑞normal-′q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and gsuperscript𝑔normal-′g^{\prime}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in y𝑦yitalic_y and y2superscript𝑦2y^{2}italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT  do
6                   Based on ((q,g),(q,g))𝑞𝑔superscript𝑞superscript𝑔((q,g),(q^{\prime},g^{\prime}))( ( italic_q , italic_g ) , ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) learn a new call-return token pair (r,r)𝑟superscript𝑟(r,r^{\prime})( italic_r , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT );
7                   if D{(r,r)}𝐷𝑟superscript𝑟normal-′D\cup\{(r,r^{\prime})\}italic_D ∪ { ( italic_r , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } is compatible with Ndone{uxzyv}subscript𝑁normal-done𝑢𝑥𝑧𝑦𝑣N_{\mathrm{done}}\cup\{uxzyv\}italic_N start_POSTSUBSCRIPT roman_done end_POSTSUBSCRIPT ∪ { italic_u italic_x italic_z italic_y italic_v } then
8                         D𝐭𝐨𝐤𝐞𝐧𝐒𝐞𝐚𝐫𝐜𝐡(N{uxzyv},Ndone{uxzyv},D{(r,r)})superscript𝐷𝐭𝐨𝐤𝐞𝐧𝐒𝐞𝐚𝐫𝐜𝐡𝑁𝑢𝑥𝑧𝑦𝑣subscript𝑁done𝑢𝑥𝑧𝑦𝑣𝐷𝑟superscript𝑟D^{\prime}\leftarrow\mathbf{tokenSearch}{(N\setminus\{uxzyv\},N_{\mathrm{done}% }\cup\{uxzyv\},D\cup\{(r,r^{\prime})\})}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← bold_tokenSearch ( italic_N ∖ { italic_u italic_x italic_z italic_y italic_v } , italic_N start_POSTSUBSCRIPT roman_done end_POSTSUBSCRIPT ∪ { italic_u italic_x italic_z italic_y italic_v } , italic_D ∪ { ( italic_r , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } );
9                         if Dsuperscript𝐷normal-′D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not None then return Dsuperscript𝐷normal-′D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT;
10                        
11                   end if
12                  
13             end foreach
            return None; // No compatible tokenizer found
14            
15       end if
16      else return 𝐭𝐨𝐤𝐞𝐧𝐒𝐞𝐚𝐫𝐜𝐡(N{uxzyv},Ndone{uxzyv},D)𝐭𝐨𝐤𝐞𝐧𝐒𝐞𝐚𝐫𝐜𝐡𝑁𝑢𝑥𝑧𝑦𝑣subscript𝑁normal-done𝑢𝑥𝑧𝑦𝑣𝐷\mathbf{tokenSearch}{(N\setminus\{uxzyv\},N_{\mathrm{done}}\cup\{uxzyv\},D)}bold_tokenSearch ( italic_N ∖ { italic_u italic_x italic_z italic_y italic_v } , italic_N start_POSTSUBSCRIPT roman_done end_POSTSUBSCRIPT ∪ { italic_u italic_x italic_z italic_y italic_v } , italic_D );
17      
18
19Initialize K𝐾Kitalic_K as 1111;
20 repeat
21       KK+1𝐾𝐾1K\leftarrow K+1italic_K ← italic_K + 1; NS,K𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠(S,K)subscript𝑁𝑆𝐾𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠𝑆𝐾N_{S,K}\leftarrow\mathbf{candidateNesting}(S,K)italic_N start_POSTSUBSCRIPT italic_S , italic_K end_POSTSUBSCRIPT ← bold_candidateNesting ( italic_S , italic_K ); D𝐭𝐨𝐤𝐞𝐧𝐒𝐞𝐚𝐫𝐜𝐡(NS,K,,)𝐷𝐭𝐨𝐤𝐞𝐧𝐒𝐞𝐚𝐫𝐜𝐡subscript𝑁𝑆𝐾D\leftarrow\mathbf{tokenSearch}{(N_{S,K},\emptyset,\emptyset)}italic_D ← bold_tokenSearch ( italic_N start_POSTSUBSCRIPT italic_S , italic_K end_POSTSUBSCRIPT , ∅ , ∅ );
22until DNone𝐷NoneD\neq\texttt{None}italic_D ≠ None;
23return D𝐷Ditalic_D;
Algorithm 4 The 𝐭𝐨𝐤𝐞𝐧𝐈𝐧𝐟𝐞𝐫(𝒪,S)𝐭𝐨𝐤𝐞𝐧𝐈𝐧𝐟𝐞𝐫𝒪𝑆\mathbf{tokenInfer}{(\mathcal{O},S)}bold_tokenInfer ( caligraphic_O , italic_S ) algorithm that infers call and return tokens. The 𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠(,)𝐜𝐚𝐧𝐝𝐢𝐝𝐚𝐭𝐞𝐍𝐞𝐬𝐭𝐢𝐧𝐠\mathbf{candidateNesting}(-,-)bold_candidateNesting ( - , - ) function is the same as the one in Algorithm 3.
Input: A partial tokenizer D𝐷Ditalic_D and a string s𝑠sitalic_s.
Output: A token list l𝑙litalic_l.
1 Initialize the token list as l[]𝑙l\leftarrow[]italic_l ← [ ];
2 Initialize the current location of string s𝑠sitalic_s as i1𝑖1i\leftarrow 1italic_i ← 1;
3 while i|s|𝑖𝑠i\leq|s|italic_i ≤ | italic_s | do
4       if  We find a first match w=s[i]s[j]𝑤𝑠delimited-[]𝑖normal-…𝑠delimited-[]𝑗w=s[i]...s[j]italic_w = italic_s [ italic_i ] … italic_s [ italic_j ] for token hD𝐷h\in Ditalic_h ∈ italic_D and w𝑤witalic_w is not k𝑘kitalic_k-repeatable then
5             Push new match m=(h,i,j)𝑚𝑖𝑗m=(h,i,j)italic_m = ( italic_h , italic_i , italic_j ) to token list l𝑙litalic_l;
6             ij+1𝑖𝑗1i\leftarrow j+1italic_i ← italic_j + 1;
7            
8       end if
9      else ii+1𝑖𝑖1i\leftarrow i+1italic_i ← italic_i + 1;
10      
11 end while
12return l𝑙litalic_l;
Algorithm 5 The 𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐞(D,s)𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐞𝐷𝑠\mathbf{tokenize}(D,s)bold_tokenize ( italic_D , italic_s ) algorithm that tokenizes a string.

We next illustrate the steps of Algorithm 4 using our XML example. We start with an empty tokenizer D𝐷Ditalic_D. We then iteratively select a nesting pattern s=uxzyv𝑠𝑢𝑥𝑧𝑦𝑣s=uxzyvitalic_s = italic_u italic_x italic_z italic_y italic_v, tokenize s𝑠sitalic_s using Algorithm 5, and verify the compatibility of D𝐷Ditalic_D with the tokenization. In our example, we start with the seed string s=<p><p>p</p></p>𝑠<p><p>p</p></p>s=\texttt{<p><p>p</p></p>}italic_s = <p><p>p</p></p> and pick a nesting pattern. Suppose V-Star picks the outmost pattern (<p>,</p>)<p></p>(\texttt{<p>},\texttt{</p>})( <p> , </p> ). The token list 𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐞(D,s)𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐞𝐷𝑠\mathbf{tokenize}(D,s)bold_tokenize ( italic_D , italic_s ) of s𝑠sitalic_s is empty, since there is no rule to find a token yet. Apparently, this tokenizer D𝐷Ditalic_D is not compatible with uxzyv𝑢𝑥𝑧𝑦𝑣uxzyvitalic_u italic_x italic_z italic_y italic_v.

We then extend D𝐷Ditalic_D by a call-return token pair learned from ((q,g),(q,g))𝑞𝑔superscript𝑞superscript𝑔((q,g),(q^{\prime},g^{\prime}))( ( italic_q , italic_g ) , ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) derived from (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) or (x2,y2)superscript𝑥2superscript𝑦2\left(x^{2},y^{2}\right)( italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). By enumerating candidate prefixes and suffixes ((q,g),(q,g))𝑞𝑔superscript𝑞superscript𝑔((q,g),(q^{\prime},g^{\prime}))( ( italic_q , italic_g ) , ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) within (x,y)=(<p>,</p>)𝑥𝑦<p></p>(x,y)=(\texttt{<p>},\texttt{</p>})( italic_x , italic_y ) = ( <p> , </p> ), we can build the first call-return token pair. From x𝑥xitalic_x, we first pick the outmost (<,>)<>(\texttt{<},\texttt{>})( < , > ) as (q,g)𝑞𝑔(q,g)( italic_q , italic_g ); then in y𝑦yitalic_y, we pick the outmost (<,>)<>(\texttt{<},\texttt{>})( < , > ) as (q,g)superscript𝑞superscript𝑔(q^{\prime},g^{\prime})( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). By learning the tokens’ lexical rules from membership query functions λw.χ(w<p>p</p></p>)formulae-sequence𝜆𝑤subscript𝜒𝑤<p>p</p></p>\lambda w.\chi_{\mathcal{L}}(w\texttt{<p>p</p></p>})italic_λ italic_w . italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( italic_w <p>p</p></p> ) and λw.χ(<p><p>p</p>w)formulae-sequence𝜆𝑤subscript𝜒<p><p>p</p>𝑤\lambda w.\chi_{\mathcal{L}}(\texttt{<p><p>p</p>}w)italic_λ italic_w . italic_χ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( <p><p>p</p> italic_w ), we identify two regular expressions <p> and </p> for the call and return tokens, respectively. Note that if the open tag contained XML attributes, the learned lexical rules would encompass regular expressions that specify these attributes. To check if the partial tokenizer D={(<p>D=\{(\texttt{<p>}italic_D = { ( <p>, </p>)}\texttt{</p>})\}</p> ) } is compatible with s=<p><p>p</p></p>𝑠<p><p>p</p></p>s=\texttt{<p><p>p</p></p>}italic_s = <p><p>p</p></p>, we need to tokenize s𝑠sitalic_s following Algorithm 5, which returns the token list [<p>,<p>,</p>,</p>]<p><p></p></p>[\texttt{<p>},\texttt{<p>},\texttt{</p>},\texttt{</p>}][ <p> , <p> , </p> , </p> ]. It can be shown that this partial tokenizer is compatible with all nesting patterns of string s𝑠sitalic_s. Therefore, Algorithm 4 ends here and returns this compatible tokenizer.

Lemma 5.0 (Finite and Sufficient Seed Strings).

Assume the oracle language and the oracle tokenizer satisfy the Tokenization Consistency, Separation, Exclusivity, Unique Pairing, Token Fixed Prefix and Suffix, and k𝑘kitalic_k-Repetition properties. There exists a finite set of seed strings S0subscript𝑆0S_{0}\subseteq\mathcal{L}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊆ caligraphic_L, with which we can find a tokenizer that is compatible with the oracle language \mathcal{L}caligraphic_L using Algorithm 4.

As a summary, we can learn a compatible tokenizer from a certain finite set of seed strings. With a compatible tokenizer τ𝜏{\tau}italic_τ, Theorem 5.2 gives us that ~τsubscript~𝜏\tilde{\mathcal{L}}_{{\tau}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is a character-based VPL. Then by Theorem 4.7, we can use Algorithm 1 to learn ~τsubscript~𝜏\tilde{\mathcal{L}}_{{\tau}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT exactly under active learning.

6. Evaluation

In this section, we discuss V-Star’s implementation, evaluation and its comparison with two other state-of-the-art grammar inference tools, Glade (Bastani et al., 2017) and Arvada (Kulkarni et al., 2022), in the context of inferring grammars from program inputs.

Implementation

While black-box programs naturally support membership queries, direct support of equivalence queries is absent. To instantiate the MAT, we approximate equivalence queries through membership queries. In particular, we construct a set of strings by combining prefixes, infixes, and suffixes of the seed strings; for each such string s𝑠sitalic_s, if 𝐜𝐨𝐧𝐯τ(s)subscript𝐜𝐨𝐧𝐯𝜏𝑠{\mathbf{conv}}_{\tau}(s)bold_conv start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s ) is well-matched, we add it to a set of test strings. The set of test strings is then used to check the consistency between the hypothesis VPA and the oracle language. A test string becomes a counterexample if it witnesses inconsistency (i.e., either the hypothesis VPA or the oracle accepts the string, but not both). Similar ideas have appeared in conformance testing (Kumar et al., 2006; Aichernig et al., 2024).

Our previously discussed algorithm produces a visibly pushdown automaton (VPA), instead of a visibly pushdown grammar (VPG). Upon the successful learning of a VPA, we transform it into a VPG using methods outlined by Alur and Madhusudan (2004).

Datasets

For our experiments, we replicated the evaluation methodology of the Arvada study, utilizing their datasets (Kulkarni et al., 2022), including the oracle grammars, datasets for evaluating the recall (discussed later), and seed strings. We selected the grammars of JSON, LISP, XML, While, and MathExpr, due to their distinct characteristics of being VPGs.

Metrics

We evaluate the performance of V-Star using four key metrics: Recall, Precision, F-1 Score, and Number of Membership Queries. We define each metric as follows:

  1. (1)

    Recall: This metric is the probability that a string of the oracle grammar is also a string of the learned grammar G𝐺Gitalic_G (Bastani et al., 2017). For finite languages, it can be defined as: |𝒪G||𝒪|subscript𝒪subscript𝐺subscript𝒪\frac{|\mathcal{L}_{\mathcal{O}}\cap\mathcal{L}_{G}|}{|\mathcal{L}_{\mathcal{O% }}|}divide start_ARG | caligraphic_L start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ∩ caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | end_ARG start_ARG | caligraphic_L start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT | end_ARG. Due to the potential infinity of the languages, it may be impractical to compute recall directly. Instead, we approximate it by using a representative dataset from the oracle language and then calculating the proportion of this dataset that is accepted by the learned grammar.

  2. (2)

    Precision: Contrary to recall, precision is the probability that a string in the learned language is accepted by the oracle (Bastani et al., 2017). For finite languages, it can be defined as: |𝒪G||G|subscript𝒪subscript𝐺subscript𝐺\frac{|\mathcal{L}_{\mathcal{O}}\cap\mathcal{L}_{G}|}{|\mathcal{L}_{G}|}divide start_ARG | caligraphic_L start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ∩ caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | end_ARG start_ARG | caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | end_ARG. As with recall, we approximate precision by sampling strings from the learned grammar and calculating the percentage of strings that are accepted by the oracle. We adopt the same sampling method from Arvada (Kulkarni et al., 2022).

  3. (3)

    F-1 Score: The F-1 score is the harmonic mean of precision and recall, defined as 21R+1P21𝑅1𝑃\frac{2}{\frac{1}{R}+\frac{1}{P}}divide start_ARG 2 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_R end_ARG + divide start_ARG 1 end_ARG start_ARG italic_P end_ARG end_ARG. where R𝑅Ritalic_R is recall and P𝑃Pitalic_P is precision. The F-1 score serves as a measure of the overall accuracy, only reaching high values when both precision and recall are high.

  4. (4)

    Number of Unique Membership Queries: This counts the number of unique membership queries, i.e., distinct oracle calls, made during the learning process. Since a particular string might be queried multiple times, we cache the result after the first query, and only count unique queries. This metric serves as an efficiency measure.

Table 1. Evaluation on datasets where the oracle grammars are VPGs. “#Seeds” is the number of seed strings for each grammar. “#Queries” is the number of membership queries, while “%Q(Token)” and “%Q(VPA)” are the percentages of these queries attributed to token inference and VPA learning, respectively. “#TS” is the number of test strings sampled by V-Star. Results for Arvada are listed as the means over 10 runs ± the standard deviation (Kulkarni, 2023).
Glade
#Seeds Recall Precision F1 #Queries Time
json 71 0.42 0.98 0.59 11 Ktimes11K11\text{\,}\mathrm{K}start_ARG 11 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 21 stimes21s21\text{\,}\mathrm{s}start_ARG 21 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
lisp 26 0.23 1.00 0.38 3.8 Ktimes3.8K3.8\text{\,}\mathrm{K}start_ARG 3.8 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 7 stimes7s7\text{\,}\mathrm{s}start_ARG 7 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
xml 62 0.26 1.00 0.42 15 Ktimes15K15\text{\,}\mathrm{K}start_ARG 15 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 21 stimes21s21\text{\,}\mathrm{s}start_ARG 21 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
while 10 0.01 1.00 0.02 9.2 Ktimes9.2K9.2\text{\,}\mathrm{K}start_ARG 9.2 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 13 stimes13s13\text{\,}\mathrm{s}start_ARG 13 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
mathexpr 40 0.18 0.98 0.31 19 Ktimes19K19\text{\,}\mathrm{K}start_ARG 19 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 42 stimes42s42\text{\,}\mathrm{s}start_ARG 42 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
Arvada
Recall Precision F1 #Queries Time
json 0.97 ± 0.09 0.92 ± 0.08 0.94 ± 0.05 6.8 Ktimes6.8K6.8\text{\,}\mathrm{K}start_ARG 6.8 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG ± 394 25 stimes25s25\text{\,}\mathrm{s}start_ARG 25 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG ± 2 stimes2s2\text{\,}\mathrm{s}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
lisp 0.38 ± 0.26 0.95 ± 0.08 0.50 ± 0.18 2.2 Ktimes2.2K2.2\text{\,}\mathrm{K}start_ARG 2.2 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG ± 307 8 stimes8s8\text{\,}\mathrm{s}start_ARG 8 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG ± 2 stimes2s2\text{\,}\mathrm{s}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
xml 0.99 ± 0.02 1.00 ± 0.00 1.00 ± 0.01 12 Ktimes12K12\text{\,}\mathrm{K}start_ARG 12 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG ± 1 Ktimes1K1\text{\,}\mathrm{K}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 61 stimes61s61\text{\,}\mathrm{s}start_ARG 61 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG ± 5 stimes5s5\text{\,}\mathrm{s}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
while 0.91 ± 0.20 1.00 ± 0.00 0.94 ± 0.14 5.4 Ktimes5.4K5.4\text{\,}\mathrm{K}start_ARG 5.4 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG ± 563 15 stimes15s15\text{\,}\mathrm{s}start_ARG 15 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG ± 1 stimes1s1\text{\,}\mathrm{s}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
mathexpr 0.72 ± 0.24 0.96 ± 0.03 0.80 ± 0.16 6.6 Ktimes6.6K6.6\text{\,}\mathrm{K}start_ARG 6.6 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG ± 421 24 stimes24s24\text{\,}\mathrm{s}start_ARG 24 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG ± 2 stimes2s2\text{\,}\mathrm{s}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
V-Star
Recall Precision F1 #Queries %Q(Token) %Q(VPA) #TS Time
json 1.00 1.00 1.00 541 Ktimes541K541\text{\,}\mathrm{K}start_ARG 541 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 2.71 %times2.71percent2.71\text{\,}\%start_ARG 2.71 end_ARG start_ARG times end_ARG start_ARG % end_ARG 97.29 %times97.29percent97.29\text{\,}\%start_ARG 97.29 end_ARG start_ARG times end_ARG start_ARG % end_ARG 8043 33 mintimes33min33\text{\,}\mathrm{m}\mathrm{i}\mathrm{n}start_ARG 33 end_ARG start_ARG times end_ARG start_ARG roman_min end_ARG
lisp 1.00 1.00 1.00 16 Ktimes16K16\text{\,}\mathrm{K}start_ARG 16 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 1.37 %times1.37percent1.37\text{\,}\%start_ARG 1.37 end_ARG start_ARG times end_ARG start_ARG % end_ARG 98.63 %times98.63percent98.63\text{\,}\%start_ARG 98.63 end_ARG start_ARG times end_ARG start_ARG % end_ARG 693 77 stimes77s77\text{\,}\mathrm{s}start_ARG 77 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
xml 1.00 1.00 1.00 208 Ktimes208K208\text{\,}\mathrm{K}start_ARG 208 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 94.93 %times94.93percent94.93\text{\,}\%start_ARG 94.93 end_ARG start_ARG times end_ARG start_ARG % end_ARG 5.07 %times5.07percent5.07\text{\,}\%start_ARG 5.07 end_ARG start_ARG times end_ARG start_ARG % end_ARG 682 16 mintimes16min16\text{\,}\mathrm{m}\mathrm{i}\mathrm{n}start_ARG 16 end_ARG start_ARG times end_ARG start_ARG roman_min end_ARG
while 1.00 1.00 1.00 1440 Ktimes1440K1440\text{\,}\mathrm{K}start_ARG 1440 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 9.40 %times9.40percent9.40\text{\,}\%start_ARG 9.40 end_ARG start_ARG times end_ARG start_ARG % end_ARG 90.60 %times90.60percent90.60\text{\,}\%start_ARG 90.60 end_ARG start_ARG times end_ARG start_ARG % end_ARG 119 1.5 htimes1.5h1.5\text{\,}\mathrm{h}start_ARG 1.5 end_ARG start_ARG times end_ARG start_ARG roman_h end_ARG
mathexpr 1.00 1.00 1.00 4738 Ktimes4738K4738\text{\,}\mathrm{K}start_ARG 4738 end_ARG start_ARG times end_ARG start_ARG roman_K end_ARG 0.11 %times0.11percent0.11\text{\,}\%start_ARG 0.11 end_ARG start_ARG times end_ARG start_ARG % end_ARG 99.89 %times99.89percent99.89\text{\,}\%start_ARG 99.89 end_ARG start_ARG times end_ARG start_ARG % end_ARG 2602 6 htimes6h6\text{\,}\mathrm{h}start_ARG 6 end_ARG start_ARG times end_ARG start_ARG roman_h end_ARG
Results

Table 1 summarizes the performances of Glade, Arvada, and V-Star on oracle VPGs, with the results of Arvada and Glade assessed on the same platform as V-Star, utilizing the Arvada artifact (Kulkarni, 2023). The table shows that V-Star achieves exact learning for all oracles, exhibiting superior accuracy compared to other tools. However, V-Star issues a greater number of queries than Glade and Arvada, resulting in greater inference time. This primarily stems from (1) the substantial number of test strings used in approximating equivalence queries, and (2) the fact that V-Star consumes seed strings without pre-processing. In contrast, Glade and Arvada employ a pre-tokenization strategy, such as grouping digits or letters as a single terminal, which reduces seed string lengths. We take our approach since V-Star can learn tokens. Overall the evaluation shows that V-Star is more accurate but takes more time to infer grammars. In grammar learning, we believe that accuracy is a more important goal as a more accurate grammar benefits downstream applications greatly. Improving efficiency of V-Star (e.g., using heuristics of target grammars) while not decreasing accuracy is left for future work.

V-Star requires a considerable number of membership queries for the MathExpr grammar. This can be attributed in part to the large number of constant function names (26 in all) within the grammar, such as “sin” or “cos”. In its quest for high accuracy, V-Star explores various combinations of these constant names exhaustively. We acknowledge that this approach could be further optimized and propose this as an avenue for future improvement.

In Table 1, we include data on the percentage of membership queries allocated for token inference (“%Q(Token)”) and for learning VPA (“%Q(VPA)”). It can be observed that the majority of queries are utilized for VPA learning. This is mainly because seed strings tend to be short, leading to fewer potential nesting patterns. One exception is XML, where most queries are for token inference. This is because the XML grammar, primarily based on nested tag pairs, allows for easier inference of the overall grammar once the opening and closing tags (call and return tokens) are identified. Furthermore, many queries are required to infer the lexical rules of XML attributes. Additionally, the table provides information on the count of seed strings (“#Seeds”) used in our evaluation. For the grammars assessed, V-Star requires a relatively small number of seed strings to achieve exact learning, attributed to its strategy of employing a wide range of substring combinations to construct test strings for effective simulating equivalence queries; column “#TS” shows the number of test strings constructed.

7. Future Work

We believe the performance of V-Star can be further improved with more advanced methods for generating counterexamples, such as using machine learning tools to infer counterexamples from seed strings and a VPA. A related direction is to investigate the potential adaptation of V-Star with discrimination trees. Other grammar inference tools that are based on discrimination trees such as TTT (Isberner, 2015) enhance inference efficiency by reducing counterexample lengths and minimizing membership queries. It remains to be seen how V-Star can be adapted in this manner and what improvements this can yield.

The present study focuses primarily on inferring well-matched VPGs using V-Star. However, our preliminary experience suggests that V-Star can also be effectively employed to learn general VPGs (Alur et al., 2005) with open call and return symbols. A general VPG can be used to specify streaming data. As such, the learning problem for general VPGs is a promising direction for future work.

Our method makes the assumption of unique call-return token pairing to ease tokenizer inference and reduce computational complexity, as matching one call token with multiple return tokens complicates tokenizer inference. It would be interesting future work to consider the implications of relaxing this assumption to enhance flexibility.

Experimentally, we focus on languages such as XML and JSON to align with benchmarks used by prior tools for a direct comparison. It would be interesting to evaluate V-Star on more complex programming language grammars to check its effectiveness on those grammars.

Improving the readability of VPGs inferred by V-Star is still a challenge. Currently, the grammars generated tend to be larger and less readable than oracle grammars, due to the inherent rigid requirements of VPG rules, the inclusion of lexical rules, and automatically named nonterminals. Although we have made attempts to refactor grammars using regular expressions, these solutions are largely heuristic and may not consistently yield optimal results. Exploring machine learning-based approaches presents a promising avenue to systematically enhance the clarity and conciseness of inferred grammars, making them potentially more accessible and understandable for users.

Finally, VPGs learned by V-Star may provide a valuable starting point for better inference algorithms of CFGs. For instance, similar to the CFGs learned by Glade (Bastani et al., 2017), the VPGs inferred by V-Star can serve as inputs for machine learning tools such as REINAM (Wu et al., 2019), which improves the input grammar with reinforcement learning. Comparing the improvements enabled by these different starting grammars would be an intriguing line of inquiry.

8. Conclusions

This paper introduces V-Star, an algorithm designed to take advantage of nesting structures in languages to achieve exact learning of visibly pushdown grammars. Through a set of novel techniques to infer token boundaries and tag call/return tokens, V-Star demonstrates its capability to learn a diverse array of practical languages. Our preliminary experiments are promising and show V-Star’s advantages of accurate learning.

Appendix A Proofs of Theorems in Section 4.2

Theorem A.1 ().

Given a tagging function t𝑡titalic_t such that language ^={t(s)s}^conditional-set𝑡𝑠𝑠\hat{\mathcal{L}}=\{t(s)\mid s\in\mathcal{L}\}over^ start_ARG caligraphic_L end_ARG = { italic_t ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL, the minimal k𝑘kitalic_k-SEVPA of language ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG can be learned in polynomial numbers of equivalence and membership queries.

Proof.

We run Algorithm 1 with a target language ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG over the alphabet Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG. Define m𝑚mitalic_m as the state count of the minimal k𝑘kitalic_k-SEVPA of ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG and n𝑛nitalic_n as the maximum length of counterexamples returned by equivalence queries. Proposition 4.4 establishes that the number of equivalence queries does not exceed m𝑚mitalic_m, as each iteration expands the number of states in 𝒬𝒬\mathcal{Q}caligraphic_Q by a minimum of one. This also shows that the algorithm must terminate.

A counterexample returned by an equivalence query causes at most logn𝑛\log nroman_log italic_n membership queries as detailed in Proposition 4.6, resulting in no more than mlogn𝑚𝑛m\log nitalic_m roman_log italic_n membership queries during Step 4 of Algorithm 1. Membership queries in Steps 2 and 5 involve words either of form wqw𝑤𝑞superscript𝑤wqw^{\prime}italic_w italic_q italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT or wqmw𝑤𝑞𝑚superscript𝑤wqmw^{\prime}italic_w italic_q italic_m italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where qQi𝑞subscript𝑄𝑖q\in Q_{i}italic_q ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, mΣM𝑚subscriptΣ𝑀m\in\Sigma_{M}italic_m ∈ roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, and (w,w)Ci𝑤superscript𝑤subscript𝐶𝑖(w,w^{\prime})\in C_{i}( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. With |Ci|subscript𝐶𝑖|C_{i}|| italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bounded by |Qi|msubscript𝑄𝑖𝑚|Q_{i}|\leq m| italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_m at completion, total queries amount to at most i=0k(|Qi|+|Qi||ΣM|)|Ci|=i=0k|Qi||Ci|(1+|ΣM|)m2(1+|Σ|+|Σcall|×m×|Σret|)m3|Σ|2superscriptsubscript𝑖0𝑘subscript𝑄𝑖subscript𝑄𝑖subscriptΣ𝑀subscript𝐶𝑖superscriptsubscript𝑖0𝑘subscript𝑄𝑖subscript𝐶𝑖1subscriptΣ𝑀superscript𝑚21ΣsubscriptΣcall𝑚subscriptΣretsuperscript𝑚3superscriptΣ2\sum_{i=0}^{k}(|Q_{i}|+|Q_{i}||\Sigma_{M}|)|C_{i}|=\sum_{i=0}^{k}|Q_{i}||C_{i}% |(1+|\Sigma_{M}|)\leq m^{2}(1+|\Sigma|+|\Sigma_{\text{call}}|\times m\times|% \Sigma_{\text{ret}}|)\leq m^{3}|\Sigma|^{2}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( | italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + | italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT | ) | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ( 1 + | roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT | ) ≤ italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + | roman_Σ | + | roman_Σ start_POSTSUBSCRIPT call end_POSTSUBSCRIPT | × italic_m × | roman_Σ start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT | ) ≤ italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | roman_Σ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

In conclusion, the number of queries remains polynomially bound by n𝑛nitalic_n, m𝑚mitalic_m, and |Σ|Σ|\Sigma|| roman_Σ |, including O(m3|Σ|2+mlogn)𝑂superscript𝑚3superscriptΣ2𝑚𝑛O(m^{3}|\Sigma|^{2}+m\log n)italic_O ( italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | roman_Σ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m roman_log italic_n ) membership queries and O(m)𝑂𝑚O(m)italic_O ( italic_m ) equivalence queries. ∎

Appendix B Proofs of Theorems in Section 4.3

Given tagging t𝑡titalic_t, we say that string s𝑠sitalic_s is t𝑡titalic_t-well-matched, if t(s)𝑡𝑠t(s)italic_t ( italic_s ) is well-matched.

Definition B.0 ().

[Parse Tree] Given a grammar G=(V,Σ,P,L0)𝐺𝑉Σ𝑃subscript𝐿0G=(V,\Sigma,P,L_{0})italic_G = ( italic_V , roman_Σ , italic_P , italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), a parse tree with respect to grammar G𝐺Gitalic_G is an ordered tree where (1) the leaves of the tree are terminals in ΣΣ\Sigmaroman_Σ or ϵitalic-ϵ\epsilonitalic_ϵ, and (2) each non-leaf node is a nonterminal L𝐿Litalic_L in V𝑉Vitalic_V, where the children of the node are α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, αnsubscript𝛼𝑛\alpha_{n}italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, …, αnsubscript𝛼𝑛\alpha_{n}italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT such that Lα1α2αn𝐿subscript𝛼1subscript𝛼2subscript𝛼𝑛L\to\alpha_{1}\alpha_{2}\ldots\alpha_{n}italic_L → italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a production rule in P𝑃Pitalic_P, or ϵitalic-ϵ\epsilonitalic_ϵ, such that Lϵ𝐿italic-ϵL\to\epsilonitalic_L → italic_ϵ is a production rule. The root of the tree should be L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the start nonterminal of grammar G𝐺Gitalic_G. A parse tree of a string sΣ*𝑠superscriptΣs\in\Sigma^{*}italic_s ∈ roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT in the language of grammar G𝐺Gitalic_G is a parse tree whose leaves, when concatenated from left to right, form s𝑠sitalic_s.

Lemma B.0 (Pumping Lemma for VPLs).

For any VPL ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG, there exists a positive number l𝑙litalic_l such that, for any string s𝑠sitalic_s in ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG with length greater than l𝑙litalic_l, it is possible to express s𝑠sitalic_s according to one of the following conditions:

  1. (1)

    (Regular Pumping) We can partition s𝑠sitalic_s into s=uxv𝑠𝑢𝑥𝑣s=uxvitalic_s = italic_u italic_x italic_v for strings u,x,𝑢𝑥u,x,italic_u , italic_x , and v𝑣vitalic_v, with x𝑥xitalic_x being non-empty, such that uxkv𝑢superscript𝑥𝑘𝑣ux^{k}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v remains in ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG for all k0𝑘0k\geq 0italic_k ≥ 0.

  2. (2)

    (Nesting Pumping) We can partition s𝑠sitalic_s into s=uxzyv𝑠𝑢𝑥𝑧𝑦𝑣s=uxzyvitalic_s = italic_u italic_x italic_z italic_y italic_v for strings u,x,z,y,𝑢𝑥𝑧𝑦u,x,z,y,italic_u , italic_x , italic_z , italic_y , and v𝑣vitalic_v, with x𝑥xitalic_x and y𝑦yitalic_y being non-empty, x𝑥xitalic_x containing a call symbol, and y𝑦yitalic_y containing a return symbol, such that uxkzykv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑘𝑣ux^{k}zy^{k}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v is valid for all k1𝑘1k\geq 1italic_k ≥ 1.

Proof.

Let VPG G=(Σ^,V,P,L0)𝐺^Σ𝑉𝑃subscript𝐿0G=(\hat{\Sigma},V,P,L_{0})italic_G = ( over^ start_ARG roman_Σ end_ARG , italic_V , italic_P , italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) be a grammar of ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG. We define l𝑙litalic_l as the length of the longest string s𝑠sitalic_s that contains no recursion in any of its parse trees. Formally, in a parse tree of s𝑠sitalic_s, if the subtree rooted with a nonterminal node A𝐴Aitalic_A contains another appearance of A𝐴Aitalic_A, we say there is recursion in the parse tree and we call A𝐴Aitalic_A a recursive nonterminal in the parse tree. l𝑙litalic_l is then defined as the length of the longest string s𝑠sitalic_s whose parse trees do not contain recursion. This l𝑙litalic_l is well defined because the number of non-recursive parse trees is finite: any path that goes from the root to a leaf of a parse tree and exceeds the length of |V|+2𝑉2|V|+2| italic_V | + 2 must have |V|+1𝑉1|V|+1| italic_V | + 1 nonterminals and revisit at least one nonterminal twice.

For any string s𝑠sitalic_s exceeding l𝑙litalic_l in length, one of its parse trees must have a recursive nonterminal; say it is L𝐿Litalic_L. The derivation of the parse tree can be written as: L0*uLv*u(s1Ls2)v*ssuperscriptsubscript𝐿0𝑢𝐿𝑣superscript𝑢subscript𝑠1𝐿subscript𝑠2𝑣superscript𝑠L_{0}\to^{*}uLv\to^{*}u(s_{1}Ls_{2})v\to^{*}sitalic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_u italic_L italic_v → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_u ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_v → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_s, where L*s1Ls2superscript𝐿subscript𝑠1𝐿subscript𝑠2L\to^{*}s_{1}Ls_{2}italic_L → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We have two cases:

  1. (1)

    If s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is empty, then s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT cannot be empty since LL𝐿𝐿L\to Litalic_L → italic_L is not a valid VPG rule. Thus, u(s1)ksLv𝑢superscriptsubscript𝑠1𝑘subscript𝑠𝐿𝑣u(s_{1})^{k}s_{L}vitalic_u ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_v remains valid, where sLsubscript𝑠𝐿s_{L}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is a terminal string derived from L𝐿Litalic_L. This satisfies regular pumping.

  2. (2)

    If s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is not empty, then a matching rule is used somewhere in the derivation sequence that leads to the second appearance of L𝐿Litalic_L. This is because by the VPG rules, if only rules of the form L1cL2subscript𝐿1𝑐subscript𝐿2L_{1}\to cL_{2}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_c italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT or L1ϵsubscript𝐿1italic-ϵL_{1}\to\epsilonitalic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_ϵ were used, then L𝐿Litalic_L must be the last symbol in the derived string s1Ls2subscript𝑠1𝐿subscript𝑠2s_{1}Ls_{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which contradicts with that s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is not empty. This leads to:

    L*s1\guilsinglleftaAb\guilsinglrightBs2*s1\guilsinglleftas1′′Ls2′′b\guilsinglrightBs2superscript𝐿superscriptsubscript𝑠1\guilsingllefta𝐴b\guilsinglright𝐵superscriptsubscript𝑠2superscriptsuperscriptsubscript𝑠1\guilsinglleftasuperscriptsubscript𝑠1′′𝐿superscriptsubscript𝑠2′′b\guilsinglright𝐵superscriptsubscript𝑠2L\to^{*}s_{1}^{\prime}\text{\guilsinglleft{$a$}}A\text{{$b$}\guilsinglright}Bs% _{2}^{\prime}\to^{*}s_{1}^{\prime}\text{\guilsinglleft{$a$}}s_{1}^{\prime% \prime}Ls_{2}^{\prime\prime}\text{{$b$}\guilsinglright}Bs_{2}^{\prime}italic_L → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_a italic_A italic_b italic_B italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_a italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT italic_L italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT italic_b italic_B italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

    Here, A*s1′′Ls2′′superscript𝐴superscriptsubscript𝑠1′′𝐿superscriptsubscript𝑠2′′A\to^{*}s_{1}^{\prime\prime}Ls_{2}^{\prime\prime}italic_A → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT italic_L italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT. We then select x𝑥xitalic_x as s1\guilsinglleftas1′′superscriptsubscript𝑠1\guilsinglleftasuperscriptsubscript𝑠1′′s_{1}^{\prime}\text{\guilsinglleft{$a$}}s_{1}^{\prime\prime}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_a italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT and y𝑦yitalic_y as s2′′b\guilsinglrights2superscriptsubscript𝑠2′′b\guilsinglrightsuperscriptsubscript𝑠2s_{2}^{\prime\prime}\text{{$b$}\guilsinglright}s_{2}^{\prime}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for nesting pumping.

Lemma B.0 ().

Consider an oracle VPL \mathcal{L}caligraphic_L, a VPG G=(Σ,V,P,L0)𝐺Σ𝑉𝑃subscript𝐿0G=(\Sigma,V,P,L_{0})italic_G = ( roman_Σ , italic_V , italic_P , italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for L𝐿Litalic_L, and an oracle tagging t𝒪subscript𝑡𝒪t_{\mathcal{O}}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT. For each string s𝑠s\in\mathcal{L}italic_s ∈ caligraphic_L and s=uxzyv𝑠𝑢𝑥𝑧𝑦𝑣s=uxzyvitalic_s = italic_u italic_x italic_z italic_y italic_v, where u,x,z,y,v𝑢𝑥𝑧𝑦𝑣u,x,z,y,vitalic_u , italic_x , italic_z , italic_y , italic_v are substrings, and x,y𝑥𝑦x,yitalic_x , italic_y are nonempty, if string uxkzykv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑘𝑣ux^{k}zy^{k}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v is valid but string uxkzyjv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑗𝑣ux^{k}zy^{j}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_v is invalid for k,j(|V|2+1)2𝑘𝑗superscriptsuperscript𝑉212k,j\leq(|V|^{2}+1)^{2}italic_k , italic_j ≤ ( | italic_V | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and kj𝑘𝑗k\neq jitalic_k ≠ italic_j, then t𝒪(x)subscript𝑡𝒪𝑥t_{\mathcal{O}}(x)italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_x ) contains an oracle call symbol, and t𝒪(y)subscript𝑡𝒪𝑦t_{\mathcal{O}}(y)italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_y ) contains an oracle return symbol, and the two symbols are matched with each other in s𝑠sitalic_s.

Proof.

In this proof, we abuse the notation x𝑥xitalic_x to also mean t𝒪(x)subscript𝑡𝒪𝑥t_{\mathcal{O}}(x)italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_x ).

We first show that x𝑥xitalic_x as well as y𝑦yitalic_y contains unmatched symbol. Otherwise, x𝑥xitalic_x and y𝑦yitalic_y contain only plain symbols or well-matched call-return pairs. For each k1𝑘1k\geq 1italic_k ≥ 1 and string uxkzykv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑘𝑣ux^{k}zy^{k}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v, the derivation path of xksuperscript𝑥𝑘{x}^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT can be written as Lk,1*xkLk,2superscriptsubscript𝐿𝑘1superscript𝑥𝑘subscript𝐿𝑘2L_{k,1}\to^{*}{x}^{k}L_{k,2}italic_L start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k , 2 end_POSTSUBSCRIPT, where Lk,1,Lk,2subscript𝐿𝑘1subscript𝐿𝑘2L_{k,1},L_{k,2}italic_L start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_k , 2 end_POSTSUBSCRIPT are two nonterminals in V𝑉Vitalic_V. This is because, suppose x𝑥xitalic_x contains only plain symbols, then the derivation path of x𝑥xitalic_x is of the form

L1x[1]L2x[1]x[2]L3xL|x|+1subscript𝐿1𝑥delimited-[]1subscript𝐿2𝑥delimited-[]1𝑥delimited-[]2subscript𝐿3𝑥subscript𝐿𝑥1L_{1}\to{x}[1]L_{2}\to{x}[1]{x}[2]L_{3}\to\cdots\to xL_{|x|+1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_x [ 1 ] italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → italic_x [ 1 ] italic_x [ 2 ] italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT → ⋯ → italic_x italic_L start_POSTSUBSCRIPT | italic_x | + 1 end_POSTSUBSCRIPT

for certain nonterminals Li,i[1..|x|+1]L_{i,i\in[1..|x|+1]}italic_L start_POSTSUBSCRIPT italic_i , italic_i ∈ [ 1 . . | italic_x | + 1 ] end_POSTSUBSCRIPT. The case is similar when x𝑥xitalic_x also contains well-matched substrings; we omit the discussion for brevity. Now, for each k[1..|V|2+1]k\in[1..|V|^{2}+1]italic_k ∈ [ 1 . . | italic_V | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ], we have the following derivations:

L1,1subscript𝐿11\displaystyle L_{1,1}italic_L start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT *x1L1,2superscriptabsentsuperscript𝑥1subscript𝐿12\displaystyle\to^{*}{x}^{1}L_{1,2}→ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT
L2,1subscript𝐿21\displaystyle L_{2,1}italic_L start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT *x2L2,2superscriptabsentsuperscript𝑥2subscript𝐿22\displaystyle\to^{*}{x}^{2}L_{2,2}→ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT
\displaystyle\dots
L|V|2+1,1subscript𝐿superscript𝑉211\displaystyle L_{|V|^{2}+1,1}italic_L start_POSTSUBSCRIPT | italic_V | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 , 1 end_POSTSUBSCRIPT *x|V|2+1L|V|2+1,2superscriptabsentsuperscript𝑥superscript𝑉21subscript𝐿superscript𝑉212\displaystyle\to^{*}{x}^{|V|^{2}+1}L_{|V|^{2}+1,2}→ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT | italic_V | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT | italic_V | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 , 2 end_POSTSUBSCRIPT

Apparantly, there exist ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and k1k2subscript𝑘1subscript𝑘2k_{1}\neq k_{2}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, such that k,k1,k2|V|2+1superscript𝑘subscript𝑘1subscript𝑘2superscript𝑉21k^{\prime},k_{1},k_{2}\leq|V|^{2}+1italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ | italic_V | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1, and a pair (Lk,1,Lk,2)subscript𝐿superscript𝑘1subscript𝐿superscript𝑘2(L_{k^{\prime},1},L_{k^{\prime},2})( italic_L start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 2 end_POSTSUBSCRIPT ) appears twice on both sides, i.e.,

\displaystyle\dots
Lk,1subscript𝐿superscript𝑘1\displaystyle L_{k^{\prime},1}italic_L start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 end_POSTSUBSCRIPT *xk1Lk,2superscriptabsentsuperscript𝑥subscript𝑘1subscript𝐿superscript𝑘2\displaystyle\to^{*}{x}^{k_{1}}L_{k^{\prime},2}→ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 2 end_POSTSUBSCRIPT
\displaystyle\dots
Lk,1subscript𝐿superscript𝑘1\displaystyle L_{k^{\prime},1}italic_L start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 end_POSTSUBSCRIPT *xk2Lk,2superscriptabsentsuperscript𝑥subscript𝑘2subscript𝐿superscript𝑘2\displaystyle\to^{*}{x}^{k_{2}}L_{k^{\prime},2}→ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 2 end_POSTSUBSCRIPT
\displaystyle\dots

Thus, both string uxk1zyk1v𝑢superscript𝑥subscript𝑘1𝑧superscript𝑦subscript𝑘1𝑣ux^{k_{1}}zy^{k_{1}}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_v and uxk2zyk1v𝑢superscript𝑥subscript𝑘2𝑧superscript𝑦subscript𝑘1𝑣ux^{k_{2}}zy^{k_{1}}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_v are valid. Given that k1k2subscript𝑘1subscript𝑘2k_{1}\neq k_{2}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, this is a contradiction.

Therefore, x𝑥xitalic_x and y𝑦yitalic_y must include unmatched symbols. Consider the type of the unmatched symbol in x𝑥xitalic_x. If x𝑥xitalic_x includes a return symbol b𝑏bitalic_b\guilsinglright, where the matched \guilsingllefta𝑎aitalic_a is before x𝑥xitalic_x, then ux2zy2v𝑢superscript𝑥2𝑧superscript𝑦2𝑣ux^{2}zy^{2}vitalic_u italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_v is invalid, because u𝑢uitalic_u has no additional call symbol to match b𝑏bitalic_b\guilsinglright. Thus, x𝑥xitalic_x includes a symbol \guilsingllefta𝑎aitalic_a, whose matched symbol b𝑏bitalic_b\guilsinglright is after x𝑥xitalic_x. If b𝑏bitalic_b\guilsinglright is in y𝑦yitalic_y, we are done. Otherwise, b𝑏bitalic_b\guilsinglright is either in z𝑧zitalic_z or in v𝑣vitalic_v. Consider ux2zy2v𝑢superscript𝑥2𝑧superscript𝑦2𝑣ux^{2}zy^{2}vitalic_u italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_v. Since the string is valid, the unmatched \guilsingllefta𝑎aitalic_a in x2superscript𝑥2x^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT must match a return symbol in y2superscript𝑦2y^{2}italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

In conclusion, x𝑥xitalic_x contains an oracle call symbol, which matches a return symbol in y𝑦yitalic_y. ∎

Theorem B.4 (Termination and Correctness of Algorithm 3).

Let m𝑚mitalic_m be the number of states of the minimal k𝑘kitalic_k-SEVPA for the oracle VPL. There exists a number K((m2+2m)2+1)2𝐾superscriptsuperscriptsuperscript𝑚22𝑚212K\leq((m^{2}+2m)^{2}+1)^{2}italic_K ≤ ( ( italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_m ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, with which function 𝐭𝐚𝐠𝐈𝐧𝐟𝐞𝐫(𝒪,S)𝐭𝐚𝐠𝐈𝐧𝐟𝐞𝐫𝒪𝑆\mathbf{tagInfer}(\mathcal{O},S)bold_tagInfer ( caligraphic_O , italic_S ) returns a tagging that is compatible with a finite set of seed strings S𝑆Sitalic_S.

Proof.

First, we show that at least the oracle tagging can be found, which must be compatible with the pattern. This is because, from Lemma B.3, when K>(|V|2+1)2𝐾superscriptsuperscript𝑉212K>(|V|^{2}+1)^{2}italic_K > ( | italic_V | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, each candidate nesting pattern (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) must either be invalidated, or contain oracle call-return pair unmatched in x𝑥xitalic_x and y𝑦yitalic_y, respectively.

Therefore, since any VPG for the oracle VPL can be used for the checking, we pick the specific VPG converted from the minimal k𝑘kitalic_k-SEVPA by the method discussed in Alur and Madhusudan (2004), Theorem 5.3 (Visibly pushdown grammars), where |V|𝑉|V|| italic_V | is bounded by m2+2msuperscript𝑚22𝑚m^{2}+2mitalic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_m. ∎

For Lemmas B.5 and B.6, we first introduce another congruence relation by Alur et al. (2005). Two well-matched strings, s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, are deemed congruent, denoted as s1s2similar-tosubscript𝑠1subscript𝑠2s_{1}\sim s_{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, if their contexts coincide. Specifically,

u,vΣ*,us1vus2v.\forall u,v\in\Sigma^{*},us_{1}v\in\mathcal{L}\iff us_{2}v\in\mathcal{L}.∀ italic_u , italic_v ∈ roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_u italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_v ∈ caligraphic_L ⇔ italic_u italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_v ∈ caligraphic_L .

This congruence is an equivalence relation, and \mathcal{L}caligraphic_L is a VPL on Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG if and only if the congruence relation admits a finite number of equivalence classes.

Given a tagging t𝑡titalic_t, denote the congruence relation over ^tsubscript^𝑡\hat{\mathcal{L}}_{t}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as tsubscriptsimilar-to𝑡\sim_{t}∼ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Given a compatible tagging function t𝑡titalic_t and a string s𝑠sitalic_s, the following Lemma B.5 shows that, if t(s)𝑡𝑠t(s)italic_t ( italic_s ) is well-matched, then t𝒪(s)subscript𝑡𝒪𝑠t_{\mathcal{O}}(s)italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_s ) has a bounded number of unmatched symbols.

Lemma B.0 ().

Given oracle language \mathcal{L}caligraphic_L and oracle tagging t𝒪subscript𝑡𝒪t_{\mathcal{O}}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT, for each compatible tagging t𝑡titalic_t, there exists an upper bound positive number, denoted as Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, such that for each string sΣ*𝑠superscriptΣs\in\Sigma^{*}italic_s ∈ roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, if s𝑠sitalic_s is t𝑡titalic_t-well-matched and there exists context strings (w,w)𝑤superscript𝑤(w,w^{\prime})( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) such that wsw𝑤𝑠superscript𝑤wsw^{\prime}\in\mathcal{L}italic_w italic_s italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_L, then t𝒪(s)subscript𝑡𝒪𝑠t_{\mathcal{O}}(s)italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_s ) contains at most Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT unmatched oracle call and return symbols.

Proof.

In this proof, to simplify the notation, we use “s𝑠sitalic_s” or “w𝑤witalic_w” to also mean strings tagged by the oracle tagging function t𝒪subscript𝑡𝒪t_{\mathcal{O}}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT. For strings tagged by a compatible tagging function t𝑡titalic_t, we use “t(s)𝑡𝑠t(s)italic_t ( italic_s )” and “t(w)𝑡𝑤t(w)italic_t ( italic_w )” explicitly.

To simplify the problem, let us assume there is an oracle VPG that includes only one matching rule; we denote the matching rule as L\guilsinglleftaAb\guilsinglrightB𝐿\guilsingllefta𝐴b\guilsinglright𝐵L\to{\text{\guilsinglleft{$a$}}{A}\text{{$b$}\guilsinglright}}Bitalic_L → italic_a italic_A italic_b italic_B. As an overview, we show that for string s𝑠sitalic_s that contains K𝐾Kitalic_K unmatched oracle call symbols \guilsingllefta𝑎aitalic_a, we can construct K𝐾Kitalic_K equivalence classes for the oracle congruence relation. Therefore, Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is bounded by the number of oracle equivalence classes. The case for multiple matching rules can be similarly proved, and we omit it for brevity.

For t𝑡titalic_t-well-matched string s𝑠sitalic_s with wsw𝑤𝑠superscript𝑤wsw^{\prime}\in\mathcal{L}italic_w italic_s italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_L, if s𝑠sitalic_s contains no unmatched oracle symbols, then we are done. Otherwise, assume s𝑠sitalic_s contains no return symbols, and K𝐾Kitalic_K unmatched oracle call symbols \guilsingllefta𝑎aitalic_a (the other cases are similar and we omit them for brevity). We can rewrite wsw𝑤𝑠superscript𝑤wsw^{\prime}italic_w italic_s italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to reflect the derivation of these oracle call and return symbols as follows:

wsw=w+q0(\guilsinglleftas3(1))(\guilsinglleftas3(K))sL(1)+sL(2)(s4(K)b\guilsinglrightsB(K))(s4(1)b\guilsinglrightsB(1))+w′′𝑤𝑠superscript𝑤𝑤subscript𝑞0\guilsinglleftasuperscriptsubscript𝑠31\guilsinglleftasuperscriptsubscript𝑠3𝐾superscriptsubscript𝑠𝐿1superscriptsubscript𝑠𝐿2superscriptsubscript𝑠4𝐾b\guilsinglrightsuperscriptsubscript𝑠𝐵𝐾superscriptsubscript𝑠41b\guilsinglrightsuperscriptsubscript𝑠𝐵1superscript𝑤′′wsw^{\prime}=w+q_{0}(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\dots(\text{% \guilsinglleft{$a$}}s_{3}^{(K)})s_{L}^{(1)}+s_{L}^{(2)}(s_{4}^{(K)}\text{{$b$}% \guilsinglright}s_{B}^{(K)})\dots(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^% {(1)})+w^{\prime\prime}italic_w italic_s italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w + italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) … ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) … ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) + italic_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT

where w𝑤witalic_w, q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, s2(i)superscriptsubscript𝑠2𝑖s_{2}^{(i)}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, s3(i)superscriptsubscript𝑠3𝑖s_{3}^{(i)}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, sL(1)superscriptsubscript𝑠𝐿1s_{L}^{(1)}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, sL(2)superscriptsubscript𝑠𝐿2s_{L}^{(2)}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT, s4(i)superscriptsubscript𝑠4𝑖s_{4}^{(i)}italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, sB(i)superscriptsubscript𝑠𝐵𝑖s_{B}^{(i)}italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and w′′superscript𝑤′′w^{\prime\prime}italic_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT for i[1..K]i\in[1..K]italic_i ∈ [ 1 . . italic_K ] are strings, and

s𝑠\displaystyle sitalic_s =q0(\guilsinglleftas3(1))(\guilsinglleftas3(K))sL(1),absentsubscript𝑞0\guilsinglleftasuperscriptsubscript𝑠31\guilsinglleftasuperscriptsubscript𝑠3𝐾superscriptsubscript𝑠𝐿1\displaystyle=q_{0}(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\dots(\text{% \guilsinglleft{$a$}}s_{3}^{(K)})s_{L}^{(1)},= italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) … ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ,
wsuperscript𝑤\displaystyle w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =sL(2)(s4(K)b\guilsinglrightsB(K))(s4(1)b\guilsinglrightsB(1))+w′′absentsuperscriptsubscript𝑠𝐿2superscriptsubscript𝑠4𝐾b\guilsinglrightsuperscriptsubscript𝑠𝐵𝐾superscriptsubscript𝑠41b\guilsinglrightsuperscriptsubscript𝑠𝐵1superscript𝑤′′\displaystyle=s_{L}^{(2)}(s_{4}^{(K)}\text{{$b$}\guilsinglright}s_{B}^{(K)})% \dots(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^{(1)})+w^{\prime\prime}= italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) … ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) + italic_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT

and string q0(\guilsinglleftas3(1))(\guilsinglleftas3(K))sL(1)+sL(2)(s4(K)b\guilsinglrightsB(K))(s4(1)b\guilsinglrightsB(1))subscript𝑞0\guilsinglleftasuperscriptsubscript𝑠31\guilsinglleftasuperscriptsubscript𝑠3𝐾superscriptsubscript𝑠𝐿1superscriptsubscript𝑠𝐿2superscriptsubscript𝑠4𝐾b\guilsinglrightsuperscriptsubscript𝑠𝐵𝐾superscriptsubscript𝑠41b\guilsinglrightsuperscriptsubscript𝑠𝐵1q_{0}(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\dots(\text{\guilsinglleft{$a$}}s_% {3}^{(K)})s_{L}^{(1)}+s_{L}^{(2)}(s_{4}^{(K)}\text{{$b$}\guilsinglright}s_{B}^% {(K)})\dots(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^{(1)})italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) … ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) … ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) is derived by:

L\guilsinglleftaAb\guilsinglrightB𝐿\guilsingllefta𝐴b\guilsinglright𝐵\displaystyle L\to\text{\guilsinglleft{$a$}}A\text{{$b$}\guilsinglright}Bitalic_L → italic_a italic_A italic_b italic_B (\guilsinglleftas3(1))L(s4(1)b\guilsinglrightsB(1))absent\guilsinglleftasuperscriptsubscript𝑠31𝐿superscriptsubscript𝑠41b\guilsinglrightsuperscriptsubscript𝑠𝐵1\displaystyle\to(\text{\guilsinglleft{$a$}}s_{3}^{(1)})L(s_{4}^{(1)}\text{{$b$% }\guilsinglright}s_{B}^{(1)})→ ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) italic_L ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
(\guilsinglleftas3(1))\guilsinglleftaAb\guilsinglrightB(s4(1)b\guilsinglrightsB(1))absent\guilsinglleftasuperscriptsubscript𝑠31\guilsingllefta𝐴b\guilsinglright𝐵superscriptsubscript𝑠41b\guilsinglrightsuperscriptsubscript𝑠𝐵1\displaystyle\to(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\text{\guilsinglleft{$a% $}}A\text{{$b$}\guilsinglright}B(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^{% (1)})→ ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) italic_a italic_A italic_b italic_B ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
(\guilsinglleftas3(1))(\guilsinglleftas3(2))L(s4(2)b\guilsinglrightsB(2))(s4(1)b\guilsinglrightsB(1))absent\guilsinglleftasuperscriptsubscript𝑠31\guilsinglleftasuperscriptsubscript𝑠32𝐿superscriptsubscript𝑠42b\guilsinglrightsuperscriptsubscript𝑠𝐵2superscriptsubscript𝑠41b\guilsinglrightsuperscriptsubscript𝑠𝐵1\displaystyle\to(\text{\guilsinglleft{$a$}}s_{3}^{(1)})(\text{\guilsinglleft{$% a$}}s_{3}^{(2)})L(s_{4}^{(2)}\text{{$b$}\guilsinglright}s_{B}^{(2)})(s_{4}^{(1% )}\text{{$b$}\guilsinglright}s_{B}^{(1)})→ ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) italic_L ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
(\guilsinglleftas3(1))(\guilsinglleftas3(K))L(s4(K)b\guilsinglrightsB(K))(s4(1)b\guilsinglrightsB(1))absent\guilsinglleftasuperscriptsubscript𝑠31\guilsinglleftasuperscriptsubscript𝑠3𝐾𝐿superscriptsubscript𝑠4𝐾b\guilsinglrightsuperscriptsubscript𝑠𝐵𝐾superscriptsubscript𝑠41b\guilsinglrightsuperscriptsubscript𝑠𝐵1\displaystyle\to(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\dots(\text{% \guilsinglleft{$a$}}s_{3}^{(K)})L(s_{4}^{(K)}\text{{$b$}\guilsinglright}s_{B}^% {(K)})\dots(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^{(1)})→ ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) … ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) italic_L ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) … ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
(\guilsinglleftas3(1))(\guilsinglleftas3(K))sL(s4(K)b\guilsinglrightsB(K))(s4(1)b\guilsinglrightsB(1))absent\guilsinglleftasuperscriptsubscript𝑠31\guilsinglleftasuperscriptsubscript𝑠3𝐾subscript𝑠𝐿superscriptsubscript𝑠4𝐾b\guilsinglrightsuperscriptsubscript𝑠𝐵𝐾superscriptsubscript𝑠41b\guilsinglrightsuperscriptsubscript𝑠𝐵1\displaystyle\to(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\dots(\text{% \guilsinglleft{$a$}}s_{3}^{(K)})s_{L}(s_{4}^{(K)}\text{{$b$}\guilsinglright}s_% {B}^{(K)})\dots(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^{(1)})→ ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) … ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) … ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )

where sL=sL(1)sL(2)subscript𝑠𝐿superscriptsubscript𝑠𝐿1superscriptsubscript𝑠𝐿2s_{L}=s_{L}^{(1)}s_{L}^{(2)}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT.

Let (xi,yi)=(\guilsinglleftas3(i),s4(i)b\guilsinglrightsB(i))subscript𝑥𝑖subscript𝑦𝑖\guilsinglleftasuperscriptsubscript𝑠3𝑖superscriptsubscript𝑠4𝑖b\guilsinglrightsuperscriptsubscript𝑠𝐵𝑖(x_{i},y_{i})=(\text{\guilsinglleft{$a$}}s_{3}^{(i)},s_{4}^{(i)}\text{{$b$}% \guilsinglright}s_{B}^{(i)})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( italic_a italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_b italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ), i[1..K]i\in[1..K]italic_i ∈ [ 1 . . italic_K ]. Notice that (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are K𝐾Kitalic_K disjoint nesting patterns. Since those (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are exchangeable, we denote each of them as (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) when their indices do not matter. With this notation, we can simplify the above formulae as

(3) wsw𝑤𝑠superscript𝑤\displaystyle wsw^{\prime}italic_w italic_s italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =w+q0xKsL(1)+sL(2)yKw′′, whereabsent𝑤subscript𝑞0superscript𝑥𝐾superscriptsubscript𝑠𝐿1superscriptsubscript𝑠𝐿2superscript𝑦𝐾superscript𝑤′′ where\displaystyle=w+q_{0}x^{K}s_{L}^{(1)}+s_{L}^{(2)}y^{K}w^{\prime\prime},\text{ where}= italic_w + italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , where
(4) s𝑠\displaystyle sitalic_s =q0xKsL(1)absentsubscript𝑞0superscript𝑥𝐾superscriptsubscript𝑠𝐿1\displaystyle=q_{0}x^{K}s_{L}^{(1)}= italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT
(5) wsuperscript𝑤\displaystyle w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =sL(2)yKw′′absentsuperscriptsubscript𝑠𝐿2superscript𝑦𝐾superscript𝑤′′\displaystyle=s_{L}^{(2)}y^{K}w^{\prime\prime}= italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT

Since t𝑡titalic_t is compatible with \mathcal{L}caligraphic_L, for i[1..K]i\in[1..K]italic_i ∈ [ 1 . . italic_K ], by definition, each pattern (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) contains a call-return pair (ci,di)subscript𝑐𝑖subscript𝑑𝑖(c_{i},d_{i})( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of compatible tagging t𝑡titalic_t, where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are unmatched in xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. Without loss of generality, let us assume these (ci,di)subscript𝑐𝑖subscript𝑑𝑖(c_{i},d_{i})( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are the same, denoted as (c,d)𝑐𝑑(c,d)( italic_c , italic_d ).

Now consider Equation (3)-(5). For i[1..K]i\in[1..K]italic_i ∈ [ 1 . . italic_K ], each xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains a symbol cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, whose matched symbol, denoted as disuperscriptsubscript𝑑𝑖d_{i}^{\prime}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, is after xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT; similarly, each yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains a symbol disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, whose matched cisuperscriptsubscript𝑐𝑖c_{i}^{\prime}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is before yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since s𝑠sitalic_s is t𝑡titalic_t-well-matched, for i[1..K]i\in[1..K]italic_i ∈ [ 1 . . italic_K ], each symbol disuperscriptsubscript𝑑𝑖d_{i}^{\prime}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can only locate in s𝑠sitalic_s, and each symbol cisuperscriptsubscript𝑐𝑖c_{i}^{\prime}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT cannot locate in s𝑠sitalic_s.

Now, for j[1..K]j\in[1..K]italic_j ∈ [ 1 . . italic_K ], we construct t𝒪subscript𝑡𝒪t_{\mathcal{O}}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT-well-matched strings sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and their contexts (w^j,w^j)subscript^𝑤𝑗superscriptsubscript^𝑤𝑗(\widehat{w}_{j},\widehat{w}_{j}^{\prime})( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), as follows:

sjsubscript𝑠𝑗\displaystyle s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =xKjsL(1)+sL(2)yKjabsentsuperscript𝑥𝐾𝑗superscriptsubscript𝑠𝐿1superscriptsubscript𝑠𝐿2superscript𝑦𝐾𝑗\displaystyle=x^{K-j}s_{L}^{(1)}+s_{L}^{(2)}y^{K-j}= italic_x start_POSTSUPERSCRIPT italic_K - italic_j end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_K - italic_j end_POSTSUPERSCRIPT
w^jsubscript^𝑤𝑗\displaystyle\widehat{w}_{j}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =wq0xjabsent𝑤subscript𝑞0superscript𝑥𝑗\displaystyle=wq_{0}x^{j}= italic_w italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT
w^jsuperscriptsubscript^𝑤𝑗\displaystyle\widehat{w}_{j}^{\prime}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =yjw′′absentsuperscript𝑦𝑗superscript𝑤′′\displaystyle=y^{j}w^{\prime\prime}= italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT

Now, we prove that each sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents a different equivalence class. First, it is obvious that sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is t𝒪subscript𝑡𝒪t_{\mathcal{O}}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT-well-matched. Then, we show that w^isjw^isubscript^𝑤𝑖subscript𝑠𝑗superscriptsubscript^𝑤𝑖\widehat{w}_{i}s_{j}\widehat{w}_{i}^{\prime}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is invalid for ij𝑖𝑗i\neq jitalic_i ≠ italic_j. Let us expand w^isjw^isubscript^𝑤𝑖subscript𝑠𝑗superscriptsubscript^𝑤𝑖\widehat{w}_{i}s_{j}\widehat{w}_{i}^{\prime}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as

w^isjw^i=w+q0xK+ijsL(1)+sL(2)yK+ijw′′.subscript^𝑤𝑖subscript𝑠𝑗superscriptsubscript^𝑤𝑖𝑤subscript𝑞0superscript𝑥𝐾𝑖𝑗superscriptsubscript𝑠𝐿1superscriptsubscript𝑠𝐿2superscript𝑦𝐾𝑖𝑗superscript𝑤′′\widehat{w}_{i}s_{j}\widehat{w}_{i}^{\prime}=w+q_{0}x^{K+i-j}s_{L}^{(1)}+s_{L}% ^{(2)}y^{K+i-j}w^{\prime\prime}.over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w + italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_K + italic_i - italic_j end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_K + italic_i - italic_j end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT .

Denote the number of unmatched d𝑑ditalic_d and unmatched c𝑐citalic_c in string x𝑥xitalic_x as nd(x)subscript𝑛𝑑𝑥n_{d}(x)italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) and nc(x)subscript𝑛𝑐𝑥n_{c}(x)italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x ), respectively. There are three cases.

If nd(x)>nc(x)subscript𝑛𝑑𝑥subscript𝑛𝑐𝑥n_{d}(x)>n_{c}(x)italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) > italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x ), then, when i>j𝑖𝑗i>jitalic_i > italic_j, we have nd(wq0xK+ij)>nd(wq0xK)=0subscript𝑛𝑑𝑤subscript𝑞0superscript𝑥𝐾𝑖𝑗subscript𝑛𝑑𝑤subscript𝑞0superscript𝑥𝐾0n_{d}(wq_{0}x^{K+i-j})>n_{d}(wq_{0}x^{K})=0italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_w italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_K + italic_i - italic_j end_POSTSUPERSCRIPT ) > italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_w italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ) = 0, thus w^isjw^isubscript^𝑤𝑖subscript𝑠𝑗superscriptsubscript^𝑤𝑖\widehat{w}_{i}s_{j}\widehat{w}_{i}^{\prime}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT’s prefix contains pending return symbol, therefore the string is invalid.

If nd(x)<nc(x)subscript𝑛𝑑𝑥subscript𝑛𝑐𝑥n_{d}(x)<n_{c}(x)italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) < italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x ), then, when i<j𝑖𝑗i<jitalic_i < italic_j, we have

(6) nc(w^i)=nc(wq0xi)<nc(wq0xj)=nc(w^j).subscript𝑛𝑐subscript^𝑤𝑖subscript𝑛𝑐𝑤subscript𝑞0superscript𝑥𝑖subscript𝑛𝑐𝑤subscript𝑞0superscript𝑥𝑗subscript𝑛𝑐subscript^𝑤𝑗n_{c}(\widehat{w}_{i})=n_{c}(wq_{0}x^{i})<n_{c}(wq_{0}x^{j})=n_{c}(\widehat{w}% _{j}).italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) < italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) = italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .

On the other hand, since s𝑠sitalic_s is t𝑡titalic_t-well-matched, we have

(7) nc(qw0xj)=nd(xKjsL(1)).subscript𝑛𝑐𝑞subscript𝑤0superscript𝑥𝑗subscript𝑛𝑑superscript𝑥𝐾𝑗superscriptsubscript𝑠𝐿1n_{c}(qw_{0}x^{j})=n_{d}(x^{K-j}s_{L}^{(1)}).italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_q italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) = italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_K - italic_j end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) .

Therefore, based on Equation (6)-(7), we have

nd(w^ixKjsL(1))=nd(xKjsL(1))nc(w^i)=nc(wq0xj)nc(w^i)>0.subscript𝑛𝑑subscript^𝑤𝑖superscript𝑥𝐾𝑗superscriptsubscript𝑠𝐿1subscript𝑛𝑑superscript𝑥𝐾𝑗superscriptsubscript𝑠𝐿1subscript𝑛𝑐subscript^𝑤𝑖subscript𝑛𝑐𝑤subscript𝑞0superscript𝑥𝑗subscript𝑛𝑐subscript^𝑤𝑖0n_{d}(\widehat{w}_{i}x^{K-j}s_{L}^{(1)})=n_{d}(x^{K-j}s_{L}^{(1)})-n_{c}(% \widehat{w}_{i})=n_{c}(wq_{0}x^{j})-n_{c}(\widehat{w}_{i})>0.italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_K - italic_j end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_K - italic_j end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) - italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) - italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > 0 .

Similar to the first case, this shows w^isjw^isubscript^𝑤𝑖subscript𝑠𝑗superscriptsubscript^𝑤𝑖\widehat{w}_{i}s_{j}\widehat{w}_{i}^{\prime}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is invalid.

In the last case, nd(x)=nc(x)subscript𝑛𝑑𝑥subscript𝑛𝑐𝑥n_{d}(x)=n_{c}(x)italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) = italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x ). We show that this is impossible. We first assume nd(x)=nc(x)=1subscript𝑛𝑑𝑥subscript𝑛𝑐𝑥1n_{d}(x)=n_{c}(x)=1italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) = italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x ) = 1, then discuss the other case in the end. First, we rewrite each x𝑥xitalic_x in xKsuperscript𝑥𝐾x^{K}italic_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT as

x=w1dw2cw3,𝑥subscript𝑤1𝑑subscript𝑤2𝑐subscript𝑤3x=w_{1}dw_{2}cw_{3},italic_x = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ,

where wi,i=1,2,3subscript𝑤formulae-sequence𝑖𝑖123w_{i,i=1,2,3}italic_w start_POSTSUBSCRIPT italic_i , italic_i = 1 , 2 , 3 end_POSTSUBSCRIPT does not contain unmatched c𝑐citalic_c nor d𝑑ditalic_d. Therefore, for two adjacent x𝑥xitalic_x, we have

(w1dw2cw3)(w1dw2cw3)subscript𝑤1𝑑subscript𝑤2𝑐subscript𝑤3subscript𝑤1𝑑subscript𝑤2𝑐subscript𝑤3\displaystyle(w_{1}dw_{2}cw_{3})(w_{1}dw_{2}cw_{3})( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )
=w1dw2(cw3w1dw2)cw3absentsubscript𝑤1𝑑subscript𝑤2𝑐subscript𝑤3subscript𝑤1𝑑subscript𝑤2𝑐subscript𝑤3\displaystyle=w_{1}dw_{2}(cw_{3}w_{1}dw_{2})cw_{3}= italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

In string cw3w1dw2𝑐subscript𝑤3subscript𝑤1𝑑subscript𝑤2cw_{3}w_{1}dw_{2}italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, notice that c𝑐citalic_c is matched with d𝑑ditalic_d, therefore, string w3w1subscript𝑤3subscript𝑤1w_{3}w_{1}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is t𝑡titalic_t-well-matched. Since (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is a nesting pattern of the oracle tagging function, for any k>0𝑘0k>0italic_k > 0, we can rewrite xk+1superscript𝑥𝑘1x^{k+1}italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT as

xk+1=w1dw2(cw3w1dw2)kcw3.superscript𝑥𝑘1subscript𝑤1𝑑subscript𝑤2superscript𝑐subscript𝑤3subscript𝑤1𝑑subscript𝑤2𝑘𝑐subscript𝑤3x^{k+1}=w_{1}dw_{2}(cw_{3}w_{1}dw_{2})^{k}cw_{3}.italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT .

With the corresponding yksuperscript𝑦𝑘y^{k}italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, we have a new nesting pattern

(cw3w1dw2,y).𝑐subscript𝑤3subscript𝑤1𝑑subscript𝑤2𝑦(cw_{3}w_{1}dw_{2},y).( italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y ) .

Since tagging t𝑡titalic_t is compatible and cw3w1d𝑐subscript𝑤3subscript𝑤1𝑑cw_{3}w_{1}ditalic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d is t𝑡titalic_t-well-matched, w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT contains an unmatched call symbol of tagging t𝑡titalic_t, denoted as g𝑔gitalic_g, whose matched return symbol of tagging t𝑡titalic_t, denoted as hhitalic_h, is after w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Now we have come back to a similar situation of comparing ng(x)subscript𝑛𝑔𝑥n_{g}(x)italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_x ) and nh(x)subscript𝑛𝑥n_{h}(x)italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x ). With a similar analysis, we can show that if the number of unmatched ng(x)nh(x)subscript𝑛𝑔𝑥subscript𝑛𝑥n_{g}(x)\neq n_{h}(x)italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_x ) ≠ italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x ), then, we can construct K𝐾Kitalic_K equivalence classes in t𝒪subscriptsimilar-tosubscript𝑡𝒪\sim_{t_{\mathcal{O}}}∼ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Therefore, we again must have ng(x)=nh(x)subscript𝑛𝑔𝑥subscript𝑛𝑥n_{g}(x)=n_{h}(x)italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_x ) = italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x ).

Then, we can rewrite x𝑥xitalic_x by expanding w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as

x=w1dw2cw3=w1d(w1hw2gw3)cw3.𝑥subscript𝑤1𝑑subscript𝑤2𝑐subscript𝑤3subscript𝑤1𝑑superscriptsubscript𝑤1superscriptsubscript𝑤2𝑔superscriptsubscript𝑤3𝑐subscript𝑤3x=w_{1}dw_{2}cw_{3}=w_{1}d(w_{1}^{\prime}hw_{2}^{\prime}gw_{3}^{\prime})cw_{3}.italic_x = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_h italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_g italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT .

And, similarly, rewrite two adjacent x𝑥xitalic_x as

xx𝑥𝑥\displaystyle xxitalic_x italic_x =(w1d(w1hw2gw3)cw3)(w1d(w1hw2gw3)cw3)absentsubscript𝑤1𝑑superscriptsubscript𝑤1superscriptsubscript𝑤2𝑔superscriptsubscript𝑤3𝑐subscript𝑤3subscript𝑤1𝑑superscriptsubscript𝑤1superscriptsubscript𝑤2𝑔superscriptsubscript𝑤3𝑐subscript𝑤3\displaystyle=(w_{1}d(w_{1}^{\prime}hw_{2}^{\prime}gw_{3}^{\prime})cw_{3})(w_{% 1}d(w_{1}^{\prime}hw_{2}^{\prime}gw_{3}^{\prime})cw_{3})= ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_h italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_g italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_h italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_g italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )
=w1dw1hw2(gw3cw3w1dw1hw2)gw3cw3absentsubscript𝑤1𝑑superscriptsubscript𝑤1superscriptsubscript𝑤2𝑔superscriptsubscript𝑤3𝑐subscript𝑤3subscript𝑤1𝑑superscriptsubscript𝑤1superscriptsubscript𝑤2𝑔superscriptsubscript𝑤3𝑐subscript𝑤3\displaystyle=w_{1}dw_{1}^{\prime}hw_{2}^{\prime}(gw_{3}^{\prime}cw_{3}w_{1}dw% _{1}^{\prime}hw_{2}^{\prime})gw_{3}^{\prime}cw_{3}= italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_h italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_g italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_h italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_g italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

Again, string gw3cw3w1dw1hw2𝑔superscriptsubscript𝑤3𝑐subscript𝑤3subscript𝑤1𝑑superscriptsubscript𝑤1superscriptsubscript𝑤2gw_{3}^{\prime}cw_{3}w_{1}dw_{1}^{\prime}hw_{2}^{\prime}italic_g italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_h italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT forms the first part of a nesting pattern, thus must contain another unmatched call symbol in tagging t𝑡titalic_t.

However, notice that gw3cw3w1dw1h𝑔superscriptsubscript𝑤3𝑐subscript𝑤3subscript𝑤1𝑑superscriptsubscript𝑤1gw_{3}^{\prime}cw_{3}w_{1}dw_{1}^{\prime}hitalic_g italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_h is t𝑡titalic_t-well-matched. Therefore, the new unmatched symbol must appear in w2superscriptsubscript𝑤2w_{2}^{\prime}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which is strictly shorter than w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Subsequentially, we can find substrings w2superscriptsubscript𝑤2w_{2}^{\prime}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, w2′′superscriptsubscript𝑤2′′w_{2}^{\prime\prime}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT, …, w2(i)superscriptsubscript𝑤2𝑖w_{2}^{(i)}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, …with decreasing lengths that must contain unmatched call symbol in t𝑡titalic_t. However, w2|w2|superscriptsubscript𝑤2subscript𝑤2w_{2}^{|w_{2}|}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT must be empty and contain no symbol, which makes t𝑡titalic_t incompatible with \mathcal{L}caligraphic_L, a contradiction.

Above is the case where the numbers of unmatched c𝑐citalic_c and d𝑑ditalic_d in x𝑥xitalic_x is 1111. When the numbers are greater than 1111 (recall that the two numbers should be the same), we rewrite x𝑥xitalic_x as

w1dw2cw3,subscript𝑤1𝑑subscript𝑤2𝑐subscript𝑤3w_{1}dw_{2}cw_{3},italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ,

where d𝑑ditalic_d is the last unmatched d𝑑ditalic_d, and c𝑐citalic_c is the first unmatched c𝑐citalic_c. Expand x2superscript𝑥2x^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT again, and we can observe that string w3w1subscript𝑤3subscript𝑤1w_{3}w_{1}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is still t𝑡titalic_t-well-matched. The rest of the reasoning is the same as above.

In conclusion, for each K𝐾Kitalic_K and t𝑡titalic_t-well-matched string s𝑠sitalic_s with wsw𝑤𝑠superscript𝑤wsw^{\prime}\in\mathcal{L}italic_w italic_s italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_L, the number of unmatched oracle call symbols equals or less than the number of equivalence classes of t𝒪subscriptsimilar-tosubscript𝑡𝒪\sim_{t_{\mathcal{O}}}∼ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Given that \mathcal{L}caligraphic_L is a VPL under t𝒪subscript𝑡𝒪t_{\mathcal{O}}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT, the numbers of unmatched oracle symbols in any such s𝑠sitalic_s have an upper bound. ∎

Theorem B.6 ().

Given oracle language \mathcal{L}caligraphic_L and oracle tagging t𝒪subscript𝑡𝒪t_{\mathcal{O}}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT, if language {t𝒪(s)s}conditional-setsubscript𝑡𝒪𝑠𝑠\{t_{\mathcal{O}}(s)\mid s\in\mathcal{L}\}{ italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL, then for any tagging t𝑡titalic_t compatible with \mathcal{L}caligraphic_L, language ^t={t(s)s}subscript^𝑡conditional-set𝑡𝑠𝑠\hat{\mathcal{L}}_{t}=\{t(s)\mid s\in\mathcal{L}\}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_t ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL.

Proof.

By Lemma B.5, there exists a positive number Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, such that any t𝑡titalic_t-well-matched string contains at most Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT unmatched oracle symbols.

For a given t𝑡titalic_t-well-matched string p𝑝pitalic_p, without loss of generality, let us assume there are only K<Nt𝐾subscript𝑁𝑡K<N_{t}italic_K < italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT number of unmatched oracle call symbols in t𝑡titalic_t. We prove the theorem by showing that t(p)𝑡𝑝t(p)italic_t ( italic_p ) is equivalent to a string within a fixed length, denoted as N𝑁Nitalic_N.

First, we can partition t𝒪(p)subscript𝑡𝒪𝑝t_{\mathcal{O}}(p)italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_p ) into t𝒪(p)=p^1\guilsingllefta1p^2\guilsingllefta2p^K\guilsinglleftaKp^K+1subscript𝑡𝒪𝑝subscript^𝑝1subscript\guilsingllefta1subscript^𝑝2subscript\guilsingllefta2subscript^𝑝𝐾subscript\guilsingllefta𝐾subscript^𝑝𝐾1t_{\mathcal{O}}(p)=\hat{p}_{1}\text{\guilsinglleft{$a$}}_{1}\hat{p}_{2}\text{% \guilsinglleft{$a$}}_{2}\dots\hat{p}_{K}\text{\guilsinglleft{$a$}}_{K}\hat{p}_% {K+1}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_p ) = over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT, where each \guilsinglleftaisubscript\guilsingllefta𝑖\text{\guilsinglleft{$a$}}_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an unmatched call symbol and each p^isubscript^𝑝𝑖\hat{p}_{i}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is well-matched under t𝒪subscript𝑡𝒪t_{\mathcal{O}}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT. Let us denote the length of the longest representative under t𝒪subscript𝑡𝒪t_{\mathcal{O}}italic_t start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT as l𝑙litalic_l. Denote [pi]𝒪subscriptdelimited-[]subscript𝑝𝑖𝒪{[p_{i}]}_{\mathcal{O}}[ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT as pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s representative in the oracle congruence relation, we have

w,w,wpiww[pi]𝒪w.ifffor-all𝑤superscript𝑤𝑤subscript𝑝𝑖superscript𝑤𝑤subscriptdelimited-[]subscript𝑝𝑖𝒪superscript𝑤\forall w,w^{\prime},\ wp_{i}w^{\prime}\in\mathcal{L}\iff w{[p_{i}]}_{\mathcal% {O}}w^{\prime}\in\mathcal{L}.∀ italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_w italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_L ⇔ italic_w [ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_L .

Now, we construct a shorter representative for t(p)𝑡𝑝t(p)italic_t ( italic_p ), by replacing pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with [pi]𝒪subscriptdelimited-[]subscript𝑝𝑖𝒪[p_{i}]_{\mathcal{O}}[ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT. Formally, for all w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

w1pw2subscript𝑤1𝑝subscript𝑤2\displaystyle w_{1}pw_{2}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_p italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =w1(p1a1p2a2pKaKpK+1)w2absentsubscript𝑤1subscript𝑝1subscript𝑎1subscript𝑝2subscript𝑎2subscript𝑝𝐾subscript𝑎𝐾subscript𝑝𝐾1subscript𝑤2\displaystyle=w_{1}(p_{1}{a}_{1}p_{2}{a}_{2}\dots p_{K}{a}_{K}p_{K+1})w_{2}= italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
=(w1p1a1ai1)pi(aipKaKpK+1w2)absentsubscript𝑤1subscript𝑝1subscript𝑎1subscript𝑎𝑖1subscript𝑝𝑖subscript𝑎𝑖subscript𝑝𝐾subscript𝑎𝐾subscript𝑝𝐾1subscript𝑤2\displaystyle=(w_{1}p_{1}{a}_{1}\dots a_{i-1})p_{i}({a}_{i}\dots p_{K}{a}_{K}p% _{K+1}w_{2})\in\mathcal{L}= ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_a start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT … italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ caligraphic_L
(w1p1a1ai1)[pi]𝒪(aipKaKpK+1w2).iffabsentsubscript𝑤1subscript𝑝1subscript𝑎1subscript𝑎𝑖1subscriptdelimited-[]subscript𝑝𝑖𝒪subscript𝑎𝑖subscript𝑝𝐾subscript𝑎𝐾subscript𝑝𝐾1subscript𝑤2\displaystyle\iff(w_{1}p_{1}{a}_{1}\dots a_{i-1}){[p_{i}]}_{\mathcal{O}}({a}_{% i}\dots p_{K}{a}_{K}p_{K+1}w_{2})\in\mathcal{L}.⇔ ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_a start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) [ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT … italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ caligraphic_L .

Consequently, the length of any representative under tagging t𝑡titalic_t is limited by K+(K+1)lNt+(Nt+1)l𝐾𝐾1𝑙subscript𝑁𝑡subscript𝑁𝑡1𝑙K+(K+1)l\leq N_{t}+(N_{t}+1)litalic_K + ( italic_K + 1 ) italic_l ≤ italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 ) italic_l, accounting for K𝐾Kitalic_K unmatched characters and K+1𝐾1K+1italic_K + 1 substrings each with a length not exceeding l𝑙litalic_l.

In conslusion, language ^tsubscript^𝑡\hat{\mathcal{L}}_{t}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has a finite number of equivalence classes, therefore is a VPL. ∎

Theorem B.7 (Finite and Sufficient Seed Strings).

For any given oracle language \mathcal{L}caligraphic_L, there exists a finite set of seed strings, denoted as S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, such that any tagging that is compatible with S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is also compatible with \mathcal{L}caligraphic_L.

Proof.

The strategy is to first construct a set of seed strings that provide information of the oracle call and return symbols, then extend the set with more seed strings to exclude the incompatible tagging functions.

Initialize S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as an empty set. For each oracle call-return symbol pair (\guilsingllefta,b\guilsinglright)\guilsinglleftab\guilsinglright(\text{\guilsinglleft{$a$}},\text{{$b$}\guilsinglright})( italic_a , italic_b ), pick a seed string s𝑠sitalic_s that contains a nesting pattern (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) where \guilsingllefta𝑎aitalic_a is unmatched in x𝑥xitalic_x, and b𝑏bitalic_b\guilsinglright is unmatched in y𝑦yitalic_y. Incorporate s𝑠sitalic_s into S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Note that if there is no such nesting pattern for (\guilsingllefta,b\guilsinglright)\guilsinglleftab\guilsinglright(\text{\guilsinglleft{$a$}},\text{{$b$}\guilsinglright})( italic_a , italic_b ), then it is easy to show that (\guilsingllefta,b\guilsinglright)\guilsinglleftab\guilsinglright(\text{\guilsinglleft{$a$}},\text{{$b$}\guilsinglright})( italic_a , italic_b ) are “redundant” in that they can be treated as plain symbols, which does not change the language with tagging removed.

Then, for each tagging t𝑡titalic_t that can be found in S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, but is incompatible with a nesting pattern of a certain string s𝑠sitalic_s in \mathcal{L}caligraphic_L, include string s𝑠sitalic_s in S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Given that the set of such tagging functions is finite, S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT remains a finite set.

In conclusion, given such S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Algorithm 3 can at least find the oracle tagging (with redundant tagging removed), by iteratively selecting the oracle call-return pair for each nesting pattern. ∎

Appendix C Proofs of Theorems in Section 5

This section is organized as follows. Lemma C.1 shows that given a nesting pattern uxzyv𝑢𝑥𝑧𝑦𝑣uxzyvitalic_u italic_x italic_z italic_y italic_v, for sufficiently large k𝑘kitalic_k, string uxkzykv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑘𝑣ux^{k}zy^{k}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v contains an unmatched oracle call token in xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and contains an unmatched oracle return token in yksuperscript𝑦𝑘y^{k}italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Since such k𝑘kitalic_k varies among strings, Lemma C.2 bounds k𝑘kitalic_k with k2𝑘2k\leq 2italic_k ≤ 2 with the help of Exclusivity. Moving on, Lemma C.3 shows that for oracle language \mathcal{L}caligraphic_L and oracle tokenizer τ𝒪subscript𝜏𝒪{\tau}_{\mathcal{O}}italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT, ~τ𝒪subscript~subscript𝜏𝒪\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a VPL over Σ~𝒪subscript~Σ𝒪\tilde{\Sigma}_{\mathcal{O}}over~ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT. Lemma C.4 shows that token-based matching rules lead to nesting patterns. Based on Lemma C.4, Lemma C.5 bounds the number of unmatched tokens between τ(s)𝜏𝑠{\tau}(s)italic_τ ( italic_s ) and τ𝒪(s)subscript𝜏𝒪𝑠{\tau}_{\mathcal{O}}(s)italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_s ). Following Lemma C.3 and C.5, Theorem C.6 proves that a compatible tokenizer converts the oracle language into a VPL. We conclude this section with Lemma C.7, which shows that there exists a finite set of seed strings that allows V-Star to find a compatible tokenizer.

Lemma C.0 (Matched Tokens in Nesting Patterns, I).

Given oracle language \mathcal{L}caligraphic_L, oracle tokenizer τ𝜏{\tau}italic_τ, and a VPG G=(Σ,V,P,L0)𝐺Σ𝑉𝑃subscript𝐿0G=(\Sigma,V,P,L_{0})italic_G = ( roman_Σ , italic_V , italic_P , italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for the oracle language, for each nesting pattern s=uxzyv𝑠𝑢𝑥𝑧𝑦𝑣s=uxzyvitalic_s = italic_u italic_x italic_z italic_y italic_v, there exists an oracle matching rule in P𝑃Pitalic_P, denoted as LhaAhbB𝐿subscript𝑎𝐴subscript𝑏𝐵L\to h_{a}{A}h_{b}Bitalic_L → italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_A italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_B, and exists a positive number k𝑘kitalic_k, so that tokens hasubscript𝑎h_{a}italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and hbsubscript𝑏h_{b}italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT are included in xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and yksuperscript𝑦𝑘y^{k}italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, respectively.

Proof.

For each k>0𝑘0k>0italic_k > 0, consider the tokenization of uxkzykv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑘𝑣ux^{k}zy^{k}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v, denoted as lk=τ(uxkzykv)subscript𝑙𝑘𝜏𝑢superscript𝑥𝑘𝑧superscript𝑦𝑘𝑣l_{k}={\tau}(ux^{k}zy^{k}v)italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_τ ( italic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v ).

Let lkxsuperscriptsubscript𝑙𝑘𝑥l_{k}^{x}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT be the token sequence in lksubscript𝑙𝑘l_{k}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, i.e., each token in lkxsuperscriptsubscript𝑙𝑘𝑥l_{k}^{x}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT overlaps xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

We say a token covers a string, if the string is a substring of the string captured by the token. For example, if token hhitalic_h in lksubscript𝑙𝑘l_{k}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT captures string ux𝑢𝑥uxitalic_u italic_x, then hhitalic_h covers x1superscript𝑥1x^{1}italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT.

We show that there exists an upper bound, denoted as Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, such that for each k𝑘kitalic_k and each token hlkxsuperscriptsubscript𝑙𝑘𝑥h\in l_{k}^{x}italic_h ∈ italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT, if hhitalic_h covers xisuperscript𝑥𝑖x^{i}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, then iNx𝑖subscript𝑁𝑥i\leq N_{x}italic_i ≤ italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Otherwise, notice that there are only a finite number of tokens, therefore, there exists a token hhitalic_h such that hhitalic_h starts at a character in uxk𝑢superscript𝑥𝑘ux^{k}italic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and can end at a character in either infinite locations in xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for k>0𝑘0k>0italic_k > 0, or infinite locations in yksuperscript𝑦𝑘y^{k}italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for k>0𝑘0k>0italic_k > 0 (could be both). In the first case, one can see that there exist two strings s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of hhitalic_h, such that

s1subscript𝑠1\displaystyle s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =vxix1absent𝑣superscript𝑥𝑖subscript𝑥1\displaystyle=vx^{i}x_{1}= italic_v italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
s2subscript𝑠2\displaystyle s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =vxjx1absent𝑣superscript𝑥𝑗subscript𝑥1\displaystyle=vx^{j}x_{1}= italic_v italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

where v𝑣vitalic_v is a suffix of either u𝑢uitalic_u or x𝑥xitalic_x, i𝑖iitalic_i and j𝑗jitalic_j are two numbers such that ij𝑖𝑗i\neq jitalic_i ≠ italic_j, and x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a prefix of x𝑥xitalic_x. Apparantly, s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be exchanged, which violates that (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is a nesting pattern. The second case could be similarly invalidated.

Similarly, we can prove another upper bound Nysubscript𝑁𝑦N_{y}italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT for y𝑦yitalic_y. Let N𝑁Nitalic_N be max(Nx,Ny)subscript𝑁𝑥subscript𝑁𝑦\max(N_{x},N_{y})roman_max ( italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ).

With upper bound N𝑁Nitalic_N, we can consider a new nesting pattern uxzyvsuperscript𝑢𝑥superscript𝑧𝑦superscript𝑣u^{\prime}xz^{\prime}yv^{\prime}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_x italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_y italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where u=uxNsuperscript𝑢𝑢superscript𝑥𝑁u^{\prime}=ux^{N}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_u italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, z=xNzyNsuperscript𝑧superscript𝑥𝑁𝑧superscript𝑦𝑁z^{\prime}=x^{N}zy^{N}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, and v=yNvsuperscript𝑣superscript𝑦𝑁𝑣v^{\prime}=y^{N}vitalic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_y start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_v:

(uxN)xk(xNzyN)yk(yNv).𝑢superscript𝑥𝑁superscript𝑥𝑘superscript𝑥𝑁𝑧superscript𝑦𝑁superscript𝑦𝑘superscript𝑦𝑁𝑣(ux^{N})x^{k}(x^{N}zy^{N})y^{k}(y^{N}v).( italic_u italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_v ) .

By doing this, we exclude the first and last tokens in lksubscript𝑙𝑘l_{k}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that only partially overlap with xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. From now on, we assume lksubscript𝑙𝑘l_{k}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is contained in xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for each k𝑘kitalic_k.

For each positive number K𝐾Kitalic_K, define language Ksubscript𝐾\mathcal{L}_{K}caligraphic_L start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT as {wlkk>K}conditional-set𝑤subscript𝑙𝑘𝑘𝐾\{w\in l_{k}\mid k>K\}{ italic_w ∈ italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_k > italic_K }, and token-based language Ksuperscriptsubscript𝐾\mathcal{L}_{K}^{\prime}caligraphic_L start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as {lkk>K}conditional-setsubscript𝑙𝑘𝑘𝐾\{l_{k}\mid k>K\}{ italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_k > italic_K }. Apparantly, Ksubscript𝐾\mathcal{L}_{K}caligraphic_L start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT is not a regular language. Based on this, we can show that there exists token list lK𝑙superscriptsubscript𝐾l\in\mathcal{L}_{K}^{\prime}italic_l ∈ caligraphic_L start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that contains unmatched call or return tokens. Otherwise, each sequence lK𝑙superscriptsubscript𝐾l\in\mathcal{L}_{K}^{\prime}italic_l ∈ caligraphic_L start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can only contain plain tokens or well-matched tokens. We only need to show that the depth of the nested well-matched tokens is bounded for all k𝑘kitalic_k, then the language Ksubscript𝐾\mathcal{L}_{K}caligraphic_L start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT is a regular language, a contradiction.

To show that the depth is bounded, notice that otherwise, we would have a nesting pattern in xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for certain k𝑘kitalic_k. Denote the pattern as (g,h)𝑔(g,h)( italic_g , italic_h ). Nesting pattern (g,h)𝑔(g,h)( italic_g , italic_h ), by definition, can be replaced by (gi,hi)superscript𝑔𝑖superscript𝑖(g^{i},h^{i})( italic_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) for any i>0𝑖0i>0italic_i > 0 in xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. This replacement extends xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT to xksuperscript𝑥superscript𝑘x^{k^{\prime}}italic_x start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for certain k>ksuperscript𝑘𝑘k^{\prime}>kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_k. However, since (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is a nesting pattern, uxkzykv𝑢superscript𝑥superscript𝑘𝑧superscript𝑦𝑘𝑣ux^{k^{\prime}}zy^{k}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_v should be invalid, a contradiction.

Therefore, we have shown that for each K𝐾Kitalic_K, there exists k𝑘kitalic_k and lksubscript𝑙𝑘l_{k}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, such that k>K𝑘𝐾k>Kitalic_k > italic_K and lksubscript𝑙𝑘l_{k}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT contains a non-plain token. In other words, when k𝑘kitalic_k goes to infinity, xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT contains an infinite number of unmatched call or return tokens. However, the number of unmatched return tokens is bounded in xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for k>0𝑘0k>0italic_k > 0, otherwise, since u𝑢uitalic_u is fixed, not enough call tokens can match those return tokens. Therefore, the number of unmatched call tokens is unbounded in xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for k>0𝑘0k>0italic_k > 0. This means for a sufficiently large k𝑘kitalic_k, a call token in xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT must be matched with a return token in yksuperscript𝑦𝑘y^{k}italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. We thus have proven the lemma. ∎

Lemma C.0 (Matched Tokens in Nesting Patterns, II).

Based on Lemma C.1, assume Exclusivity. For any nesting pattern uxzyv𝑢𝑥𝑧𝑦𝑣uxzyvitalic_u italic_x italic_z italic_y italic_v, there exists a matching rule, denoted as LhaAhbB𝐿subscript𝑎𝐴subscript𝑏𝐵L\to{h_{a}}{A}{h_{b}}Bitalic_L → italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_A italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_B, such that sahasubscript𝑠𝑎subscript𝑎s_{a}\in{h_{a}}italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is a substring of x2superscript𝑥2x^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and sbhbsubscript𝑠𝑏subscript𝑏s_{b}\in{h_{b}}italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is a substring of y2superscript𝑦2y^{2}italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Proof.

From Lemma C.1, we know that a token hasubscript𝑎h_{a}italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is in xksuperscript𝑥𝑘x^{k}italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for certain k𝑘kitalic_k. We can therefore let x=x1x2=x1x2𝑥subscript𝑥1subscript𝑥2superscriptsubscript𝑥1superscriptsubscript𝑥2x=x_{1}x_{2}=x_{1}^{\prime}x_{2}^{\prime}italic_x = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, so that sa=x2xix1=x2(x1x2)ix1subscript𝑠𝑎subscript𝑥2superscript𝑥𝑖superscriptsubscript𝑥1subscript𝑥2superscriptsubscript𝑥1subscript𝑥2𝑖superscriptsubscript𝑥1s_{a}=x_{2}x^{i}x_{1}^{\prime}=x_{2}(x_{1}x_{2})^{i}x_{1}^{\prime}italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for certain i𝑖iitalic_i. Consider x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; there are two cases.

If x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is empty, then sa=x1ix1=xix1=(x1x2)ix1subscript𝑠𝑎superscriptsubscript𝑥1𝑖superscriptsubscript𝑥1superscript𝑥𝑖superscriptsubscript𝑥1superscriptsuperscriptsubscript𝑥1superscriptsubscript𝑥2𝑖superscriptsubscript𝑥1s_{a}=x_{1}^{i}x_{1}^{\prime}=x^{i}x_{1}^{\prime}=(x_{1}^{\prime}x_{2}^{\prime% })^{i}x_{1}^{\prime}italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Notice that we must have i1𝑖1i\leq 1italic_i ≤ 1, otherwise either x1superscriptsubscript𝑥1x_{1}^{\prime}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT or x2superscriptsubscript𝑥2x_{2}^{\prime}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT becomes both a prefix and an infix of sasubscript𝑠𝑎s_{a}italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, which violates Exclusivity. Therefore, sa=x1subscript𝑠𝑎superscriptsubscript𝑥1s_{a}=x_{1}^{\prime}italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT or x1x2x1superscriptsubscript𝑥1superscriptsubscript𝑥2superscriptsubscript𝑥1x_{1}^{\prime}x_{2}^{\prime}x_{1}^{\prime}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Apparantly, s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a substring of x2superscript𝑥2x^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

If x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is nonempty, i𝑖iitalic_i must be zero, otherwise x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is both a prefix and an infix, a violation of Exclusivity. In this case, we have sa=x2x1subscript𝑠𝑎subscript𝑥2superscriptsubscript𝑥1s_{a}=x_{2}x_{1}^{\prime}italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, also a substring of x2superscript𝑥2x^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

In conclusion, we know that sasubscript𝑠𝑎s_{a}italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is a substring of x2superscript𝑥2x^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The reasoning is quite similar for the return token, and we omit it for brevity. ∎

Lemma C.0 ().

For oracle language \mathcal{L}caligraphic_L and oracle tokenizer τ𝒪subscript𝜏𝒪{\tau}_{\mathcal{O}}italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT, ~τ𝒪subscript~subscript𝜏𝒪\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a VPL over Σ~𝒪subscript~Σ𝒪\tilde{\Sigma}_{\mathcal{O}}over~ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT.

Proof.

We prove by building a VPA for language ~τ𝒪subscript~subscript𝜏𝒪\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

First, since the language {τ𝒪(s)s}conditional-setsubscript𝜏𝒪𝑠𝑠\{{\tau}_{\mathcal{O}}(s)\mid s\in\mathcal{L}\}{ italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ( italic_s ) ∣ italic_s ∈ caligraphic_L } is a VPL over Tτ𝒪subscript𝑇subscript𝜏𝒪T_{{\tau}_{\mathcal{O}}}italic_T start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we have a VPA for it, denoted as 𝒪subscript𝒪\mathcal{H}_{\mathcal{O}}caligraphic_H start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT.

Then, since each token τ𝜏{\tau}italic_τ is a regular language, denote its finite state automaton as tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Now we build a VPA ~~\tilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG for language ~τ𝒪subscript~subscript𝜏𝒪\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT by replacing each token in 𝒪subscript𝒪\mathcal{H}_{\mathcal{O}}caligraphic_H start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT with its FSA tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

First, the set of states is the union of states in 𝒪subscript𝒪\mathcal{H}_{\mathcal{O}}caligraphic_H start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT and tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for τTτ𝒪𝜏subscript𝑇subscript𝜏𝒪{\tau}\in T_{{\tau}_{\mathcal{O}}}italic_τ ∈ italic_T start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Then, the transitions are defined as follows.

  1. (1)

    We retain transitions p𝑖p𝑖𝑝superscript𝑝p\xrightarrow{i}p^{\prime}italic_p start_ARROW overitalic_i → end_ARROW italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in each FSA hcsubscriptsubscript𝑐\mathcal{H}_{h_{c}}caligraphic_H start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT for plain token hcsubscript𝑐h_{c}italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

  2. (2)

    For transition qhcqsubscript𝑐𝑞superscript𝑞q\xrightarrow{h_{c}}q^{\prime}italic_q start_ARROW start_OVERACCENT italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we add transition (q,ϵ)Shc,0𝑞italic-ϵsubscript𝑆subscript𝑐0(q,\epsilon)\to S_{h_{c},0}( italic_q , italic_ϵ ) → italic_S start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , 0 end_POSTSUBSCRIPT for the start state Shc,0subscript𝑆subscript𝑐0S_{h_{c},0}italic_S start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , 0 end_POSTSUBSCRIPT and transitions (Ehc,ϵ)qsubscript𝐸subscript𝑐italic-ϵsuperscript𝑞(E_{h_{c}},\epsilon)\to q^{\prime}( italic_E start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ϵ ) → italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for each acceptance state Ehcsubscript𝐸subscript𝑐E_{h_{c}}italic_E start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT in hcsubscriptsubscript𝑐\mathcal{H}_{h_{c}}caligraphic_H start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

  3. (3)

    For transition qha, push qsubscript𝑎 push 𝑞superscript𝑞q\xrightarrow{h_{a},\text{ push }}q^{\prime}italic_q start_ARROW start_OVERACCENT italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , push end_OVERACCENT → end_ARROW italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we add transition q\guilsingllefta, push (q,\guilsingllefta)Sha,0\guilsingllefta push 𝑞\guilsingllefta𝑞subscript𝑆subscript𝑎0q\xrightarrow{\text{\guilsinglleft{$a$}},\text{ push }(q,\text{\guilsinglleft{% $a$}})}S_{h_{a},0}italic_q start_ARROW start_OVERACCENT italic_a , push ( italic_q , italic_a ) end_OVERACCENT → end_ARROW italic_S start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , 0 end_POSTSUBSCRIPT, where \guilsingllefta𝑎aitalic_a is the call token corresponding to hasubscript𝑎h_{a}italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, and we add transitions (Eha,ϵ)qsubscript𝐸subscript𝑎italic-ϵsuperscript𝑞(E_{h_{a}},\epsilon)\to q^{\prime}( italic_E start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ϵ ) → italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for each acceptance state Ehasubscript𝐸subscript𝑎E_{h_{a}}italic_E start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT in hasubscriptsubscript𝑎\mathcal{H}_{h_{a}}caligraphic_H start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

  4. (4)

    For transition qhb, pop (q′′,ha)qsubscript𝑏 pop superscript𝑞′′superscriptsubscript𝑎𝑞superscript𝑞q\xrightarrow{h_{b},\text{ pop }(q^{\prime\prime},h_{a}^{\prime})}q^{\prime}italic_q start_ARROW start_OVERACCENT italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , pop ( italic_q start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_OVERACCENT → end_ARROW italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we add transition qb\guilsinglright, pop (q′′,\guilsingllefta)qb\guilsinglright pop superscript𝑞′′superscript\guilsingllefta𝑞superscript𝑞q\xrightarrow{\text{{$b$}\guilsinglright},\text{ pop }(q^{\prime\prime},\text{% \guilsinglleft{$a$}}^{\prime})}q^{\prime}italic_q start_ARROW start_OVERACCENT italic_b , pop ( italic_q start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_OVERACCENT → end_ARROW italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where \guilsinglleftasuperscript\guilsingllefta\text{\guilsinglleft{$a$}}^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the call token corresponding to hasuperscriptsubscript𝑎h_{a}^{\prime}italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and we add transitions (Ehb,ϵ)qsubscript𝐸subscript𝑏italic-ϵsuperscript𝑞(E_{h_{b}},\epsilon)\to q^{\prime}( italic_E start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ϵ ) → italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for each acceptance state Ehbsubscript𝐸subscript𝑏E_{h_{b}}italic_E start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT in hbsubscriptsubscript𝑏\mathcal{H}_{h_{b}}caligraphic_H start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Apparantly, a string s𝑠s\in\mathcal{L}italic_s ∈ caligraphic_L must be accepted by ~~\tilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG. On the otherhand, if a string s𝑠sitalic_s is accepted by ~~\tilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG, then it leads to a valid token sequence l𝑙litalic_l and sl𝑠𝑙s\in litalic_s ∈ italic_l, therefore, s𝑠sitalic_s is also a valid string \mathcal{L}caligraphic_L.

In conclusion, ~~\tilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG is a VPA that accepts ~τ𝒪subscript~subscript𝜏𝒪\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which means language ~τ𝒪subscript~subscript𝜏𝒪\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a VPL. ∎

Lemma C.0 (Call and Return Tokens in Nesting Patterns).

For any oracle VPL \mathcal{L}caligraphic_L, for each string s𝑠s\in\mathcal{L}italic_s ∈ caligraphic_L, if s𝑠sitalic_s is derived by repeatedly applying a matching rule which exposes a recursion, i.e.,

L0*pLqp(haAhbB)q*p(sauLvsbB)q*psausLvsbsBq=ssuperscriptsubscript𝐿0𝑝𝐿𝑞𝑝subscript𝑎𝐴subscript𝑏𝐵𝑞superscript𝑝subscript𝑠𝑎𝑢𝐿𝑣subscript𝑠𝑏𝐵𝑞superscript𝑝subscript𝑠𝑎𝑢subscript𝑠𝐿𝑣subscript𝑠𝑏subscript𝑠𝐵𝑞𝑠L_{0}\to^{*}pLq\to p({h_{a}}A{h_{b}}B)q\to^{*}p(s_{a}uLvs_{b}B)q\to^{*}ps_{a}% us_{L}vs_{b}s_{B}q=sitalic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_p italic_L italic_q → italic_p ( italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_A italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_B ) italic_q → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u italic_L italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_B ) italic_q → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_p italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_q = italic_s

where sahasubscript𝑠𝑎subscript𝑎s_{a}\in{h_{a}}italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and sbhbsubscript𝑠𝑏subscript𝑏s_{b}\in{h_{b}}italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, and sL,sBsubscript𝑠𝐿subscript𝑠𝐵s_{L},s_{B}\in\mathcal{L}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∈ caligraphic_L are strings derived from nonterminals L𝐿Litalic_L and B𝐵Bitalic_B, respectively, and u,v𝑢𝑣u,v\in\mathcal{L}italic_u , italic_v ∈ caligraphic_L. Then there exists a nesting pattern (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) for s𝑠sitalic_s where sasubscript𝑠𝑎s_{a}italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is a prefix of x𝑥xitalic_x and sbsubscript𝑠𝑏s_{b}italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is a substring of y𝑦yitalic_y.

Proof.

Consider the iterative application of the derivation pLq*p(sauLvsbB)qsuperscript𝑝𝐿𝑞𝑝subscript𝑠𝑎𝑢𝐿𝑣subscript𝑠𝑏𝐵𝑞pLq\to^{*}p(s_{a}uLvs_{b}B)qitalic_p italic_L italic_q → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u italic_L italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_B ) italic_q to L𝐿Litalic_L. This leads to the deduction pLq*p(sau)kL(vsbsB)kqsuperscript𝑝𝐿𝑞𝑝superscriptsubscript𝑠𝑎𝑢𝑘𝐿superscript𝑣subscript𝑠𝑏subscript𝑠𝐵𝑘𝑞pLq\to^{*}p(s_{a}u)^{k}L(vs_{b}s_{B})^{k}qitalic_p italic_L italic_q → start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_L ( italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_q. To show that the pair (sau,vsbB)subscript𝑠𝑎𝑢𝑣subscript𝑠𝑏𝐵(s_{a}u,vs_{b}B)( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u , italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_B ) represents a nesting pattern, we only need to prove that uxkzyjv𝑢superscript𝑥𝑘𝑧superscript𝑦𝑗𝑣ux^{k}zy^{j}vitalic_u italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_v is invalid when kj𝑘𝑗k\neq jitalic_k ≠ italic_j (both 0absent0\geq 0≥ 0).

Let the oracle tokenizer be τ𝜏{\tau}italic_τ. For k>0𝑘0k>0italic_k > 0, we tokenize string s3=p(sau)ksL(vsbB)kqsubscript𝑠3𝑝superscriptsubscript𝑠𝑎𝑢𝑘subscript𝑠𝐿superscript𝑣subscript𝑠𝑏𝐵𝑘𝑞s_{3}=p(s_{a}u)^{k}s_{L}(vs_{b}B)^{k}qitalic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_B ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_q as:

τ(s3)𝜏subscript𝑠3\displaystyle{\tau}(s_{3})italic_τ ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) =τ(p(sau)ksL(vsbsB)kq)absent𝜏𝑝superscriptsubscript𝑠𝑎𝑢𝑘subscript𝑠𝐿superscript𝑣subscript𝑠𝑏subscript𝑠𝐵𝑘𝑞\displaystyle={\tau}(p(s_{a}u)^{k}s_{L}(vs_{b}s_{B})^{k}q)= italic_τ ( italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_q )
=τ(p)ha(τ(u)ha)k1τ(usLv)hb(τ(sBv)hb)k1τ(sBq)(By Tokenization Consistency)absent𝜏𝑝subscript𝑎superscript𝜏𝑢subscript𝑎𝑘1𝜏𝑢subscript𝑠𝐿𝑣subscript𝑏superscript𝜏subscript𝑠𝐵𝑣subscript𝑏𝑘1𝜏subscript𝑠𝐵𝑞(By Tokenization Consistency)\displaystyle={\tau}(p){h_{a}}({\tau}(u){h_{a}})^{k-1}{\tau}(us_{L}v){h_{b}}({% \tau}(s_{B}v){h_{b}})^{k-1}{\tau}(s_{B}q)\quad\text{(By Tokenization % Consistency)}= italic_τ ( italic_p ) italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_τ ( italic_u ) italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_τ ( italic_u italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_v ) italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_τ ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_v ) italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_τ ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_q ) (By Tokenization Consistency)

Notice that u𝑢uitalic_u, sLsubscript𝑠𝐿s_{L}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and v𝑣vitalic_v are independent strings of token list. Therefore, we have

τ(usLv)=τ(u)τ(sL)τ(v).𝜏𝑢subscript𝑠𝐿𝑣𝜏𝑢𝜏subscript𝑠𝐿𝜏𝑣{\tau}(us_{L}v)={\tau}(u){\tau}(s_{L}){\tau}(v).italic_τ ( italic_u italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_v ) = italic_τ ( italic_u ) italic_τ ( italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) italic_τ ( italic_v ) .

We next tokenize string s4=p(sau)ksL(vsbsB)jqsubscript𝑠4𝑝superscriptsubscript𝑠𝑎𝑢𝑘subscript𝑠𝐿superscript𝑣subscript𝑠𝑏subscript𝑠𝐵𝑗𝑞s_{4}=p(s_{a}u)^{k}s_{L}(vs_{b}s_{B})^{j}qitalic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_q, for kj𝑘𝑗k\neq jitalic_k ≠ italic_j (both >0absent0>0> 0):

τ(s4)𝜏subscript𝑠4\displaystyle{\tau}(s_{4})italic_τ ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) =τ(p(sau)ksL(vsbsB)jq)absent𝜏𝑝superscriptsubscript𝑠𝑎𝑢𝑘subscript𝑠𝐿superscript𝑣subscript𝑠𝑏subscript𝑠𝐵𝑗𝑞\displaystyle={\tau}(p(s_{a}u)^{k}s_{L}(vs_{b}s_{B})^{j}q)= italic_τ ( italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_q )
={τ(p(sau)ksLvsb(sBvsb)j1sBq)if k>jτ(p(sau)ksLvsb(sBvsb)k1sB(vsbsB)jkq)if k<jabsentcases𝜏𝑝superscriptsubscript𝑠𝑎𝑢𝑘subscript𝑠𝐿𝑣subscript𝑠𝑏superscriptsubscript𝑠𝐵𝑣subscript𝑠𝑏𝑗1subscript𝑠𝐵𝑞if 𝑘𝑗𝜏𝑝superscriptsubscript𝑠𝑎𝑢𝑘subscript𝑠𝐿𝑣subscript𝑠𝑏superscriptsubscript𝑠𝐵𝑣subscript𝑠𝑏𝑘1subscript𝑠𝐵superscript𝑣subscript𝑠𝑏subscript𝑠𝐵𝑗𝑘𝑞if 𝑘𝑗\displaystyle=\begin{cases}{\tau}(p(s_{a}u)^{k}s_{L}vs_{b}(s_{B}vs_{b})^{j-1}s% _{B}q)&\text{if }k>j\\ {\tau}(p(s_{a}u)^{k}s_{L}vs_{b}(s_{B}vs_{b})^{k-1}s_{B}(vs_{b}s_{B})^{j-k}q)&% \text{if }k<j\end{cases}= { start_ROW start_CELL italic_τ ( italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_q ) end_CELL start_CELL if italic_k > italic_j end_CELL end_ROW start_ROW start_CELL italic_τ ( italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT italic_q ) end_CELL start_CELL if italic_k < italic_j end_CELL end_ROW
={τ(p(sau)ksLvsb(sBvsb)j1)τ(sBq)if k>jτ(p(sau)ksLvsb(sBvsb)k1)τ(sB(vsbsB)jkq)if k<jabsentcases𝜏𝑝superscriptsubscript𝑠𝑎𝑢𝑘subscript𝑠𝐿𝑣subscript𝑠𝑏superscriptsubscript𝑠𝐵𝑣subscript𝑠𝑏𝑗1𝜏subscript𝑠𝐵𝑞if 𝑘𝑗𝜏𝑝superscriptsubscript𝑠𝑎𝑢𝑘subscript𝑠𝐿𝑣subscript𝑠𝑏superscriptsubscript𝑠𝐵𝑣subscript𝑠𝑏𝑘1𝜏subscript𝑠𝐵superscript𝑣subscript𝑠𝑏subscript𝑠𝐵𝑗𝑘𝑞if 𝑘𝑗\displaystyle=\begin{cases}{\tau}(p(s_{a}u)^{k}s_{L}vs_{b}(s_{B}vs_{b})^{j-1})% {\tau}(s_{B}q)&\text{if }k>j\\ {\tau}(p(s_{a}u)^{k}s_{L}vs_{b}(s_{B}vs_{b})^{k-1}){\tau}(s_{B}(vs_{b}s_{B})^{% j-k}q)&\text{if }k<j\end{cases}= { start_ROW start_CELL italic_τ ( italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT ) italic_τ ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_q ) end_CELL start_CELL if italic_k > italic_j end_CELL end_ROW start_ROW start_CELL italic_τ ( italic_p ( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) italic_τ ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT italic_q ) end_CELL start_CELL if italic_k < italic_j end_CELL end_ROW
=τ(p)ha(τ(u)ha)k1τ(usLv)hb(τ(sBv)hb)j1τ(sBq)absent𝜏𝑝subscript𝑎superscript𝜏𝑢subscript𝑎𝑘1𝜏𝑢subscript𝑠𝐿𝑣subscript𝑏superscript𝜏subscript𝑠𝐵𝑣subscript𝑏𝑗1𝜏subscript𝑠𝐵𝑞\displaystyle={\tau}(p){h_{a}}({\tau}(u){h_{a}})^{k-1}{\tau}(us_{L}v){h_{b}}({% \tau}(s_{B}v){h_{b}})^{j-1}{\tau}(s_{B}q)= italic_τ ( italic_p ) italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_τ ( italic_u ) italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_τ ( italic_u italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_v ) italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_τ ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_v ) italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_τ ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_q )

We applied Tokenization Consistency in the last step above. Apparantly, τ(s4)𝜏subscript𝑠4{\tau}(s_{4})italic_τ ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) is invalid because of imbalanced call and return tokens. There are two cases left, where either k𝑘kitalic_k or j𝑗jitalic_j equals 00.

When k=0𝑘0k=0italic_k = 0 and j>0𝑗0j>0italic_j > 0, s4=psL(vsbsB)jqsubscript𝑠4𝑝subscript𝑠𝐿superscript𝑣subscript𝑠𝑏subscript𝑠𝐵𝑗𝑞s_{4}=ps_{L}(vs_{b}s_{B})^{j}qitalic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_p italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_q. Assume s4subscript𝑠4s_{4}italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT is valid; we can tokenize s4subscript𝑠4s_{4}italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT as

τ(s4)𝜏subscript𝑠4\displaystyle{\tau}(s_{4})italic_τ ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) =τ(psL(vsbsB)jq)absent𝜏𝑝subscript𝑠𝐿superscript𝑣subscript𝑠𝑏subscript𝑠𝐵𝑗𝑞\displaystyle={\tau}(ps_{L}(vs_{b}s_{B})^{j}q)= italic_τ ( italic_p italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_q )
=τ(psLvsb(sBvsb)j1sBq)absent𝜏𝑝subscript𝑠𝐿𝑣subscript𝑠𝑏superscriptsubscript𝑠𝐵𝑣subscript𝑠𝑏𝑗1subscript𝑠𝐵𝑞\displaystyle={\tau}(ps_{L}vs_{b}(s_{B}vs_{b})^{j-1}s_{B}q)= italic_τ ( italic_p italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_q )
=τ(p)τ(sL)τ(v)hb(τ(sBv)hb)j1τ(sB)τ(q).absent𝜏𝑝𝜏subscript𝑠𝐿𝜏𝑣subscript𝑏superscript𝜏subscript𝑠𝐵𝑣subscript𝑏𝑗1𝜏subscript𝑠𝐵𝜏𝑞\displaystyle={\tau}(p){\tau}(s_{L}){\tau}(v){h_{b}}({\tau}(s_{B}v){h_{b}})^{j% -1}{\tau}(s_{B}){\tau}(q).= italic_τ ( italic_p ) italic_τ ( italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) italic_τ ( italic_v ) italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_τ ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_v ) italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_τ ( italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) italic_τ ( italic_q ) .

Again, τ(s4)𝜏subscript𝑠4{\tau}(s_{4})italic_τ ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) is apparantly invalid. The case of k>0𝑘0k>0italic_k > 0 and j=0𝑗0j=0italic_j = 0 is similar and we omit it for brevity.

Therefore, we have shown that the pair (sau,vsbB)subscript𝑠𝑎𝑢𝑣subscript𝑠𝑏𝐵(s_{a}u,vs_{b}B)( italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_u , italic_v italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_B ) is a nesting pattern. ∎

Lemma C.0 ().

Given a compatible tokenizer τ𝜏{\tau}italic_τ, there exists an upper bound Nτsubscript𝑁𝜏N_{\tau}italic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT, so that for each string sΣ~*𝑠superscript~Σs\in\tilde{\Sigma}^{*}italic_s ∈ over~ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, if s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG is well-matched and there exists context strings (w,w)𝑤superscript𝑤(w,w^{\prime})( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) such that wsw𝑤𝑠superscript𝑤wsw^{\prime}\in\mathcal{L}italic_w italic_s italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_L, then string s𝑠sitalic_s contains at most Nτsubscript𝑁𝜏N_{\tau}italic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT unmatched oracle call or return tokens.

Proof.

The proof parallels the proof of Lemma B.5; we show that this would otherwise result in an infinite number of equivalence classes for τ𝒪subscriptsimilar-tosubscript𝜏𝒪\sim_{{\tau}_{\mathcal{O}}}∼ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT, contradicting to that L~τ𝒪subscript~𝐿subscript𝜏𝒪\tilde{L}_{{\tau}_{\mathcal{O}}}over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a VPL, proven by Lemma C.3.

If for any number K𝐾Kitalic_K, there exists a τ𝜏{\tau}italic_τ-well-matched string s𝑠sitalic_s, such that s𝑠sitalic_s can contain at least K𝐾Kitalic_K unmatched call token, then, by Lemma C.4, a set of nesting patterns (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) appear in wsw𝑤𝑠superscript𝑤wsw^{\prime}italic_w italic_s italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and therefore in w~s~w~~𝑤~𝑠superscript~𝑤\tilde{w}\tilde{s}\tilde{w}^{\prime}over~ start_ARG italic_w end_ARG over~ start_ARG italic_s end_ARG over~ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then, since τ𝜏{\tau}italic_τ is compatible with \mathcal{L}caligraphic_L, a call-return token pair \guilsinglleftc,d\guilsinglright\guilsinglleftcd\guilsinglright\text{\guilsinglleft{$c$}},\text{{$d$}\guilsinglright}italic_c , italic_d appear in each (xi~,yi~)~subscript𝑥𝑖~subscript𝑦𝑖(\tilde{x_{i}},\tilde{y_{i}})( over~ start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , over~ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ). The rest of the analysis is similar to that of Lemma B.5, and we omit it for brevity. ∎

Theorem C.6 ().

Given oracle language \mathcal{L}caligraphic_L and oracle tokenizer τ𝒪subscript𝜏𝒪{\tau}_{\mathcal{O}}italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT, for each tokenizer τ𝜏{\tau}italic_τ that is compatible with the oracle language \mathcal{L}caligraphic_L, language ~τsubscript~𝜏\tilde{\mathcal{L}}_{{\tau}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is a VPL.

Proof.

By Lemma C.5, for certain kNτ𝑘subscript𝑁𝜏k\leq N_{\tau}italic_k ≤ italic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT, we can represent q~𝒪subscript~𝑞𝒪\tilde{q}_{\mathcal{O}}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT as q1\guilsingllefta1q2\guilsingllefta2\guilsinglleftak1qksubscript𝑞1subscript\guilsingllefta1subscript𝑞2subscript\guilsingllefta2subscript\guilsingllefta𝑘1subscript𝑞𝑘q_{1}\text{\guilsinglleft{$a$}}_{1}q_{2}\text{\guilsinglleft{$a$}}_{2}\dots% \text{\guilsinglleft{$a$}}_{k-1}q_{k}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_a start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where qi,i[1..k]q_{i,i\in[1..k]}italic_q start_POSTSUBSCRIPT italic_i , italic_i ∈ [ 1 . . italic_k ] end_POSTSUBSCRIPT is well-matched under τ𝒪subscript𝜏𝒪{\tau}_{\mathcal{O}}italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT. Now, we replace each qi,i[1..k]q_{i,i\in[1..k]}italic_q start_POSTSUBSCRIPT italic_i , italic_i ∈ [ 1 . . italic_k ] end_POSTSUBSCRIPT with its representative in the equivalence class of τ𝒪subscriptsimilar-tosubscript𝜏𝒪\sim_{{\tau}_{\mathcal{O}}}∼ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and get qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that q~τq~subscriptsimilar-to𝜏~superscript𝑞~𝑞\tilde{q^{\prime}}\sim_{{\tau}}\tilde{q}over~ start_ARG italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∼ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG. Since τ𝒪subscriptsimilar-tosubscript𝜏𝒪\sim_{{\tau}_{\mathcal{O}}}∼ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT has a finite number of equivalence classes, let the length of the longest representative be l𝑙litalic_l, the length of qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is bounded by l×k+k+1𝑙𝑘𝑘1l\times k+k+1italic_l × italic_k + italic_k + 1.

Therefore, the congruence relation τsubscriptsimilar-to𝜏\sim_{\tau}∼ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT has a finite number of equivalence classes, which shows ~τsubscript~𝜏\tilde{\mathcal{L}}_{{\tau}}over~ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is a VPL. ∎

Lemma C.0 (Finite and Sufficient Seed Strings).

There is a finite set of seed strings S𝑆Sitalic_S, with which we can find a tokenizer that is compatible with the oracle language \mathcal{L}caligraphic_L using Algorithm 4.

Proof.

The construction of S𝑆Sitalic_S involves two phases: we first identify strings that reveal the oracle call and return tokens, then augment this set to exclude taggings incompatible with \mathcal{L}caligraphic_L.

Starting with an empty set S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, for each oracle call-return token pair (ha,hb)subscript𝑎subscript𝑏({h_{a}},{h_{b}})( italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ), we select a seed string s𝑠sitalic_s that contains a nesting pattern (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) where hasubscript𝑎{h_{a}}italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and hbsubscript𝑏{h_{b}}italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT are respectively unmatched in x𝑥xitalic_x and y𝑦yitalic_y, then we include s𝑠sitalic_s in S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We denote the new set as S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Note that, similar to Lemma B.7, if there is no such nesting pattern for (ha,hb)subscript𝑎subscript𝑏({h_{a}},{h_{b}})( italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ), then it is easy to show that (ha,hb)subscript𝑎subscript𝑏({h_{a}},{h_{b}})( italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) are “redundant”, in that they can be treated as plain tokens, which does not change the language with tagging removed.

Subsequently, we extend S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with strings s𝑠sitalic_s from \mathcal{L}caligraphic_L that is incompatible with certain tokenizer that can be found by Algorithm 4 based on S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We denote the new set as S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is still finite, since Algorithm 4 can only find a finite number of tokenizers from S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

For each string s𝑠sitalic_s in S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we modify s𝑠sitalic_s by replacing call and return tokens hasubscript𝑎{h_{a}}italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and hbsubscript𝑏{h_{b}}italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT with strings sasubscript𝑠𝑎s_{a}italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and sbsubscript𝑠𝑏s_{b}italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT from Token Fixed Prefix and Suffix, respectively. The new set is our targeted S𝑆Sitalic_S.

Given S𝑆Sitalic_S, we now show that the oracle tokenizer τ𝜏{\tau}italic_τ is a possible return of Algorithm 4. Intuitively, the oracle token pair can be incrementally added to the hypothesis partial tokenizer D𝐷Ditalic_D, and the tokenization 𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐞(D,s)𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐞𝐷𝑠\mathbf{tokenize}(D,s)bold_tokenize ( italic_D , italic_s ) for any valid string s𝑠sitalic_s maintains well-matched.

By Lemma C.2, each oracle token pair is contained in (x2,y2)superscript𝑥2superscript𝑦2(x^{2},y^{2})( italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for certain nesting pattern (x,y)𝑥𝑦(x,y)( italic_x , italic_y ). Therefore, from the first nesting pattern (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), Algorithm 4 could find ((q,g),(q,g))𝑞𝑔superscript𝑞superscript𝑔((q,g),(q^{\prime},g^{\prime}))( ( italic_q , italic_g ) , ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) where ((q,g),(q,g))𝑞𝑔superscript𝑞superscript𝑔((q,g),(q^{\prime},g^{\prime}))( ( italic_q , italic_g ) , ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) belongs to an oracle token pair. Then, because of the Token Fixed Prefix and Suffix assumption, the lexical rules for the two paired tokens are learned accurately. Denote this tokenizer that contains only one token-pair as D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Consider Algorithm 5. At line 4, we construct a new match m𝑚mitalic_m. Now, we show that m𝑚mitalic_m must be the match given by the oracle tokenizer. Firstly, m𝑚mitalic_m must correspond to an oracle match, otherwise it will be filtered out by k𝑘kitalic_k-Repetition. Then, by Unique Pairing and Separation, m𝑚mitalic_m must correspond to the oracle call/return token. Therefore, 𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐞(D1,s)𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐞subscript𝐷1𝑠\mathbf{tokenize}(D_{1},s)bold_tokenize ( italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s ) contains only well-matched oracle tokens, thus is well-matched.

A similar analysis can be done to show that as long as D𝐷Ditalic_D contains only oracle token pairs, 𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐞(D,s)𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐞𝐷𝑠\mathbf{tokenize}(D,s)bold_tokenize ( italic_D , italic_s ) is well-matched for any valid string s𝑠sitalic_s. We thus have proved that Algorithm 4 can at least find the oracle tokenizer given S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (with redundant oracle tokens removed), by iteratively selecting the oracle token pair for each nesting pattern.

In conclusion, there exists a finite set of seed strings, where Algorithm 4 can find a compatible tokenizer. ∎

References

  • (1)
  • Aichernig et al. (2024) Bernhard K. Aichernig, Martin Tappler, and Felix Wallner. 2024. Benchmarking Combinations of Learning and Testing Algorithms for Automata Learning. Form. Asp. Comput. 36, 1, Article 3 (mar 2024), 37 pages. https://doi.org/10.1145/3605360
  • Alur (2007) Rajeev Alur. 2007. Marrying words and trees. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (¡conf-loc¿, ¡city¿Beijing¡/city¿, ¡country¿China¡/country¿, ¡/conf-loc¿) (PODS ’07). Association for Computing Machinery, New York, NY, USA, 233–242. https://doi.org/10.1145/1265530.1265564
  • Alur et al. (2005) Rajeev Alur, Viraj Kumar, P. Madhusudan, and Mahesh Viswanathan. 2005. Congruences for visibly pushdown languages. In Proceedings of the 32nd International Conference on Automata, Languages and Programming (Lisbon, Portugal) (ICALP’05). Springer-Verlag, Berlin, Heidelberg, 1102–1114. https://doi.org/10.1007/11523468_89
  • Alur and Madhusudan (2004) Rajeev Alur and P. Madhusudan. 2004. Visibly pushdown languages. In Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing (Chicago, IL, USA) (STOC ’04). Association for Computing Machinery, New York, NY, USA, 202–211. https://doi.org/10.1145/1007352.1007390
  • Alur and Madhusudan (2009) Rajeev Alur and P. Madhusudan. 2009. Adding nesting structure to words. J. ACM 56, 3, Article 16 (may 2009), 43 pages. https://doi.org/10.1145/1516512.1516518
  • Angluin (1987) Dana Angluin. 1987. Learning regular sets from queries and counterexamples. Inf. Comput. 75, 2 (nov 1987), 87–106. https://doi.org/10.1016/0890-5401(87)90052-6
  • Arefin et al. (2024) M. Arefin, S. Shetiya, Z. Wang, and C. Csallner. 2024. Fast Deterministic Black-box Context-free Grammar Inference. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 901–901. https://doi.ieeecomputersociety.org/
  • Barbot et al. (2021) Benoît Barbot, Benedikt Bollig, Alain Finkel, Serge Haddad, Igor Khmelnitsky, Martin Leucker, Daniel Neider, Rajarshi Roy, and Lina Ye. 2021. Extracting Context-Free Grammars from Recurrent Neural Networks using Tree-Automata Learning and A* Search. In Proceedings of the Fifteenth International Conference on Grammatical Inference (Proceedings of Machine Learning Research, Vol. 153), Jane Chandlee, Rémi Eyraud, Jeff Heinz, Adam Jardine, and Menno van Zaanen (Eds.). PMLR, 113–129. https://proceedings.mlr.press/v153/barbot21a.html
  • Bastani et al. (2017) Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing program input grammars. SIGPLAN Not. 52, 6 (jun 2017), 95–110. https://doi.org/10.1145/3140587.3062349
  • Bendrissou et al. (2022) Bachir Bendrissou, Rahul Gopinath, and Andreas Zeller. 2022. “Synthesizing input grammars”: a replication study. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (San Diego, CA, USA) (PLDI 2022). Association for Computing Machinery, New York, NY, USA, 260–268. https://doi.org/10.1145/3519939.3523716
  • Chaudhuri and Alur (2007) Swarat Chaudhuri and Rajeev Alur. 2007. Instrumenting C programs with nested word monitors. In Proceedings of the 14th International SPIN Conference on Model Checking Software (Berlin, Germany). Springer-Verlag, Berlin, Heidelberg, 279–283.
  • Chow (1978) T. S. Chow. 1978. Testing Software Design Modeled by Finite-State Machines. IEEE Trans. Softw. Eng. 4, 3 (may 1978), 178–187. https://doi.org/10.1109/TSE.1978.231496
  • Drewes and Högberg (2007) Frank Drewes and Johanna Högberg. 2007. Query Learning of Regular Tree Languages: How to Avoid Dead States. Theor. Comp. Sys. 40, 2 (feb 2007), 163–185. https://doi.org/10.1007/s00224-005-1233-3
  • Drewes et al. (2011) Frank Drewes, Johanna Högberg, and Andreas Maletti. 2011. MAT learners for tree series: an abstract data type and two realizations. Acta Informatica 48, 3 (May 2011), 165–189. https://doi.org/10.1007/s00236-011-0135-x
  • Fujiwara et al. (1991) S. Fujiwara, G. v. Bochmann, F. Khendek, M. Amalou, and A. Ghedamsi. 1991. Test selection based on finite state models. IEEE Transactions on Software Engineering 17, 6 (1991), 591–603. https://doi.org/10.1109/32.87284
  • Gauwin et al. (2008) Olivier Gauwin, Joachim Niehren, and Yves Roos. 2008. Streaming tree automata. Inform. Process. Lett. 109, 1 (2008), 13–17. https://doi.org/10.1016/j.ipl.2008.08.002
  • Harris et al. (2012) William R. Harris, Somesh Jha, and Thomas Reps. 2012. Secure programming via visibly pushdown safety games. In Proceedings of the 24th International Conference on Computer Aided Verification (Berkeley, CA) (CAV’12). Springer-Verlag, Berlin, Heidelberg, 581–598. https://doi.org/10.1007/978-3-642-31424-7_41
  • Heizmann et al. (2010) Matthias Heizmann, Jochen Hoenicke, and Andreas Podelski. 2010. Nested interpolants. SIGPLAN Not. 45, 1 (jan 2010), 471–482. https://doi.org/10.1145/1707801.1706353
  • Howar (2012) Falk Howar. 2012. Active learning of interface programs. Ph. D. Dissertation. https://doi.org/10.17877/DE290R-4817
  • Irfan et al. (2010) Muhammad Naeem Irfan, Catherine Oriat, and Roland Groz. 2010. Angluin style finite state machine inference with non-optimal counterexamples. In Proceedings of the First International Workshop on Model Inference In Testing (Trento, Italy) (MIIT ’10). Association for Computing Machinery, New York, NY, USA, 11–19. https://doi.org/10.1145/1868044.1868046
  • Isberner (2015) Malte Isberner. 2015. Foundations of active automata learning: an algorithmic perspective. https://api.semanticscholar.org/CorpusID:45690562
  • Jia et al. (2021) Xiaodong Jia, Ashish Kumar, and Gang Tan. 2021. A derivative-based parser generator for visibly Pushdown grammars. Proc. ACM Program. Lang. 5, OOPSLA, Article 151 (oct 2021), 24 pages. https://doi.org/10.1145/3485528
  • Jia et al. (2023) Xiaodong Jia, Ashish Kumar, and Gang Tan. 2023. A Derivative-based Parser Generator for Visibly Pushdown Grammars. ACM Trans. Program. Lang. Syst. 45, 2, Article 9 (may 2023), 68 pages. https://doi.org/10.1145/3591472
  • Kulkarni (2023) Neil Kulkarni. 2023. Arvada. https://github.com/neil-kulkarni/arvada.
  • Kulkarni et al. (2022) Neil Kulkarni, Caroline Lemieux, and Koushik Sen. 2022. Learning highly recursive input grammars. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (Melbourne, Australia) (ASE ’21). IEEE Press, 456–467. https://doi.org/10.1109/ASE51524.2021.9678879
  • Kumar et al. (2006) Viraj Kumar, P. Madhusudan, and Mahesh Viswanathan. 2006. Minimization, learning, and conformance testing of boolean programs. In Proceedings of the 17th International Conference on Concurrency Theory (Bonn, Germany) (CONCUR’06). Springer-Verlag, Berlin, Heidelberg, 203–217. https://doi.org/10.1007/11817949_14
  • Kumar et al. (2007) Viraj Kumar, P. Madhusudan, and Mahesh Viswanathan. 2007. Visibly pushdown automata for streaming XML. In Proceedings of the 16th International Conference on World Wide Web (Banff, Alberta, Canada) (WWW ’07). Association for Computing Machinery, New York, NY, USA, 1053–1062. https://doi.org/10.1145/1242572.1242714
  • Maler and Pnueli (1991) Oded Maler and Amir Pnueli. 1991. On the learnability of infinitary regular sets. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory (Santa Cruz, California, USA) (COLT ’91). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 128–138.
  • Michaliszyn and Otop (2022) Jakub Michaliszyn and Jan Otop. 2022. Learning Deterministic Visibly Pushdown Automata Under Accessible Stack. In 47th International Symposium on Mathematical Foundations of Computer Science (MFCS 2022) (Leibniz International Proceedings in Informatics (LIPIcs), Vol. 241), Stefan Szeider, Robert Ganian, and Alexandra Silva (Eds.). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 74:1–74:16. https://doi.org/10.4230/LIPIcs.MFCS.2022.74
  • Mozafari et al. (2010) Barzan Mozafari, Kai Zeng, and Carlo Zaniolo. 2010. From regular expressions to nested words: unifying languages and query execution for relational and XML sequences. Proc. VLDB Endow. 3, 1–2 (sep 2010), 150–161. https://doi.org/10.14778/1920841.1920865
  • Mozafari et al. (2012) Barzan Mozafari, Kai Zeng, and Carlo Zaniolo. 2012. High-performance complex event processing over XML streams. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (Scottsdale, Arizona, USA) (SIGMOD ’12). Association for Computing Machinery, New York, NY, USA, 253–264. https://doi.org/10.1145/2213836.2213866
  • Nguyen and Sudholt (2006) Dong Ha Nguyen and M. Sudholt. 2006. VPA-Based Aspects: Better Support for AOP over Protocols. In Fourth IEEE International Conference on Software Engineering and Formal Methods (SEFM’06). 167–176. https://doi.org/10.1109/SEFM.2006.39
  • Raffelt et al. (2005) Harald Raffelt, Bernhard Steffen, and Therese Berg. 2005. LearnLib: a library for automata learning and experimentation. In Proceedings of the 10th International Workshop on Formal Methods for Industrial Critical Systems (Lisbon, Portugal) (FMICS ’05). Association for Computing Machinery, New York, NY, USA, 62–71. https://doi.org/10.1145/1081180.1081189
  • Rivest and Schapire (1993) R.L. Rivest and R.E. Schapire. 1993. Inference of Finite Automata Using Homing Sequences. Information and Computation 103, 2 (1993), 299–347. https://doi.org/10.1006/inco.1993.1021
  • Thomo and Venkatesh (2011) A. Thomo and S. Venkatesh. 2011. Rewriting of visibly pushdown languages for XML data integration. Theoretical Computer Science 412, 39 (2011), 5285–5297. https://doi.org/10.1016/j.tcs.2011.05.047
  • Vasilevskii (1973) M. P. Vasilevskii. 1973. Failure diagnosis of automata. Cybernetics 9, 4 (July 1973), 653–665. https://doi.org/10.1007/BF01068590
  • Wu et al. (2019) Zhengkai Wu, Evan Johnson, Wei Yang, Osbert Bastani, Dawn Song, Jian Peng, and Tao Xie. 2019. REINAM: reinforcement learning for input-grammar inference. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 488–498. https://doi.org/10.1145/3338906.3338958