V-Star: Learning Visibly Pushdown Grammars from Program Inputs (Extended Version)

Xiaodong Jia [email protected] 0000-0003-2493-9111 and Gang Tan [email protected] 0000-0001-6109-6091 The Pennsylvania State University201 Old MainState CollegePennsylvaniaUSA16802

(2018)

Abstract.

Accurate description of program inputs remains a critical challenge in the field of programming languages. Active learning, as a well-established field, achieves exact learning for regular languages. We offer an innovative grammar inference tool, V-Star, based on the active learning of visibly pushdown automata. V-Star deduces nesting structures of program input languages from sample inputs, employing a novel inference mechanism based on nested patterns. This mechanism identifies token boundaries and converts languages such as XML documents into VPLs. We then adapted Angluin’s L-Star, an exact learning algorithm, for VPA learning, which improves the precision of our tool. Our evaluation demonstrates that V-Star effectively and efficiently learns a variety of practical grammars, including S-Expressions, JSON, and XML, and outperforms other state-of-the-art tools.

grammar inference; visibly pushdown grammar

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Theory of computation Program analysis^†^†ccs: Theory of computation Grammars and context-free languages^†^†ccs: Software and its engineering Automatic programming

1. Introduction

In recent years, there has been a growing interest in learning grammars from a set of sample strings. This interest stems from a wide range of applications in fuzzing, program validation, and other areas (Arefin et al., 2024; Bastani et al., 2017; Kulkarni et al., 2022; Wu et al., 2019; Bendrissou et al., 2022). Despite significant progress, the challenge of learning the input grammars for black-box programs remains, particularly when considering grammars with inherent complexities. This challenge is part of a broader problem that has been extensively studied for regular languages but is significantly more difficult when dealing with broader classes of grammars.

Recently, GLADE (Bastani et al., 2017) (followed by a replication study (Bendrissou et al., 2022)) and ARVADA (Kulkarni et al., 2022) have been proposed to learn context-free grammars (CFGs) under active learning. Both approaches require positive seed inputs and utilize enumeration and heuristics to reduce the search space. However, methods in CFG learning such as Glade and Arvada do not fully utilize the concept of nesting structures, which could potentially improve a grammar-learning process’s accuracy.

Nesting structures are widely observed in practical languages, where recursions are explicitly delimited in their sentences. For example, an XML document’s open and close tags delimit a component of the document and can be nested within other open and close tags. These nesting structures often carry valuable insights into the grammars’ structure and could potentially be a powerful tool for learning grammars.

To achieve accurate learning, we model the nesting structures together with the target language as Visibly Pushdown Grammars (VPGs) (Alur and Madhusudan, 2009), a subclass of CFGs. VPGs formally specify nesting structures, and despite being slightly weaker than CFGs they can specify many practical format languages such as XML and JSON; they also enjoy all desirable closure properties; e.g., the set of visibly pushdown languages is closed under intersection, concatenation, and complement (Alur and Madhusudan, 2009). We posit that these properties position VPGs as an ideal mechanism for learning practical grammars, which is the focus of this paper.

Learning VPGs is a problem that fits nicely into the well-studied active learning field. In this field, Angluin (Angluin, 1987) first demonstrated that it is possible to efficiently learn regular languages from a minimally adequate teacher (MAT), which answers (1) whether a string is in the language, which is called a membership query, and (2) whether a finite state automaton accepts exactly the language held by the teacher, which is called an equivalence query; if not, the teacher provides a counterexample accepted by either the automaton or the language, but not both.

Despite advancements in VPG learning facilitated by a MAT, current techniques, as highlighted in previous work (Barbot et al., 2021; Michaliszyn and Otop, 2022; Isberner, 2015), are constrained by assumptions that do not match practical settings. Specifically, they assume the availability of known nesting patterns, or more technically, a predefined tagging function—a cornerstone of the VPG formalism detailed in Section 3. This tagging function defines a set of call and return symbols and is assumed to operate on individual characters. In contrast, in many practical settings the tagging must be inferred and it operates on sequences of characters (i.e., tokens) rather than on individual characters; a more detailed comparison with prior work and the limitations of these assumptions are discussed in Section 2.

To address these limitations, we introduce V-Star, a novel grammar-inference framework designed to learn VPGs from a black-box program using a collection of sample seed strings. V-Star’s algorithm is inspired by $L^{*}$ (Angluin, 1987), which learns a finite-state automaton using a MAT and seed strings. We develop V-Star in several pivotal steps. First, we develop an $L^{*}$ -like algorithm for learning VPGs when tags are known and are on single characters; this algorithm utilizes $k$ -SEVPA (Alur et al., 2005) to define a set of congruence relations, critical to the algorithm. Second, we develop a tag-inference algorithm, which utilizes a novel notion of nesting patterns to infer call and return symbols, assuming they are of single characters. In the third step, we lift the restriction that tags are on single characters and develop an algorithm for inferring token-based tags. Finally, we remove the requirement of equivalence queries by simulating them using membership queries via sampling. These steps all together result in a practical framework for learning VPGs from seed strings. The main contributions of V-Star are summarized as follows:

(1)

Innovative Tool and Methodology: V-Star is a novel tool for VPG inference. Its algorithm adapts Angluin’s $L^{*}$ algorithm and integrates a set of novel techniques such as nesting patterns to infer call and return tokens. To our best knowledge, this is the first VPG-learning algorithm without knowing what call/return tokens are a priori.
(2)

Theoretical Reasoning: We provide a theoretical analysis elucidating the conditions under which V-Star achieves accurate learning. We first prove that for any character-based visibly pushdown language with uniquely paired call/return symbols, there exists a finite set of seed strings from which V-Star can learn a tagging function that achieves exact learning. We further show that under some realistic assumptions V-Star can infer tagged tokens for token-based visibly pushdown languages.
(3)

Accuracy: Our evaluation of V-Star demonstrates its better accuracy in learning practical grammars, in comparison with state-of-the-art grammar learning tools. The accuracy result highlights the benefit of utilizing the concept of nesting structures and V-Star’s ability to simulate the equivalence queries by sampling test strings from the seed strings.

2. Related Work

In the realm of automata learning, learning finite state automata is a well-studied field. The $L^{*}$ Algorithm by Angluin (Angluin, 1987), which learns a finite state automaton held by a MAT in polynomial time, is a seminal work in this area. Following $L^{*}$ , various adaptations focusing on the active learning problem have been proposed, such as (Rivest and Schapire, 1993; Maler and Pnueli, 1991; Irfan et al., 2010; Howar, 2012; Isberner, 2015), to list a few.

Glade (Bastani et al., 2017), targeting learning a CFG from an oracle, employs a two-step algorithmic approach. Initially, it enumerates all substrings of seed strings and attempts to generalize these substrings into regular expressions. Subsequently, nonterminals are created and merged based on learned regular expressions. Arvada (Kulkarni et al., 2022) also aims to learn a CFG by using a technique that exchanges two substrings from seed strings. If two substrings are interchangeable, they are assigned the same nonterminal, and the process gradually constructs parse trees. Arvada employs heuristics to exchange substrings with similar contexts. In its evaluation, the context is considered to be the surrounding substrings of length four. While context strings bear some similarity to call and return symbols in VPGs, call and return symbols are more flexible than context strings in Arvada: they can wrap contexts of an arbitrary length. Moreover, call and return symbols have a stronger implication on the recursive structure of the oracle language, a feature that V-Star capitalizes on for better grammar-inference accuracy, which is evidenced by V-Star’s experimental comparison with Glade and Arvada discussed in Section 6. As an extension of GLADE, REINAM (Wu et al., 2019) refines the grammar learned by GLADE using reinforcement learning. This process allows for the potential replacement of GLADE by other learning tools such as Arvada (Kulkarni et al., 2022) or our tool V-Star.

Learning VPGs, a subset of CFGs, has been the focus of several systems. For example, assuming that the set of call and return symbols is known, VPL* (Barbot et al., 2021) learns VPGs with membership and equivalence queries. The approach taken by VPL* is indirect, first using TL* (Drewes and Högberg, 2007; Drewes et al., 2011) to learn a tree automaton, which is then converted to a VPG. Another work by Michaliszyn and Otop (2022) also assumes that the set of call and return symbols is known and attempts to learn a visibly pushdown automaton (VPA). Moreover, it requires a stronger teacher who, in addition to providing membership and equivalence queries, can also report the stack content during the VPA’s execution. TTT (Isberner, 2015) is another VPA-inference tool under the active-learning setting, based on discrimination trees. V-Star differentiates itself from these systems by learning call and return symbols from the oracle and seed strings.

In practice, a MAT is often instantiated by a black-box program and the oracle language comprises input strings that do not trigger program errors (assuming the program always terminates). Thus, membership queries require only program execution. However, equivalence queries are much harder to answer. Simulating equivalence queries has a long history, often under the name of conformance testing (Aichernig et al., 2024). Chow (Chow, 1978) first proposed the so-called W-method for Mealy machines; we note that the FSA version of the W-method is essentially a brute force approach of enumerating suffix strings that distinguish the representative prefix strings in the Nerode relation. The space of suffix strings is restricted from $\Sigma^{*}$ to $\Sigma^{k}$ , under the assumption that the difference between the size of the oracle FSA and the size of the learned FSA is $k$ . The W-method has many variants (Vasilevskii, 1973; Raffelt et al., 2005; Fujiwara et al., 1991). For more information, we refer to Aichernig et al. (2024).

3. Background

3.1. Grammar Inference

In grammar inference, we assume an oracle for the grammar being inferred, denoted as $\mathcal{O}$ , which maps strings to booleans—true for valid strings and false for invalid ones. The set of valid strings according to the oracle forms the oracle language, denoted as $\mathcal{L}_{\mathcal{O}}$ . When the oracle language is defined by a grammar, we refer to this grammar as the oracle grammar, denoted as $G_{\mathcal{O}}$ . Similarly, when the oracle language can be recognized by a deterministic finite automaton (DFA), we refer to this DFA as the oracle automaton, denoted as $\mathcal{H}_{\mathcal{O}}$ .

We define the active learning problem as follows:

Inputs: The problem takes two inputs:

(1)

A set $\Sigma$ of terminals, and
(2)

A minimally adequate teacher (MAT), which can answer both membership and equivalence queries. For an equivalence query with a hypothesis grammar, the teacher returns true when the hypothesis grammar is equivalent to the oracle grammar, meaning they generate the same language; if they are not equivalent, it provides a counterexample, which is a string accepted by either the hypothesis grammar or the oracle grammar, but not by both.

Output: The goal of the active learning problem is to construct a grammar, denoted as $\mathcal{G}$ , such that the language it generates, denoted as $\mathcal{L}_{\mathcal{G}}$ , is identical to the oracle language $\mathcal{L}_{\mathcal{O}}$ .

3.2. Visibly Pushdown Grammars

The expressive power of VPGs (Alur and Madhusudan, 2009) is between regular grammars and context-free grammars, and VPGs are sufficient for describing the syntax of many practical languages, such as JSON, XML, and HTML. Application wise, VPGs have been used in program analysis, XML processing, and other fields (Harris et al., 2012; Heizmann et al., 2010; Chaudhuri and Alur, 2007; Nguyen and Sudholt, 2006; Kumar et al., 2007; Alur, 2007; Thomo and Venkatesh, 2011; Gauwin et al., 2008; Mozafari et al., 2010, 2012). Besides, since they can be efficiently parsed, VPGs are also found valuable to specify practical languages (Jia et al., 2021, 2023).

A language is called a visibly pushdown language (VPL) if it can be generated by a VPG. VPLs enjoy the same appealing theoretical closure properties as regular languages; e.g., the set of VPLs is closed under intersection, concatenation, and complement (Alur and Madhusudan, 2009). Further, since VPLs are a subset of deterministic context-free languages, it is always possible to build a deterministic pushdown automaton from a VPL.

A VPG (Alur and Madhusudan, 2009) is formally defined as a tuple $(V,\Sigma,P,L_{0})$ , where $V$ is a set of nonterminals, $\Sigma$ a set of terminals, $P$ a set of production rules, and $L_{0}\in V$ the start nonterminal. The set of terminals $\Sigma$ is partitioned into three kinds: $\Sigma_{\text{plain}}$ , $\Sigma_{\text{call}}$ , $\Sigma_{\text{ret}}$ , which contain plain, call, and return symbols, respectively. The stack action associated with an input symbol is fully determined by the kind of the symbol: an action of pushing to the stack is always performed for a call symbol, an action of popping from the stack is always performed for a return symbol, and no stack action is performed for a plain symbol. Notation-wise, a terminal in $\Sigma_{\text{call}}$ is tagged with \guilsinglleft on the left, and a terminal in $\Sigma_{\text{ret}}$ is tagged with \guilsinglright on the right. For example, \guilsinglleft $a$ is a call symbol in $\Sigma_{\text{call}}$ , and $b$ \guilsinglright is a return symbol in $\Sigma_{\text{ret}}$ .

Well-matched VPGs produce strings where each call symbol is always paired with a return symbol. They are formally defined as follows:

Definition 3.0 (Well-matched VPGs).

A grammar $G=(V,\Sigma,P,L_{0})$ is a well-matched VPG if every production rule in $P$ adheres to one of the following forms:

(1)

$L\to\epsilon$ , where $\epsilon$ stands for the empty string.
(2)

$L\to cL_{1}$ , where $c\in\Sigma_{\text{plain}}$ .
(3)

$L\to{\text{\guilsinglleft{$a$}}{L_{1}}\text{{$b$}\guilsinglright}}L_{2}$ , where $\text{\guilsinglleft{$a$}}\in\Sigma_{\text{call}}$ and $\text{{$b$}\guilsinglright}\in\Sigma_{\text{ret}}$ .

Note that in $L\to cL_{1}$ terminal $c$ must be a plain symbol, and in $L\to{\text{\guilsinglleft{$a$}}{L_{1}}\text{{$b$}\guilsinglright}}L_{2}$ a call symbol must be matched with a return symbol; these requirements ensure that any derived string must be well-matched. This is useful in languages like XML, where tags open and close in a nested, well-matched manner. For instance, the grammar rule “ $\text{element}\to\text{OpenTag content CloseTag Empty}\mid\text{SingleTag Empty}$ ” represents an XML element that either contains content within matched open and close tags or is an empty single tag.

In this paper, we consider only well-matched VPGs, and use the term VPGs for well-matched VPGs. We also call rules in the form of $L\to{\text{\guilsinglleft{$a$}}{L_{1}}\text{{$b$}\guilsinglright}}L_{2}$ matching rules, and rules of the form $L\to cL_{1}$ linear rules.

3.3. Visibly Pushdown Automata

A Visibly Pushdown Automaton (VPA) (Alur and Madhusudan, 2004) on finite strings over symbols $\Sigma_{\text{call}}$ , $\Sigma_{\text{ret}}$ , and $\Sigma_{\text{plain}}$ is a tuple $\mathcal{H}=(Q,q_{0},\Gamma,\delta,Q_{F})$ where $Q$ is a finite set of states, $q_{0}\in Q$ is the initial state, $\Gamma$ is a finite stack alphabet that contains a special bottom-of-stack symbol $\bot$ , $\delta=\delta_{\text{call}}\cup\delta_{\text{ret}}\cup\delta_{\text{pln}}$ is the transition function, where $\delta_{\text{call}}:Q\times\Sigma_{\text{call}}\rightarrow Q\times(\Gamma% \setminus\{\bot\})$ , $\delta_{\text{ret}}:Q\times\Sigma_{\text{ret}}\times\Gamma\rightarrow Q$ , and $\delta_{\text{pln}}:Q\times\Sigma_{\text{plain}}\rightarrow Q$ , and $Q_{F}\subseteq Q$ is a set of final states.

The function $\delta_{\text{call}}(q,\text{\guilsinglleft{$a$}})=(q^{\prime},\gamma)$ means that upon reading \guilsinglleft $a$ , state $q$ is changed to state $q^{\prime}$ and $\gamma$ is pushed onto the stack. Similarly, $\delta_{\text{ret}}(q,\text{{$b$}\guilsinglright},\gamma)=q^{\prime}$ means that upon reading $b$ \guilsinglright and the stack top is $\gamma$ , $q$ is changed to $q^{\prime}$ and $\gamma$ is removed from the top of the stack (if $\gamma$ is $\bot$ , the empty stack remains unaltered). Finally, $\delta_{\text{pln}}(q,c)=q^{\prime}$ means state $q$ is changed to $q^{\prime}$ upon reading symbol $c$ .

A stack is a non-empty finite sequence over $\Gamma$ ending in the bottom-of-stack symbol $\bot$ . The set of all stacks is denoted as $St=(\Gamma\setminus\{\bot\})^{*}\cdot\{\bot\}$ . A configuration is a pair $(q,T)$ of state $q$ and stack $T\in St$ . We define the single-step transition of configurations $\delta((q,T),i)$ as tuple $(q^{\prime},T^{\prime})$ , based on the type of symbol $i$ and the transition functions:

(1)

If $i\in\Sigma_{\text{call}}$ , then $(q^{\prime},\gamma)=\delta_{\text{call}}(q,i)$ and $T^{\prime}=\gamma\cdot T$ for certain $\gamma\in\Gamma$ ;
(2)

If $i\in\Sigma_{\text{ret}}$ , then $q^{\prime}=\delta_{\text{ret}}(q,i,\gamma)$ for certain $\gamma$ and either $T=\gamma\cdot T^{\prime}$ , or $\gamma=\bot$ and $T=T^{\prime}=\bot$ ;
(3)

If $i\in\Sigma_{\text{plain}}$ , then $q^{\prime}=\delta_{\text{pln}}(q,i)$ and $T^{\prime}=T$ .

We extend the single-step transition for string $si$ as $\delta((q,T),si)=\delta(\delta((q,T),s),i)$ . A string $s\in\Sigma^{*}$ is accepted by VPA $\mathcal{H}$ if $\delta((q_{0},\bot),s)\in Q_{F}$ . The language of $\mathcal{H}$ is the set of strings accepted by $\mathcal{H}$ .

3.4. Angluin’s L-Star Algorithm

We next briefly discuss the $L^{*}$ algorithm (Angluin, 1987), which learns a finite state automaton from a MAT in polynomial time. We first introduce a notion of equivalence. We define two strings equivalent w.r.t. to language $\mathcal{L}$ if extending them with any suffix $w$ has the same membership result in $\mathcal{L}$ :

s_{1}\simeq s_{2}\equiv\forall w,s_{1}w\in\mathcal{L}\;\Leftrightarrow\;s_{2}w% \in\mathcal{L}.

We also introduce a notion of approximate equivalence relative to suffixes in a test-string set $T$ :

s_{1}\simeq_{T}s_{2}\equiv\forall w\in T,s_{1}w\in\mathcal{L}\;\Leftrightarrow% \;s_{2}w\in\mathcal{L}.

The $L^{*}$ algorithm operates in iterations and maintains two sets of strings: $Q$ and $T$ , both of which start with $\{\epsilon\}$ . The set $T$ contains a set of test strings. The set $Q$ contains a set of strings that are separable by $T$ , which means any two different strings in $Q$ are not $T$ equivalent: $\forall s_{1}\ s_{2}\in Q,s_{1}\neq s_{2}\;\Rightarrow\;s_{1}\not\simeq_{T}s_{2}$ . In addition, $(Q,T)$ is closed in the sense that for any $s\in Q$ and any symbol $c$ there exists $s^{\prime}\in Q$ such that $sc\simeq_{T}s^{\prime}$ .

Given separable and closed $(Q,T)$ , we can construct a hypothesis DFA: each string in $Q$ becomes a state and we add a transition from $s\in Q$ with input symbol $c$ to the unique state $s^{\prime}\in Q$ such that $sc\simeq_{T}s^{\prime}$ ; the initial state is the empty string $\epsilon$ and acceptance states are those $Q$ strings that are in $\mathcal{L}$ . With the hypothesis DFA, we can ask the MAT to check if the DFA is equivalent to the oracle language. If it is, a DFA for the oracle language has been learned and the algorithm terminates. If not, the MAT gives a counterexample. With the counterexample, the algorithm can extend $Q$ and $T$ and then use membership queries provided by the MAT to make $Q$ and $T$ separable and closed again. Details can be found in the $L^{*}$ paper (Angluin, 1987).

4. V-Star for a Character-Based VPL

V-Star learns a Visibly Pushdown Automaton (VPA) using a MAT, which provides both membership and equivalence queries. For ease of exposition, we divide our discussion into two steps: in this section, we consider grammar inference for a character-based VPL, in which the tagging of call/return symbols is on individual characters; in the next section, we consider grammar inference for a token-based VPL, a more realistic setting in which the tagging is performed on tokens (sequences of characters).

For both steps, we develop an algorithm that not only infers the tagging but also constructs a VPA. We further prove that this constructed VPA achieves exact learning, meaning it recognizes the oracle language. We also provide an analysis of the algorithm’s time complexity.

This section proceeds as follows: we start with a precise problem statement in Section 4.1; we then introduce a new algorithm for learning a VPA in Section 4.2 assuming a given tagging function. Then, in Section 4.3, we study how to infer a tagging function that makes the VPA-learning algorithm terminate and achieve exact learning.

4.1. Problem Statement

V-Star seeks to infer a Visibly Pushdown Grammar (VPG) from a black-box oracle that knows a VPL. We next define the precise knowledge of the oracle and what queries it allows.

We assume $\Sigma$ is the alphabet set from which valid strings can draw characters. A VPL tags each character $i$ in $\Sigma$ as a call symbol \guilsinglleft $i$ , a return symbol $i$ \guilsinglright, or a plain symbol. This is modeled by a tagging function $t:\Sigma\to\hat{\Sigma}$ , which maps a character $i$ to either \guilsinglleft $i$ , $i$ \guilsinglright, or $i$ itself. This function extends to strings: $t(s)=t(s[1])\ldots t(s[n])$ , where $n$ is the length of $s$ and $s[j]$ is its $j$ -th character. Given a tagging function $t$ , we define the terminal set $\hat{\Sigma}_{t}$ (also denoted as $\hat{\Sigma}$ when $t$ is clear from the context) as the set of tagged characters: $\hat{\Sigma}_{t}=\Sigma_{\text{call}}\cup\Sigma_{\text{plain}}\cup\Sigma_{% \text{ret}}$ , where $\Sigma_{\text{call}}$ , $\Sigma_{\text{ret}}$ , and $\Sigma_{\text{plain}}$ include call, return, and plain symbols defined by $t$ , respectively.

An oracle $\mathcal{O}$ knows a language $\mathcal{L}\subseteq\Sigma^{*}$ and a tagging function $t_{\mathcal{O}}$ such that the tagged language $\hat{\mathcal{L}}_{\mathcal{O}}=\{t_{\mathcal{O}}(s)\mid s\in\mathcal{L}\}$ is a VPL over terminal set $\hat{\Sigma}_{t_{\mathcal{O}}}$ . The oracle $\mathcal{O}$ can answer membership and equivalence queries in active learning. The oracle’s ability to answer these queries is modeled as two functions. The membership query function, $\chi_{\mathcal{L}}:\Sigma^{*}\to\{\text{True},\text{False}\}$ , is defined as follows:

\chi_{\mathcal{L}}(s)=\begin{cases}\text{True}&\text{if }s\in\mathcal{L},\\ \text{False}&\text{otherwise}.\end{cases}

That is, it returns true iff the input string $s$ belongs to $\mathcal{L}$ . Note that input strings to membership queries do not carry tags, which reflects the fact that existing oracles are typically recognizers/parsers that take untagged strings. An example oracle used in our experiments is an off-the-shelf JSON parser, which takes untagged JSON strings; the goal of V-Star is to learn the JSON grammar from this oracle. Also note that we sometimes abuse the notation and pretend that $\chi_{\mathcal{L}}$ can also take tagged strings, in which case it performs membership testing using the string after tagging is removed; i.e., for a tagging function $t$ , $\chi_{\mathcal{L}}(t(s))$ is defined as $\chi_{\mathcal{L}}(s)$ .

We next define the equivalence query function, which checks the equivalence between the oracle language $\mathcal{L}$ and the language defined by a hypothesis VPA $\mathcal{H}$ proposed by some learning algorithm. One complication is that the tagging function produced by the learner might be different from the oracle tagging function, even if the underlying untagged language is the same as the oracle one. This is due to the inherent flexibility of VPL tagging. As an example, suppose the oracle language is $\{(\text{\guilsinglleft{$a$}}\text{\guilsinglleft{$g$}})^{k}(\text{{$h$}% \guilsinglright}\text{{$b$}\guilsinglright})^{k}\mid k\geq 0\}$ , then its underlying untagged language is the same as the untagged language of $\{(\text{\guilsinglleft{$a$}}{g})^{k}({h}\text{{$b$}\guilsinglright})^{k}\mid k% \geq 0\}$ , which tags only $a$ and $b$ , or of $\{(a\text{\guilsinglleft{$g$}})^{k}(\text{{$h$}\guilsinglright}b)^{k}\mid k% \geq 0\}$ , which tags only $g$ and $h$ .

Since what is relevant is the underlying untagged language, we should allow a learner to learn a different tagging function. Therefore we assume that the learner produces a hypothesis Visibly Pushdown Automaton (VPA) $\mathcal{H}$ , as well as a hypothesis tagging function $t_{\mathcal{H}}$ . The learner should achieve exact learning, formally stated as $\forall s\in\Sigma^{*},\;\chi_{\mathcal{L}}(s)=\chi_{(\mathcal{H},t_{\mathcal{% H}})}(s)$ , where

\chi_{(\mathcal{H},t_{\mathcal{H}})}(s)=\begin{cases}\text{True}&\text{if }t_{% \mathcal{H}}(s)\text{ is accepted by }\mathcal{H},\\ \text{False}&\text{otherwise}.\end{cases}

Now the equivalence query function $\mathcal{E}$ is defined as follows: $\mathcal{E}(\mathcal{H},t_{\mathcal{H}})$ returns none when the oracle language is equivalent to the untagged language recognized by $\mathcal{H}$ and otherwise returns some $s$ such that $\chi_{\mathcal{L}}(s)\neq\chi_{(\mathcal{H},t_{\mathcal{H}})}(s)$ .

V-Star’s active learning goal is, with an oracle that provides membership and equivalence queries, to learn a tagging function $t$ and a VPA $\mathcal{H}$ so that exact learning is achieved.

The Unique Pairing assumption for oracle languages

To simplify the tagging inference algorithm that will be discussed in Section 4.3, we assume that in the oracle VPL $\hat{\mathcal{L}}_{\mathcal{O}}=\{t_{\mathcal{O}}(s)\mid s\in\mathcal{L}\}$ , a call symbol is uniquely paired with a return symbol; i.e., if \guilsinglleft $a$ is matched with $b$ \guilsinglright in one sentence, then \guilsinglleft $a$ can be matched with only $b$ \guilsinglright in every sentence of the language. This assumption simplifies our algorithm design, and is satisfied by languages we experimented with (e.g., XML and JSON). We now represent pairs $(a,b)$ in a tagging function $t$ as a tagging $T\subseteq 2^{\Sigma\times\Sigma}$ , where $t(a)=\text{\guilsinglleft{$a$}}$ , $t(b)=\text{{$b$}\guilsinglright}$ . While Algorithm 3 technically can be adjusted to operate without the above assumption, efficiency would be significantly decreased.

4.2. Learning VPA with Known Tagging

This subsection outlines an algorithm for learning a Visibly Pushdown Automaton (VPA) using a MAT, assuming a tagging function $t$ as input. $\hat{\Sigma}_{t}$ is the tagged alphabet according to $t$ and $\hat{\mathcal{L}}_{t}=\{t(s)\mid s\in\mathcal{L}\}$ is the oracle language $\mathcal{L}$ tagged with $t$ . We assume that $t$ must make $\hat{\mathcal{L}}$ contain a set of well-matched strings. To avoid clutter, we will write $\hat{\Sigma}$ for $\hat{\Sigma}_{t}$ and $\hat{\mathcal{L}}$ for $\hat{\mathcal{L}}_{t}$ in this subsection. While there were prior VPA-learning algorithms proposed under this setting, some required more information from the oracle, such as the stack content during VPA execution (Michaliszyn and Otop, 2022). Isberner (2015) used advanced discrimination tree structures to minimize the number of membership queries; however, both Isberner (2015) and Howar (2012) discussed that discrimination tree-based algorithms could significantly raise the number of equivalence queries. Since in our implementation we simulate equivalence queries using membership queries (see Section 6), increasing the number of equivalence queries would escalate the simulation effort.

In this section, we introduce a VPA learning algorithm based on $k$ -SEVPA ((Alur et al., 2005; Kumar et al., 2007)) and demonstrate its polynomial-time efficiency in Theorem 4.7. Although the concept of polynomial-time VPA learning has been previously explored, as in Isberner (2015)’s TTT-VPA, our approach differs by adopting a table-based methodology, inspired by the clarity and directness of the $L^{*}$ algorithm (Angluin, 1987). This shift not only simplifies the presentation but also makes it easy to interface with tag-inference algorithms that we will discuss later in this paper.

Our algorithm is outlined in Algorithm 1. At every iteration, it maintains a set of separable and closed equivalence states and test strings in $\mathcal{Q}$ . The current states are used to produce a hypothesis VPA through $\mathbf{constructVPA}(\mathcal{Q})$ . It then queries the oracle using an equivalence query. If the query does not produce a counterexample, then the iterative process terminates with the hypothesis VPA as the result; otherwise, the returned counterexample is tagged through the assumed tagging function $t$ and employed to refine the current set of equivalence states and test strings, through $\mathbf{update}(-,-,-)$ and $\mathbf{close}(-,-)$ . Next we describe $\mathbf{constructVPA}(-)$ , $\mathbf{close}(-,-)$ , and $\mathbf{update}(-,-,-)$ , starting with some background information.

Input: Oracle

\mathcal{O}

with membership queries and equivalence queries

\mathcal{E}

, tagging function

t

, terminals

\hat{\Sigma}=\Sigma_{\text{call}}\cup\Sigma_{\text{ret}}\cup\Sigma_{\text{% plain}}

Output: Learned VPA

\mathcal{H}_{\mathcal{Q}}

1 Initialize

Q_{i,i=0..|\Sigma_{\text{call}}|}

\{\epsilon\}

C_{0}

\{\epsilon\}

, and

C_{j,j=1..|\Sigma_{\text{call}}|}

\left\{\left(\text{\guilsinglleft{$a$}}_{j,j=1..|\Sigma_{\text{call}}|},\text{% {$b$}\guilsinglright}\right)\mid\text{{$b$}\guilsinglright}\in\Sigma_{\text{% ret}}\right\}

;

\mathcal{Q}\leftarrow\mathbf{close}(\mathcal{O},\mathcal{Q})

;

3 while $\mathcal{E}\left(\mathbf{constructVPA}(\mathcal{Q})\right)$ produces a counterexample $s$ do

\mathcal{Q}^{\prime}\leftarrow\mathbf{update}(\mathcal{O},\mathcal{Q},t(s))

;

\mathcal{Q}\leftarrow\mathbf{close}(\mathcal{O},\mathcal{Q}^{\prime})

;

7 end while

return $\mathbf{constructVPA}(\mathcal{Q})$ ;

Algorithm 1 The

\mathbf{learn}(\mathcal{O},t,\hat{\Sigma})

function that learns a VPA from a MAT.

4.2.1. Background: $k$ -SEVPA and Congruence Relations

Unlike regular languages, a VPL may not have a unique minimum-state deterministic pushdown recognizer. Nonetheless, partitioning the call symbols into $k$ distinct groups and mandating the following ensure the existence of a unique minimal VPA: (1) the states are partitioned to a set of $k+1$ modules (each is a set of states), with the $0$ -th module as the base module with the initial state and the $i$ -th module for the $i$ -th group of call symbols with $i\in[1..k]$ ; (2) the machine stays in the same module when encountering a plain symbol; the machine transitions to a unique entry state in the $i$ -th module when encountering a call symbol from the $i$ -th group; the machine transitions back to the caller module when encountering a return symbol. Such a VPA is known as a $k$ -Single Entry VPA ( $k$ -SEVPA (Alur et al., 2005)) and is similar to a control-flow graph with the $0$ -th module for the main function and the $i$ -th module for the $i$ -th function in a program.

The minimal $k$ -SEVPA can be defined with a set of congruence relations.

Definition 4.0 ().

[Congruence Relations for the Minimal $k$ -SEVPA (Alur et al., 2005)] Given a VPL $\hat{\mathcal{L}}$ over $\hat{\Sigma}$ , let $\Sigma_{\text{call}}^{i}$ , $i\in[1..k]$ , represent the $i$ -th group of call symbols. Given well-matched strings $s_{1}$ and $s_{2}$ , we introduce $k+1$ congruence relations:

(1)		$\displaystyle s_{1}\sim_{0}s_{2}\text{ iff }$	$\displaystyle\forall w\in\hat{\Sigma}^{*},\ s_{1}w\in\hat{\mathcal{L}}\iff s_{% 2}w\in\hat{\mathcal{L}};$
(2)		$\displaystyle s_{1}\sim_{i}s_{2}\text{ iff }$	$\displaystyle\forall w,w^{\prime}\in\hat{\Sigma}^{*},\ \forall\text{% \guilsinglleft{$a$}}\in\Sigma_{\text{call}}^{i},\ w\text{\guilsinglleft{$a$}}s% _{1}w^{\prime}\in\hat{\mathcal{L}}\iff w\text{\guilsinglleft{$a$}}s_{2}w^{% \prime}\in\hat{\mathcal{L}},\text{ for }i\in[1..k].$

Note that $\sim_{0}$ is the Myhill-Nerode right congruence and can be used to construct the minimal DFA for a regular language. For $\sim_{i}$ when $i\in[1..k]$ , the context strings assume specialized forms: the left context string ends with a call symbol from the $i$ -th group, and the string $w\text{\guilsinglleft{$a$}}w^{\prime}$ is a well-matched string, since both $w\text{\guilsinglleft{$a$}}s_{1}w^{\prime}\in\hat{\mathcal{L}}$ and $s_{1}$ are well matched. From the above congruence relations, we can construct the minimal $k$ -SEVPA: the equivalence classes of $\sim_{i}$ become the states of the $i$ -th module, with $[\epsilon]_{\sim_{i}}$ being the unique entry state of the $i$ -th module; transition edges can also be added (see Alur et al. (2005)): e.g., $[s]_{\sim_{i}}$ transitions to $[si]_{\sim_{i}}$ for plain symbol $i$ , and to $[\epsilon]_{\sim_{j}}$ for call symbol $\text{\guilsinglleft{$a$}}\in\Sigma_{\text{call}}^{j}$ .

In our algorithm, we set $k$ to be the number of call symbols decided by the input tagging function so that each call symbol is in its own group. We write $\text{\guilsinglleft{$a$}}_{i}$ for the $i$ -th call symbol. This partitioning is practical because call symbols often fulfill diverse roles and hence find themselves in separate contexts. Further, Proposition 2 in Alur et al. (2005) tells that enlarging $k$ can lead to a more compact VPA.

4.2.2. Access Words and Test Words

At each step, V-Star maintains $\mathcal{Q}$ , which contains

(1)

a set of $k+1$ modules $Q_{0}$ to $Q_{k}$ , each containing empty string $\epsilon$ and a set of well-matched access words in $\hat{\Sigma}^{*}$ ,
(2)

and a set of test words $C_{0}$ to $C_{k}$ , with $C_{0}$ containing strings in $\hat{\Sigma}^{*}$ for testing $Q_{0}$ and $C_{i}$ containing strings in the form of $(w\text{\guilsinglleft{$a$}}_{i},w^{\prime})$ for testing $Q_{i}$ , where $w$ and $w^{\prime}$ are in $\hat{\Sigma}^{*}$ , $\text{\guilsinglleft{$a$}}_{i}$ is the $i$ -th call symbol, and $w\text{\guilsinglleft{$a$}}_{i}w^{\prime}$ is well matched.

Given test words $C_{i\in[0..k]}$ , two well-matched strings $q_{1}$ and $q_{2}$ are $C_{i}-$ equivalent, denoted as $q_{1}\sim_{C_{i}}q_{2}$ , if (1) when $i=0$ , $\forall w\in C_{0}$ , $q_{1}w\in\hat{\mathcal{L}}$ iff $q_{2}w\in\hat{\mathcal{L}}$ ; and (2) when $i\in[1..k]$ , $\forall(w\text{\guilsinglleft{$a$}},w^{\prime})\in C_{i}$ , $w\text{\guilsinglleft{$a$}}q_{1}w^{\prime}\in\hat{\mathcal{L}}$ iff $w\text{\guilsinglleft{$a$}}q_{2}w^{\prime}\in\hat{\mathcal{L}}$ . These are essentially the same equivalence relations as those in Definition 4.1, relative to test words in $C_{i}$ .

We define the following two properties of $\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}$ :

(1)

Separability: no two distinct strings in $Q_{i}$ are $C_{i}$ -equivalent, meaning $\forall q\ q^{\prime}\in Q_{i},q\neq q^{\prime}\;\Rightarrow\;q\not\sim_{C_{i}% }q^{\prime}$ .
(2)

Closedness: for every $q\in Q_{i}$ and $m\in\Sigma_{M}$ (defined below), there is some $q^{\prime}\in Q_{i}$ such that $qm\sim_{C_{i}}q^{\prime}$ .

Definition 4.0 (Nested Words and $\Sigma_{M}$ ).

Given $\hat{\Sigma}=\Sigma_{\text{call}}\cup\Sigma_{\text{plain}}\cup\Sigma_{\text{% ret}}$ , along with $(Q_{i},C_{i})_{i\in[1..k]}$ , we define the nested words for $(Q_{i},C_{i})$ , denoted as $M_{i}$ , as

M_{i}=\{\text{\guilsinglleft{$a$}}_{i}q\text{{$b$}\guilsinglright}\mid q\in Q_% {i},\text{{$b$}\guilsinglright}\in\Sigma_{\text{ret}}\},

where $\text{\guilsinglleft{$a$}}_{i}$ is the $i$ -th call symbol. We define $\Sigma_{M}=\cup_{i}M_{i}\cup\hat{\Sigma}$ .

Our learning algorithm is then based on the following set of propositions.

Definition 4.0 ( $\mathbf{constructVPA}(\mathcal{Q})$ function).

For separable and closed $\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}$ , we can construct a hypothesis $k$ -SEVPA, denoted as $\mathcal{H}$ as follows. The set of states of $\mathcal{H}$ is $\bigcup_{i\in[0..k]}Q_{i}$ . We write $q\in Q_{i}$ as $[q]_{i}$ . The initial state is $[\epsilon]_{0}$ . Define the set of acceptance states, $Q_{F}$ , to be $\{[q]_{0}\mid q\in Q_{0}\cap\hat{\mathcal{L}}\}$ , which can be constructed via membership queries. The transition function $\delta$ from the current state $[q]_{i}$ , $i\in[0..k]$ , and the next input symbol is defined as follows:

(1)

For plain symbol $c$ , the transition is $[q]_{i}\xrightarrow{c}[q^{\prime}]_{i}$ , where $q^{\prime}\sim_{C_{i}}qc$ .
(2)

For call symbol $\text{\guilsinglleft{$a$}}_{j}$ , the transition is $[q]_{i}\xrightarrow{\text{\guilsinglleft{$a$}}_{j},\text{ push }([q]_{i},\text% {\guilsinglleft{$a$}}_{j})}[\epsilon]_{j}$ , the unique entry state for module $j$ .
(3)

For return symbol $b$ \guilsinglright, the transition is $[q]_{i}\xrightarrow{\text{{$b$}\guilsinglright},\text{ pop }([q^{\prime}]_{j},% \text{\guilsinglleft{$a$}}_{i})}[q^{\prime\prime}]_{j}$ , where $q^{\prime\prime}\sim_{C_{j}}q^{\prime}\text{\guilsinglleft{$a$}}_{i}q\text{{$b% $}\guilsinglright}$ .

The target state in each transition exists by closedness and is unique by separability. To run the VPA on a string $s$ , start with the initial state $[\epsilon]_{0}$ and an empty stack and use $\delta$ for transitions. The automaton accepts a string if it terminates in a configuration with a state within $Q_{F}$ and an empty stack.

Proposition 4.0 ().

If $\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}$ is separable and language $\hat{\mathcal{L}}=\{t(s)\mid s\in\mathcal{L}\}$ is a VPL, then the number of states in $\mathbf{constructVPA}(\mathcal{Q})$ is bounded above by the number of states in the minimal $k$ -SEVPA for VPL $\hat{\mathcal{L}}$ .

Proof.

For two strings $s_{1}$ and $s_{2}$ , if $s_{1}\sim_{i}s_{2}$ (Definition 4.1), then $s_{1}\sim_{C_{i}}s_{2}$ . Hence, the number of equivalence classes of $\sim_{C_{i}}$ is less than that of $\sim_{i}$ , which corresponds to the number of states of the $i$ -th module in the minimal $k$ -SEVPA. Further, since $Q_{i}$ is separable, each element of $Q_{i}$ corresponds to a unique equivalence class of $\sim_{C_{i}}$ . Therefore, $|Q_{i}|$ is bounded above by the number of equivalence class of $\sim_{C_{i}}$ , which is bounded above by the number of states of the $i$ -th module in the minimal $k$ -SEVPA. Since the number of states in $\mathbf{constructVPA}(\mathcal{Q})$ is $\sum_{i\in[0..k]}|Q_{i}|$ , it is upper bounded by the number of states in the minimal $k$ -SEVPA for $\hat{\mathcal{L}}$ . ∎

Proposition 4.0 ().

If $\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}$ is separable but not closed, then using membership queries one can find $i$ and $q\in\hat{\Sigma}^{*}\setminus Q_{i}$ such that $(Q_{i}\cup\{q_{i}\},C_{i})$ and the rest $(Q_{j},C_{j})_{j\neq i}$ remain separable.

Proof.

Since $(Q_{i},C_{i})_{i\in[0..k]}$ are not closed, there exists $q\in Q_{i}$ for certain $i$ and $m\in\Sigma_{M}$ such that $qm$ is not $C_{i}$ -equivalent to any state in $Q_{i}$ . Using membership queries (by enumerating all test strings in $C_{i}$ ) we can find such a $q$ and $m$ , and then add $qm$ to $Q_{i}$ , which remains separable by construction. ∎

Algorithm 2 outlines the $\mathbf{close}(\mathcal{O},\mathcal{Q})$ function, which keeps applying Proposition 4.5 until $\mathcal{Q}$ becomes separable and closed.

Input: Oracle

\mathcal{O}

and separable

\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}

Output: Separable and closed

\mathcal{Q}^{\prime}

1 Initialize

\Sigma_{M}

\bigcup_{i=1..k}\{\text{\guilsinglleft{$a$}}_{i}q\text{{$b$}\guilsinglright}% \mid q\in Q_{i},\text{{$b$}\guilsinglright}\in\Sigma_{\text{ret}}\}\cup\hat{\Sigma}

;

2 Initialize the work list

W

\{(q,i,m)\mid q\in Q_{i},i\in 0..k,m\in\Sigma_{M}\}

;

3 while $W$ is not empty do

4 Take

(q,i,m)

from

W

;

5 if $\forall q^{\prime}\in Q_{i}$ , $qm\not\sim_{C_{i}}q^{\prime}$ then

Q_{i}\leftarrow Q_{i}\cup\{qm\}

;

W\leftarrow W\cup\{(qm,i,m^{\prime})\,\mid\,m^{\prime}\in\Sigma_{M}\}

;

8 if $i>0$ then

\Sigma_{M}\leftarrow\Sigma_{M}\cup\{\text{\guilsinglleft{$a$}}_{i}qm\text{{$b$% }\guilsinglright}\mid\text{{$b$}\guilsinglright}\in\Sigma_{\text{ret}}\}

;

W\leftarrow W\cup\{(q^{\prime\prime},j,\text{\guilsinglleft{$a$}}_{i}qm\text{{% $b$}\guilsinglright})\,\mid\,q^{\prime\prime}\in Q_{j},j\in[0..k],\text{{$b$}% \guilsinglright}\in\Sigma_{\text{ret}}\}

;

12 end if

14 end if

16 end while

17return $\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}$ ;

Algorithm 2 The

\mathbf{close}(\mathcal{O},\mathcal{Q})

function.

Proposition 4.0 ().

Suppose that $\mathcal{Q}=\{(Q_{i},C_{i})\,\mid\,i\in[0..k]\}$ is separable and closed, and let $\mathcal{H}$ be the hypothesis automaton (Definition 4.3). Suppose the oracle returns a counterexample $s$ for an equivalence query with $\mathcal{H}$ . Using $\log|s|$ membership queries, one can find $i\in[1..k]$ and $q\in\hat{\Sigma}^{*}\setminus Q_{i}$ and $(w\text{\guilsinglleft{$a$}}_{i},w^{\prime})\in\hat{\Sigma}^{*}\Sigma_{\text{% call}}\times\hat{\Sigma}^{*}$ such that $(Q_{i}\cup\{q\},C_{i}\cup\{(w\text{\guilsinglleft{$a$}}_{i},w^{\prime})\})$ is separable, or find $w\in\hat{\Sigma}^{*}$ when $i=0$ such that $(Q_{0}\cup\{q\},C_{0}\cup\{w\})$ is separable.

Proof.

Let $n$ be the length of $s$ . Let $q_{0}=[\epsilon]_{0}$ be the initial state of $\mathcal{H}$ , and $\delta$ be the transition function of $\mathcal{H}$ . For $i=1,\ldots,n$ , define $q_{i}=\delta(q_{0},s[1]\ldots s[i])$ to be the state in $Q_{j}$ reached by $\mathcal{H}$ after reading the prefix $s[1]\ldots s[i]$ of $s$ , and define $T_{i}$ as the corresponding stack. For convenience, we write $[q_{i}]_{j}$ for the state $q_{i}$ in module $j$ .

We define the context of $q_{i}$ as follows. When $T_{i}$ is empty, we define the context to be $(\epsilon,s[i+1]\ldots s[n])$ . Otherwise, let $T_{i}=(q_{j_{n^{\prime}}},\text{\guilsinglleft{$a$}}_{j_{n^{\prime}}})\cdots(q% _{j_{1}},\text{\guilsinglleft{$a$}}_{j_{1}})\cdot\bot$ . We define the context as

(q_{j_{1}}\text{\guilsinglleft{$a$}}_{j_{1}}\dots q_{j_{n^{\prime}}}\text{% \guilsinglleft{$a$}}_{j_{n^{\prime}}},s[i+1]\ldots s[n]).

We denote the context of $q_{i}$ as $(w_{i},w_{i}^{\prime})$ . We say that state $q_{i}$ is correct if $\chi_{\mathcal{L}}(w_{i}q_{i}w_{i}^{\prime})=\chi_{\mathcal{L}}(s)$ .

State $q_{0}=[\epsilon]_{0}$ is obviously correct since its context is $(\epsilon,s)$ . However, state $q_{n}$ must be incorrect because of the following. First, state $q_{n}$ must be in module $0$ : if the counterexample $s$ is accepted by $\mathcal{H}$ , then $s$ is well-matched under $t$ ; otherwise, the counterexample $s$ is in $\hat{\mathcal{L}}$ , which is a well-matched language. Therefore, we write $[q_{n}]_{0}$ . Next, since $s$ is a counterexample, we have $\chi_{(\mathcal{H},t)}(s)\neq\chi_{\mathcal{L}}(s)$ . By the construction of $\mathcal{H}$ , we have $\chi_{(\mathcal{H},t)}(s)=\chi(q_{n})$ . Therefore, we have $\chi(q_{n})\neq\chi_{\mathcal{L}}(s)$ , which means state $[q_{n}]_{0}$ is incorrect.

Now we can then use binary search (using $\log|s|$ membership queries) to find $i$ such that $[q_{i}]_{j}$ is correct, while $[q_{i+1}]_{j^{\prime}}$ is incorrect. We first show that $s[i+1]$ cannot be a call symbol. Otherwise, we must have $q_{i+1}=\epsilon$ and $T_{i+1}=(q_{i},s[i+1])\cdots(q_{j_{1}},\text{\guilsinglleft{$a$}}_{j_{1}})\cdot\bot$ . The context of $q_{i+1}$ is $(q_{j_{1}}\text{\guilsinglleft{$a$}}_{j_{1}}\dots q_{i}s[i+1],s[i+2]\ldots s[n])$ . We have

\chi_{\mathcal{L}}(w_{i+1}q_{i+1}w_{i+1}^{\prime})=\chi_{\mathcal{L}}(q_{j_{1}% }\text{\guilsinglleft{$a$}}_{j_{1}}\dots q_{i}s[i+1]s[i+2]\ldots s[n])=\chi_{% \mathcal{L}}(w_{i}q_{i}w_{i}^{\prime})=\chi_{\mathcal{L}}(s),

but $\chi_{\mathcal{L}}(w_{i+1}q_{i+1}w_{i+1}^{\prime})\neq\chi_{\mathcal{L}}(s)$ , a contradiction.

Assume $s[i+1]$ is a plain symbol. Since $q_{i}$ is a state in module $j$ , we let $Q_{j}^{\prime}=Q_{j}\cup\{q_{i}s[i+1]\}$ and $C_{j}^{\prime}=C_{j}\cup\{(w_{i+1},w_{i+1}^{\prime})\}$ . By definition of the transition function of $\mathcal{H}$ , $q_{i+1}$ is the unique element of $Q_{j}$ that is $C_{j}$ -equivalent to $q_{i}s[i+1]$ . On the other hand, the test $(w_{i+1},w_{i+1}^{\prime})$ distinguishes $q_{i+1}$ from $q_{i}s[i+1]$ , since we can get $\chi_{\mathcal{L}}(w_{i}q_{i}w^{\prime}_{i})=\chi_{\mathcal{L}}(w_{i+1}q_{i}s[% i+1]s[i+2]...s[n])\neq\chi_{\mathcal{L}}(w_{i+1}q_{i+1}s[i+2]...s[n])$ .

Otherwise, $s[i+1]$ is a return symbol $b$ \guilsinglright. Let $T_{i}=(q_{i^{\prime}},\text{\guilsinglleft{$a$}}_{j})\cdots(q_{i_{1}},\text{% \guilsinglleft{$a$}}_{j_{1}})\cdot\bot$ . Recall that at state $q_{i}$ , $\mathcal{H}$ reads $b$ \guilsinglright and transfers to $[q_{i+1}]_{j^{\prime}}$ such that $q_{i+1}\sim_{C_{j^{\prime}}}q_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}% \text{{$b$}\guilsinglright}$ . Notice that $\chi_{\mathcal{L}}(w_{i+1}q_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}% \text{{$b$}\guilsinglright}w_{i+1}^{\prime})=\chi_{\mathcal{L}}(w_{i}q_{i}w_{i% }^{\prime})=\chi_{\mathcal{L}}(s)$ . Therefore, $q_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}\text{{$b$}\guilsinglright}$ is correct. We let $Q_{j^{\prime}}^{\prime}=Q_{j^{\prime}}\cup\{q_{i^{\prime}}\text{\guilsinglleft% {$a$}}_{j}q_{i}\text{{$b$}\guilsinglright}\}$ and $C_{j^{\prime}}^{\prime}=C_{j^{\prime}}\cup\{(w_{i+1},w_{i+1}^{\prime})\}$ . By definition of the transition function of $\mathcal{H}$ , $q_{i+1}$ is the unique element of $Q_{j^{\prime}}$ that is $C_{j^{\prime}}$ -equivalent to $q_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}\text{{$b$}\guilsinglright}$ . On the other hand, the test $(w_{i+1},w_{i+1}^{\prime})$ distinguishes $q_{i+1}$ from $q_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}\text{{$b$}\guilsinglright}$ . We conclude that $q_{i^{\prime}}\text{\guilsinglleft{$a$}}_{j}q_{i}\text{{$b$}\guilsinglright}% \notin Q_{j^{\prime}}$ , and that $(Q_{j^{\prime}}^{\prime},C_{j^{\prime}}^{\prime})$ is separable. ∎

We call the procedure in Proposition 4.6 $\mathbf{update}(\mathcal{O},\mathcal{Q},t(s))$ , which takes a separable and closed $\mathcal{Q}$ and a counterexample $s$ and returns a separable $\mathcal{Q}^{\prime}$ .

With these lemmas, we can prove the following theorem; its proof is provided in Appendix A.1.

Theorem 4.7 ().

Given a tagging function $t$ such that language $\hat{\mathcal{L}}=\{t(s)\mid s\in\mathcal{L}\}$ is a VPL, the minimal $k$ -SEVPA of language $\hat{\mathcal{L}}$ can be learned in polynomial numbers of equivalence and membership queries.

Therefore, if language $\hat{\mathcal{L}}=\{t(s)\mid s\in\mathcal{L}\}$ is a VPL, then we can use Algorithm 1 to learn the minimal $k$ -SEVPA of language $\hat{\mathcal{L}}$ . However, in general, a tagging function $t$ does not necessarily introduce a VPL $\hat{\mathcal{L}}$ , even if each sentence in $\hat{\mathcal{L}}$ is well-matched. For example, consider the language $\hat{\mathcal{L}}_{\mathcal{O}}=\{a^{k}b^{k}\mid k>0\}$ and the tagging $t$ that maps $a$ and $b$ to plain symbols. The resulting language is trivially well matched (as it does not have call/return symbols), but it is not a VPL. In the next section, we discuss the procedure to find a right tagging function that makes $\mathcal{L}$ a VPL.

4.3. Tagging Inference

	$\displaystyle L$	$\displaystyle\to{\text{\guilsinglleft{$a$}}{\ A\ }\text{{$b$}\guilsinglright}}% \ L\mid c\ B\mid\epsilon$
	$\displaystyle A$	$\displaystyle\to\text{\guilsinglleft{$g$}}\ L\ \text{{$h$}\guilsinglright}\ E$
	$\displaystyle B$	$\displaystyle\to d\ L$
	$\displaystyle E$	$\displaystyle\to\epsilon$
	Seed strings	$\displaystyle S=\{agcdcdhbcd\}$

Figure 1. An oracle VPG and a set of seed strings.

\Description

An oracle VPG and a seed string.

We next propose an algorithm that infers a tagging function $t$ so that the tagged language $\hat{\mathcal{L}}_{t}=\{t(s)\mid s\in\mathcal{L}\}$ is a VPL. Then with Theorem 4.7, we can use the tagging function in Algorithm 1 to learn the VPL efficiently. Our algorithm takes a set of seed strings $S$ for inference. In practice, the seed strings can be collected via existing corpora of data (e.g., a corpus of XML strings) or via valid inputs to black-box implementations of oracles (e.g., an XML parser).

We will use the oracle VPG in Figure 1 as a running example, which includes a single seed string. Note that seed strings are untagged: it is the task of our algorithm to infer the tagging. As discussed earlier, the inferred and the oracle tagging functions may differ. For the example VPG in Figure 1, we can remove the tags on either $(\text{\guilsinglleft{$a$}},\text{{$b$}\guilsinglright})$ or $(\text{\guilsinglleft{$g$}},\text{{$h$}\guilsinglright})$ and the resulting grammar is still a VPG and generates the same untagged language. As we will explain, in such a case the algorithm opts for the outermost tags in its inferred VPG (i.e., \guilsinglleft $a$ and $b$ \guilsinglright), while treating ${g}$ and ${h}$ as plain symbols.

This section unfolds as follows. We first introduce a VPL pumping lemma, which enables a nesting test to filter out invalid taggings. We then present a tagging inference algorithm based on the nesting test and state theorems discussing properties of the algorithm. We leave most of the proofs of these theorems to Appendix B.

One straightforward observation is that a tagging $T$ is invalid if some seed string after tagging through $T$ is not well matched. However, this test alone would not eliminate too many possibilities. We next introduce a nesting test to filter out more invalid taggings. For this, we first propose a pumping lemma for VPLs. This lemma diverges from traditional pumping lemmas for regular languages and context-free languages by focusing on the unique requirements of call and return symbols in VPLs.

Lemma 4.0 (Pumping Lemma for VPLs).

For any VPL $\hat{\mathcal{L}}$ , there exists a positive number $l$ such that, for any string $s$ in $\hat{\mathcal{L}}$ with length greater than $l$ , it is possible to express $s$ according to one of the following conditions:

(1)

(Regular Pumping) We can partition $s$ into $s=uxv$ for strings $u,x,$ and $v$ , with $x$ being non-empty, such that $ux^{k}v$ remains in $\hat{\mathcal{L}}$ for all $k\geq 0$ .
(2)

(Nesting Pumping) We can partition $s$ into $s=uxzyv$ for strings $u,x,z,y,$ and $v$ , with $x$ and $y$ being non-empty, $x$ containing a call symbol, and $y$ containing a return symbol, such that $ux^{k}zy^{k}v$ is valid for all $k\geq 1$ .

For example, consider the VPG in Figure 1. Any string in $\hat{\mathcal{L}}$ with length greater than 6 can be decomposed based on the above two ways; e.g., for the tagged seed string $s=\text{\guilsinglleft{$a$}}\text{\guilsinglleft{$g$}}cdcd\text{{$h$}% \guilsinglright}\text{{$b$}\guilsinglright}cd$ , we have $(\text{\guilsinglleft{$a$}}\text{\guilsinglleft{$g$}})^{k}cdcd(\text{{$h$}% \guilsinglright}\text{{$b$}\guilsinglright})^{k}cd$ in the language, for $k\geq 1$ ; it happens it can also be decomposed through regular pumping: we have $\text{\guilsinglleft{$a$}}\text{\guilsinglleft{$g$}}cd(cd)^{j}\text{{$h$}% \guilsinglright}\text{{$b$}\guilsinglright}cd$ in the language, for $j\geq 0$ . We now extend the concept of nesting pumping to untagged strings, calling them nesting patterns.

Definition 4.0 (Nesting Patterns).

For an untagged string $s$ in the oracle language $\mathcal{L}$ , a nesting pattern is a partitioning of $s=uxzyv$ , where (1) $x$ and $y$ are non-empty, (2) $ux^{k}zy^{k}v\in\mathcal{L}$ for all $k\geq 1$ , (3) but for $k\neq j$ (both $\geq 0$ ), $ux^{k}zy^{j}v\not\in\mathcal{L}$ . The third condition precludes the possibility that $uxzyv$ represents a regular pumping, which allows $ux^{j}zyv$ and $uxzy^{j}v$ for all $j\geq 0$ . When $u$ , $z$ , and $v$ are not the focus, we may succinctly write a nesting pattern in a string as a pair $(x,y)$ .

Definition 4.0 (Compatible Tagging).

We say that tagging $t$ is compatible with a nesting pattern $s=uxzyv$ , if there exists a pair $(\text{\guilsinglleft{$a$}},\text{{$b$}\guilsinglright})$ in $t$ , such that (1) $x$ includes $a$ and $y$ includes $b$ , and (2) $t(x)$ includes an unmatched \guilsinglleft $a$ and $t(y)$ includes an unmatched $b$ \guilsinglright in $t(s)$ .

We say that tagging $t$ is compatible with a set of seed strings $S$ , if (1) strings in $S$ are well-matched under $t$ , and (2) $t$ is compatible with all nesting patterns of $S$ . We say that tagging $t$ is compatible with language $\mathcal{L}$ if it is compatible with each string in the language.

Theorem 4.11 ().

Given an oracle language $\mathcal{L}$ , for any tagging $t$ compatible with $\mathcal{L}$ , language $\hat{\mathcal{L}}_{t}=\{t(s)\mid s\in\mathcal{L}\}$ is a VPL.

For the example in Figure 1, the single seed string’s nesting patterns include

\{(ag,hb),(agcd,hb),\dots,(ag,cdcdhbcd)\}.

One compatible tagging is $\{(a,b)\}$ : firstly, the tagging would make the seed string well matched; secondly, each nesting pattern includes $(a,b)$ . By Theorem 4.11, when we tag $a$ as a call symbol and $b$ as a return symbol, the oracle language becomes a VPL. Other compatible taggings include $\{(a,h)\}$ , $\{(g,h)\}$ , $\{(g,b)\}$ , and $\{(a,b),(g,h)\}$ . In contrast, the tagging $\{(a,h),(g,b)\}$ is incompatible since this tagging would not make the seed string well matched.

Recall that Theorem 4.7 tells us that, if a tagging $t$ makes $\mathcal{L}$ a VPL, we can efficiently learn the VPL under active learning through Algorithm 1. Now Theorem 4.11 tells us that a compatible tagging $t$ makes $\mathcal{L}$ a VPL. Therefore, what is remaining is to infer a compatible tagging. With such a tagging, we can use it in Algorithm 1 to learn a VPL whose untagged strings are the same as the oracle language.

Input: Oracle

\mathcal{O}

and seed strings

S

Output: Some tagging

T

compatible with

S

, or None if no compatible tagging is found.

1 Function $\mathbf{candidateNesting}$ ( $S,K$ ):

N_{S,K}\leftarrow\emptyset

;

3 foreach partitioning $uxzyv$ of $s\in S$ do

4 if $\forall k\leq K,\chi_{\mathcal{L}}(ux^{k}zy^{k}v)=\mathrm{True}$ and $\forall k,j\leq K,k\neq j\;\Rightarrow\;\chi_{\mathcal{L}}(ux^{k}zy^{j}v)=% \mathrm{False}$ then

N_{S,K}\leftarrow N_{S,K}\cup\{uxzyv\}

;

6 end foreach

7 return $N_{S,K}$ ;

10 Function $\mathbf{search}$ ( $N,N_{\mathrm{done}},T$ ):

11 if $N$ is empty then return $\texttt{Some}(T)$ ;

12 Take a nesting pattern

uxzyv

from

N

;

13 if $T$ is incompatible with $uxzyv$ then

14 foreach character $a$ in $x$ and $b$ in $y$ do

15 if all strings in $S$ are well-matched under $T\cup\{(a,b)\}$

16 and

T\cup\{(a,b)\}

is compatible with

N_{\mathrm{done}}\cup\{uxzyv\}

then

T^{\prime}\leftarrow\mathbf{search}(N\setminus\{uxzyv\},N_{\mathrm{done}}\cup% \{uxzyv\},T\cup\{(a,b)\})

;

18 if $T^{\prime}$ is not None then return $T^{\prime}$ ;

20 end if

22 end foreach

return None; // No compatible tagging found

24 end if

25 else return $\mathbf{search}(N\setminus\{uxzyv\},N_{\mathrm{done}}\cup\{uxzyv\},T)$ ;

28Initialize

K

1

;

29 repeat

K\leftarrow K+1

;

N_{S,K}\leftarrow\mathbf{candidateNesting}(S,K)

;

T\leftarrow\mathbf{search}(N_{S,K},\emptyset,\emptyset)

;

31until $T\neq\texttt{None}$ ;

32return $T$ ;

Algorithm 3 The

\mathbf{tagInfer}(\mathcal{O},S)

algorithm that infers tagging.

We next describe Algorithm 3, which performs inference of a compatible tagging using an input set of seed strings. Its runtime complexity is exponential in the worst case; however, as will be discussed in our evaluation section, its runtime performance on practical grammars is efficient. As an overview, starting with a bound $K=2$ , the algorithm (1) employs a bounded checking approach in the $\mathbf{candidateNesting}$ function to compute candidate nesting patterns $N_{S,K}$ for seed strings $S$ , and (2) for $N_{S,K}$ , the $\mathbf{search}$ function tries to find a compatible tagging using a search algorithm (which may backtrack). If the $\mathbf{search}$ function fails to find a compatible tagging, we increase $K$ by $1$ and start anew. In more detail, in $\mathbf{candidateNesting}$ , for each disjoint substring pair $(x,y)$ in each seed string, we check if $ux^{k}zy^{k}v\in\mathcal{L}$ for $k\leq K$ and check if $ux^{k}zy^{j}v\not\in\mathcal{L}$ for $k,j\leq K$ and $k\neq j$ . If so, $(x,y)$ is a candidate nesting pattern. In the $\mathbf{search}$ function, we begin with an empty tagging $T$ , which tags each character as a plain symbol. We then check if a candidate nesting pattern is already covered by the current tagging; if not, we treat a symbol in $x$ as a call symbol and a symbol in $y$ as a return symbol and continue the search process.

Returning to our example shown in Figure 1, the seed string includes the nesting pattern $\{(ag,hb)\}$ . Our algorithm prioritizes the outermost characters for pairing. Consequently, the pair $(a,b)$ is selected and it covers all nesting patterns of the seed string, resulting in a compatible tagging $\{(a,b)\}$ .

The following theorem states that for some bounded $K$ Algorithm 3 terminates and returns a compatible tagging.

Theorem 4.12 (Termination and Correctness of Algorithm 3).

Let $m$ be the number of states of the minimal $k$ -SEVPA for the oracle VPL. There exists a number $K\leq((m^{2}+2m)^{2}+1)^{2}$ , such that $\mathbf{tagInfer}(\mathcal{O},S)$ returns a tagging that is compatible with a finite set of seed strings $S$ .

Note that the theorem is with respect to a finite set of seed strings $S$ . It does not say whether the found tagging is compatible with all strings in the oracle language $\mathcal{L}$ . We address this by demonstrating the existence of a finite set of seed strings $S_{0}$ , for which a compatible tagging $T$ with $S_{0}$ ensures compatibility with all strings in $\mathcal{L}$ .

Theorem 4.13 (Finite and Sufficient Seed Strings).

For any given oracle language $\mathcal{L}$ , there exists a finite set of seed strings, denoted as $S_{0}$ , such that any tagging that is compatible with $S_{0}$ is also compatible with $\mathcal{L}$ .

The proof of Theorem 4.13 is provided in Appendix B.7. We illustrate how $S_{0}$ is constructed using the VPG shown in Figure 1. In general, for each matching rule $L\to{\text{\guilsinglleft{$a$}}{A}\text{{$b$}\guilsinglright}}{B}$ where the nonterminal $A$ can be recursively rewritten into $L$ via a set of derivations, we generate a string reflecting this recursion and incorporate it into $S_{0}$ . The example VPG includes matching rules $L\to{\text{\guilsinglleft{$a$}}{A}\text{{$b$}\guilsinglright}}{L}$ and $A\to\text{\guilsinglleft{$g$}}L\text{{$h$}\guilsinglright}E$ . We start by expanding $L$ to reveal $A$ , followed by expanding $A$ to unveil $L$ , resulting in the pattern ${\text{\guilsinglleft{$a$}}{\text{\guilsinglleft{$g$}}L\text{{$h$}% \guilsinglright}}\text{{$b$}\guilsinglright}}L$ . Then the expansion of $L$ to $cd$ gives us a seed string $agcdhbcd$ . We also generate a seed string witnessing the recursive transition from $A$ to $L$ and back to $A$ , which would lead to strings like $agagcdhbhbcd$ .

In conclusion, it is established that a finite set of seed strings exists, enabling our algorithm to identify a tagging $T$ that is compatible with the oracle language $\mathcal{L}$ . This compatibility guarantees that the tagged oracle language is a VPL. With the learned tagging as an input, Algorithm 1 can be employed to achieve exact learning.

5. V-Star for a Token-Basd VPL

In Section 4, we assumed that tagging of the oracle language is on individual characters; i.e., each character is uniquely tagged. This assumption does not always align with real-world scenarios. For instance, in JSON, a curly bracket { may serve as a call symbol, yet it can also be a key, exemplified by {"{":true}; in XML documents, an opening tag such as functions as a call symbol, but it is composed of multiple characters. In this section, we enhance V-Star to address these scenarios.

5.1. Problem Statement

The oracle language $\mathcal{L}$ is a VPL when sentences in $\mathcal{L}$ are converted to sequences of tokens determined by an oracle tokenizer. Formally, a tokenizer function, ${\tau}:\Sigma^{*}\rightarrow H_{\tau}^{*}$ , translates a string $s$ from $\mathcal{L}$ into a sequence of tagged tokens, where $H_{\tau}=H_{\text{call}}\cup H_{\text{plain}}\cup H_{\text{ret}}$ represents the set of tagged tokens given by ${\tau}$ ; we write $H$ when ${\tau}$ is clear from the context. The language $\{{\tau}(s)\mid s\in\mathcal{L}\}$ is assumed to be a VPL over tokens in $H$ . Each category of token $h$ is defined as a regular language, often specified by a regular expression. The notation $s\in h$ indicates that string $s$ belongs to token $h$ . We use metasymbols ${h_{a}}$ , $h_{c}$ , or ${h_{b}}$ for call, plain, or return tokens, respectively.

	$\displaystyle L$	$\displaystyle\to\ {\texttt{OPEN}}\ L\ {\texttt{CLOSE}}\mid\texttt{TEXT}$
	OPEN	$\displaystyle\to\texttt{<p>}$
	CLOSE	$\displaystyle\to\texttt{</p>}$
	TEXT	$\displaystyle\to\text{[a..z]+}$

Figure 2. An example XML grammar and the associated lexical rules.

\Description

An example XML grammar and the associated lexical rules.

A toy XML grammar is shown in Figure 2 and we use the seed string $s=\texttt{p}$ as an example. The tokens are OPEN, TEXT, and CLOSE. The oracle tokenizer converts p into the token sequence [OPEN,OPEN,TEXT,CLOSE,CLOSE], where OPEN is a call symbol, TEXT is a plain symbol, and CLOSE is a return symbol.

The oracle still provides membership and equivalence queries. The membership query function $\chi_{\mathcal{L}}:\Sigma^{*}\to\{\text{True},\text{False}\}$ is as before. However, we change the form of equivalence queries. The reason for the change is to convert the oracle language to a character-based VPL so that we can reuse Algorithm 1 for learning a hypothesis VPA.

To model equivalence queries, we first define a converter function. A tokenizer ${\tau}$ identifies boundaries of call and return tokens for a string. We then use ${\mathbf{conv}}_{\tau}:\Sigma^{*}\to\tilde{\Sigma}_{{\tau}}^{*}$ to transform a valid string $s\in\mathcal{L}$ into a new string $\tilde{s}={\mathbf{conv}}_{\tau}(s)$ by inserting artificial call and return symbols to mark token boundaries. This process is formalized next. Given a tokenizer ${\tau}$ with $H_{{\tau}}=H_{\text{call}}\cup H_{\text{plain}}\cup H_{\text{ret}}$ , we first build an extended character set $\tilde{\Sigma}_{{\tau}}$ : for the $i$ -th pair of call and return tokens $h_{a_{i}}$ and $h_{b_{i}}$ , we generate a pair of call and return symbol $\text{\guilsinglleft{$a$}}_{i}$ and $b_{i}$ \guilsinglright outside of $\Sigma$ . We define $\tilde{\Sigma}_{{\tau}}$ as $\Sigma\cup\{\text{\guilsinglleft{$a$}}_{i}\mid i\in[1..|H_{\text{call}}|]\}% \cup\{\text{{$b_{i}$}\guilsinglright}\mid i\in[1..|H_{\text{ret}}|]\}$ . Then, the transformation of $s\in\mathcal{L}$ into language $\tilde{\mathcal{L}}$ over $\tilde{\Sigma}_{{\tau}}$ proceeds as follows. Let ${\tau}(s)={\tau}(s_{1}\dots s_{k})=t_{1}\dots t_{k}$ , where $s_{k}\in t_{k}$ . We construct $\tilde{s}$ based on tokenization: for each $i\in[1..k]$ , if $t_{i}$ belongs to $H_{a_{j}}$ , the call symbol $\text{\guilsinglleft{$a$}}_{j}$ is added before $s_{i}$ in $s$ ; if $t_{i}$ is in $H_{\text{ret}}$ , the return symbol $b_{j}$ \guilsinglright is added after $s_{i}$ in $s$ . For instance, for the XML grammar in Figure 2, with the call-return token pair being (OPEN,CLOSE), our extended character set $\tilde{\Sigma}_{{\tau}}$ have two additional characters, say $\triangleleft$ and $\triangleright$ . The seed string $s=\texttt{p}$ is converted to $\triangleleft\texttt{}\triangleleft\texttt{}p\texttt{}\triangleright% \texttt{}\triangleright$ . Note that the resulting string after conversion is a well-matched string in a character-based VPL that has the call symbol $\triangleleft$ and return symbol $\triangleright$ . This allows us to reuse the previous algorithm for learning a character-based VPA.

With the converter function defined, we can model an equivalence query. $\mathcal{E}(\mathcal{H}_{\tau},{\tau})$ takes a hypothesis VPA $\mathcal{H}_{\tau}$ and a hypothesis tokenizer ${\tau}$ and returns none when the oracle language is equivalent to the unconverted language recognized by $\mathcal{H}_{\tau}$ and otherwise returns some $s$ such that $\chi_{\mathcal{L}}(s)\neq\chi_{(\mathcal{H}_{\tau},{\tau})}(s)$ , where

\chi_{(\mathcal{H}_{\tau},{\tau})}(s)=\begin{cases}\text{True}&\text{if }{% \mathbf{conv}}_{\tau}(s)\text{ is accepted by }\mathcal{H}_{\tau},\\ \text{False}&\text{otherwise}.\end{cases}

A learner achieves exact learning if $\forall s\in\Sigma^{*},\;\chi_{\mathcal{L}}(s)=\chi_{(\mathcal{H}_{\tau},{\tau% })}(s)$ .

Similar to the method we discussed in Section 4.3, we utilize nesting patterns to identify the boundaries of call and return tokens. Our objective is to discover a compatible tokenizer, which ensures that the language $\tilde{\mathcal{L}}_{{\tau}}=\{{\mathbf{conv}}_{\tau}(s)\mid s\in\mathcal{L}\}$ is a VPL. We will demonstrate the existence of a finite set of substrings from which our algorithm can successfully learn a compatible tokenizer. Then the aforementioned converter function transforms the oracle language into a character-based VPL, which according to Theorem 4.7 can be efficiently learned by Algorithm 1.

Assumptions for oracle languages and tokenizers

We previously defined a tokenizer as a function that maps a string to a list of tokens. However, assuming an arbitrary tokenizer is insufficient, as it has been demonstrated that any CFG can be mapped to a VPG through some tagging (Alur and Madhusudan, 2004). Take, for instance, the CFG $\{L\to cLc\mid c\}$ . A tokenizer might tag $c$ differently based on its position within a string, e.g., maps the string $ccc$ to the token list $[\text{\guilsinglleft{$c$}},c,\text{{$c$}\guilsinglright}]$ , where \guilsinglleft $c$ and $c$ \guilsinglright represent call and return tokens, respectively; the resulting language is a VPG. To simplify tag learning to the setting where tagging is context independent, the oracle tokenizer and the language are assumed to satisfy the following properties:

Tokenization Consistency. For a string $p=p_{1}\ldots p_{k}$ , if each substring $p_{i}$ belongs to a token $t_{i}$ , then ${\tau}(p)=t_{1}\ldots t_{k}$ . For example, string can be split into [,]; this assumption requires it to be tokenized as [OPEN, CLOSE].

Separation. Strings for different tokens do not overlap. For the previous example of $\{L\to cLc\mid c\}$ , tagging the first $c$ as a call token and the last as a return token would violate this property.

Exclusivity. A prefix or suffix of a call or return token $h$ cannot serve as an infix of $h$ . Exclusivity is not required for a token that contains only a single character.

Unique Pairing. Each call token is uniquely paired with a return token, similar to an assumption we made for the setting of character-based VPLs.

Token Fixed Prefix and Suffix. For each call or return token $h$ , if $h$ contains more than a single character, we require that there exists a prefix $q$ and a suffix $g$ , such that all strings of $h$ starts with $q$ and ends with $g$ . Further, there exists a string $s_{h}$ of $h$ , such that the combination of the prefixes and suffixes of $s_{h}$ constitutes a sufficient set of test strings for exact learning of the token using $L^{*}$ , from the membership query function $\lambda s.\chi_{\mathcal{L}}(wsw^{\prime})$ , where $w,w^{\prime}$ are any strings such that $ws_{h}w^{\prime}\in\mathcal{L}$ .

$k$ -Repetition. Given a positive numbr $k$ , for each valid string $s=w_{1}ww_{3}$ where $w$ is a nonempty substring, we say that $w$ is $k$ -repeatable in $s$ , if $w_{1}w^{k}w_{3}$ is also a valid string. A language $\mathcal{L}$ and its tokenizer ${\tau}$ are said to satisfy $k$ -Repetition if, for any valid string $s\in\mathcal{L}$ and any substring $w$ in $s$ , if $w$ belongs to a call or return token $h$ , but is not tokenized as $h$ in $s$ , then $w$ is $k$ -repeatable in $s$ .

For example, consider the JSON string {"{":true}. Suppose $w$ is the second {; since it is inside quotes, it belongs to part of the token for a JSON key (i.e., "{"), even though $\{$ itself is a call token. For any $k$ , $w$ is $k$ -repeatable since the string $\texttt{\{"$w^{k}$":true\}}=\texttt{\{"\{}\ldots\texttt{\{":true\}}$ remains a valid JSON string. In our implementation, we set $k$ to $2$ .

While we make the aforementioned assumptions, our approach is still quite expressive, since the above properties are typically satisfied by practical grammars, including those in our evaluation.

5.2. Tagging Inference for Tokens

In this section, we define the compatibility of tokenizers and present a theorem about the relation between a compatible tokenizer and the converted language $\tilde{\mathcal{L}}_{{\tau}}$ . Then, we discuss an algorithm that infers a compatible tokenizer from given seed strings.

To define compatible tokenizers, we introduce some additional definitions. Given a tokenizer ${\tau}$ , recall that for a string $s$ in $\mathcal{L}$ , ${\mathbf{conv}}_{\tau}(s)$ is built by inserting artificial call and return symbols to $s$ . Now, let $s=s_{1}s_{2}s_{3}$ . We define ${\mathbf{conv}}_{{\tau},s}(s_{2})$ as the substring in ${\mathbf{conv}}_{\tau}(s)$ that corresponds to $s_{2}$ ; additionally, if ${\mathbf{conv}}_{\tau}$ inserted a call symbol between $s_{1}$ and $s_{2}$ , then ${\mathbf{conv}}_{{\tau},s}{(s_{2})}$ includes and starts with that call symbol; if ${\mathbf{conv}}_{\tau}$ inserted a return symbol between $s_{2}$ and $s_{3}$ , then ${\mathbf{conv}}_{{\tau},s}{(s_{2})}$ includes and ends with that return symbol. For example, for the seed string $s=\texttt{p}$ , ${\mathbf{conv}}_{{\tau},s}{(\texttt{})}=\triangleleft\ \texttt{}$ , and ${\mathbf{conv}}_{{\tau},s}{(\texttt{})}=\texttt{}\ \triangleright$ .

Definition 5.0 (Compatible Tokenizers).

We say that a tokenizer ${\tau}$ is compatible with a set of nesting patterns $N$ , if for each nesting pattern $s=uxzyv$ in $N$ , ${\mathbf{conv}}_{{\tau},s}{(x)}$ , and ${\mathbf{conv}}_{{\tau},s}{(y)}$ , there exists a pair of artificial call and return symbols $(\triangleleft,\triangleright)$ in ${\tau}$ , such that (1) ${\mathbf{conv}}_{{\tau},s}{(x)}$ includes $\triangleleft$ and ${\mathbf{conv}}_{{\tau},s}{(y)}$ includes $\triangleright$ , and (2) $\triangleleft$ is unmatched in ${\mathbf{conv}}_{{\tau},s}{(x)}$ and $\triangleright$ is unmatched in ${\mathbf{conv}}_{{\tau},s}{(y)}$ .

We say that a tokenizer ${\tau}$ is compatible with a set of seed strings $S$ , if (1) for each string $s$ in $S$ , ${\mathbf{conv}}_{\tau}{(s)}$ is well-matched, and (2) ${\tau}$ is compatible with all nesting patterns of $S$ .

Now we present Theorem 5.2 as the basis for exact learning.

Theorem 5.2 ().

Assume the oracle language and the oracle tokenizer satisfy the Tokenization Consistency and Separation properties. Given a tokenizer ${\tau}$ that is compatible with the oracle language $\mathcal{L}$ , language $\tilde{\mathcal{L}}_{{\tau}}$ is a VPL.

Now, we propose Algorithm 4 to infer a hypothesis compatible tokenizer. Instead of finding a full-fledged tokenizer, the algorithm infers a partial tokenizer, which recognizes only call and return tokens in an input string; the syntax of plain tokens is instead learned during the VPA learning process. As a result of this choice, substrings between call/return tokens recognized by a partial tokenizer are implicitly treated as plain tokens. We represent a partial tokenizer as a set $D=\{(r_{i},r_{i}^{\prime})\mid i\in[1..|H_{\text{call}}]|\}$ , where $r_{i}$ and $r_{i}^{\prime}$ are the regular expressions for the $i$ -th paired call and return token, respectively. Function ${\mathbf{conv}}_{D}{(s)}$ and compatibility are similarly defined for a partial tokenizer $D$ ; we omit them for brevity.

At a high level, Algorithm 4 identifies call/return tokens by enumerating potential prefixes and suffixes based on the Token Fixed Prefix and Suffix assumption. Further, under the Exclusivity assumption, we can prove that oracle call/return tokens must appear in $(x^{2},y^{2})$ for certain nesting pattern $(x,y)$ ; the proof is provided in the appendix as Lemma C.2. Therefore, we restrict our enumeration to substrings within $(x^{2},y^{2})$ . Our approach begins by searching within $(x,y)$ and then progressively expands the search space to $(x^{2},y^{2})$ . Upon identifying a candidate prefix-suffix pair for a token, we learn the token’s lexical rules as a regular expression within the prefix-suffix pair using $L^{*}$ at line 6; in this learning, we simulate the equivalence queries using test strings obtained by combining the prefixes and suffixes of $x$ and $y$ , respectively. We then incorporate the tokens into the partial tokenizer and proceed to assess the tokenizer’s compatibility with the nesting patterns of seed strings (line 7).

One compatibility condition is that the seed strings after tokenization should be well-matched; for that, Algorithm 5 is used to tokenize a string based on a given partial tokenizer. The main challenge of tokenization is that we have only a partial tokenizer and we need to rely on $k$ -Repetition to deal with the case when a plain token string contains a call/return token as part of its substring. E.g., in $s=\texttt{\{"\{":true\}}$ , the second { is actually part of the plain token "{" and should not be treated as a call token. To demonstrate Algorithm 5, consider a partial tokenizer $D=\{(\texttt{\{},\texttt{\}})\}$ and an input JSON string $s=\texttt{\{"\{":true\}}$ . Start with string index $i=1$ and token list $l=[]$ , Algorithm 5 matches the first { as $m_{1}$ and pushes it to $l$ . Since $i=2$ does not result in any match, $i$ is updated to $3$ , where the second { is matched as $m_{2}$ ; however it is not added to the token list since it is $k$ -repeatable. Finally, the last } is matched as $m_{3}$ . As a result, Algorithm 5 returns $[m_{1},m_{3}]$ .

Input: Oracle

\mathcal{O}

and seed strings

S

Output: Some tokenizer

D

compatible with

S

, or None if no compatible tokenizer is found.

1 Function $\mathbf{tokenSearch}$ ( $N$ , $N_{\mathrm{done}}$ , $D$ ):

2 if $N$ is empty then return $\texttt{Some}(D)$ ;

3 Take a nesting pattern

s=uxzyv

from

N

;

4 if $D$ is incompatible with $uxzyv$ then

5 foreach disjoint substrings $q$ and $g$ in $x$ and $x^{2}$ , and $q^{\prime}$ and $g^{\prime}$ in $y$ and $y^{2}$ do

6 Based on

((q,g),(q^{\prime},g^{\prime}))

learn a new call-return token pair

(r,r^{\prime})

;

7 if $D\cup\{(r,r^{\prime})\}$ is compatible with $N_{\mathrm{done}}\cup\{uxzyv\}$ then

D^{\prime}\leftarrow\mathbf{tokenSearch}{(N\setminus\{uxzyv\},N_{\mathrm{done}% }\cup\{uxzyv\},D\cup\{(r,r^{\prime})\})}

;

9 if $D^{\prime}$ is not None then return $D^{\prime}$ ;

11 end if

13 end foreach

return None; // No compatible tokenizer found

15 end if

16 else return $\mathbf{tokenSearch}{(N\setminus\{uxzyv\},N_{\mathrm{done}}\cup\{uxzyv\},D)}$ ;

19Initialize

K

1

;

20 repeat

K\leftarrow K+1

;

N_{S,K}\leftarrow\mathbf{candidateNesting}(S,K)

;

D\leftarrow\mathbf{tokenSearch}{(N_{S,K},\emptyset,\emptyset)}

;

22until $D\neq\texttt{None}$ ;

23return $D$ ;

Algorithm 4 The

\mathbf{tokenInfer}{(\mathcal{O},S)}

algorithm that infers call and return tokens. The

\mathbf{candidateNesting}(-,-)

function is the same as the one in Algorithm 3.

Input: A partial tokenizer

D

and a string

s

Output: A token list

l

1 Initialize the token list as

l\leftarrow[]

;

2 Initialize the current location of string

s

i\leftarrow 1

;

3 while $i\leq|s|$ do

4 if We find a first match $w=s[i]...s[j]$ for token $h\in D$ and $w$ is not $k$ -repeatable then

5 Push new match

m=(h,i,j)

to token list

l

;

i\leftarrow j+1

;

8 end if

9 else

i\leftarrow i+1

;

11 end while

12return $l$ ;

Algorithm 5 The

\mathbf{tokenize}(D,s)

algorithm that tokenizes a string.

We next illustrate the steps of Algorithm 4 using our XML example. We start with an empty tokenizer $D$ . We then iteratively select a nesting pattern $s=uxzyv$ , tokenize $s$ using Algorithm 5, and verify the compatibility of $D$ with the tokenization. In our example, we start with the seed string $s=\texttt{p}$ and pick a nesting pattern. Suppose V-Star picks the outmost pattern $(\texttt{},\texttt{})$ . The token list $\mathbf{tokenize}(D,s)$ of $s$ is empty, since there is no rule to find a token yet. Apparently, this tokenizer $D$ is not compatible with $uxzyv$ .

We then extend $D$ by a call-return token pair learned from $((q,g),(q^{\prime},g^{\prime}))$ derived from $(x,y)$ or $\left(x^{2},y^{2}\right)$ . By enumerating candidate prefixes and suffixes $((q,g),(q^{\prime},g^{\prime}))$ within $(x,y)=(\texttt{},\texttt{})$ , we can build the first call-return token pair. From $x$ , we first pick the outmost $(\texttt{<},\texttt{>})$ as $(q,g)$ ; then in $y$ , we pick the outmost $(\texttt{<},\texttt{>})$ as $(q^{\prime},g^{\prime})$ . By learning the tokens’ lexical rules from membership query functions $\lambda w.\chi_{\mathcal{L}}(w\texttt{p})$ and $\lambda w.\chi_{\mathcal{L}}(\texttt{p}w)$ , we identify two regular expressions and for the call and return tokens, respectively. Note that if the open tag contained XML attributes, the learned lexical rules would encompass regular expressions that specify these attributes. To check if the partial tokenizer $D=\{(\texttt{}$ , $\texttt{})\}$ is compatible with $s=\texttt{p}$ , we need to tokenize $s$ following Algorithm 5, which returns the token list $[\texttt{},\texttt{},\texttt{},\texttt{}]$ . It can be shown that this partial tokenizer is compatible with all nesting patterns of string $s$ . Therefore, Algorithm 4 ends here and returns this compatible tokenizer.

Lemma 5.0 (Finite and Sufficient Seed Strings).

Assume the oracle language and the oracle tokenizer satisfy the Tokenization Consistency, Separation, Exclusivity, Unique Pairing, Token Fixed Prefix and Suffix, and $k$ -Repetition properties. There exists a finite set of seed strings $S_{0}\subseteq\mathcal{L}$ , with which we can find a tokenizer that is compatible with the oracle language $\mathcal{L}$ using Algorithm 4.

As a summary, we can learn a compatible tokenizer from a certain finite set of seed strings. With a compatible tokenizer ${\tau}$ , Theorem 5.2 gives us that $\tilde{\mathcal{L}}_{{\tau}}$ is a character-based VPL. Then by Theorem 4.7, we can use Algorithm 1 to learn $\tilde{\mathcal{L}}_{{\tau}}$ exactly under active learning.

6. Evaluation

In this section, we discuss V-Star’s implementation, evaluation and its comparison with two other state-of-the-art grammar inference tools, Glade (Bastani et al., 2017) and Arvada (Kulkarni et al., 2022), in the context of inferring grammars from program inputs.

Implementation

While black-box programs naturally support membership queries, direct support of equivalence queries is absent. To instantiate the MAT, we approximate equivalence queries through membership queries. In particular, we construct a set of strings by combining prefixes, infixes, and suffixes of the seed strings; for each such string $s$ , if ${\mathbf{conv}}_{\tau}(s)$ is well-matched, we add it to a set of test strings. The set of test strings is then used to check the consistency between the hypothesis VPA and the oracle language. A test string becomes a counterexample if it witnesses inconsistency (i.e., either the hypothesis VPA or the oracle accepts the string, but not both). Similar ideas have appeared in conformance testing (Kumar et al., 2006; Aichernig et al., 2024).

Our previously discussed algorithm produces a visibly pushdown automaton (VPA), instead of a visibly pushdown grammar (VPG). Upon the successful learning of a VPA, we transform it into a VPG using methods outlined by Alur and Madhusudan (2004).

Datasets

For our experiments, we replicated the evaluation methodology of the Arvada study, utilizing their datasets (Kulkarni et al., 2022), including the oracle grammars, datasets for evaluating the recall (discussed later), and seed strings. We selected the grammars of JSON, LISP, XML, While, and MathExpr, due to their distinct characteristics of being VPGs.

Metrics

We evaluate the performance of V-Star using four key metrics: Recall, Precision, F-1 Score, and Number of Membership Queries. We define each metric as follows:

(1)

Recall: This metric is the probability that a string of the oracle grammar is also a string of the learned grammar $G$ (Bastani et al., 2017). For finite languages, it can be defined as: $\frac{|\mathcal{L}_{\mathcal{O}}\cap\mathcal{L}_{G}|}{|\mathcal{L}_{\mathcal{O% }}|}$ . Due to the potential infinity of the languages, it may be impractical to compute recall directly. Instead, we approximate it by using a representative dataset from the oracle language and then calculating the proportion of this dataset that is accepted by the learned grammar.
(2)

Precision: Contrary to recall, precision is the probability that a string in the learned language is accepted by the oracle (Bastani et al., 2017). For finite languages, it can be defined as: $\frac{|\mathcal{L}_{\mathcal{O}}\cap\mathcal{L}_{G}|}{|\mathcal{L}_{G}|}$ . As with recall, we approximate precision by sampling strings from the learned grammar and calculating the percentage of strings that are accepted by the oracle. We adopt the same sampling method from Arvada (Kulkarni et al., 2022).
(3)

F-1 Score: The F-1 score is the harmonic mean of precision and recall, defined as $\frac{2}{\frac{1}{R}+\frac{1}{P}}$ . where $R$ is recall and $P$ is precision. The F-1 score serves as a measure of the overall accuracy, only reaching high values when both precision and recall are high.
(4)

Number of Unique Membership Queries: This counts the number of unique membership queries, i.e., distinct oracle calls, made during the learning process. Since a particular string might be queried multiple times, we cache the result after the first query, and only count unique queries. This metric serves as an efficiency measure.

Table 1. Evaluation on datasets where the oracle grammars are VPGs. “#Seeds” is the number of seed strings for each grammar. “#Queries” is the number of membership queries, while “%Q(Token)” and “%Q(VPA)” are the percentages of these queries attributed to token inference and VPA learning, respectively. “#TS” is the number of test strings sampled by V-Star. Results for Arvada are listed as the means over 10 runs ± the standard deviation (Kulkarni, 2023).

		Glade
	#Seeds	Recall	Precision	F1	#Queries	Time
json	71	0.42	0.98	0.59	$11\text{\,}\mathrm{K}$	$21\text{\,}\mathrm{s}$
lisp	26	0.23	1.00	0.38	$3.8\text{\,}\mathrm{K}$	$7\text{\,}\mathrm{s}$
xml	62	0.26	1.00	0.42	$15\text{\,}\mathrm{K}$	$21\text{\,}\mathrm{s}$
while	10	0.01	1.00	0.02	$9.2\text{\,}\mathrm{K}$	$13\text{\,}\mathrm{s}$
mathexpr	40	0.18	0.98	0.31	$19\text{\,}\mathrm{K}$	$42\text{\,}\mathrm{s}$

	Arvada
	Recall	Precision	F1	#Queries	Time
json	0.97 ± 0.09	0.92 ± 0.08	0.94 ± 0.05	$6.8\text{\,}\mathrm{K}$ ± 394	$25\text{\,}\mathrm{s}$ ± $2\text{\,}\mathrm{s}$
lisp	0.38 ± 0.26	0.95 ± 0.08	0.50 ± 0.18	$2.2\text{\,}\mathrm{K}$ ± 307	$8\text{\,}\mathrm{s}$ ± $2\text{\,}\mathrm{s}$
xml	0.99 ± 0.02	1.00 ± 0.00	1.00 ± 0.01	$12\text{\,}\mathrm{K}$ ± $1\text{\,}\mathrm{K}$	$61\text{\,}\mathrm{s}$ ± $5\text{\,}\mathrm{s}$
while	0.91 ± 0.20	1.00 ± 0.00	0.94 ± 0.14	$5.4\text{\,}\mathrm{K}$ ± 563	$15\text{\,}\mathrm{s}$ ± $1\text{\,}\mathrm{s}$
mathexpr	0.72 ± 0.24	0.96 ± 0.03	0.80 ± 0.16	$6.6\text{\,}\mathrm{K}$ ± 421	$24\text{\,}\mathrm{s}$ ± $2\text{\,}\mathrm{s}$

	V-Star
	Recall	Precision	F1	#Queries	%Q(Token)	%Q(VPA)	#TS	Time
json	1.00	1.00	1.00	$541\text{\,}\mathrm{K}$	$2.71\text{\,}\%$	$97.29\text{\,}\%$	8043	$33\text{\,}\mathrm{m}\mathrm{i}\mathrm{n}$
lisp	1.00	1.00	1.00	$16\text{\,}\mathrm{K}$	$1.37\text{\,}\%$	$98.63\text{\,}\%$	693	$77\text{\,}\mathrm{s}$
xml	1.00	1.00	1.00	$208\text{\,}\mathrm{K}$	$94.93\text{\,}\%$	$5.07\text{\,}\%$	682	$16\text{\,}\mathrm{m}\mathrm{i}\mathrm{n}$
while	1.00	1.00	1.00	$1440\text{\,}\mathrm{K}$	$9.40\text{\,}\%$	$90.60\text{\,}\%$	119	$1.5\text{\,}\mathrm{h}$
mathexpr	1.00	1.00	1.00	$4738\text{\,}\mathrm{K}$	$0.11\text{\,}\%$	$99.89\text{\,}\%$	2602	$6\text{\,}\mathrm{h}$

Results

Table 1 summarizes the performances of Glade, Arvada, and V-Star on oracle VPGs, with the results of Arvada and Glade assessed on the same platform as V-Star, utilizing the Arvada artifact (Kulkarni, 2023). The table shows that V-Star achieves exact learning for all oracles, exhibiting superior accuracy compared to other tools. However, V-Star issues a greater number of queries than Glade and Arvada, resulting in greater inference time. This primarily stems from (1) the substantial number of test strings used in approximating equivalence queries, and (2) the fact that V-Star consumes seed strings without pre-processing. In contrast, Glade and Arvada employ a pre-tokenization strategy, such as grouping digits or letters as a single terminal, which reduces seed string lengths. We take our approach since V-Star can learn tokens. Overall the evaluation shows that V-Star is more accurate but takes more time to infer grammars. In grammar learning, we believe that accuracy is a more important goal as a more accurate grammar benefits downstream applications greatly. Improving efficiency of V-Star (e.g., using heuristics of target grammars) while not decreasing accuracy is left for future work.

V-Star requires a considerable number of membership queries for the MathExpr grammar. This can be attributed in part to the large number of constant function names (26 in all) within the grammar, such as “sin” or “cos”. In its quest for high accuracy, V-Star explores various combinations of these constant names exhaustively. We acknowledge that this approach could be further optimized and propose this as an avenue for future improvement.

In Table 1, we include data on the percentage of membership queries allocated for token inference (“%Q(Token)”) and for learning VPA (“%Q(VPA)”). It can be observed that the majority of queries are utilized for VPA learning. This is mainly because seed strings tend to be short, leading to fewer potential nesting patterns. One exception is XML, where most queries are for token inference. This is because the XML grammar, primarily based on nested tag pairs, allows for easier inference of the overall grammar once the opening and closing tags (call and return tokens) are identified. Furthermore, many queries are required to infer the lexical rules of XML attributes. Additionally, the table provides information on the count of seed strings (“#Seeds”) used in our evaluation. For the grammars assessed, V-Star requires a relatively small number of seed strings to achieve exact learning, attributed to its strategy of employing a wide range of substring combinations to construct test strings for effective simulating equivalence queries; column “#TS” shows the number of test strings constructed.

7. Future Work

We believe the performance of V-Star can be further improved with more advanced methods for generating counterexamples, such as using machine learning tools to infer counterexamples from seed strings and a VPA. A related direction is to investigate the potential adaptation of V-Star with discrimination trees. Other grammar inference tools that are based on discrimination trees such as TTT (Isberner, 2015) enhance inference efficiency by reducing counterexample lengths and minimizing membership queries. It remains to be seen how V-Star can be adapted in this manner and what improvements this can yield.

The present study focuses primarily on inferring well-matched VPGs using V-Star. However, our preliminary experience suggests that V-Star can also be effectively employed to learn general VPGs (Alur et al., 2005) with open call and return symbols. A general VPG can be used to specify streaming data. As such, the learning problem for general VPGs is a promising direction for future work.

Our method makes the assumption of unique call-return token pairing to ease tokenizer inference and reduce computational complexity, as matching one call token with multiple return tokens complicates tokenizer inference. It would be interesting future work to consider the implications of relaxing this assumption to enhance flexibility.

Experimentally, we focus on languages such as XML and JSON to align with benchmarks used by prior tools for a direct comparison. It would be interesting to evaluate V-Star on more complex programming language grammars to check its effectiveness on those grammars.

Improving the readability of VPGs inferred by V-Star is still a challenge. Currently, the grammars generated tend to be larger and less readable than oracle grammars, due to the inherent rigid requirements of VPG rules, the inclusion of lexical rules, and automatically named nonterminals. Although we have made attempts to refactor grammars using regular expressions, these solutions are largely heuristic and may not consistently yield optimal results. Exploring machine learning-based approaches presents a promising avenue to systematically enhance the clarity and conciseness of inferred grammars, making them potentially more accessible and understandable for users.

Finally, VPGs learned by V-Star may provide a valuable starting point for better inference algorithms of CFGs. For instance, similar to the CFGs learned by Glade (Bastani et al., 2017), the VPGs inferred by V-Star can serve as inputs for machine learning tools such as REINAM (Wu et al., 2019), which improves the input grammar with reinforcement learning. Comparing the improvements enabled by these different starting grammars would be an intriguing line of inquiry.

8. Conclusions

This paper introduces V-Star, an algorithm designed to take advantage of nesting structures in languages to achieve exact learning of visibly pushdown grammars. Through a set of novel techniques to infer token boundaries and tag call/return tokens, V-Star demonstrates its capability to learn a diverse array of practical languages. Our preliminary experiments are promising and show V-Star’s advantages of accurate learning.

Appendix A Proofs of Theorems in Section 4.2

Theorem A.1 ().

Proof.

We run Algorithm 1 with a target language $\hat{\mathcal{L}}$ over the alphabet $\hat{\Sigma}$ . Define $m$ as the state count of the minimal $k$ -SEVPA of $\hat{\mathcal{L}}$ and $n$ as the maximum length of counterexamples returned by equivalence queries. Proposition 4.4 establishes that the number of equivalence queries does not exceed $m$ , as each iteration expands the number of states in $\mathcal{Q}$ by a minimum of one. This also shows that the algorithm must terminate.

A counterexample returned by an equivalence query causes at most $\log n$ membership queries as detailed in Proposition 4.6, resulting in no more than $m\log n$ membership queries during Step 4 of Algorithm 1. Membership queries in Steps 2 and 5 involve words either of form $wqw^{\prime}$ or $wqmw^{\prime}$ , where $q\in Q_{i}$ , $m\in\Sigma_{M}$ , and $(w,w^{\prime})\in C_{i}$ . With $|C_{i}|$ bounded by $|Q_{i}|\leq m$ at completion, total queries amount to at most $\sum_{i=0}^{k}(|Q_{i}|+|Q_{i}||\Sigma_{M}|)|C_{i}|=\sum_{i=0}^{k}|Q_{i}||C_{i}% |(1+|\Sigma_{M}|)\leq m^{2}(1+|\Sigma|+|\Sigma_{\text{call}}|\times m\times|% \Sigma_{\text{ret}}|)\leq m^{3}|\Sigma|^{2}$ .

In conclusion, the number of queries remains polynomially bound by $n$ , $m$ , and $|\Sigma|$ , including $O(m^{3}|\Sigma|^{2}+m\log n)$ membership queries and $O(m)$ equivalence queries. ∎

Appendix B Proofs of Theorems in Section 4.3

Given tagging $t$ , we say that string $s$ is $t$ -well-matched, if $t(s)$ is well-matched.

Definition B.0 ().

[Parse Tree] Given a grammar $G=(V,\Sigma,P,L_{0})$ , a parse tree with respect to grammar $G$ is an ordered tree where (1) the leaves of the tree are terminals in $\Sigma$ or $\epsilon$ , and (2) each non-leaf node is a nonterminal $L$ in $V$ , where the children of the node are $\alpha_{1}$ , $\alpha_{n}$ , …, $\alpha_{n}$ such that $L\to\alpha_{1}\alpha_{2}\ldots\alpha_{n}$ is a production rule in $P$ , or $\epsilon$ , such that $L\to\epsilon$ is a production rule. The root of the tree should be $L_{0}$ , the start nonterminal of grammar $G$ . A parse tree of a string $s\in\Sigma^{*}$ in the language of grammar $G$ is a parse tree whose leaves, when concatenated from left to right, form $s$ .

Lemma B.0 (Pumping Lemma for VPLs).

(1)

(Regular Pumping) We can partition $s$ into $s=uxv$ for strings $u,x,$ and $v$ , with $x$ being non-empty, such that $ux^{k}v$ remains in $\hat{\mathcal{L}}$ for all $k\geq 0$ .
(2)

(Nesting Pumping) We can partition $s$ into $s=uxzyv$ for strings $u,x,z,y,$ and $v$ , with $x$ and $y$ being non-empty, $x$ containing a call symbol, and $y$ containing a return symbol, such that $ux^{k}zy^{k}v$ is valid for all $k\geq 1$ .

Proof.

Let VPG $G=(\hat{\Sigma},V,P,L_{0})$ be a grammar of $\hat{\mathcal{L}}$ . We define $l$ as the length of the longest string $s$ that contains no recursion in any of its parse trees. Formally, in a parse tree of $s$ , if the subtree rooted with a nonterminal node $A$ contains another appearance of $A$ , we say there is recursion in the parse tree and we call $A$ a recursive nonterminal in the parse tree. $l$ is then defined as the length of the longest string $s$ whose parse trees do not contain recursion. This $l$ is well defined because the number of non-recursive parse trees is finite: any path that goes from the root to a leaf of a parse tree and exceeds the length of $|V|+2$ must have $|V|+1$ nonterminals and revisit at least one nonterminal twice.

For any string $s$ exceeding $l$ in length, one of its parse trees must have a recursive nonterminal; say it is $L$ . The derivation of the parse tree can be written as: $L_{0}\to^{*}uLv\to^{*}u(s_{1}Ls_{2})v\to^{*}s$ , where $L\to^{*}s_{1}Ls_{2}$ . We have two cases:

(1)

If $s_{2}$ is empty, then $s_{1}$ cannot be empty since $L\to L$ is not a valid VPG rule. Thus, $u(s_{1})^{k}s_{L}v$ remains valid, where $s_{L}$ is a terminal string derived from $L$ . This satisfies regular pumping.

(2)

If $s_{2}$ is not empty, then a matching rule is used somewhere in the derivation sequence that leads to the second appearance of $L$ . This is because by the VPG rules, if only rules of the form $L_{1}\to cL_{2}$ or $L_{1}\to\epsilon$ were used, then $L$ must be the last symbol in the derived string $s_{1}Ls_{2}$ , which contradicts with that $s_{2}$ is not empty. This leads to:

L\to^{*}s_{1}^{\prime}\text{\guilsinglleft{$a$}}A\text{{$b$}\guilsinglright}Bs% _{2}^{\prime}\to^{*}s_{1}^{\prime}\text{\guilsinglleft{$a$}}s_{1}^{\prime% \prime}Ls_{2}^{\prime\prime}\text{{$b$}\guilsinglright}Bs_{2}^{\prime}

Here, $A\to^{*}s_{1}^{\prime\prime}Ls_{2}^{\prime\prime}$ . We then select $x$ as $s_{1}^{\prime}\text{\guilsinglleft{$a$}}s_{1}^{\prime\prime}$ and $y$ as $s_{2}^{\prime\prime}\text{{$b$}\guilsinglright}s_{2}^{\prime}$ for nesting pumping.

∎

Lemma B.0 ().

Consider an oracle VPL $\mathcal{L}$ , a VPG $G=(\Sigma,V,P,L_{0})$ for $L$ , and an oracle tagging $t_{\mathcal{O}}$ . For each string $s\in\mathcal{L}$ and $s=uxzyv$ , where $u,x,z,y,v$ are substrings, and $x,y$ are nonempty, if string $ux^{k}zy^{k}v$ is valid but string $ux^{k}zy^{j}v$ is invalid for $k,j\leq(|V|^{2}+1)^{2}$ and $k\neq j$ , then $t_{\mathcal{O}}(x)$ contains an oracle call symbol, and $t_{\mathcal{O}}(y)$ contains an oracle return symbol, and the two symbols are matched with each other in $s$ .

Proof.

In this proof, we abuse the notation $x$ to also mean $t_{\mathcal{O}}(x)$ .

We first show that $x$ as well as $y$ contains unmatched symbol. Otherwise, $x$ and $y$ contain only plain symbols or well-matched call-return pairs. For each $k\geq 1$ and string $ux^{k}zy^{k}v$ , the derivation path of ${x}^{k}$ can be written as $L_{k,1}\to^{*}{x}^{k}L_{k,2}$ , where $L_{k,1},L_{k,2}$ are two nonterminals in $V$ . This is because, suppose $x$ contains only plain symbols, then the derivation path of $x$ is of the form

L_{1}\to{x}[1]L_{2}\to{x}[1]{x}[2]L_{3}\to\cdots\to xL_{|x|+1}

for certain nonterminals $L_{i,i\in[1..|x|+1]}$ . The case is similar when $x$ also contains well-matched substrings; we omit the discussion for brevity. Now, for each $k\in[1..|V|^{2}+1]$ , we have the following derivations:

	$\displaystyle L_{1,1}$	$\displaystyle\to^{*}{x}^{1}L_{1,2}$
	$\displaystyle L_{2,1}$	$\displaystyle\to^{*}{x}^{2}L_{2,2}$
		$\displaystyle\dots$
	$\displaystyle L_{\|V\|^{2}+1,1}$	$\displaystyle\to^{*}{x}^{\|V\|^{2}+1}L_{\|V\|^{2}+1,2}$

Apparantly, there exist $k^{\prime}$ and $k_{1}\neq k_{2}$ , such that $k^{\prime},k_{1},k_{2}\leq|V|^{2}+1$ , and a pair $(L_{k^{\prime},1},L_{k^{\prime},2})$ appears twice on both sides, i.e.,

		$\displaystyle\dots$
	$\displaystyle L_{k^{\prime},1}$	$\displaystyle\to^{*}{x}^{k_{1}}L_{k^{\prime},2}$
		$\displaystyle\dots$
	$\displaystyle L_{k^{\prime},1}$	$\displaystyle\to^{*}{x}^{k_{2}}L_{k^{\prime},2}$
		$\displaystyle\dots$

Thus, both string $ux^{k_{1}}zy^{k_{1}}v$ and $ux^{k_{2}}zy^{k_{1}}v$ are valid. Given that $k_{1}\neq k_{2}$ , this is a contradiction.

Therefore, $x$ and $y$ must include unmatched symbols. Consider the type of the unmatched symbol in $x$ . If $x$ includes a return symbol $b$ \guilsinglright, where the matched \guilsinglleft $a$ is before $x$ , then $ux^{2}zy^{2}v$ is invalid, because $u$ has no additional call symbol to match $b$ \guilsinglright. Thus, $x$ includes a symbol \guilsinglleft $a$ , whose matched symbol $b$ \guilsinglright is after $x$ . If $b$ \guilsinglright is in $y$ , we are done. Otherwise, $b$ \guilsinglright is either in $z$ or in $v$ . Consider $ux^{2}zy^{2}v$ . Since the string is valid, the unmatched \guilsinglleft $a$ in $x^{2}$ must match a return symbol in $y^{2}$ .

In conclusion, $x$ contains an oracle call symbol, which matches a return symbol in $y$ . ∎

Theorem B.4 (Termination and Correctness of Algorithm 3).

Let $m$ be the number of states of the minimal $k$ -SEVPA for the oracle VPL. There exists a number $K\leq((m^{2}+2m)^{2}+1)^{2}$ , with which function $\mathbf{tagInfer}(\mathcal{O},S)$ returns a tagging that is compatible with a finite set of seed strings $S$ .

Proof.

First, we show that at least the oracle tagging can be found, which must be compatible with the pattern. This is because, from Lemma B.3, when $K>(|V|^{2}+1)^{2}$ , each candidate nesting pattern $(x,y)$ must either be invalidated, or contain oracle call-return pair unmatched in $x$ and $y$ , respectively.

Therefore, since any VPG for the oracle VPL can be used for the checking, we pick the specific VPG converted from the minimal $k$ -SEVPA by the method discussed in Alur and Madhusudan (2004), Theorem 5.3 (Visibly pushdown grammars), where $|V|$ is bounded by $m^{2}+2m$ . ∎

For Lemmas B.5 and B.6, we first introduce another congruence relation by Alur et al. (2005). Two well-matched strings, $s_{1}$ and $s_{2}$ , are deemed congruent, denoted as $s_{1}\sim s_{2}$ , if their contexts coincide. Specifically,

\forall u,v\in\Sigma^{*},us_{1}v\in\mathcal{L}\iff us_{2}v\in\mathcal{L}.

This congruence is an equivalence relation, and $\mathcal{L}$ is a VPL on $\hat{\Sigma}$ if and only if the congruence relation admits a finite number of equivalence classes.

Given a tagging $t$ , denote the congruence relation over $\hat{\mathcal{L}}_{t}$ as $\sim_{t}$ .

Given a compatible tagging function $t$ and a string $s$ , the following Lemma B.5 shows that, if $t(s)$ is well-matched, then $t_{\mathcal{O}}(s)$ has a bounded number of unmatched symbols.

Lemma B.0 ().

Given oracle language $\mathcal{L}$ and oracle tagging $t_{\mathcal{O}}$ , for each compatible tagging $t$ , there exists an upper bound positive number, denoted as $N_{t}$ , such that for each string $s\in\Sigma^{*}$ , if $s$ is $t$ -well-matched and there exists context strings $(w,w^{\prime})$ such that $wsw^{\prime}\in\mathcal{L}$ , then $t_{\mathcal{O}}(s)$ contains at most $N_{t}$ unmatched oracle call and return symbols.

Proof.

In this proof, to simplify the notation, we use “ $s$ ” or “ $w$ ” to also mean strings tagged by the oracle tagging function $t_{\mathcal{O}}$ . For strings tagged by a compatible tagging function $t$ , we use “ $t(s)$ ” and “ $t(w)$ ” explicitly.

To simplify the problem, let us assume there is an oracle VPG that includes only one matching rule; we denote the matching rule as $L\to{\text{\guilsinglleft{$a$}}{A}\text{{$b$}\guilsinglright}}B$ . As an overview, we show that for string $s$ that contains $K$ unmatched oracle call symbols \guilsinglleft $a$ , we can construct $K$ equivalence classes for the oracle congruence relation. Therefore, $N_{t}$ is bounded by the number of oracle equivalence classes. The case for multiple matching rules can be similarly proved, and we omit it for brevity.

For $t$ -well-matched string $s$ with $wsw^{\prime}\in\mathcal{L}$ , if $s$ contains no unmatched oracle symbols, then we are done. Otherwise, assume $s$ contains no return symbols, and $K$ unmatched oracle call symbols \guilsinglleft $a$ (the other cases are similar and we omit them for brevity). We can rewrite $wsw^{\prime}$ to reflect the derivation of these oracle call and return symbols as follows:

wsw^{\prime}=w+q_{0}(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\dots(\text{% \guilsinglleft{$a$}}s_{3}^{(K)})s_{L}^{(1)}+s_{L}^{(2)}(s_{4}^{(K)}\text{{$b$}% \guilsinglright}s_{B}^{(K)})\dots(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^% {(1)})+w^{\prime\prime}

where $w$ , $q_{0}$ , $s_{2}^{(i)}$ , $s_{3}^{(i)}$ , $s_{L}^{(1)}$ , $s_{L}^{(2)}$ , $s_{4}^{(i)}$ , $s_{B}^{(i)}$ and $w^{\prime\prime}$ for $i\in[1..K]$ are strings, and

	$\displaystyle s$	$\displaystyle=q_{0}(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\dots(\text{% \guilsinglleft{$a$}}s_{3}^{(K)})s_{L}^{(1)},$
	$\displaystyle w^{\prime}$	$\displaystyle=s_{L}^{(2)}(s_{4}^{(K)}\text{{$b$}\guilsinglright}s_{B}^{(K)})% \dots(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^{(1)})+w^{\prime\prime}$

and string $q_{0}(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\dots(\text{\guilsinglleft{$a$}}s_% {3}^{(K)})s_{L}^{(1)}+s_{L}^{(2)}(s_{4}^{(K)}\text{{$b$}\guilsinglright}s_{B}^% {(K)})\dots(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^{(1)})$ is derived by:

	$\displaystyle L\to\text{\guilsinglleft{$a$}}A\text{{$b$}\guilsinglright}B$	$\displaystyle\to(\text{\guilsinglleft{$a$}}s_{3}^{(1)})L(s_{4}^{(1)}\text{{$b$% }\guilsinglright}s_{B}^{(1)})$
		$\displaystyle\to(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\text{\guilsinglleft{$a% $}}A\text{{$b$}\guilsinglright}B(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^{% (1)})$
		$\displaystyle\to(\text{\guilsinglleft{$a$}}s_{3}^{(1)})(\text{\guilsinglleft{$% a$}}s_{3}^{(2)})L(s_{4}^{(2)}\text{{$b$}\guilsinglright}s_{B}^{(2)})(s_{4}^{(1% )}\text{{$b$}\guilsinglright}s_{B}^{(1)})$
		$\displaystyle\to(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\dots(\text{% \guilsinglleft{$a$}}s_{3}^{(K)})L(s_{4}^{(K)}\text{{$b$}\guilsinglright}s_{B}^% {(K)})\dots(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^{(1)})$
		$\displaystyle\to(\text{\guilsinglleft{$a$}}s_{3}^{(1)})\dots(\text{% \guilsinglleft{$a$}}s_{3}^{(K)})s_{L}(s_{4}^{(K)}\text{{$b$}\guilsinglright}s_% {B}^{(K)})\dots(s_{4}^{(1)}\text{{$b$}\guilsinglright}s_{B}^{(1)})$

where $s_{L}=s_{L}^{(1)}s_{L}^{(2)}$ .

Let $(x_{i},y_{i})=(\text{\guilsinglleft{$a$}}s_{3}^{(i)},s_{4}^{(i)}\text{{$b$}% \guilsinglright}s_{B}^{(i)})$ , $i\in[1..K]$ . Notice that $(x_{i},y_{i})$ are $K$ disjoint nesting patterns. Since those $(x_{i},y_{i})$ are exchangeable, we denote each of them as $(x,y)$ when their indices do not matter. With this notation, we can simplify the above formulae as

(3)	$\displaystyle wsw^{\prime}$	$\displaystyle=w+q_{0}x^{K}s_{L}^{(1)}+s_{L}^{(2)}y^{K}w^{\prime\prime},\text{ where}$
(4)	$\displaystyle s$	$\displaystyle=q_{0}x^{K}s_{L}^{(1)}$
(5)	$\displaystyle w^{\prime}$	$\displaystyle=s_{L}^{(2)}y^{K}w^{\prime\prime}$

Since $t$ is compatible with $\mathcal{L}$ , for $i\in[1..K]$ , by definition, each pattern $(x_{i},y_{i})$ contains a call-return pair $(c_{i},d_{i})$ of compatible tagging $t$ , where $c_{i}$ and $d_{i}$ are unmatched in $x_{i}$ and $y_{i}$ , respectively. Without loss of generality, let us assume these $(c_{i},d_{i})$ are the same, denoted as $(c,d)$ .

Now consider Equation (3)-(5). For $i\in[1..K]$ , each $x_{i}$ contains a symbol $c_{i}$ , whose matched symbol, denoted as $d_{i}^{\prime}$ , is after $x_{i}$ ; similarly, each $y_{i}$ contains a symbol $d_{i}$ , whose matched $c_{i}^{\prime}$ is before $y_{i}$ . Since $s$ is $t$ -well-matched, for $i\in[1..K]$ , each symbol $d_{i}^{\prime}$ can only locate in $s$ , and each symbol $c_{i}^{\prime}$ cannot locate in $s$ .

Now, for $j\in[1..K]$ , we construct $t_{\mathcal{O}}$ -well-matched strings $s_{j}$ and their contexts $(\widehat{w}_{j},\widehat{w}_{j}^{\prime})$ , as follows:

	$\displaystyle s_{j}$	$\displaystyle=x^{K-j}s_{L}^{(1)}+s_{L}^{(2)}y^{K-j}$
	$\displaystyle\widehat{w}_{j}$	$\displaystyle=wq_{0}x^{j}$
	$\displaystyle\widehat{w}_{j}^{\prime}$	$\displaystyle=y^{j}w^{\prime\prime}$

Now, we prove that each $s_{j}$ represents a different equivalence class. First, it is obvious that $s_{j}$ is $t_{\mathcal{O}}$ -well-matched. Then, we show that $\widehat{w}_{i}s_{j}\widehat{w}_{i}^{\prime}$ is invalid for $i\neq j$ . Let us expand $\widehat{w}_{i}s_{j}\widehat{w}_{i}^{\prime}$ as

\widehat{w}_{i}s_{j}\widehat{w}_{i}^{\prime}=w+q_{0}x^{K+i-j}s_{L}^{(1)}+s_{L}% ^{(2)}y^{K+i-j}w^{\prime\prime}.

Denote the number of unmatched $d$ and unmatched $c$ in string $x$ as $n_{d}(x)$ and $n_{c}(x)$ , respectively. There are three cases.

If $n_{d}(x)>n_{c}(x)$ , then, when $i>j$ , we have $n_{d}(wq_{0}x^{K+i-j})>n_{d}(wq_{0}x^{K})=0$ , thus $\widehat{w}_{i}s_{j}\widehat{w}_{i}^{\prime}$ ’s prefix contains pending return symbol, therefore the string is invalid.

If $n_{d}(x)<n_{c}(x)$ , then, when $i<j$ , we have

(6)

n_{c}(\widehat{w}_{i})=n_{c}(wq_{0}x^{i})<n_{c}(wq_{0}x^{j})=n_{c}(\widehat{w}% _{j}).

On the other hand, since $s$ is $t$ -well-matched, we have

(7)

n_{c}(qw_{0}x^{j})=n_{d}(x^{K-j}s_{L}^{(1)}).

Therefore, based on Equation (6)-(7), we have

n_{d}(\widehat{w}_{i}x^{K-j}s_{L}^{(1)})=n_{d}(x^{K-j}s_{L}^{(1)})-n_{c}(% \widehat{w}_{i})=n_{c}(wq_{0}x^{j})-n_{c}(\widehat{w}_{i})>0.

Similar to the first case, this shows $\widehat{w}_{i}s_{j}\widehat{w}_{i}^{\prime}$ is invalid.

In the last case, $n_{d}(x)=n_{c}(x)$ . We show that this is impossible. We first assume $n_{d}(x)=n_{c}(x)=1$ , then discuss the other case in the end. First, we rewrite each $x$ in $x^{K}$ as

x=w_{1}dw_{2}cw_{3},

where $w_{i,i=1,2,3}$ does not contain unmatched $c$ nor $d$ . Therefore, for two adjacent $x$ , we have

	$\displaystyle(w_{1}dw_{2}cw_{3})(w_{1}dw_{2}cw_{3})$
	$\displaystyle=w_{1}dw_{2}(cw_{3}w_{1}dw_{2})cw_{3}$

In string $cw_{3}w_{1}dw_{2}$ , notice that $c$ is matched with $d$ , therefore, string $w_{3}w_{1}$ is $t$ -well-matched. Since $(x,y)$ is a nesting pattern of the oracle tagging function, for any $k>0$ , we can rewrite $x^{k+1}$ as

x^{k+1}=w_{1}dw_{2}(cw_{3}w_{1}dw_{2})^{k}cw_{3}.

With the corresponding $y^{k}$ , we have a new nesting pattern

(cw_{3}w_{1}dw_{2},y).

Since tagging $t$ is compatible and $cw_{3}w_{1}d$ is $t$ -well-matched, $w_{2}$ contains an unmatched call symbol of tagging $t$ , denoted as $g$ , whose matched return symbol of tagging $t$ , denoted as $h$ , is after $w_{2}$ .

Now we have come back to a similar situation of comparing $n_{g}(x)$ and $n_{h}(x)$ . With a similar analysis, we can show that if the number of unmatched $n_{g}(x)\neq n_{h}(x)$ , then, we can construct $K$ equivalence classes in $\sim_{t_{\mathcal{O}}}$ . Therefore, we again must have $n_{g}(x)=n_{h}(x)$ .

Then, we can rewrite $x$ by expanding $w_{2}$ as

x=w_{1}dw_{2}cw_{3}=w_{1}d(w_{1}^{\prime}hw_{2}^{\prime}gw_{3}^{\prime})cw_{3}.

And, similarly, rewrite two adjacent $x$ as

	$\displaystyle xx$	$\displaystyle=(w_{1}d(w_{1}^{\prime}hw_{2}^{\prime}gw_{3}^{\prime})cw_{3})(w_{% 1}d(w_{1}^{\prime}hw_{2}^{\prime}gw_{3}^{\prime})cw_{3})$
		$\displaystyle=w_{1}dw_{1}^{\prime}hw_{2}^{\prime}(gw_{3}^{\prime}cw_{3}w_{1}dw% _{1}^{\prime}hw_{2}^{\prime})gw_{3}^{\prime}cw_{3}$

Again, string $gw_{3}^{\prime}cw_{3}w_{1}dw_{1}^{\prime}hw_{2}^{\prime}$ forms the first part of a nesting pattern, thus must contain another unmatched call symbol in tagging $t$ .

However, notice that $gw_{3}^{\prime}cw_{3}w_{1}dw_{1}^{\prime}h$ is $t$ -well-matched. Therefore, the new unmatched symbol must appear in $w_{2}^{\prime}$ , which is strictly shorter than $w_{2}$ . Subsequentially, we can find substrings $w_{2}^{\prime}$ , $w_{2}^{\prime\prime}$ , …, $w_{2}^{(i)}$ , …with decreasing lengths that must contain unmatched call symbol in $t$ . However, $w_{2}^{|w_{2}|}$ must be empty and contain no symbol, which makes $t$ incompatible with $\mathcal{L}$ , a contradiction.

Above is the case where the numbers of unmatched $c$ and $d$ in $x$ is $1$ . When the numbers are greater than $1$ (recall that the two numbers should be the same), we rewrite $x$ as

w_{1}dw_{2}cw_{3},

where $d$ is the last unmatched $d$ , and $c$ is the first unmatched $c$ . Expand $x^{2}$ again, and we can observe that string $w_{3}w_{1}$ is still $t$ -well-matched. The rest of the reasoning is the same as above.

In conclusion, for each $K$ and $t$ -well-matched string $s$ with $wsw^{\prime}\in\mathcal{L}$ , the number of unmatched oracle call symbols equals or less than the number of equivalence classes of $\sim_{t_{\mathcal{O}}}$ . Given that $\mathcal{L}$ is a VPL under $t_{\mathcal{O}}$ , the numbers of unmatched oracle symbols in any such $s$ have an upper bound. ∎

Theorem B.6 ().

Given oracle language $\mathcal{L}$ and oracle tagging $t_{\mathcal{O}}$ , if language $\{t_{\mathcal{O}}(s)\mid s\in\mathcal{L}\}$ is a VPL, then for any tagging $t$ compatible with $\mathcal{L}$ , language $\hat{\mathcal{L}}_{t}=\{t(s)\mid s\in\mathcal{L}\}$ is a VPL.

Proof.

By Lemma B.5, there exists a positive number $N_{t}$ , such that any $t$ -well-matched string contains at most $N_{t}$ unmatched oracle symbols.

For a given $t$ -well-matched string $p$ , without loss of generality, let us assume there are only $K<N_{t}$ number of unmatched oracle call symbols in $t$ . We prove the theorem by showing that $t(p)$ is equivalent to a string within a fixed length, denoted as $N$ .

First, we can partition $t_{\mathcal{O}}(p)$ into $t_{\mathcal{O}}(p)=\hat{p}_{1}\text{\guilsinglleft{$a$}}_{1}\hat{p}_{2}\text{% \guilsinglleft{$a$}}_{2}\dots\hat{p}_{K}\text{\guilsinglleft{$a$}}_{K}\hat{p}_% {K+1}$ , where each $\text{\guilsinglleft{$a$}}_{i}$ is an unmatched call symbol and each $\hat{p}_{i}$ is well-matched under $t_{\mathcal{O}}$ . Let us denote the length of the longest representative under $t_{\mathcal{O}}$ as $l$ . Denote ${[p_{i}]}_{\mathcal{O}}$ as $p_{i}$ ’s representative in the oracle congruence relation, we have

\forall w,w^{\prime},\ wp_{i}w^{\prime}\in\mathcal{L}\iff w{[p_{i}]}_{\mathcal% {O}}w^{\prime}\in\mathcal{L}.

Now, we construct a shorter representative for $t(p)$ , by replacing $p_{i}$ with $[p_{i}]_{\mathcal{O}}$ . Formally, for all $w_{1}$ and $w_{2}$ ,

	$\displaystyle w_{1}pw_{2}$	$\displaystyle=w_{1}(p_{1}{a}_{1}p_{2}{a}_{2}\dots p_{K}{a}_{K}p_{K+1})w_{2}$
		$\displaystyle=(w_{1}p_{1}{a}_{1}\dots a_{i-1})p_{i}({a}_{i}\dots p_{K}{a}_{K}p% _{K+1}w_{2})\in\mathcal{L}$
		$\displaystyle\iff(w_{1}p_{1}{a}_{1}\dots a_{i-1}){[p_{i}]}_{\mathcal{O}}({a}_{% i}\dots p_{K}{a}_{K}p_{K+1}w_{2})\in\mathcal{L}.$

Consequently, the length of any representative under tagging $t$ is limited by $K+(K+1)l\leq N_{t}+(N_{t}+1)l$ , accounting for $K$ unmatched characters and $K+1$ substrings each with a length not exceeding $l$ .

In conslusion, language $\hat{\mathcal{L}}_{t}$ has a finite number of equivalence classes, therefore is a VPL. ∎

Theorem B.7 (Finite and Sufficient Seed Strings).

Proof.

The strategy is to first construct a set of seed strings that provide information of the oracle call and return symbols, then extend the set with more seed strings to exclude the incompatible tagging functions.

Initialize $S_{0}$ as an empty set. For each oracle call-return symbol pair $(\text{\guilsinglleft{$a$}},\text{{$b$}\guilsinglright})$ , pick a seed string $s$ that contains a nesting pattern $(x,y)$ where \guilsinglleft $a$ is unmatched in $x$ , and $b$ \guilsinglright is unmatched in $y$ . Incorporate $s$ into $S_{0}$ . Note that if there is no such nesting pattern for $(\text{\guilsinglleft{$a$}},\text{{$b$}\guilsinglright})$ , then it is easy to show that $(\text{\guilsinglleft{$a$}},\text{{$b$}\guilsinglright})$ are “redundant” in that they can be treated as plain symbols, which does not change the language with tagging removed.

Then, for each tagging $t$ that can be found in $S_{0}$ , but is incompatible with a nesting pattern of a certain string $s$ in $\mathcal{L}$ , include string $s$ in $S_{0}$ . Given that the set of such tagging functions is finite, $S_{0}$ remains a finite set.

In conclusion, given such $S_{0}$ , Algorithm 3 can at least find the oracle tagging (with redundant tagging removed), by iteratively selecting the oracle call-return pair for each nesting pattern. ∎

Appendix C Proofs of Theorems in Section 5

This section is organized as follows. Lemma C.1 shows that given a nesting pattern $uxzyv$ , for sufficiently large $k$ , string $ux^{k}zy^{k}v$ contains an unmatched oracle call token in $x^{k}$ , and contains an unmatched oracle return token in $y^{k}$ . Since such $k$ varies among strings, Lemma C.2 bounds $k$ with $k\leq 2$ with the help of Exclusivity. Moving on, Lemma C.3 shows that for oracle language $\mathcal{L}$ and oracle tokenizer ${\tau}_{\mathcal{O}}$ , $\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}$ is a VPL over $\tilde{\Sigma}_{\mathcal{O}}$ . Lemma C.4 shows that token-based matching rules lead to nesting patterns. Based on Lemma C.4, Lemma C.5 bounds the number of unmatched tokens between ${\tau}(s)$ and ${\tau}_{\mathcal{O}}(s)$ . Following Lemma C.3 and C.5, Theorem C.6 proves that a compatible tokenizer converts the oracle language into a VPL. We conclude this section with Lemma C.7, which shows that there exists a finite set of seed strings that allows V-Star to find a compatible tokenizer.

Lemma C.0 (Matched Tokens in Nesting Patterns, I).

Given oracle language $\mathcal{L}$ , oracle tokenizer ${\tau}$ , and a VPG $G=(\Sigma,V,P,L_{0})$ for the oracle language, for each nesting pattern $s=uxzyv$ , there exists an oracle matching rule in $P$ , denoted as $L\to h_{a}{A}h_{b}B$ , and exists a positive number $k$ , so that tokens $h_{a}$ and $h_{b}$ are included in $x^{k}$ and $y^{k}$ , respectively.

Proof.

For each $k>0$ , consider the tokenization of $ux^{k}zy^{k}v$ , denoted as $l_{k}={\tau}(ux^{k}zy^{k}v)$ .

Let $l_{k}^{x}$ be the token sequence in $l_{k}$ , i.e., each token in $l_{k}^{x}$ overlaps $x^{k}$ .

We say a token covers a string, if the string is a substring of the string captured by the token. For example, if token $h$ in $l_{k}$ captures string $ux$ , then $h$ covers $x^{1}$ .

We show that there exists an upper bound, denoted as $N_{x}$ , such that for each $k$ and each token $h\in l_{k}^{x}$ , if $h$ covers $x^{i}$ , then $i\leq N_{x}$ . Otherwise, notice that there are only a finite number of tokens, therefore, there exists a token $h$ such that $h$ starts at a character in $ux^{k}$ , and can end at a character in either infinite locations in $x^{k}$ for $k>0$ , or infinite locations in $y^{k}$ for $k>0$ (could be both). In the first case, one can see that there exist two strings $s_{1}$ , $s_{2}$ of $h$ , such that

	$\displaystyle s_{1}$	$\displaystyle=vx^{i}x_{1}$
	$\displaystyle s_{2}$	$\displaystyle=vx^{j}x_{1}$

where $v$ is a suffix of either $u$ or $x$ , $i$ and $j$ are two numbers such that $i\neq j$ , and $x_{1}$ is a prefix of $x$ . Apparantly, $s_{1}$ and $s_{2}$ can be exchanged, which violates that $(x,y)$ is a nesting pattern. The second case could be similarly invalidated.

Similarly, we can prove another upper bound $N_{y}$ for $y$ . Let $N$ be $\max(N_{x},N_{y})$ .

With upper bound $N$ , we can consider a new nesting pattern $u^{\prime}xz^{\prime}yv^{\prime}$ , where $u^{\prime}=ux^{N}$ , $z^{\prime}=x^{N}zy^{N}$ , and $v^{\prime}=y^{N}v$ :

(ux^{N})x^{k}(x^{N}zy^{N})y^{k}(y^{N}v).

By doing this, we exclude the first and last tokens in $l_{k}$ that only partially overlap with $x^{k}$ . From now on, we assume $l_{k}$ is contained in $x^{k}$ for each $k$ .

For each positive number $K$ , define language $\mathcal{L}_{K}$ as $\{w\in l_{k}\mid k>K\}$ , and token-based language $\mathcal{L}_{K}^{\prime}$ as $\{l_{k}\mid k>K\}$ . Apparantly, $\mathcal{L}_{K}$ is not a regular language. Based on this, we can show that there exists token list $l\in\mathcal{L}_{K}^{\prime}$ that contains unmatched call or return tokens. Otherwise, each sequence $l\in\mathcal{L}_{K}^{\prime}$ can only contain plain tokens or well-matched tokens. We only need to show that the depth of the nested well-matched tokens is bounded for all $k$ , then the language $\mathcal{L}_{K}$ is a regular language, a contradiction.

To show that the depth is bounded, notice that otherwise, we would have a nesting pattern in $x^{k}$ for certain $k$ . Denote the pattern as $(g,h)$ . Nesting pattern $(g,h)$ , by definition, can be replaced by $(g^{i},h^{i})$ for any $i>0$ in $x^{k}$ . This replacement extends $x^{k}$ to $x^{k^{\prime}}$ for certain $k^{\prime}>k$ . However, since $(x,y)$ is a nesting pattern, $ux^{k^{\prime}}zy^{k}v$ should be invalid, a contradiction.

Therefore, we have shown that for each $K$ , there exists $k$ and $l_{k}$ , such that $k>K$ and $l_{k}$ contains a non-plain token. In other words, when $k$ goes to infinity, $x^{k}$ contains an infinite number of unmatched call or return tokens. However, the number of unmatched return tokens is bounded in $x^{k}$ for $k>0$ , otherwise, since $u$ is fixed, not enough call tokens can match those return tokens. Therefore, the number of unmatched call tokens is unbounded in $x^{k}$ for $k>0$ . This means for a sufficiently large $k$ , a call token in $x^{k}$ must be matched with a return token in $y^{k}$ . We thus have proven the lemma. ∎

Lemma C.0 (Matched Tokens in Nesting Patterns, II).

Based on Lemma C.1, assume Exclusivity. For any nesting pattern $uxzyv$ , there exists a matching rule, denoted as $L\to{h_{a}}{A}{h_{b}}B$ , such that $s_{a}\in{h_{a}}$ is a substring of $x^{2}$ , and $s_{b}\in{h_{b}}$ is a substring of $y^{2}$ .

Proof.

From Lemma C.1, we know that a token $h_{a}$ is in $x^{k}$ for certain $k$ . We can therefore let $x=x_{1}x_{2}=x_{1}^{\prime}x_{2}^{\prime}$ , so that $s_{a}=x_{2}x^{i}x_{1}^{\prime}=x_{2}(x_{1}x_{2})^{i}x_{1}^{\prime}$ for certain $i$ . Consider $x_{2}$ ; there are two cases.

If $x_{2}$ is empty, then $s_{a}=x_{1}^{i}x_{1}^{\prime}=x^{i}x_{1}^{\prime}=(x_{1}^{\prime}x_{2}^{\prime% })^{i}x_{1}^{\prime}$ . Notice that we must have $i\leq 1$ , otherwise either $x_{1}^{\prime}$ or $x_{2}^{\prime}$ becomes both a prefix and an infix of $s_{a}$ , which violates Exclusivity. Therefore, $s_{a}=x_{1}^{\prime}$ or $x_{1}^{\prime}x_{2}^{\prime}x_{1}^{\prime}$ . Apparantly, $s_{1}$ is a substring of $x^{2}$ .

If $x_{2}$ is nonempty, $i$ must be zero, otherwise $x_{2}$ is both a prefix and an infix, a violation of Exclusivity. In this case, we have $s_{a}=x_{2}x_{1}^{\prime}$ , also a substring of $x^{2}$ .

In conclusion, we know that $s_{a}$ is a substring of $x^{2}$ . The reasoning is quite similar for the return token, and we omit it for brevity. ∎

Lemma C.0 ().

For oracle language $\mathcal{L}$ and oracle tokenizer ${\tau}_{\mathcal{O}}$ , $\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}$ is a VPL over $\tilde{\Sigma}_{\mathcal{O}}$ .

Proof.

We prove by building a VPA for language $\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}$ .

First, since the language $\{{\tau}_{\mathcal{O}}(s)\mid s\in\mathcal{L}\}$ is a VPL over $T_{{\tau}_{\mathcal{O}}}$ , we have a VPA for it, denoted as $\mathcal{H}_{\mathcal{O}}$ .

Then, since each token ${\tau}$ is a regular language, denote its finite state automaton as $\mathcal{H}_{t}$ .

Now we build a VPA $\tilde{\mathcal{H}}$ for language $\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}$ by replacing each token in $\mathcal{H}_{\mathcal{O}}$ with its FSA $\mathcal{H}_{t}$ .

First, the set of states is the union of states in $\mathcal{H}_{\mathcal{O}}$ and $\mathcal{H}_{t}$ for ${\tau}\in T_{{\tau}_{\mathcal{O}}}$ .

Then, the transitions are defined as follows.

(1)

We retain transitions $p\xrightarrow{i}p^{\prime}$ in each FSA $\mathcal{H}_{h_{c}}$ for plain token $h_{c}$ .
(2)

For transition $q\xrightarrow{h_{c}}q^{\prime}$ , we add transition $(q,\epsilon)\to S_{h_{c},0}$ for the start state $S_{h_{c},0}$ and transitions $(E_{h_{c}},\epsilon)\to q^{\prime}$ for each acceptance state $E_{h_{c}}$ in $\mathcal{H}_{h_{c}}$ .
(3)

For transition $q\xrightarrow{h_{a},\text{ push }}q^{\prime}$ , we add transition $q\xrightarrow{\text{\guilsinglleft{$a$}},\text{ push }(q,\text{\guilsinglleft{% $a$}})}S_{h_{a},0}$ , where \guilsinglleft $a$ is the call token corresponding to $h_{a}$ , and we add transitions $(E_{h_{a}},\epsilon)\to q^{\prime}$ for each acceptance state $E_{h_{a}}$ in $\mathcal{H}_{h_{a}}$ .
(4)

For transition $q\xrightarrow{h_{b},\text{ pop }(q^{\prime\prime},h_{a}^{\prime})}q^{\prime}$ , we add transition $q\xrightarrow{\text{{$b$}\guilsinglright},\text{ pop }(q^{\prime\prime},\text{% \guilsinglleft{$a$}}^{\prime})}q^{\prime}$ , where $\text{\guilsinglleft{$a$}}^{\prime}$ is the call token corresponding to $h_{a}^{\prime}$ , and we add transitions $(E_{h_{b}},\epsilon)\to q^{\prime}$ for each acceptance state $E_{h_{b}}$ in $\mathcal{H}_{h_{b}}$ .

Apparantly, a string $s\in\mathcal{L}$ must be accepted by $\tilde{\mathcal{H}}$ . On the otherhand, if a string $s$ is accepted by $\tilde{\mathcal{H}}$ , then it leads to a valid token sequence $l$ and $s\in l$ , therefore, $s$ is also a valid string $\mathcal{L}$ .

In conclusion, $\tilde{\mathcal{H}}$ is a VPA that accepts $\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}$ , which means language $\tilde{\mathcal{L}}_{{\tau}_{\mathcal{O}}}$ is a VPL. ∎

Lemma C.0 (Call and Return Tokens in Nesting Patterns).

For any oracle VPL $\mathcal{L}$ , for each string $s\in\mathcal{L}$ , if $s$ is derived by repeatedly applying a matching rule which exposes a recursion, i.e.,

L_{0}\to^{*}pLq\to p({h_{a}}A{h_{b}}B)q\to^{*}p(s_{a}uLvs_{b}B)q\to^{*}ps_{a}% us_{L}vs_{b}s_{B}q=s

where $s_{a}\in{h_{a}}$ and $s_{b}\in{h_{b}}$ , and $s_{L},s_{B}\in\mathcal{L}$ are strings derived from nonterminals $L$ and $B$ , respectively, and $u,v\in\mathcal{L}$ . Then there exists a nesting pattern $(x,y)$ for $s$ where $s_{a}$ is a prefix of $x$ and $s_{b}$ is a substring of $y$ .

Proof.

Consider the iterative application of the derivation $pLq\to^{*}p(s_{a}uLvs_{b}B)q$ to $L$ . This leads to the deduction $pLq\to^{*}p(s_{a}u)^{k}L(vs_{b}s_{B})^{k}q$ . To show that the pair $(s_{a}u,vs_{b}B)$ represents a nesting pattern, we only need to prove that $ux^{k}zy^{j}v$ is invalid when $k\neq j$ (both $\geq 0$ ).

Let the oracle tokenizer be ${\tau}$ . For $k>0$ , we tokenize string $s_{3}=p(s_{a}u)^{k}s_{L}(vs_{b}B)^{k}q$ as:

	$\displaystyle{\tau}(s_{3})$	$\displaystyle={\tau}(p(s_{a}u)^{k}s_{L}(vs_{b}s_{B})^{k}q)$
		$\displaystyle={\tau}(p){h_{a}}({\tau}(u){h_{a}})^{k-1}{\tau}(us_{L}v){h_{b}}({% \tau}(s_{B}v){h_{b}})^{k-1}{\tau}(s_{B}q)\quad\text{(By Tokenization % Consistency)}$

Notice that $u$ , $s_{L}$ and $v$ are independent strings of token list. Therefore, we have

{\tau}(us_{L}v)={\tau}(u){\tau}(s_{L}){\tau}(v).

We next tokenize string $s_{4}=p(s_{a}u)^{k}s_{L}(vs_{b}s_{B})^{j}q$ , for $k\neq j$ (both $>0$ ):

	$\displaystyle{\tau}(s_{4})$	$\displaystyle={\tau}(p(s_{a}u)^{k}s_{L}(vs_{b}s_{B})^{j}q)$
		$\displaystyle=\begin{cases}{\tau}(p(s_{a}u)^{k}s_{L}vs_{b}(s_{B}vs_{b})^{j-1}s% _{B}q)&\text{if }k>j\\ {\tau}(p(s_{a}u)^{k}s_{L}vs_{b}(s_{B}vs_{b})^{k-1}s_{B}(vs_{b}s_{B})^{j-k}q)&% \text{if }k<j\end{cases}$
		$\displaystyle=\begin{cases}{\tau}(p(s_{a}u)^{k}s_{L}vs_{b}(s_{B}vs_{b})^{j-1})% {\tau}(s_{B}q)&\text{if }k>j\\ {\tau}(p(s_{a}u)^{k}s_{L}vs_{b}(s_{B}vs_{b})^{k-1}){\tau}(s_{B}(vs_{b}s_{B})^{% j-k}q)&\text{if }k<j\end{cases}$
		$\displaystyle={\tau}(p){h_{a}}({\tau}(u){h_{a}})^{k-1}{\tau}(us_{L}v){h_{b}}({% \tau}(s_{B}v){h_{b}})^{j-1}{\tau}(s_{B}q)$

We applied Tokenization Consistency in the last step above. Apparantly, ${\tau}(s_{4})$ is invalid because of imbalanced call and return tokens. There are two cases left, where either $k$ or $j$ equals $0$ .

When $k=0$ and $j>0$ , $s_{4}=ps_{L}(vs_{b}s_{B})^{j}q$ . Assume $s_{4}$ is valid; we can tokenize $s_{4}$ as

	$\displaystyle{\tau}(s_{4})$	$\displaystyle={\tau}(ps_{L}(vs_{b}s_{B})^{j}q)$
		$\displaystyle={\tau}(ps_{L}vs_{b}(s_{B}vs_{b})^{j-1}s_{B}q)$
		$\displaystyle={\tau}(p){\tau}(s_{L}){\tau}(v){h_{b}}({\tau}(s_{B}v){h_{b}})^{j% -1}{\tau}(s_{B}){\tau}(q).$

Again, ${\tau}(s_{4})$ is apparantly invalid. The case of $k>0$ and $j=0$ is similar and we omit it for brevity.

Therefore, we have shown that the pair $(s_{a}u,vs_{b}B)$ is a nesting pattern. ∎

Lemma C.0 ().

Given a compatible tokenizer ${\tau}$ , there exists an upper bound $N_{\tau}$ , so that for each string $s\in\tilde{\Sigma}^{*}$ , if $\tilde{s}$ is well-matched and there exists context strings $(w,w^{\prime})$ such that $wsw^{\prime}\in\mathcal{L}$ , then string $s$ contains at most $N_{\tau}$ unmatched oracle call or return tokens.

Proof.

The proof parallels the proof of Lemma B.5; we show that this would otherwise result in an infinite number of equivalence classes for $\sim_{{\tau}_{\mathcal{O}}}$ , contradicting to that $\tilde{L}_{{\tau}_{\mathcal{O}}}$ is a VPL, proven by Lemma C.3.

If for any number $K$ , there exists a ${\tau}$ -well-matched string $s$ , such that $s$ can contain at least $K$ unmatched call token, then, by Lemma C.4, a set of nesting patterns $(x_{i},y_{i})$ appear in $wsw^{\prime}$ , and therefore in $\tilde{w}\tilde{s}\tilde{w}^{\prime}$ . Then, since ${\tau}$ is compatible with $\mathcal{L}$ , a call-return token pair $\text{\guilsinglleft{$c$}},\text{{$d$}\guilsinglright}$ appear in each $(\tilde{x_{i}},\tilde{y_{i}})$ . The rest of the analysis is similar to that of Lemma B.5, and we omit it for brevity. ∎

Theorem C.6 ().

Given oracle language $\mathcal{L}$ and oracle tokenizer ${\tau}_{\mathcal{O}}$ , for each tokenizer ${\tau}$ that is compatible with the oracle language $\mathcal{L}$ , language $\tilde{\mathcal{L}}_{{\tau}}$ is a VPL.

Proof.

By Lemma C.5, for certain $k\leq N_{\tau}$ , we can represent $\tilde{q}_{\mathcal{O}}$ as $q_{1}\text{\guilsinglleft{$a$}}_{1}q_{2}\text{\guilsinglleft{$a$}}_{2}\dots% \text{\guilsinglleft{$a$}}_{k-1}q_{k}$ , where $q_{i,i\in[1..k]}$ is well-matched under ${\tau}_{\mathcal{O}}$ . Now, we replace each $q_{i,i\in[1..k]}$ with its representative in the equivalence class of $\sim_{{\tau}_{\mathcal{O}}}$ , and get $q^{\prime}$ that $\tilde{q^{\prime}}\sim_{{\tau}}\tilde{q}$ . Since $\sim_{{\tau}_{\mathcal{O}}}$ has a finite number of equivalence classes, let the length of the longest representative be $l$ , the length of $q^{\prime}$ is bounded by $l\times k+k+1$ .

Therefore, the congruence relation $\sim_{\tau}$ has a finite number of equivalence classes, which shows $\tilde{\mathcal{L}}_{{\tau}}$ is a VPL. ∎

Lemma C.0 (Finite and Sufficient Seed Strings).

There is a finite set of seed strings $S$ , with which we can find a tokenizer that is compatible with the oracle language $\mathcal{L}$ using Algorithm 4.

Proof.

The construction of $S$ involves two phases: we first identify strings that reveal the oracle call and return tokens, then augment this set to exclude taggings incompatible with $\mathcal{L}$ .

Starting with an empty set $S_{0}$ , for each oracle call-return token pair $({h_{a}},{h_{b}})$ , we select a seed string $s$ that contains a nesting pattern $(x,y)$ where ${h_{a}}$ and ${h_{b}}$ are respectively unmatched in $x$ and $y$ , then we include $s$ in $S_{0}$ . We denote the new set as $S_{1}$ . Note that, similar to Lemma B.7, if there is no such nesting pattern for $({h_{a}},{h_{b}})$ , then it is easy to show that $({h_{a}},{h_{b}})$ are “redundant”, in that they can be treated as plain tokens, which does not change the language with tagging removed.

Subsequently, we extend $S_{1}$ with strings $s$ from $\mathcal{L}$ that is incompatible with certain tokenizer that can be found by Algorithm 4 based on $S_{1}$ . We denote the new set as $S_{2}$ . $S_{2}$ is still finite, since Algorithm 4 can only find a finite number of tokenizers from $S_{1}$ .

For each string $s$ in $S_{2}$ , we modify $s$ by replacing call and return tokens ${h_{a}}$ and ${h_{b}}$ with strings $s_{a}$ and $s_{b}$ from Token Fixed Prefix and Suffix, respectively. The new set is our targeted $S$ .

Given $S$ , we now show that the oracle tokenizer ${\tau}$ is a possible return of Algorithm 4. Intuitively, the oracle token pair can be incrementally added to the hypothesis partial tokenizer $D$ , and the tokenization $\mathbf{tokenize}(D,s)$ for any valid string $s$ maintains well-matched.

By Lemma C.2, each oracle token pair is contained in $(x^{2},y^{2})$ for certain nesting pattern $(x,y)$ . Therefore, from the first nesting pattern $(x,y)$ , Algorithm 4 could find $((q,g),(q^{\prime},g^{\prime}))$ where $((q,g),(q^{\prime},g^{\prime}))$ belongs to an oracle token pair. Then, because of the Token Fixed Prefix and Suffix assumption, the lexical rules for the two paired tokens are learned accurately. Denote this tokenizer that contains only one token-pair as $D_{1}$ . Consider Algorithm 5. At line 4, we construct a new match $m$ . Now, we show that $m$ must be the match given by the oracle tokenizer. Firstly, $m$ must correspond to an oracle match, otherwise it will be filtered out by $k$ -Repetition. Then, by Unique Pairing and Separation, $m$ must correspond to the oracle call/return token. Therefore, $\mathbf{tokenize}(D_{1},s)$ contains only well-matched oracle tokens, thus is well-matched.

A similar analysis can be done to show that as long as $D$ contains only oracle token pairs, $\mathbf{tokenize}(D,s)$ is well-matched for any valid string $s$ . We thus have proved that Algorithm 4 can at least find the oracle tokenizer given $S_{0}$ (with redundant oracle tokens removed), by iteratively selecting the oracle token pair for each nesting pattern.

In conclusion, there exists a finite set of seed strings, where Algorithm 4 can find a compatible tokenizer. ∎

References

(1)
Aichernig et al. (2024) Bernhard K. Aichernig, Martin Tappler, and Felix Wallner. 2024. Benchmarking Combinations of Learning and Testing Algorithms for Automata Learning. Form. Asp. Comput. 36, 1, Article 3 (mar 2024), 37 pages. https://doi.org/10.1145/3605360
Alur (2007) Rajeev Alur. 2007. Marrying words and trees. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (¡conf-loc¿, ¡city¿Beijing¡/city¿, ¡country¿China¡/country¿, ¡/conf-loc¿) (PODS ’07). Association for Computing Machinery, New York, NY, USA, 233–242. https://doi.org/10.1145/1265530.1265564
Alur et al. (2005) Rajeev Alur, Viraj Kumar, P. Madhusudan, and Mahesh Viswanathan. 2005. Congruences for visibly pushdown languages. In Proceedings of the 32nd International Conference on Automata, Languages and Programming (Lisbon, Portugal) (ICALP’05). Springer-Verlag, Berlin, Heidelberg, 1102–1114. https://doi.org/10.1007/11523468_89
Alur and Madhusudan (2004) Rajeev Alur and P. Madhusudan. 2004. Visibly pushdown languages. In Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing (Chicago, IL, USA) (STOC ’04). Association for Computing Machinery, New York, NY, USA, 202–211. https://doi.org/10.1145/1007352.1007390
Alur and Madhusudan (2009) Rajeev Alur and P. Madhusudan. 2009. Adding nesting structure to words. J. ACM 56, 3, Article 16 (may 2009), 43 pages. https://doi.org/10.1145/1516512.1516518
Angluin (1987) Dana Angluin. 1987. Learning regular sets from queries and counterexamples. Inf. Comput. 75, 2 (nov 1987), 87–106. https://doi.org/10.1016/0890-5401(87)90052-6
Arefin et al. (2024) M. Arefin, S. Shetiya, Z. Wang, and C. Csallner. 2024. Fast Deterministic Black-box Context-free Grammar Inference. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 901–901. https://doi.ieeecomputersociety.org/
Barbot et al. (2021) Benoît Barbot, Benedikt Bollig, Alain Finkel, Serge Haddad, Igor Khmelnitsky, Martin Leucker, Daniel Neider, Rajarshi Roy, and Lina Ye. 2021. Extracting Context-Free Grammars from Recurrent Neural Networks using Tree-Automata Learning and A* Search. In Proceedings of the Fifteenth International Conference on Grammatical Inference (Proceedings of Machine Learning Research, Vol. 153), Jane Chandlee, Rémi Eyraud, Jeff Heinz, Adam Jardine, and Menno van Zaanen (Eds.). PMLR, 113–129. https://proceedings.mlr.press/v153/barbot21a.html
Bastani et al. (2017) Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing program input grammars. SIGPLAN Not. 52, 6 (jun 2017), 95–110. https://doi.org/10.1145/3140587.3062349
Bendrissou et al. (2022) Bachir Bendrissou, Rahul Gopinath, and Andreas Zeller. 2022. “Synthesizing input grammars”: a replication study. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (San Diego, CA, USA) (PLDI 2022). Association for Computing Machinery, New York, NY, USA, 260–268. https://doi.org/10.1145/3519939.3523716
Chaudhuri and Alur (2007) Swarat Chaudhuri and Rajeev Alur. 2007. Instrumenting C programs with nested word monitors. In Proceedings of the 14th International SPIN Conference on Model Checking Software (Berlin, Germany). Springer-Verlag, Berlin, Heidelberg, 279–283.
Chow (1978) T. S. Chow. 1978. Testing Software Design Modeled by Finite-State Machines. IEEE Trans. Softw. Eng. 4, 3 (may 1978), 178–187. https://doi.org/10.1109/TSE.1978.231496
Drewes and Högberg (2007) Frank Drewes and Johanna Högberg. 2007. Query Learning of Regular Tree Languages: How to Avoid Dead States. Theor. Comp. Sys. 40, 2 (feb 2007), 163–185. https://doi.org/10.1007/s00224-005-1233-3
Drewes et al. (2011) Frank Drewes, Johanna Högberg, and Andreas Maletti. 2011. MAT learners for tree series: an abstract data type and two realizations. Acta Informatica 48, 3 (May 2011), 165–189. https://doi.org/10.1007/s00236-011-0135-x
Fujiwara et al. (1991) S. Fujiwara, G. v. Bochmann, F. Khendek, M. Amalou, and A. Ghedamsi. 1991. Test selection based on finite state models. IEEE Transactions on Software Engineering 17, 6 (1991), 591–603. https://doi.org/10.1109/32.87284
Gauwin et al. (2008) Olivier Gauwin, Joachim Niehren, and Yves Roos. 2008. Streaming tree automata. Inform. Process. Lett. 109, 1 (2008), 13–17. https://doi.org/10.1016/j.ipl.2008.08.002
Harris et al. (2012) William R. Harris, Somesh Jha, and Thomas Reps. 2012. Secure programming via visibly pushdown safety games. In Proceedings of the 24th International Conference on Computer Aided Verification (Berkeley, CA) (CAV’12). Springer-Verlag, Berlin, Heidelberg, 581–598. https://doi.org/10.1007/978-3-642-31424-7_41
Heizmann et al. (2010) Matthias Heizmann, Jochen Hoenicke, and Andreas Podelski. 2010. Nested interpolants. SIGPLAN Not. 45, 1 (jan 2010), 471–482. https://doi.org/10.1145/1707801.1706353
Howar (2012) Falk Howar. 2012. Active learning of interface programs. Ph. D. Dissertation. https://doi.org/10.17877/DE290R-4817
Irfan et al. (2010) Muhammad Naeem Irfan, Catherine Oriat, and Roland Groz. 2010. Angluin style finite state machine inference with non-optimal counterexamples. In Proceedings of the First International Workshop on Model Inference In Testing (Trento, Italy) (MIIT ’10). Association for Computing Machinery, New York, NY, USA, 11–19. https://doi.org/10.1145/1868044.1868046
Isberner (2015) Malte Isberner. 2015. Foundations of active automata learning: an algorithmic perspective. https://api.semanticscholar.org/CorpusID:45690562
Jia et al. (2021) Xiaodong Jia, Ashish Kumar, and Gang Tan. 2021. A derivative-based parser generator for visibly Pushdown grammars. Proc. ACM Program. Lang. 5, OOPSLA, Article 151 (oct 2021), 24 pages. https://doi.org/10.1145/3485528
Jia et al. (2023) Xiaodong Jia, Ashish Kumar, and Gang Tan. 2023. A Derivative-based Parser Generator for Visibly Pushdown Grammars. ACM Trans. Program. Lang. Syst. 45, 2, Article 9 (may 2023), 68 pages. https://doi.org/10.1145/3591472
Kulkarni (2023) Neil Kulkarni. 2023. Arvada. https://github.com/neil-kulkarni/arvada.
Kulkarni et al. (2022) Neil Kulkarni, Caroline Lemieux, and Koushik Sen. 2022. Learning highly recursive input grammars. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (Melbourne, Australia) (ASE ’21). IEEE Press, 456–467. https://doi.org/10.1109/ASE51524.2021.9678879
Kumar et al. (2006) Viraj Kumar, P. Madhusudan, and Mahesh Viswanathan. 2006. Minimization, learning, and conformance testing of boolean programs. In Proceedings of the 17th International Conference on Concurrency Theory (Bonn, Germany) (CONCUR’06). Springer-Verlag, Berlin, Heidelberg, 203–217. https://doi.org/10.1007/11817949_14
Kumar et al. (2007) Viraj Kumar, P. Madhusudan, and Mahesh Viswanathan. 2007. Visibly pushdown automata for streaming XML. In Proceedings of the 16th International Conference on World Wide Web (Banff, Alberta, Canada) (WWW ’07). Association for Computing Machinery, New York, NY, USA, 1053–1062. https://doi.org/10.1145/1242572.1242714
Maler and Pnueli (1991) Oded Maler and Amir Pnueli. 1991. On the learnability of infinitary regular sets. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory (Santa Cruz, California, USA) (COLT ’91). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 128–138.
Michaliszyn and Otop (2022) Jakub Michaliszyn and Jan Otop. 2022. Learning Deterministic Visibly Pushdown Automata Under Accessible Stack. In 47th International Symposium on Mathematical Foundations of Computer Science (MFCS 2022) (Leibniz International Proceedings in Informatics (LIPIcs), Vol. 241), Stefan Szeider, Robert Ganian, and Alexandra Silva (Eds.). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 74:1–74:16. https://doi.org/10.4230/LIPIcs.MFCS.2022.74
Mozafari et al. (2010) Barzan Mozafari, Kai Zeng, and Carlo Zaniolo. 2010. From regular expressions to nested words: unifying languages and query execution for relational and XML sequences. Proc. VLDB Endow. 3, 1–2 (sep 2010), 150–161. https://doi.org/10.14778/1920841.1920865
Mozafari et al. (2012) Barzan Mozafari, Kai Zeng, and Carlo Zaniolo. 2012. High-performance complex event processing over XML streams. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (Scottsdale, Arizona, USA) (SIGMOD ’12). Association for Computing Machinery, New York, NY, USA, 253–264. https://doi.org/10.1145/2213836.2213866
Nguyen and Sudholt (2006) Dong Ha Nguyen and M. Sudholt. 2006. VPA-Based Aspects: Better Support for AOP over Protocols. In Fourth IEEE International Conference on Software Engineering and Formal Methods (SEFM’06). 167–176. https://doi.org/10.1109/SEFM.2006.39
Raffelt et al. (2005) Harald Raffelt, Bernhard Steffen, and Therese Berg. 2005. LearnLib: a library for automata learning and experimentation. In Proceedings of the 10th International Workshop on Formal Methods for Industrial Critical Systems (Lisbon, Portugal) (FMICS ’05). Association for Computing Machinery, New York, NY, USA, 62–71. https://doi.org/10.1145/1081180.1081189
Rivest and Schapire (1993) R.L. Rivest and R.E. Schapire. 1993. Inference of Finite Automata Using Homing Sequences. Information and Computation 103, 2 (1993), 299–347. https://doi.org/10.1006/inco.1993.1021
Thomo and Venkatesh (2011) A. Thomo and S. Venkatesh. 2011. Rewriting of visibly pushdown languages for XML data integration. Theoretical Computer Science 412, 39 (2011), 5285–5297. https://doi.org/10.1016/j.tcs.2011.05.047
Vasilevskii (1973) M. P. Vasilevskii. 1973. Failure diagnosis of automata. Cybernetics 9, 4 (July 1973), 653–665. https://doi.org/10.1007/BF01068590
Wu et al. (2019) Zhengkai Wu, Evan Johnson, Wei Yang, Osbert Bastani, Dawn Song, Jian Peng, and Tao Xie. 2019. REINAM: reinforcement learning for input-grammar inference. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 488–498. https://doi.org/10.1145/3338906.3338958

V-Star: Learning Visibly Pushdown Grammars from Program Inputs (Extended Version)

Abstract.

1. Introduction

2. Related Work

3. Background

3.1. Grammar Inference

3.2. Visibly Pushdown Grammars

Definition 3.0 (Well-matched VPGs).

3.3. Visibly Pushdown Automata

3.4. Angluin’s L-Star Algorithm

4. V-Star for a Character-Based VPL

4.1. Problem Statement

The Unique Pairing assumption for oracle languages

4.2. Learning VPA with Known Tagging

4.2.1. Background: k𝑘kitalic_k-SEVPA and Congruence Relations

Definition 4.0 ().

4.2.2. Access Words and Test Words

Definition 4.0 (Nested Words and ΣMsubscriptnormal-Σ𝑀\Sigma_{M}roman_Σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT).

Definition 4.0 (𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀⁢(𝒬)𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐕𝐏𝐀𝒬\mathbf{constructVPA}(\mathcal{Q})bold_constructVPA ( caligraphic_Q ) function).

Proposition 4.0 ().

Proof.

Proposition 4.0 ().

Proof.

Proposition 4.0 ().

Proof.

Theorem 4.7 ().

4.3. Tagging Inference

Lemma 4.0 (Pumping Lemma for VPLs).

Definition 4.0 (Nesting Patterns).

Definition 4.0 (Compatible Tagging).

Theorem 4.11 ().

Theorem 4.12 (Termination and Correctness of Algorithm 3).

Theorem 4.13 (Finite and Sufficient Seed Strings).

5. V-Star for a Token-Basd VPL

5.1. Problem Statement

Assumptions for oracle languages and tokenizers

5.2. Tagging Inference for Tokens

Definition 5.0 (Compatible Tokenizers).

Theorem 5.2 ().

Lemma 5.0 (Finite and Sufficient Seed Strings).

6. Evaluation

Implementation

Datasets

Metrics

Results

7. Future Work

8. Conclusions

Appendix A Proofs of Theorems in Section 4.2

Theorem A.1 ().

Proof.

Appendix B Proofs of Theorems in Section 4.3

Definition B.0 ().

Lemma B.0 (Pumping Lemma for VPLs).

Proof.

Lemma B.0 ().

Proof.

Theorem B.4 (Termination and Correctness of Algorithm 3).

Proof.

Lemma B.0 ().

Proof.

Theorem B.6 ().

Proof.

Theorem B.7 (Finite and Sufficient Seed Strings).

Proof.

Appendix C Proofs of Theorems in Section 5

Lemma C.0 (Matched Tokens in Nesting Patterns, I).

Proof.

Lemma C.0 (Matched Tokens in Nesting Patterns, II).

Proof.

Lemma C.0 ().

Proof.

Lemma C.0 (Call and Return Tokens in Nesting Patterns).

Proof.

Lemma C.0 ().

Proof.

Theorem C.6 ().

Proof.

Lemma C.0 (Finite and Sufficient Seed Strings).

Proof.

References

4.2.1. Background: $k$ -SEVPA and Congruence Relations

Definition 4.0 (Nested Words and $\Sigma_{M}$ ).

Definition 4.0 ( $\mathbf{constructVPA}(\mathcal{Q})$ function).