0% found this document useful (0 votes)
5 views

Physics of Language Models Part 1 Learning Hierarchical Language Structures

Uploaded by

zjuelma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Physics of Language Models Part 1 Learning Hierarchical Language Structures

Uploaded by

zjuelma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Physics of Language Models: Part 1,

Learning Hierarchical Language Structures


Zeyuan Allen-Zhu Yuanzhi Li
[email protected] [email protected]
Meta / FAIR Labs Mohamed bin Zayed University of AI
arXiv:2305.13673v3 [cs.CL] 2 Jun 2024

May 24, 2023


(version 3)∗

Abstract
Transformer-based language models are effective but complex, and understanding their inner
workings is a significant challenge. Previous research has primarily explored how these models
handle simple tasks like name copying or selection, and we extend this by investigating how these
models grasp complex, recursive language structures defined by context-free grammars (CFGs).
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating
lengthy sentences (e.g., hundreds of tokens) that are locally ambiguous and require dynamic
programming to parse. Despite this complexity, we demonstrate that generative models like
GPT can accurately learn this CFG language and generate sentences based on it. We explore the
model’s internals, revealing that its hidden states precisely capture the structure of CFGs, and
its attention patterns resemble the information passing in a dynamic programming algorithm.
This paper also presents several corollaries, including showing why positional embedding
is inferior to relative attention or rotary embedding; demonstrating that encoder-based models
(e.g., BERT, deBERTa) cannot learn very deeply nested CFGs as effectively as generative models
(e.g., GPT); and highlighting the necessity of adding structural and syntactic errors to the
pretraining data to make the model more robust to corrupted language prefixes.


V1 appeared on this date; V2 polishes writing and adds Appendix G; V3 polishes writing and changes the title.
We would like to thank Lin Xiao, Sida Wang and Hu Xu for many helpful conversations. We would like to extend
special thanks to Ian Clark, Gourab De, Anmol Mann, and Max Pfeifer from W&B, as well as Nabib Ahmed, Giri
Anantharaman, Lucca Bertoncini, Henry Estela, Liao Hu, Caleb Ho, Will Johnson, Apostolos Kokolis, and Shubho
Sengupta from Meta FAIR NextSys; without their invaluable support, the experiments in this paper would not have
been possible.
1 Introduction
Transformer-based language models, like GPT [23], are powerful but mysterious; many studies
attempt to uncover the inner workings of transformers. Perhaps the simplest observation is that
attention heads can pair closing brackets with open ones, see the concurrent work and the references
therein [36]. Others also demonstrate that transformer can store key-value knowledge pairs by
storing value in the hidden embedding of keys (see [1] and the references therein).
The seminal work from Anthropic [12, 22] focuses on induction heads, which are logic operations
on the input level (such as [A ][B ]...[A ] implies the next token should be [B ]). This can be used
to interpret how language models perform sequence copying, translation, and some easy forms of
pattern matching. They “hypothesized” that induction heads may exist to “match and copy more
abstract and sophisticated linguistic features, rather than precise tokens”, yet they acknowledge
that they “don’t have a strong framework for mechanistically understanding” this.
The interpretability in the wild paper [32] explored many different types of attention heads,
including “copy head”, “name mover head”, “inhibition head”, etc. Most notably, they explained
how GPT2 predicts the next token “Mary” given prefix “When Mary and John went to the store,
John gave a drink to [...]” This requires some logical reasoning by selecting (not naively copying)
what is the right name. While this result is very inspiring, there exists very simple rule-based
algorithm to achieve the same.
In practice, transformers perform much more complex operations, yet, there is an inherent
difficulty in interpreting those models: To interpret how transformer performs a certain task, there
must be a well-defined algorithm to solve it so one can argue that the inner representations of the
transformer align with the algorithm. Almost all of the “impressive skills” demonstrated by state-
of-the-art language models are beyond solvable by any other known algorithm. Motivated by these,
we ask: Is there a setting for us to understand how language models perform hard tasks, involving
deep logics / reasoning / computation chains?
We propose to tackle this question in a controlled setting where the languages are generated syn-
thetically using context-free grammars (CFGs). CFGs, which include terminal (T) and nonterminal
(NT) symbols, a root symbol, and production rules, can hierarchically produce highly structured
expressions. A string is part of CFG language if a rule sequence can transform the root symbol into
this string, and the language model is asked to complete the given partial strings from the CFG.
We pick CFG because, there exists textbook-level, yet quite difficult dynamic programming (DP)
algorithm to solve CFG instances.1 Generally,
• We wish to capture long-range dependencies via CFG. The simplest example is bracket match-
ing, in ...Y(...)[[...]{...}]{...}X, the next symbol X could depend on Y that was hun-
dreds of tokens before. Another example is coding, where goto jumpback can only be used if
jumpback is a valid line number that could be hundreds of lines ago.
• We wish to capture local ambiguity. A coding grammar (like python) can be parsed using
greedy without ambiguity, so does bracket matching — once locally seen ...()... we know
the two parentheses must be paired together. We focus on hard CFGs that require global
planning via dynamic programming to parse.
Most popular choices of CFGs do not satisfy the two above properties. Notably, the English
CFG (e.g., derived from Penn TreeBank) has an average length of 28 tokens (too short), and is
not very locally ambiguous (e.g., RB JJ or JJ PP imply their parent must be ADJP). As we show
1
Not to say in the theory community, CFGs are also used to model some rich, recursive structure in languages,
including some logics, grammars, formats, expressions, patterns, etc.

1
root |->20 21 19|->18 16 18 16|->15 15 13|->11 12 10|->8 9 9 7|->2 2 1
root |->20 19 21
root |->21 19 19
19|->17 18
19|->18 18
16|->13 15 13
16|->14 13
13|->12 11 12
13|->10 12 11
10|->9 7 9
10|->7 9 9
7|->3 2 2
7|->3 1 2
an example sentence
root |->20 20 20|->16 16 16|->14 14 14|->10 12 11|->8 8 7|->3 2 332213123312113123211322312312111213211322311311
20|->16 17 17|->15 14 13 14|->12 10 12 11|->9 7 8|->3 1 1 322333123121112131133112132121333331232212131232
20|->17 16 18 17|->14 15 14|->12 11 11|->9 7 7 8|->1 2 221111213322131131131131111113231233133133311331
21|->18 17 17|->15 14 14|->10 12 12 12|->7 9 7 8|->3 3 1 333332231211311121221111211233312331121113313333
21|->17 16 18|->14 15 13 15|->10 11 11 12|->9 8 9|->1 2 1 331123333131111333312113211312121133333212111121
21|->16 17 18 18|->15 13 13 15|->11 11 10 12|->8 8 9 9|->3 3 213223223322133221113221132323313111213223223221
21|->16 18 18|->13 15 15|->10 10 9|->1 1 211133331121322221332211212133121331332212213221
15|->12 12 11 211213331232233312

Figure 1: An example CFG used in our experiments. It generates long (e.g., length 354 in this example) and ambigu-
ous strings. Determining if a string x belongs to the CFG language x ∈ L(G) typically requires dynamic
programming, even when the CFG rules are known.

in Appendix G, such CFGs can even be learned using tiny GPT2 models with ∼ 100k parameters.
Thus, it is too easy for our interpretability purpose.
For such reason, we design our own synthetic CFG languages. We give one example
in Figure 1 and discuss a family of 7 such CFGs with varying difficulties in Section 2 (we have
15 more in the appendix).2 We pre-train GPT-2 [25], denoted by GPT for short, on a language
modeling task using a large corpus of strings sampled from our constructed CFGs. We test the
model’s accuracy and diversity by feeding it prefixes from the CFG (or no prefix, just the starting
token) and observing if it can generate accurate completions.
It is perhaps evident from Figure 1 that even if the CFG tree is given, deciding if the string
belongs to this language for a real person may require a scratch paper and perhaps half an hour,
not to say to learn such CFG from scratch.
However, we demonstrate that GPT can learn such CFGs, and using rotary or relative attentions
is crucial, especially for complex CFGs (Results 1-3). Additionally, we examine attention patterns
and hidden states to understand how GPT achieves this. Specifically, we:
• Results 4-5. Develop a multi-head linear probing method to verify that the model’s hidden
states linearly encode NT information almost perfectly, a significant finding as pre-training
does not expose the CFG structure. (In contrast, encoder models like BERT do not.)
• Results 6-9. Introduce methods to visualize and quantify attention patterns, demonstrating
that GPT learns position-based and boundary-based attentions, contributing to understanding
how it learns CFG’s regularity, periodicity, and hierarchical structure.
• Corollary. Suggest that GPT models learn CFGs by implementing a dynamic programming-
like algorithm. The boundary-based attention allows a token to attend to its closest NT
symbols in the CFG tree, even when separated by hundreds of tokens. This resembles DP, in
which the CFG parsing on a sequence 1...i needs to be “concatenated” with another sequence
i+1...j in order to form a solution to a larger problem on 1...j. See Figure 2+10 for illustrations.
We also explore implicit CFGs [24], where each T symbol is a bag of tokens, and data is gener-
ated by randomly selecting tokens from these bags. Implicit CFGs capture additional structures,
such as word categories. We demonstrate that GPT models learn implicit CFGs by encoding the
T symbol information (i.e., token bags) directly into their token embedding layers (Result 10).
We further examine model robustness [19, 30] using CFGs, assessing the model’s ability to
auto-correct errors and generate valid CFGs from a corrupted prefix (e.g., randomly flipping 15%
of the symbols in the prefix). This capability is crucial as it reflects the model’s ability to process
real-world data, including those containing grammatical errors. We find that:
2
A benefit of using synthetic data is to control the difficulty of the data, so that we can observe how transformers
learn to solve tasks at different difficulty levels, and observe their difference.

2
CFG/DP parsing


(examples of) rules from cfg3f
𝔰3 = 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 17 17 17 ... …
𝔰4 = 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 ... 18|->13 15
𝔰5 = 12 12 12 12 12 12 12 11 11 11 11 11 11 12 12 12 12 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 ... 13|->12 11 12
15|->10 10
𝔰6 = 8 8 8 8 8 9 9 9 9 9 7 7 7 9 9 8 8 9 9 7 7 7 9 9 8 8 8 9 9 9 9 7 7 7 ... 10|->8 9 9
10|->9 7 9
11|->9 7
𝑥= 1 2 3 3 1 3 3 1 2 1 2 2 1 1 1 1 2 1 1 3 1 2 1 1 3 3 1 1 1 1 1 2 2 1 ... 12|->9 8
12|->8 8 9
...
transformer

learns boundary-based attention to 8|->3 1 1


parsing

8|->1 2
most adjacent NT boundaries at all levels NT boundary 𝔟6 =1 8|->3 3 1
9|->1 2 1
NT ancestor 𝔰6 =8 9|->3 3
learns NT ancestor/boundary info 9|->1 1
NT boundary 𝔟6 =𝔟5 =𝔟4 =1 NT boundary 𝔟6 5 =1
=𝔟 NT boundary 𝔟6 =𝔟5 =𝔟4 =𝔟3 =1
linearly encoded in the hidden states NT ancestors 𝔰6 =8, 𝔰5 =12, 𝔰4 =13 NT ancestors 𝔰6 =9, 𝔰5 =10 NT ancestors 𝔰6 =9, 𝔰5 =10, 𝔰4 =15, 𝔰3 =18

𝔟♯ = 6 6 5 6 5 6 4 6 6 5 6 6 3 6 ...
𝔭6 = 1 1 2 2 2 3 3 4 4 4 5 5 5 6 6 7 7 8 8 9 9 9 10 10 11 11 11 12 12 13 13 14 14 14 ...
𝔭5 = 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 ...
𝔭4 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 ...
𝔭3 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 ...


Figure 2: An example string x from G = cfg3f. Though formally defined in Section 2.1, bold symbols in color
represent NT boundaries which mark the ending positions of the parsed CFG subtrees at various levels ℓ:
we denote by bℓ (i) = 1 if xi is at the NT boundary for level ℓ. The NT ancestor sℓ (i) represents the tree
node’s name at level ℓ for symbol xi . The NT ancestor index pℓ (i) represents that xi is on the “pℓ (i)-th”
subtree for level ℓ counting from the left.

• Result 11. GPT models, trained on grammatically correct data, exhibit low robustness.
However, introducing just a 10% perturbation to the training data significantly improves the
model’s robustness. This suggests the benefit of using lower-quality data during pre-training.
• Result 12-13. When trained with perturbed data, GPT models develop a “mode switch”
for toggling between making or not making grammar mistakes. This behavior is observable in
real-life completion models like Llama or GPT-3 (davinci003).

2 Our Synthetic Context-Free Grammars


A probabilistic context-free grammar (CFG) is a formal system defining a string distribution using
production rules. It comprises four components: terminal symbols (T), nonterminal symbols (NT),
a root symbol (root ∈ NT), and production rules (R). We represent a CFG as G = (T, NT, R),
with L(G) denoting the string distribution generated by G.

2.1 Definition and Notations


We mostly focus on L-level CFGs where each level ℓ ∈ [L] corresponds to a set of symbols NTℓ
with NTℓ ⊆ NT for ℓ < L, NTL = T, and NT1 = {root}. Symbols at different levels are disjoint:
NTi ∩ NTj = ∅ for i ̸= j. We consider rules of length 2 or 3, denoted as R = (R1 , . . . , RL−1 ),
where each Rℓ consists of rules in the form:
r = (a 7→ b, c, d) or r = (a 7→ b, c) for a ∈ NTℓ and b, c, d ∈ NTℓ+1
Given a non-terminal symbol a ∈ NT and any rule r = (a 7→ ⋆), we say a ∈ r. For each a ∈ NT,
def 
its associated set of rules is R(a) = r | r ∈ Rℓ ∧ a ∈ r , its degree is |R(a)|, and the CFG’s size
is (|NT1 |, |NT2 |, . . . , |NTL |).
Generating from CFG. To generate samples x from L(G), follow these steps:
1. Start with the root symbol NT1 .

2. For each layer ℓ < L, keep a sequence of symbols sℓ = sℓ,1 , · · · , sℓ,mℓ .

3
S S
NP VP . NP VP
TO VP NP PP VBZ VP
DT NN IN NP VBD SBAR
VBD VP NP PP S
TO VP IN NP PP , NP VP .
VB NP DT NN IN NP CD NNS TO VP
VBZ VP
NP PP PP VBD NP
IN NP IN NP DT JJ NN NN

(a) real-life English CFG derived from Penn Treebank, short and simple
S S

68 68 68 68

49 49 65 66 65 66 65 66

45 44 31 33 58 57 62 62 53 54 60 60 60 61 61 60

39 36 35 40 23 22 16 22 54 52 37 41 55 56 55 56 42 47 45 45 52 51 52 50 53 52 47 46 57 55 56 56

30 28 32 27 27 30 31 26 4 11 8 8 10 2 5 11 46 45 35 40 31 33 31 29 53 54 51 52 50 53 50 49 31 29 36 36 41 41 39 36 38 41 46 47 29 33 47 42 42 47 38 41 17 19 39 37 53 50 53 54 40 39 40 39

10 4 15 17 4 4 11 7 6 7 7 6 39 37 41 41 19 15 18 18 23 22 15 20 19 15 23 21 44 43 43 42 41 38 40 34 40 39 42 47 47 42 45 44 23 21 32 27 26 32 32 30 32 30 29 29 26 32 32 29 31 29 39 37 40 35 20 17 15 20 36 36 38 41 31 29 31 26 32 29 26 25 9 2 27 33 30 27 31 26 47 42 48 44 43 42 31 26 27 33 20 19

11 5 7 1 29 29 21 16 32 30 26 25 11 7 3 6 6 7 7 6 8 8 4 4 11 5 7 6 6 6 40 34 36 37 30 26 38 41 31 29 29 33 11 5 30 26 31 26 29 29 37 40 8 8 40 35 37 40 41 41 40 34 4 11 3 2 20 23 4 4 10 2 16 19 23 23 16 19 24 24 20 17 20 17 5 8 23 23 20 23 23 21 19 15 23 21 4 4 31 33 20 19 8 7 1 10 7 1 2 4 4 7 32 27 19 21 32 29 32 30 23 22 21 16 17 19 16 19 21 16 20 19 19 21 17 24 16 22 24 24 3 2 19 15 20 19 35 36 38 41 35 37 35 40 36 37 38 41 19 15 5 8 17 24 24 21 4 7

23 21 16 21 6 6 5 8 15 17 1 3 17 19 22 20 21 17 30 26 26 32 16 21 24 24 17 19 27 26 32 30 6 7 23 21 23 21 15 20 22 24 4 3 19 15 10 2 20 17 23 21 31 33 31 26 4 11 6 7 31 33 5 11 31 29 31 29 7 6 15 17 1 10 4 3 5 8 11 7 6 2 6 2 10 2 9 3 4 7 7 1 6 11 6 2 4 3 4 4 6 2 7 6 5 10 9 2 4 3 5 10 19 15 15 20 1 10 9 3 16 19 9 3 6 6 16 19 23 21 23 23 7 6 10 4 5 10 5 8 11 7 3 2 10 2 4 7 11 7 9 2 3 2 1 3 5 11 1 3 9 4 11 7 11 5 1 10 9 3 33 27 32 27 32 29 26 25 31 25 30 27 27 30 31 26 32 27 31 33 29 33 31 29 9 2 11 5 1 3 6 6

7 6 5 10 5 8 3 2 7 1 11 7 10 4 4 7 5 10 22 24 17 19 4 3 15 17 10 2 6 6 5 1 11 7 6 6 4 3 20 23 7 1 4 11 6 6 7 6 6 6 4 7 8 8 2 4 6 11 7 1 4 3 5 10 19 15 15 20 4 11 6 7 21 23 23 22 23 21 19 15 21 16 2 4 9 3 2 4 2 4 4 4 9 2 5 8 9 3 6 2 6 6 4 3 7 6 24 21 17 24 23 23 5 10 16 19 23 21 20 19 22 20 23 22 19 21 22 24 4 4 9 3 6 7 4 3 20 23 4 7 23 22 16 22 20 17 15 20 23 22 21 16

8 8 1 3 7 1 11 7 7 1 4 7 7 6 9 2 2 4 4 4 3 2 7 6 4 3 5 11 7 6 5 10 9 2 2 4 3 2 10 2 1 3 3 2 1 3 6 2 7 6 6 2 5 10 6 11 8 9 4 4 4 11 8 8 9 2 5 10 10 4 1 3 4 4 7 6 6 2 5 11 10 2 8 9 6 11 7 1 2 4 6 11 7 6 10 4 6 6

(b) a family of max-depth 11 CFGs where rules have length 1 or 2 that GPT can learn, see cfg0 in Appendix G

Figure 3: CFG visual comparisons: left is a medium-length sample, and right is a 80%-percentile-length sample

3. For the next layer, randomly sample a rule r ∈ R(sℓ,i ) for each sℓ,i with uniform probability.3
Replace sℓ,i with b, c, d if r = (sℓ,i 7→ b,c, d), or with b, c if r = (sℓ,i 7→ b, c). Let the resulting
sequence be sℓ = sℓ+1,1 , · · · , sℓ+1,mℓ+1 .
4. During generation, when a rule sℓ,i 7→ sℓ+1,j , sℓ+1,j+1 is applied, define the parent parℓ+1 (j) =
def
parℓ+1 (j + 1) = i (and similarly if the rule of sℓ,i is of length 3).
5. Define NT ancestor indices p = (p1 (i), . . . , pL (i))i∈[mL ] and NT ancestor symbols s = (s1 (i), . . . , sL (i))i∈[mL ]
as shown in Figure 2:
def def def
pL (j) = j , pℓ (j) = parℓ+1 (pℓ+1 (j)) and sℓ (j) = sℓ,pℓ (j)
The final string is x = sL = (sL,1 , · · · , sL,mL ) with xi = sL,i and length len(x) = mL . We
use (x, p, s) ∼ L(G) to represent x with its associated NT ancestor indices and symbols, sampled
according to the generation process. We write x ∼ L(G) when p and s are evident from the context.

Definition 2.1. A symbol xi in a sample (x, p, s) ∼ L(G) is the NT boundary / NT end at level
def
ℓ ∈ [L − 1] if pℓ (i) ̸= pℓ (i + 1) or i = len(x). We denote bℓ (i) = 1xi is the NT boundary at level ℓ as the
NT-end boundary indicator function. The deepest NT-end of i is— see also Figure 2 —
b♯ (i) = minℓ∈{2,3,...,L−1} {bℓ (i) = 1} or ⊥ if set is empty .

The cfg3 synthetic CFG family. We focus on seven synthetic CFGs of depth L = 7 detailed
in Section A.1. The hard datasets cfg3b, cfg3i, cfg3h, cfg3g, cfg3f have sizes (1, 3, 3, 3, 3, 3, 3) and
increasing difficulties cfg3b < cfg3i < cfg3h < cfg3g < cfg3f. The easy datasets cfg3e1 and cfg3e2
have sizes (1, 3, 9, 27, 81, 27, 9) and (1, 3, 9, 27, 27, 9, 4) respectively. The sequences generated by
these CFGs are up to 36 = 729 in length. Typically, the learning difficulty of CFGs inversely scales
with the number of NT/T symbols, assuming other factors remain constant, because having more
NT/T symbols makes the language less ambiguous and more easily parsed using greedy (see Figure 4
and we discuss more in Appendix G). We thus primarily focus on cfg3b, cfg3i, cfg3h, cfg3g, cfg3f.

2.2 Why Such CFGs


We use CFG as a proxy to study some rich, recursive structure in languages, which can cover some
logics, grammars, formats, expressions, patterns, etc. Those structures are diverse yet strict (for
example, Chapter 3.1 should be only followed by Chapter 3.1.1, Chapter 4 or Chapter 3.2, not
others). The CFGs we consider are non-trivial, with likely over 2270 > 1080 strings in cfg3f among
a total of over 3300 > 10140 possible strings of length 300 or more (see the entropy estimation in
3
For simplicity, we consider the uniform case, eliminating rules with extremely low probability. Such rules compli-
cate the learning of the CFG and the investigation of a transformer’s inner workings (e.g., require larger networks and
longer training time). Our results do extend to non-uniform cases when the distributions are not heavily unbalanced.

4
GPT GPT_rel GPT_rot GPT_pos GPT_uni
cfg3 99.8 99.8 99.8 99.9 99.8 99.9 99.9 99.9 99.9 100.0
b GPT GPT_rel GPT_rot GPT_pos GPT_uni
cfg3
generation acc (%)
cfg3 99.5 99.5 99.8 99.8 99.4 99.5 99.8 99.8 99.6 99.7 0.00008 0.00011 0.00009 0.00009 0.00004
i truth GPT GPT_rel GPT_rot GPT_pos GPT_uni b
cfg3 96.8 96.9 99.7 99.6 99.6 99.5 99.0 99.0 98.9 98.8
cfg3 cfg3 0.00024 0.00014 0.00028 0.00015 0.00021
h cfg b 169 169 169 169 169 169 cfg3 i

KL divergence
entropy (bits)
cfg3
g 64.1 63.8 99.1 99.2 98.6 98.4 97.0 96.9 96.7 96.9 cfg33i 185 190 189 189 190 189
cfg3
h 0.00078 0.00023 0.00023 0.00027 0.00036
cfg3 57.1 57.3 98.8 98.8 97.6 97.7 93.9 93.8 92.8 92.9 cfg3h 204 203 203 203 202 203 g 0.00450 0.00034 0.00047 0.00058 0.00069
cfg3 f c g 268 272 267 268 266 267 cfg3
cfgf3g3f
0.00455 0.00043 0.00060 0.00093 0.00112
cfg3
e1 98.1 98.9 98.4 99.0 98.2 98.9 98.3 98.9 98.6 99.0 268 275 270 272 269 269 cfg3 f
cfg3e1 e 0.00019 0.00014 0.00016 0.00013 0.00011
e2 99.3 99.5 99.6 99.7 99.6 99.7 99.5 99.7 99.4 99.6 216 214 213 213 214 213 cfg3 1
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50
e2 256 252 255 251 253 252 e2 0.00031 0.00025 0.00025 0.00011 0.00011

Figure 4: Generation accuracy (left), entropy (middle), KL-divergence (right) across multiple CFG datasets.
Observations: Less ambiguous CFGs (cfg3e1, cfg3e2, as they have fewer NT/T symbols) are easier to
learn. Transformers using relative positional embedding (GPTrel or GPTpos ) are better for learning harder
CFGs. The vanilla GPT is worse than even GPTuni , which is GPT with fixed, uniform attentions.

Figure 4). The probability of a random string belonging to this language is nearly zero, and a
random completion of a valid prefix is unlikely to satisfy the CFG. In particular, Figure 31 in the
appendix shows that cfg3f cannot be learned by transformers (much) smaller than GPT2-small. In
contrast, the English CFG (e.g., derived from Penn TreeBank) can be learned to good accuracy
using tiny GPT2 models with ∼ 100k parameters — so it is too easy for our interpretability purpose.
To obtain the cleanest interpretability result, we have selected a CFG family with a “canonical
representation” (e.g., a layered CFG). This controlled design choice allows us to demonstrate a
strong correlation between the CFG representation and the hidden states in the learned transformer.
We also create additional CFG families to examine “not-so-canonical” CFG trees, with results
deferred to Appendix G (see an example in Figure 3). We do not claim our results encompass all
CFGs; our chosen CFGs are already quite challenging for a transformer to learn and can lead to
clean hierarchical interpretability results.

3 Results 1-3: Transformer Can Learn Such CFGs


In this section, we generate a large corpus {x(i) }i∈[N ] from a synthetic CFG language L(G) in
Section 2.1, and pretrain a (generative, decoder-only) transformer model F on this corpus, treating
each terminal symbol as a separate token, using an auto-regressive task (see Appendix A.3 for
details). We then evaluate how well the model learns such L(G).
Models. We denote the GPT2 small architecture (12-layer, 12-head, 768-dimensions) as GPT [25]
and implemented its two modern variants. We denote GPT with relative positional attention [13] as
GPTrel , and GPT with rotary positional embedding [9, 29] as GPTrot . For purposes in later sections, we
introduce two weaker variants. GPTpos replaces the attention matrix with a matrix based solely on
tokens’ relative positions, while GPTuni uses a constant, uniform average of past tokens from various
window lengths as the attention matrix. Detailed explanations of these variants are in Section A.2.
We quickly summarize our findings and then elaborate them in details.
Result 1-3 (Figure 4). The GPT models can effectively learn our synthetic CFGs. Given any
prefix, they can generate completion strings
• that can perfectly adhere to the CFG rules most of the time, (accuracy)
• that are sufficiently diverse in the CFG language, and (diversity)
• that closely follow the probabilistic distribution of the CFG language. (probability)
Moreover, one had better use rotary or relative attentions; the original GPT (with absolute posi-
tional embedding) performs even worse than GPTuni (with uniform attention).

5
Result 1: Completion accuracy. We evaluate F by letting it generate completions for pre-
fixes x:c = (x1 , x2 , · · · , xc ) from strings x freshly sampled from L(G). The generation accuracy is
measured as Prx∼L(G) + randomness of F [(x:c , F (x:c )) ∈ L(G)]. We use multinomial sampling without
beam search for generation.4
Figure 4 (left) shows the generation accuracies for cuts c = 0 and c = 50. The c = 0 result
tests the transformer’s ability to generate a sentence in the CFG, while c = 50 tests its ability to
complete a sentence.5 The results show that the pretrained GPT models can often generate strings
that perfectly adhere to the CFG rules for the cfg3 data family.
Result 2: Generation diversity. Could it be possible that the pretrained GPT models only
memorized a small subset of strings from the CFG? We evaluate this by measuring the diversity of
its generated strings. High diversity suggests a better understanding of the CFG rules.
We consider two methods to estimate diversity. One is to estimate the distribution’s entropy,
which provides a rough estimate of (the log2 of) the support size, see the middle of Figure 4. The
other is to use birthday paradox to lower bound the support size [6]. This allows us to make precise
claims, such as in the cfg3f dataset, there are at least 4 × 108 distinct sentential forms derivable
from a symbol at levels 1 to 5 or levels 2 to 6; not to say from the root to level 7. Details are
in Appendix B. Our general conclusion is that the pre-trained model does not rely on simply
memorizing a small set of patterns to achieve high completion accuracy.
Result 3: Distribution comparison. To fully learn a CFG, it is crucial to learn the distribution
of generating probabilities. One naive approach is to compare the marginal distributions p(a, i),
for the probability of symbol a ∈ NTℓ appearing at position i. We observe a strong alignment
between the generation probabilities and the ground-truth, included in Appendix B.2.
Another approach is to compute the KL-divergence between the per-symbol conditional distri-
butions. Let p∗ be the distribution
 (i) over strings in the true CFG and p be that from the generative
transformer model. Let S = x i∈[M ] be samples from the true CFG distribution. Then, the
KL-divergence can be estimated as follows:6
1 P 1 P P Prp∗ [t|x1 ,...,xi−1 ]
|S| x∈S len(x)+1 i∈[len(x)+1] t∈T∪{eos} Prp∗ [t | x1 , . . . , xi−1 ] log Prp [t|x1 ,...,xi−1 ]

In Figure 4 (right) we compare the KL-divergence between the true CFG distribution and the GPT
models’ output distributions using M = 20000 samples.
Connection to DP. Result 1-3 (e.g., learning the CFG’s marginal distribution) is merely an small
step towards showing that the model employs a DP-like approach. Dynamic programming (e.g.,
the inside-outside algorithm [8]) can compute marginal distributions of CFGs, and such algorithms
can be implemented using nonlinear neural networks like transformers, achieving a global minimum
in the auto-regressive training objective.7 However, the mere existence of a dynamic-programming
transformer to obtain the training objective’s global minimum is not entirely satisfactory. Does
employing an AdamW stochastic optimizer for 100k iterations on the training objective yield such
an algorithm? The remainder of this paper will delve deeper to address this question.
4
The last softmax layer converts the model outputs into a probability distribution over (next) symbols. We follow
this distribution to generate the next symbol, reflecting the unaltered distribution learned by the transformer. This
is the source of the “randomness of F ” and is often referred to as using “temperature τ = 1.”
5
Our cfg3 family is large enough to ensure a negligible chance of a freshly sampled prefix of length 50 being seen
during pretraining.
6
A nearly identical formula was also used in [11].
7
This has been carefully explored for masked language modeling case in Zhao et al. [37].

6
4 Results 4-5: How Do Transformers Learn CFGs?
In this section, we delve into the learned representation of the transformer to understand how it
encodes CFGs. We employ various measurements to probe the representation and gain insights.
Recall classical way to solve CFGs. Given CFG G, the classical way to verify if a sequence x
satisfies L(G) is to use dynamic programming (DP) [26, 28]. One possible implementation of DP
involves using the function DP(i, j, a), which determines whether or not xi+1 , xi+1 . . . , xj can be
generated from symbol a following the CFG rules. From this DP representation, a DP recurrent
formula can be easily derived.8
In the context of this paper, any sequence x ∼ L(G) that satisfies the CFG must satisfy the
following conditions:
bℓ (i) = 1, bℓ (j) = 1, ∀k ∈ (i, j), bℓ (k) = 0 and sℓ (j) = a =⇒ DP(i, j, a) = 1 (4.1)
(recall the NT-boundary bℓ and the NT-ancestor sℓ notions from Section 2.1). Note that (4.1) is not
an “if and only if” condition because there may be a subproblem DP(i, j, a) = 1 that does not lie on
the final CFG parsing tree but is still locally parsable by some valid CFG subtree. However, (4.1)
provides a “backbone” of subproblems, where verifying their DP(i, j, a) = 1 values certifies that the
sentence x is a valid string from L(G). It is worth mentioning that there are exponentially many
implementations of the same DP algorithm9 and not all (i, j, a) tuples need to be computed in
DP(i, j, a). Only those in the “backbone” are necessary.
Connecting to transformer. In this section, we investigate whether pre-trained transformer
F also implicitly encodes the NT ancestor and boundary information. If it does, this suggests
that the transformer contains sufficient information to support all the DP(i, j, a) values in the
backbone. This is a significant finding, considering that transformer F is trained solely on the auto-
regressive task without any exposure to NT information. If it does encode the NT information after
pretraining, it means that the model can both generate and certify sentences in the CFG language.

4.1 Result 4: Transformer’s Last Layer Encodes NT Ancestors/Boundaries


Let l be the last layer of the transformer (other layers are studied in Appendix C.2). Given
an input string x, we denote the hidden state of the transformer at layer l and position i as
Ei (x) ∈ Rd . We first investigate whether a linear function can predict b1 (i), . . . , bL (i) i∈[len(x)]

 
and s1 (i), . . . , sL (i) i∈[len(x)] using the full Ei (x) i∈[len(x)] . If possible, it implies that the last-layer
hidden states encode the CFG’s structural information up to a linear transformation.
Multi-head linear probing (full). Due to the high dimensionality of this linear function (e.g.,
len(x) = 300 and d = 768 yield 300 × 768 dimensions) and variable string lengths, we propose a
multi-head linear function for efficient learning. We consider a set of linear functions fr : Rd →
R|NT| , where r ∈ [H] and H is the number of “heads”. To predict any sℓ (i), we apply:
Gi (x) = r∈[H],k∈[len(x)] wr,i→k · fr (Ek (x)) ∈ R|NT|
P
(4.2)
8
For example, one can compute DP(i, j, a) = 1 if and only if there exists i = i1 < i2 < · · · < ik = j such that
DP(ir , ir+1 , br ) = 1 for all r ∈ [k − 1] and a → b1 , b2 , . . . , bk is a rule of the CFG. Implementing this naively would
result in a O(len4 ) algorithm for CFGs with a maximum rule length of 3. However, it can be implemented more
efficiently with O(len3 ) time by introducing auxiliary nodes (e.g., via binarization).
9
Each inner loop of the dynamic programming can proceed in any arbitrary order, not limited to k = i..j or
2
k = j..i, and the algorithm can prune and break early. This gives a safe estimate of at least (n!)Ω(n ) possible
Ω(n)
implementations. Furthermore, there are at least 2 ways to perform binarization, meaning to break length-3 rules
to length-2 ones. This is just to detect if a given string of length n belongs to the CFG.

7
GPT GPT_rel GPT_rot GPT_pos GPT_uni deBERTa baseline (GPT_rand)
cfg3 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99.7 99.9 85.0 65.7 56.8 61.5 62.7

predict NT ancestor (%)


b
cfg3 99.6 99.7 99.6 99.2 99.7 99.6 99.7 99.6 99.2 99.7 99.6 99.7 99.6 99.2 99.8 99.6 99.7 99.6 99.3 99.8 99.6 99.7 99.6 99.3 99.8 99.7 99.7 99.7 99.2 99.4 84.6 71.7 64.6 66.4 65.2
i
cfg3 99.7 98.3 98.3 99.2 100 99.7 98.1 97.8 99.0 100 99.7 98.4 98.2 99.3 100 99.7 98.5 98.5 99.4 100 99.7 98.6 98.6 99.4 100 99.9 99.8 99.8 99.7 100 67.5 47.2 50.6 66.3 92.8
h
cfg3 100 99.2 95.6 94.6 97.3 100 99.3 96.7 97.2 99.0 100 99.3 96.6 97.2 99.0 100 99.3 96.7 96.9 98.8 100 99.4 97.0 97.2 98.9 100 99.5 95.5 85.6 90.5 70.8 56.4 49.4 57.0 73.1
g
cfg3 100 97.6 94.3 88.4 85.9 100 97.5 94.8 92.9 93.5 100 97.7 95.2 93.3 94.2 100 97.9 95.6 93.5 93.9 100 98.2 95.8 93.2 93.5 100 99.6 96.3 84.0 77.5 71.3 49.9 44.6 59.1 68.6
cfg3 f
e1 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99.8 45.4 27.6 34.6 47.2 76.3
cfg3
e2 99.9 100 100 100 100 99.8 100 100 100 100 99.9 100 100 100 100 99.9 100 100 100 100 99.9 100 100 100 100 100 100 100 100 99.9 36.0 16.6 23.5 44.6 78.3
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

Figure 5: After pre-training, hidden states of generative models implicitly encode NT-ancestor information. The
N Tℓ column represents the accuracy of predicting sℓ , the NT ancestors at level ℓ, via linear probing (4.2).

It also encodes NT boundaries (Appendix C.1); and such information is discovered gradually and hierar-
chically across layers and training epochs (Appendix C.2 and C.3). We compare against a baseline which
is the encoding from a randomly-intialized GPT GPTrand (serving as a neural-tangent kernel baseline).
We also compare against DeBERTa, illustrating that BERT-like models are less effective in learning NT
information at levels close to the CFG root.

def exp(⟨Pi,r ,Pk,r ⟩) ′


where wr,i→k = P
exp(⟨Pi,r ,Pk′ ,r ⟩) for trainable parameters Pi,r ∈ Rd . Gi can be seen as a
k′ ∈[len(x)]

“multi-head attention”
 over linear functions. We train Gi (x) ∈ R|NT| using the cross-entropy loss
to predict sℓ (i) ℓ∈[L] . Despite having multiple heads,
Gi (x) is still a linear function over (Ek (x))k∈[len(x)]
as the linear weights wr,i→k depend only on positions i and k, not on x. Similarly, we train G′i (x) ∈
RL using the logistic loss to predict the binary values bℓ (i) ℓ∈[L] . Details are in Section A.4.
Using such multi-head linear probing, we discover that:
Result 4 (Figure 5). Pre-training allows GPT models to almost perfectly encode the NT an-
cestor sℓ (i) and NT boundary bℓ (i) information in the last transformer layer’s hidden states
(Ek (x))k∈[len(x)] , up to a linear transformation.
In contrast, encoder models (like deBERTa) may not learn deep NT information very well.10
But, do we need this full layer for linear probing? We explore next.

4.2 Result 5: NT Ancestors are Encoded At NT Boundaries


 
Above, we used the full hidden layer, Ei (x) i∈[len(x)] , to predict sℓ (i) ℓ∈[L] for each position i.
This is essential since it’s information-theoretically impossible to extract all of i’s NT ancestors
by only reading Ei (x) or even all hidden states to its left, especially if xi is the start of a string or
a subtree in the CFG. But, how about those ones information-theoretically possible? In particular,
how about predicting sℓ (i) at locations i with bℓ (i) = 1— i.e., at the end of the CFG subtrees.
Multi-head linear probing (diagonal). We consider a neighborhood of position i in the hidden
states, say Ei±1 (x), and use that for linear probing. In symbols, we replace wr,i→k in (4.2) with
zeros for |i − k| > 1 (tridiagonal masking), or with zeros for i ̸= k (diagonal masking).
Gi (x) = r∈[H],k∈[len(x)],|i−k|≤δ wr,i→k · fr (Ek (x)) ∈ R|NT|
P
where δ = 0 or 1 (4.3)

10
Among encoder-based models, deBERTa [13] is a modern variant of BERT, which is equipped with relative atten-
tions. It is expected that encoder-based models do not learn very deep NT information, because in a masked-language
modeling (MLM) task, the model only needs to figure out the missing token from its surrounding, say, 20 tokens.
This can be done by pattern matching, as opposed to a global planning process like dynamic programming.

8


𝔰3 = 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 17 17 17 ...
𝔰4 = 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 ...
𝔰5 = 12 12 12 12 12 12 12 11 11 11 11 11 11 12 12 12 12 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 ...
𝔰6 = 8 8 8 8 8 9 9 9 9 9 7 7 7 9 9 8 8 9 9 7 7 7 9 9 8 8 8 9 9 9 9 7 7 7 ...

𝑥= 1 2 3 3 1 3 3 1 2 1 2 2 1 1 1 1 2 1 1 3 1 2 1 1 3 3 1 1 1 1 1 2 2 1 ...

𝔟6 = 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0 1 ...
𝔟5 = 0 0 0 0 linearly encode
0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 linearly encode
0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 ...
𝔟4 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...
𝔟3 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...


Figure 6: Illustration of Result 5: GPT’s last layer hidden states at the blue positions linearly encode the NT
ancestor and boundary information in the red boxes very well. (They may not encode NT ancestors for
smaller levels because that may not be information-theoretically possible.)

GPT GPT_rel GPT_rot GPT_pos GPT_uni deBERTa baseline (GPT_rand)


cfg3
predict NT at NT-end

100 100 99.6 99.8 100 100 100 99.6 99.8 100 100 100 99.6 99.8 100 100 100 99.6 99.8 100 100 100 99.6 99.8 100 100 100 98.9 85.7 85.7 91.3 75.6 66.8 68.0 83.4
(diagonal masking)

b
cfg3 97.2 98.4 100 100 100 97.2 98.4 100 100 100 97.2 98.4 100 100 100 97.2 98.4 100 100 100 97.2 98.4 100 100 100 99.6 99.6 98.0 89.0 86.2 76.9 67.2 65.4 67.2 81.3
i
cfg3 99.8 99.6 99.3 100 100 99.8 99.7 99.4 100 100 99.8 99.7 99.4 100 100 99.8 99.7 99.4 100 100 99.8 99.7 99.3 99.9 100 99.9 99.7 97.8 87.8 98.5 71.8 50.5 53.7 70.2 89.7
h
cfg3 100 100 99.6 99.0 99.4 100 100 99.7 99.5 99.9 100 100 99.7 99.5 99.8 100 100 99.6 99.4 99.8 100 100 99.6 99.4 99.8 100 99.1 84.3 74.6 81.8 70.7 59.9 54.2 62.6 79.3
g
cfg3 100 99.1 99.1 98.2 96.2 100 99.2 99.2 98.9 98.4 100 99.2 99.3 98.9 98.1 100 99.2 99.2 98.7 97.9 100 99.2 99.2 98.7 97.6 100 99.1 78.2 69.3 80.0 75.4 58.8 54.4 66.4 77.6
cfg3 f
e1 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99.9 100 100 100 89.2 86.1 36.5 26.1 38.2 58.5 82.0
cfg3
e2 99.6 99.9 100 100 100 99.6 99.9 100 100 100 99.6 99.9 100 100 100 99.6 99.9 100 100 100 99.6 99.9 100 100 100 100 100 99.6 90.6 89.4 38.6 23.4 30.4 52.3 82.7
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2
GPT GPT_rel GPT_rot GPT_pos GPT_uni deBERTa baseline (GPT_rand)
(tridiagonal masking)

cfg3
predict NT at NT-end

b 100 100 99.6 99.8 100 100 100 99.6 99.8 100 100 100 99.6 99.8 100 100 100 99.6 99.8 100 100 100 99.7 99.8 100 100 100 99.0 84.7 84.3 95.0 78.9 68.8 69.2 83.5
cfg3 99.1 99.2 100 100 100 99.2 99.2 100 100 100 99.2 99.2 100 100 100 99.2 99.2 100 100 100 99.2 99.2 100 100 100 99.6 99.7 99.4 92.0 85.4 83.3 71.2 69.8 72.2 84.5
i
cfg3 99.8 99.6 99.5 100 100 99.8 99.7 99.5 100 100 99.8 99.7 99.5 100 100 99.8 99.7 99.5 100 100 99.8 99.7 99.5 100 100 99.8 99.0 97.3 90.8 98.1 79.6 52.7 55.2 70.3 91.6
h
cfg3 100 100 99.6 99.1 99.5 100 100 99.7 99.5 99.9 100 100 99.7 99.5 99.9 100 100 99.7 99.4 99.8 100 100 99.7 99.4 99.8 100 99.4 90.2 75.3 83.1 76.2 61.2 54.7 62.9 81.5
g
cfg3 100 99.2 99.1 98.4 97.6 100 99.3 99.3 99.0 99.3 100 99.3 99.3 99.0 99.1 100 99.2 99.2 98.9 98.9 100 99.2 99.2 98.8 98.8 100 98.7 84.9 69.2 79.9 79.3 60.5 54.7 67.4 83.1
cfg3 f
e1 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 94.3 88.7 40.3 30.4 41.3 62.4 89.5
cfg3
e2 99.9 99.9 100 100 100 99.9 99.9 100 100 100 99.9 99.9 100 100 100 99.9 99.9 100 100 100 99.9 99.9 100 100 100 100 100 99.9 94.5 89.8 40.5 24.6 32.4 56.1 85.0
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

Figure 7: Generative models encode NT ancestors almost exactly at NT boundaries. The N Tℓ column repre-
sents the accuracy to predict sℓ (i) at locations i with bℓ (i) = 1, via diagonal multi-head linear probing (4.3).

Observation. By comparing against a baseline, which is the encoding from a random GPT, we see that
BERT-like (encoder-only) transformers such as DeBERTa trained on a masked language modeling (MLM)
task, do not store deep NT ancestor information at the NT boundaries.

Result 5 (Figure 6). For GPT models, the information of position i’s NT ancestor/boundary is
locally encoded around position i ± 1 when i is on the NT boundary. This is because:
• At NT boundaries (i.e., bℓ (x) = 1), diagonal or tridiagonal multi-head linear probing (4.3)
is adequate for accurately predicting the NT ancestors sℓ (x) (see Figure 7).
• Such masking is also sufficient for accurately predicting NT boundaries bℓ (i) (deferred to
Figure 19 in Appendix C.1).
In contrast, encoder models like deBERTa do not store deep NT information at the NT boundaries.

Related work. Our probing approach is akin to the seminal work by Hewitt and Manning [14],
which uses linear probing to examine the correlation between BERT’s hidden states and the parse
tree distance metric (similar to NT-distance in our language). Subsequent studies [7, 16, 18, 27,
31, 33, 37] have explored various probing techniques to suggest that BERT-like transformers can
approximate CFGs from natural languages.
Our approach differs in that we use synthetic data to demonstrate that linear probing can
almost perfectly recover NT ancestors and boundaries, even for complex CFGs that generate strings
exceeding hundreds of tokens. We focus on pre-training generative (decoder-only) language models.
For a non-generative, encoder-based model like BERT [15] or its modern variant deBERTa [13], they

9
do not learn deep (i.e., close to the CFG root) NT information very well, as shown in Result 4-5.
Our results, along with Section 5, provide evidence that generative language models like GPT-
2 employ a dynamic-programming-like approach to generate CFGs, while encoder-based models,
typically trained via MLM, struggle to learn more complex/deeper CFGs.

5 Results 6-9: How Do Transformers Learn NTs?


We now delve into the attention patterns. We demonstrate that these patterns mirror the CFG’s
syntactic structure and rules, with the transformer employing different attention heads to learn
NTs at different CFG levels.

5.1 Result 6: Position-Based Attention


We first note that the transformer’s attention weights are primarily influenced by the tokens’ relative
distance. This holds true even when trained on the CFG data with absolute positional embedding.
This implies that the transformer learns the CFG’s regularity and periodicity through positional
information, which it then uses for generation.
Formally, let Al,h,j→i (x) for j ≥ i represent the
attention weight for positions j → i at layer l and distance p = |j-i|
10 30 50 70 90 110 130 150 170 190 210 230 250 270 290
lay1
head h of the transformer, on input sequence x. For
position-based attention pattern
lay2
lay3 0.8
for GPT over cfg3h data

each layer l, head h, and distanceP p ≥ 0, we compute lay4


lay5 0.6
the average of the partial sum 1≤i′ ≤i Al,h,j→i′ (x) lay6
lay7
over all data x and pairs i, j with j − i = p. We lay8
lay9
0.4

lay1
plot this cumulative sum for l, h, p in Figure 8. We lay1
0 0.2
1
lay1
observe a strong correlation between the attention 2

pattern and the relative distance p = j − i. The at-


tention pattern is also multi-scale, with some atten-
tion heads focusing on shorter distances and others Figure 8: When trained on cfg3h using absolute po-
sitional embedding, GPT shows a position-
on longer ones. based attention pattern. The 12 rows in
Motivated by this, we explore whether using each block represent attention heads. See
position-based attention is sufficient to learn CFGs. Appendix D.1 for more experiments.
In Figure 4, we find that GPTpos (or even GPTuni ) per-
forms well, surpassing the vanilla GPT, but not reaching the full potential of GPTrel . This supports the
superior practical performance of relative-position based transformer variants (such as GPTrel , GPTrot ,
deBERTa) over their base models (GPT or BERT). On this other hand, this also indicates that
position-based attention alone is not enough for transformers to learn CFGs.

5.2 Result 7-9: Boundary-Based Attention


Next, we remove the position-bias from the attention matrix to examine the remaining part. We
find that the transformer also learns a strong boundary-based attention pattern, where tokens on
the NT-end boundaries typically attend to the “most adjacent” NT-end boundaries, see
Figure 2. This attention pattern enables the transformer to effectively learn the hierarchical and
recursive structure of the CFG, and generate output tokens based on the NT symbols and rules.
Formally, let Al,h,j→i (x) for j ≥ i denote the attention weight for positions j → i at layer l
and head h of the transformer, on input sequence x. Given a sample pool {x(n) }n∈[N ] ∈ L(G), we

10
head1 head2 head3 head4 head5 head6 head7 head8 head9 head10 head11 head12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 5 54 53 52 55 44 43 42 45 34 33 32 35 24 23 22 2

(NTend4 ± 1) (NTend4 ± 1) attention


NT5
lay1 -1
for GPTrel over cfg3h data
lay1 NT4
0.08
any (NTend ± 2) attention
0

NTend NTend attention pattern


NT3
NT2
NT5
NT4
lay2
+1
-1 r=0 x x x x x x x x x x

for GPTrel over cfg3h data


lay2 NT3 0.04 0

for GPTrel over cfg3f data


NT2
NT5 +1
-1
0.07 x x x
lay3 NT4
NT3 lay3 0 x x x x
NT2 +1
lay4
NT5
NT4
NT3 lay4 -1
0 0.06 r=4 x x x x
NT2 +1 x x x x
NT5
NT4 0.03 lay5 -1
lay5 0 x x x x
NT3
NT2 +1 0.05 x x x x
NT5
NT4 lay6 -1
lay6 NT3 0 r=8 x x x x x x x
NT2 +1
lay7
NT5
NT4 lay7 -1
0 0.04 x x x x x x x x
NT3
NT2 0.02 +1 x x x x x x x x
NT5
lay8 -1 x x x x x x x x
lay8 NT4
NT3 0 0.03
NT2 +1 r=12 x x x x x x x x
NT5
lay9 -1
lay9 NT4
NT3 0 x x x x x x x x
+1
NT2
NT5 lay1 -1 0.02 x x x x x x x x
lay1 NT4 0.01 0 0
0 NT3
NT2 +1 x x x x x x x x
lay1 -1
lay1
NT5
NT4
1 0 0.01 r=16 x x x x x x x x

0
1 NT3
NT2 +1 x x x x x x x x
NT5 lay1 -1
lay1 NT4
2 0 x x x x x x x x
2 NT3 +1 0.00
NT2 0.00 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 x x x x x x x x
-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2

(a) Bl,h,j→i for i + δ at NT-end in CFG level (b) Bl,h,j→i for i + δ1 , j + δ2 at NT- end→end
(c) Bl,h,ℓ ′ →ℓ,r for NT-ends between
ℓ. Rows represent ℓ = 2, 3, 4, 5 and columns ends in CFG level ℓ = 4. Rows / CFG levels ℓ′ → ℓ. Rows represent r
represent δ = −2, −1, 0, 1, 2. columns represent δ1 , δ2 = −1, 0, +1. and columns ℓ′ → ℓ. “×” means empty
entries.

Figure 9: After pretrained on our CFG data, GPT model’s attention layers have a strong bias towards “ NT-end
at level ℓ′ to the most adjacent NT-end at ℓ ”, for even different ℓ, ℓ′ . For definitions see Section 5.2,
and more experiments see Appendix D.2, D.3 and D.4. Corollary: this is evidence that the model uses
dynamic-programming like approach to learn such hard, synthetic CFGs (see discussions in Section 5.3).

compute for each layer l, head h,11


Al,h,p = AverageJAl,h,j→i (x(n) ) | n ∈ N, 1 ≤ i ≤ j ≤ len(x(n) ) s.t. j − i = pK ,
which represents the average attention between any token pairs of distance p over the sample pool.
def
To remove position-bias, we focus on Bl,h,j→i (x) = Al,h,j→i (x) − Al,h,j−i in this subsection. Our
observation can be broken down into three steps.
Result 7 (Figure 9(a)). Bl,h,j→i (x) exhibits a strong bias towards tokens i at NT ends.
This can be seen in Figure 9(a), where we present the average value of Bl,h,j→i (x) over data x and
pairs i, j where i + δ is the deepest NT-end at level ℓ (symbolically, b♯ (i + δ) = ℓ). The attention
weights are highest when δ = 0 and decrease rapidly for surrounding tokens.
Result 8 (Figure 9(b)). Bl,h,j→i (x) favors pairs i, j both at NT ends at some level ℓ.
This can be seen in Figure 9(b), where we show the average value of Bl,h,j→i (x) over data x and
pairs i, j where bℓ (i + δ1 ) = bℓ (j + δ2 ) = 1 for δ1 , δ2 ∈ {−1, 0, 1}. It is maximized when δ1 = δ2 = 0.

Result 9 (Figure 9(c)). Bl,h,j→i (x) favors “adjacent” NT-end token pairs i, j.
Above, we define “adjacency” as follows. We introduce Bl,h,ℓ end→end to represent the average value
′ →ℓ,r

of Bl,h,j→i (x) over samples x and token pairs i, j that are at the deepest NT-ends on levels ℓ, ℓ′
respectively (symbolically, b♯ (i) = ℓ ∧ b♯ (j) = ℓ′ ), and are at a distance r based on the ancestor
indices at level ℓ (symbolically, pℓ (j) − pℓ (i) = r). We observe that Bl,h,ℓ end→end decreases as r
′ →ℓ,r

increases, and is highest when r = 0 (or r = 1 for pairs ℓ → ℓ without an r = 0 entry).12
In conclusion, tokens corresponding to NT-ends at level ℓ′ statistically have higher attention
weights to their most adjacent NT-ends at every level ℓ, even after removing position-bias.13
11
Throughout this paper, we use J·K to denote multi-sets that allow multiplicity, such as J1, 2, 2, 3K. This allows us
to conveniently talk about its set average.
12
For any token pair j → i with ℓ = b♯ (i) ≥ b♯ (j) = ℓ′ — meaning i is at an NT-end closer to the root than j — it
satisfies pℓ (j) − pℓ (i) ≥ 1 so their distance r is strictly positive.
13
Without removing position-bias, such a statement might be meaningless as the position-bias may favor “adjacent”
anything, including NT-end pairs.

11
𝐷𝑃(0, 𝑗, 𝟏𝟖)
𝐷𝑃(𝑖, 𝑗, 𝑎) = whether symbol 𝑎
𝐷𝑃(0, 𝑖1 , 𝟏𝟑) 𝐷𝑃(𝑖1 , 𝑗, 𝟏𝟓) can generate 𝑥𝑖+1 … 𝑥𝑗


𝐷𝑃(𝑖1 , 𝑖2 , 𝟏𝟎) 𝐷𝑃(𝑖2 , 𝑗, 𝟏𝟎)
(stored here, see Results 4-5)


learns to
after pretraining, model’s attention 𝑗 → 𝑖 parse CFG
has a strong bias from any position 𝑗 positions 𝑖 position 𝑗
to its most adjacent NT node positions 𝑖


18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 17 17 17 ... Corollary: GPT mimics
13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 ...
12 12 12 12 12 12 12 11 11 11 11 11 11 12 12 12 12 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 ... dynamic programming (DP)
8 8 8 8 8 9 9 9 9 9 7 7 7 9 9 8 8 9 9 7 7 7 9 9 8 8 8 9 9 9 9 7 7 7 ...

𝑥= 1 2 3 3 1 3 3 1 2 1 2 2 1 1 1 1 2 1 1 3 1 2 1 1 3 3 1 1 1 1 1 2 2 1 ...

learns to
positions 𝑖 position 𝑗
generate
from CFG

𝐷𝑃2 𝑡, 15 = 𝐷𝑃 0, 𝑖1 , 𝟏𝟑 𝐷𝑃(𝑖1 , 𝑖2 , 𝟏𝟎)


𝐷𝑃2 (𝑗, 𝑎) = whether symbol 𝑎
𝐷𝑃2 (𝑡, 𝟏𝟎) 𝐷𝑃(𝑖2 , 𝑗, 𝟖)
can follow sequence 𝑥1 … 𝑥𝑗
𝐷𝑃2 (𝑡, 𝟗)

Figure 10: Illustration of how GPTs mimic dynamic programming. See discussions in Section 5.3.

5.3 Connection to DP
Dynamic programming (DP) comprises two components: storage and recurrent formula. Identifying
a specific DP implementation that a transformer follows is challenging due to the “exponentially
many” ways to implement such DPs (see Footnote 9). However, we highlight common elements in
all DP implementations and their correlation with the transformer. In Section 4, we demonstrated
that transformers can encode the DP’s storage “backbone”, encompassing all necessary DP(i, j, a)
on the correct CFG parsing tree, regardless of the DP implementation.
For the recurrent formula, consider DP(k, j, a) in the backbone, derived from DP(k, i, b) ∧
DP(i, j, c) using CFG rule a 7→ b, c. Given that DP(k, i, b) is stored near position i while DP(k, j, a)
and DP(i, j, c) are stored near position j (Result 5), the model needs to perform a memory read of
position i from position j, or j → i. Note that positions i and j are adjacent NT-ends of the same
level, and we have verified that GPT models favor attending j → i when i and j are adjacent NT-
ends, serving as evidence that (decoder-only) transformers use a DP-like approach. See Figure 10
(top) for an illustration.
Further reading for experts. Transformers are not only parsing algorithms but also genera-
tive ones. Experts in CFGs (or participants in competitions like IOI/USACO/ACM-ICPC) may
immediately understand that the generative process requires implementing a second DP:
let DP2 (j, a) denote if prefix x1 , . . . , xj can be followed with a given symbol a ∈ NT ∪ T.
Suppose there is a rule b 7→ c, a, and DP(i, j, c) ∧ DP2 (i, b) both hold; this implies DP2 (j, a) also
holds. This is analogous to the inside-outside algorithm [8]. In this case, the model also needs to
perform a memory read of position i from position j. Here, position i is the most adjacent NT-end
to position j at a different level ; we have also verified that GPT models favor attending such j → i.
See Figure 10 (bottom).
Finally, the above demonstration shows how to correctly parse and generate, but to generate
following the same distribution of CFGs, the model needs to learn DP′2 (j, a), the probability that
symbol a can follow prefix x1 , . . . , xj . The recurrent formula is similar in terms of memory read
patterns (thus the attention patterns). We ignore this subtlety for conciseness.
In sum, while identifying a specific DP implementation that a transformer learns is nearly
impossible, we have shown that the backbone of the DP — including the necessary DP storage
states and recurrent formula — are observable in the pretrained models’ hidden states and attention

12
000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111
1.00 1.00

correlations of word embeddings

correlations of word embeddings


0.75 0.75

non-uniform OT distribution
000 000

uniform OT distribution
0.50 0.50
001 001
0.25 0.25
010 010
011 0.00 011 0.00
0.25 0.25
100 100
101 0.50 101 0.50
110 0.75 110 0.75
111 111
1.00 1.00

Figure 11: Language models learn implicit CFGs by using word embeddings to encode the (hidden) terminal symbol.

We present word embedding correlations for GPT pre-trained on an implicit CFG with |T| = 3 and
vocabulary size |OT| = 300. There are 300 rows/columns each representing an observable token a ∈ OT.
Label ijk ∈ {0, 1}3 in the figure indicates whether a is in OTt for the three choices t ∈ T.

patterns. This serves as strong evidence that pretrained (decoder-only) transformers largely mimic
dynamic programming, regardless of the specific DP implementation they choose.

6 Results 10-13: Extensions of CFGs


6.1 Result 10: Implicit CFGs
In an implicit CFG, terminal symbols represent bags of tokens with shared properties. For ex-
ample, a terminal symbol like noun corresponds to a distribution over a bag of nouns, while verb
corresponds to a distribution over a bag of verbs. These distributions can be non-uniform and
overlapping, allowing tokens to be shared between different terminal symbols. During pre-training,
the model learns to associate tokens with their respective syntactic or semantic categories, without
prior knowledge of their specific roles in the CFG.
Formally, we consider a set of observable tokens OT, and each terminal symbol t ∈ T in G
is associated with a subset OTt ⊆ OT and a probability distribution Dt over OTt . The sets
(OTt )t can be overlapping. To generate a string from this implicit CFG, after generating x =
(x1 , x2 , . . . , xm ) ∼ L(G), for each terminal symbol xi , we independently sample one element yi ∼
Dxi . After that, we observe the new string y = (y1 , y2 , · · · , ym ), and let this new distribution be
called y ∼ LO (G)
We pre-train language models using samples from the distribution y ∼ LO (G). During testing,
we evaluate the success probability of the model generating a string that belongs to LO (G), given
an input prefix y:c . Or, in symbols,
 
Pry∼LO (G)+randomness of F (y:c , F (y:c )) ∈ LO (G) ,
where F (y:c ) represents the model’s generated completion given prefix y:c . (We again use dynamic
programming to determine whether the output string is in LO (G).)
We summarize our finding below and deferring details to Appendix E.
Result 10 (Figure 11). Generative language models can learn implicit CFGs very well. In par-
ticular, after pretraining, the token embeddings from the same subset OTt are grouped together,
indicating they use token embedding layer to encode the hidden terminal symbol information.

13
--------------------pre-training method--------------------
NT-level 0.1 random perturbation T-level 0.15 random perturbation NT-level 0.05 deterministic permutation

generation acc (%) for cfg3b


cut0 =0.1 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99.8 100 100 100 100 100 100 100 100 100 100
cut0 =0.2 98.7 100 100 100 100 100 100 100 100 100 99.2 99.9 100 100 100 99.9 100 100 100 100 98.5 100 100 100 100 100 100 100 100 100 100
cut0 =1 0.0 14.3 24.7 39.8 44.4 55.7 64.5 73.5 82.6 91.8 0.0 14.1 22.8 35.3 44.9 58.2 65.4 75.5 83.6 92.5 0.0 14.7 26.9 38.5 49.8 56.8 65.5 75.2 81.5 91.8 99.8
corrupted cut50 =0.1 78.3 78.9 80.6 78.0 79.1 78.6 79.5 78.6 76.4 77.9 82.6 80.4 80.6 80.4 81.7 82.6 81.4 81.7 80.8 80.8 60.4 58.3 56.5 58.1 60.4 59.1 60.6 57.5 58.9 56.9 30.0
corrupted cut50 =0.2 77.4 78.7 80.0 76.6 77.8 78.2 78.3 77.3 74.9 77.9 81.1 81.1 80.5 79.6 81.2 82.0 81.4 80.7 80.0 80.4 59.5 57.7 55.9 57.6 59.2 58.8 59.7 57.2 57.8 57.1 30.3
corrupted cut50 =1 0.0 0.5 0.5 0.6 0.5 0.3 0.6 0.4 0.5 0.7 0.0 0.4 0.5 0.8 0.2 0.3 0.5 0.6 0.7 0.6 0.0 0.1 0.4 0.4 0.4 0.5 0.9 0.5 0.3 0.3 29.6
cut50 =0.1 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99.4 100 100 100 100 100 100 100 100 100 100
cut50 =0.2 99.2 100 100 100 100 100 100 100 100 100 99.6 100 100 100 100 100 100 100 100 100 98.4 100 100 100 100 100 100 100 100 100 100
cut50 =1 0.0 91.5 95.7 97.1 98.1 98.7 99.2 99.0 99.5 99.4 0.0 92.8 96.2 97.6 98.2 99.1 99.3 99.4 99.5 99.7 0.0 83.4 90.6 94.0 96.2 97.2 98.1 98.7 99.2 99.3 99.9
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 clean
------------pre-training data perturbation ratio OR clean data------------

Figure 12: Generation accuracies for models pre-trained cleanly VS pre-trained over perturbed data, on clean or
corrupted prefixes with cuts c = 0 or c = 50, using generation temperatures τ = 0.1, 0.2, 1.0.

Observation. In Rows 4/5, by comparing against the last column, we see it is beneficial to include
low-quality data (e.g. grammar mistakes) during pre-training. The amount of low-quality data could be
little (γ = 0.1 fraction) or large (every training sentence may have grammar mistake). The transformer
also learns a “mode switch” between the “correct mode” or not; details in Section 6.2.

6.2 Results 11-13: Robustness on Corrupted CFG


One may also wish to pre-train a transformer to be robust against errors and inconsistencies in
the input. For example, if the input data is a prefix with some tokens being corrupted or missing,
then one may hope the transformer to correct the errors and still complete the sentence following
the correct CFG rules. Robustness is an important property, as it reflects the generalization and
adaptation ability of the transformer to deal with real-world training data, which may not always
follow the CFG perfectly (such as having grammar errors).
To test robustness, for each input prefix x:c of length c that belongs to the CFG, we ran-
domly select a set of positions i ∈ [c] in this prefix — each with probability ρ— and flip them
i.i.d. with a random symbol in T. Call the resulting prefix x e:c . Next, we feed the corrupted
prefix xe:c to the transformer F and compute its generation accuracy in the uncorrupted CFG:
x:c )) ∈ L(G)].
Prx∼L(G), F [(x:c , F (e
We not only consider clean pre-training, but also some versions of robust pre-training. That is,
we randomly select γ ∈ [0, 1] fraction of the training data and perturb them before feeding into the
pre-training process. We compare three types of data perturbations.14
• (T-level random perturbation). Each xi w.p. 0.15 we replace it with a random symbol in T.
• (NT-level random perturbation). Let ℓ = L − 1 and recall sℓ = sℓ,1 , sℓ,2 , . . . , sℓ,mL−1 is the


sequence of symbols at NT-level ℓ. For each sℓ,i , w.p. 0.10 we perturb it to a random symbol
in NTℓ ; and then generate x = sL according to this perturbed sequence.
• (NT-level deterministic perturbation). Let ℓ = L − 1 and fix a permutation π over symbols in
NTℓ . For each sℓ,i , w.p. 0.05 we perturb it to its next symbol in NTL−1 according to π; and
then generate x = sL according to this perturbed sequence.
We focus on ρ = 0.15 with a wide range of perturbation rate τ = 0.0, 0.1, . . . , 0.9, 1.0. We present
our findings in Figure 12. The main message is:
Result 11 (Figure 12, rows 4/5). When pretrained over clean data, GPT models are not so robust
to “grammar mistakes.” It is beneficial to include corrupted or low-quality pretrain data.
Specifically, GPT models achieve only ∼ 30% accuracy when pretrained over clean data x ∼ L(G).
14
One can easily extend our experiments by considering other types of data corruption (for evaluation), and other
types of data perturbations (for training). We refrain from doing so because it is beyond the scope of this paper.

14
If we pretrain from perturbed data — both when γ = 1.0 so all data are perturbed, and when
γ = 0.1 so we have a small fraction of perturbed data — GPT can achieve ∼ 79%, 82% and 60%
robust accuracies respectively using the three types of data perturbations (rows 4/5 of Figure 12).
Next, we take a closer look. If we use temperature τ = 1 for generation:
Result 12 (Figure 12, rows 3/6/9). Pre-training on corrupted data teaches model a mode switch.
• Given a correct prefix, it mostly completes with a correct string in the CFG (Row 9);
• Given a corrupted prefix, it always completes sentences with grammar mistakes (Row 6);
• When given no prefix, it generates corrupted strings with probability close to γ (Row 3).
By comparing the generation accuracies across different τ and γ, we observe:
Result 13 (Figure 12, rows 4/5/6). High robust accuracy is achieved when generating using low
temperatures τ ,15 and is not sensitive to γ– the fraction of pretrain data that is perturbed.
This should not be surprising given that the language model learned a “mode switch.” Using low
temperature encourages the model to, for each next token, pick a more probable solution. This
allows it to achieve good robust accuracy even when the model is trained totally on corrupted data
(γ = 1.0). Note this is consistent with practice: when feeding a pre-trained completion model (such
as Llama or GPT-3-davinci003) with prompts of grammar mistakes, it tends to produce texts also
with (even new!) grammar mistakes when using a large temperature.
Our experiments suggest that, additional instruct fine-tuning may be necessary, if one wants
the model to always stay in the “correct mode” even for high temperatures. This is beyond the
scope of this paper.

7 Related Work and Conclusion

Related Works. Transformers can encode some CFGs, especially those that correspond to natural
languages [7, 14, 16, 18, 27, 31, 33, 37]. Deletang et al. [10] studied transformer’s learnability on
a few languages in the Chomsky hierarchy (which includes CFGs) However, the inner mechanisms
regarding how transformer can or cannot solve those tasks are unclear.
There are works “better” than us by precisely interpreting each neuron’s function, but they
study simpler tasks using simpler architectures. For instance, Nanda et al. [21] examined 1 or
2-layer transformers with context length 3 for the arithmetic addition. We focus the 100M-sized
GPT2 model with context length exceeding 300. While we cannot precisely determine each neuron’s
function, we have identified the roles of some heads and some hidden states, which correlate with
dynamic programming.
In addition to linear probing, Murty et al. [20] explored alternative methods to deduce the tree
structures learned by a transformer. They developed a score to quantify the “tree-like” nature of
a transformer, demonstrating that it becomes increasingly tree-like during training. Our Figure 21
in Appendix C.3 also confirmed on such findings.
Conclusion. In this paper, we studied how transformers like GPT2 learn synthetically generated,
yet challenging CFGs, and show the inner workings highly correlate with the internal states of the
dynamic programming algorithms needed to parse and generate from such CFGs. This contributes
15
Recall, when temperature τ = 0 the generation is greedy and deterministic; when τ = 1 it reflects the unaltered
distribution learned by the transformer; when τ > 0 s small it encourages the transformer to output “more probable”
tokens.

15
to the fields by providing insights into how language models can effectively learn and generate
complex and diverse structural expressions. Additionally, we introduced tools like multi-head
linear probing, which may pave the way for further interpretation and analysis of larger models
tackling more complex tasks.
We also present corollary results, including showing why positional embedding is inferior to
relative attention or rotary embedding; demonstrating that encoder-based models (e.g., BERT,
deBERTa) cannot learn very deeply nested CFGs as effectively as generative models (e.g., GPT);
and highlighting the practical necessity of adding structural, syntactic errors to the pretraining
data to make the model more robust to corrupted language prefixes.
Finally, Part 1 of this work series marks the initial step in exploring how language models learn
hierarchical language structures. Future directions include grade-school math and reasoning [34, 35]
(Part 2), as well as knowledge storage, extraction, and manipulation [1, 2, 4] (Part 3).

Appendix
A Experiment Setups
A.1 Dataset Details
We construct seven synthetic CFGs of depth L = 7 with varying levels of learning difficulty. It
can be inferred that the greater the number of T/NT symbols, the more challenging it is to learn
the CFG. For this reason, to push the capabilities of language models to their limits, we primarily
focus on cfg3b, cfg3i, cfg3h, cfg3g, cfg3f, which are of sizes (1, 3, 3, 3, 3, 3, 3) and present increasing
levels of difficulty. Detailed information about these CFGs is provided in Figure 13:
• In cfg3b, we construct the CFG such that the degree |R(a)| = 2 for every NT a. We also
ensure that in any generation rule, consecutive pairs of T/NT symbols are distinct.
The 25%, 50%, 75%, and 95% percentile string lengths are 251, 278, 308, 342 respectively.
• In cfg3i, we set |R(a)| = 2 for every NT a. We remove the requirement for distinctness to
make the data more challenging than cfg3b.
The 25%, 50%, 75%, and 95% percentile string lengths are 276, 307, 340, 386 respectively.
• In cfg3h, we set |R(a)| ∈ {2, 3} for every NT a to make the data more challenging than cfg3i.
The 25%, 50%, 75%, and 95% percentile string lengths are 202, 238, 270, 300 respectively.
• In cfg3g, we set |R(a)| = 3 for every NT a to make the data more challenging than cfg3h.
The 25%, 50%, 75%, and 95% percentile string lengths are 212, 258, 294, 341 respectively.
• In cfg3f, we set |R(a)| ∈ {3, 4} for every NT a to make the data more challenging than cfg3g.
The 25%, 50%, 75%, and 95% percentile string lengths are 191, 247, 302, 364 respectively.
Remark A.1. From the examples in Figure 13, it becomes evident that for grammars G of depth
7, proving that a string x belongs to L(G) is highly non-trivial, even for a human being, and even
when the CFG rules are known. The standard method of demonstrating x ∈ L(G) is through
dynamic programming. We further discuss what we mean by a CFG’s “difficulty” in Appendix G,
and provide additional experiments beyond the cfg3 data family.

16
22|->21 20 22|->19 19 20 22|->20 20 21 22|->19 20 22|->20 21
22|->20 19 22|->21 20 19 22|->19 21 22|->20 20 19 22|->20 19 21
19|->16 17 18
19|->17 18 16
19|->18 16 18
19|->16 16
19|->16 17
19|->18 17
22|->20 19 21
19|->17 17 16
22|->21 19 19
22|->20 20
a sample from cfg3b:
20|->17 16 18 20|->17 16 17 20|->18 16 19|->18 17 16 19|->18 16 18 312312132132123323213132112332321233213123213132
20|->16 17 20|->18 18 20|->17 16 19|->18 16 17 19|->17 18 313211232131221123312321232121123312313221213212
21|->18 16 21|->16 16 18 21|->17 17 18 20|->16 17 19|->18 18 331312321213212332321123323121313213123221123323
21|->16 18 17 21|->18 17 21|->17 18 17 20|->18 18 20|->16 16 132121313122112332312123213213231312123213232131
16|->15 13 16|->13 13 16|->14 13 20|->16 17 17 20|->16 17 123213123132321321313221313232313212112331231322
16|->13 15 14 16|->14 14 16|->15 13 21|->16 16 20|->17 16 18 112321312321313123132213121321233122132131231321
17|->14 13 15 17|->15 15 17|->13 14 21|->16 16 18 21|->18 17 313123132213213132
17|->15 13 14 17|->15 14 17|->15 13 15 21|->18 16 21|->17 16
18|->15 14 13 18|->14 15 13 18|->15 13 13 16|->14 13 13 21|->16 17 18
18|->14 13 18|->14 15 18|->15 14 14 16|->13 14 21|->16 18 a sample from cfg3i:
13|->11 12 13|->12 11 18|->14 15 15 16|->13 13 16|->15 15
13|->12 11 13|->10 12 11 13|->12 11 17|->14 13 14 16|->13 15 13 113113121222312312113113121222312231112313121212
14|->11 10 12 14|->10 10 10 13|->11 10 17|->14 15 13 16|->14 13 222312311131212113113123123123123123122313121212
14|->10 11 12 14|->10 10 14|->10 12 12 17|->15 14 16|->14 14 312312312231312231112312311131211231231112312312
15|->12 11 10 15|->11 11 10 14|->10 10 18|->15 13 17|->15 14 13 231231211231312112313121212231231231231231111212
15|->11 12 10 15|->11 10 12 14|->12 12 10 18|->15 15 17|->14 15 312231231231312111131131131222312231223123123123
10|->7 9 8 10|->8 7 7 15|->10 12 18|->14 13 15 17|->15 14 123122313121111231312312113122313121111312231231
10|->9 8 7 10|->9 9 15|->11 11 10 13|->10 12 18|->14 15 13 221131231212122312313123123121112113
11|->8 7 9 11|->7 7 7 10|->8 7 9 13|->11 11 11 18|->15 13 13
11|->7 8 9
12|->8 9 7
11|->7 7 8
12|->7 9 9
10|->9 7
10|->8 8
13|->11 11
14|->11 12
18|->13 15
13|->11 12 a sample from cfg3h:
12|->9 7 8 12|->8 7 11|->8 7 7 14|->10 11 10 13|->12 11 12 131231331311332131323223212232123121313121321313
7|->3 1 7|->3 1 2 11|->7 7 14|->10 10 13|->10 12 11 113313333113123232131323213113131232121231332132
7|->1 2 3 7|->2 3 1 11|->7 9 9 15|->10 11 14|->10 12 322321333311231331231332321312131133131231231311
cfg3b 8|->3 2
8|->3 1 2 cfg3i 8|->1 1
8|->2 2
12|->7 9
12|->8 7
15|->12 10 10
15|->12 11
14|->12 10 12
14|->12 11
312133311312321331232131313312131231311212312312
232213131131331133313312322132131312133312131212
9|->3 2 1 9|->1 1 3 12|->9 8 10|->8 8 8 14|->10 12 12 1231311232131331313133123232213
9|->2 1 9|->1 2 7|->2 3 2 10|->7 7 7 15|->10 11 11
7|->1 2 3
7|->1 3 1
10|->7 7
11|->8 8 9
15|->11 11 10
15|->10 10 a sample from cfg3g:
8|->1 2 11|->9 7 15|->12 12 11
231221122132232312311233223313313313313312122221
cfg3h 8|->3 3 1
8|->1 3
11|->8 9 7
12|->7 9
10|->8 9 9
10|->9 7 9
123322331331132132233222123113233113233123231132
331123112311111222312312233121111123122112332321
9|->2 1 3 12|->7 8 10|->7 9 9
231221111231331132212223321232133133133133113132
9|->1 3 3 12|->9 9 9 11|->8 8
311122211322322113311323312313223323133133113231
7|->2 3 1 11|->9 7
222332123132132211313231123331132331112223311232
7|->1 1 11|->9 7 7
21123123111132
7|->2 2 12|->7 9 7
8|->1 3 2 12|->9 8
8|->1 3
8|->3 3 1
12|->8 8 9
7|->2 2 1
a sample from cfg3f:
cfg3g 9|->2 3 3
9|->2 3
7|->3 2 2
7|->3 1 2
332213123312113123211322312312111213211322311311
322333123121112131133112132121333331232212131232
9|->2 1 7|->3 2 221111213322131131131131111113231233133133311331
8|->3 1 1 333332231211311121221111211233312331121113313333
8|->1 2 331123333131111333312113211312121133333212111121
8|->3 3 1
cfg3f 9|->1 2 1
213223223322133221113221132323313111213223223221
211133331121322221332211212133121331332212213221
9|->3 3 211213331232233312
9|->1 1

Figure 13: The context-free grammars cfg3b, cfg3i, cfg3h, cfg3g, cfg3f that we primarily use in this paper, together
with a sample string from each of them.

Observation. Although those CFGs are only of depth 7, they are capable of generating sufficiently long
and hard instances; after all, even when the CFG rules are given, the typical way to decide if a string x
belongs to the CFG language x ∈ L(G) may require dynamic programming.

Remark A.2. cfg3f is a dataset that sits right on the boundary of difficulty at which GPT2-small
is capable of learning, see Figure 31 later which shows that smaller GPT2 cannot learn such cfg3f
(and refer to subsequent subsections for training parameters). While it is certainly possible to
consider deeper and more complex CFGs, this would necessitate training a larger network for a
longer period. We choose not to do this as our findings are sufficiently convincing at the level of
cfg3f.
Simultaneously, to illustrate that transformers can learn CFGs with larger |NT| or |T|, we con-
struct datasets cfg3e1 and cfg3e2 respectively of sizes (1, 3, 9, 27, 81, 27, 9) and (1, 3, 9, 27, 27, 9, 4).
They are too lengthy to describe so we include them in an attached txt file in Appendix G.2.

A.2 Model Architecture Details


We define GPT as the standard GPT2-small architecture [25], which consists of 12 layers, 12 attention
heads per layer, and 768 (=12 × 64) hidden dimensions. We pre-train GPT on the aforementioned
datasets, starting from random initialization. For a baseline comparison, we also implement De-
BERTa [13], resizing it to match the dimensions of GPT2 — thus also comprising 12 layers, 12
attention heads, and 768 dimensions.

17
Architecture size. We have experimented with models of varying sizes and observed that their
learning capabilities scale with the complexity of the CFGs. To ensure a fair comparison and
enhance reproducibility, we primarily focus on models with 12 layers, 12 attention heads, and 768
dimensions. The transformers constructed in this manner consist of 86M parameters.
Modern GPTs with relative attention. Recent research [9, 13, 29] has demonstrated that
transformers can significantly improve performance by using attention mechanisms based on the
relative position differences of tokens, as opposed to the absolute positions used in the original
GPT2 [25] or BERT [15]. There are two main approaches to achieve this. The first is to use a
“relative positional embedding layer” on |j − i| when calculating the attention from j to i (or a
bucket embedding to save space). This approach is the most effective but tends to train slower.
The second approach is to apply a rotary positional embedding (RoPE) transformation [29] on the
hidden states; this is known to be slightly less effective than the relative approach, but it can be
trained much faster.
We have implemented both approaches. We adopted the RoPE implementation from the GPT-
NeoX-20B project (along with the default parameters), but downsized it to fit the GPT2 small
model. We refer to this architecture as GPTrot . Since we could not find a standard implementation
of GPT using relative attention, we re-implemented GPT2 using the relative attention framework
from DeBERTa [13]. (Recall, DeBERTa is a variant of BERT that effectively utilizes relative
positional embeddings.) We refer to this architecture as GPTrel .
Weaker GPTs utilizing only position-based attention. For the purpose of analysis, we also
consider two significantly weaker variants of GPT, where the attention matrix exclusively depends
on the token positions, and not on the input sequences or hidden embeddings. In other words, the
attention pattern remains constant for all input sequences.
We implement GPTpos , a variant of GPTrel that restricts the attention matrix to be computed
solely using the (trainable) relative positional embedding. This can be perceived as a GPT variant
that maximizes the use of position-based attention. We also implement GPTuni , a 12-layer, 8-head,
1024-dimension transformer, where the attention matrix is fixed ; for each h ∈ [8], the h-th head
consistently uses a fixed, uniform attention over the previous 2h − 1 tokens. This can be perceived
as a GPT variant that employs the simplest form of position-based attention.
Remark A.3. It should not be surprising that GPTpos or GPTuni perform much worse than other
GPT models on real-life wikibook pre-training. However, once again, we use them only for analysis
purpose in this paper, as we wish to demonstrate what is the maximum power of GPT when only
using position-based attention to learn CFGs, and what is the marginal effect when one goes beyond
position-based attention.

Features from random transformer. Finally we also consider a randomly-initialized GPTrel ,


and use those random features for the purpose of predicting NT ancestors and NT ends. This
serves as a baseline, and can be viewed as the power of the so-called (finite-width) neural tangent
kernel [5]. We call this GPTrand .

A.3 Pre-Training Details


For each sample x ∼ L(G) we append it to the left with a BOS token and to the right with an
EOS token. Then, following the tradition of language modeling (LM) pre-training, we concatenate
consecutive samples and randomly cut the data to form sequences of a fixed window length 512.
As a baseline comparison, we also applied DeBERTa on a masked language modeling (MLM)
task for our datasets. We use standard MLM parameters: 15% masked probability, in which 80%

18
chance of using a masked token, 10% chance using the original token, and 10% chance using a
random token.
We use standard initializations from the huggingface library. For GPT pre-training, we use
AdamW with β = (0.9, 0.98), weight decay 0.1, learning rate 0.0003, and batch size 96. We pre-
train the model for 100k iterations, with a linear learning rate decay.16 For DeBERTa, we use
learning rate 0.0001 which is better and 2000 steps of learning rate linear warmup.
Throughout the experiments, for both pre-training and testing, we only use fresh samples
from the CFG datasets (thus using 4.9 billion tokens = 96 × 512 × 100k). We have also tested pre-
training with a finite training set of 100m tokens; and the conclusions of this paper stay similar.
To make this paper clean, we choose to stick to the infinite-data regime in this version of the
paper, because it enables us to make negative statements (for instance about the vanilla GPT or
DeBERTa, or about the learnability of NT ancestors / NT boundaries) without worrying about
the sample size. Please note, given that our CFG language is very large (e.g., length 300 tree of
length-2/3 rules and degree 4 would have at least 4300/3 possibility), there is almost no chance that
training/testing hit the same sentence.
As for the reproducibility of our result, we did not run each pre-train experiment more than
once (or plot any confidence interval). This is because, rather than repeating our experiments
identically, it is obviously more interesting to use the resources to run it against different datasets
and against different parameters. We pick the best model using the perplexity score from each
pre-training task. When evaluating the generation accuracy in Figure 4, we have generated more
than 20000 samples for each case, and present the diversity pattern accordingly in Figure 14.

A.4 Predict NT ancestor and NT boundary


Recall from Section 4.1 that we have proposed to use a multi-head linear function to probe whether
or not the hidden states of a transformer, implicitly encodes the NT ancestor and NT boundary
information for each token position. Since this linear function can be of dimension 512 × 768—
when having a context length 512 and hidden dimension 768 — recall in (4.2), we have proposed
to use a multi-head attention to construct such linear function for efficient learning purpose. This
significantly reduces sample complexity and makes it much easier to find the linear function.
In our implementation, we choose H = 16 heads and hidden dimension d′ = 1024 when con-
structing this position-based attention in (4.2). We have also tried other parameters but the NT
ancestor/boundary prediction accuracies are not very sensitive to such architecture change. We
again use AdamW with β = (0.9, 0.98) but this time with learning rate 0.003, weight decay 0.001,
batch size 60 and train for 30k iterations.
Once again we use fresh new samples when training such linear functions. When evaluating the
accuracies on predicting the NT ancester / boundary information, we also use fresh new samples.
Recall our CFG language is sufficiently large so there is negligible chance that the model has seen
such a string during training.

B More Experiments on Generation


Diversity can be estimated through entropy. Given a distribution p over strings and a sampled sub-
set S = x(i) i∈[M ] from p, for any string x ∈ S, denote by len(x) its length so x = (x1 , . . . , xlen(x) ),
16
We have slightly tuned the parameters to make pre-training go best. We noticed for training GPTs over our
CFG data, a warmup learning rate schedule is not needed.

19
truth GPT GPT_rel GPT_rot GPT_pos GPT_uni
000 100 000 000 000 000 000 000 000 100 000

diversity pattern for cfg3f


00 00 00 00 00 00 00 00 10 00 00 104
00 00 00 00 10 00 00 00 00 00 00
00 10 00 10 00 00 00 00 00 00 00
0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 103
0 0 0 0 0 0 0 0 0 0 0

102

101

100
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

truth F
Figure 14: Comparing the generation diversity Sa→ℓ 2
and Sa→ℓ 2
across different learned GPT models (c = 0 or
c = 50). Rows correspond to NT symbols a and columns correspond to ℓ2 = 2, 3, . . . , 7. Colors represent
truth
the number of distinct elements in Sa→ℓ 2
, and the white numbers represent the collision counts (if not
present, meaning there are more than 5 collisions). More experiments in Figure 15, 16, and 17

Observation. We use M = 20000 samples. The diversity pattern from the pre-trained transformer
matches that of the ground-truth. For instance, from the root one can generate Ω(M 2 ) distinct sequences
to level ℓ2 = 5 using the CFG rules, and from every a ∈ NT2 one can generate Ω(M 2 ) to level ℓ2 = 6
(not to say to the T-level ℓ2 = 7); this is already more than the number of parameters in the model.
Therefore, we conclude that the pre-trained model does not rely on simply memorizing a small set
of patterns to learn the CFGs.

and denote by xlen(x)+1 = eos. The entropy in bits for p can be estimated by
1 P P  
− |S| x∈S i∈[len(x)+1] log2 Prp xi | x1 , . . . , xi−1

We compare the entropy of the true CFG distribution and the transformer’s output distribution
using M = 20000 samples in Figure 4 (middle).
Diversity can also be estimated using the birthday paradox to lower bound the support size of
a distribution [6]. Given a distribution p over strings and a sampled subset S = x(i) i∈[M ] from
p, if every pair of samples in S are distinct, then with good probability the support of p is of size
at least Ω(M 2 ). In Appendix B.1, we conducted an experiment with M = 20000. We performed a
birthday paradox experiment from every symbol a ∈ NTℓ1 to some other level ℓ2 > ℓ1 , comparing
that with the ground truth. For instance, we confirmed for the cfg3f dataset, there are at least
Ω(M 2 ) distinct sentential forms that can be derived from a symbol in level 1 to level 5, or from
level 2 to level 6, etc. — not to mention from the root in NT1 to the leaf at level 7. In particular,
M 2 is already more than the number of parameters in the model.
From both experiments, we conclude that the pre-trained model does not rely on simply
memorizing a small set of patterns to learn the CFGs.

B.1 Generation Diversity via Birthday Paradox


Since “diversity” is influenced by the length of the input prefix, the length of the output, and the
CFG rules, we want to carefully define what we measure.
Given a sample pool x(1) , ..., x(M ) ∈ L(G), for every symbol a ∈ NTℓ1 and some later level
ℓ2 ≥ ℓ1 that is closer to the leaves, we wish to define a multi-set Sa→ℓ2 that describes all possible
generations from a ∈ NTℓ1 to NTℓ2 in this sample pool. Formally,

Definition B.1. For x ∈ L(G) and ℓ ∈ [L], we use sℓ (i..j) to denote the sequence of NT ancestor

20
symbols at level ℓ ∈ [L] from position i to j with distinct ancestor indices:17
sℓ (i..j) = (sℓ (k))k∈{i,i+1,...,j} s.t. pℓ (k)̸=pℓ (k+1)

Definition B.2. For symbol a ∈ NTℓ1 and some layer ℓ2 ∈ {ℓ1 , ℓ1 + 1, . . . , L}, define multi-set18
s {
Sa→ℓ2 (x) = sℓ2 (i..j) ∀i, j, i ≤ j such that pℓ1 (i − 1) ̸= pℓ1 (i) = pℓ1 (j) ̸= pℓ1 (j + 1) ∧ a = sℓ1 (i)

and we define the multi-set union Sa→ℓ2 = i∈[M ] Sa→ℓ2 x(i) , which is the multiset of all sentential
S 

forms that can be derived from NT symbol a to depth ℓ2 .

(Above, when x ∼ L(G) is generated from the ground-truth CFG, then the ancestor indices and
symbols p, s are defined in Section 2.1. If x ∈ L(G) is an output from the transformer F , then we
let p, s be computed using dynamic programming, breaking ties lexicographically.)
truth to denote the ground truth S
We use Sa→ℓ (1) (M ) are i.i.d. sampled from the
2 a→ℓ2 when x , . . . , x
real distribution L(G), and denote by
F
S (i) (i) 
Sa→ℓ 2
= i∈[M ′ ] and x(i) ,F (x(i) )∈L(G) Sa→ℓ2 x:c , F (x:c )
:c :c

that from the transformer F . For a fair comparison, for each F and p, we pick an M ′ ≥ M such
(i) (i)
that M = i ∈ [M ′ ] | x:p , F (x:p ) ∈ L(G) so that F is capable of generating exactly M sentences


that nearly-perfectly satisfy the CFG rules.19


Intuitively, for x’s generated by the transformer model, the larger the number of distinct se-
F
quences in Sa→ℓ is, the more diverse the set of NTs at level ℓ2 (or Ts if ℓ2 = L) the model can
2
generate starting from NT a. Moreover, in the event that Sa→ℓ F has only distinct sequences (so
2
collision count = 0), then we know that the generation from a → ℓ2 , with good probability, should
include at least Ω(M 2 ) possibilities using a birthday paradox argument. 20
For such reason, it can be beneficial if we compare the number of distinct sequences and the
collision counts between Sa→ℓ F truth . Note we consider all ℓ ≥ ℓ instead of only ℓ = L,
and Sa→ℓ
2 2 2 1 2
because we want to better capture model’s diversity at all CFG levels.21 We present our findings
in Figure 14 with M = 20000 samples for the cfg3f dataset.
In Figure 15 we present that for cfg3b, cfg3i, cfg3h, cfg3g, in Figure 16 for cfg3e1, and in Figure 17
for cfg3e2. We note that not only for hard, ambiguous datasets, also for those less ambiguous
(cfg3e1, cfg3e2) datasets, language models are capable of generating very diverse outputs.
17
With the understanding that pℓ (0) = pℓ (len(x) + 1) = ∞.
18
Throughout this paper, we use J·K to denote multi-sets that allow multiplicity, such as J1, 2, 2, 3K. This allows us
to conveniently talk about its collision count, number of distinct elements, and set average.
19
Please note M and M ′ are roughly the same, given
Ω(L)
20
A CFG of depth L, even with constant degree and constant size, can generate 22 distinct sequences.
21
A model might generate a same NT symbol sequence sL−1 , and then generate different Ts randomly from each
F F
NT. In this way, the model still generates strings x’s with large diversity, but Sa→L−1 (x) is small. If Sa→ℓ 2
is large
for every ℓ2 and a, then the generation from the model is truely diverse at any level of the CFG.

21
truth GPT GPT_rel GPT_rot GPT_pos GPT_uni
00 00 00 00 00 00 00 00 00 00 00

diversity pattern for cfg3b


00 00 00 00 00 00 00 00 00 00 00 104
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 103

102

101

100
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

(a) cfg3b dataset

truth GPT GPT_rel GPT_rot GPT_pos GPT_uni


00 00 00 00 00 00 00 00 00 00 00
diversity pattern for cfg3i

0 0 0 0 0 0 0 0 0 0 0 104
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
103

102

101

100
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

(b) cfg3i dataset

truth GPT GPT_rel GPT_rot GPT_pos GPT_uni


00 00 00 00 00 00 00 00 00 00 00
diversity pattern for cfg3h

00 30 10 20 10 0 30 20 10 20 00 104
0 0 0 0 0 0 0 0 0 0 0
00 00 00 00 00 00 00 00 00 00 00

0 0 0 0 0 0 0 0 0 0 0 103

102

101

100
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

(c) cfg3h dataset

truth GPT GPT_rel GPT_rot GPT_pos GPT_uni


000 000 000 000 000 000 000 000 000 000 000
diversity pattern for cfg3g

00 00 00 00 00 00 00 00 00 00 00 104
0 20 20 0 40 10 30 50 20 10 10
00 00 10 10 20 00 00 00 00 00 10
2 5 4 3 3 5 3 3
4 2 2 2 2 1 3 1 2 0 0 103
1 2 2 2 1 4 1 1 2 4 0

102

101

100
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

(d) cfg3g dataset

truth F
Figure 15: Comparing the generation diversity Sa→ℓ 2
and Sa→ℓ 2
across different learned GPT models (and for c = 0
or c = 50). Rows correspond to NT symbols a and columns correspond to ℓ2 = 2, 3, . . . , 7. Colors represent
truth
the number of distinct elements in Sa→ℓ 2
, and the white numbers represent the collision counts (if not
present, meaning there are more than 5 collisions).
22
truth GPT GPT_rel GPT_rot GPT_pos GPT_uni
400 100 200 200 100 300 300 100 000 100 200
00 10 00 00 00 00 00 00 00 00 00
10 10 10 20 20 10 10 10 10 00 00
00 00 00 00 00 00 00 00 00 00 00
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
0 2 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 0 1 1 0 0 0

104
diversity pattern for cfg3e1

103

102

101

100

cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

truth F
Figure 16: Comparing the generation diversity Sa→ℓ 2
and Sa→ℓ 2
across different learned GPT models (and for c = 0
or c = 50). Rows correspond to NT symbols a and columns correspond to ℓ2 = 2, 3, . . . , 7. Colors represent
truth
the number of distinct elements in Sa→ℓ 2
, and the white numbers represent the collision counts (if not
present, meaning there are more than 5 collisions). This is for the cfg3e1 dataset.

23
truth GPT GPT_rel GPT_rot GPT_pos GPT_uni
000 100 000 000 000 200 000 000 100 000 100
00 30 20 10 00 00 20 30 10 10 10
00 00 00 00 00 00 00 00 10 20 00
00 00 00 00 00 00 00 00 00 00 00
0 0 0 2 0 1 0 0 0 2 1
0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 1
0 0 1 1 2 1 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0
0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
diversity pattern for cfg3e2

104

103

102

101

100

cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

truth F
Figure 17: Comparing the generation diversity Sa→ℓ 2
and Sa→ℓ 2
across different learned GPT models (and for c = 0
or c = 50). Rows correspond to NT symbols a and columns correspond to ℓ2 = 2, 3, . . . , 7. Colors represent
truth
the number of distinct elements in Sa→ℓ 2
, and the white numbers represent the collision counts (if not
present, meaning there are more than 5 collisions). This is for the cfg3e2 dataset.

24
B.2 Marginal Distribution Comparison
In order to effectively learn a CFG, it is also important to match the distribution of generating
probabilities. While measuring this can be challenging, we have conducted at least a simple test
on the marginal distributions p(a, i), which represent the probability of symbol a ∈ NTℓ appearing
at position i (i.e., the probability that sℓ (i) = a). We observe a strong alignment between the
generated probabilities and the ground-truth distribution. See Figure 18.

truth GPT GPT_rel GPT_rot GPT_pos GPT_uni GPT GPT_rel GPT_rot GPT_pos GPT_uni
0 0 0.20
0.7
0.15
50 0.6 50
0.10
absolute position

absolute position
0.5
100 100 0.05
0.4 0.00
150 0.3 150
0.05
0.2 0.10
200 200
0.1 0.15
250 250
0.0 0.20
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

(a) cfg3b dataset; marginal distribution (b) cfg3b dataset; marginal distribution - ground truth
truth GPT GPT_rel GPT_rot GPT_pos GPT_uni GPT GPT_rel GPT_rot GPT_pos GPT_uni
0 0 0.20
0.8 0.15
50 0.7 50
0.10
absolute position

absolute position
0.6
100 100 0.05
0.5
0.00
150 0.4 150
0.05
0.3
200 0.2 200 0.10
0.1 0.15
250 250
0.0 0.20
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

(c) cfg3i dataset; marginal distribution (d) cfg3i dataset; marginal distribution - ground truth
truth GPT GPT_rel GPT_rot GPT_pos GPT_uni GPT GPT_rel GPT_rot GPT_pos GPT_uni
0 1.0 0 0.20
0.15
50 0.8 50
0.10
absolute position

absolute position

100 0.6 100 0.05


0.00
150 0.4 150
0.05

200 200 0.10


0.2
0.15
250 250
0.0 0.20
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

(e) cfg3h dataset; marginal distribution (f) cfg3h dataset; marginal distribution - ground truth
truth GPT GPT_rel GPT_rot GPT_pos GPT_uni GPT GPT_rel GPT_rot GPT_pos GPT_uni
0 0 0.20
0.6 0.15
50 50
0.5 0.10
absolute position

absolute position

100 0.4 100 0.05


0.00
150 0.3 150
0.05
0.2
200 200 0.10
0.1 0.15
250 250
0.0 0.20
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

(g) cfg3g dataset; marginal distribution (h) cfg3g dataset; marginal distribution - ground truth
truth GPT GPT_rel GPT_rot GPT_pos GPT_uni GPT GPT_rel GPT_rot GPT_pos GPT_uni
0 0 0.20
0.8
0.15
50 0.7 50
0.10
absolute position

absolute position

0.6
100 100 0.05
0.5
0.00
0.4
150 150
0.3 0.05

200 0.2 200 0.10


0.1 0.15
250 250
0.0 0.20
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50

(i) cfg3f dataset; marginal distribution (j) cfg3f dataset; marginal distribution - ground truth

Figure 18: Marginal distribution p(a, i) difference between a trained model and the ground-truth, for an NT/T symbol
a (column) at position i (row). Figures on the left compare the marginal distribution of the ground-truth
against those generated from 5 models × 2 cut positions (c = 0/c = 50). Figures on the right showcase
the marginal distribution difference between them and the ground-truth. It is noticeable from the figures
that GPT did not learn cfg3g and cfg3f well. This is consistent with the generation accuracies in Figure 4.

25
C More Experiments on NT Ancestor and NT Boundary Predic-
tions
C.1 NT Ancestor and NT Boundary Predictions
Earlier, as confirmed in Figure 5, we established that the hidden states (of the final transformer
layer) have implicitly encoded the NT ancestor symbols sℓ (i) for each CFG level ℓ and token position
i using a linear transformation. In Figure 19(a), we also demonstrated that the same conclusion
applies to the NT-end boundary information bℓ (i). More importantly, for bℓ (i), we showed that
this information is stored locally, very close to position i (such as at i ± 1). Detailed information
can be found in Figure 19.
Furthermore, as recalled in Figure 7, we confirmed that at any NT boundary where bℓ (i) = 1,
the transformer has also locally encoded clear information about the NT ancestor symbol sℓ (i),
either exactly at i or at i ± 1. To be precise, this is a conditional statement — given that it is an
NT boundary, NT ancestors can be predicted. Therefore, in principle, one must also verify that
the prediction task for the NT boundary is successful to begin with. Such missing experiments are,
in fact, included in Figure 19(b) and Figure 19(c).

26
GPT GPT_rel GPT_rot GPT_pos GPT_uni baseline (GPT_rand)

predict NT-end boundary (%)


cfg3 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 96.5 88.0 95.5 98.5 99.6
b
cfg3 99.7 99.8 99.0 99.5 99.9 99.7 99.8 99.1 99.5 99.9 99.7 99.8 99.1 99.5 99.9 99.8 99.8 99.1 99.6 99.9 99.8 99.8 99.1 99.6 99.9 87.5 88.6 94.9 97.9 99.3
i
cfg3 99.7 99.3 99.5 99.8 99.9 99.7 99.4 99.5 99.8 99.9 99.7 99.4 99.5 99.8 99.9 99.7 99.4 99.6 99.9 100 99.7 99.4 99.6 99.9 100 88.1 86.8 94.0 97.9 99.4
h
cfg3 99.8 98.0 98.2 99.2 99.7 99.8 98.3 98.5 99.4 99.8 99.8 98.2 98.5 99.4 99.8 99.7 98.3 98.6 99.4 99.8 99.8 98.3 98.6 99.4 99.8 92.1 85.6 93.6 97.7 99.3
g
cfg3 100 98.3 98.8 99.3 99.7 100 98.8 99.0 99.5 99.8 100 98.8 99.1 99.5 99.8 100 98.9 99.2 99.6 99.8 100 98.8 99.1 99.5 99.8 91.7 85.6 94.8 98.1 99.4
cfg3 f
e1 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 71.7 84.2 94.0 97.8 99.3
cfg3
e2 99.5 99.9 100 100 100 99.6 100 100 100 100 99.6 100 100 100 100 99.7 100 100 100 100 99.7 100 100 100 100 73.1 84.6 94.2 98.0 99.3
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

(a) Predicting NT boundaries: the column N Tℓ for ℓ = 2, 3, 4, 5, 6 represents the accuracy of


predicting bℓ using the multi-head linear probing function described in (4.2).

GPT GPT_rel GPT_rot GPT_pos GPT_uni baseline (GPT_rand)


predict NT-end boundary (%)

cfg3 95.7 100 99.6 99.5 99.9 95.8 100 99.6 99.5 99.9 95.8 100 99.6 99.5 99.9 95.7 100 99.6 99.5 99.9 95.8 100 99.6 99.5 99.9 96.5 88.0 95.5 98.5 99.6
b
(diagonal masking)

cfg3 96.5 96.9 97.7 98.5 99.4 96.6 97.1 97.8 98.5 99.4 96.6 97.0 97.8 98.5 99.4 96.5 97.0 97.7 98.5 99.4 96.6 97.1 97.8 98.5 99.4 87.5 88.6 94.9 97.9 99.3
i
cfg3 91.3 95.0 97.8 99.1 99.6 91.5 95.2 97.9 99.1 99.6 91.5 95.2 97.9 99.1 99.6 91.5 95.2 97.9 99.1 99.6 91.5 95.2 97.9 99.1 99.6 88.1 86.8 94.0 97.9 99.4
h
cfg3 86.7 92.6 95.0 98.0 99.1 86.9 92.8 95.2 98.1 99.2 86.9 92.8 95.3 98.1 99.2 86.9 92.8 95.2 98.1 99.2 86.9 92.8 95.2 98.1 99.2 92.1 85.6 93.6 97.7 99.3
g
cfg3 89.1 92.7 96.5 98.2 99.2 89.4 93.2 96.7 98.4 99.3 89.4 93.2 96.7 98.4 99.3 89.3 93.2 96.6 98.3 99.2 89.3 93.2 96.6 98.3 99.2 91.7 85.6 94.8 98.1 99.4
cfg3 f
e1 98.2 99.6 99.9 99.9 99.8 98.2 99.6 99.9 99.9 99.8 98.2 99.6 99.9 99.9 99.8 98.2 99.6 99.9 99.9 99.8 98.2 99.6 99.9 99.9 99.8 71.7 84.2 94.0 97.8 99.3
cfg3
e2 96.0 99.0 99.9 100 100 96.1 99.0 99.9 100 100 96.0 99.0 99.9 100 100 96.0 99.0 99.9 100 100 96.1 99.0 99.9 100 100 73.1 84.6 94.2 98.0 99.3
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

(b) Predicting NT boundaries with diagonal masking: the column N Tℓ for ℓ = 2, 3, 4, 5, 6


represents the accuracy of predicting bℓ using (4.2) but setting wr,i→k = 0 for i ̸= k.

GPT GPT_rel GPT_rot GPT_pos GPT_uni baseline (GPT_rand)


predict NT-end boundary (%)

cfg3 99.9 100 99.6 99.6 99.9 99.9 100 99.6 99.6 99.9 99.9 100 99.6 99.6 99.9 99.9 100 99.6 99.6 99.9 99.9 100 99.6 99.6 99.9 96.5 88.0 95.5 98.5 99.6
b
(tridiagonal masking)

cfg3 97.7 98.2 98.3 98.9 99.6 97.8 98.2 98.4 98.9 99.6 97.7 98.2 98.4 98.9 99.6 97.8 98.2 98.4 98.9 99.6 97.8 98.2 98.4 98.9 99.6 87.5 88.6 94.9 97.9 99.3
i
cfg3 98.0 97.2 98.7 99.4 99.8 98.1 97.3 98.8 99.4 99.8 98.1 97.3 98.8 99.4 99.8 98.1 97.4 98.7 99.4 99.8 98.1 97.4 98.7 99.4 99.8 88.1 86.8 94.0 97.9 99.4
h
cfg3 96.7 96.3 96.5 98.7 99.5 96.7 96.5 96.8 98.8 99.6 96.7 96.5 96.8 98.8 99.6 96.7 96.5 96.8 98.8 99.6 96.7 96.5 96.7 98.8 99.6 92.1 85.6 93.6 97.7 99.3
g
cfg3 98.3 95.4 97.4 98.7 99.6 98.4 95.7 97.6 98.9 99.6 98.4 95.7 97.6 98.9 99.6 98.4 95.7 97.6 98.8 99.6 98.4 95.7 97.6 98.8 99.6 91.7 85.6 94.8 98.1 99.4
cfg3 f
e1 99.9 100 100 100 99.9 99.9 100 100 100 99.9 99.9 100 100 100 99.9 99.9 100 100 100 99.9 99.9 100 100 100 99.9 71.7 84.2 94.0 97.8 99.3
cfg3
e2 98.7 99.7 100 100 100 98.8 99.7 100 100 100 98.8 99.7 100 100 100 98.8 99.7 100 100 100 98.9 99.7 100 100 100 73.1 84.6 94.2 98.0 99.3
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

(c) Predicting NT boundaries with tridiagonal masking: the column N Tℓ for ℓ = 2, 3, 4, 5, 6


represents the accuracy of predicting bℓ using (4.2) but setting wr,i→k = 0 for |i − k| > 1.

Figure 19: After pre-training, the NT-end boundary information — i.e., bℓ (i) for position i and NT level ℓ— is
largely stored locally near the hidden state at position i ± 1, up to a linear transformation. This can be
compared with the prediction accuracy of the NT ancestor sℓ (i) in Figure 5.

Observation. This implies, the transformer actually knows, with a very good accuracy, that “position
i is already the end of NT on level ℓ”, by just reading all the texts until this position (possibly peeking
one more to its right).
Remark 1. It may be mathematically necessary to peek more than 1 tokens to decide if a position i is
at an NT boundary, due to CFG’s ambiguity. But, in most cases, that can be decided quite early.
Remark 2. Predicting NT boundary is a very biased binary classification task. For levels ℓ that are close
to the CFG root, most symbols are not at NT boundary for that level ℓ (see Figure 2). For such reason,
in the heatmap color of the figures above, we have normalized the columns with respect to NT2..NT6
differently, to reflect this bias.

27
C.2 NT Predictions Across Transformer’s Layers
As one may image, the NT ancestor and boundary information for smaller CFG levels ℓ (i.e., closer
to CFG root) are only learned at those deeper transformer layers l. In Figure 20, we present this
finding by calculating the linear encoding accuracies with respect to all the 12 transformer layers
in GPT and GPTrel . We confirm that generative models discover such information hierarchically.

GPT on cfg3f GPT_rel on cfg3f GPT_rand on cfg3f GPT on cfg3i GPT_rel on cfg3i GPT_rand on cfg3i
lay0 69.8 49.2 44.6 59.1 68.0 69.7 49.3 44.6 59.1 68.0 69.7 49.2 44.5 59.1 68.7 84.4 71.4 64.1 66.5 65.2 84.4 71.4 64.1 66.5 65.3 84.3 71.3 64.0 66.3 65.9
predict NT ancestor (%) across layers

lay1 98.9 72.3 48.7 59.5 68.0 94.2 64.2 46.6 59.3 68.0 71.6 49.9 44.6 59.2 68.6 97.3 87.7 79.5 73.0 69.4 96.9 85.3 76.1 71.3 68.5 84.8 71.8 64.6 66.6 65.5
lay2 99.0 73.6 49.2 59.6 68.1 99.8 78.6 51.2 59.7 68.0 71.8 50.0 44.6 59.1 68.6 97.5 88.7 81.1 74.0 70.1 97.8 90.6 83.0 74.9 71.3 84.8 71.8 64.7 66.7 65.3
lay3 99.1 75.3 50.2 59.6 68.1 100 87.2 58.6 60.3 68.2 71.8 50.0 44.6 59.1 68.6 97.7 90.5 83.8 76.4 74.3 98.5 95.5 91.9 81.9 80.7 84.8 71.9 64.7 66.3 65.5
lay4 99.4 78.2 52.1 59.7 68.1 100 93.6 71.2 61.9 68.8 71.7 49.9 44.6 59.1 68.6 98.1 92.4 86.9 79.7 77.1 99.1 98.3 97.0 92.0 92.7 84.7 71.8 64.6 66.5 65.2
lay5 99.9 82.7 54.8 59.9 68.3 100 96.3 81.6 65.0 69.7 71.6 49.9 44.6 59.1 68.6 98.3 93.9 89.2 82.1 79.4 99.3 99.0 98.5 95.6 96.0 84.7 71.8 64.7 66.4 65.2
lay6 100 87.6 60.7 60.5 68.4 100 97.4 89.6 72.7 72.2 71.6 49.9 44.6 59.1 68.6 98.6 95.5 91.9 85.8 82.8 99.5 99.4 99.3 97.7 97.8 84.7 71.7 64.6 66.6 65.3
lay7 100 92.2 69.2 61.5 68.8 100 97.7 93.0 82.3 76.3 71.5 49.9 44.6 59.1 68.6 98.8 97.1 95.2 90.8 89.5 99.5 99.6 99.5 98.7 98.9 84.7 71.7 64.6 66.2 65.3
lay8 100 95.3 78.7 63.6 69.5 100 97.7 94.2 88.0 83.2 71.4 49.9 44.6 59.1 68.6 99.2 98.5 97.7 94.6 94.8 99.6 99.6 99.6 99.1 99.6 84.6 71.7 64.7 66.1 65.2
lay9 100 97.1 87.3 68.3 71.2 100 97.7 94.8 91.6 90.3 71.5 49.9 44.6 59.1 68.6 99.4 99.3 99.1 97.4 97.8 99.6 99.7 99.6 99.2 99.8 84.5 71.7 64.6 66.4 65.6
lay1 100 97.7 92.4 78.3 75.1 100 97.7 95.0 92.8 93.3 71.4 49.9 44.5 59.1 68.6 99.6 99.6 99.5 98.9 99.3 99.6 99.7 99.6 99.3 99.8 84.6 71.7 64.7 66.3 65.2
0
lay1 100 97.8 94.1 86.7 82.3 100 97.7 94.9 92.9 93.7 71.3 49.8 44.5 59.1 68.6 99.6 99.7 99.6 99.2 99.7 99.6 99.7 99.6 99.2 99.8 84.7 71.7 64.6 66.5 65.3
1
lay1 100 97.6 94.3 88.4 85.9 100 97.5 94.8 92.9 93.5 71.3 49.9 44.6 59.1 68.6 99.6 99.7 99.6 99.2 99.7 99.6 99.7 99.6 99.2 99.7 84.6 71.7 64.6 66.4 65.2
2
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

(a) Predict NT ancestors, comparing against the GPTrand baseline

GPT on cfg3f GPT_rel on cfg3f GPT_rand on cfg3f GPT on cfg3i GPT_rel on cfg3i GPT_rand on cfg3i
lay0 90.8 85.4 94.8 98.1 99.4 90.8 85.4 94.8 98.1 99.4 90.7 85.4 94.8 98.1 99.4 86.9 88.4 94.9 97.9 99.3 86.9 88.4 94.9 97.9 99.3 86.9 88.5 94.8 97.8 99.3
predict NT-end boundary across layers

lay1 100 92.9 95.0 98.1 99.4 99.2 88.9 94.8 98.1 99.4 91.7 85.6 94.8 98.1 99.4 97.6 97.3 96.0 98.1 99.3 97.3 96.2 95.6 98.1 99.3 87.6 88.7 94.9 97.8 99.3
lay2 100 93.4 95.0 98.1 99.4 100 95.1 95.2 98.1 99.4 91.8 85.6 94.8 98.1 99.4 98.0 97.7 96.2 98.2 99.4 98.7 98.2 96.7 98.3 99.4 87.7 88.7 94.9 97.9 99.3
lay3 100 94.0 95.1 98.1 99.4 100 97.1 95.7 98.1 99.4 91.8 85.6 94.8 98.1 99.4 98.4 98.1 96.6 98.3 99.4 99.1 98.9 97.7 98.5 99.4 87.7 88.6 94.9 97.9 99.3
lay4 100 95.0 95.2 98.1 99.4 100 98.3 96.9 98.2 99.4 91.9 85.6 94.8 98.1 99.4 98.8 98.5 97.2 98.4 99.4 99.4 99.4 98.4 98.8 99.5 87.7 88.7 94.9 97.8 99.3
lay5 100 96.1 95.5 98.1 99.4 100 98.8 98.2 98.4 99.4 91.8 85.6 94.8 98.1 99.4 98.9 98.7 97.6 98.5 99.4 99.5 99.6 98.7 99.1 99.7 87.7 88.6 94.9 97.9 99.3
lay6 100 97.1 95.9 98.1 99.4 100 98.9 98.8 98.8 99.5 91.8 85.6 94.8 98.1 99.4 99.1 98.9 97.9 98.6 99.5 99.6 99.7 98.9 99.3 99.8 87.7 88.6 94.9 97.9 99.3
lay7 100 97.7 96.6 98.2 99.4 100 98.9 99.0 99.2 99.7 91.8 85.6 94.8 98.1 99.4 99.3 99.1 98.2 98.8 99.5 99.7 99.8 99.0 99.4 99.8 87.7 88.6 94.9 97.9 99.3
lay8 100 98.2 97.6 98.3 99.4 100 98.9 99.0 99.4 99.8 91.8 85.6 94.8 98.1 99.4 99.4 99.4 98.5 99.0 99.6 99.7 99.8 99.0 99.5 99.9 87.6 88.6 94.9 97.9 99.3
lay9 100 98.4 98.4 98.6 99.5 100 98.9 99.1 99.5 99.8 91.8 85.6 94.8 98.1 99.4 99.5 99.6 98.8 99.2 99.8 99.7 99.8 99.1 99.6 99.9 87.6 88.6 94.9 97.9 99.3
lay1 100 98.5 98.7 98.9 99.6 100 98.9 99.1 99.5 99.8 91.8 85.6 94.8 98.1 99.4 99.6 99.7 99.0 99.4 99.9 99.8 99.8 99.1 99.6 99.9 87.7 88.7 94.9 97.8 99.3
0
lay1 100 98.5 98.9 99.3 99.7 100 98.9 99.1 99.5 99.8 91.7 85.5 94.8 98.1 99.4 99.7 99.8 99.1 99.5 99.9 99.7 99.8 99.1 99.6 99.9 87.6 88.6 94.9 97.9 99.3
1
lay1 100 98.3 98.8 99.3 99.7 100 98.8 99.0 99.5 99.8 91.7 85.6 94.8 98.1 99.4 99.7 99.8 99.0 99.5 99.9 99.7 99.8 99.1 99.5 99.9 87.5 88.6 94.9 97.9 99.3
2
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

(b) Predict NT boundaries, comparing against the GPTrand baseline

Figure 20: Generative models discover NT ancestors and NT boundaries hierarchically.

28
C.3 NT Predictions Across Training Epochs
Moreover, one may conjecture that the NT ancestor and NT boundary information is learned
gradually as the number of training steps increase. We have confirmed this in Figure 21. We
emphasize that this does not imply layer-wise training is applicable in learning deep CFGs. It is
crucial to train all the layers together, as the training process of deeper transformer layers may
help backward correct the features learned in the lower layers, through a process called “backward
feature correction” [3].

predict NT (GPT) predict NTend (GPT) predict NT (GPT_rel) predict NTend (GPT_rel)
5 99.5 84.2 57.2 59.9 68.7 100 96.4 95.6 98.1 99.4 100 96.2 86.8 68.8 70.9 100 98.5 98.5 98.7 99.5
10 100 93.2 71.6 62.0 69.1 100 98.0 97.2 98.2 99.4 100 96.8 91.7 79.7 75.5 100 98.6 98.8 99.1 99.6
15 100 95.2 79.7 64.5 69.9 100 98.2 97.9 98.4 99.4 100 97.0 92.7 85.3 80.0 100 98.6 98.8 99.3 99.7
20 100 96.1 83.4 66.1 70.3 100 98.4 98.3 98.5 99.4 100 97.1 93.2 87.5 83.4 100 98.7 98.9 99.4 99.7
25 100 96.5 86.0 68.7 71.1 100 98.4 98.4 98.6 99.5 100 97.2 93.6 88.9 86.0 100 98.7 98.9 99.4 99.8
30 100 96.8 87.5 70.5 71.7 100 98.4 98.5 98.7 99.5 100 97.2 93.7 89.7 87.8 100 98.7 98.9 99.4 99.8
35 100 97.0 88.5 71.9 72.6 100 98.4 98.5 98.8 99.5 100 97.4 94.1 90.6 89.3 100 98.7 98.9 99.4 99.8
40 100 97.1 89.4 73.3 73.1 100 98.5 98.6 98.8 99.5 100 97.3 94.0 90.8 90.1 100 98.7 98.9 99.4 99.8
45 100 97.1 90.1 74.7 73.9 100 98.4 98.6 98.9 99.5 100 97.4 94.0 91.1 91.0 100 98.7 98.9 99.4 99.8
predict NT ancestor/boundary (%) across training epochs

50 100 97.2 90.6 76.3 74.4 100 98.5 98.6 98.9 99.6 100 97.4 94.1 91.3 91.4 100 98.7 98.9 99.4 99.8
55 100 97.3 91.0 77.6 75.0 100 98.4 98.7 99.0 99.6 100 97.4 94.2 91.5 91.7 100 98.7 99.0 99.5 99.8
60 100 97.2 91.4 78.8 76.0 100 98.4 98.7 99.0 99.6 100 97.3 94.3 91.6 91.8 100 98.8 99.0 99.5 99.8
65 100 97.3 91.8 79.8 76.9 100 98.4 98.7 99.0 99.6 100 97.4 94.3 91.7 92.0 100 98.7 99.0 99.5 99.8
70 100 97.4 92.1 80.5 77.2 100 98.4 98.7 99.0 99.6 100 97.5 94.4 91.7 92.3 100 98.8 99.0 99.5 99.8
75 100 97.4 92.4 81.2 77.9 100 98.4 98.7 99.1 99.6 100 97.4 94.3 91.8 92.5 100 98.8 99.0 99.5 99.8
80 100 97.5 92.7 82.2 78.5 100 98.4 98.7 99.1 99.6 100 97.5 94.4 91.9 92.5 100 98.8 99.0 99.5 99.8
85 100 97.3 92.7 82.6 79.1 100 98.3 98.7 99.1 99.6 100 97.5 94.5 92.1 92.5 100 98.8 99.0 99.5 99.8
90 100 97.5 92.9 83.3 79.3 100 98.4 98.7 99.1 99.7 100 97.5 94.5 92.1 92.5 100 98.8 99.0 99.5 99.8
95 100 97.5 93.0 83.9 80.3 100 98.4 98.7 99.1 99.7 100 97.4 94.4 92.2 93.0 100 98.7 99.0 99.5 99.8
100 100 97.5 93.3 84.4 80.5 100 98.4 98.7 99.2 99.7 100 97.5 94.5 92.3 93.0 100 98.8 99.0 99.5 99.8
105 100 97.5 93.3 84.7 80.8 100 98.4 98.8 99.2 99.7 100 97.5 94.5 92.3 93.0 100 98.8 99.0 99.5 99.8
110 100 97.5 93.3 85.0 81.6 100 98.3 98.7 99.2 99.7 100 97.5 94.5 92.2 92.9 100 98.7 99.0 99.5 99.8
115 100 97.5 93.4 85.3 81.5 100 98.4 98.8 99.2 99.7 100 97.4 94.4 92.2 92.8 100 98.8 99.0 99.5 99.8
120 100 97.6 93.5 85.6 82.4 100 98.4 98.8 99.2 99.7 100 97.5 94.5 92.2 92.9 100 98.8 99.0 99.5 99.8
125 100 97.6 93.8 86.2 82.8 100 98.4 98.8 99.2 99.7 100 97.6 94.8 92.6 93.3 100 98.8 99.0 99.5 99.8
130 100 97.5 93.7 86.4 83.1 100 98.4 98.7 99.2 99.7 100 97.4 94.6 92.6 93.1 100 98.7 99.0 99.5 99.8
135 100 97.6 93.8 86.7 83.3 100 98.4 98.8 99.2 99.7 100 97.5 94.7 92.4 93.1 100 98.7 99.0 99.5 99.8
140 100 97.5 93.6 86.5 83.6 100 98.3 98.8 99.2 99.7 100 97.5 94.6 92.6 93.3 100 98.7 99.0 99.5 99.8
145 100 97.6 93.8 86.7 83.5 100 98.4 98.8 99.2 99.7 100 97.5 94.7 92.9 93.4 100 98.7 99.0 99.5 99.8
150 100 97.6 93.8 87.0 83.8 100 98.4 98.8 99.2 99.7 100 97.5 94.7 92.7 93.4 100 98.8 99.0 99.5 99.8
155 100 97.6 93.9 87.1 84.7 100 98.4 98.8 99.2 99.7 100 97.5 94.6 92.5 93.0 100 98.8 99.0 99.5 99.8
160 100 97.6 94.0 87.1 84.5 100 98.4 98.8 99.3 99.7 100 97.6 94.7 92.5 93.0 100 98.8 99.0 99.5 99.8
165 100 97.6 94.0 87.8 85.0 100 98.4 98.8 99.3 99.7 100 97.5 94.6 92.7 93.3 100 98.8 99.0 99.5 99.8
170 100 97.5 94.1 87.8 85.3 100 98.4 98.8 99.3 99.7 100 97.4 94.7 92.8 93.5 100 98.7 99.0 99.5 99.8
175 100 97.6 94.1 87.9 85.4 100 98.4 98.8 99.3 99.7 100 97.5 94.7 92.6 93.2 100 98.8 99.0 99.5 99.8
180 100 97.6 94.1 87.9 85.3 100 98.4 98.8 99.3 99.7 100 97.6 94.7 92.5 93.2 100 98.8 99.0 99.5 99.8
185 100 97.6 94.2 88.1 85.5 100 98.3 98.8 99.3 99.7 100 97.5 94.7 92.7 93.4 100 98.8 99.0 99.5 99.8
190 100 97.6 94.3 88.2 85.6 100 98.4 98.8 99.3 99.7 100 97.5 94.8 92.8 93.6 100 98.8 99.0 99.5 99.8
195 100 97.6 94.2 88.3 86.0 100 98.4 98.8 99.3 99.7 100 97.5 94.8 92.8 93.5 100 98.8 99.0 99.5 99.8
200 100 97.7 94.2 88.2 85.7 100 98.4 98.8 99.3 99.7 100 97.5 94.7 92.7 93.3 100 98.8 99.0 99.5 99.8
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

Figure 21: Generative models discover NT ancestors and NT boundaries gradually across training epochs (here 1
epoch equals 500 training steps). CFG levels closer to the leaves are learned faster, and their accuracies
continue to increase as deeper levels are being learned, following a principle called “backward feature
correction” in deep hierarchical learning [3].

29
D More Experiments on Attention Patterns
D.1 Position-Based Attention Pattern
Recall from Figure 8 we have shown that the attention weights between any two positions j → i
have a strong bias in the relative difference p = |j − i|. Different heads or layers have different
dependencies on p. Below in Figure 22, we give experiments for this phenomenon in more datasets
and for both GPT/GPTrel .
distance p = |j-i| distance p = |j-i|
10 30 50 70 90 110 130 150 170 190 210 230 250 270 290 10 30 50 70 90 110 130 150 170 190 210 230 250 270 290
lay1 lay1
position-based attention pattern

position-based attention pattern


lay2 lay2
0.8

for GPTrel over cfg3b data


lay3 lay3 0.8
for GPT over cfg3b data

lay4 lay4
lay5 0.6 lay5 0.6
lay6 lay6
lay7 lay7
lay8 0.4 lay8 0.4
lay9 lay9
lay1 lay1
0 0.2 0 0.2
lay1 lay1
1 1
lay1 lay1
2 2

(a) GPT on cfg3b (b) GPTrel on cfg3b


distance p = |j-i| distance p = |j-i|
10 30 50 70 90 110 130 150 170 190 210 230 250 270 290 10 30 50 70 90 110 130 150 170 190 210 230 250 270 290
lay1 lay1
position-based attention pattern

position-based attention pattern


lay2 lay2
0.8 0.8

for GPTrel over cfg3i data


lay3 lay3
for GPT over cfg3i data

lay4 lay4
lay5 0.6 lay5 0.6
lay6 lay6
lay7 lay7
lay8 0.4 lay8 0.4
lay9 lay9
lay1 0.2 lay1
0 0 0.2
lay1 lay1
1 1
lay1 lay1
2 2

(c) GPT on cfg3i (d) GPTrel on cfg3i


distance p = |j-i| distance p = |j-i|
10 30 50 70 90 110 130 150 170 190 210 230 250 270 290 10 30 50 70 90 110 130 150 170 190 210 230 250 270 290
lay1 lay1
position-based attention pattern

position-based attention pattern

lay2 lay2
0.8 0.8
for GPTrel over cfg3h data

lay3 lay3
for GPT over cfg3h data

lay4 lay4
lay5 0.6 lay5 0.6
lay6 lay6
lay7 lay7
lay8 0.4 lay8 0.4
lay9 lay9
lay1 lay1
0 0.2 0 0.2
lay1 lay1
1 1
lay1 lay1
2 2

(e) GPT on cfg3h (f) GPTrel on cfg3h


distance p = |j-i| distance p = |j-i|
10 30 50 70 90 110 130 150 170 190 210 230 250 270 290 10 30 50 70 90 110 130 150 170 190 210 230 250 270 290
lay1 lay1
position-based attention pattern

position-based attention pattern

lay2 lay2
for GPTrel over cfg3g data

lay3 0.8 lay3 0.8


for GPT over cfg3g data

lay4 lay4
lay5 0.6 lay5
0.6
lay6 lay6
lay7 lay7
lay8 0.4 lay8 0.4
lay9 lay9
lay1 lay1
0 0.2 0 0.2
lay1 lay1
1 1
lay1 lay1
2 2

(g) GPT on cfg3g (h) GPTrel on cfg3g


distance p = |j-i| distance p = |j-i|
10 30 50 70 90 110 130 150 170 190 210 230 250 270 290 10 30 50 70 90 110 130 150 170 190 210 230 250 270 290
lay1 lay1
position-based attention pattern

position-based attention pattern

lay2 lay2
0.8
for GPTrel over cfg3f data

lay3 lay3 0.8


for GPT over cfg3f data

lay4 lay4
lay5 0.6 lay5
0.6
lay6 lay6
lay7 lay7
lay8 0.4 lay8 0.4
lay9 lay9
lay1 lay1
0 0.2 0 0.2
lay1 lay1
1 1
lay1 lay1
2 2

(i) GPT on cfg3f (j) GPTrel on cfg3f

Figure 22: Position-based attention pattern. The 12 rows in each layer represent 12 heads. Observations. The
attention pattern is multi-scale: different heads or layers have different dependencies on p.

30
D.2 From Anywhere to NT-ends
def
Recall from Figure 9(a), we showed that after removing the position-bias Bl,h,j→i (x) = Al,h,j→i (x)−
Al,h,j−i , the attention weights have a very strong bias towards tokens i that are at NT ends. In
Figure 23 we complement this experiment with more datasets.

head1 head2 head3 head4 head5 head6 head7 head8 head9 head10 head11 head12 head1 head2 head3 head4 head5 head6 head7 head8 head9 head10 head11 head12
NT5 NT5
for GPTrel over cfg3b data

lay1 NT4 lay1 NT4

for GPTrel over cfg3i data


any (NTend ± 2) attention

any (NTend ± 2) attention


NT3 NT3
NT2 NT2
NT5
NT4 0.06 NT5
NT4 0.025
lay2 NT3 lay2 NT3
NT2 NT2
NT5 NT5
lay3 NT4 lay3 NT4
NT3 NT3
NT2 0.05 NT2
NT5 NT5
lay4 NT4
NT3 lay4 NT4
NT3
0.020
NT2 NT2
NT5 NT5
lay5 NT4 lay5 NT4
NT3 NT3
NT2 0.04 NT2
NT5 NT5
lay6 NT4 lay6 NT4
NT3 NT3 0.015
NT2 NT2
NT5 NT5
lay7 NT4 lay7 NT4
NT3
NT2
0.03 NT3
NT2
NT5 NT5
lay8 NT4 lay8 NT4
NT3
NT2
NT3
NT2 0.010
NT5 NT5
lay9 NT4
NT3
0.02 lay9 NT4
NT3
NT2 NT2
NT5 NT5
lay1 NT4 lay1 NT4
0 NT3
NT2 0 NT3
NT2 0.005
lay1
NT5
NT4 0.01 lay1
NT5
NT4
1 NT3
NT2 1 NT3
NT2
NT5 NT5
lay1 NT4 lay1 NT4
2 NT3
NT2 2 NT3
NT2
-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2
0.00 -2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2
0.000

(a) cfg3b dataset (b) cfg3i dataset

head1 head2 head3 head4 head5 head6 head7 head8 head9 head10 head11 head12 head1 head2 head3 head4 head5 head6 head7 head8 head9 head10 head11 head12
NT5 NT5
for GPTrel over cfg3g data
for GPTrel over cfg3h data

lay1 NT4 lay1 NT4


any (NTend ± 2) attention

any (NTend ± 2) attention


NT3 NT3
NT2 NT2
NT5
NT4
NT5
NT4 0.014
lay2 NT3 0.04 lay2 NT3
NT2 NT2
NT5 NT5
lay3 NT4 lay3 NT4
NT3
NT2
NT3
NT2 0.012
NT5 NT5
lay4 NT4 lay4 NT4
NT3 NT3
NT2 NT2
NT5 NT5 0.010
lay5 NT4 0.03 lay5 NT4
NT3 NT3
NT2 NT2
NT5 NT5
lay6 NT4 lay6 NT4
NT3 NT3
NT2 NT2 0.008
NT5 NT5
lay7 NT4 lay7 NT4
NT3 NT3
NT2
NT5
0.02 NT2
NT5
lay8 NT4
NT3 lay8 NT4
NT3
0.006
NT2 NT2
NT5 NT5
lay9 NT4 lay9 NT4
NT3 NT3
NT2
NT5
NT2
NT5
0.004
lay1 NT4 0.01 lay1 NT4
0 NT3
NT2 0 NT3
NT2
NT5 NT5
lay1 NT4 lay1 NT4 0.002
1 NT3
NT2 1 NT3
NT2
NT5 NT5
lay1 NT4 lay1 NT4
2 NT3
NT2 2 NT3
NT2
-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2
0.00 -2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2
0.000

(c) cfg3h dataset (d) cfg3g dataset

head1 head2 head3 head4 head5 head6 head7 head8 head9 head10 head11 head12
NT5
lay1 NT4
for GPTrel over cfg3f data
any (NTend ± 2) attention

NT3
NT2
NT5 0.05
lay2 NT4
NT3
NT2
NT5
lay3 NT4
NT3
NT2
NT5
NT4
0.04
lay4 NT3
NT2
NT5
lay5 NT4
NT3
NT2
NT5
lay6 NT4
NT3
0.03
NT2
NT5
lay7 NT4
NT3
NT2
NT5
lay8 NT4
NT3 0.02
NT2
NT5
lay9 NT4
NT3
NT2
NT5
lay1 NT4
0 NT3
NT2 0.01
NT5
lay1 NT4
1 NT3
NT2
NT5
lay1 NT4
2 NT3
NT2
-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2-2-10 1 2
0.00

(e) cfg3f dataset

Figure 23: Attention weights Bl,h,j→i (x) averaged over data x and pairs i, j such that i + δ is at the NT-end in level
ℓ of the CFG. In each cell, the four rows correspond to levels ℓ = 2, 3, 4, 5, and the five columns represent
δ = −2, −1, 0, +1, +2.

Observation. Attention is largest when δ = 0 and drops rapidly to the surrounding tokens of i.

31
D.3 From NT-ends to NT-ends
As mentioned in Section 5.2 and Figure 9(b), not only do tokens generally attend more to NT-ends,
but among those attentions, NT-ends are also more likely to attend to NT-ends. We include this
full experiment in Figure 24 for every different level ℓ = 2, 3, 4, 5, between any two pairs j → i that
are both at NT-ends for level ℓ, for the cfg3 datasets.

32
h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12
(NTend2 ± 1) (NTend2 ± 1) attention
0.10

(NTend3 ± 1) (NTend3 ± 1) attention


0.10

(NTend4 ± 1) (NTend4 ± 1) attention

(NTend5 ± 1) (NTend5 ± 1) attention


lay1 -1 lay1 -1 lay1 -1 lay1 -1
0 0 0 0
+1 +1 +1 +1
lay2 -1 lay2 -1 lay2 -1 0.07 lay2 -1
for GPTrel over cfg3b data

for GPTrel over cfg3b data

for GPTrel over cfg3b data

for GPTrel over cfg3b data


0 0 0 0 0.025
+1 +1 +1 +1
lay3 -1 0.08 lay3 -1 0.08 lay3 -1 lay3 -1
0 0 0 0
+1 +1 +1 0.06 +1
lay4 -1 lay4 -1 lay4 -1 lay4 -1
0 0 0 0 0.020
+1 +1 +1 +1
lay5 -1 lay5 -1 lay5 -1 0.05 lay5 -1
0 0 0 0
+1 0.06 +1 0.06 +1 +1
lay6 -1 lay6 -1 lay6 -1 lay6 -1
0 0 0 0
+1 +1 +1 0.04 +1 0.015
lay7 -1 lay7 -1 lay7 -1 lay7 -1
0 0 0 0
+1 +1 +1 +1
lay8 -1
0
0.04 lay8 -1
0
0.04 lay8 -1
0 0.03 lay8 -1
0
+1
-1
+1
-1
+1
-1
+1
-1
0.010
lay9 0 lay9 0 lay9 0 lay9 0
+1 +1 +1 0.02 +1
lay1 -1 lay1 -1 lay1 -1 lay1 -1
0 0 0.02 0 0 0.02 0 0 0 0
+1 +1 +1 +1 0.005
lay1 -1 lay1 -1 lay1 -1 lay1 -1
1 0 1 0 1 0 0.01 1 0
+1 +1 +1 +1
lay1 -1 lay1 -1 lay1 -1 lay1 -1
2 0 2 0 2 0 2 0
+1 0.00 +1 0.00 +1 0.00 +1 0.000
-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1

(a) cfg3b at level ℓ = 2 (b) cfg3b at level ℓ = 3 (c) cfg3b at level ℓ = 4 (d) cfg3b at level ℓ = 5
h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12
(NTend2 ± 1) (NTend2 ± 1) attention

(NTend3 ± 1) (NTend3 ± 1) attention

(NTend4 ± 1) (NTend4 ± 1) attention

(NTend5 ± 1) (NTend5 ± 1) attention


lay1 -1 lay1 -1 lay1 -1 lay1 -1
0
+1
0
+1
0
+1
0.035 0
+1
lay2 -1 lay2 -1 0.07 lay2 -1 lay2 -1 0.0200
for GPTrel over cfg3i data

for GPTrel over cfg3i data

for GPTrel over cfg3i data

for GPTrel over cfg3i data


0 0 0 0
+1 0.05 +1 +1 +1
lay3 -1 lay3 -1 lay3 -1 0.030 lay3 -1
0 0 0 0 0.0175
+1
-1
+1
-1
0.06 +1
-1
+1
-1
lay4 0 lay4 0 lay4 0 lay4 0
+1 0.04 +1 +1 0.025 +1 0.0150
lay5 -1 lay5 -1 0.05 lay5 -1 lay5 -1
0 0 0 0
+1 +1 +1 +1
lay6 -1
0 lay6 -1
0 lay6 -1
0 0.020 lay6 -1
0 0.0125
+1
-1 0.03 +1
-1 0.04 +1
-1
+1
-1
lay7 0 lay7 0 lay7 0 lay7 0 0.0100
+1 +1 +1 +1
lay8 -1 lay8 -1 lay8 -1 0.015 lay8 -1
0 0 0.03 0 0
+1 0.02 +1 +1 +1 0.0075
lay9 -1 lay9 -1 lay9 -1 lay9 -1
0 0 0 0.010 0
+1 +1 0.02 +1 +1
lay1 -1 lay1 -1 lay1 -1 lay1 -1 0.0050
0 0 0 0 0 0 0 0
+1 0.01 +1 +1 +1
lay1 -1 lay1 -1 lay1 -1 0.005 lay1 -1
1 0 1 0 0.01 1 0 1 0 0.0025
+1 +1 +1 +1
lay1 -1 lay1 -1 lay1 -1 lay1 -1
2 0 2 0 2 0 2 0
+1 0.00 +1 0.00 +1 0.000 +1 0.0000
-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1

(e) cfg3i at level ℓ = 2 (f) cfg3i at level ℓ = 3 (g) cfg3i at level ℓ = 4 (h) cfg3i at level ℓ = 5
h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12
(NTend2 ± 1) (NTend2 ± 1) attention

0.10
(NTend3 ± 1) (NTend3 ± 1) attention

0.10

(NTend4 ± 1) (NTend4 ± 1) attention

(NTend5 ± 1) (NTend5 ± 1) attention


lay1 -1 lay1 -1 lay1 -1 lay1 -1
0 0 0 0.08 0 0.035
+1 +1 +1 +1
lay2 -1 lay2 -1 lay2 -1 lay2 -1
for GPTrel over cfg3h data

for GPTrel over cfg3h data

for GPTrel over cfg3h data

for GPTrel over cfg3h data


0 0 0 0
+1
-1
+1
-1
+1
-1
0.07 +1
-1
lay3 0 0.08 lay3 0 0.08 lay3 0 lay3 0 0.030
+1 +1 +1 +1
lay4 -1 lay4 -1 lay4 -1 0.06 lay4 -1
0 0 0 0
+1 +1 +1 +1 0.025
lay5 -1 lay5 -1 lay5 -1 lay5 -1
0 0 0 0
+1 0.06 +1 0.06 +1 0.05 +1
lay6 -1 lay6 -1 lay6 -1 lay6 -1
0
+1
0
+1
0
+1
0
+1
0.020
lay7 -1
0 lay7 -1
0 lay7 -1
0 0.04 lay7 -1
0
+1 +1 +1 +1
lay8 -1 0.04 lay8 -1 0.04 lay8 -1 lay8 -1 0.015
0 0 0 0.03 0
+1 +1 +1 +1
lay9 -1 lay9 -1 lay9 -1 lay9 -1
0 0 0 0
+1 +1 +1 +1 0.010
lay1 -1 lay1 -1 lay1 -1 0.02 lay1 -1
0 0 0.02 0 0 0.02 0 0 0 0
+1 +1 +1 +1
lay1 -1 lay1 -1 lay1 -1 lay1 -1 0.005
1 0
+1 1 0
+1 1 0
+1 0.01 1 0
+1
lay1 -1 lay1 -1 lay1 -1 lay1 -1
2 0 2 0 2 0 2 0
+1 0.00 +1 0.00 +1 0.00 +1 0.000
-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1

(i) cfg3h at level ℓ = 2 (j) cfg3h at level ℓ = 3 (k) cfg3h at level ℓ = 4 (l) cfg3h at level ℓ = 5
h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12
(NTend2 ± 1) (NTend2 ± 1) attention

(NTend3 ± 1) (NTend3 ± 1) attention

(NTend4 ± 1) (NTend4 ± 1) attention

(NTend5 ± 1) (NTend5 ± 1) attention


lay1 -1 lay1 -1 lay1 -1 0.025 lay1 -1
0 0 0 0
+1 +1 0.016 +1 +1 0.0175
lay2 -1 lay2 -1 lay2 -1 lay2 -1
for GPTrel over cfg3g data

for GPTrel over cfg3g data

for GPTrel over cfg3g data

for GPTrel over cfg3g data


0 0.014 0 0 0
+1 +1 +1 +1
lay3 -1 lay3 -1 0.014 lay3 -1 lay3 -1
0
+1
0
+1
0
+1
0.020 0
+1
0.0150
-1 0.012 -1 -1 -1
lay4 0 lay4 0 0.012 lay4 0 lay4 0
+1 +1 +1 +1
lay5 -1 lay5 -1 lay5 -1 lay5 -1 0.0125
0 0.010 0 0 0
+1
-1
+1
-1 0.010 +1
-1 0.015 +1
-1
lay6 0 lay6 0 lay6 0 lay6 0 0.0100
+1 0.008 +1 +1 +1
lay7 -1 lay7 -1 lay7 -1 lay7 -1
0 0 0.008 0 0
+1 +1 +1 +1
lay8 -1 lay8 -1 lay8 -1 0.010 lay8 -1 0.0075
0 0.006 0 0 0
+1
-1
+1
-1
0.006 +1
-1
+1
-1
lay9 0 lay9 0 lay9 0 lay9 0
+1 0.004 +1 +1 +1 0.0050
lay1 -1
0 lay1 -1
0 0.004 lay1 -1
0 lay1 -1
0
0 +1 0 +1 0 +1 0.005 0 +1
lay1 -1 lay1 -1 lay1 -1 lay1 -1
1 0 0.002 1 0 0.002 1 0 1 0 0.0025
+1 +1 +1 +1
lay1 -1 lay1 -1 lay1 -1 lay1 -1
2 0 2 0 2 0 2 0
+1 0.000 +1 0.000 +1 0.000 +1 0.0000
-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1

(m) cfg3g at level ℓ = 2 (n) cfg3g at level ℓ = 3 (o) cfg3g at level ℓ = 4 (p) cfg3g at level ℓ = 5
h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12
(NTend2 ± 1) (NTend2 ± 1) attention

(NTend3 ± 1) (NTend3 ± 1) attention

0.10
(NTend4 ± 1) (NTend4 ± 1) attention

(NTend5 ± 1) (NTend5 ± 1) attention

lay1 -1 lay1 -1 lay1 -1 lay1 -1


0 0 0 0
+1 +1 +1 +1
lay2 -1 lay2 -1 lay2 -1 lay2 -1
for GPTrel over cfg3f data

for GPTrel over cfg3f data

for GPTrel over cfg3f data

for GPTrel over cfg3f data

0 0 0 0
+1 +1 +1 0.04 +1
lay3 -1 0.04 lay3 -1 0.08 lay3 -1 lay3 -1 0.020
0 0 0 0
+1 +1 +1 +1
lay4 -1 lay4 -1 lay4 -1 lay4 -1
0 0 0 0
+1 +1 +1 +1
lay5 -1 lay5 -1 lay5 -1 0.03 lay5 -1
0 0 0 0
+1 0.03 +1 0.06 +1 +1 0.015
lay6 -1 lay6 -1 lay6 -1 lay6 -1
0 0 0 0
+1 +1 +1 +1
lay7 -1 lay7 -1 lay7 -1 lay7 -1
0 0 0 0
+1
-1 0.02
+1
-1 0.04
+1
-1 0.02 +1
-1 0.010
lay8 0 lay8 0 lay8 0 lay8 0
+1 +1 +1 +1
lay9 -1 lay9 -1 lay9 -1 lay9 -1
0 0 0 0
+1 +1 +1 +1
lay1 -1 lay1 -1 lay1 -1 lay1 -1
0 0 0.01 0 0 0.02 0 0 0.01 0 0 0.005
+1 +1 +1 +1
lay1 -1 lay1 -1 lay1 -1 lay1 -1
1 0 1 0 1 0 1 0
+1 +1 +1 +1
lay1 -1 lay1 -1 lay1 -1 lay1 -1
2 0 2 0 2 0 2 0
+1 0.00 +1 0.00 +1 0.00 +1 0.000
-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1 -1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1-1 0+1

(q) cfg3f at level ℓ = 2 (r) cfg3f at level ℓ = 3 (s) cfg3f at level ℓ = 4 (t) cfg3f at level ℓ = 5

Figure 24: Attention pattern Bl,h,j→i (x) averaged over data x and pairs i, j such that i + δ1 and j + δ2 are at the
NT-end boundaries in level ℓ of the CFG. In each block, the three rows correspond to δ1 = −1, 0, +1 and
the three columns correspond to δ2 = −1, 0, +1.

Observation. Different transformer layer/head may be in charge of attending NT-ends at different


levels ℓ. Also, it is noticeable that the attention value drops rapidly from δ1 = ±1 to δ1 = 0, but less so
from δ2 = ±1 to δ2 = 0. This should not be surprising, as it may still be ambiguous to decide if position
j is at NT-end until one reads few more tokens (see discussions under Figure 19).

33
D.4 From NT-ends to Adjacent NT-ends
In Figure 9(c) we have showcased that Bl,h,j→i (x) has a strong bias towards token pairs i, j that
are “adjacent” NT-ends. We have defined what “adjacency” means in Section 5.2 and introduced a
end→end , to capture B
notion Bl,h,ℓ′ →ℓ,r l,h,j→i (x) averaged over samples x and all token pairs i, j such that,
they are at deepest NT-ends on levels ℓ, ℓ′ respectively (in symbols, b♯ (i) = ℓ ∧ b♯ (j) = ℓ′ ), and of
distance r based on the ancestor indices at level ℓ (in symbols, pℓ (j) − pℓ (i) = r).
Previously, we have only presented by Figure 9(c) for a single dataset, and averaged over all the
transformer layers. In the full experiment Figure 25 we show that for more datasets, and Figure 26
we show that for individual layers.

5 54 53 52 55 44 43 42 45 34 33 32 35 24 23 22 2 5 54 53 52 55 44 43 42 45 34 33 32 35 24 23 22 2
NTend NTend attention pattern

NTend NTend attention pattern


r=0 x x x x x x x x x x r=0 x x x x x x x x x x

for GPTrel over cfg3h data


for GPTrel over cfg3i data

x x x x x x
x x x x x x x x
r=4 x x x x r=4 x x x x
x x x x x x x x x
x x x x x x x x x x x
x x x x x x x x x x x x
r=8 x x x x x x x r=8 x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
r=12 x x x x x x x x r=12 x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
r=16 x x x x x x x x r=16 x x x x x x x x x
0

x x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x x x x x

(a) cfg3i (b) cfg3h

5 54 53 52 55 44 43 42 45 34 33 32 35 24 23 22 2 5 54 53 52 55 44 43 42 45 34 33 32 35 24 23 22 2
NTend NTend attention pattern

NTend NTend attention pattern

r=0 x x x x x x x x x x r=0 x x x x x x x x x x
for GPTrel over cfg3g data

for GPTrel over cfg3f data

x x x x x x
x x x x x x x x
r=4 x x x x r=4 x x x x
x x x x x x x x
x x x x x x x x
x x x x x x x x
r=8 x x x x x x x r=8 x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
r=12 x x x x x x x x r=12 x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
r=16 x x x x x x x x r=16 x x x x x x x x
0

x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x

(c) cfg3g (d) cfg3f

end→end ′
Figure 25: Attention pattern Bl,h,ℓ ′ →ℓ,r (x) averaged over layers l, heads h and data x. The columns represent ℓ → ℓ

and the rows represent r. “×” means empty entries.

Remark. We present this boundary bias by looking at how close NT boundaries at level ℓ′ attend to
any other NT boundary at level ℓ. For some distances r, this “distance” that we have defined may be
non-existing. For instance, when ℓ ≥ ℓ′ one must have r > 0. Nevertheless, we see that the attention
value, even after removing the position bias, still have a large correlation with respect to the smallest
possible distance r, between every pairs of NT levels ℓ, ℓ′ . This is a strong evidence that CFGs are
implementing some variant of dynamic programming.

34
lay1 lay2 lay3 lay4 lay5 lay6 lay7 lay8 lay9 lay10 lay11 lay12
r=0
r=4
r=8
r=12
r=16

(a) cfg3i

lay1 lay2 lay3 lay4 lay5 lay6 lay7 lay8 lay9 lay10 lay11 lay12
r=0
r=4
r=8
r=12
r=16

(b) cfg3h

lay1 lay2 lay3 lay4 lay5 lay6 lay7 lay8 lay9 lay10 lay11 lay12
r=0
r=4
r=8
r=12
r=16

(c) cfg3g

lay1 lay2 lay3 lay4 lay5 lay6 lay7 lay8 lay9 lay10 lay11 lay12
r=0
r=4
r=8
r=12
r=16

(d) cfg3f

end→end
Figure 26: Attention pattern Bl,h,ℓ′ →ℓ,r (x) for each individual transformer layer l ∈ [12], averaged over heads h and

data x. The rows and columns are in the same format as Figure 25.

Observation. Different transformer layers are responsible for learning “NT-end to most adjacent NT-
end” at different CFG levels.

35
E More Experiments on Implict CFGs
We study implicit CFGs where each terminal symbol t ∈ T is is associated a bag of observable
tokens OTt . For this task, we study eight different variants of implicit CFGs, all converted from
the exact same cfg3i dataset (see Section A.1). Recall cfg3i has three terminal symbols |T| = 3:
• we consider a vocabulary size |OT| = 90 or |OT| = 300;
• we let {OTt }t∈T be either disjoint or overlapping; and
• we let the distribution over OTt be either uniform or non-uniform.
We present the generation accuracies of learning such implicit CFGs with respect to different model
architectures in Figure 27, where in each cell we evaluate accuracy using 2000 generation samples.
We also present the correlation matrix of the word embedding layer in Figure 11 for the GPTrel
model (the correlation will be similar if we use other models).

disjoint |vocab|=90 disjoint |vocab|=300 overlap |vocab|=90 overlap |vocab|=300


GPT 98.7 99.4 99.0 99.2 100.0 100.0 100.0 98.1 72.7 70.4 75.2 75.4 100.0 100.0 100.0 100.0
GPT_rel 99.3 99.7 99.0 98.9 100.0 100.0 98.9 99.1 97.8 97.9 92.9 91.9 100.0 100.0 100.0 100.0
GPT_rot 99.2 99.5 99.0 98.4 100.0 100.0 98.6 99.0 96.4 95.9 84.9 87.8 100.0 100.0 100.0 100.0
GPT_pos 99.2 99.4 98.4 99.2 100.0 100.0 96.6 96.4 90.1 91.3 82.6 83.6 100.0 100.0 100.0 99.7
GPT_uni 99.7 99.6 98.4 99.0 100.0 100.0 89.5 92.9 80.5 77.2 64.4 65.4 100.0 100.0 99.9 100.0
cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50 cut0 cut50
uniform non-uniorm uniform non-uniorm uniform non-uniorm uniform non-uniorm

Figure 27: Generation accuracies on eight implicit CFG variants from pre-trained language models.

36
F More Experiments on Robustness
Recall that in Figure 12, we have compared clean training vs training over three types of perturbed
data, for their generation accuracies given both clean prefixes and corrupted prefixes. We now
include more experiments with respect to more datasets in Figure 28. For each entry of the figure,
we have generated 2000 samples to evaluate the generation accuracy.
--------------------pre-training method--------------------
NT-level 0.1 random perturbation T-level 0.15 random perturbation NT-level 0.05 deterministic permutation
generation acc (%) for cfg3b

cut0 =0.1 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99.8 100 100 100 100 100 100 100 100 100 100
cut0 =0.2 98.7 100 100 100 100 100 100 100 100 100 99.2 99.9 100 100 100 99.9 100 100 100 100 98.5 100 100 100 100 100 100 100 100 100 100
cut0 =1 0.0 14.3 24.7 39.8 44.4 55.7 64.5 73.5 82.6 91.8 0.0 14.1 22.8 35.3 44.9 58.2 65.4 75.5 83.6 92.5 0.0 14.7 26.9 38.5 49.8 56.8 65.5 75.2 81.5 91.8 99.8
corrupted cut50 =0.1 78.3 78.9 80.6 78.0 79.1 78.6 79.5 78.6 76.4 77.9 82.6 80.4 80.6 80.4 81.7 82.6 81.4 81.7 80.8 80.8 60.4 58.3 56.5 58.1 60.4 59.1 60.6 57.5 58.9 56.9 30.0
corrupted cut50 =0.2 77.4 78.7 80.0 76.6 77.8 78.2 78.3 77.3 74.9 77.9 81.1 81.1 80.5 79.6 81.2 82.0 81.4 80.7 80.0 80.4 59.5 57.7 55.9 57.6 59.2 58.8 59.7 57.2 57.8 57.1 30.3
corrupted cut50 =1 0.0 0.5 0.5 0.6 0.5 0.3 0.6 0.4 0.5 0.7 0.0 0.4 0.5 0.8 0.2 0.3 0.5 0.6 0.7 0.6 0.0 0.1 0.4 0.4 0.4 0.5 0.9 0.5 0.3 0.3 29.6
cut50 =0.1 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99.4 100 100 100 100 100 100 100 100 100 100
cut50 =0.2 99.2 100 100 100 100 100 100 100 100 100 99.6 100 100 100 100 100 100 100 100 100 98.4 100 100 100 100 100 100 100 100 100 100
cut50 =1 0.0 91.5 95.7 97.1 98.1 98.7 99.2 99.0 99.5 99.4 0.0 92.8 96.2 97.6 98.2 99.1 99.3 99.4 99.5 99.7 0.0 83.4 90.6 94.0 96.2 97.2 98.1 98.7 99.2 99.3 99.9
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 clean
------------pre-training data perturbation ratio OR clean data------------
(a) cfg3b dataset

--------------------pre-training method--------------------
NT-level 0.1 random perturbation T-level 0.15 random perturbation NT-level 0.05 deterministic permutation
generation acc (%) for cfg3i

cut0 =0.1 99.0 99.9 99.8 99.7 100.0 99.7 99.6 99.3 99.1 99.4 98.0 98.8 99.4 99.5 99.4 99.6 99.3 98.9 99.3 99.7 99.6 98.4 99.4 99.8 99.4 98.3 99.6 97.9 99.6 98.5 97.7
cut0 =0.2 95.0 99.6 99.4 98.7 99.2 98.8 99.2 98.9 98.7 99.4 96.5 98.1 99.2 99.2 99.2 98.7 98.7 98.2 98.8 99.4 98.9 97.8 99.2 99.3 98.8 98.6 98.9 98.2 98.4 98.2 98.0
cut0 =1 0.0 13.6 25.9 36.2 44.0 57.9 64.0 73.3 84.4 92.6 0.0 14.7 25.1 33.8 46.4 53.5 63.0 73.5 84.6 92.0 0.0 17.2 25.6 37.3 43.8 54.5 66.8 75.1 84.3 91.3 99.8
corrupted cut50 =0.1 71.9 75.1 73.2 72.9 73.2 73.1 74.3 72.5 71.7 70.9 78.6 75.2 77.0 76.6 77.6 78.6 78.7 78.2 78.4 78.8 48.2 46.8 48.4 46.9 46.4 47.6 48.2 46.4 48.2 48.0 36.8
corrupted cut50 =0.2 71.3 73.3 72.0 72.3 71.0 71.9 73.8 72.5 72.2 70.2 76.5 75.9 75.6 75.4 76.7 76.4 78.2 76.2 78.2 75.1 49.0 46.1 48.3 46.9 46.1 46.7 49.6 47.0 48.4 47.9 37.0
corrupted cut50 =1 0.0 0.4 0.6 0.7 0.3 0.5 0.9 0.6 0.4 0.7 0.0 0.5 0.5 0.5 0.3 0.6 0.4 0.5 0.4 0.4 0.0 0.3 0.3 0.4 0.4 0.6 0.6 0.4 0.3 0.5 37.1
cut50 =0.1 99.1 100.0 99.9 99.9 99.8 99.6 99.8 99.2 99.3 99.4 98.8 99.2 99.5 99.4 99.1 99.8 99.3 99.3 99.6 99.7 99.7 99.2 99.1 99.9 99.2 99.4 99.7 98.4 99.3 98.8 98.6
cut50 =0.2 96.0 99.7 99.9 99.4 99.6 99.7 99.5 99.3 99.1 99.2 97.7 99.0 99.6 99.7 99.5 99.8 99.4 98.7 99.4 99.7 99.2 98.8 99.4 99.8 99.5 99.7 99.7 99.2 99.4 99.1 98.6
cut50 =1 0.0 90.1 94.4 96.6 97.6 98.9 98.8 98.7 99.7 99.4 0.0 93.3 95.8 96.7 97.9 99.0 99.2 99.2 99.2 99.1 0.0 85.1 90.3 94.5 96.2 97.2 97.3 98.6 99.0 99.3 99.9
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 clean
------------pre-training data perturbation ratio OR clean data------------
(b) cfg3i dataset

--------------------pre-training method--------------------
NT-level 0.1 random perturbation T-level 0.15 random perturbation NT-level 0.05 deterministic permutation
generation acc (%) for cfg3h

cut0 =0.1 61.5 89.0 98.0 98.1 97.5 94.9 96.9 98.0 98.4 98.1 95.2 97.1 99.2 99.1 99.6 99.2 98.2 99.5 99.0 98.2 88.9 98.6 98.6 99.1 99.0 99.3 99.2 98.6 98.5 98.7 97.2
cut0 =0.2 44.9 93.1 98.3 98.7 98.8 97.9 98.7 99.4 98.9 99.1 83.4 97.3 98.5 98.9 99.2 99.1 99.1 99.4 98.7 99.1 72.1 98.6 98.7 99.1 99.1 99.6 99.2 99.3 98.9 99.0 99.0
cut0 =1 0.0 14.9 22.0 34.3 46.4 55.5 66.0 71.3 83.8 91.5 0.0 11.7 24.2 34.3 43.0 56.2 66.9 76.8 83.6 91.3 0.0 15.2 26.6 40.7 41.5 54.7 63.2 74.3 84.2 90.9 99.6
corrupted cut50 =0.1 29.6 35.5 43.1 41.5 43.3 39.5 45.9 41.7 43.4 41.0 50.4 49.4 49.8 51.2 51.6 51.5 50.2 50.3 52.3 47.0 35.4 37.2 36.3 35.5 35.3 33.9 36.6 36.6 37.0 33.8 18.4
corrupted cut50 =0.2 20.2 29.3 34.1 32.0 32.5 33.4 37.0 35.1 35.5 34.2 44.3 43.4 44.5 46.6 43.3 48.1 46.6 47.2 48.8 41.6 27.3 29.9 29.5 30.1 28.5 27.2 30.7 30.4 30.1 29.2 17.1
corrupted cut50 =1 0.0 1.1 0.3 0.6 0.4 0.7 1.0 0.5 0.8 0.6 0.0 0.7 0.2 0.8 0.3 0.7 0.0 1.4 0.1 0.6 0.0 0.5 1.3 1.0 0.8 0.4 0.9 0.8 0.4 0.7 12.0
cut50 =0.1 61.9 92.3 98.9 98.5 98.7 96.1 98.0 99.2 99.0 98.8 92.9 98.6 99.3 99.7 99.3 99.3 99.0 99.4 99.3 98.6 87.6 98.8 99.4 99.4 99.8 99.2 98.9 99.5 98.9 99.4 98.3
cut50 =0.2 48.3 94.3 99.4 99.5 99.5 98.9 98.9 99.6 99.7 99.2 83.5 98.9 99.2 99.7 99.8 99.4 99.5 99.8 99.5 99.6 78.9 98.8 99.3 99.4 99.6 99.6 99.5 99.7 99.6 99.3 99.2
cut50 =1 0.0 84.2 92.1 95.9 97.0 97.4 98.4 99.1 98.8 99.2 0.0 89.8 95.6 95.7 97.4 98.6 99.3 99.4 99.1 99.4 0.0 72.1 84.2 90.6 94.6 97.0 97.4 98.6 98.4 98.9 99.9
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 clean
------------pre-training data perturbation ratio OR clean data------------
(c) cfg3h dataset

Figure 28: Generation accuracies for models pre-trained cleanly VS pre-trained over perturbed data, on clean or
corrupted prefixes with cuts c = 0 or c = 50, using generation temperatures τ = 0.1, 0.2, 1.0.

Observation 1. In Rows 4/5, by comparing against the last column, we see it is beneficial to include
low-quality data (e.g. grammar mistakes) during pre-training. The amount of low-quality data could be
little (γ = 0.1 fraction) or large (every training sentence may have grammar mistake).
Observation 2. In Rows 3/6/9 of Figure 12 we see pre-training teaches the model a mode switch.
When given a correct prefix it is in the correct mode and completes with correct strings (Row 9); given
corrupted prefixes it always completes sentences with grammar mistakes (Row 6); given no prefix it
generates corrupted strings with probability γ (Row 3).
Observation 3. Comparing Rows 4/5 to Row 6 in Figure 12 we see that high robust accuracy is
achieved only when generating using low temperatures τ . Using low temperature encourages the model
to, for each next token, pick a more probable solution. This allows it to achieve good robust accuracy
even when the model is trained totally on corrupted data (γ = 1.0).

37
S S
NP VP . NP VP
TO VP NP PP VBZ VP
DT NN IN NP VBD SBAR
VBD VP NP PP S
TO VP IN NP PP , NP VP .
VB NP DT NN IN NP CD NNS TO VP
VBZ VP
NP PP PP VBD NP
IN NP IN NP DT JJ NN NN

(a) the real-life CFG derived from Penn Treebank, short and simple
S S

20 21 20 21

16 17 18 17 17 16 18 18 17

15 15 15 14 15 13 13 15 14 13 14 15 14 14 15 13 13 15 13 13 15 14

12 12 11 10 10 10 10 10 12 12 10 10 10 12 11 10 12 11 10 11 11 12 10 12 11 12 10 12 11 11 10 12 10 12 12 11 10 10 11 12 12 11 12 12 12 11 12 11 12 10 12 11 10 11 11 10 12 12

7 9 7 7 9 7 8 8 7 9 9 9 7 9 9 7 9 8 9 9 9 7 9 9 8 8 8 9 9 7 9 8 9 9 9 7 9 9 8 9 7 7 9 7 9 7 9 7 9 7 7 9 7 9 8 8 9 7 7 9 7 9 7 9 8 8 9 9 7 7 7 9 7 7 9 9 9 8 9 7 7 8 8 8 9 9 7 9 7 7 9 9 8 8 9 8 8 9 9 7 8 9 9 7 9 9 9 7 7 9 7 9 8 9 7 7 9 7 7 9 7 8 8 9 9 7 8 8 9 9 7 8 8 9 7 9 9 9 8 9 7 7 7 9 9 8 8 9 7 7 8 9 9 8 8 9 7 9 7

3 2 1 1 3 2 2 3 2 2 1 2 1 2 2 1 1 2 1 2 2 2 1 3 3 3 3 3 3 3 1 2 1 2 1 3 3 3 2 1 1 1 2 1 1 1 2 1 1 1 3 2 2 1 2 1 3 3 1 2 3 3 1 3 1 1 3 3 3 3 2 2 1 1 1 3 3 1 1 1 1 2 1 3 3 2 2 1 1 2 1 1 1 3 1 1 1 1 2 2 1 2 2 1 1 2 1 2 2 1 1 2 1 3 2 3 3 3 2 2 1 2 1 3 2 3 2 2 3 3 3 1 2 1 1 3 1 1 3 1 1 1 2 1 3 2 2 2 1 3 3 2 2 1 3 3 3 2 2 3 3 3 3 1 3 1 1 3 3 1 2 1 2 2 1 3 1 2 3 2 2 1 1 3 1 2 3 1 2 3 3 3 3 1 2 1 1 2 1 2 1 3 2 3 2 1 2 3 1 1 3 1 1 3 3 1 1 3 1 2 3 3 3 1 2 3 2 1 2 1 3 3 3 3 1 3 1 1 3 3 3 1 1 3 1 1 3 3 1 2 1 3 2 2 3 3 1 3 3 1 2 1 3 1 2 1 2 1 1 1 1 1 3 2 2 3 2 2 1 2 1 2 2 1 3 3 3 3 1 1 2 1 2 2 1 3 2 3 3 2 2 1 3 2 2 3 3 2 2 1 3 3 1 1 2 3 3 3 3 3 2 3 3 1 3 3 1 3 3 1 1 3 2 2 3 3 1 3 1 1 1 2 1 2 2 1 1 2 1 3 3 3 3 3 3 1 1 1 2 2 1 3 2 3 1 2 3 3 3 3 1 2 3 1 1 3 3 3 2 2 3 1 2 3 3 1 3 3 1 2 1 3 1 1 1 2 1 1 3 2 1 2 1 3 1 2

(b) the cfg3 family we used in the main body of this paper has rule lengths 2 or 3 (cfg3f in this figure)
S S

21 21 21 21

18 17 19 19 17 18 18 17 19 19 17 18

14 15 12 11 13 16 15 14 15 15 13 12 14 15 15 14 16 13 12 15 14 15 15 14 15 14 16 14 14 14

9 8 8 13 11 10 8 10 8 9 10 10 8 13 11 13 11 11 13 12 13 11 13 11 11 11 13 8 8 8 10 9 12 13 13 13 11 11 11 13 12 13 13 11 13 8 8 8 10 9 13 11 12 13 13 11 13 11 13 11 10 8 10 11 13 11 12 13 12 11 13 12 13 13 9 9 12 13 13

4 2 4 4 3 4 3 2 2 10 8 8 9 10 1 2 4 1 1 2 4 2 4 4 2 4 3 4 8 8 8 4 2 4 8 8 8 8 9 8 9 8 8 8 9 9 10 8 3 4 4 8 8 8 8 9 10 8 9 8 9 8 8 8 1 2 1 2 1 2 1 4 2 4 2 4 9 8 8 9 10 10 8 8 8 8 8 8 4 2 4 8 9 8 9 10 8 8 8 10 9 9 10 10 8 8 8 8 9 10 10 8 1 2 4 3 4 1 2 1 4 2 3 4 4 8 8 8 8 9 9 8 8 10 8 9 10 10 4 2 4 10 8 8 9 10 9 10 10 2 3 4 1 4 2 3 2 2 4 2 8 9 10 10 8 8 9 10 9 10 8 9 9 8 9 9 10 10 10 8 10 10 8 10 8 2 3 4 2 3 4 9 9 8 8 8 10 8

4 3 4 4 3 4 2 3 4 1 4 2 3 2 2 4 3 4 3 2 2 3 2 2 4 3 4 3 2 2 4 3 4 3 4 4 3 2 2 2 3 4 1 2 4 3 4 4 3 4 3 4 4 2 3 4 4 2 1 2 4 3 4 3 2 2 1 2 4 3 4 2 3 4 4 1 4 3 4 2 3 4 3 2 2 2 3 4 4 3 4 1 2 1 2 3 4 4 1 2 3 2 2 4 2 4 1 2 1 2 4 3 4 3 2 2 1 2 3 2 2 1 2 2 3 4 3 2 2 4 2 4 3 2 2 4 3 4 1 2 4 1 3 4 4 3 4 4 4 2 1 2 4 3 4 1 2 1 2 4 2 4 1 4 2 4 2 3 2 2 1 2 4 3 4 1 2 1 2 3 4 4 2 3 4 4 3 4 4 3 4 4 2 4 3 4 2 3 4 4 1 4 1 1 2 3 2 2 2 3 4 4 1 4 2 4 1 2 2 3 4 4 2 4 2 3 2 2 1 2 4 2 4 4 1 3 4 4 1 4 2 4 3 4 2 3 4 2 3 4 3 2 2 2 3 4 4 2 4 4 1 4 1 4 1 1 2 4 1 4 3 4 3 2 2 4 2 4 3 4 4 3 2 2 4 3 4 4 3 4 4 3 4

(c) the cfg8 family has rule lengths 1, 2, or 3 (cfg8e in this figure)
S S

26 25 26 26 25 26

21 23 23 21 20 21 20 21 23 21 23 22 23

17 16 18 18 16 18 18 17 16 19 16 17 17 16 18 18 18 16 17 18 18 16 16 17 19 18 18 16

12 13 14 15 15 13 15 12 14 12 12 14 12 12 13 14 14 15 13 12 13 14 14 13 15 14 12 12 14 12 12 15 15 12 12 14 14 12 12 12 14 12 14 13 15 14 15 13 13 15 13 15

10 8 10 8 9 8 11 10 10 10 11 8 11 10 10 11 9 8 9 9 8 9 8 11 8 10 11 8 10 11 11 10 11 8 11 8 11 9 8 9 8 9 8 10 8 10 11 10 9 8 9 9 8 9 11 10 10 10 11 9 8 9 8 10 11 9 8 9 9 8 9 9 10 11 9 9 8 8 9 8 10 10 11 8 11 9 10 11 10 10 11 10 8 10 9 9 8

3 3 4 2 4 3 3 2 4 2 4 3 3 4 2 4 3 3 3 4 1 2 1 3 3 3 3 4 3 3 4 1 1 3 2 3 2 4 3 2 4 3 2 4 1 2 1 2 3 1 3 3 3 3 4 2 4 3 3 2 4 3 1 2 1 1 2 1 3 3 4 4 3 2 4 3 2 1 1 3 3 3 4 1 1 1 1 3 4 4 4 4 4 4 3 2 4 2 4 1 1 3 1 1 3 3 3 4 2 4 3 2 3 2 3 3 4 2 4 3 2 4 2 2 3 2 2 3 3 2 4 2 2 3 2 4 3 1 1 3 2 4 4 4 3 2 2 3 1 1 1 3 3 3 1 2 1 4 4 3 3 4 1 2 1 2 4 3 2 4 3 2 4 3 3 4 1 1 3 4 4 3 2

(d) the cfg9 family has rule lengths 1, 2, or 3 (cfg9e in this figure)
S S

68 68 68 68

49 49 65 66 65 66 65 66

45 44 31 33 58 57 62 62 53 54 60 60 60 61 61 60

39 36 35 40 23 22 16 22 54 52 37 41 55 56 55 56 42 47 45 45 52 51 52 50 53 52 47 46 57 55 56 56

30 28 32 27 27 30 31 26 4 11 8 8 10 2 5 11 46 45 35 40 31 33 31 29 53 54 51 52 50 53 50 49 31 29 36 36 41 41 39 36 38 41 46 47 29 33 47 42 42 47 38 41 17 19 39 37 53 50 53 54 40 39 40 39

10 4 15 17 4 4 11 7 6 7 7 6 39 37 41 41 19 15 18 18 23 22 15 20 19 15 23 21 44 43 43 42 41 38 40 34 40 39 42 47 47 42 45 44 23 21 32 27 26 32 32 30 32 30 29 29 26 32 32 29 31 29 39 37 40 35 20 17 15 20 36 36 38 41 31 29 31 26 32 29 26 25 9 2 27 33 30 27 31 26 47 42 48 44 43 42 31 26 27 33 20 19

11 5 7 1 29 29 21 16 32 30 26 25 11 7 3 6 6 7 7 6 8 8 4 4 11 5 7 6 6 6 40 34 36 37 30 26 38 41 31 29 29 33 11 5 30 26 31 26 29 29 37 40 8 8 40 35 37 40 41 41 40 34 4 11 3 2 20 23 4 4 10 2 16 19 23 23 16 19 24 24 20 17 20 17 5 8 23 23 20 23 23 21 19 15 23 21 4 4 31 33 20 19 8 7 1 10 7 1 2 4 4 7 32 27 19 21 32 29 32 30 23 22 21 16 17 19 16 19 21 16 20 19 19 21 17 24 16 22 24 24 3 2 19 15 20 19 35 36 38 41 35 37 35 40 36 37 38 41 19 15 5 8 17 24 24 21 4 7

23 21 16 21 6 6 5 8 15 17 1 3 17 19 22 20 21 17 30 26 26 32 16 21 24 24 17 19 27 26 32 30 6 7 23 21 23 21 15 20 22 24 4 3 19 15 10 2 20 17 23 21 31 33 31 26 4 11 6 7 31 33 5 11 31 29 31 29 7 6 15 17 1 10 4 3 5 8 11 7 6 2 6 2 10 2 9 3 4 7 7 1 6 11 6 2 4 3 4 4 6 2 7 6 5 10 9 2 4 3 5 10 19 15 15 20 1 10 9 3 16 19 9 3 6 6 16 19 23 21 23 23 7 6 10 4 5 10 5 8 11 7 3 2 10 2 4 7 11 7 9 2 3 2 1 3 5 11 1 3 9 4 11 7 11 5 1 10 9 3 33 27 32 27 32 29 26 25 31 25 30 27 27 30 31 26 32 27 31 33 29 33 31 29 9 2 11 5 1 3 6 6

7 6 5 10 5 8 3 2 7 1 11 7 10 4 4 7 5 10 22 24 17 19 4 3 15 17 10 2 6 6 5 1 11 7 6 6 4 3 20 23 7 1 4 11 6 6 7 6 6 6 4 7 8 8 2 4 6 11 7 1 4 3 5 10 19 15 15 20 4 11 6 7 21 23 23 22 23 21 19 15 21 16 2 4 9 3 2 4 2 4 4 4 9 2 5 8 9 3 6 2 6 6 4 3 7 6 24 21 17 24 23 23 5 10 16 19 23 21 20 19 22 20 23 22 19 21 22 24 4 4 9 3 6 7 4 3 20 23 4 7 23 22 16 22 20 17 15 20 23 22 21 16

8 8 1 3 7 1 11 7 7 1 4 7 7 6 9 2 2 4 4 4 3 2 7 6 4 3 5 11 7 6 5 10 9 2 2 4 3 2 10 2 1 3 3 2 1 3 6 2 7 6 6 2 5 10 6 11 8 9 4 4 4 11 8 8 9 2 5 10 10 4 1 3 4 4 7 6 6 2 5 11 10 2 8 9 6 11 7 1 2 4 6 11 7 6 10 4 6 6

(e) the cfg0 family has max-depth 11 and rule lengths 1 or 2 (cfg0e in this figure)

Figure 29: CFG comparisons: left is a medium-length sample and right is a 80%-percentile-length sample

G Beyond the CFG3 Data Family


The primary focus of this paper is on the cfg3 data family, introduced in Section A.1. This paper
does not delve into how GPTs parse English or other natural languages. In fact, our CFGs are
more “difficult” than, for instance, the English CFGs derived from the Penn TreeBank (PTB) [17].
By “difficult”, we refer to the ease with which a human can parse them. For example, in the PTB
CFG, if one encounters RB JJ or JJ PP consecutively, their parent must be ADJP. In contrast, given
a string
3322131233121131232113223123121112132113223113113223331231211121311331121321213333312322121312322211112133221311311311
3111111323123313313331133133333223121131112122111121123331233112111331333333112333313111133331211321131212113333321211
1121213223223322133221113221132323313111213223223221211133331121322221332211212133121331332212213221211213331232233312

that is in cfg3f, even with all the CFG rules provided, one would likely need a large piece of scratch
paper to perform dynamic programming by hand to determine the CFG tree used to generate it.
Generally, the difficulty of CFGs scales with the average length of the strings. For instance,
the average length of a CFG in our cfg3 family is over 200, whereas in the English Penn Treebank
(PTB), it is only 28. However, the difficulty of CFGs may inversely scale with the number of Non-
Terminal/Terminal (NT/T) symbols. Having an excess of NT/T symbols can simplify the parsing
of the string using a greedy approach (recall the RB JJ or JJ PP examples mentioned earlier). This
is why we minimized the number of NT/T symbols per level in our cfg3b, cfg3i, cfg3h, cfg3g, cfg3f
construction. For comparison, we also considered cfg3e1, cfg3e2, which have many NT/T symbols
per level. Figure 4 shows that such CFGs are extremely easy to learn.
To broaden the scope of this paper, we also briefly present results for some other CFGs. We
include the real-life CFG derived from the Penn Treebank, and three new families of synthetic
CFGs (cfg8, cfg9, cfg0). Examples from these are provided in Figure 29 to allow readers to quickly
compare their difficulty levels.

38
gpt
g g g g g g g g g g g g g g g g g g gpt
-1-1 pt-4-2 pt-2-4 pt-4-4 pt-6-4 pt-2-2 pt-4-2 pt-6-2 pt-2-4 pt-4-4 pt-6-4 pt-2-2 pt-4-2 pt-2-4 pt-4-4 pt-6-4 pt-4-6 pt-6-6 pt-6-8 -12-12
-16 -16 -16 -16 -16 -32 -32 -32 -32 -32 -32 -64 -64 -64 -64 -64 -64 -64 -64 -64
gen acc cut0 67.7 90.6 94.8 97.2 97.6 94.4 97.0 97.8 97.9 98.7 99.1 97.1 98.6 98.9 99.5 99.6 99.7 99.7 99.8 99.9
cut1
0 78.1 93.0 95.8 98.0 98.3 94.7 97.5 98.2 98.2 99.1 99.3 97.2 98.8 98.8 99.7 99.7 99.8 99.8 99.9 99.9

(a) generation accuracies for cuts c = 0 and c = 10

gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt gpt
-1-1 -4-2 -2-4 -4-4 -6-4 -2-2 -4-2 -6-2 -2-4 -4-4 -6-4 -2-2 -4-2 -2-4 -4-4 -6-4 -4-6 -6-6 -6-8 -12
-12
-16 -16 -16 -16 -16 -32 -32 -32 -32 -32 -32 -64 -64 -64 -64 -64 -64 -64 -64 -64

KL d 0.07981 0.01357 0.00806 0.00435 0.00317 0.00914 0.00450 0.00299 0.00394 0.00179 0.00119 0.00505 0.00190 0.00220 0.00079 0.00064 0.00066 0.00052 0.00044 0.00034
iv

(b) KL-divergence

g g g g g g g g g g g g g g g g g g g gpt-
tru pt-1-1 pt-4-2 pt-2-4 pt-4-4 pt-6-4 pt-2-2 pt-4-2 pt-6-2 pt-2-4 pt-4-4 pt-6-4 pt-2-2 pt-4-2 pt-2-4 pt-4-4 pt-6-4 pt-4-6 pt-6-6 pt-6-8 12-12
th -16 -16 -16 -16 -16 -32 -32 -32 -32 -32 -32 -64 -64 -64 -64 -64 -64 -64 -64 -64
ent 61.1 60.1 62.0 58.7 58.7 57.9 58.3 59.1 58.4 57.4 57.0 57.8 59.2 58.4 59.4 57.4 57.3 57.2 56.9 57.0 57.2
rop
y
mod
el_s 12K 68K 135K 235K 335K 135K 235K 335K 468K 864K 1.3M 468K 864K 1.7M 3.3M 4.9M 7.3M 10.9M 19.2M 85.5M
ize

(c) entropy and model size

Figure 30: Real-life PTB CFG learned by GPTrot of different model sizes.

gpt g g g g g g g g g g g g g g g g g g gpt
-1-1 pt-4-2 pt-2-4 pt-4-4 pt-6-4 pt-2-2 pt-4-2 pt-6-2 pt-2-4 pt-4-4 pt-6-4 pt-2-2 pt-4-2 pt-2-4 pt-4-4 pt-6-4 pt-4-6 pt-6-6 pt-6-8 -12-12
-16 -16 -16 -16 -16 -32 -32 -32 -32 -32 -32 -64 -64 -64 -64 -64 -64 -64 -64 -64
cu
gen acc

cut1t0 0.0 0.0 0.0 0.4 0.0 0.0 0.4 1.0 0.1 1.7 8.7 0.0 1.0 0.2 5.5 34.3 11.3 47.0 56.8 97.8
0 0.0 0.0 0.0 2.1 1.8 0.0 0.4 1.1 0.1 1.7 8.9 0.0 1.0 0.3 5.6 34.1 11.3 47.1 56.7 97.8

Figure 31: By contrast, small GPTrot model sizes cannot learn the cfg3f data (compare to Figure 30(a)).

G.1 The Penn TreeBank CFG


We derive the English CFG from the Penn TreeBank (PTB) dataset [17]. To make our experiment
run faster, we have removed all the CFG rules that have appeared fewer than 50 times in the data.22
This results in 44 T+NT symbols and 156 CFG rules. The maximum node degree is 65 (for the
non-terminal NP) and the maximum CFG rule length is 7 (for S -> ‘‘ S , ’’ NP VP .). If one
performs binarization (to ensure all the CFG rules have a maximum length of 2), this results in
132 T+NT symbols and 288 rules.
Remark G.1. Following the notion of this paper, we treat those symbols such as NNS (common
noun, plural), NN (common noun, singular) as terminal symbols. If one wishes to also take into
consideration the bag of words (such as the word vocabulary of plural nouns), we have called it
implicit CFG and studied it in Section 6.1. In short, adding bag of words does not increase the
learning difficult of a CFG; the (possibly overlapping) vocabulary words will be simply encoded in
the embedding layer of a transformer.
For this PTB CFG, we also consider transformers of sizes smaller than GPT2-small. Recall
22
These are a large set of rare rules, each appearing with a probability ≤ 0.2%. We are evaluating whether the
generated sentence belongs to the CFG, a process that requires CPU-intensive dynamic programming. To make the
computation time tractable, we remove the set of rare rules.
Note that cfg3 does not contain rare rules either. Including such rules complicates the CFG learning process,
necessitating a larger transformer and extended training time. It also complicates the investigation of a transformer’s
inner workings if these rare rules are not perfectly learned.

39
GPT GPT_rel GPT_rot GPT_pos GPT_uni GPT GPT_rel GPT_rot GPT_pos GPT_uni GPT GPT_rel GPT_rot GPT_pos GPT_uni

generation acc (%)

generation acc (%)

generation acc (%)


cfg8 cfg9 cfg0
a 99.6 99.6 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.8 a 99.9 99.9 99.9 99.9 99.9 99.9 99.9 99.9 100 99.9 a 97.4 97.5 98.9 98.8 98.3 98.4 98.5 98.5 98.5 98.4
cfg8 cfg9 cfg0
b 99.8 99.8 100 100 100 100 100 100 99.9 99.9 b 99.8 99.9 99.9 100 99.9 99.8 99.9 99.9 99.9 99.9 b 90.9 91.3 96.0 95.9 94.1 93.1 92.9 92.8 92.5 92.5
cfg8 95.3 95.2 99.4 99.4 99.2 99.2 98.7 98.6 98.8 98.8 cfg9 99.4 99.4 99.6 99.7 99.6 99.6 99.4 99.5 99.7 99.7 cfg0 99.5 99.6 99.6 99.7 99.6 99.6 99.7 99.7 99.6 99.6
c c c
cfg8 cfg9 cfg0
d 97.5 97.5 98.3 98.3 98.0 98.0 97.9 97.9 97.6 97.4 d 99.8 99.9 99.8 99.9 99.9 99.9 99.8 99.9 99.9 99.9 d 98.0 98.3 98.5 98.6 98.4 98.5 98.7 98.8 98.1 98.2
cfg8 cfg9 cfg0
e 82.1 82.3 97.4 97.6 93.7 93.7 94.6 94.4 93.0 93.5 e 96.6 96.7 99.7 99.8 99.7 99.7 99.1 98.9 98.6 98.8 e 99.7 99.8 99.7 99.7 99.7 99.7 99.7 99.8 99.7 99.7
cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20 cut0 cut20

Figure 32: Generation accuracies for cfg8/9/0 data family; suggesting our results also hold for unbalanced trees with
len-1 rules.

GPT2-small has 12 layers, 12 heads, and 64 dimensions for each head. More generally, we let
GPT-ℓ-h-d denote an ℓ-layer, h-head, d-dim-per-head GPTrot (so GPT2-small can be written as
GPT-12-12-64).
We use transformers of different sizes to pretrain on this PTB CFG. We repeat the experiments
in Figure 4 (with the same pretrain parameters described in Appendix A.3), that is, we compute
the generation accuracy, completion accuracy (with cut c = 10), the output entropy and the KL-
divergence. We report the findings in Figure 30. In particular:
• Even a 135K-sized GPT2 (GPT-2-4-16) can achieve generation accuracy ∼ 95% and have a KL
divergence less than 0.01. (Note the PTB CFG has 30 terminal symbols so its KL divergence
may appear larger than that of cfg3 in Figure 4.)
• Even a 1.3M-sized GPT2 (GPT-6-4-32) can achieve generation accuracy 99% and have a KL
divergence on the order of 0.001.
• Using M = 10000 samples, we estimate the entropy of the ground truth PTB CFG is around 60
bits, and the output entropy of those learned transformer models are also on this magnitude.
• By contrast, those small model sizes cannot learn the cfg3f data, see Figure 31.

G.2 More Synthetic CFGs


Remember that the cfg3 family appears “balanced” because all leaves are at the same depth and the
non-terminal (NT) symbols at different levels are disjoint. This characteristic aids our investigation
into the inner workings of a transformer learning such a language. We introduce three new synthetic
data families, which we refer to as cfg8/9/0 (each with five datasets, totaling 15 datasets). These
are all “unbalanced” CFGs, which support length-1 rules.23 Specifically, the cfg0 family has a depth
of 11 with rules of length 1 or 2, while the cfg8/9 family has depth 7 with rules of length 1/2/3.
In all of these families, we demonstrate in Figure 32 that GPT can learn them with a satisfactory
level of accuracy.
We have included all the CFG trees used in this paper to this embedded file: cfgs.txt. It can
be opened using Adobe Reader. Below, we provide descriptions of how we selected them.
CFG8 family. The cfg8 family consists of five CFGs, namely cfg8a/b/c/d/e. They are constructed
similarly to cfg3b/i/h/g/f, with the primary difference being that we sample rule lengths uniformly
from {1, 2, 3} instead of {2, 3}. Additionally,
• In cfg8a, we set the degree |R(a)| = 2 for every NT a; we also ensure that in any generation rule,
consecutive pairs of terminal/non-terminal symbols are distinct. The size is (1, 3, 3, 3, 3, 3, 3).
• In cfg8b, we set |R(a)| = 2 for every NT a; we remove the distinctness requirement to make
the data more challenging than cfg8a. The size is (1, 3, 3, 3, 3, 3, 3).
23
When a length-1 CFG rule is applied, we can merge the two nodes at different levels, resulting in an “unbalanced”
CFG.

40
• In cfg8c, we set |R(a)| ∈ {2, 3} for every NT a to make the data more challenging than cfg8b.
The size is (1, 3, 3, 3, 3, 3, 3).
• In cfg8d, we set |R(a)| = 3 for every NT a. We change the size to (1, 3, 3, 3, 3, 3, 4) because
otherwise a random string would be too close (in editing distance) to this language.
• In cfg8e, we set |R(a)| ∈ {3, 4} for every NT a. We change the size to (1, 3, 3, 3, 3, 3, 4) because
otherwise a random string would be too close to this language.
A notable feature of this data family is that, due to the introduction of length-1 rules, a string
in this language L(G) may be globally ambiguous. This means that there can be multiple ways to
parse it by the same CFG, resulting in multiple solutions for its NT ancestor/boundary information
for most symbols. Therefore, it is not meaningful to perform linear probing on this dataset, as the
per-symbol NT information is mostly non-unique.24
CFG9 family. Given the ambiguity issues arising from the cfg8 data construction, our goal is
to construct an unbalanced and yet challenging CFG data family where the non-terminal (NT)
information is mostly unique, thereby enabling linear probing.
To accomplish this, we first adjust the size to (1, 4, 4, 4, 4, 4, 4), then we permit only one NT
per layer to have a rule of length 1. We construct five CFGs, denoted as cfg9a/b/c/d/e, and their
degree configurations (i.e., R(a)) are identical to those of the cfg8 family. We then employ rejection
sampling by generating a few strings from these CFGs and checking if the dynamic programming
(DP) solution is unique. If it is not, we continue to generate a new CFG until this condition is met.
Examples from cfg9e are illustrated in Figure 29. We will conduct linear probing experiments
on this data family.
CFG0 family. Since all the CFGs above support rules of length 3, we have focused on L = 7
to prevent the string length from becoming excessively long.25 In the cfg0 family, we construct
five CFGs, denoted as cfg0a/b/c/d/e. All of them have a depth of L = 11. Their rule lengths are
randomly selected from {1, 2} (compared to {2, 3} for cfg3 or {1, 2, 3} for cfg8/9). Their degree
configurations (i.e., R(a)) are identical to those of the cfg8 family. We have chosen their sizes as
follows, noting that we have enlarged the sizes as otherwise a random string would be too close to
this language:
• We use size [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4] for cfg0a/b.
• We use size [1, 2, 3, 4, 5, 6, 6, 6, 6, 6, 6] for cfg0c.
• We use size [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] for cfg0d/e.
Once again, the CFGs generated in this manner are globally ambiguous like the cfg8 family, so
we cannot perform linear probing on them. However, it would be interesting to demonstrate the
ability of transformers to learn such CFGs.
Additional experiments. We present the generation accuracies (or the complete accuracies for
cut c = 20) for the three new data families in Figure 32. It is evident that the cfg8/9/0 families
can be learned almost perfectly by GPT2-small, especially the relative/rotary embedding ones.
As previously mentioned, the cfg9 data family is not globally ambiguous, making it an excellent
synthetic data set for testing the encoding of the NT ancestor/boundary information, similar to
24
In contrast, the cfg3 data family is only locally ambiguous, meaning that it is difficult to determine its hidden NT
information by locally examining a substring; however, when looking at the entire string as a whole, the NT infor-
mation per symbol can be uniquely determined with a high probability (if using for instance dynamic programming).
25
Naturally, a larger transformer would be capable of solving such CFG learning tasks when the string length
exceeds 1000; we have briefly tested this and found it to be true. However, conducting comprehensive experiments
of this length would be prohibitively expensive, so we have not included them in this paper.

41
predict NT ancestor (%)
GPT GPT_rel GPT_rot GPT_pos GPT_uni deBERTa baseline (GPT_rand)
cfg9
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 98.7 83.6 83.9 71.9 94.1
cfg9 a
b2 99.9 99.9 100 100 100 99.9 99.9 100 100 100 99.9 99.9 100 100 100 99.9 99.9 100 100 100 99.9 99.9 100 100 100 100 100 100 100 100 84.8 78.6 82.6 82.8 91.0
cfg9 99.6 99.8 99.7 99.8 100 99.7 99.8 99.7 99.8 100 99.7 99.8 99.7 99.8 100 99.7 99.8 99.8 99.8 100 99.7 99.9 99.8 99.9 100 100 100 100 99.9 100 86.4 66.8 66.4 69.7 94.7
c
cfg9
d 100 99.7 99.6 99.4 99.6 100 99.7 99.5 99.3 99.6 100 99.7 99.5 99.4 99.7 100 99.8 99.6 99.5 99.7 100 99.8 99.6 99.5 99.7 100 100 99.8 99.6 99.9 91.7 66.3 69.4 69.6 75.1
cfg9
e 99.1 98.5 95.6 95.0 93.9 99.1 98.5 95.5 95.2 94.9 99.1 98.6 95.8 95.3 95.0 99.1 98.7 96.1 95.3 94.6 99.2 98.8 96.3 95.5 94.7 99.7 99.6 98.4 96.9 93.9 72.6 56.1 52.0 54.4 67.2
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

Figure 33: Same as Figure 5 but for the cfg9 family. After pre-training, hidden states of generative models implicitly
encode the NT ancestors information. The N Tℓ column represents the accuracy of predicting sℓ , the NT
ancestors at level ℓ. This suggests our probing technique applies more broadly.

GPT GPT_rel GPT_rot GPT_pos GPT_uni deBERTa baseline (GPT_rand)


predict NT at NT-end predict NT at NT-end
(tridiagonal masking) (diagonal masking)

cfg9
100 99.9 100 100 100 100 99.9 100 100 100 100 99.9 100 100 100 100 99.9 100 100 100 100 99.9 100 100 100 100 100 100 98.4 98.7 95.6 89.6 91.6 84.6 96.8
cfg9 a
b2 98.2 97.3 99.8 100 100 98.2 97.3 99.8 100 100 98.2 97.2 99.8 100 100 98.2 97.3 99.8 100 100 98.2 97.2 99.8 99.9 100 100 100 100 99.9 99.6 85.0 76.6 73.1 71.0 81.0
cfg9 97.3 98.9 99.6 100 100 97.3 98.9 99.6 100 100 97.3 98.9 99.6 100 100 97.3 98.9 99.6 100 100 97.3 98.9 99.6 100 100 100 100 99.9 94.6 97.0 73.7 65.7 68.6 79.0 95.9
c
cfg9
d 99.9 99.9 99.1 97.8 99.8 99.9 99.9 99.1 97.8 99.8 99.9 99.9 99.0 97.8 99.8 99.9 99.9 99.1 97.8 99.8 99.9 99.9 99.1 97.8 99.8 100 100 99.8 97.9 97.8 92.9 80.1 81.5 78.8 83.9
cfg9
e 98.5 98.5 97.1 94.0 98.8 98.5 98.5 97.2 94.2 99.0 98.6 98.6 97.2 94.2 99.0 98.6 98.5 97.1 94.1 98.7 98.5 98.5 97.1 94.0 98.6 99.6 99.0 95.9 89.0 88.0 81.1 71.1 70.5 68.4 82.5
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

GPT GPT_rel GPT_rot GPT_pos GPT_uni deBERTa baseline (GPT_rand)


cfg9
100 99.9 100 100 100 100 99.9 100 100 100 100 99.9 100 100 100 100 99.9 100 100 100 100 99.9 100 100 100 100 100 100 99.8 99.7 97.8 93.4 94.8 90.5 99.2
cfg9 a
b2 98.8 98.3 99.9 100 100 98.8 98.3 99.9 99.9 100 98.8 98.2 99.9 99.9 100 98.8 98.2 99.9 100 100 98.8 98.2 99.9 100 100 100 100 100 100 99.9 88.0 82.7 76.5 77.5 93.1
cfg9 98.1 99.3 99.7 100 100 98.1 99.3 99.7 100 100 98.1 99.3 99.7 100 100 98.1 99.3 99.8 100 100 98.1 99.3 99.7 100 100 100 100 99.9 98.6 98.6 77.7 69.7 73.8 83.0 99.2
c
cfg9
d 99.9 99.9 99.2 98.5 100 99.9 99.9 99.2 98.5 100 99.9 99.9 99.2 98.5 100 99.9 99.9 99.2 98.5 100 99.9 99.9 99.2 98.5 100 100 100 99.8 99.3 99.5 94.2 81.3 82.7 82.4 91.6
cfg9
e 98.7 98.7 97.6 95.6 99.2 98.8 98.8 97.7 95.7 99.3 98.7 98.8 97.7 95.7 99.3 98.7 98.8 97.7 95.6 99.1 98.7 98.7 97.6 95.5 99.1 99.6 99.3 97.8 93.3 91.2 82.8 73.1 72.1 71.0 85.1
NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2 NT6 NT5 NT4 NT3 NT2

Figure 34: Same as Figure 7 but for the cfg9 data family. Generative pre-trained transformer encodes NT ancestors
almost exactly at NT boundaries. The N Tℓ column represents the accuracy of predicting sℓ (i) at locations
i with bℓ (i) = 1. This suggests our probing technique applies more broadly.

what we did in Section 4. Indeed, we replicated our probing experiments in Figure 33 and Figure 34
for the cfg9 data family. This suggests that our probing technique has broader applicability.

References
[1] Zeyuan Allen-Zhu and Yuanzhi Li. Physics of Language Models: Part 3.1, Knowledge Storage and
Extraction. ArXiv e-prints, abs/2309.14316, September 2023. Full version available at http://arxiv.
org/abs/2309.14316.
[2] Zeyuan Allen-Zhu and Yuanzhi Li. Physics of Language Models: Part 3.2, Knowledge Manipulation.
ArXiv e-prints, abs/2309.14402, September 2023. Full version available at http://arxiv.org/abs/
2309.14402.
[3] Zeyuan Allen-Zhu and Yuanzhi Li. Backward feature correction: How deep learning performs deep
learning. In COLT, 2023. Full version available at http://arxiv.org/abs/2001.04413.
[4] Zeyuan Allen-Zhu and Yuanzhi Li. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling
Laws. ArXiv e-prints, abs/2404.05405, April 2024. Full version available at http://arxiv.org/abs/
2404.05405.
[5] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-
parameterization. In ICML, 2019. Full version available at http://arxiv.org/abs/1811.03962.
[6] Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv
preprint arXiv:1706.08224, 2017.
[7] David Arps, Younes Samih, Laura Kallmeyer, and Hassan Sajjad. Probing for constituency structure
in neural language models. arXiv preprint arXiv:2204.06201, 2022.
[8] James K Baker. Trainable grammars for speech recognition. The Journal of the Acoustical Society of
America, 65(S1):S132–S132, 1979.
[9] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He,

42
Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit,
Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B: An open-source
autoregressive language model. In Proceedings of the ACL Workshop on Challenges & Perspectives in
Creating Large Language Models, 2022. URL https://arxiv.org/abs/2204.06745.
[10] Gregoire Deletang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt,
Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, et al. Neural networks and the chomsky hierar-
chy. In ICLR, 2023.
[11] Brian DuSell and David Chiang. Learning hierarchical structures with differentiable nondeterministic
stacks. In ICLR, 2022.
[12] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda
Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.
Transformer Circuits Thread, 1, 2021.
[13] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with
disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
[14] John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word represen-
tations. In Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pages 4129–4138, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:
10.18653/v1/N19-1419. URL https://aclanthology.org/N19-1419.
[15] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidi-
rectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186,
2019.
[16] Christopher D Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent
linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National
Academy of Sciences, 117(48):30046–30054, 2020.
[17] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated
corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993. URL https:
//aclanthology.org/J93-2004.
[18] Rowan Hall Maudslay and Ryan Cotterell. Do syntactic probes probe syntax? experiments with
jabberwocky probing. arXiv preprint arXiv:2106.02559, 2021.
[19] Milad Moradi and Matthias Samwald. Evaluating the robustness of neural language models to input
perturbations. arXiv preprint arXiv:2108.12237, 2021.
[20] Shikhar Murty, Pratyusha Sharma, Jacob Andreas, and Christopher D Manning. Characterizing intrin-
sic compositionality in transformers with tree projections. In ICLR, 2023.
[21] Neel Nanda, Lawrence Chan, Tom Liberum, Jess Smith, and Jacob Steinhardt. Progress measures for
grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
[22] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben
Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv
preprint arXiv:2209.11895, 2022.
[23] OpenAI. Gpt-4 technical report, 2023.
[24] Matt Post and Shane Bergsma. Explicit and implicit syntactic features for text classification. In
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2:
Short Papers), pages 866–872, 2013.
[25] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models
are unsupervised multitask learners. 2019.
[26] Itiroo Sakai. Syntax in universal translation. In Proceedings of the International Conference on Machine
Translation and Applied Language Analysis, 1961.
[27] Hui Shi, Sicun Gao, Yuandong Tian, Xinyun Chen, and Jishen Zhao. Learning bounded context-free-
grammar via lstm and the transformer: Difference and the explanations. In Proceedings of the AAAI

43
Conference on Artificial Intelligence, volume 36, pages 8267–8276, 2022.
[28] Michael Sipser. Introduction to the Theory of Computation. Cengage Learning, 2012.
[29] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with
rotary position embedding, 2021.
[30] Lifu Tu, Garima Lalwani, Spandana Gella, and He He. An empirical study on robustness to spurious
correlations using pre-trained language models. Transactions of the Association for Computational
Linguistics, 8:621–633, 2020.
[31] David Vilares, Michalina Strzyz, Anders Søgaard, and Carlos Gómez-Rodrı́guez. Parsing as pretraining.
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9114–9121, 2020.
[32] Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretabil-
ity in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593,
2022.
[33] Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. Perturbed masking: Parameter-free probing for
analyzing and interpreting bert. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 4166–4176, 2020.
[34] Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of Language Models: Part 2.1,
Grade-School Math and the Hidden Reasoning Process. arXiv preprint, 2024. to appear.
[35] Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of Language Models: Part 2.2, How
to Learn From Mistakes on Grade-School Math Problems. arXiv preprint, 2024. to appear.
[36] Shizhuo Dylan Zhang, Curt Tigges, Stella Biderman, Maxim Raginsky, and Talia Ringer. Can trans-
formers learn to solve problems recursively? arXiv preprint arXiv:2305.14699, 2023.
[37] Haoyu Zhao, Abhishek Panigrahi, Rong Ge, and Sanjeev Arora. Do transformers parse while predicting
the masked word? arXiv preprint arXiv:2303.08117, 2023.

44

You might also like