Can Large Language Models Handle Basic Legal Text
Can Large Language Models Handle Basic Legal Text
2
Task deposition synthetic U.S.
Task
description transcripts sections Code
text→cite given text, return its cite D D D
cite→text given a cite, return the text at it D D D
cite→amended same, but prompted with amended text D
defined→cite given a term, return the cite of its definition D D
cite→defined given a cite, return the term defined at it D D
Table 1: Summary of the tests making up BLT. The three right columns are the types of text, and rows are the tasks
that might be performed on that text. A checkmark indicates that it makes sense to run that test on that type of text
and so is in BLT.
by (Liu et al., 2023). They found that GPT-4’s use ber of the line above with the text "__"?" To ensure
of information in prompts followed a U-curve with there is only one clearly correct answer, prompts
respect to the information’s position, with informa- are never constructed asking about lines with less
tion in the middle of the prompt used much less than four words, that are subsets of another line,
than if it were at the start or end. They connected or that are too similar to other lines (Levenshtein
this to the "serial-position" effect exhibited by hu- distance under four (Levenshtein et al., 1966)).
mans, who best remember material presented near The converse is another basic text-handling task:
the beginning or end. given a citation to a transcript, find the text at the
cited location. Lawyers must do this basic task
3 The BLT Benchmark in order to evaluate the opposing side’s motions.
The BLT benchmark involves three different types Paralegals do it on their side’s own motions before
of legal text, each of which has between two and submitting them (ProParalegal, 2017). This moti-
five different tasks run on it (summarized in Ta- vates the cite→text task, where the prompt consists
ble 1). of a part of a deposition transcript followed by the
question "What is the exact text of just line __ of
3.1 Deposition Transcripts page __ above?" An example of GPT-4 failing this
task appears in Figure 1.
In litigation in the U.S., depositions of witnesses
under oath are a key factfinding tool.2 The deposi- BLT’s deposition transcript tests are built from
tions typically occur in lawyers’ offices and allow a novel corpus we constructed of 33,176 lines of
lawyers to ask witnesses questions on virtually any actual deposition transcripts filed with courts. They
topic. Professional court reporters transcribe the are from a variety of cases, as the main criterion
depositions into transcripts, typically with 25 num- was that they were cleanly OCR’ed and could be
bered lines per page,3 often running over 100 pages fully cleaned with regular expressions. This corpus
for a single witness deposition. Attorneys must cite can be extended ad infinitum by others by down-
relevant portions of the resulting transcripts in sub- loading further transcripts and cleaning them, as
sequent motions, such as those asking the court to deposition transcripts are likely not copyrightable.6
grant summary judgment to their side.4 Portions of The existing page and line numbers are stripped
transcripts are cited by page and line number.5 out, and random spans of appropriate length are
A basic text-handling task a lawyer must do in selected, with new page numbers and line numbers.
constructing a motion is finding the page and line This renumbering guards against the possibility
where particular text appears. This motivates the that a model has already seen the deposition and
text→cite task, where the prompt consists of part allows many more prompts to be constructed from
of an actual deposition transcript followed by the each raw deposition.
question, "What are the page number and line num- The size of the prompt is entirely scalable. For
2
Federal Rule of Civil Procedure 30. 6
Lipman v. Massachusetts, 311 F. Supp. 593 (D. Mass.
3
U.S. Court Reporters Association Manual, Section 18.8 1970); 1 Nimmer on Copyright §5.12[C] (2023 ed.). Even in
4
Federal Rule of Civil Procedure 56(c)(1)(A). the unlikely event transcripts are copyrightable, this use of
5
See Bluebook Rule B17.1.2 (HYPC, 2020). them is likely fair use. 17 U.S.C. §107.
3
Section 5217. Definition of cleight.
(a) General rule. The term "cleight" means any baitisolist or any roussiont.
(b) The term "baitisolist" means any ballinated or any caset.
(c) The term "roussiont" means any dicemercu or any accodpoileare.
What is the exact citation above where the term "roussiont" is defined? (Use standard legal formatting like
section 1001(b)(2)).
Section 5217(b)
Figure 2: Representative example of GPT-4 incorrectly answering defined→cite question with a 2-deep, 2-wide
synthetic section. The correct answer is "section 5217(c)".
BLT-4k, we use a mix of 1 and 2 page subsets of deep to 5-wide, 5-deep, which (because statute size
depositions. For the 1-page subsets, only the line is exponential with respect to depth) takes up much
number is involved (i.e., not any page number).7 of the 128,000-token window. For the full list of
Larger versions have larger quotes; for example, sizes in each benchmark, see Appendix A.
with BLT-128k, we use a mix of 120 and 140-page A basic legal text-processing skill is finding the
subsets of depositions. Appendix B details the page citation within a hierarchical text containing the
sizes used for different sizes of BLT. precise text to which you are pointing a court or
other lawyer. This motivates having the text→cite
3.2 Synthetic Sections task on synthetic sections, where the prompt con-
Lawyers regularly work with hierarchical text, in- sists of one synthetic section followed by the ques-
cluding statutes, contracts, regulations, treaties, tion "What is the exact citation above of the text
court rules, and corporate charters. Hierarchical "__"? (Use standard legal formatting like section
text is often organized into sections, subsections, 1001(b)(2))." The code to generate synthetic sec-
paragraphs, subparagraphs, etc. Being able to nav- tions guarantees that every line is different, ensur-
igate such hierarchical text is a basic legal text- ing there is only a single correct answer.
handling task required of all lawyers, whether they The converse legal skill is, given a hierarchical
are litigators arguing that a statute applies to their citation, finding the text at it. This motivates the
case, or are transactional lawyers negotiating the cite→text task, where the prompt consists of one
terms of a contract. synthetic section followed by the question "What
We generate synthetic hierarchical sections, fol- is the exact text of just section __ above?"
lowing the approach of Blair-Stanek et al. (2023). We ask this question only of "leaves" in the
The synthetic hierarchical sections are purely defi- statute, meaning they have no subsections under-
nitional, like many sections of statutes, regulations, neath them. This ensures there is only a single
or contracts. They use repeated application of the correct answer. For example, if you asked for the
logical form A ⇒ B, with two parameters: depth, text of section 573(a)(2) and if there existed two
the maximum number of times it is applied; and subparagraphs under it – sections 573(a)(2)(A) and
width, the number of times it is applied to each B. 573(a)(2)(B) – it is unclear whether or not the text
The terms defined are nonces that are not real words of these two subparagraphs (A) and (B) should
but are pronounceable. These synthetic sections be returned. In this example, section 573(a)(2) is
can be arbitrarily large, thus making tasks based not a leaf, and such ambiguity is avoided by con-
on them scalable to different-sized token windows. sidering only leaves. This leaf-only restriction is
For example, BLT-4k has synthetic sections rang- necessary for cite→text’s questions to have a sin-
ing from 2-wide, 2-deep, as in Figure 2, which are gle correct answer; although not necessary in its
very short, up to 3-wide, 4-deep, which takes up converse, text→cite, we adopt it there as well so
much of the 4k token window. At the highest end, that both involve comparable synthetic sections.
BLT-128k has a variety ranging from 60-wide, 2- We also include two other basic text-handling
7
tasks on the synthetic sections. Terms are defined
The 1-page subsets are followed by "What is the line
number of the line above with the text "__"?" and "What is in hierarchical texts and often referenced elsewhere
the exact text of just line __ above?" in the same hierarchical text. For example, a sec-
4
tion of a statute might define "applicable ship" at is the correct answer). This tests a basic legal skill:
one specific subparagraph, but then use "applica- the ability to apply a newly-amended statute rather
ble ship" a dozen other times in the same section. than the old version. If the leaf contains any num-
Lawyers must be able to cite a term’s precise def- bers, we add or subtract one from the last appearing
inition. With defined→cite, the prompt is one number;9 otherwise, we tweak the last appearing
synthetic section followed by the question "What citation from, say, "(D)" to "(A)"; otherwise, we
is the exact citation above where the term "__" is toggle the last "and" to "or" or vice versa; other-
defined? (Use standard legal formatting like sec- wise, we toggle the last "shall" to "may" or vice
tion 1001(b)(2))." Conversely, when given such versa;10 if none of these are available, we insert
a citation by another lawyer, a lawyer must be "unless otherwise provided by section 101," at the
able to find the term, which motivates cite→define, start of the leaf.
the prompt is one synthetic section followed by For all five tasks on the U.S. Code, we do not
the question "What is the term defined at section use all possible sections. We do not use sections
__?" An example of GPT-4 incorrectly answering containing tables, which are not purely text.11 We
a defined→cite problem appears in Figure 2. do not use sections with quoted hierarchical text
such as model contracts, which are hard for even a
3.3 U.S. Code
human lawyer to read.12
The U.S. Code is the official compilation of gen- For text→cite, we do the same test as with tran-
eral and permanent U.S. federal statutes.8 The U.S. scripts, not using lines that are under four words
Code is a large corpus of hierarchical text. We long, are subsets of any line appearing elsewhere
apply to the U.S. Code all four tasks that we ap- in the prompt, or that have a Levenshtein distance
plied to synthetic sections: text→cite, cite→text, under four from another line in the prompt. For
defined→cite, and cite→defined. For these four defined→cite, we do not use terms defined in more
tasks on the U.S. Code, the prompt is the same as than one place in the prompt. We never use any of
for synthetic sections. the cites that Congress has sloppily added twice.13
The benefit of the U.S. Code sections over the Unlike synthetic sections, which can be gener-
synthetic sections is that they are real legal text. ated in unlimited quantities in arbitrarily large sizes,
Their downside is that LLMs have doubtless been there are a limited number of U.S. Code sections.
trained on them, as they are publicly available on But it is a huge corpus, with 43,916 sections that
multiple websites. By contrast, no LLM will have meet the criteria discussed above, 447,037 leaves,
seen the synthetic sections in the BLT dataset be- and 23,562 unique definitions. Although 94% of
fore its publication. Even after the dataset’s publi- sections are under 2,000 GPT-4 tokens, that still
cation, arbitrarily many new synthetic sections can leaves 2,602 sections over 2,000 tokens, including
be generated by changing the random seed. (There 813 sections over 4,000 GPT-4 tokens and 196 sec-
are 9000 nonces available, so, for example, there tions over 8,000 tokens. When there are insufficient
are approximately 10197 possible unique 50 nonce numbers of large enough sections, we can generate
sections.) Thus the synthetic sections test LLMs’ prompts of any desired size ad infinitum by adding
ability to do basic legal text handling of hierar- randomly selected other sections of approximately
chical text never before seen. This simulates the the same size. We randomly shuffle the order of the
challenges of lawyers working with newly drafted sections in the prompt so that the target section’s
contracts or new legislation. position is not a cue that the model should focus
To test whether LLMs’ familiarity with U.S. on it.
Code sections causes errors, we add a fifth test for Having multiple sections filling the prompt is
U.S. Code sections: cite→amended. In all but one similar to how Greg Brockman pasted nine tax-
respect, this test is identical to cite→text, in that it
9
gives the cite of a leaf and asks "What is the exact We never move from 1 to 2 or from 2 to 1 since that would
text of just section __ above?" The sole difference also require changing singular nouns to plural or vice versa.
10
In legal phrasing, the distinction between "may" and
is that we make a small but semantically-important "shall" is extremely important (Garner, 2019)
11
change to the text in that leaf to see if the LLM For example, 26 U.S.C. §1 containing income tax tables
returns the original text or the changed text (which is not used. 5,946 sections are excluded for this reason.
12
Examples include 5 U.S.C. §9507 and 25 U.S.C. §5329.
8 13
Congress makes it available in XML form at https:// For example, there are two subsection (e)’s in 42 U.S.C.
uscode.house.gov/download/download.shtml. §1397hh.
5
related sections of the U.S Code into GPT-4 during We tested two models from OpenAI: GPT-3.5-
the livestream introducing it to show how it could turbo and GPT-4.17 From Google, we tested PaLM
"do taxes." This is realistic: lawyers handling real- 2, specifically chat-bison-001. Anthropic did not
world issues often must apply several statutes in grant us access to the API for its LLM Claude for
conjunction, not just one.14 this research.18
We observe that GPT-4 far outperforms the other
3.4 General Considerations models on all tasks. On several tasks, GPT-3.5-
As shown in Table 1, there are a total of 11 tests. turbo outperforms PaLM 2, and vice versa.
For each of the 11, and for each possible size (rang-
4.1 GPT-4’s poor performance on transcripts
ing from BLT-4k to BLT-128k), we generate a test
split of 100 prompts and 1000 training prompts.15 GPT-4 has its weakest performance in handling
Why only 100 test prompts for each test split? transcripts, which is surprising, since transcripts
Three reasons. First, there are 11 tests, thus 1,100 have much simpler numbering than the hierarchical
test prompts for BLT-4k, 1,100 test prompts for text in synthetic sections or U.S. Code. For exam-
BLT-8k, etc. Second, the cost of calling GPT-4 ple, on BLT-4k’s transcripts, GPT-4 gets 82% on
with just 1,100 BLT-8k prompts with around 5,000 text→cite and 78% on cite→text.
tokens per prompt is nontrivial. Third, any LLM
4.1.1 transcript text→cite
deployed for real-world legal practice really should
be at or near 100%, and as accuracy approaches To further investigate, we generate 1,000 new tran-
100% the t-statistic goes to zero. script text→cite prompts in the same format as
We are publicly releasing all test and train files.16 BLT-4k (25-lines per page, with half being one-
We do not retain any "held out" additional test splits. page and half being two-page) and run them against
That is because it is possible to generate nearly GPT-4. It achieves 87.5% on these 1,000.
unlimited additional test data using our code simply The biggest determinant of performance is
by changing the random seed and by adding further whether the transcript was a single page or two
OCR’ed deposition transcripts. pages. GPT-4 correctly answered 91% of single-
We leave a window buffer. Some LLMs’ tok- page transcript prompts, but just 84% of 2-page
enizers might encode a prompt less efficiently than transcript prompts. This makes sense, since 2-page
GPT-4’s tokenizer, so we allow an extra 20% buffer transcripts have 50 lines of text, whereas 1-page
for that. All LLMs sometimes provide answers to transcripts have just 25 lines of text. (An exam-
our prompts that are far more verbose than a direct ple of GPT-4 getting a wrong answer on a 2-page
answer, so we leave another 200 tokens to buffer transcript appears in Appendix C.)
that. So, for example, we ensure that all BLT-4k To see whether GPT-4 is confused solely by the
prompts are under 4, 000 × 0.8 − 200 = 3, 000 greater number of lines or by having the text split
GPT-4 tokens. into two pages, we generate 500 new prompts with
single pages but with 50 lines per page. (In other
4 Results and Discussion words, it is all one page, but with line numbers
starting at 1: and ending at 50:, followed by the
Table 2 contains the accuracy of several models question). We find GPT-4 achieves 84.8% accu-
on several different sizes of BLT. All tests were by racy, nearly identical to the two-page transcripts,
API call, with temperature set to 0.0 to maximize indicating the problem with them is length, not be-
reproducability. Our code measures accuracy with ing split into two pages. On both one-page and
forgiving rules, including ignoring case, whites- two-page transcripts, most errors are either identi-
pace, and final punctuation and stop words. That fying the line after the correct one (61% of errors)
code also uses handwritten rules to classify errors, or before the correct one (25% of errors). The full
and we draw on this feature in the discussion below. error breakdowns are in Appendix D.
14
For example, Home Depot U.S.A., Inc. v. Jackson, 139 If we graph GPT-4’s accuracy on two-page tran-
S.Ct. 1743 (2019), involved the interplay of 28 U.S.C. §§1441, scripts with 25 line per page, with respect to the
1446, and 1453.
15 17
Thus on BLT-4k there are 11×100 = 1, 100 test prompts, Specifically, we used GPT-3.5-turbo-1106 and GPT-4-
spread across 11 JSONL files, as well as 11 × 1, 000 = 0613, respectively.
11, 000 training prompts across 11 JSONL files. 18
We also were unable to get access to GPT-4-32k or chat-
16
https://github.com/BlairStanek/BLT/ bison-32k.
6
model transc. synthetic section U.S. Code mean
cite→amended
defined→cite
cite→defined
defined→cite
cite→defined
text→cite
cite→text
text→cite
cite→text
text→cite
cite→text
GPT-3.5-turbo 53 32 72 38 83 79 89 52 56 77 98 66.3
BLT-4k GPT-4 82 78 88 97 90 100 98 93 93 98 100 92.5
PaLM 2 82 18 61 48 95 78 22 61 63 7 95 57.3
GPT-3.5-turbo 24 17 19 20 65 52 63 58 56 69 89 48.4
BLT-8k GPT-4 44 26 64 49 82 83 94 74 76 88 97 70.6
PaLM 2 36 7 1 2 2 41 15 35 35 10 66 22.7
BLT-16k GPT-3.5-turbo 7 5 25 12 67 57 38 36 36 50 77 37.3
Table 2: Accuracy in percent of several models against the test splits of several different sizes of BLT. Since each
size of BLT’s test split consists of 100 prompts (with answers) the numbers are both the number correct and the
percent accuracy.
page and line being requested (grouped into buck- 4.2 GPT-4’s poor performance on synthetic
ets of 5 lines), we do not see a ’U’ as suggested by text→cite
Liu et al. (2023), but rather a decreasing ’w’:
Other than transcripts, GPT-4’s worst performance
on BLT-4k is on synthetic sections text→cite at
88%; GPT-4 performs even worse on it in BLT-8k
at 64%. GPT-4’s errors (which are in Appendix
E) show that its most common error is getting a
single portion of the hierarchical citation wrong.
For example, when the correct answer is "sec-
tion 2553(b)(2)(B)(i)", GPT-4 returned "section
2553(b)(1)(B)(i)". This is an easy mistake for a ca-
sual observer to make, but lawyers must not make
such errors in court filings. It refers to an entirely
different portion of the document.
4.1.2 transcript cite→text We added the test cite→amended for U.S. Code
to determine whether LLMs would over-rely on
GPT-4 also performed poorly on cite→text, so we their knowledge of the U.S. Code from their train-
created 1,000 new prompts in the same format as ing and miss a small dummy amendment to the
BLT-4k. GPT-4 got 75.7% accuracy on these 1,000. requested line of the statute. We added a specific
Here we did not see a significant difference be- error code to track when the LLM returned the
tween one-page and two-page transcripts, which unamended text rather than the amended text actu-
got 76.6% and 74.8% respectively. We also cre- ally put into the prompt. With 700 cite→amended
ated 1,000 new prompts that were one page, with prompts across the 7 configurations tested in Ta-
just 15 lines on the page (as opposed to the normal ble 2, the unamended text was returned only 5
25). GPT-4 got 77.5% on those shorter transcripts, times: chat-bison-001 did so twice on BLT-4k and
suggesting a slight accuracy improvement as the twice on BLT-8k; GPT-3.5-turbo did so once on
number of lines decreases. BLT-8k. GPT-4 never returned the unamended text.
7
text and task not tuned finetuned Blair-Stanek et al. (2023) calls ids like "b44" or
transcript text→cite 53 100 "m22". Thus, the facts and question look like
transcript cite→text 32 99 "Alexis is a q55. Is section 1661(c) applicable
synthetic text→cite 72 98 to Alexis? Let’s think step by step." Examples of
synthetic cite→text 38 100 such prompts appear in Appendix F.
synthetic defined→cite 83 100 Without fine-tuning, GPT-3.5-turbo gets an ac-
synthetic cite→defined 79 100 curacy of 79% on applies-to. With fine tuning the
uscode text→cite 89 100 accuracy actually falls to 67%. This result was un-
uscode cite→text 52 100 expected.
uscode cite→amendedtext 56 100 BLT-4k’s training data had prompts with chunks
uscode defined→cite 77 100 of text like synthetic sections followed by a single
uscode cite→defined 98 100 question, with the correct answer always being very
short – either a line of text, a definition, or a citation.
Table 3: Results of fine-tuning GPT-3.5-turbo on 9,900 This training data trained our finetuned GPT-3.5-
training samples from BLT-4k. Both numerical columns turbo to think that a chunk of text, followed by a
contain percent accuracy on BLT-4k’s test prompts. We single question should have a very short answer.
see improvement to near-perfect scores.
The mean number of words used by GPT-3.5-
turbo to respond, with its step-by-step reasoning,
5 Fine Tuning to one of the applies-to queries fell from 59 before
fine tuning to 19 after fine tuning. This truncated
We fine-tune the 4,000-token version of GPT-3.5- the proper reasoning. A representative example
turbo, since we have not been granted fine-tuning of the finetuned and non-tuned versions answer-
access to GPT-4 or the 16k-token version of GPT- ing the same prompt with very different chains of
3.5-turbo. Because we are limited to 4,000 tokens, reasoning and conclusions appears in Appendix F.
we naturally do fine-tuning with BLT-4k’s train- Future research should focus on how to improve
ing set. For each of the 11 task types, BLT-4k performance on BLT without such side-effects.
has a training set with 1,000 prompts and answers,
for a total of 11,000. Of these, we use 90% for 6 Conclusion
training. We train for two epochs with the default
learning rate. The results of the fine tuning are We demonstate that the best currently available
in Table 3. We find that fine-tuning brings GPT- LLMs perform very poorly at many basic legal
3.5-turbo, which is not OpenAI’s most advanced text-handling tasks. The chief innovation officer
model, to near the 100% performance expected of at large international law firm Baker & McKenzie
lawyers and paralegals. observed to the New York Times of LLMs, "At
We also test how this finetuned GPT-3.5-turbo its best, the technology seems like a very smart
performs on two related tasks: SARA, and the paralegal." (Lohr, 2023). It might be more accurate
applies-to test with synthetic sections. For SARA, to say that LLMs are like very sloppy paralegals.
we used the 276 cases where the answer is en- We find poor performance from GPT-3.5-turbo and
tail/contradict, with each prompt consisting of each PaLM 2 on our smallest test set, BLT-4k, and poor
U.S. Code section mentioned in the case, plus the performance from even GPT-4 on portions of BLT-
facts (i.e., the premise) and the hypothesis (i.e., the 4k like finding the text on one or two pages of
question). Without fine-tuning, GPT-3.5-turbo’s deposition transcript. But fine-tuning on BLT-4k’s
accuracy was 54.3% (150 / 276), but with our fine- training set brings performance of GPT-3.5-turbo
tuning it is 60.9% (168 / 276). up to the expected human level of performance.
For applies-to, we used the code of Blair-Stanek While we are focused on law, the BLT tasks
et al. (2023) to generate 1,000 various sized syn- are low level enough that we would expect these
thetic sections and questions. Recall that BLT’s findings to be relevant to anyone, regardless of
synthetic sections use nonces like "cleight" and domain. Moreover, this poor performance on ad-
"roussiont" as in Figure 2. To ensure that our vanced LLMs as-is shows the importance of con-
applies-to synthetic sections look different than sulting domain experts in training of future LLMs.
this, we do not use nonces, but rather use what
8
References A Synthetic Section Sizes
Andrew Blair-Stanek, Nils Holzenberger, and Ben- Larger versions of BLT have longer and more com-
jamin Van Durme. 2023. OpenAI Cribbed Our Tax
Example, But Can GPT-4 Really Do Tax? Tax Notes plicated prompts. Below are the size of synthetic
Federal, 180:1101–1105. sections in each size of BLT.
Andrew Blair-Stanek, Nils Holzenberger, and Benjamin Version Sizes
Van Durme. 2023. Can GPT-3 Perform Statutory
Reasoning? In Proceedings of the Nineteenth In- BLT-4k 2 wide, 2 deep
ternational Conference on Artificial Intelligence and 2 wide, 3 deep
Law, ICAIL ’23, page 22–31. 2 wide, 4 deep
Bryan Garner, editor. 2019. Black’s Law Dictionary.
2 wide, 5 deep
3 wide, 2 deep
Neel Guha, Daniel E. Ho, Julian Nyarko, and Christo- 3 wide, 3 deep
pher Ré. 2022. Legalbench: Prototyping a collabora- 3 wide, 4 deep
tive benchmark for legal reasoning.
4 wide, 2 deep
Dan Hendrycks, Collin Burns, Anya Chen, and 4 wide, 3 deep
Spencer Ball. 2021. Cuad: An expert-annotated
nlp dataset for legal contract review. arXiv preprint BLT-8k 2 wide, 6 deep
arXiv:2103.06268. 3 wide, 5 deep
4 wide, 4 deep
Nils Holzenberger, Andrew Blair-Stanek, and Ben-
jamin Van Durme. 2020. A dataset for statutory 7 wide, 3 deep
reasoning in tax law entailment and question answer- 20 wide, 2 deep
ing. In Proceedings of the Natural Legal Language
Processing Workshop 2020 co-located with the 26th BLT-16k 5 wide, 4 deep
ACM SIGKDD International Conference on Knowl- 8 wide, 3 deep
edge Discovery & Data Mining (KDD 2020), Virtual 9 wide, 3 deep
Workshop, August 24, 2020, volume 2645 of CEUR 30 wide, 2 deep
Workshop Proceedings, pages 31–38. CEUR-WS.org.
BLT-32k 3 wide, 6 deep
HYPC, editor. 2020. The Bluebook: A Uniform System
4 wide, 5 deep
of Citation, 21st edition.
6 wide, 4 deep
Vladimir I Levenshtein et al. 1966. Binary codes capa- 12 wide, 3 deep
ble of correcting deletions, insertions, and reversals. 11 wide, 3 deep
In Soviet physics doklady, volume 10, pages 707–710.
Soviet Union. 44 wide, 2 deep
40 wide, 2 deep
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran-
jape, Michele Bevilacqua, Fabio Petroni, and Percy BLT-64k 7 wide, 4 deep
Liang. 2023. Lost in the middle: How language 16 wide, 3 deep
models use long contexts. 15 wide, 3 deep
Steve Lohr. 2023. A.I. threatens lawyers? We’ve heard 14 wide, 3 deep
this before. New York Times, page B1. 13 wide, 3 deep
60 wide, 2 deep
Eric Martínez. 2023. Re-evaluating gpt-4’s bar exam
performance.
65 wide, 2 deep
BLT-128k 4 wide, 6 deep
OpenAI. 2023a. Gpt-4 developer livestream.
5 wide, 5 deep
OpenAI. 2023b. Gpt-4 technical report. 8 wide, 4 deep
9 wide, 4 deep
Drescher ProParalegal. 2017. Practice tip sheet.
20 wide, 3 deep
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta 80 wide, 2 deep
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola
Cancedda, and Thomas Scialom. 2023. Toolformer: B Transcript Quotation Sizes
Language models can teach themselves to use tools.
arXiv preprint arXiv:2302.04761. Larger versions of BLT have longer prompts. Be-
Libin Zhang. 2023. Tax questions for language models. low are the number of pages of transcript quota-
Tax Notes Federal, 179:1699. tion used in each size BLT model. The generated
9
prompts are distributed uniformly among these 9: physician?
page sizes. For example, half of BLT-4k’s prompts 10: A. Excuse me again?
have one-page deposition quotations and the other 11: MR. GROSSMAN: You said you
half have two-page deposition quotations. 12: went to CityMD. She thinks you went
13: twice, did you go twice?
Version Transcript Pages
14: THE WITNESS: No.
BLT-4k 1, 2 15: MR. GROSSMAN: Then you’re not
BLT-8k 5, 10, 15 16: listening to the question. You went
BLT-16k 25, 40 17: to CityMD first, then you went to
BLT-32k 30, 60, 80 18: your primary physician. Now, she
BLT-64k 100, 130 19: wants to know where you went next.
BLT-128k 120, 140 20: THE WITNESS: No, first I went
21: to the primary.
C Example GPT-4 failing on a 2-page 22: MR. GROSSMAN: First primary,
transcript text→cite Prompt 23: and then CityMD and then where did
Page 68: 24: you go?
1: after the accident? 25: THE WITNESS: I was in CityMD
2: A. He called me – that same day,
What are the page number and line number of the
3: he called me at nighttime and asked me how
line above with the text "your primary physician.
4: I was.
Now, she"?
5: Q. He called you and asked you how
6: you were? The page number is 69 and the line number
7: A. Yes. is 19.
8: Q. What did you tell him?
9: A. I just told him that I wasn’t GPT-4’s answer above is wrong, since the re-
10: feeling good, that I was just laying in quested text is on line 18.
11: bed.
12: Q. And what did he say? D GPT-4 Errors on Transcripts
13: A. He said take the pills I gave GPT-4 performs surprisingly poorly on transcripts,
14: you, you’ll be ready and better by next even the 1- and 2-page transcripts in BLT-4k. Be-
15: week. He was just giving me something like low are qualitative and quantitative breakdowns
16: a fast way for me to feel up, to cheer up. of the GPT-4’s errors on the large runs used to
17: It wasn’t helping. investigate the problem. Our API-calling code au-
18: Q. Going back to your primary care tomatically categorizes the errors.
19: physician. You said she wanted to send you
20: to a specialist? D.1 GPT-4’s errors on transcript text→cite
21: A. Yes. from BLT-4k
22: Q. But then you called workers’ Recall that we generated 1,000 new BLT-4k style
23: comp? transcripts with text→cite prompts. Here were the
24: A. Yeah, that’s when the issue 25: happened lines returned relative to the correct ones:
and I called workers’ comp. # Relative Error
Page 69: 3 prior page
1: Q. Then who did you treat with 1 19 before
2: next? 1 3 before
3: A. I treated with CityMD. 1 2 before
4: Q. You said CityMD? 12 1 before (i.e. prior line)
5: A. Yes. 76 1 after (i.e. next line)
6: Q. Was the place that you went to 1 2 after
7: the first time the same place that you went 1 3 after
8: like after you went to your primary care 1 next page
10
D.2 GPT-4’s qualitative errors on transcript # Relative Error
cite→text from BLT-4k 2 prior page
4 same page, >3 lines before
Recall that we generated 1,000 new BLT-4k style 1 3 lines before
transcripts with cite→text prompts. Here were the 6 2 lines before
errors: 81 prior line
39 next line
10 2 lines after
# Type of Error 7 3 lines after
166 wrong line 6 4 lines after
48 superset 15 same page, >4 lines after
10 not found 3 next page
9 subset wrong line
7 multiple wrong lines
3 subset
E GPT-4’s errors on synthetic text→cite
11
# Type of Error First, let’s check if Alexis is a q55, as men-
tioned in section 1661(b)(1).
7 wrong paragraph
5 wrong subclause Therefore, the answer (Yes or No) is
5 wrong subparagraph Yes
5 wrong clause
2 wrong item Above we see that the finetuned model has a
2 wrong subsection terse answer that started to give reasoning but then
3 wrong subsection,wrong paragraph stopped prematurely. Below we see the non-tuned
1 wrong paragraph,wrong subparagraph model on the same prompt, giving robust reasoning
and reaching the correct result:
2 wrong paragraph, subparagraph, & clause
1 wrong paragraph, clause, & subclause Non-Tuned Model Chat:
3 not parallel Section 1661. Definition of t88.
(a) General rule. The term "t88" means-
Above we again see most errors are just getting
(1) any q33, or
the citation wrong in a single component.
(2) any a33.
F Finetuned worse on applies-to (b) The term "q33" means-
(1) any q55, or
In section 5 we found – unexpectedly – that GPT- (2) any v33.
3.5-turbo finetuned on BLT-4k performed worse (c) The term "a33" means-
on the applies-to task defined in Blair-Stanek et al. (1) any o33, or
(2023) than the not-finetuned version. It appears (2) any i66.
that the fine-tuning resulted in much-shortened
chains of reasoning, often incomplete, that failed Alexis is a q55. Is section 1661(c) applicable to
to get to the correct answer. Alexis? Let’s think step by step.
The word "first" appears in the non-tuned First, we need to determine if Alexis falls un-
model’s response 82% of the time and 84% for der the definition of "q55" or "v33" as per sec-
the finetuned model. A good chain of reasoning tion 1661(b). Since Alexis is a q55, she falls
has either the words "second", "next", or "then" under the definition of "q55" as per section
later in its answer. Those words come after "first" 1661(b)(1).
52% of the time with the non-tuned model but a
very-low 2% for the finetuned model. Because Next, we need to determine if section 1661(c)
the finetuned model’s reasoning was truncated, its is applicable to Alexis. According to section
accuracy fell. 1661(c), the term "a33" means any o33 or any
Below we have a representative example, with i66. Since Alexis is not an o33 or an i66, sec-
the exact same prompt, showing how the finetuned tion 1661(c) is not applicable to her.
and not-finetuned models responded.
In conclusion, section 1661(c) is not applicable
Finetuned Model Chat:
to Alexis.
Section 1661. Definition of t88.
(a) General rule. The term "t88" means- Therefore, the answer (Yes or No) is
(1) any q33, or No.
(2) any a33.
(b) The term "q33" means-
(1) any q55, or
(2) any v33.
(c) The term "a33" means-
(1) any o33, or
(2) any i66.
12