0% found this document useful (0 votes)
23 views12 pages

Can Large Language Models Handle Basic Legal Text

Uploaded by

Ryan Lee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views12 pages

Can Large Language Models Handle Basic Legal Text

Uploaded by

Ryan Lee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

BLT: Can Large Language Models Handle Basic Legal Text?

Andrew Blair-Stanek1,2 Nils Holzenberger3 Benjamin Van Durme1


1
Johns Hopkins University, 2 University of Maryland School of Law,
3
Télécom Paris - Institut Polytechnique de Paris

Abstract 1: straight that I couldn’t do nothing.


2: Q. How did you get to your
We find that the best publicly available LLMs 3 : girlfriend’s house today?
arXiv:2311.09693v1 [cs.CL] 16 Nov 2023

like GPT-4 and PaLM 2 currently perform


4: A. She picks me up.
poorly at basic text handling required of
lawyers or paralegals, such as looking up the 5: Q. How long have you guys been
text at a line of a witness deposition or at a 6: together?
subsection of a contract. We introduce a bench- 7: A. We’ve been together for four or
mark to quantify this poor performance, which 8: five years.
casts into doubt LLMs’ current reliability as-is 9: Q. Okay. Do you currently have
for legal practice. Finetuning for these tasks 10: any doctors’ appointments scheduled?
brings an older LLM to near-perfect perfor- 11: A. I did yesterday, but not today.
mance on our test set and also raises perfor-
12: Q. Like do you have another one
mance on a related legal task. This stark result
highlights the need for more domain expertise 13: this week?
in LLM training. 14: A. I have therapy tomorrow.
15: Q. Tomorrow?
1 Introduction
What is the exact text of just line 10 above?
There has been much excitement about the poten- A. I did yesterday, but not today.
tial of large language models (LLMs) to handle le-
gal tasks. Recent papers, presentations, and bench- Figure 1: Representative example of GPT-4 incorrectly
marks have focused on LLMs’ ability to perform answering a simple question about a page from a line-
self-contained high-level legal tasks, such as pass- numbered witness deposition transcript. The line num-
bers, exactly as shown above, are in the prompt passed to
ing the bar exam, identifying hearsay evidence,
GPT-4. We find GPT-4 incorrectly answers 23% of such
or navigating the tax code. LLMs’ strong perfor- one-page, 15-line deposition line-retrieval prompts.
mance on these tasks have led to predictions that
LLMs will imminently revolutionize law practice.
Yet no previous work has, to our knowledge, been
done on LLMs’ ability to handle the most basic Our code1 can generate nearly unlimited new test
text processing done by lawyers and paralegals. or training data through changing the random seed.
We also release test and train sets for a progression
We introduce a novel benchmark for Basic Legal
of LLM window sizes: BLT-4k (which fits inside
Text (BLT) processing to fill this gap. Running it
4,000-token windows), BLT-8k (which fits inside
on the most advanced publicly-available LLMs, we
8,000-token windows), BLT-16k, BLT-32k, BLT-
find they perform poorly.
64k, and BLT-128k. We also include code that
The BLT tasks mimic basic tasks done by
calls the LLMs’ APIs, measures their accuracy,
lawyers and paralegals, including looking up ci-
and classifies the errors using hand-written rules.
tations to the record, finding statutory citations con-
We use the training data to finetune GPT-3.5-
taining particular text or definitions, or finding the
turbo, and find this rapidly brings about near-
text in a contract that another lawyer has explicitly
human (i.e., 100%) performance. This indicates
cited. By design, a human paralegal, newly-minted
that – to the extent that those training LLMs have
lawyer, or even non-subject matter expert could
1
perform these tasks at or near 100%. https://github.com/BlairStanek/BLT/
cared to seek out domain expertise – they have not Many legal benchmarks for LLMs have been cre-
focused on the actual day-to-day needs of domain ated, with most incorporated into the broad Legal-
practitioners (here, lawyers and paralegals). Bench project (Guha et al., 2022). These are all
much higher-level tasks than BLT, including iden-
2 Background tifying testimony to which the hearsay doctrine
applies or whether contractual terms impose partic-
2.1 Why Care About Law? ular restrictions (Hendrycks et al., 2021).
Law is a largely text-based profession, so it is often LLMs have seen much of the internet during
used to demonstrate advances in natural language their training, including legal statutes, court cases,
processing in general and LLMs specifically. and publicly available contracts. To see whether
For example, OpenAI’s GPT-4 technical report LLMs can handle truly novel legal texts, Blair-
(OpenAI, 2023b) mentioned only a single bench- Stanek et al. (2023) generated synthetic sections
mark in the abstract itself: GPT-4 had passed the constructed with nonces (phonetically plausible
bar exam, with the score in the 90th percentile. nonsense words) or ids (like "b44" or "m22") and
probed GPT-3’s ability to apply these synthetic sec-
OpenAI’s developer livestream by co-founder
tions. Their prompts consisted of one synthetic
Greg Brockman introducing GPT-4 (OpenAI,
section plus text like "Alexis is a portle. Is section
2023a) used four examples to show GPT-4’s ca-
1001(b) applicable to Alexis? Let’s think step by
pabilities. In one, Brockman prompted GPT-4 with
step." In this prompt, "portle" is a nonce used in
several sections of U.S. tax law and had it calculate
the synthetic section, and 1001(b) is a specific cite
the taxes of hypothetical taxpayers Alice and Bob.
within that section. This applies-to task is fairly
Brockman proclaimed that GPT-4 can "do taxes."
high-level, since it requires following the hierarchy
But subsequent authors have raised doubts about
in hierarchical text, potentially to the top of the sec-
these achievements. For example, Martínez (2023)
tion and back down. Examples of such prompting
questions the statistics behind GPT-4 getting in the
appear in Appendix F. Along with SARA, we use
90th percentile on the bar. Tax lawyers noted that
applies-to as a benchmark to see how fine tuning
GPT-4 in the livestream actually got the wrong
with BLT improves LLMs’ performance on them,
answer for Alice and Bob’s taxes (Zhang, 2023).
as discussed in Section 5.
Moreover, the example involving Alice and Bob
was copied verbatim from an academic dataset, 2.3 Related Work Not Involving Law
SARA, (Holzenberger et al., 2020): when run on
LLMs have been trained to use "Tools" (Schick
the full SARA dataset, GPT-4 performed poorly
et al., 2023). For example, an LLM might detect
(Blair-Stanek et al., 2023).
that it needs to call a calculator tool to precisely
handle a math problem posed to it in text form. In
2.2 Legal Benchmarks for LLMs
theory, tools could be written to handle the BLT
SARA stands for StAtutory Reasoning Assess- tasks and integrated into LLMs. But lawyers will
ment, and it is a dataset testing the ability to reason rarely paste in legal text in as pristine a prompt
over statutory text. It consists of nine tax-related format as the BLT prompts, so such tools might
sections of the U.S. Code, plus 376 hand-crafted not always work. Moreover, for every tool that
"cases" consisting of facts and a question that can handles basic legal text, there is an aspect of legal
be unambiguously answered applying the nine sec- practice that cannot quite be done by such a tool.
tions to the facts. (SARA also has the human- For example, a text-matching tool might find the
ascertained ground-truth for all 376 cases). In the citation to the record where the plaintiff says "I
livestream, Brockman prompted GPT-4 with these have therapy tomorrow." But a lawyer might need a
nine sections and one of the cases. Of the 376 cases, citation to the record proving that plaintiff receives
100 have an answer that is the tax liablity in dollars, care for psychological problems, and a basic text-
with the remaining 276 being entail/contradict ques- matching tool would not handle that. One would
tions. Because SARA is a higher-level task clearly hope an LLM could do that without any tool.
predicated on lower-level text handling (specifi- Some of BLT’s prompts are quite long, which is
cially, finding text at a citation), we use it as one of realistic because lawyers must often perform basic
the benchmarks for fine tuning with BLT, discussed legal text handling of long texts. The effectiveness
in Section 5. of using long contexts for LLMs was investigated

2
Task deposition synthetic U.S.
Task
description transcripts sections Code
text→cite given text, return its cite D D D
cite→text given a cite, return the text at it D D D
cite→amended same, but prompted with amended text D
defined→cite given a term, return the cite of its definition D D
cite→defined given a cite, return the term defined at it D D

Table 1: Summary of the tests making up BLT. The three right columns are the types of text, and rows are the tasks
that might be performed on that text. A checkmark indicates that it makes sense to run that test on that type of text
and so is in BLT.

by (Liu et al., 2023). They found that GPT-4’s use ber of the line above with the text "__"?" To ensure
of information in prompts followed a U-curve with there is only one clearly correct answer, prompts
respect to the information’s position, with informa- are never constructed asking about lines with less
tion in the middle of the prompt used much less than four words, that are subsets of another line,
than if it were at the start or end. They connected or that are too similar to other lines (Levenshtein
this to the "serial-position" effect exhibited by hu- distance under four (Levenshtein et al., 1966)).
mans, who best remember material presented near The converse is another basic text-handling task:
the beginning or end. given a citation to a transcript, find the text at the
cited location. Lawyers must do this basic task
3 The BLT Benchmark in order to evaluate the opposing side’s motions.
The BLT benchmark involves three different types Paralegals do it on their side’s own motions before
of legal text, each of which has between two and submitting them (ProParalegal, 2017). This moti-
five different tasks run on it (summarized in Ta- vates the cite→text task, where the prompt consists
ble 1). of a part of a deposition transcript followed by the
question "What is the exact text of just line __ of
3.1 Deposition Transcripts page __ above?" An example of GPT-4 failing this
task appears in Figure 1.
In litigation in the U.S., depositions of witnesses
under oath are a key factfinding tool.2 The deposi- BLT’s deposition transcript tests are built from
tions typically occur in lawyers’ offices and allow a novel corpus we constructed of 33,176 lines of
lawyers to ask witnesses questions on virtually any actual deposition transcripts filed with courts. They
topic. Professional court reporters transcribe the are from a variety of cases, as the main criterion
depositions into transcripts, typically with 25 num- was that they were cleanly OCR’ed and could be
bered lines per page,3 often running over 100 pages fully cleaned with regular expressions. This corpus
for a single witness deposition. Attorneys must cite can be extended ad infinitum by others by down-
relevant portions of the resulting transcripts in sub- loading further transcripts and cleaning them, as
sequent motions, such as those asking the court to deposition transcripts are likely not copyrightable.6
grant summary judgment to their side.4 Portions of The existing page and line numbers are stripped
transcripts are cited by page and line number.5 out, and random spans of appropriate length are
A basic text-handling task a lawyer must do in selected, with new page numbers and line numbers.
constructing a motion is finding the page and line This renumbering guards against the possibility
where particular text appears. This motivates the that a model has already seen the deposition and
text→cite task, where the prompt consists of part allows many more prompts to be constructed from
of an actual deposition transcript followed by the each raw deposition.
question, "What are the page number and line num- The size of the prompt is entirely scalable. For
2
Federal Rule of Civil Procedure 30. 6
Lipman v. Massachusetts, 311 F. Supp. 593 (D. Mass.
3
U.S. Court Reporters Association Manual, Section 18.8 1970); 1 Nimmer on Copyright §5.12[C] (2023 ed.). Even in
4
Federal Rule of Civil Procedure 56(c)(1)(A). the unlikely event transcripts are copyrightable, this use of
5
See Bluebook Rule B17.1.2 (HYPC, 2020). them is likely fair use. 17 U.S.C. §107.

3
Section 5217. Definition of cleight.
(a) General rule. The term "cleight" means any baitisolist or any roussiont.
(b) The term "baitisolist" means any ballinated or any caset.
(c) The term "roussiont" means any dicemercu or any accodpoileare.

What is the exact citation above where the term "roussiont" is defined? (Use standard legal formatting like
section 1001(b)(2)).
Section 5217(b)
Figure 2: Representative example of GPT-4 incorrectly answering defined→cite question with a 2-deep, 2-wide
synthetic section. The correct answer is "section 5217(c)".

BLT-4k, we use a mix of 1 and 2 page subsets of deep to 5-wide, 5-deep, which (because statute size
depositions. For the 1-page subsets, only the line is exponential with respect to depth) takes up much
number is involved (i.e., not any page number).7 of the 128,000-token window. For the full list of
Larger versions have larger quotes; for example, sizes in each benchmark, see Appendix A.
with BLT-128k, we use a mix of 120 and 140-page A basic legal text-processing skill is finding the
subsets of depositions. Appendix B details the page citation within a hierarchical text containing the
sizes used for different sizes of BLT. precise text to which you are pointing a court or
other lawyer. This motivates having the text→cite
3.2 Synthetic Sections task on synthetic sections, where the prompt con-
Lawyers regularly work with hierarchical text, in- sists of one synthetic section followed by the ques-
cluding statutes, contracts, regulations, treaties, tion "What is the exact citation above of the text
court rules, and corporate charters. Hierarchical "__"? (Use standard legal formatting like section
text is often organized into sections, subsections, 1001(b)(2))." The code to generate synthetic sec-
paragraphs, subparagraphs, etc. Being able to nav- tions guarantees that every line is different, ensur-
igate such hierarchical text is a basic legal text- ing there is only a single correct answer.
handling task required of all lawyers, whether they The converse legal skill is, given a hierarchical
are litigators arguing that a statute applies to their citation, finding the text at it. This motivates the
case, or are transactional lawyers negotiating the cite→text task, where the prompt consists of one
terms of a contract. synthetic section followed by the question "What
We generate synthetic hierarchical sections, fol- is the exact text of just section __ above?"
lowing the approach of Blair-Stanek et al. (2023). We ask this question only of "leaves" in the
The synthetic hierarchical sections are purely defi- statute, meaning they have no subsections under-
nitional, like many sections of statutes, regulations, neath them. This ensures there is only a single
or contracts. They use repeated application of the correct answer. For example, if you asked for the
logical form A ⇒ B, with two parameters: depth, text of section 573(a)(2) and if there existed two
the maximum number of times it is applied; and subparagraphs under it – sections 573(a)(2)(A) and
width, the number of times it is applied to each B. 573(a)(2)(B) – it is unclear whether or not the text
The terms defined are nonces that are not real words of these two subparagraphs (A) and (B) should
but are pronounceable. These synthetic sections be returned. In this example, section 573(a)(2) is
can be arbitrarily large, thus making tasks based not a leaf, and such ambiguity is avoided by con-
on them scalable to different-sized token windows. sidering only leaves. This leaf-only restriction is
For example, BLT-4k has synthetic sections rang- necessary for cite→text’s questions to have a sin-
ing from 2-wide, 2-deep, as in Figure 2, which are gle correct answer; although not necessary in its
very short, up to 3-wide, 4-deep, which takes up converse, text→cite, we adopt it there as well so
much of the 4k token window. At the highest end, that both involve comparable synthetic sections.
BLT-128k has a variety ranging from 60-wide, 2- We also include two other basic text-handling
7
tasks on the synthetic sections. Terms are defined
The 1-page subsets are followed by "What is the line
number of the line above with the text "__"?" and "What is in hierarchical texts and often referenced elsewhere
the exact text of just line __ above?" in the same hierarchical text. For example, a sec-

4
tion of a statute might define "applicable ship" at is the correct answer). This tests a basic legal skill:
one specific subparagraph, but then use "applica- the ability to apply a newly-amended statute rather
ble ship" a dozen other times in the same section. than the old version. If the leaf contains any num-
Lawyers must be able to cite a term’s precise def- bers, we add or subtract one from the last appearing
inition. With defined→cite, the prompt is one number;9 otherwise, we tweak the last appearing
synthetic section followed by the question "What citation from, say, "(D)" to "(A)"; otherwise, we
is the exact citation above where the term "__" is toggle the last "and" to "or" or vice versa; other-
defined? (Use standard legal formatting like sec- wise, we toggle the last "shall" to "may" or vice
tion 1001(b)(2))." Conversely, when given such versa;10 if none of these are available, we insert
a citation by another lawyer, a lawyer must be "unless otherwise provided by section 101," at the
able to find the term, which motivates cite→define, start of the leaf.
the prompt is one synthetic section followed by For all five tasks on the U.S. Code, we do not
the question "What is the term defined at section use all possible sections. We do not use sections
__?" An example of GPT-4 incorrectly answering containing tables, which are not purely text.11 We
a defined→cite problem appears in Figure 2. do not use sections with quoted hierarchical text
such as model contracts, which are hard for even a
3.3 U.S. Code
human lawyer to read.12
The U.S. Code is the official compilation of gen- For text→cite, we do the same test as with tran-
eral and permanent U.S. federal statutes.8 The U.S. scripts, not using lines that are under four words
Code is a large corpus of hierarchical text. We long, are subsets of any line appearing elsewhere
apply to the U.S. Code all four tasks that we ap- in the prompt, or that have a Levenshtein distance
plied to synthetic sections: text→cite, cite→text, under four from another line in the prompt. For
defined→cite, and cite→defined. For these four defined→cite, we do not use terms defined in more
tasks on the U.S. Code, the prompt is the same as than one place in the prompt. We never use any of
for synthetic sections. the cites that Congress has sloppily added twice.13
The benefit of the U.S. Code sections over the Unlike synthetic sections, which can be gener-
synthetic sections is that they are real legal text. ated in unlimited quantities in arbitrarily large sizes,
Their downside is that LLMs have doubtless been there are a limited number of U.S. Code sections.
trained on them, as they are publicly available on But it is a huge corpus, with 43,916 sections that
multiple websites. By contrast, no LLM will have meet the criteria discussed above, 447,037 leaves,
seen the synthetic sections in the BLT dataset be- and 23,562 unique definitions. Although 94% of
fore its publication. Even after the dataset’s publi- sections are under 2,000 GPT-4 tokens, that still
cation, arbitrarily many new synthetic sections can leaves 2,602 sections over 2,000 tokens, including
be generated by changing the random seed. (There 813 sections over 4,000 GPT-4 tokens and 196 sec-
are 9000 nonces available, so, for example, there tions over 8,000 tokens. When there are insufficient
are approximately 10197 possible unique 50 nonce numbers of large enough sections, we can generate
sections.) Thus the synthetic sections test LLMs’ prompts of any desired size ad infinitum by adding
ability to do basic legal text handling of hierar- randomly selected other sections of approximately
chical text never before seen. This simulates the the same size. We randomly shuffle the order of the
challenges of lawyers working with newly drafted sections in the prompt so that the target section’s
contracts or new legislation. position is not a cue that the model should focus
To test whether LLMs’ familiarity with U.S. on it.
Code sections causes errors, we add a fifth test for Having multiple sections filling the prompt is
U.S. Code sections: cite→amended. In all but one similar to how Greg Brockman pasted nine tax-
respect, this test is identical to cite→text, in that it
9
gives the cite of a leaf and asks "What is the exact We never move from 1 to 2 or from 2 to 1 since that would
text of just section __ above?" The sole difference also require changing singular nouns to plural or vice versa.
10
In legal phrasing, the distinction between "may" and
is that we make a small but semantically-important "shall" is extremely important (Garner, 2019)
11
change to the text in that leaf to see if the LLM For example, 26 U.S.C. §1 containing income tax tables
returns the original text or the changed text (which is not used. 5,946 sections are excluded for this reason.
12
Examples include 5 U.S.C. §9507 and 25 U.S.C. §5329.
8 13
Congress makes it available in XML form at https:// For example, there are two subsection (e)’s in 42 U.S.C.
uscode.house.gov/download/download.shtml. §1397hh.

5
related sections of the U.S Code into GPT-4 during We tested two models from OpenAI: GPT-3.5-
the livestream introducing it to show how it could turbo and GPT-4.17 From Google, we tested PaLM
"do taxes." This is realistic: lawyers handling real- 2, specifically chat-bison-001. Anthropic did not
world issues often must apply several statutes in grant us access to the API for its LLM Claude for
conjunction, not just one.14 this research.18
We observe that GPT-4 far outperforms the other
3.4 General Considerations models on all tasks. On several tasks, GPT-3.5-
As shown in Table 1, there are a total of 11 tests. turbo outperforms PaLM 2, and vice versa.
For each of the 11, and for each possible size (rang-
4.1 GPT-4’s poor performance on transcripts
ing from BLT-4k to BLT-128k), we generate a test
split of 100 prompts and 1000 training prompts.15 GPT-4 has its weakest performance in handling
Why only 100 test prompts for each test split? transcripts, which is surprising, since transcripts
Three reasons. First, there are 11 tests, thus 1,100 have much simpler numbering than the hierarchical
test prompts for BLT-4k, 1,100 test prompts for text in synthetic sections or U.S. Code. For exam-
BLT-8k, etc. Second, the cost of calling GPT-4 ple, on BLT-4k’s transcripts, GPT-4 gets 82% on
with just 1,100 BLT-8k prompts with around 5,000 text→cite and 78% on cite→text.
tokens per prompt is nontrivial. Third, any LLM
4.1.1 transcript text→cite
deployed for real-world legal practice really should
be at or near 100%, and as accuracy approaches To further investigate, we generate 1,000 new tran-
100% the t-statistic goes to zero. script text→cite prompts in the same format as
We are publicly releasing all test and train files.16 BLT-4k (25-lines per page, with half being one-
We do not retain any "held out" additional test splits. page and half being two-page) and run them against
That is because it is possible to generate nearly GPT-4. It achieves 87.5% on these 1,000.
unlimited additional test data using our code simply The biggest determinant of performance is
by changing the random seed and by adding further whether the transcript was a single page or two
OCR’ed deposition transcripts. pages. GPT-4 correctly answered 91% of single-
We leave a window buffer. Some LLMs’ tok- page transcript prompts, but just 84% of 2-page
enizers might encode a prompt less efficiently than transcript prompts. This makes sense, since 2-page
GPT-4’s tokenizer, so we allow an extra 20% buffer transcripts have 50 lines of text, whereas 1-page
for that. All LLMs sometimes provide answers to transcripts have just 25 lines of text. (An exam-
our prompts that are far more verbose than a direct ple of GPT-4 getting a wrong answer on a 2-page
answer, so we leave another 200 tokens to buffer transcript appears in Appendix C.)
that. So, for example, we ensure that all BLT-4k To see whether GPT-4 is confused solely by the
prompts are under 4, 000 × 0.8 − 200 = 3, 000 greater number of lines or by having the text split
GPT-4 tokens. into two pages, we generate 500 new prompts with
single pages but with 50 lines per page. (In other
4 Results and Discussion words, it is all one page, but with line numbers
starting at 1: and ending at 50:, followed by the
Table 2 contains the accuracy of several models question). We find GPT-4 achieves 84.8% accu-
on several different sizes of BLT. All tests were by racy, nearly identical to the two-page transcripts,
API call, with temperature set to 0.0 to maximize indicating the problem with them is length, not be-
reproducability. Our code measures accuracy with ing split into two pages. On both one-page and
forgiving rules, including ignoring case, whites- two-page transcripts, most errors are either identi-
pace, and final punctuation and stop words. That fying the line after the correct one (61% of errors)
code also uses handwritten rules to classify errors, or before the correct one (25% of errors). The full
and we draw on this feature in the discussion below. error breakdowns are in Appendix D.
14
For example, Home Depot U.S.A., Inc. v. Jackson, 139 If we graph GPT-4’s accuracy on two-page tran-
S.Ct. 1743 (2019), involved the interplay of 28 U.S.C. §§1441, scripts with 25 line per page, with respect to the
1446, and 1453.
15 17
Thus on BLT-4k there are 11×100 = 1, 100 test prompts, Specifically, we used GPT-3.5-turbo-1106 and GPT-4-
spread across 11 JSONL files, as well as 11 × 1, 000 = 0613, respectively.
11, 000 training prompts across 11 JSONL files. 18
We also were unable to get access to GPT-4-32k or chat-
16
https://github.com/BlairStanek/BLT/ bison-32k.

6
model transc. synthetic section U.S. Code mean

cite→amended
defined→cite

cite→defined

defined→cite

cite→defined
text→cite

cite→text

text→cite

cite→text

text→cite

cite→text
GPT-3.5-turbo 53 32 72 38 83 79 89 52 56 77 98 66.3
BLT-4k GPT-4 82 78 88 97 90 100 98 93 93 98 100 92.5
PaLM 2 82 18 61 48 95 78 22 61 63 7 95 57.3
GPT-3.5-turbo 24 17 19 20 65 52 63 58 56 69 89 48.4
BLT-8k GPT-4 44 26 64 49 82 83 94 74 76 88 97 70.6
PaLM 2 36 7 1 2 2 41 15 35 35 10 66 22.7
BLT-16k GPT-3.5-turbo 7 5 25 12 67 57 38 36 36 50 77 37.3

Table 2: Accuracy in percent of several models against the test splits of several different sizes of BLT. Since each
size of BLT’s test split consists of 100 prompts (with answers) the numbers are both the number correct and the
percent accuracy.

page and line being requested (grouped into buck- 4.2 GPT-4’s poor performance on synthetic
ets of 5 lines), we do not see a ’U’ as suggested by text→cite
Liu et al. (2023), but rather a decreasing ’w’:
Other than transcripts, GPT-4’s worst performance
on BLT-4k is on synthetic sections text→cite at
88%; GPT-4 performs even worse on it in BLT-8k
at 64%. GPT-4’s errors (which are in Appendix
E) show that its most common error is getting a
single portion of the hierarchical citation wrong.
For example, when the correct answer is "sec-
tion 2553(b)(2)(B)(i)", GPT-4 returned "section
2553(b)(1)(B)(i)". This is an easy mistake for a ca-
sual observer to make, but lawyers must not make
such errors in court filings. It refers to an entirely
different portion of the document.

4.3 Recognizing amended text


(cite→amended)

4.1.2 transcript cite→text We added the test cite→amended for U.S. Code
to determine whether LLMs would over-rely on
GPT-4 also performed poorly on cite→text, so we their knowledge of the U.S. Code from their train-
created 1,000 new prompts in the same format as ing and miss a small dummy amendment to the
BLT-4k. GPT-4 got 75.7% accuracy on these 1,000. requested line of the statute. We added a specific
Here we did not see a significant difference be- error code to track when the LLM returned the
tween one-page and two-page transcripts, which unamended text rather than the amended text actu-
got 76.6% and 74.8% respectively. We also cre- ally put into the prompt. With 700 cite→amended
ated 1,000 new prompts that were one page, with prompts across the 7 configurations tested in Ta-
just 15 lines on the page (as opposed to the normal ble 2, the unamended text was returned only 5
25). GPT-4 got 77.5% on those shorter transcripts, times: chat-bison-001 did so twice on BLT-4k and
suggesting a slight accuracy improvement as the twice on BLT-8k; GPT-3.5-turbo did so once on
number of lines decreases. BLT-8k. GPT-4 never returned the unamended text.

7
text and task not tuned finetuned Blair-Stanek et al. (2023) calls ids like "b44" or
transcript text→cite 53 100 "m22". Thus, the facts and question look like
transcript cite→text 32 99 "Alexis is a q55. Is section 1661(c) applicable
synthetic text→cite 72 98 to Alexis? Let’s think step by step." Examples of
synthetic cite→text 38 100 such prompts appear in Appendix F.
synthetic defined→cite 83 100 Without fine-tuning, GPT-3.5-turbo gets an ac-
synthetic cite→defined 79 100 curacy of 79% on applies-to. With fine tuning the
uscode text→cite 89 100 accuracy actually falls to 67%. This result was un-
uscode cite→text 52 100 expected.
uscode cite→amendedtext 56 100 BLT-4k’s training data had prompts with chunks
uscode defined→cite 77 100 of text like synthetic sections followed by a single
uscode cite→defined 98 100 question, with the correct answer always being very
short – either a line of text, a definition, or a citation.
Table 3: Results of fine-tuning GPT-3.5-turbo on 9,900 This training data trained our finetuned GPT-3.5-
training samples from BLT-4k. Both numerical columns turbo to think that a chunk of text, followed by a
contain percent accuracy on BLT-4k’s test prompts. We single question should have a very short answer.
see improvement to near-perfect scores.
The mean number of words used by GPT-3.5-
turbo to respond, with its step-by-step reasoning,
5 Fine Tuning to one of the applies-to queries fell from 59 before
fine tuning to 19 after fine tuning. This truncated
We fine-tune the 4,000-token version of GPT-3.5- the proper reasoning. A representative example
turbo, since we have not been granted fine-tuning of the finetuned and non-tuned versions answer-
access to GPT-4 or the 16k-token version of GPT- ing the same prompt with very different chains of
3.5-turbo. Because we are limited to 4,000 tokens, reasoning and conclusions appears in Appendix F.
we naturally do fine-tuning with BLT-4k’s train- Future research should focus on how to improve
ing set. For each of the 11 task types, BLT-4k performance on BLT without such side-effects.
has a training set with 1,000 prompts and answers,
for a total of 11,000. Of these, we use 90% for 6 Conclusion
training. We train for two epochs with the default
learning rate. The results of the fine tuning are We demonstate that the best currently available
in Table 3. We find that fine-tuning brings GPT- LLMs perform very poorly at many basic legal
3.5-turbo, which is not OpenAI’s most advanced text-handling tasks. The chief innovation officer
model, to near the 100% performance expected of at large international law firm Baker & McKenzie
lawyers and paralegals. observed to the New York Times of LLMs, "At
We also test how this finetuned GPT-3.5-turbo its best, the technology seems like a very smart
performs on two related tasks: SARA, and the paralegal." (Lohr, 2023). It might be more accurate
applies-to test with synthetic sections. For SARA, to say that LLMs are like very sloppy paralegals.
we used the 276 cases where the answer is en- We find poor performance from GPT-3.5-turbo and
tail/contradict, with each prompt consisting of each PaLM 2 on our smallest test set, BLT-4k, and poor
U.S. Code section mentioned in the case, plus the performance from even GPT-4 on portions of BLT-
facts (i.e., the premise) and the hypothesis (i.e., the 4k like finding the text on one or two pages of
question). Without fine-tuning, GPT-3.5-turbo’s deposition transcript. But fine-tuning on BLT-4k’s
accuracy was 54.3% (150 / 276), but with our fine- training set brings performance of GPT-3.5-turbo
tuning it is 60.9% (168 / 276). up to the expected human level of performance.
For applies-to, we used the code of Blair-Stanek While we are focused on law, the BLT tasks
et al. (2023) to generate 1,000 various sized syn- are low level enough that we would expect these
thetic sections and questions. Recall that BLT’s findings to be relevant to anyone, regardless of
synthetic sections use nonces like "cleight" and domain. Moreover, this poor performance on ad-
"roussiont" as in Figure 2. To ensure that our vanced LLMs as-is shows the importance of con-
applies-to synthetic sections look different than sulting domain experts in training of future LLMs.
this, we do not use nonces, but rather use what

8
References A Synthetic Section Sizes
Andrew Blair-Stanek, Nils Holzenberger, and Ben- Larger versions of BLT have longer and more com-
jamin Van Durme. 2023. OpenAI Cribbed Our Tax
Example, But Can GPT-4 Really Do Tax? Tax Notes plicated prompts. Below are the size of synthetic
Federal, 180:1101–1105. sections in each size of BLT.
Andrew Blair-Stanek, Nils Holzenberger, and Benjamin Version Sizes
Van Durme. 2023. Can GPT-3 Perform Statutory
Reasoning? In Proceedings of the Nineteenth In- BLT-4k 2 wide, 2 deep
ternational Conference on Artificial Intelligence and 2 wide, 3 deep
Law, ICAIL ’23, page 22–31. 2 wide, 4 deep
Bryan Garner, editor. 2019. Black’s Law Dictionary.
2 wide, 5 deep
3 wide, 2 deep
Neel Guha, Daniel E. Ho, Julian Nyarko, and Christo- 3 wide, 3 deep
pher Ré. 2022. Legalbench: Prototyping a collabora- 3 wide, 4 deep
tive benchmark for legal reasoning.
4 wide, 2 deep
Dan Hendrycks, Collin Burns, Anya Chen, and 4 wide, 3 deep
Spencer Ball. 2021. Cuad: An expert-annotated
nlp dataset for legal contract review. arXiv preprint BLT-8k 2 wide, 6 deep
arXiv:2103.06268. 3 wide, 5 deep
4 wide, 4 deep
Nils Holzenberger, Andrew Blair-Stanek, and Ben-
jamin Van Durme. 2020. A dataset for statutory 7 wide, 3 deep
reasoning in tax law entailment and question answer- 20 wide, 2 deep
ing. In Proceedings of the Natural Legal Language
Processing Workshop 2020 co-located with the 26th BLT-16k 5 wide, 4 deep
ACM SIGKDD International Conference on Knowl- 8 wide, 3 deep
edge Discovery & Data Mining (KDD 2020), Virtual 9 wide, 3 deep
Workshop, August 24, 2020, volume 2645 of CEUR 30 wide, 2 deep
Workshop Proceedings, pages 31–38. CEUR-WS.org.
BLT-32k 3 wide, 6 deep
HYPC, editor. 2020. The Bluebook: A Uniform System
4 wide, 5 deep
of Citation, 21st edition.
6 wide, 4 deep
Vladimir I Levenshtein et al. 1966. Binary codes capa- 12 wide, 3 deep
ble of correcting deletions, insertions, and reversals. 11 wide, 3 deep
In Soviet physics doklady, volume 10, pages 707–710.
Soviet Union. 44 wide, 2 deep
40 wide, 2 deep
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran-
jape, Michele Bevilacqua, Fabio Petroni, and Percy BLT-64k 7 wide, 4 deep
Liang. 2023. Lost in the middle: How language 16 wide, 3 deep
models use long contexts. 15 wide, 3 deep
Steve Lohr. 2023. A.I. threatens lawyers? We’ve heard 14 wide, 3 deep
this before. New York Times, page B1. 13 wide, 3 deep
60 wide, 2 deep
Eric Martínez. 2023. Re-evaluating gpt-4’s bar exam
performance.
65 wide, 2 deep
BLT-128k 4 wide, 6 deep
OpenAI. 2023a. Gpt-4 developer livestream.
5 wide, 5 deep
OpenAI. 2023b. Gpt-4 technical report. 8 wide, 4 deep
9 wide, 4 deep
Drescher ProParalegal. 2017. Practice tip sheet.
20 wide, 3 deep
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta 80 wide, 2 deep
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola
Cancedda, and Thomas Scialom. 2023. Toolformer: B Transcript Quotation Sizes
Language models can teach themselves to use tools.
arXiv preprint arXiv:2302.04761. Larger versions of BLT have longer prompts. Be-
Libin Zhang. 2023. Tax questions for language models. low are the number of pages of transcript quota-
Tax Notes Federal, 179:1699. tion used in each size BLT model. The generated

9
prompts are distributed uniformly among these 9: physician?
page sizes. For example, half of BLT-4k’s prompts 10: A. Excuse me again?
have one-page deposition quotations and the other 11: MR. GROSSMAN: You said you
half have two-page deposition quotations. 12: went to CityMD. She thinks you went
13: twice, did you go twice?
Version Transcript Pages
14: THE WITNESS: No.
BLT-4k 1, 2 15: MR. GROSSMAN: Then you’re not
BLT-8k 5, 10, 15 16: listening to the question. You went
BLT-16k 25, 40 17: to CityMD first, then you went to
BLT-32k 30, 60, 80 18: your primary physician. Now, she
BLT-64k 100, 130 19: wants to know where you went next.
BLT-128k 120, 140 20: THE WITNESS: No, first I went
21: to the primary.
C Example GPT-4 failing on a 2-page 22: MR. GROSSMAN: First primary,
transcript text→cite Prompt 23: and then CityMD and then where did
Page 68: 24: you go?
1: after the accident? 25: THE WITNESS: I was in CityMD
2: A. He called me – that same day,
What are the page number and line number of the
3: he called me at nighttime and asked me how
line above with the text "your primary physician.
4: I was.
Now, she"?
5: Q. He called you and asked you how
6: you were? The page number is 69 and the line number
7: A. Yes. is 19.
8: Q. What did you tell him?
9: A. I just told him that I wasn’t GPT-4’s answer above is wrong, since the re-
10: feeling good, that I was just laying in quested text is on line 18.
11: bed.
12: Q. And what did he say? D GPT-4 Errors on Transcripts
13: A. He said take the pills I gave GPT-4 performs surprisingly poorly on transcripts,
14: you, you’ll be ready and better by next even the 1- and 2-page transcripts in BLT-4k. Be-
15: week. He was just giving me something like low are qualitative and quantitative breakdowns
16: a fast way for me to feel up, to cheer up. of the GPT-4’s errors on the large runs used to
17: It wasn’t helping. investigate the problem. Our API-calling code au-
18: Q. Going back to your primary care tomatically categorizes the errors.
19: physician. You said she wanted to send you
20: to a specialist? D.1 GPT-4’s errors on transcript text→cite
21: A. Yes. from BLT-4k
22: Q. But then you called workers’ Recall that we generated 1,000 new BLT-4k style
23: comp? transcripts with text→cite prompts. Here were the
24: A. Yeah, that’s when the issue 25: happened lines returned relative to the correct ones:
and I called workers’ comp. # Relative Error
Page 69: 3 prior page
1: Q. Then who did you treat with 1 19 before
2: next? 1 3 before
3: A. I treated with CityMD. 1 2 before
4: Q. You said CityMD? 12 1 before (i.e. prior line)
5: A. Yes. 76 1 after (i.e. next line)
6: Q. Was the place that you went to 1 2 after
7: the first time the same place that you went 1 3 after
8: like after you went to your primary care 1 next page

10
D.2 GPT-4’s qualitative errors on transcript # Relative Error
cite→text from BLT-4k 2 prior page
4 same page, >3 lines before
Recall that we generated 1,000 new BLT-4k style 1 3 lines before
transcripts with cite→text prompts. Here were the 6 2 lines before
errors: 81 prior line
39 next line
10 2 lines after
# Type of Error 7 3 lines after
166 wrong line 6 4 lines after
48 superset 15 same page, >4 lines after
10 not found 3 next page
9 subset wrong line
7 multiple wrong lines
3 subset
E GPT-4’s errors on synthetic text→cite

There was a difference in the type of errors on


one-page versus two-page. Almost all (84%) the er- GPT-4 got 12 of 100 wrong on the synthetic
rors on two-page transcripts was returning the text text→cite of BLT-4k. Those errors are in the table
from the wrong line (almost always the line before below. An actual example of a wrong paragraph
or after the correct one). But one-page transcripts is returning "section 2553(b)(1)(B)(i)" when the
were only 51% wrong-line errors, with 35% being correct answer is "section 2553(b)(2)(B)(i)". The
superset errors, where GPT-4 returned not only the first part of a statute is the subsection (e.g. the
text of the requested line, but also part of adjacent "(b)" in section 2553(b)), the next part is a para-
lines. Here is an example of a superset error: graph (e.g. the "(1)" in section 2553(b)(1)), then a
subparagraph, a clause, a subclause, an item, and a
... subitem.
22: MS. WOLVERTON: Objection. Vague. Whose # Type of Error
23: e-mail records? 3 wrong subparagraph
24: Q Ms. Abedin’s e-mail records located. I 3 wrong paragraph

25: apologize. 2 wrong paragraph, subparagraph, & clause


1 wrong subsection, paragraph, & subclause
What is the exact text of just line 25 above? 1 wrong subsection, paragraph,
subparagraph, clause
GPT-4 returned "Q Ms. Abedin’s e-mail records
located. I apologize." which is actually lines 24 and 2 not parallel
25, whereas the prompt asks for "the exact text of
just line 25 above". These superset errors are much Above, we see six errors are just one part
more understandable from a human perspective of the citation wrong, four involve multiple
than an entirely wrong line. parts wrong (but the right correct structure), and
two were totally not parallel. An actual an-
swer that is "not parallel" is returning "section
D.3 GPT-4’s errors for wrong line on 9680(a)(1)(B)(2)(C)(i)(I)" when the correct answer
transcript cite→text from BLT-4k is "section 9680(b)(2)(C)(i)(I)".
GPT-4 got 36 of 100 wrong on the same task
In the table above we see that of the 1,000 prompts,
in BLT-8k, and those errors are below. An ac-
166 were "wrong line" errors. Below shows the
tual example of a wrong item is returning "section
position, relative to the correct one, from which the
3632(c)(2)(B)(ii)(II)(aa)" when the correct answer
text was incorrectly drawn:
is "section 3632(c)(2)(B)(ii)(II)(bb)".

11
# Type of Error First, let’s check if Alexis is a q55, as men-
tioned in section 1661(b)(1).
7 wrong paragraph
5 wrong subclause Therefore, the answer (Yes or No) is
5 wrong subparagraph Yes
5 wrong clause
2 wrong item Above we see that the finetuned model has a
2 wrong subsection terse answer that started to give reasoning but then
3 wrong subsection,wrong paragraph stopped prematurely. Below we see the non-tuned
1 wrong paragraph,wrong subparagraph model on the same prompt, giving robust reasoning
and reaching the correct result:
2 wrong paragraph, subparagraph, & clause
1 wrong paragraph, clause, & subclause Non-Tuned Model Chat:
3 not parallel Section 1661. Definition of t88.
(a) General rule. The term "t88" means-
Above we again see most errors are just getting
(1) any q33, or
the citation wrong in a single component.
(2) any a33.
F Finetuned worse on applies-to (b) The term "q33" means-
(1) any q55, or
In section 5 we found – unexpectedly – that GPT- (2) any v33.
3.5-turbo finetuned on BLT-4k performed worse (c) The term "a33" means-
on the applies-to task defined in Blair-Stanek et al. (1) any o33, or
(2023) than the not-finetuned version. It appears (2) any i66.
that the fine-tuning resulted in much-shortened
chains of reasoning, often incomplete, that failed Alexis is a q55. Is section 1661(c) applicable to
to get to the correct answer. Alexis? Let’s think step by step.
The word "first" appears in the non-tuned First, we need to determine if Alexis falls un-
model’s response 82% of the time and 84% for der the definition of "q55" or "v33" as per sec-
the finetuned model. A good chain of reasoning tion 1661(b). Since Alexis is a q55, she falls
has either the words "second", "next", or "then" under the definition of "q55" as per section
later in its answer. Those words come after "first" 1661(b)(1).
52% of the time with the non-tuned model but a
very-low 2% for the finetuned model. Because Next, we need to determine if section 1661(c)
the finetuned model’s reasoning was truncated, its is applicable to Alexis. According to section
accuracy fell. 1661(c), the term "a33" means any o33 or any
Below we have a representative example, with i66. Since Alexis is not an o33 or an i66, sec-
the exact same prompt, showing how the finetuned tion 1661(c) is not applicable to her.
and not-finetuned models responded.
In conclusion, section 1661(c) is not applicable
Finetuned Model Chat:
to Alexis.
Section 1661. Definition of t88.
(a) General rule. The term "t88" means- Therefore, the answer (Yes or No) is
(1) any q33, or No.
(2) any a33.
(b) The term "q33" means-
(1) any q55, or
(2) any v33.
(c) The term "a33" means-
(1) any o33, or
(2) any i66.

Alexis is a q55. Is section 1661(c) applicable to


Alexis? Let’s think step by step.

12

You might also like