Long Docs
Long Docs
content from documents that are too big to fit into the context window.
One approach for solving this is to chunk the document up and process
each chunk separately, before combining into one list of answers.
Approach
Setup
import textract
import os
import openai
import tiktoken
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your
OpenAI API key if not set as env var>"))
text =
textract.process('data/fia_f1_power_unit_financial_regulations_issue_1_-
_2022-08-16.pdf', method='pdfminer').decode('utf-8')
# Example prompt -
document = '<document>'
When you extract a key piece of information, include the closest page
number.
Use the following format:\n0. Who is the author\n1. What is the amount of
the "Power Unit Cost Cap" in USD, GBP and EUR\n2. What is the value of
External Manufacturing Costs in USD\n3. What is the Capital Expenditure
Limit in USD\n\nDocument: \"\"\"<document>\"\"\"\n\n0. Who is the
author: Tom Anderson (Page 1)\n1.'''
print(template_prompt)
When you extract a key piece of information, include the closest page
number.
1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR
1.
# Split a text into smaller chunks of size n, preferably ending at the end of
a sentence
tokens = tokenizer.encode(text)
i=0
# Find the nearest end of sentence within a range of 0.5 * n and 1.5 *
n tokens
chunk = tokenizer.decode(tokens[i:j])
if chunk.endswith(".") or chunk.endswith("\n"):
break
j -= 1
if j == i + int(0.5 * n):
j = min(i + n, len(tokens))
yield tokens[i:j]
i=j
def extract_chunk(document,template_prompt):
prompt = template_prompt.replace('<document>',document)
messages = [
response = client.chat.completions.create(
model='gpt-4',
messages=messages,
temperature=0,
max_tokens=1500,
top_p=1,
frequency_penalty=0,
presence_penalty=0
# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")
results = []
chunks = create_chunks(clean_text,1000,tokenizer)
results.append(extract_chunk(chunk,template_prompt))
#print(chunk)
print(results[-1])
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not
in x]
zipped
['1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and
EUR: USD 95,000,000 (Page 2); GBP 76,459,000 (Page 2); EUR 90,210,000
(Page 2)',
# Example prompt -
When you extract a key piece of information, include the closest page
number.
print(template_prompt)
When you extract a key piece of information, include the closest page
number.
Document: """<document>"""
1.
results = []
results.append(extract_chunk(chunk,template_prompt))
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not
in x]
zipped
'3. Which years do these financial regulations apply to: 2026 onwards
(Page 1)',
'3. Which years do these financial regulations apply to: 2023, 2024, 2025,
2026 and subsequent Full Year Reporting Periods (Page 2)',
'3. Which years do these financial regulations apply to: 2022-2025 (Page
6)',
'3. Which years do these financial regulations apply to: 2023, 2024, 2025,
2026 and subsequent Full Year Reporting Periods (Page 10)',
'3. Which years do these financial regulations apply to: 2022 (Page 14)',
'3. Which years do these financial regulations apply to: 2022 (Page 16)',
'3. Which years do these financial regulations apply to: 2022 (Page 19)',
'3. Which years do these financial regulations apply to: 2022 (Page 21)',
'3. Which years do these financial regulations apply to: 2026 onwards
(Page 26)',
'3. Which years do these financial regulations apply to: 2026 (Page 2)',
'3. Which years do these financial regulations apply to: 2022 (Page 30)',
'3. Which years do these financial regulations apply to: 2022 (Page 32)',
'3. Which years do these financial regulations apply to: 2023, 2024 and
2025 (Page 1)',
'3. Which years do these financial regulations apply to: 2022 (Page 37)',
'3. Which years do these financial regulations apply to: 2026 onwards
(Page 40)',
'3. Which years do these financial regulations apply to: 2022 (Page 1)',
'3. Which years do these financial regulations apply to: 2026 to 2030
seasons (Page 46)',
'3. Which years do these financial regulations apply to: 2022 (Page 47)',
'3. Which years do these financial regulations apply to: 2022 (Page 1)',
'3. Which years do these financial regulations apply to: 2022 (Page 1)',
'3. Which years do these financial regulations apply to: 2022 (Page 56)',
'3. Which years do these financial regulations apply to: 2022 (Page 1)',
'3. Which years do these financial regulations apply to: 2022 (Page 16)',
'3. Which years do these financial regulations apply to: 2022 (Page 16)']
Consolidation
We've been able to extract the first two answers safely, while the third
was confounded by the date that appeared on every page, though the
correct answer is in there as well.
To tune this further you can consider experimenting with:
The way you chunk your data - we have gone for 1000 tokens with
no overlap, but more intelligent chunking that breaks info into
sections, cuts by tokens or similar may get better results