0% found this document useful (0 votes)

21 views8 pages

Long Docs

GPT-3 can help us extract key figures, dates or other bits of important content from documents that are too big to fit into the context window. One approach for solving this is to chunk the document up and process each chunk separately, before combining into one list of answers.

Uploaded by

eliot hyseni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views8 pages

Long Docs

Uploaded by

eliot hyseni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

GPT-3 can help us extract key figures, dates or other bits of important

content from documents that are too big to fit into the context window.
One approach for solving this is to chunk the document up and process
each chunk separately, before combining into one list of answers.

In this notebook we'll run through this approach:

 Load in a long PDF and pull the text out

 Create a prompt to be used to extract key bits of information

 Chunk up our document and process each chunk to pull any

answers out

 Combine them at the end

 This simple approach will then be extended to three more difficult

questions

Approach

 Setup: Take a PDF, a Formula 1 Financial Regulation document on

Power Units, and extract the text from it for entity extraction. We'll
use this to try to extract answers that are buried in the content.

 Simple Entity Extraction: Extract key bits of information from

chunks of a document by:

 Creating a template prompt with our questions and an

example of the format it expects

 Create a function to take a chunk of text as input, combine

with the prompt and get a response

 Run a script to chunk the text, extract answers and output

them for parsing

 Complex Entity Extraction: Ask some more difficult questions

which require tougher reasoning to work out

Setup

!pip install textract

!pip install tiktoken

import textract

import os

import openai

import tiktoken
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your
OpenAI API key if not set as env var>"))

# Extract the raw text from each PDF using textract

text =
textract.process('data/fia_f1_power_unit_financial_regulations_issue_1_-
_2022-08-16.pdf', method='pdfminer').decode('utf-8')

clean_text = text.replace(" ", " ").replace("\n", "; ").replace(';',' ')

Simple Entity Extraction

# Example prompt -

document = '<document>'

template_prompt=f'''Extract key pieces of information from this regulation

document.

If a particular piece of information is not present, output \"Not specified\".

When you extract a key piece of information, include the closest page
number.

Use the following format:\n0. Who is the author\n1. What is the amount of
the "Power Unit Cost Cap" in USD, GBP and EUR\n2. What is the value of
External Manufacturing Costs in USD\n3. What is the Capital Expenditure
Limit in USD\n\nDocument: \"\"\"<document>\"\"\"\n\n0. Who is the
author: Tom Anderson (Page 1)\n1.'''

print(template_prompt)

Extract key pieces of information from this regulation document.

If a particular piece of information is not present, output "Not specified".

When you extract a key piece of information, include the closest page
number.

Use the following format:

0. Who is the author

1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR

2. What is the value of External Manufacturing Costs in USD

3. What is the Capital Expenditure Limit in USD

Document: """<document>"""

0. Who is the author: Tom Anderson (Page 1)

# Split a text into smaller chunks of size n, preferably ending at the end of
a sentence

def create_chunks(text, n, tokenizer):

tokens = tokenizer.encode(text)

"""Yield successive n-sized chunks from text."""

i=0

while i < len(tokens):

# Find the nearest end of sentence within a range of 0.5 * n and 1.5 *
n tokens

j = min(i + int(1.5 * n), len(tokens))

while j > i + int(0.5 * n):

# Decode the tokens and check for full stop or newline

chunk = tokenizer.decode(tokens[i:j])

if chunk.endswith(".") or chunk.endswith("\n"):

break

j -= 1

# If no end of sentence found, use n tokens as the chunk size

if j == i + int(0.5 * n):

j = min(i + n, len(tokens))

yield tokens[i:j]

i=j

def extract_chunk(document,template_prompt):

prompt = template_prompt.replace('<document>',document)
messages = [

{"role": "system", "content": "You help extract information from

documents."},

{"role": "user", "content": prompt}

response = client.chat.completions.create(

model='gpt-4',

messages=messages,

temperature=0,

max_tokens=1500,

top_p=1,

frequency_penalty=0,

presence_penalty=0

return "1." + response.choices[0].message.content

# Initialise tokenizer

tokenizer = tiktoken.get_encoding("cl100k_base")

results = []

chunks = create_chunks(clean_text,1000,tokenizer)

text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

for chunk in text_chunks:

results.append(extract_chunk(chunk,template_prompt))

#print(chunk)

print(results[-1])

groups = [r.split('\n') for r in results]

# zip the groups together

zipped = list(zip(*groups))

zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not
in x]

zipped

['1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and
EUR: USD 95,000,000 (Page 2); GBP 76,459,000 (Page 2); EUR 90,210,000
(Page 2)',

'2. What is the value of External Manufacturing Costs in USD: US Dollars

20,000,000 in respect of each of the Full Year Reporting Periods ending on
31 December 2023, 31 December 2024 and 31 December 2025, adjusted
for Indexation (Page 10)',

'3. What is the Capital Expenditure Limit in USD: US Dollars 30,000,000

(Page 32)']

Complex Entity Extraction

# Example prompt -

template_prompt=f'''Extract key pieces of information from this regulation

document.

If a particular piece of information is not present, output \"Not specified\".

When you extract a key piece of information, include the closest page
number.

Use the following format:\n0. Who is the author\n1. How is a Minor

Overspend Breach calculated\n2. How is a Major Overspend Breach
calculated\n3. Which years do these financial regulations apply to\n\
nDocument: \"\"\"<document>\"\"\"\n\n0. Who is the author: Tom
Anderson (Page 1)\n1.'''

print(template_prompt)

Extract key pieces of information from this regulation document.

If a particular piece of information is not present, output "Not specified".

When you extract a key piece of information, include the closest page
number.

Use the following format:

0. Who is the author

1. How is a Minor Overspend Breach calculated

2. How is a Major Overspend Breach calculated

3. Which years do these financial regulations apply to

Document: """<document>"""

0. Who is the author: Tom Anderson (Page 1)

results = []

for chunk in text_chunks:

results.append(extract_chunk(chunk,template_prompt))

groups = [r.split('\n') for r in results]

# zip the groups together

zipped = list(zip(*groups))

zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not
in x]

zipped

['1. How is a Minor Overspend Breach calculated: A Minor Overspend

Breach arises when a Power Unit Manufacturer submits its Full Year
Reporting Documentation and Relevant Costs reported therein exceed the
Power Unit Cost Cap by less than 5% (Page 24)',

'2. How is a Major Overspend Breach calculated: A Material Overspend

Breach arises when a Power Unit Manufacturer submits its Full Year
Reporting Documentation and Relevant Costs reported therein exceed the
Power Unit Cost Cap by 5% or more (Page 25)',

'3. Which years do these financial regulations apply to: 2026 onwards
(Page 1)',

'3. Which years do these financial regulations apply to: 2023, 2024, 2025,
2026 and subsequent Full Year Reporting Periods (Page 2)',
'3. Which years do these financial regulations apply to: 2022-2025 (Page
6)',

'3. Which years do these financial regulations apply to: 2023, 2024, 2025,
2026 and subsequent Full Year Reporting Periods (Page 10)',

'3. Which years do these financial regulations apply to: 2022 (Page 14)',

'3. Which years do these financial regulations apply to: 2022 (Page 16)',

'3. Which years do these financial regulations apply to: 2022 (Page 19)',

'3. Which years do these financial regulations apply to: 2022 (Page 21)',

'3. Which years do these financial regulations apply to: 2026 onwards
(Page 26)',

'3. Which years do these financial regulations apply to: 2026 (Page 2)',

'3. Which years do these financial regulations apply to: 2022 (Page 30)',

'3. Which years do these financial regulations apply to: 2022 (Page 32)',

'3. Which years do these financial regulations apply to: 2023, 2024 and
2025 (Page 1)',

'3. Which years do these financial regulations apply to: 2022 (Page 37)',

'3. Which years do these financial regulations apply to: 2026 onwards
(Page 40)',

'3. Which years do these financial regulations apply to: 2022 (Page 1)',

'3. Which years do these financial regulations apply to: 2026 to 2030
seasons (Page 46)',

'3. Which years do these financial regulations apply to: 2022 (Page 47)',

'3. Which years do these financial regulations apply to: 2022 (Page 1)',

'3. Which years do these financial regulations apply to: 2022 (Page 56)',

'3. Which years do these financial regulations apply to: 2022 (Page 1)',

'3. Which years do these financial regulations apply to: 2022 (Page 16)',

'3. Which years do these financial regulations apply to: 2022 (Page 16)']

Consolidation

We've been able to extract the first two answers safely, while the third
was confounded by the date that appeared on every page, though the
correct answer is in there as well.
To tune this further you can consider experimenting with:

 A more descriptive or specific prompt

 If you have sufficient training data, fine-tuning a model to find a set

of outputs very well

 The way you chunk your data - we have gone for 1000 tokens with
no overlap, but more intelligent chunking that breaks info into
sections, cuts by tokens or similar may get better results

However, with minimal tuning we have now answered 6 questions of

varying difficulty using the contents of a long document, and have a
reusable approach that we can apply to any long document requiring
entity extraction. Look forward to seeing what you can do with this!

DL 9
No ratings yet
DL 9
10 pages
Computer Questions Asked in JKSSB 2 DSK
No ratings yet
Computer Questions Asked in JKSSB 2 DSK
62 pages
RAG With Reinforcement Learning
No ratings yet
RAG With Reinforcement Learning
40 pages
Manual PW50v5
No ratings yet
Manual PW50v5
9 pages
Lecture 31-Document GPT Hands On
No ratings yet
Lecture 31-Document GPT Hands On
18 pages
CBLM Final
75% (4)
CBLM Final
56 pages
Leica Service Manual m500 N m520 PDF
No ratings yet
Leica Service Manual m500 N m520 PDF
92 pages
Gen Ai 7,8,9,10
No ratings yet
Gen Ai 7,8,9,10
7 pages
Claude Comparet DB
No ratings yet
Claude Comparet DB
8 pages
Installation and Wiring - E1102000035GB03
No ratings yet
Installation and Wiring - E1102000035GB03
142 pages
Bling
No ratings yet
Bling
7 pages
Zref
No ratings yet
Zref
8 pages
Project X
No ratings yet
Project X
10 pages
Blue Coat Systems Reporter
100% (1)
Blue Coat Systems Reporter
251 pages
Notes - by Kishor
No ratings yet
Notes - by Kishor
11 pages
CNv6 instructorPPT Chapter6
No ratings yet
CNv6 instructorPPT Chapter6
44 pages
Ubc60Xlt: Programmabie Hand-Held Scanner
No ratings yet
Ubc60Xlt: Programmabie Hand-Held Scanner
28 pages
Introduction
No ratings yet
Introduction
17 pages
Slide - Understanding Underfloor Heating Installation
No ratings yet
Slide - Understanding Underfloor Heating Installation
22 pages
FE-5035 Manual en
No ratings yet
FE-5035 Manual en
60 pages
Activation
No ratings yet
Activation
2 pages
Company Name Name of The Person
No ratings yet
Company Name Name of The Person
6 pages
Change Page Settings On Google Docs
No ratings yet
Change Page Settings On Google Docs
4 pages
Module 2 Design Thinking 1 1
No ratings yet
Module 2 Design Thinking 1 1
5 pages
Estres Group Research Remastered
No ratings yet
Estres Group Research Remastered
16 pages
Chapter 1 - System Development
No ratings yet
Chapter 1 - System Development
24 pages
Web Development PHP
No ratings yet
Web Development PHP
9 pages
Mechatronics Tutorial Christopher
No ratings yet
Mechatronics Tutorial Christopher
19 pages
001-014 Connecting Rod: General Information
No ratings yet
001-014 Connecting Rod: General Information
13 pages
Erp For SMEs
100% (1)
Erp For SMEs
11 pages
Python Scripts
No ratings yet
Python Scripts
5 pages
Create, View, or Download A File
No ratings yet
Create, View, or Download A File
3 pages
Add, Delete & Organize Slides
No ratings yet
Add, Delete & Organize Slides
3 pages
Cabin Tiltable CESM0689
No ratings yet
Cabin Tiltable CESM0689
5 pages
Slamet Sutikno Portfolio - Digital Marketing
No ratings yet
Slamet Sutikno Portfolio - Digital Marketing
18 pages
CRM (Compatibility Mode)
No ratings yet
CRM (Compatibility Mode)
86 pages
02case Study Project Main Document March 2022 - CP Final 18042022
No ratings yet
02case Study Project Main Document March 2022 - CP Final 18042022
7 pages
Animation Institutes in India
No ratings yet
Animation Institutes in India
3 pages
Appendix 5B. Preliminary Electrical Design Drawings Part4
No ratings yet
Appendix 5B. Preliminary Electrical Design Drawings Part4
1 page
Bharani: Doctor/Nurse Calling System
No ratings yet
Bharani: Doctor/Nurse Calling System
2 pages
A Detection System For Stolen Vehicles Using Vehicle Attributes With Deep Learning
No ratings yet
A Detection System For Stolen Vehicles Using Vehicle Attributes With Deep Learning
4 pages
SFG Player Guide PVP
No ratings yet
SFG Player Guide PVP
31 pages
How To Install PostgreSQL 11 On CentOS 7
No ratings yet
How To Install PostgreSQL 11 On CentOS 7
5 pages
Cat Box
No ratings yet
Cat Box
21 pages
Project 4 Student Book Third Edition : Download Now
No ratings yet
Project 4 Student Book Third Edition : Download Now
1 page
Gabimi Fatal MOODLE
No ratings yet
Gabimi Fatal MOODLE
1 page
Data Sheet 150M Wireless N ADSL2+ Router (DT 850W)
No ratings yet
Data Sheet 150M Wireless N ADSL2+ Router (DT 850W)
4 pages
Mastering Claude AI: The Art of Prompt Engineering: 200+ Elite AI Prompts for Unlocking Claude's Full Potential"
From Everand
Mastering Claude AI: The Art of Prompt Engineering: 200+ Elite AI Prompts for Unlocking Claude's Full Potential"
Abdelhalim Rekab
5/5 (1)
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
Financial Data Science with Python: An Integrated Approach to Analysis, Modeling, and Machine Learning
From Everand
Financial Data Science with Python: An Integrated Approach to Analysis, Modeling, and Machine Learning
Haojun Chen
No ratings yet
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Blockchain Foundation Courseware - English
From Everand
Blockchain Foundation Courseware - English
Eppo Luppes
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
C++ Coding Idea with Example
From Everand
C++ Coding Idea with Example
Billy H. Green
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Introduction to AutoCAD Plant 3D 2021
From Everand
Introduction to AutoCAD Plant 3D 2021
Tutorial Books
4/5 (6)
Programming PowerPoint With VBA Straight to the Point
From Everand
Programming PowerPoint With VBA Straight to the Point
Eduardo N Sanchez
No ratings yet
Business 360°: Unlocking Computer Application
From Everand
Business 360°: Unlocking Computer Application
NotesKaro
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
From Everand
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
Kim Chantala
No ratings yet
Building a Tip Calculator Web App with Vanilla HTML, CSS, and JavaScript.: A Practical Q&A Guide Using a Tip Calculator
From Everand
Building a Tip Calculator Web App with Vanilla HTML, CSS, and JavaScript.: A Practical Q&A Guide Using a Tip Calculator
Lumavalle Press
No ratings yet
Building a Mortgage Calculator Web App with Vanilla HTML, CSS, and JavaScript.: A Practical Q&A Guide Using a Mortgage Calculator
From Everand
Building a Mortgage Calculator Web App with Vanilla HTML, CSS, and JavaScript.: A Practical Q&A Guide Using a Mortgage Calculator
Lumavalle Press
No ratings yet
Low-Code/No-Code: Citizen Developers and the Surprising Future of Business Applications
From Everand
Low-Code/No-Code: Citizen Developers and the Surprising Future of Business Applications
Phil Simon
2.5/5 (2)
Building a Loan Calculator Web App with Vanilla HTML, CSS, and JavaScript.: A Practical Q&A Guide Using a Loan Calculator
From Everand
Building a Loan Calculator Web App with Vanilla HTML, CSS, and JavaScript.: A Practical Q&A Guide Using a Loan Calculator
Lumavalle Press
No ratings yet
Stripe Integration in Angular: A Step-by-Step Guide to Creating Payment Functionality
From Everand
Stripe Integration in Angular: A Step-by-Step Guide to Creating Payment Functionality
Abdelfattah Ragab
No ratings yet
Building a Retirement Planner Web App with Vanilla HTML, CSS, and JavaScript.: A Practical Q&A Guide Using a Retirement Planner
From Everand
Building a Retirement Planner Web App with Vanilla HTML, CSS, and JavaScript.: A Practical Q&A Guide Using a Retirement Planner
Lumavalle Press
No ratings yet
Quick JavaScript Learning In Just 3 Days: Fast-Track Learning Course
From Everand
Quick JavaScript Learning In Just 3 Days: Fast-Track Learning Course
Vijay K.R.
No ratings yet
Agencies, Brokerages & Insurance Related Activity Revenues World Summary: Market Values & Financials by Country
From Everand
Agencies, Brokerages & Insurance Related Activity Revenues World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
True Downtime Costs Analysis - 2nd Edition
From Everand
True Downtime Costs Analysis - 2nd Edition
Don Fitchett
No ratings yet
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
Introduction to Python Programming: Do your first steps into programming with python
From Everand
Introduction to Python Programming: Do your first steps into programming with python
Greytower Corp
No ratings yet
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
Credit Card Issuing Revenues World Summary: Market Values & Financials by Country
From Everand
Credit Card Issuing Revenues World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
AutoIT Scripting For Beginners
From Everand
AutoIT Scripting For Beginners
Rajan
5/5 (2)
Real Estate Credit Revenues World Summary: Market Values & Financials by Country
From Everand
Real Estate Credit Revenues World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
IGNOU PGDCA All in One Previous Years Unsolved Papers
From Everand
IGNOU PGDCA All in One Previous Years Unsolved Papers
Manish Soni
No ratings yet
Financial Transactions Processing, Reserve & Clearinghouse Revenues World Summary: Market Values & Financials by Country
From Everand
Financial Transactions Processing, Reserve & Clearinghouse Revenues World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
IGNOU BCA Introduction to Database Management Systems Previous Year Unsolved Papers MCS 023
From Everand
IGNOU BCA Introduction to Database Management Systems Previous Year Unsolved Papers MCS 023
Manish Soni
No ratings yet
Securities Brokerage Revenues World Summary: Market Values & Financials by Country
From Everand
Securities Brokerage Revenues World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
IGNOU MCA Accountancy and Financial Previous Years Unsolved Papers MCS 225
From Everand
IGNOU MCA Accountancy and Financial Previous Years Unsolved Papers MCS 225
Manish Soni
No ratings yet
Investment Advice Revenues World Summary: Market Values & Financials by Country
From Everand
Investment Advice Revenues World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Commodity Contracts Dealing Revenues World Summary: Market Values & Financials by Country
From Everand
Commodity Contracts Dealing Revenues World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
Miscellaneous Intermediation Revenues World Summary: Market Values & Financials by Country
From Everand
Miscellaneous Intermediation Revenues World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
Portfolio Management Revenues World Summary: Market Values & Financials by Country
From Everand
Portfolio Management Revenues World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
How to Track Schedules, Costs and Earned Value with Microsoft Project
From Everand
How to Track Schedules, Costs and Earned Value with Microsoft Project
Akram Najjar
No ratings yet
SC-200: Microsoft Security Operations Analyst Preparation
From Everand
SC-200: Microsoft Security Operations Analyst Preparation
Georgio Daccache
No ratings yet
Learn SAP SD in 24 Hours
From Everand
Learn SAP SD in 24 Hours
Alex Nordeen
5/5 (1)
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
From Everand
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
equitypress
3.5/5 (2)
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
From Everand
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
Georgio Daccache
No ratings yet
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
From Everand
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
Georgio Daccache
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Long Docs

Uploaded by

Long Docs

Uploaded by

GPT-3 can help us extract key figures, dates or other bits of important

In this notebook we'll run through this approach:

 Load in a long PDF and pull the text out

 Create a prompt to be used to extract key bits of information

 Chunk up our document and process each chunk to pull any

 Combine them at the end

 This simple approach will then be extended to three more difficult

 Setup: Take a PDF, a Formula 1 Financial Regulation document on

 Simple Entity Extraction: Extract key bits of information from

 Creating a template prompt with our questions and an

 Create a function to take a chunk of text as input, combine

 Run a script to chunk the text, extract answers and output

 Complex Entity Extraction: Ask some more difficult questions

!pip install textract

!pip install tiktoken

# Extract the raw text from each PDF using textract

clean_text = text.replace(" ", " ").replace("\n", "; ").replace(';',' ')

Simple Entity Extraction

template_prompt=f'''Extract key pieces of information from this regulation

If a particular piece of information is not present, output \"Not specified\".

Extract key pieces of information from this regulation document.

If a particular piece of information is not present, output "Not specified".

Use the following format:

0. Who is the author

2. What is the value of External Manufacturing Costs in USD

3. What is the Capital Expenditure Limit in USD

0. Who is the author: Tom Anderson (Page 1)

def create_chunks(text, n, tokenizer):

"""Yield successive n-sized chunks from text."""

while i < len(tokens):

j = min(i + int(1.5 * n), len(tokens))

while j > i + int(0.5 * n):

# Decode the tokens and check for full stop or newline

# If no end of sentence found, use n tokens as the chunk size

{"role": "system", "content": "You help extract information from

{"role": "user", "content": prompt}

return "1." + response.choices[0].message.content

text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

for chunk in text_chunks:

groups = [r.split('\n') for r in results]

'2. What is the value of External Manufacturing Costs in USD: US Dollars

'3. What is the Capital Expenditure Limit in USD: US Dollars 30,000,000

Complex Entity Extraction

template_prompt=f'''Extract key pieces of information from this regulation

If a particular piece of information is not present, output \"Not specified\".

Use the following format:\n0. Who is the author\n1. How is a Minor

Extract key pieces of information from this regulation document.

If a particular piece of information is not present, output "Not specified".

Use the following format:

0. Who is the author

2. How is a Major Overspend Breach calculated

3. Which years do these financial regulations apply to

0. Who is the author: Tom Anderson (Page 1)

for chunk in text_chunks:

groups = [r.split('\n') for r in results]

# zip the groups together

['1. How is a Minor Overspend Breach calculated: A Minor Overspend

'2. How is a Major Overspend Breach calculated: A Material Overspend

 A more descriptive or specific prompt

 If you have sufficient training data, fine-tuning a model to find a set

However, with minimal tuning we have now answered 6 questions of

You might also like