Demystifying PDF Parsing 01 - Overview - by Florian June - Generative AI
Demystifying PDF Parsing 01 - Overview - by Florian June - Generative AI
Open in app
Search
Member-only story
Transforming unstructured documents like PDF files and scanned images into
structured or semi-structured formats is a critical part of artificial intelligence. This
process is key to the intelligence of AI.
This series of articles will categorize the mainstream methods of PDF parsing and
explore the principles of some representative open-source frameworks. From a
developer’s perspective, learn how to develop your own pdf parsing tools.
Regarding open-source frameworks, our focus is not solely on their usage. The key
lies in whether we can learn insights or ideas from them, as this would be greatly
beneficial.
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 1/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
As the first article in the series, the main content of this article is to define the task
of pdf parsing and classify the existing methods, then briefly introduce them.
Figure 1: The primary task of PDF parsing. Image by author, the original PDF page is sourced from “Attention
Is All You Need” page 5.
While the task description may appear simple, life experience shows that such tasks
often demand more effort.
Method Classification
From my current understanding, methods for building PDF parsing tools can be
primarily divided into the following four categories:
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 2/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
Rule-based: The parsing of PDF file is based on predefined rules. Although this
method is fast, it lacks flexibility.
This series of articles focuses on the first three methods, without discussing rule-
based methods.
Pipeline-Based Methods
This method views the task of parsing PDFs as a pipeline of models or algorithms, as
depicted in Figure 2.
Figure 2: Pipeline-based method. Image by author, the original PDF page is sourced from PubLayNet.
The pipeline-based method can be broadly divided into the following five steps:
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 3/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
1. The original pdf file may have some problems, such as being blurry, the
orientation of the picture is skewed, etc., so preprocessing is needed, such as
image enhancement, image direction correction, etc.
2. Conduct layout analysis, which involves two main components: visual structure
analysis and semantic structure analysis. Visual structure analysis aims to
identify the document’s structure and establish the boundaries of its similar
regions. Meanwhile, semantic structure analysis involves labeling these detected
areas with specific document types like text, title, list, table, figure, and so on.
Also, analyze the reading order of the page.
3. Treat the different areas identified during the layout analysis independently,
which include table understanding, text recognition, and recognition of other
components such as formulas, flowcharts and special symbols, etc.
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 4/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
The OCR-FREE solution believes that methods driven by the OCR model, like
pipeline-based methods, depend on extracting text from external systems. This
leads to higher computing resource usage and longer processing times.
Furthermore, these models might inherit OCR inaccuracies, which can complicate
document understanding and analysis tasks.
Figure 3: OCR-free small model-based method. Image by author, the original PDF page is sourced from
“Attention Is All You Need” page 5.
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 5/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
Figure 4: Large Multimodal Model-Based method by creating prompts. Image by author, the original PDF
page is sourced from “Attention Is All You Need” page 10.
Figure 5: Large Multimodal Model-Based method by fine-tuning. Image by author, the original PDF page is
sourced from “Attention Is All You Need” page 10.
LLaVAR: It collected rich text training data and used a higher-resolution CLIP as
the visual encoder to enhance OCR capabilities.
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 6/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
Conclusion
In general, this article defined the primary tasks involved in PDF parsing,
categorized the existing methods, and then provided a brief introduction to each.
Please note that this article is based on my current understanding. If I gain new
insights into this field in the future, I will update this article.
If you’re interested in PDF parsing or document intelligence, feel free to check out
my other articles.
Florian June
Document Intelligence
and PDF Parsing
Finally, if there are any errors or omissions in this article, or if you have any
thoughts to share, please point them out in the comment section.
Connect with us on Substack, LinkedIn, and Zeniteq to stay in the loop with the
latest AI stories. Let’s shape the future of AI together!
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 7/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
Follow
AI researcher, focusing on LLMs, RAG, Agent, Document AI, Data Structures. Find the newest article in my
newsletter: https://florianjune.substack.com/
Jun 12 414
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 8/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
Gandhi KT in Generative AI
Jun 18 517 11
Amazing Chat GPT Prompts That Will Take You to The Next Level
Go deeper down the rabbit hole with these prompts
May 10 1.8K 23
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 9/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
Feb 2 1.5K 17
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 10/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
Feb 2 1.5K 17
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 11/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
4d ago 2.3K 16
Lists
What is ChatGPT?
9 stories · 391 saves
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 12/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
Hanan Tabak
Jul 5 246 4
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 13/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI
4d ago 167 1
Jul 2 422 3
Combine text extraction, network analysis, and LLM prompting and summarization for
improved RAG accuracy
5d ago 284 4
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 15/15