0% found this document useful (0 votes)

14 views15 pages

Demystifying PDF Parsing 01 - Overview - by Florian June - Generative AI

This article provides an overview of PDF parsing, defining its main tasks and classifying existing methods into pipeline-based, OCR-free small model-based, and large multimodal model-based approaches. It emphasizes the importance of transforming unstructured documents into structured formats for AI applications and discusses various frameworks associated with each method. The article serves as the first in a series aimed at exploring PDF parsing techniques and their development.

Uploaded by

alex trivaylo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views15 pages

Demystifying PDF Parsing 01 - Overview - by Florian June - Generative AI

Uploaded by

alex trivaylo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Open in app

Member-only story

Demystifying PDF Parsing 01: Overview

Task Definition, Method Classification and Method Introduction to PDF Parsing

Florian June · Follow

Published in Generative AI
6 min read · May 5, 2024

Listen Share More

Transforming unstructured documents like PDF files and scanned images into
structured or semi-structured formats is a critical part of artificial intelligence. This
process is key to the intelligence of AI.

In previous articles, I have discussed:

Parsing PDF files and their tables using unstructured framework.

Extracting formulas in PDFs with Nougat.

Parsing tables in PDFs with Nougat.

However, these articles have primarily focused on using open-source tools to

address specific problems, without stressing how these tools should be further
developed.

This series of articles will categorize the mainstream methods of PDF parsing and
explore the principles of some representative open-source frameworks. From a
developer’s perspective, learn how to develop your own pdf parsing tools.

Regarding open-source frameworks, our focus is not solely on their usage. The key
lies in whether we can learn insights or ideas from them, as this would be greatly
beneficial.

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 1/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

As the first article in the series, the main content of this article is to define the task
of pdf parsing and classify the existing methods, then briefly introduce them.

The Main Task of PDF Parsing

Figure 1 illustrates the primary task of PDF parsing:

Figure 1: The primary task of PDF parsing. Image by author, the original PDF page is sourced from “Attention
Is All You Need” page 5.

Input: PDF file or image.

Output: Structured or semi-structured files like Markdown, HTML, JSON or

other formats defined by developers.

While the task description may appear simple, life experience shows that such tasks
often demand more effort.

Method Classification
From my current understanding, methods for building PDF parsing tools can be
primarily divided into the following four categories:

Pipeline-based: This treats the entire process of PDF parsing as a sequence of

models or algorithms. Each step handles its own sub-task, systematically solving
the overall task.

OCR-free small model-based: This method takes an end-to-end approach to

address the entire task. It views PDF parsing as a form of sequence prediction. A
small model is trained to predict tokens in formats like Markdown, JSON or
HTML using the prepared training data.

Large multimodal model-based: Utilize the powerful capabilities of large

multimodal models, and delegate document understanding tasks to them.

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 2/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Specifically, define various tasks of PDF parsing in the form of sequential

predictions. By using different prompts or fine-tuning large multimodal model,
we can guide it to accomplish different tasks, such as layout analysis, table
recognition, and formula recognition. The output will be in formats like
Markdown, JSON or HTML.

Rule-based: The parsing of PDF file is based on predefined rules. Although this
method is fast, it lacks flexibility.

This series of articles focuses on the first three methods, without discussing rule-
based methods.

Pipeline-Based Methods
This method views the task of parsing PDFs as a pipeline of models or algorithms, as
depicted in Figure 2.

Figure 2: Pipeline-based method. Image by author, the original PDF page is sourced from PubLayNet.

The pipeline-based method can be broadly divided into the following five steps:
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 3/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

1. The original pdf file may have some problems, such as being blurry, the
orientation of the picture is skewed, etc., so preprocessing is needed, such as
image enhancement, image direction correction, etc.

2. Conduct layout analysis, which involves two main components: visual structure
analysis and semantic structure analysis. Visual structure analysis aims to
identify the document’s structure and establish the boundaries of its similar
regions. Meanwhile, semantic structure analysis involves labeling these detected
areas with specific document types like text, title, list, table, figure, and so on.
Also, analyze the reading order of the page.

3. Treat the different areas identified during the layout analysis independently,
which include table understanding, text recognition, and recognition of other
components such as formulas, flowcharts and special symbols, etc.

4. Integrate the previous results to restore the page structure.

5. Output structured or semi-structured information, such as Markdown, JSON or

HTML.

It is worth mentioning that PDF parsing is actually a subset of document

intelligence, also known as document AI. Beyond the content shown in Figure 2,
document intelligence also includes:

Information extraction: Entity recognition, relationship extraction.

Document retrieval: Keyword retrieval, structure-based retrieval.

Semantic analysis: Content classification, summary, document QA.

Below are some representative pipeline-based pdf parsing frameworks:

Marker: It is a lightweight pipeline of deep learning models capable of

converting PDF, EPUB, and MOBI files into markdown format.

Unstructured: A comprehensive framework that offers good customizability.

LayoutParser: A unified toolkit for deep learning based document image

analysis.

OCR-Free Small Model-Based Methods

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 4/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

The OCR-FREE solution believes that methods driven by the OCR model, like
pipeline-based methods, depend on extracting text from external systems. This
leads to higher computing resource usage and longer processing times.
Furthermore, these models might inherit OCR inaccuracies, which can complicate
document understanding and analysis tasks.

Therefore, the OCR-free small model-based methods should be developed, as

illustrated in Figure 3.

Figure 3: OCR-free small model-based method. Image by author, the original PDF page is sourced from
“Attention Is All You Need” page 5.

From a structural perspective, OCR-free methods are relatively straightforward

compared to pipeline-based methods. The key areas of OCR-free methods that
require focus are the construction of training data and the design of the model
structure.

Below are some representative OCR-free small model-based pdf parsing

frameworks:

Donut: OCR-free document understanding transformer.

Nougat: Based on Donut architecture, which is particularly effective when used

with PDF papers, formulas, and tables.

Dessurt: It’s based on an architecture similar to Donut, combining bidirectional

cross attention with various pre-training methods.

Large Multimodal Model-Based Methods

In the era of LLMs, it’s not surprising to consider using large multimodal models for
PDF parsing.

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 5/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Figure 4: Large Multimodal Model-Based method by creating prompts. Image by author, the original PDF
page is sourced from “Attention Is All You Need” page 10.

Figure 5: Large Multimodal Model-Based method by fine-tuning. Image by author, the original PDF page is
sourced from “Attention Is All You Need” page 10.

As illustrated in Figure 4 and 5, we can create prompts or fine-tune large

multimodal models to enhance them, aiding us in completing various tasks.

Below are some representative large multimodal models:

TextMonkey: A large multimodal model that focuses on text-related tasks,

including document Q&A and scene text Q&A, has achieved state-of-the-art
results in multiple benchmarks.

LLaVAR: It collected rich text training data and used a higher-resolution CLIP as
the visual encoder to enhance OCR capabilities.

GPT-4V: High-quality close source large multimodal model.

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 6/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Conclusion
In general, this article defined the primary tasks involved in PDF parsing,
categorized the existing methods, and then provided a brief introduction to each.

Please note that this article is based on my current understanding. If I gain new
insights into this field in the future, I will update this article.

If you’re interested in PDF parsing or document intelligence, feel free to check out
my other articles.

Florian June

Document Intelligence
and PDF Parsing

View list 7 stories

And the latest AI-related content can be found in my newsletter.

Finally, if there are any errors or omissions in this article, or if you have any
thoughts to share, please point them out in the comment section.

This story is published under Generative AI Publication.

Connect with us on Substack, LinkedIn, and Zeniteq to stay in the loop with the
latest AI stories. Let’s shape the future of AI together!

Large Language Models AI Gpt 4 Deep Learning Computer Science

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 7/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Written by Florian June

6.3K Followers · Writer for Generative AI

AI researcher, focusing on LLMs, RAG, Agent, Document AI, Data Structures. Find the newest article in my
newsletter: https://florianjune.substack.com/

Florian June in AI Advances

Advanced RAG 12: Enhancing Global Understanding

Priciples, Code Explanation and Insights

Jun 12 414

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 8/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Gandhi KT in Generative AI

How good is Raspberry Pi’s AI Kit

Jun 18 517 11

Antonis Iliakis in Generative AI

Amazing Chat GPT Prompts That Will Take You to The Next Level
Go deeper down the rabbit hole with these prompts

May 10 1.8K 23

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 9/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Florian June in Towards AI

Advanced RAG 02: Unveiling PDF Parsing

Including key points, diagrams, and code

Feb 2 1.5K 17

See all from Florian June

See all from Generative AI

Recommended from Medium

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 10/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Florian June in Towards AI

Advanced RAG 02: Unveiling PDF Parsing

Including key points, diagrams, and code

Feb 2 1.5K 17

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 11/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Abhay Parashar in The Pythoneers

17 Mindblowing Python Automation Scripts I Use Everyday

Scripts That Increased My Productivity and Performance

4d ago 2.3K 16

Lists

The New Chatbots: ChatGPT, Bard, and Beyond

12 stories · 422 saves

Natural Language Processing

1575 stories · 1125 saves

Generative AI Recommended Reading

52 stories · 1201 saves

What is ChatGPT?
9 stories · 391 saves

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 12/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Hanan Tabak

Building AI Research Assistant: Multi-Agent RAG System Reading From

Multiple Unstructured Sources
This GIF image was created by the author

Jul 5 246 4

Giuseppe Scalamogna in Towards Data Science

Prompt Engineering for Cognitive Flexibility

Practical insights and analysis from experiments with MMLU-Pro

https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 13/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

4d ago 167 1

Gencay I. in Level Up Coding

Claude 3.5 Sonnet: The Future of Coding Has Changed Indefinitely!

Claude 3.5 Sonnet is about to change everything!

Jul 2 422 3

Tomaz Bratanic in Neo4j Developer Blog

Implementing ‘From Local to Global’ GraphRAG with Neo4j and

LangChain: Constructing the Graph
https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 14/15
7/15/24, 8:55 AM Demystifying PDF Parsing 01: Overview | by Florian June | Generative AI

Combine text extraction, network analysis, and LLM prompting and summarization for
improved RAG accuracy

5d ago 284 4