Conclusion and Future Work

The document discusses using deep learning techniques to identify sections of scientific publications such as the title, author, abstract, and body text by analyzing PDF documents. The system was able to reliably identify body text and reject other sections with 94.32% accuracy. Future work involves improving accuracy, identifying specific section types, and using the output to extract text from PDFs.

Uploaded by

ANANTH UPADHYA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views1 page

Conclusion and Future Work

Uploaded by

ANANTH UPADHYA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

ples of an input provided to the network, examples of net- Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001).

work output, and ground truth target output. These results Conditional random fields: Probabilistic models for seg-
demonstrate impressive results with such a small dataset. menting and labeling sequence data. In ICML ’01 Pro-
In particular, the network is able to reject header and footer ceedings of the Eighteenth International Conference on
text extremely reliably. The network rejects most abstracts, Machine Learning, pages 282–289.
figure captions and references, confusing only some where LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-
the text formatting is extremely similar to typical paragraph ing. nature, 521(7553):436.
text. The per pixel classification accuracy on the validation
Lipinski, M., Yao, K., Breitinger, C., Beel, J., and Gipp,
set was 94.32%, compared to a baseline of classifying each
B. (2013). Evaluation of header metadata extraction ap-
pixel as “not paragraph” which would provide 79.67% ac-
proaches and tools for scientific PDF documents. In Pro-
curacy.
ceedings of the 13th ACM/IEEE-CS joint conference on
Digital libraries - JCDL ’13. ACM Press.
6. Conclusion and Future Work
Lopez, P. (2009). GROBID: Combining Automatic Bib-
In this paper we demonstrated that deep learning-based im- liographic Data Recognition and Term Extraction for
age analysis can be used to identify sections of scientific Scholarship Publications. In International Conference
publications. Given the results from our current experi- on Theory and Practice of Digital Libraries, pages 473–
ments, we feel that deep learning can be successfully used 474. Springer.
to enhance current PDF extraction methods, and based on
Mao, S., Rosenfeld, A., and Kanungo, T. (2003). Docu-
our findings we plan to continue collecting data in order
ment structure analysis algorithms: a literature survey.
to further increase our networks results, as we feel many
In Tapas Kanungo, et al., editors, Document Recognition
of the misclassified portions of text are due to insufficient
and Retrieval X. SPIE, jan.
training data that does not currently characterize features
such as reference sections and abstracts sufficiently. Peng, F. and McCallum, A. (2004). Accurate informa-
Our current results show that a deep learning network can tion extraction from research papers using conditional
successfully distinguish and learn the difference between random fields. In HLT-NAACL 2004: Human Language
the body text and other portions of a PDF document. The Technology Conference of the North America Chapter of
next step is to extend the approach to identifying each type the Association for Computational Linguistics, Proceed-
of text (title, author, abstract, body text, etc.) rather than ings of the Main Conference, pages 329–336.
simply body text versus other. Additionally, we plan to in- Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:
crease the accuracy of our network by adding more data Convolutional networks for biomedical image segmen-
and to create an extraction tool that leverages the output tation. In International Conference on Medical image
of the deep learning network to extract text. While we are computing and computer-assisted intervention, pages
currently evaluating accuracy based on a per pixel count of 234–241. Springer.
estimated versus redacted image, an improved test of accu- Siegel, N., Lourie, N., Power, R., and Ammar, W. (2018).
racy would be to leverage such an extraction tool to identify Extracting Scientific Figures with Distantly Supervised
the per character accuracy of this text extraction approach. Neural Networks. In To appear in ACM/IEEE Joint
Conference on Digital Libraries in 2018 (JCDL 2018).
7. Bibliographical References ACM/IEEE.
Beel, J., Langer, S., Genzmehr, M., and Müller, C. (2013). Singh, M., Barua, B., Palod, P., Garg, M., Satapathy, S.,
Docear’s PDF Inspector: Title Extraction from PDF Bushi, S., Ayush, K., Rohith, K. S., Gamidi, T., Goyal, P.,
Files. In Proceedings of the 13th ACM/IEEE-CS joint and Mukherjee, A. (2016). OCR++: A Robust Frame-
conference on Digital libraries, pages 443–444. ACM. work For Information Extraction from Scholarly Arti-
Constantin, A., Pettifer, S., and Voronkov, A. (2013). Pdfx: cles. International Conference on Computational Lin-
fully-automated pdf-to-xml conversion of scientific liter- guistics (COLING), pages 3390–3400.
ature. In Proceedings of the 2013 ACM symposium on Tkaczyk, D., Szostek, P., Dendek, P. J., Fedoryszak, M.,
Document engineering, pages 177–180. ACM. and Bolikowski, L. (2014). Cermine – Automatic Ex-
Councill, I. G., Giles, C. L., and Kan, M.-Y. (2008). traction of Metadata and References from Scientific Lit-
ParsCit: an Open-source CRF Reference String Parsing erature. In Document Analysis Systems (DAS), 2014 11th
Package. In LREC, volume 8, pages 661–667. IAPR International Workshop on, pages 217–221. IEEE.

Practical Data Analysis
From Everand
Practical Data Analysis
Hector Cuesta
4.5/5 (14)
Pdfdigest: An Adaptable Layout-Aware Pdf-To-Xml Textual Content Extractor For Scientific Articles
No ratings yet
Pdfdigest: An Adaptable Layout-Aware Pdf-To-Xml Textual Content Extractor For Scientific Articles
6 pages
Deeppdf: A Deep Learning Approach To Analyzing Pdfs
No ratings yet
Deeppdf: A Deep Learning Approach To Analyzing Pdfs
1 page
CERMINE: Automatic Extraction of Structured Metadata From Scientific Literature
No ratings yet
CERMINE: Automatic Extraction of Structured Metadata From Scientific Literature
19 pages
Text Extraction From Document Image
No ratings yet
Text Extraction From Document Image
7 pages
Robust PDF Document Conversion Using Recurrent Neural Networks
No ratings yet
Robust PDF Document Conversion Using Recurrent Neural Networks
9 pages
Rule Based Approach To Extract Metadata From Scientific PDF Documents
No ratings yet
Rule Based Approach To Extract Metadata From Scientific PDF Documents
4 pages
Rule Based Extraction From PDF
No ratings yet
Rule Based Extraction From PDF
4 pages
OCR++: A Robust Framework For Information Extraction From Scholarly Articles
No ratings yet
OCR++: A Robust Framework For Information Extraction From Scholarly Articles
9 pages
New Methods For Metadata Extraction From Scientific Literature
No ratings yet
New Methods For Metadata Extraction From Scientific Literature
175 pages
M20CS061
No ratings yet
M20CS061
37 pages
Sudi Klemens 2019
No ratings yet
Sudi Klemens 2019
104 pages
GraphNeural New 2208.11203
No ratings yet
GraphNeural New 2208.11203
8 pages
D&D Second Brain Setup
No ratings yet
D&D Second Brain Setup
9 pages
Dt Paper Springer Copy
No ratings yet
Dt Paper Springer Copy
9 pages
A Benchmark of PDF Information Extraction
No ratings yet
A Benchmark of PDF Information Extraction
23 pages
SIH1669 CodeXplorers
No ratings yet
SIH1669 CodeXplorers
6 pages
Mastering Clojure Data Analysis
From Everand
Mastering Clojure Data Analysis
Eric Rochester
No ratings yet
An Intelligent and Unified Text and Non-Text Object Extraction From PDF Using Support Vector Machine
No ratings yet
An Intelligent and Unified Text and Non-Text Object Extraction From PDF Using Support Vector Machine
9 pages
A Survey of Deep Learning Approaches For OCR and D
No ratings yet
A Survey of Deep Learning Approaches For OCR and D
14 pages
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories
No ratings yet
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories
19 pages
Nougat: Neural Optical Understanding For Academic Documents
No ratings yet
Nougat: Neural Optical Understanding For Academic Documents
17 pages
PDFX: Fully-Automated PDF-to-XML Conversion of Scientific Literature
No ratings yet
PDFX: Fully-Automated PDF-to-XML Conversion of Scientific Literature
4 pages
Image Classification and Text Extraction Using Convolutional Neural Network
No ratings yet
Image Classification and Text Extraction Using Convolutional Neural Network
7 pages
Extracting Text From PDF Files With Python - A Comprehensive Guide - Modo Leitor
No ratings yet
Extracting Text From PDF Files With Python - A Comprehensive Guide - Modo Leitor
17 pages
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories
No ratings yet
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories
19 pages
Iuhihi
No ratings yet
Iuhihi
13 pages
PDF-TREX An Approach For Recognizing and Extracting Tables From PDF Documents
No ratings yet
PDF-TREX An Approach For Recognizing and Extracting Tables From PDF Documents
5 pages
Tables To LaTeX - Structure and Content Extraction From Scientific Tables
No ratings yet
Tables To LaTeX - Structure and Content Extraction From Scientific Tables
10 pages
Practical Machine Learning: Learn how to build Machine Learning applications to solve real-world data analysis challenges with this Machine Learning book – packed with practical tutorials
From Everand
Practical Machine Learning: Learn how to build Machine Learning applications to solve real-world data analysis challenges with this Machine Learning book – packed with practical tutorials
Sunila Gollapudi
3/5 (2)
Layout-Aware Text Extraction From Full-Text PDF of Scientific Articles
No ratings yet
Layout-Aware Text Extraction From Full-Text PDF of Scientific Articles
10 pages
OmniDocBench: Benchmarking Diverse PDF Document Parsing With Comprehensive Annotations
No ratings yet
OmniDocBench: Benchmarking Diverse PDF Document Parsing With Comprehensive Annotations
30 pages
Table Recognition and Understanding From PDF Files
No ratings yet
Table Recognition and Understanding From PDF Files
5 pages
Reaserch Paper
No ratings yet
Reaserch Paper
10 pages
An - Ontological - Framework - For - Information - Extraction - From - Diverse - Scientific - Sources-Gohar-Zaman SB
No ratings yet
An - Ontological - Framework - For - Information - Extraction - From - Diverse - Scientific - Sources-Gohar-Zaman SB
14 pages
Automated Text Extraction
No ratings yet
Automated Text Extraction
6 pages
OCR (Optimal Character Recogintion)
No ratings yet
OCR (Optimal Character Recogintion)
7 pages
Java for Data Science
From Everand
Java for Data Science
Richard M Reese
No ratings yet
My Project
No ratings yet
My Project
30 pages
MANVA
No ratings yet
MANVA
51 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
PDF File Extraction
No ratings yet
PDF File Extraction
6 pages
ICDAR2021-Information Extraction From Invoices
No ratings yet
ICDAR2021-Information Extraction From Invoices
17 pages
OCRRRRRRRRRRR
No ratings yet
OCRRRRRRRRRRR
6 pages
DL 9
No ratings yet
DL 9
10 pages
Document Summarizer: A Machine Learning Approach To PDF Summarization
No ratings yet
Document Summarizer: A Machine Learning Approach To PDF Summarization
12 pages
Extracting Body Text From Academic PDF Documents For Text Mining
No ratings yet
Extracting Body Text From Academic PDF Documents For Text Mining
8 pages
Mastering Data Mining with Python – Find patterns hidden in your data
From Everand
Mastering Data Mining with Python – Find patterns hidden in your data
Megan Squire
No ratings yet
Final Project Report
50% (2)
Final Project Report
27 pages
Mastering Data Science: From Basics to Expert Proficiency
From Everand
Mastering Data Science: From Basics to Expert Proficiency
William Smith
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
How To Train LLM
No ratings yet
How To Train LLM
6 pages
Learning R for Geospatial Analysis
From Everand
Learning R for Geospatial Analysis
Michael Dorman
No ratings yet
1507 02140v1
No ratings yet
1507 02140v1
34 pages
Session 17 Document Insights Extraction
No ratings yet
Session 17 Document Insights Extraction
11 pages
PICK: Processing Key Information Extraction From Documents Using Improved Graph Learning-Convolutional Networks
No ratings yet
PICK: Processing Key Information Extraction From Documents Using Improved Graph Learning-Convolutional Networks
8 pages
PDFFigures 2.0 - Proceedings of The 16th ACM-IEEE-CS On Joint Conference On Digital Libraries
0% (1)
PDFFigures 2.0 - Proceedings of The 16th ACM-IEEE-CS On Joint Conference On Digital Libraries
6 pages
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
From Everand
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
PDL-III Report FINAL
No ratings yet
PDL-III Report FINAL
34 pages
Optimization Theory PDF
No ratings yet
Optimization Theory PDF
86 pages
Cross-School Close-To-Practice' Action Research, System Leadership and Local Civic Partnership Re-Engineering An Inner-City Learning Community
100% (9)
Cross-School Close-To-Practice' Action Research, System Leadership and Local Civic Partnership Re-Engineering An Inner-City Learning Community
12 pages
ICTMBE 2025 MohamadAzizie
No ratings yet
ICTMBE 2025 MohamadAzizie
22 pages
Strategyand Re Inventing Pharma With Artificial Intelligence
No ratings yet
Strategyand Re Inventing Pharma With Artificial Intelligence
20 pages
Bibliography-Automated Ordering System Study
No ratings yet
Bibliography-Automated Ordering System Study
2 pages
Sciencedirect
No ratings yet
Sciencedirect
6 pages
The People Side of Change
No ratings yet
The People Side of Change
36 pages
Electric Power References
No ratings yet
Electric Power References
1 page
BALAJI INSTITUTE OF Human Resource & Development
No ratings yet
BALAJI INSTITUTE OF Human Resource & Development
10 pages
Sample Essay Anthropology
No ratings yet
Sample Essay Anthropology
11 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Chapter 2 - Management Yesterday and Today
75% (4)
Chapter 2 - Management Yesterday and Today
6 pages
PMKVY 2.0 Impact Evaluation Report - Executive Summary
No ratings yet
PMKVY 2.0 Impact Evaluation Report - Executive Summary
24 pages
21st CENTURY SKILLS DEVELOPMENT AND SENIOR HIGH SCHOOL PROGRAM EXIT POINTS
No ratings yet
21st CENTURY SKILLS DEVELOPMENT AND SENIOR HIGH SCHOOL PROGRAM EXIT POINTS
10 pages
BUllYING RESEARCH
100% (2)
BUllYING RESEARCH
49 pages
Advanced Marketing Research Revised
No ratings yet
Advanced Marketing Research Revised
5 pages
The Incidence of Trismus and Long Term Impact On Health Related Quality of Life in Patients With Head and Neck Cancer
No ratings yet
The Incidence of Trismus and Long Term Impact On Health Related Quality of Life in Patients With Head and Neck Cancer
10 pages
Research Script Final
No ratings yet
Research Script Final
2 pages
Effect of Gating On Mold Filling
No ratings yet
Effect of Gating On Mold Filling
11 pages
Enhancing Automobile Manufacturing Efficiency Using Machine Learning: Sequence Tracking and Clamping Monitoring With Machine Learning Video Analytics and Laser Light Alert System
No ratings yet
Enhancing Automobile Manufacturing Efficiency Using Machine Learning: Sequence Tracking and Clamping Monitoring With Machine Learning Video Analytics and Laser Light Alert System
13 pages
28-Launching New Venture - For Sept'23 - Updated
No ratings yet
28-Launching New Venture - For Sept'23 - Updated
3 pages
2k Factorial Design
No ratings yet
2k Factorial Design
26 pages
Complete Managing Information Systems Ten Essential Topics 1st Edition Jun Xu PDF For All Chapters
100% (2)
Complete Managing Information Systems Ten Essential Topics 1st Edition Jun Xu PDF For All Chapters
84 pages
PDF Solution Manual For Applied Statistics and Probability For Engineers 7th by Montgomery Download
No ratings yet
PDF Solution Manual For Applied Statistics and Probability For Engineers 7th by Montgomery Download
41 pages
Eden Proposal First Draft Submition
No ratings yet
Eden Proposal First Draft Submition
46 pages
Easyrewardz JD Product Manager 2
No ratings yet
Easyrewardz JD Product Manager 2
3 pages
Uncorrected Proofs: Marg Deery
No ratings yet
Uncorrected Proofs: Marg Deery
6 pages
Warchol Et Al. - Visinity Visual Spatial Neighborhood Analysis For
No ratings yet
Warchol Et Al. - Visinity Visual Spatial Neighborhood Analysis For
11 pages
Ca MCQ
0% (1)
Ca MCQ
372 pages
Common Challenges in Computer Science - Research Matrix
No ratings yet
Common Challenges in Computer Science - Research Matrix
3 pages

Conclusion and Future Work

Uploaded by

Conclusion and Future Work

Uploaded by

ples of an input provided to the network, examples of net- Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001).

You might also like