Generalizability in Document Layout Analysis for Scientific Article Figure & Caption Extraction

Naiman, Jill P.

Computer Science > Digital Libraries

arXiv:2301.10781 (cs)

[Submitted on 25 Jan 2023]

Title:Generalizability in Document Layout Analysis for Scientific Article Figure & Caption Extraction

Authors:Jill P. Naiman

View PDF

Abstract:The lack of generalizability -- in which a model trained on one dataset cannot provide accurate results for a different dataset -- is a known problem in the field of document layout analysis. Thus, when a model is used to locate important page objects in scientific literature such as figures, tables, captions, and math formulas, the model often cannot be applied successfully to new domains. While several solutions have been proposed, including newer and updated deep learning models, larger hand-annotated datasets, and the generation of large synthetic datasets, so far there is no "magic bullet" for translating a model trained on a particular domain or historical time period to a new field. Here we present our ongoing work in translating our document layout analysis model from the historical astrophysical literature to the larger corpus of scientific documents within the HathiTrust U.S. Federal Documents collection. We use this example as an avenue to highlight some of the problems with generalizability in the document layout analysis community and discuss several challenges and possible solutions to address these issues. All code for this work is available on The Reading Time Machine GitHub repository (this https URL).

Comments:	9 pages, 3 figures, submitted as part of AEOLIAN Workshop 5: Making More Sense With Machines: AI/ML Methods for Interrogating and Understanding Our Textual Heritage in the Humanities, Natural Sciences, and Social Sciences
Subjects:	Digital Libraries (cs.DL)
Cite as:	arXiv:2301.10781 [cs.DL]
	(or arXiv:2301.10781v1 [cs.DL] for this version)
	https://doi.org/10.48550/arXiv.2301.10781

Submission history

From: Jill Naiman [view email]
[v1] Wed, 25 Jan 2023 19:00:01 UTC (5,112 KB)

Computer Science > Digital Libraries

Title:Generalizability in Document Layout Analysis for Scientific Article Figure & Caption Extraction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Digital Libraries

Title:Generalizability in Document Layout Analysis for Scientific Article Figure & Caption Extraction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators