0% found this document useful (0 votes)
58 views14 pages

YOLOv5 Pytorch Implementation

Uploaded by

Asnaku Arja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views14 pages

YOLOv5 Pytorch Implementation

Uploaded by

Asnaku Arja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/374383796

UNVEILING DOCUMENT STRUCTURES WITH YOLOV5 LAYOUT DETECTION A


PREPRINT

Preprint · September 2023

CITATIONS READS

0 286

3 authors, including:

Herman Sugiharto Yani Siti Nurpazrin


Siliwangi University Siliwangi University
7 PUBLICATIONS 4 CITATIONS 1 PUBLICATION 0 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Herman Sugiharto on 02 October 2023.

The user has requested enhancement of the downloaded file.


U NVEILING D OCUMENT S TRUCTURES WITH YOLOV 5 L AYOUT
D ETECTION

A P REPRINT

Herman Sugiharto Yorisa Silviana


Department of Informatics Department of Informatics
Siliwangi University Siliwangi University
Tasikmalaya, Indonesia Tasikmalaya, Indonesia
[email protected] [email protected]

Yani Siti Nurpazrin


Department of Informatics
Siliwangi University
Tasikmalaya, Indonesia
[email protected]

September 29, 2023

A BSTRACT
The current digital environment is characterized by the widespread presence of data, particularly
unstructured data, which poses many issues in sectors including finance, healthcare, and education.
Conventional techniques for data extraction encounter difficulties in dealing with the inherent variety
and complexity of unstructured data, hence requiring the adoption of more efficient methodologies.
This research investigates the utilization of YOLOv5, a cutting-edge computer vision model, for the
purpose of rapidly identifying document layouts and extracting unstructured data.
The present study establishes a conceptual framework for delineating the notion of "objects" as they
pertain to documents, incorporating various elements such as paragraphs, tables, photos, and other
constituent parts. The main objective is to create an autonomous system that can effectively recognize
document layouts and extract unstructured data, hence improving the effectiveness of data extraction.
In the conducted examination, the YOLOv5 model exhibits notable effectiveness in the task of
document layout identification, attaining a high accuracy rate along with a precision value of 0.91, a
recall value of 0.971, an F1-score of 0.939, and an area under the receiver operating characteristic
curve (AUC-ROC) of 0.975. The remarkable performance of this system optimizes the process of
extracting textual and tabular data from document images. Its prospective applications are not limited
to document analysis but can encompass unstructured data from diverse sources, such as audio data.
This study lays the foundation for future investigations into the wider applicability of YOLOv5 in
managing various types of unstructured data, offering potential for novel applications across multiple
domains.

Keywords layout detection · unstructured data · YOLOv5

1 Introduction
In the contemporary and dynamic digital age, there has been a substantial rise in the generation and utilization of data.
Unstructured data, which refers to data that does not possess a predetermined format, holds significant importance
inside diverse domains including banking, healthcare, and education.Adnan and Akbar [2019a]. A significant portion
of the data contained in documents is found in unstructured formats and exhibits variability in terms of its style and
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

presentation, hence posing difficulties in the extraction of crucial information.Adnan and Akbar [2019b]. When faced
with these variances and complexities, conventional methods of data extraction frequently demonstrate ineffectiveness
and inefficiency Zaman et al. [2020]. In order to tackle this matter, the utilization of technologies such as artificial
intelligence and computer vision has facilitated the process of data extraction and processing. Nevertheless, there exists
potential for enhancement in terms of velocity, precision, and effectiveness. Diwan et al. [2022].
Detecting objects is a fundamental task in computer vision with numerous applications, including layout detection.
Throughout the years, the YOLO (You Only Look Once) line of models has emerged as a prominent solution for
real-time object identification, renowned for their exceptional speed and accuracy Jiménez-Bravo et al. [2022]. YOLOv5,
the most recent edition of the YOLO family, demonstrates notable advancements in accuracy and precision when
compared to its previous versions. While YOLOv4 shown remarkable performance, YOLOv5 has been rigorously
crafted to augment accuracy while maintaining efficient inference speed Kaur and Singh [2022]Arifando et al. [2023].
Through a combination of architectural refinements, novel data augmentation techniques, and a carefully curated
training process, YOLOv5 accomplishes superior object detection capabilities Hussain [2023].
This study’s primary objective is to investigate and enhance the application of techniques for identifying document
layouts and extracting unstructured data using the YOLOv5 framework. This study defines "objects" as the many
components found within documents, including but not limited to paragraphs, tables, photographs, and other similar
items. The primary aim of this study is to develop and deploy a system capable of autonomously identifying document
layouts and efficiently and precisely extracting unstructured data from these documents. This study is expected to
provide a valuable contribution towards enhancing the efficacy of unstructured data extraction.

2 Related Work

Numerous studies on layout detection and the application of the YOLOv5 architecture have been utilized in the past. In
a meticulously executed research project conducted by Pfitzmann et al. [2022], the academic community was introduced
to the revolutionary DocLayNet dataset. The dataset presented below signifies a significant transformation in the domain
of document layout research, providing an extensive collection of meticulously annotated document layouts. It consists
of an astounding total of 1,107,470 meticulously annotated objects, encompassing a wide range of diverse object classes,
including but not limited to text, images, mathematical formulas, code snippets, page headers and footers, and intricate
tabular structures. In contrast, the research undertaken by Pillai and Mangsuli [2021] followed a different research path,
focusing on data derived from the complex field of the oil and gas business. The study utilized advanced transformer
topologies to address the challenging problem of detecting and extracting layout components that are embedded within
intricate papers from this particular domain.
The YOLOv5 framework has been employed in a multitude of computer vision research endeavors, encompassing
several domains such as object recognition Diwan et al. [2022], Yue et al. [2022], Kitakaze et al. [2020], object tracking
Alvar and Bajic [2018], Younis A. Al-Arbo [2021], Kumari et al. [2021], and video analysis Wang et al. [2022], Gu
et al. [2022]. In the aforementioned experiments, YOLOv5 has exhibited a notable level of precision in conjunction
with its user-friendly nature.
In this exhaustive study, the research team has developed a sophisticated system that goes beyond layout detection;
it incorporates the intricate task of layout extraction guided by meticulously predefined classes. At the core of this
robust system lies YOLOv5, an advanced deep learning framework that serves as the layout detector. Its presence and
performance in the system contribute significantly to the overarching framework’s exceptional precision and efficacy.
The primary objective of this research is to revolutionize the processing of unstructured data, with a particular
concentration on PDF documents generated from scanned sources. The documents in question provide a significant
obstacle for traditional methods of extracting text from PDF files, since they are typically hindered by the complexities
of scanned images. The unique approach employed by the study team holds the potential to surpass the existing
constraints, providing a powerful solution to the challenging endeavor of efficiently extracting information from these
texts. As we progress further into the era of digital transformation, the advances made by this research hold the promise
of substantial advances in document processing, bridging the divide between unstructured data and actionable insights.

3 Methodology

The research is a quantitative study with an experimental approach. The experimental approach is chosen because the
aim of this research is to determine the cause-and-effect relationships among existing variables such as datasets, model
architectures, and model parameters (Williams, 2007).

2
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

The novelty targeted by this proposed research lies in the utilization of YOLOv5 for detecting layouts within a
document.

Literature Review The literature survey was undertaken in order to gain a comprehensive understanding of the
concepts and theories that are relevant to the research. This includes exploring the theoretical foundations of the YOLO
architecture, examining the process of data labeling, and investigating the techniques used for layout detection. The data
was obtained from secondary sources, including online platforms, academic publications, electronic books, scholarly
papers, and other relevant materials. Furthermore, in the literature review phase, a comprehensive examination of prior
scholarly articles was conducted to assess the research that pertains to the present research subject.

Problem Definition Through an examination of prior research, several gaps or weaknesses within these studies were
uncovered, hence highlighting opportunities for prospective enhancements. After identifying gaps or weaknesses, the
researchers generated research questions to establish the aims of the next study.

Data Collection During this phase, the data underwent preparation in order to train the forthcoming layout detection
model. The dataset included of photos depicting the layout of documents sourced from a variety of academic journals.
The data was subsequently annotated using Label Studio, employing pre-established categories.

Model Training During this stage, the existing YOLOv5 architecture was trained using optimal parameters to produce
an appropriate model. The model was trained using the provided hardware and labeled data.

Model Evaluation During this phase, the trained model was subjected to several tests utilizing the pre-existing
provided data. The evaluation process additionally incorporated manual human assessment in order to augment the
validity of the evaluation data. The evaluation process involved the utilization of metrics such as accuracy, precision,
and F1 score for the purpose of calculations.

Conclusion Drawing conclusions provided an overview of the data analysis and model evaluation, encompassing the
entirety of the research.

4 Results and Discussion


4.1 Base Model

YOLO was initially proposed by Redmon et al. [2016] in 2016. This method gained recognition for its real-time
processing speed of 45 frames per second. Simultaneously, the method maintained competitive performance and even
achieved state-of-the-art results on popular datasets.
YOLOv5 is designed for fast and accurate real-time object detection. This algorithm offers several performance
enhancements compared to its previous versions Redmon and Farhadi [2016], Redmon et al. [2016], Redmon and
Farhadi [2018], including improved speed and detection capabilities. One of the key advantages of YOLOv5 is its
ability to conduct object detection swiftly on resource-constrained devices such as CPUs or mobile devices. This
enables researchers or academics to perform real-time object detection rapidly without sacrificing accuracy Jocher et al.
[2022].

Figure 1: YOLOv5 architecture Jocher et al. [2022].

3
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

The architectural design of YOLOv5, as illustrated in Figure 1, showcases its segmentation into three main
components: Backbone, PANet, and Output. The Backbone, alternatively referred to as the feature extractor, is a
crucial component within a network that is tasked with extracting fundamental elements from the input image. The
YOLOv5 model incorporates the CSPDarknet53 architecture as its underlying framework. The Path Aggregation
Network (PANet) is a key element of the YOLOv5 framework, designed to effectively aggregate information from
many scales. The PANet architecture facilitates the integration of contextual information from many scales, hence
enhancing the ability to recognize objects of varying sizes. The YOLOv5 model produces a result of several bounding
boxes and corresponding class labels, representing the detected objects in the given image. According to Jin (2022),
bounding boxes are utilized to establish the precise coordinates and dimensions of objects within an image, while class
labels serve to identify the specific category to which the identified object belongs.

4.2 Layout Detection

The technique of Layout Detection is utilized to ascertain the configuration of elements within a document Vitagliano
et al. [2022]. In this study, the term "layout" refers to the various components that comprise the structure of a layout,
including titles, text, photos, captions, and tables, as seen in Figure 2. The data extraction process for detected
documents is determined based on the specific type of data contained inside them. The process of extracting data is
depicted in Figure 3.

Figure 2: Document Layout.

The extraction components used in this research are as follows:

Optical Character Recognition (OCR) This method is employed to transform text data present in scanned documents
into editable and searchable text Billah et al. [2015]. The OCR framework used in this research is Tesseract. Tesseract
is a framework developed by Google for optical character recognition needs, offering ease of use Smith [2007].

4
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

Table extraction encompasses two components, table structure recognition and OCR. Table structure recognition is
used to detect the structure of tables, including rows, columns, and cells. The PubTables-1M model Smock et al. [2021]
is utilized for this purpose. This model accurately analyzes tables originating from images.
The extracted data will be combined into a JSON format and sorted based on the coordinate positions of the data
components. Consequently, the obtained data will include component coordinates (x1, y1, x2, y2), component classes
(such as text, tables, etc.), and data, as depicted in Figure 3.

Figure 3: Layout Detection Flow.

4.3 Dataset

The dataset included in this study comprises 153 PDF pages that have been transformed from diverse sources, such as
books and sample journals. The data was subsequently tagged utilizing Label Studio Tkachenko et al. [2020-2022] with
the subsequent classes:

Table 1: Data Classes.


Class Description
Title Attribute referring to the book title
Text Attribute referring to the text within the book
Image Attribute indicating images on the book page
Caption Attribute for captions of images or tables
Image_caption Group box for images and captions
Table Attribute for tables in the book
Table_caption Group box for tables and captions

Each page within the used dataset has a varying number of classes due to the distinct structures of each page. The
classes for the training data are indicated as shown in Figure 4.

Figure 4: Data train class.

5
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

The training data consists of 143 layout image data, while the test data comprises 10 layout image data, with data
classes visible in Figure 8.

Figure 5: Data test class.

4.4 Training Model

When conducting training, the parameters employed are outlined in Table 2.

Table 2: Data Classes


Parameter Value
Model variant YOLOv5 S
Epoch 500
Image Size 640
Patience 100
Cache RAM
Device GPU
Batch size 32

The environment utilized to execute the training is Google Colab Pro, with specifications as provided in Table 3.

Table 3: Hardware specifications

Hardware Specification
CPU 2 x Intel Xeon CPU @ 2.20GHz
GPU Tesla P100 16 GB
RAM 27 GB
Storage 129 GB available

4.5 Evaluation Metric

Evaluation metrics are tools used to measure the quality and performance of machine learning models Thambawita et al.
[2020]. Some of the metrics used include mAP50, mAP50-95, Precision, Recall, Box Loss, Class Loss, and Object
Loss.

Precision is the ratio of true positive predictions (TP) to the total number of positive predictions (T P + F P ).
Precision is used to measure the quality of positive predictions by the model Heyburn et al. [2018]. Precision is defined
as shown in Equation (1):

TP
P = (1)
TP + FP

6
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

Recall is the ratio of true positive predictions (TP) to the total number of actual positives (T P + F N ). Recall is used
to measure the model’s ability to find all positive samples Wang et al. [2022]. Recall is defined as shown in Equation
(2):

TP
R= (2)
TP + FN

mAP50 The average of the Average Precision (AP) is calculated by considering all classes. A detection is deemed
correct if the Intersection over Union (IoU) between the predicted bounding box and the ground truth is 0.5 or higher.
The aforementioned metric offers an assessment of the model’s effectiveness in object detection, allowing for a certain
degree of flexibility in terms of mistakes related to object placement and bounding box dimensions Heyburn et al.
[2018].

mAP50-95 The assessment metric employed in object detection tasks is frequently utilized inside competitive settings,
such as the COCO (Common Objects in Context) challenge. The metric being referred to is the mean Average Precision
(mAP) calculated across different Intersection over Union (IoU) criteria. These thresholds range from 0.5 to 0.95, with
an increment of 0.05 Thambawita et al. [2020].

Box Loss The metric referred to as box loss, or alternatively localization loss, evaluates the accuracy of a model’s
predictions regarding object bounding boxes. The calculation often involves determining the disparity between the
predicted bounding box coordinates generated by the model and the corresponding actual (ground truth) bounding box
coordinates. Two often employed metrics in this context are Mean Squared Error (MSE) and Intersection over Union
(IoU). Wang et al. [2022].

Class Loss The metric of class loss evaluates the model’s ability to accurately forecast object classes. The calculation
typically involves determining the discrepancy between the anticipated probability of class membership as estimated by
the model and the true classes as determined by the ground truth. Cross-Entropy Loss is a frequently employed metric
in this context Wang et al. [2022].

Object Loss The metric of object loss evaluates the model’s ability to accurately forecast the existence of objects. In
models like as YOLO, the prediction of the presence or absence of an object at the center of each cell in the visual
grid is made. The calculation of object loss involves determining the discrepancy between the anticipated probability
of object presence as determined by the model and the actual presence of the object, as indicated by the ground truth
Heyburn et al. [2018].

4.6 Training Results

The training results yield metric values as shown in Table 4, indicating mAP50, mAP50-95, Precision, and Recall
scores. Figure 6 illustrates the metric graph for iterations 238 to 381.

7
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

Figure 6: Training Model Metric Graph

Table 4: Training Model Metric

Metric Value
mAP50 0.97
mAP50-95 0.801
Precision 0.911
Recall 0.971

These results show that the model training has achieved a sufficiently high accuracy for predicting the provided
document layouts. The results also indicate that the training data stopped at epoch 381 due to achieving satisfying
accuracy and no further improvement, leading to early stopping of the model.
Box Loss as depicted in Figure 7 has values of 0.308 during the training process and 0.636 during validation.
These results indicate that the model can predict object bounding boxes well with low data loss.

8
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

Figure 7: Box Loss Metric Results

The model training yields small class loss values of 0.245 during training and 0.383 during validation, as shown in
Figure 8. This demonstrates the model’s ability to predict classes from the given layouts.

Figure 8: Class Loss Metric Results

The Object Loss metric refers to the model’s ability to detect objects before predicting their classes and bounding
boxes. The training value is 0.863, and the validation value is 0.85, as shown in Figure 9.

9
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

Figure 9: Object Loss Metric Results

The results of the extraction process are exemplified in Figure 10, demonstrating accurate predictions with high
speed.

Figure 10: Object Detection Results

Extraction results using regulation page data are shown in Figure 11, aligning with the original data. The average
extraction speed is 0.512 per page.

10
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

Figure 11: Text Extraction Results

The outcomes of the detection and extraction process provide evidence that the model successfully meets the
criteria for functioning as an unstructured document detector and extractor.

5 Conclusions
The utilization of YOLOv5 in document layout identification tasks has demonstrated significant efficacy, resulting
in a notable accuracy rate accompanied with precision values of 0.91 and recall values of 0.971. The exceptional
performance of this model has facilitated its ability to identify and retrieve textual and tabular data from document
images, hence accelerating the typically arduous task of extracting data from scanned documents. The capabilities
of YOLOv5 can be further expanded beyond the analysis of document layout, presenting opportunities for exciting
future study. This entails exploring the possibilities of utilizing many forms of unstructured data, encompassing not just
documents and photographs but also audio data analysis. This avenue has significant opportunities for a broad spectrum
of applications.

References
Kiran Adnan and Rehan Akbar. Limitations of information extraction methods and techniques for heterogeneous
unstructured big data. International Journal of Engineering Business Management, 11:184797901989077, January
2019a. doi:10.1177/1847979019890771. URL https://doi.org/10.1177/1847979019890771.
Kiran Adnan and Rehan Akbar. An analytical study of information extraction from unstructured and multidimensional
big data. Journal of Big Data, 6(1), October 2019b. doi:10.1186/s40537-019-0254-8. URL https://doi.org/10.
1186/s40537-019-0254-8.
Gohar Zaman, Hairulnizam Mahdin, Khalid Hussain, and Atta Rahman. Information extraction from semi
and unstructured data sources: A systematic literature review. ICIC Express Letters, 14:593–603, 06 2020.
doi:10.24507/icicel.14.06.593.
Tausif Diwan, G. Anirudh, and Jitendra V. Tembhurne. Object detection using YOLO: challenges, architectural
successors, datasets and applications. Multimedia Tools and Applications, 82(6):9243–9275, August 2022.
doi:10.1007/s11042-022-13644-y. URL https://doi.org/10.1007/s11042-022-13644-y.
D. M. Jiménez-Bravo, L. Lozano Murciego, A. Sales Mendes, H. Sánchez San Blás, and J. Bajo. Multi-object tracking
in traffic environments: A systematic literature review. Neurocomputing, 494:43–55, 7 2022.
J. Kaur and W. Singh. Tools, techniques, datasets and application areas for object detection in an image: a review.
Multimedia Tools and Applications, 81(27):38297–38351, apr 23 2022.
R. Arifando, S. Eto, and C. Wada. Improved YOLOv5-Based Lightweight Object Detection Algorithm for People with
Visual Impairment to Detect Buses. Applied Sciences, 13(9):5802, may 8 2023.
M. Hussain. Yolo-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing
and Industrial Defect Detection. Machines, 11(7):677, jun 23 2023.

11
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. DocLayNet: A large human-
annotated dataset for document-layout segmentation. In Proceedings of the 28th ACM SIGKDD Conference
on Knowledge Discovery and Data Mining. ACM, August 2022. doi:10.1145/3534678.3539043. URL https:
//doi.org/10.1145/3534678.3539043.
Prashanth Pillai and Purnaprajna Mangsuli. Document layout analysis using detection transformers. In Day 3
Wed, November 17, 2021. SPE, December 2021. doi:10.2118/207266-ms. URL https://doi.org/10.2118/
207266-ms.
Xuebin Yue, Hengyi Li, Masao Shimizu, Sadao Kawamura, and Lin Meng. YOLO-GD: A deep learning-based object de-
tection algorithm for empty-dish recycling robots. Machines, 10(5):294, April 2022. doi:10.3390/machines10050294.
URL https://doi.org/10.3390/machines10050294.
Yu Kyō Kitakaze, Renjin Yoshihara, Souta Okabe, and Ryō Matsumura. Development of harmful bird recogni-
tion system using object detection YOLO. Journal of Industrial Application Engineering, 8(1):10–16, 2020.
doi:10.12792/jjiiae.8.1.10. URL https://doi.org/10.12792/jjiiae.8.1.10.
Saeed Ranjbar Alvar and Ivan V. Bajic. MV-YOLO: Motion vector-aided tracking by semantic object detection.
In 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP). IEEE, August 2018.
doi:10.1109/mmsp.2018.8547125. URL https://doi.org/10.1109/mmsp.2018.8547125.
Prof.Dr. Khalil I. Alsaif Younis A. Al-Arbo. Online multi-object tracking in videos based on features detected by
YOLO. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 12(6):2922–2931, April 2021.
doi:10.17762/turcomat.v12i6.5801. URL https://doi.org/10.17762/turcomat.v12i6.5801.
Niharika Kumari, Verena Ruf, Sergey Mukhametov, Albrecht Schmidt, Jochen Kuhn, and Stefan Küchemann. Mo-
bile eye-tracking data analysis using object detection via YOLO v4. Sensors, 21(22):7668, November 2021.
doi:10.3390/s21227668. URL https://doi.org/10.3390/s21227668.
Chao Wang, Yunchu Zhang, Yanfei Zhou, Shaohan Sun, Hanyuan Zhang, and Yepeng Wang. Automatic detection
of indoor occupancy based on improved YOLOv5 model. Neural Computing and Applications, 35(3):2575–2599,
September 2022. doi:10.1007/s00521-022-07730-3. URL https://doi.org/10.1007/s00521-022-07730-3.
Yue Gu, Shucai Wang, Yu Yan, Shijie Tang, and Shida Zhao. Identification and analysis of emergency behavior of
cage-reared laying ducks based on YoloV5. Agriculture, 12(4):485, March 2022. doi:10.3390/agriculture12040485.
URL https://doi.org/10.3390/agriculture12040485.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object
detection, 2016.
Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger, 2016.
Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018.
Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, NanoCode012, Yonghye Kwon, Kalen Michael, TaoXie,
Jiacong Fang, Imyhxy, , Lorna, Zeng Yifu, Colin Wong, Abhiram V, Diego Montes, Zhiqiang Wang, Cristi Fati,
Jebastin Nadar, Laughing, UnglvKitDe, Victor Sonck, Tkianai, YxNONG, Piotr Skalski, Adam Hogan, Dhruv Nair,
Max Strobel, and Mrinal Jain. ultralytics/yolov5: v7.0 - yolov5 sota realtime instance segmentation, 2022. URL
https://zenodo.org/record/7347926.
Gerardo Vitagliano, Lucas Reisener, Lan Jiang, Mazhar Hameed, and Felix Naumann. Mondrian: Spreadsheet
layout detection. In Proceedings of the 2022 International Conference on Management of Data. ACM, June 2022.
doi:10.1145/3514221.3520152. URL https://doi.org/10.1145/3514221.3520152.
Mustain Billah, Sajjad Waheed, and Abu Hanifa. An optical character recognition system from printed text and text
image using adaptive neuro fuzzy inference system. International Journal of Computer Applications, 130(16):1–5,
November 2015. doi:10.5120/ijca2015907196. URL https://doi.org/10.5120/ijca2015907196.
R. Smith. An overview of the tesseract OCR engine. In Ninth International Conference on Document Analysis
and Recognition (ICDAR 2007) Vol 2. IEEE, September 2007. doi:10.1109/icdar.2007.4376991. URL https:
//doi.org/10.1109/icdar.2007.4376991.
Brandon Smock, Rohith Pesala, and Robin Abraham. Pubtables-1m: Towards comprehensive table extraction from
unstructured documents, 2021.
Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label Studio: Data labeling software,
2020-2022. URL https://github.com/heartexlabs/label-studio. Open source software available from
https://github.com/heartexlabs/label-studio.

12
Unveiling Document Structures with YOLOv5 Layout Detection A P REPRINT

Vajira Thambawita, Debesh Jha, Hugo Lewi Hammer, Håvard D. Johansen, Dag Johansen, Pål Halvorsen, and
Michael A. Riegler. An extensive study on cross-dataset bias and evaluation metrics interpretation for machine
learning applied to gastrointestinal tract abnormality classification. ACM Transactions on Computing for Healthcare,
1(3):1–29, June 2020. doi:10.1145/3386295. URL https://doi.org/10.1145/3386295.
Rachel Heyburn, Raymond R. Bond, Michaela Black, Maurice Mulvenna, Jonathan Wallace, Deborah Rankin, and
Brian Cleland. Machine learning using synthetic and real data: Similarity of evaluation metrics for different
healthcare datasets and for different algorithms. In Data Science and Knowledge Engineering for Sensing Decision
Support. WORLD SCIENTIFIC, July 2018. doi:10.1142/9789813273238_0160. URL https://doi.org/10.
1142/9789813273238_0160.

13

View publication stats

You might also like