A Comparative Analysis of Optical Character Recognition Models For Extracting and Classifying Texts in Natural Scenes
A Comparative Analysis of Optical Character Recognition Models For Extracting and Classifying Texts in Natural Scenes
Corresponding Author:
Puneeth Prakash
Department of Information Science and Engineering, Maharaja Institute of Technology Mysore
Affiliated to VTU Belgaum
Srirangapatna, Karnataka 571477, India
Email: [email protected]
1. INTRODUCTION
Detecting and recognizing text in natural scene images has emerged as a crucial focus in the fields
of computer vision, machine learning, and pattern recognition. Although there have been significant
advancements in these fields, accurately detecting and recognizing text in scene images and posters remains a
major challenge due to factors such as intricate backgrounds and diverse text orientations [1], [2]. Images can
generally be categorized into document images or scene images, with each presenting distinct characteristics
in text presentation [3], [4]. In natural scene images, the process of text recognition often involves steps like
text detection, segmentation, and optical character recognition (OCR)-based text recognition [5]. The
variability and complexity in these images, such as differing font styles, sizes, colors, and languages, pose
substantial challenges compared to text in documents [6]–[9].
OCR is an essential technology for recognizing text within images. It transforms various document
types, like PDF files and scanned papers, into editable text formats [10], [11]. Originating as a tool to digitize
printed text, OCR has now burgeoned into an essential component in numerous applications, ranging from
document scanning to aiding the visually impaired. Its significance is particularly pronounced in natural
scenes, where text is often embedded within complex and dynamic backgrounds. OCR enables the extraction
and digitization of textual content from various sources such as street signs, billboards, and product labels,
facilitating a myriad of applications from navigation aids for autonomous vehicles to accessible reading tools
for the visually impaired. By converting unstructured text into machine-readable data, OCR in natural scenes
not only enhances user interaction with the environment but also serves as a foundation for further processing
and analysis in fields like geospatial mapping, retail, and security.
The significance of OCR technology extends beyond mere text digitization; it plays a pivotal role in
interpreting and understanding our immediate environment. In the scenario where text appears in varying
forms and conditions, OCR is instrumental. Its applications span across diverse sectors such as autonomous
navigation, where it aids in interpreting road signs, to healthcare, where it assists in reading handwritten
notes and prescriptions [12]. Figure 1 shows the distinction between OCR documents and the identification
of text within natural settings. Figure 1(a) depicts the basic OCR image, while Figure 1(b) represent an image
in a natural scene.
However, the task of recognizing classified text in natural scenes presents unique challenges. Unlike
standard documents, text in natural environments is subject to a plethora of variables including varying
lighting conditions, diverse backgrounds, and a wide range of font styles and sizes. This variability can
significantly impede the accuracy of OCR systems. Moreover, classified text, often characterized by its
sensitive nature, demands not only high accuracy but also robustness and reliability in recognition [13].
(a) (b)
Figure 1. The distinction between OCR documents and the identification of text within natural settings of
(a) an abstract of a formal letter and (b) captured images from real-world scenes
To address these challenges, several advanced techniques have been developed, among which the
maximum stable extremal regions (MSER) and Canny edge detection algorithms stand out for their
effectiveness in detecting and segmenting text in complex natural scenes. MSER excels in identifying
coherent regions within an image that are stable across a range of thresholds, making it particularly suited for
recognizing text areas that stand out from their backgrounds. The Canny edge detector complements this by
identifying the boundaries of text through gradient analysis, highlighting edges with high contrast to the
surrounding area. The synergy between MSER’s region-based detection and Canny edge detection’s
fine-grained boundary delineation offers a robust foundation for overcoming the intrinsic challenges of text
detection in natural scenes.
Given the complexity of natural scene environments and the critical role OCR plays in interpreting
them, this study aims to conduct a comprehensive evaluation of various OCR models. Specifically, it seeks to
assess these models based on their adaptability to different environmental conditions, accuracy in recognizing
text amidst the myriad challenges posed by natural scenes, and reliability in delivering consistent results
across diverse datasets. By systematically comparing the performance of leading OCR models, this research
endeavours to provide insights into their strengths and limitations, and the technique for selecting the most
appropriate models for specific applications and setting the stage for the development of more advanced,
hybrid OCR systems tailored to the nuanced requirements of text extraction and recognition in natural scenes.
Our contribution to this research is as follows: i) addresses the gaps in OCR research, particularly in
the application of OCR technology in natural scenes, which has not been extensively explored compared to
its use in controlled environments; ii) aims to reassess existing OCR models, examining their performance
and suitability for recognizing classified text in uncontrolled, natural scenes; iii) provides valuable insights
that can guide future advancements in OCR technology, focusing on enhancing its applicability and
reliability in real-world scenarios; and iv) findings from this study are expected to benefit various
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1292 ISSN: 2252-8938
applications that rely on accurate recognition of classified text in natural settings, contributing to the
development of reliable OCR systems.
The rest of the study is as follows: section 2 is on related works in scene-text detection of various
OCR models. We introduce the methodology in section 3. The experimental results, benchmarked against
various state-of-the-art methods, are detailed in section 4. Lastly, section 5 presents the conclusions and
discusses potential directions for future research.
2. RELATED WORKS
This section provides a critical analysis of recent developments in OCR for natural scene text
recognition, highlighting advancements, challenges, and the innovative learning techniques introduced by
state-of-the-art models. The field has evolved significantly, with various approaches aimed at improving the
efficiency in text detection and recognition. However, the diversity of real-world environments still presents
challenges that many existing models do not adequately address. Despite advancements, challenges related to
complex environments and varying text characteristics continue to drive innovation in the field.
3. METHOD
This paper introduces a novel approach for detecting and recognizing inscriptions within images via
natural environments using the PDTNet model. PDTNet leverages a combination of text detection algorithms
and deep learning techniques to enhance accuracy and reliability in recognizing text amidst complex
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1294 ISSN: 2252-8938
backgrounds. The PDTNet model’s architecture is composed of various layers, each with unique dimensions
and functionalities. The initial layer is a convolutional layer with an input activation volume of 28×28×16,
generating 2,368 output values [36]. This is succeeded by another convolutional layer, doubling the depth to
32 channels and producing 4,640 outputs. Detailed explanation of the model is in subsequent sections, and
the overall architecture is depicted in Figure 2. Once the text is extracted from an image, the OCR technology
is used to convert image-based text into machine-readable text. To ensure the accuracy of the recognized text,
a spell-checking mechanism, specifically designed for OCR outputs, is employed.
Figure 2. The proposed methodology for the OCR model for scene text recognition
(a) (b)
Figure 3. Select samples of images from the PDT2023 dataset of (a) typical camera captured image on a busy
Indian road and (b) variability in complex background
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1296 ISSN: 2252-8938
Figure 5. Process of text identification through OCR for (a) area containing text: input image, (b) the result of
OCR application on the input image, and (c) the results of OCR correction
Precision indicates the proportion of true positive identifications out of all identified positives and is
calculated as (2).
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (2)
𝑇𝑃+𝐹𝑃
(a) (b)
ICDAR2017
(c) (d)
Table 3 presents the performance metrics of PDTNet across three different datasets, highlighting its
effectiveness in text recognition. For the ICDAR2015 dataset, the model achieved an accuracy of 98.20%,
precision of 98.30%, and recall of 97.70%, which are the highest metrics across all datasets tested. This
suggests that the model is particularly effective on this dataset, outperforming its performance on
ICDAR2017 and PDT2023, where lower metrics were recorded.
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1298 ISSN: 2252-8938
Figure 6. The training and validation accuracies of VGG16 in comparison with the ground truth data
Figure 7. The training and validation accuracies of ResNet50 in comparison with the ground truth data
From Figures 6 and 7, it can be observed that model 2 (ResNet50) performed better than VGG16
since it predicted exact bounding boxes that contained the text, unlike VGG16, which generated bounding
boxes with some few misclassifications [39]. Overall, the loss gradually decreased over training epochs until
convergence around the 50th epoch. Better accuracy was obtained based on the choice of hyperparameters
and variations of drop-out regularization and the results presented in Tables 4 to 6 respectively.
Table 5. Text recognition accuracy across different number of RNN models with same number of units
Metrics/Features EasyOCR [1] Tesseract [2] Surya [4] DOCTR [39] Proposed (PDTNet)
Accuracy (%) 92 85 80 95 96
Precision (%) 91 90 85 96 94
Recall (%) 89 87 82 94 92
Speed Fast Moderate Slow Very Fast Very Fast
Resource Use Low Moderate High Low Moderate
Ease of Integration Very Easy Easy Moderate Very Easy Easy
Adaptability Excellent Good Poor Excellent Superior
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1300 ISSN: 2252-8938
recognition in natural scenes. By focusing on these areas, the study aims to ensure that PDTNet remains at
the forefront of OCR technology advancements, addressing both current challenges and future opportunities
in the field.
REFERENCES
[1] M. A. M. Salehudin et al., “Analysis of optical character recognition using EasyOCR under image degradation,” Journal of
Physics: Conference Series, vol. 2641, no. 1, Nov. 2023, doi: 10.1088/1742-6596/2641/1/012001.
[2] S. Kumar, N. K. Sharma, M. Sharma, and N. Agrawal, “Text extraction from images using Tesseract,” in Deep Learning
Techniques for Automation and Industrial Applications, John Wiley & Sons, Ltd, 2024, pp. 1–18. doi:
10.1002/9781394234271.ch1.
[3] D. Shruthi, H. K. Chethan, and V. I. Agughasi, “Effective approach for fine-tuning pre-trained models for the extraction of texts
from source codes,” in ITM Web of Conferences, 2024, vol. 65, doi: 10.1051/itmconf/20246503004.
[4] L. Mosbah, I. Moalla, T. M. Hamdani, B. Neji, T. Beyrouthy, and A. M. Alimi, “ADOCRNet: A deep learning OCR for Arabic
documents recognition,” IEEE Access, vol. 12, pp. 55620–55631, 2024, doi: 10.1109/ACCESS.2024.3379530.
[5] V. I. Agughasi and M. Srinivasiah, “Semi-supervised labelling of chest x-ray images using unsupervised clustering for ground-
truth generation,” Applied Engineering and Technology, vol. 2, no. 3, pp. 188–202, 2023, doi: 10.31763/aet.v2i3.1143.
[6] A. V. Ikechukwu and S. Murali, “i-Net: a deep CNN model for white blood cancer segmentation and classification,” International
Journal of Advanced Technology and Engineering Exploration, vol. 9, no. 95, pp. 1448–1464, 2022, doi:
10.19101/IJATEE.2021.875564.
[7] S. Bhimshetty and A. V. Ikechukwu, “Energy-efficient deep Q-network: reinforcement learning for efficient routing protocol in
wireless internet of things,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 33, no. 2, pp. 971–980,
2024, doi: 10.11591/ijeecs.v33.i2.pp971-980.
[8] A. V. Ikechukwu and S. Murali, “XAI: An explainable ai model for the diagnosis of COPD from CXR images,” in 2023 IEEE
2nd International Conference on Data, Decision and Systems (ICDDS), Mangaluru, India, 2023, pp. 1-6, doi:
10.1109/ICDDS59137.2023.10434619.
[9] A. V. Ikechukwu, S. Murali, and B. Honnaraju, “COPDNet: An explainable ResNet50 model for the diagnosis of COPD from
CXR images,” in 2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON), Mysore,
India, 2023, pp. 1-7, doi: 10.1109/INDISCON58499.2023.10270604.
[10] A. V. Ikechukwu, “The superiority of fine-tuning over full-training for the efficient diagnosis of COPD from CXR images,”
Inteligencia Artificial, vol. 27, no. 74, pp. 62–79, 2024, doi: 10.4114/intartif.vol27iss74pp62-79.
[11] A. V. Ikechukwu and S. Murali, “CX-Net: an efficient ensemble semantic deep neural network for ROI identification from chest-
x-ray images for COPD diagnosis,” Machine Learning: Science and Technology, vol. 4, no. 2, 2023, doi: 10.1088/2632-
2153/acd2a5.
[12] R. Jalloul, C. H. Krishnappa, V. I. Agughasi, and R. Alkhatib, “Enhancing Early Breast Cancer Detection with Infrared
Thermography: A Comparative Evaluation of Deep Learning and Machine Learning Models,” Technologies, vol. 13, no. 1, Art.
no. 1, Jan. 2025, doi: 10.3390/technologies13010007.
[13] Y. Tang and X. Wu, “Scene text detection using superpixel-based stroke feature transform and deep learning based region
classification,” IEEE Transactions on Multimedia, vol. 20, no. 9, pp. 2276–2288, 2018, doi: 10.1109/TMM.2018.2802644.
[14] P. Prakash, S. K. Y. Hanumanthaiah, and S. B. Mayigowda, “CRNN model for text detection and classification from natural
scenes,” IAES International Journal of Artificial Intelligence, vol. 13, no. 1, pp. 839–849, 2024, doi: 10.11591/ijai.v13.i1.pp839-
849.
[15] D. Deng, H. Liu, X. Li, and D. Cai, “PixelLink: Detecting scene text via instance segmentation,” 32nd AAAI Conference on
Artificial Intelligence, AAAI 2018, pp. 6773–6780, 2018, doi: 10.1609/aaai.v32i1.12269.
[16] X. Li, “A deep learning-based text detection and recognition approach for natural scenes,” Journal of Circuits, Systems and
Computers, vol. 32, no. 5, 2023, doi: 10.1142/S0218126623500731.
[17] I. Marthot-Santaniello, M. T. Vu, O. Serbaeva, and M. Beurton-Aimar, “Stylistic similarities in Greek Papyri based on letter
shapes: a deep learning approach,” Document Analysis and Recognition – ICDAR 2023 Workshops, pp. 307–323, 2023, doi:
10.1007/978-3-031-41498-5_22.
[18] V. I. Agughasi, S. Bhimshetty, R. Deepu, and M. V. Mala, “Advances in thermal imaging: a convolutional neural network
approach for improved breast cancer diagnosis,” International Conference on Distributed Computing and Optimization
Techniques, ICDCOT 2024, 2024, doi: 10.1109/ICDCOT61034.2024.10515323.
[19] A. Yadav, S. Singh, M. Siddique, N. Mehta, and A. Kotangale, “OCR using CRNN: a deep learning approach for text
recognition,” 2023 4th International Conference for Emerging Technology, INCET 2023, 2023, doi:
10.1109/INCET57972.2023.10170436.
[20] R. Najam and S. Faizullah, “Analysis of recent deep learning techniques for arabic handwritten-text OCR and post-OCR
correction,” Applied Sciences, vol. 13, no. 13, 2023, doi: 10.3390/app13137568.
[21] P. Chhabra, A. Shrivastava, and Z. Gupta, “Comparative analysis on text detection for scenic images using EAST and CTPN,” in
7th International Conference on Trends in Electronics and Informatics, ICOEI 2023 - Proceedings, 2023, pp. 1303–1308, doi:
10.1109/ICOEI56765.2023.10125894.
[22] A. Rahman, A. Ghosh, and C. Arora, “UTRNet: high-resolution Urdu text recognition in printed documents,” Document Analysis
and Recognition - ICDAR 2023, pp. 305–324, 2023, doi: 10.1007/978-3-031-41734-4_19.
[23] S. Kaur, S. Bawa, and R. Kumar, “Heuristic-based text segmentation of bilingual handwritten documents for Gurumukhi-Latin
scripts,” Multimedia Tools and Applications, vol. 83, no. 7, pp. 18667–18697, 2024, doi: 10.1007/s11042-023-15335-8.
[24] A. V. Ikechukwu, “Leveraging transfer learning for efficient diagnosis of COPD using CXR images and explainable AI
techniques,” Inteligencia Artificial, vol. 27, no. 74, pp. 133–151, 2024, doi: 10.4114/intartif.vol27iss74pp133-151.
[25] S. Long, X. He, and C. Yao, “Scene text detection and recognition: the deep learning era,” International Journal of Computer
Vision, vol. 129, no. 1, pp. 161–184, 2021, doi: 10.1007/s11263-020-01369-0.
[26] T. Khan, R. Sarkar, and A. F. Mollah, “Deep learning approaches to scene text detection: a comprehensive review,” Artificial
Intelligence Review, vol. 54, no. 5, pp. 3239–3298, 2021, doi: 10.1007/s10462-020-09930-6.
[27] E. Hassan and V. L. Lekshmi, “Scene text detection using attention with depthwise separable convolutions,” Applied Sciences,
vol. 12, no. 13, 2022, doi: 10.3390/app12136425.
[28] X. Liu, G. Meng, and C. Pan, “Scene text detection and recognition with advances in deep learning: a survey,” International
Journal on Document Analysis and Recognition, vol. 22, no. 2, pp. 143–162, 2019, doi: 10.1007/s10032-019-00320-5.
[29] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: convolutional networks for biomedical image segmentation,” IEEE Access,
vol. 9, pp. 16591–16603, May 2015, doi: 10.1109/ACCESS.2021.3053408.
[30] H. Law and J. Deng, “CornerNet: detecting objects as paired keypoints,” International Journal of Computer Vision, vol. 128,
no. 3, pp. 642–656, 2020, doi: 10.1007/s11263-019-01204-1.
[31] N. Otsu, “Threshold selection method from gray-level histograms,” IEEE Trans Syst Man Cybern, vol. SMC-9, no. 1, pp. 62–66,
1979, doi: 10.1109/tsmc.1979.4310076.
[32] B. Gatos, I. Pratikakis, and S. J. Perantonis, “Adaptive degraded document image binarization,” Pattern Recognition, vol. 39,
no. 3, pp. 317–327, 2006, doi: 10.1016/j.patcog.2005.09.010.
[33] N. Phansalkar, S. More, A. Sabale, and M. Joshi, “Adaptive local thresholding for detection of nuclei in diversity stained cytology
images,” in ICCSP 2011 - 2011 International Conference on Communications and Signal Processing, 2011, pp. 218–220, doi:
10.1109/ICCSP.2011.5739305.
[34] F. Z. A. Bella, M. El Rhabi, A. Hakim, and A. Laghrib, “An innovative document image binarization approach driven by the non-
local p-Laplacian,” Eurasip Journal on Advances in Signal Processing, vol. 2022, no. 1, 2022, doi: 10.1186/s13634-022-00883-2.
[35] M. Cheriet, J. N. Said, and C. Y. Suen, “A recursive thresholding technique for image segmentation,” IEEE Transactions on
Image Processing, vol. 7, no. 6, pp. 918–921, 1998, doi: 10.1109/83.679444.
[36] Y. C. Wei and C. H. Lin, “A robust video text detection approach using SVM,” Expert Systems with Applications, vol. 39, no. 12,
pp. 10832–10840, 2012, doi: 10.1016/j.eswa.2012.03.010.
[37] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” Image and
Vision Computing, vol. 22, no. 10 SPEC. ISS., pp. 761–767, 2004, doi: 10.1016/j.imavis.2004.02.006.
[38] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan, “Text flow: A unified text detection system in natural scene images,” in
Proceedings of the IEEE International Conference on Computer Vision, ICCV 2015, pp. 4651–4659, doi:
10.1109/ICCV.2015.528.
[39] P. Batra, N. Phalnikar, D. Kurmi, J. Tembhurne, P. Sahare, and T. Diwan, “OCR-MRD: performance analysis of different optical
character recognition engines for medical report digitization,” Int. J. Inf. Technol., vol. 16, no. 1, pp. 447–455, Jan. 2024, doi:
10.1007/s41870-023-01610-2.
[40] K. Ma, Z. Shu, X. Bai, J. Wang, and D. Samaras, “DocUNet: document image unwarping via a stacked U-Net,” in Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 4700–4709, doi:
10.1109/CVPR.2018.00494.
[41] X. Zhou et al., “EAST: An efficient and accurate scene text detector,” Proceedings - 30th IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2017, pp. 2642–2651, 2017, doi: 10.1109/CVPR.2017.283.
[42] I. Pratikakis, K. Zagoris, G. Barlas, and B. Gatos, “ICDAR2017 competition on document image binarization (DIBCO 2017),” in
Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2017, vol. 1, pp. 1395–1403, doi:
10.1109/ICDAR.2017.228.
[43] J. Ma et al., “Arbitrary-oriented scene text detection via rotation proposals,” IEEE Transactions on Multimedia, vol. 20, no. 11,
pp. 3111–3122, 2018, doi: 10.1109/TMM.2018.2818020.
BIOGRAPHIES OF AUTHORS
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)