HubNet: An E2E Model for Wheel Hub Text Detection and Recognition Using Global and Local Features
Abstract
:1. Introduction
- We establish a wheel hub text dataset, including 446 images, 934 word instances, and 2947 character instances. The images in the dataset originate from authentic factory production line scenes with uniformly distributed character categories and diverse orientations.
- We propose HubNet, an end-to-end model for text detection and recognition utilizing both global and local features, combining directional information from global features and details from local features, which improves the model’s accuracy in text detection and recognition.
2. Materials and Methods
2.1. Dataset
2.2. Model
2.2.1. Detection Head
2.2.2. Feature Cross-Fusion
2.2.3. Recognition Head
2.2.4. Loss Function
3. Experiment
3.1. Experimental Environment Configuration
3.2. Analysis on Backbone and Feature Extractor
3.3. Ablation Study on Feature Cross-Fusion Module
3.4. Quantitative Comparison
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Shabestari, B.N.; Miller, J.W.V.; Wedding, V. Low-cost real-time automatic wheel classification system. Mach. Vis. Appl. Archit. Syst. Integr. 1992, 1823, 218–222. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Liao, M.; Pang, G.; Huang, J.; Hassner, T.; Bai, X. Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 706–722. [Google Scholar]
- Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 8–24 June 2022; pp. 4593–4603. [Google Scholar]
- Ye, M.; Zhang, J.; Zhao, S.; Liu, J.; Liu, T.; Du, B.; Tao, D. Deepsolo: Let transformer decoder with explicit points solo for text spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19348–19357. [Google Scholar]
- Xing, L.; Tian, Z.; Huang, W.; Scott, M.R. Convolutional character networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9126–9136. [Google Scholar]
- Wang, T.; Wu, D.J.; Coates, A.; Ng, A.Y. End-to-end text recognition with convolutional neural networks. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 11–15 November 2012; IEEE: Washington, DC, USA, 2012; pp. 3304–3308. [Google Scholar]
- Matas, J.; Chum, O.; Urban, M.; Pajdla, T. Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 2004, 22, 761–767. [Google Scholar] [CrossRef]
- Liu, Y.; Jin, L. Deep matching prior network: Toward tighter multi-oriented text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1962–1969. [Google Scholar]
- Liao, M.; Zhu, Z.; Shi, B.; Xia, G.S.; Bai, X. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5909–5918. [Google Scholar]
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
- Liu, Y.; Jin, L.; Fang, C. Arbitrarily shaped scene text detection with a mask tightness text detector. IEEE Trans. Image Process. 2019, 29, 2918–2930. [Google Scholar] [CrossRef] [PubMed]
- Zhu, Y.; Chen, J.; Liang, L.; Kuang, Z.; Jin, L.; Zhang, W. Fourier contour embedding for arbitrary-shaped text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3123–3131. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. Textboxes: A fast text detector with a single deep neural network. Proc. AAAI Conf. Artif. Intell. 2017, 31, 4161–4167. [Google Scholar] [CrossRef]
- Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1457–1464. [Google Scholar]
- Fang, S.; Xie, H.; Wang, Y.; Mao, Z.; Zhang, Y. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7098–7107. [Google Scholar]
- Qiao, Z.; Zhou, Y.; Yang, D.; Zhou, Y.; Wang, W. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13528–13537. [Google Scholar]
- Liu, Y.; Zhang, J.; Peng, D.; Huang, M.; Wang, X.; Tang, J.; Huang, C.; Lin, D.; Shen, C.; Bai, X.; et al. Spts v2: Single-point scene text spotting. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15665–15679. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Su, Y.; Tripathi, S.; Tu, Z. Text spotting transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9519–9528. [Google Scholar]
- Liao, M.; Zou, Z.; Wan, Z.; Yao, C.; Bai, X. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 919–931. [Google Scholar] [CrossRef] [PubMed]
- Ronen, R.; Tsiper, S.; Anschel, O.; Lavi, I.; Markovitz, A.; Manmatha, R. Glass: Global to local attention for scene-text spotting. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 249–266. [Google Scholar]
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1156–1160. [Google Scholar]
- Ch’ng, C.K.; Chan, C.S. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 935–942. [Google Scholar]
- Vatti, B.R. A generic solution to polygon clipping. Commun. ACM 1992, 35, 56–63. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Tan, M. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
- Liu, Y.; Chen, H.; Shen, C.; He, T.; Jin, L.; Wang, L. ABCNet: Real-time scene text spotting with adaptive bezier-curve network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9809–9818. [Google Scholar]
Project | Quantity |
---|---|
Image resolution | 512 × 512 pixels |
Number of word instances in the dataset | 934 |
Number of character instances in the dataset | 2947 |
Software/Hardware | Version |
---|---|
CPU | Intel® Core™ i7-14700KF |
GPU | NVIDIA GeForce RTX 3090 |
Operation System | Ubuntu 20.04 |
Python Environment | Conda 24.12 |
Python version | Python3.8.19 |
Pytorch version | Pytorch 1.10.1 + CUDA11.3 |
MMEngine version | 0.10.3 |
MMDetection version | 3.1.0 |
MMOCR version | 1.0.1 |
Backbone | FE | Detection | Recognition | ||||
---|---|---|---|---|---|---|---|
Presicion | Recall | F1 | Precision | Recall | F1 | ||
EfficientNet | ResNet18D | 90.7 | 89.5 | 90.1 | 80.8 | 73.4 | 76.9 |
EfficientNet | ResNet18 | 90.7 | 89.5 | 90.1 | 82.0 | 73.6 | 77.6 |
ResNet50 | ResNet18D | 88.1 | 80.9 | 84.3 | 81.3 | 72.2 | 76.5 |
ResNet50 | ResNet18 | 88.1 | 80.9 | 84.3 | 80.0 | 70.8 | 75.1 |
ResNet50D | ResNet18D | 92.4 | 90.8 | 91.6 | 85.5 | 78.7 | 82.0 |
ResNet50D | ResNet18 | 92.4 | 90.8 | 91.6 | 86.5 | 79.4 | 82.8 |
Features | Recognition | ||
---|---|---|---|
Precision | Recall | F1 | |
Global only | 70.9 | 64.4 | 67.5 |
Local only | 83.4 | 75.2 | 79.0 |
Add | 83.9 | 75.1 | 79.3 |
Concat | 84.8 | 77.6 | 81.0 |
Feature cross-fusion | 86.5 | 79.4 | 82.8 |
Model | Venue | Detection | Recognition | ||||
---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | ||
Mask TextSpotter v3 | ECCV’2020 | 82.4 | 80.8 | 81.6 | 33.1 | 30.5 | 31.7 |
ABCNet | CVPR’2020 | 56.5 | 57.3 | 56.9 | 12.4 | 10.9 | 11.6 |
SwinTextSpotter | CVPR’2022 | 91.4 | 90.5 | 90.9 | 74.1 | 69.3 | 71.6 |
SPTS v2 | TPAMI’2023 | 91.2 | 91.8 | 91.5 | 79.2 | 72.7 | 75.8 |
Ours | – | 92.4 | 90.8 | 91.6 | 86.5 | 79.4 | 82.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zeng, Y.; Meng, C. HubNet: An E2E Model for Wheel Hub Text Detection and Recognition Using Global and Local Features. Sensors 2024, 24, 6183. https://doi.org/10.3390/s24196183
Zeng Y, Meng C. HubNet: An E2E Model for Wheel Hub Text Detection and Recognition Using Global and Local Features. Sensors. 2024; 24(19):6183. https://doi.org/10.3390/s24196183
Chicago/Turabian StyleZeng, Yue, and Cai Meng. 2024. "HubNet: An E2E Model for Wheel Hub Text Detection and Recognition Using Global and Local Features" Sensors 24, no. 19: 6183. https://doi.org/10.3390/s24196183