Hierarchical Text Spotter For Joint Text Spotting and Layout Analysis
Hierarchical Text Spotter For Joint Text Spotting and Layout Analysis
Shangbang Long, Siyang Qin, Yasuhisa Fujii, Alessandro Bissacco, Michalis Raptis
Google Research
{longshangbang,qinb,yasuhisaf,bissacco,mraptis}@google.com
arXiv:2310.17674v1 [cs.CV] 25 Oct 2023
1. Introduction
The extraction and comprehension of text in images play
a critical role in many computer vision applications. Text
spotting algorithms have progressed significantly in recent
Figure 1. Illustration of our Hierarchical Text Spotter (HTS).
years [33, 42, 45, 49, 67], specifically within the task of de-
HTS consists of two main components: (1) Unified-Detector-
tecting [2, 28, 36, 63] and recognizing [5, 12, 40, 41, 59] in-
Polygon (UDP) that detects text lines with bounding polygons and
dividual text instances in images. Previously, defining the clusters them into paragraph groups. In this figure (upper right),
geometric layout [7, 9, 24, 62] of extracted textual content paragraph groups are illustrated by different colors. The bound-
occurred independent of text spotting and remained focused ing polygons are used to crop and rectify text lines into canonical
on document images. In this paper, we aim to further the forms that are easy to recognize. (2) Line-to-Character-to-Word
argument [34] that consolidating these separately treated (L2C2W) Recognizer that jointly predicts character classes and
tasks is complementary and mutually enhancing. We postu- bounding boxes. Spaces are used to split lines into words. The
late a joint approach for text spotting and geometric layout output of HTS is a Hierarchical Text Representation (HTR) that
analysis could provide useful signals for downstream tasks encodes the layout of all text entities in an image. In this figure,
such as semantic parsing and reasoning of text in images we use indents to represent the hierarchy of text entities (middle
right), and visualize the character bounding boxes (bottom right).
such as text-based VQA [6, 53] and document understand-
ing [16, 23, 25].
Existing text spotting methods [45, 49, 67] most com- end-to-end way. This method is limited to the detection task
monly extract text at the word level, where ‘word’ is defined and it can not produce character or word-level outputs.
as a sequence of characters delimited by space without tak- In this paper, we propose a novel method, termed Hi-
ing into account the text context. Recently, the Unified De- erarchical Text Spotter (HTS), that simultaneously localize,
tector [34], which is built upon detection transformer [58], recognize and recovers the geometric relationship of the text
detects text ‘lines’ with instance segmentation mask and on an image. The framework of HTS is illustrated in Fig.
produces an affinity matrix for paragraph grouping in an 1. It is designed to extract a hierarchical text representa-
tion (HTR) of text entities in images. HTR has four lev- 2. Related Works
els of hierarchy1 , including character, word, text line, and
paragraph, from bottom to top. The HTR representation Text Spotting Two-stage text spotters consist of a text de-
encodes the structure of text in images. To the best of our tection stage and a text recognition stage. Text detec-
knowledge, HTS is the first unified method for text spotting tion stage produces bounding polygons or rotated bounding
and geometric layout analysis. boxes for text instances at one granularity, usually words.
Text instances are cropped from input image pixels [4], en-
The proposed HTS consists of two main components: coded backbone features [26, 45], or both [49]. The text
(1) A Unified-Detector-Polygon (UDP) model that jointly recognition stage decodes the text transcription. End-to-
predicts Bezier Curve polygons [30] for text lines and an end text spotters use feature maps for the cropping process.
affinity matrix supporting the grouping of lines to para- In this case, the text recognition stage reuses those fea-
graphs. Notably, we find that the conventional way of train- tures, improving the computational efficiency [29]. How-
ing Bezier Curve polygon prediction head, i.e. applying L1 ever, end-to-end text spotters suffer from asynchronous con-
losses on control points directly [30, 47, 56], fails to capture vergence between the detection and the recognition branch
text shapes accurately on highly diverse dataset such as Hi- [22]. Due to this challenge, our proposed HTS crops from
erText [34]. Hence, we propose a novel Location and Shape input image pixels with bounding polygons. The aforemen-
Decoupling Module (LSDM) which decouples the represen- tioned text spotter framework connects detection and recog-
tation learning of location and shape. UDP equipped with nition explicitly with detection boxes. Another branch of
LSDM can accurately detect text lines of arbitrary shapes, two-stage text spotter performs implicit feature feeding via
sizes and locations across multiple datasets of different object queries [21] as in detection transformer [8] or de-
domains. (2) A Line-to-Character-to-Word (L2C2W) text formable multi-head attention [67]. More recently, single
line recognizer based on Transformer encoder-decoder [57] stage text spotters [20, 42] are proposed under a sequence-
that jointly predicts character bounding boxes and charac- to-sequence framework. These works do not perform layout
ter classes. L2C2W is trained to produce the special space analysis and are thus orthogonal to this paper.
character to delimit text lines into words. Also, unlike other Text Detection Top-down text detection methods view text
recognizers or text spotters that are based on character de- instances as objects. These methods produce detection
tection [3, 27, 31, 61], L2C2W only needs a small fraction boxes [30,56] or instance segmentation masks [34] for each
of training data to have bounding box annotations. text instance. Bottom-up methods first detect sub-parts
The proposed HTS method achieves state-of-the-art text of text instances and then connect these parts to construct
spotting results on multiple datasets across different do- whole-text bounding boxes [52] or masks [36]. Top-down
mains, including ICDAR 2015 [19], Total-Text [10], and methods tend to have simpler pipelines, while bottom-up
HierText [34]. It also surpasses Unified Detector [34] on the techniques excel at detecting text of arbitrary shapes and
geometric layout analysis benchmark of HierText, achiev- aspect ratios. Neither top-down nor bottom-up mask pre-
ing new state-of-the-art result. Importantly, these results diction methods are proficient for spotting curved text, be-
are obtained with a single model, without fine-tuning on tar- cause a mask can only locate text but cannot rectify it. Ad-
get datasets; ensuring that the proposed method can support ditionally, the performance of such models on curved text
generic text extraction applications. In ablation studies, we datasets is commonly reported by fine-tuning those models
also examine our key design choices. on the specific data. Therefore, it is unknown whether poly-
gon prediction methods can adapt to text of arbitrary shapes
Our core contributions can be summarized as follows:
and aspect ratios on diverse datasets.
• A novel Hierarchical Text Spotter for the joint task of Text Recognition An important branch of text recognizers
word-level text spotting and geometric layout analysis. [12, 32, 40] formulates the task as a sequence-to-sequence
• Location and Shape Decoupling Module which en- task [55], where the only output target is a sequence of
ables accurate polygon prediction of text lines on di- characters. Another branch formulates the task as character
verse datasets. detection [27, 31], where it produces character classes and
• L2C2W that reformulates the role of recognizer in text locations simultaneously. However, it requires bounding
spotter algorithms by performing part of layout analy- box annotations on all training data, which are rare for real-
sis and text entities localization. image data. Our recognition method falls into the sequence-
• State-of-the-art results on both text spotting and geo- to-sequence learning paradigm, with the additional ability
metric layout analysis benchmarks without fine-tuning to produce each character bounding box. Importantly, our
to each particular test dataset. model’s training requires only partially annotated data, i.e.
only a fraction of the data needs to include character level
bounding box annotations.
1 Here, we follow the definitions of these levels in [34]. Layout analysis Geometric layout analysis [18, 43, 60, 68]
aims to detect visually and geometrically coherent text Queries Unified-Detector-Polygon
blocks as objects. Recent works formulate this task as ob- NxD
ject detection [51], semantic segmentation [24,34], or learn-
ing on the graphical structure of OCR tokens via GCN [60].
Almost all entries in the HierText competition at IDCAR KMaX-DeepLab
2023 [35] adopt the segmentation formulation. Unified De-
tector [34] consolidates the task of text line detection and
geometric layout analysis. However, it can not produce
word-level entities and does not provide a recognition out- Pixel Features
put. Another line of layout analysis research focuses on Encoded Query
NxC H’xW’xC
semantic parsing of documents [16, 23, 25] to identify key-
value pairs. These methods build language models [13, 46]
Bezier Layout Cls
on top of OCR results. Recently, StruturalLM [25] and Lay- Head Head Head
outLMv3 [16] show that the grouping of words into seg- Text Masks
ments using heuristics, which is equivalent to text line for- NxHxW
Ctrl points Affinity matrix Textness Score
mation, improves parsing results. We believe our work of Nx4(m+1) NxN Nx1
tree/master/evaluation/totaltext/e2e hiertext/blob/main/eval.py
6 https://rrc.cvc.uab.es/?ch=4&com=tasks 8 https://rrc.cvc.uab.es/?ch=18
ICDAR 2015 Incidental Total-Text
Method Word-Spotting End-to-End N Full
S W G N S W G N P R F1 P R F1
MTSv3⋆ [26] 83.1 79.1 75.1 - 83.3 78.1 74.2 - - - 71.2 - - 78.4
MANGO⋆ [44] 85.2 81.1 74.6 - 85.4 80.1 73.9 - - - 68.9 - - 78.9
YAMTS⋆ [22] 86.8 82.4 76.7 - 85.3 79.8 74 - - - 71.1 - - 78.4
CharNet [61] - - - - 83.10 79.15 69.14 65.73 - - 69.2 - - -
TESTR♯ [67] - - - - 85.16 79.36 73.57 65.27 - - 73.3 - - 83.9
Qin et al.♯ [45] - - - - 85.51 81.91 - 69.94 - - 70.7 - - -
TTS⋆ [21] 85.0 81.5 77.3 - 85.2 81.7 77.4 - - - 75.6 - - 84.4
GLASS♯ [49] 86.8 82.5 78.8 71.69* 84.7 80.1 76.3 70.15* - - 76.6 - - 83
UNITS ⋆ [20] 88.1 84.9 80.7 78.7 88.4 83.9 79.7 78.5 - - 77.3 - - 85.0
HTS⋆♯ (ours) 89.55 85.72 81.23 78.62 89.38 84.61 80.69 78.81 80.41 75.92 78.10 90.12 80.74 85.17
Table 1. Results for ICDAR 2015 and Total-Text. ‘S’, ‘W’, ‘G’ and ‘N’ refer to strong, weak, generic and no lexicons. ‘Full’ for
Total-Text means all test set words, and is equivalent to the weak setting in ICDAR 2015. ‘-’ means scores are not reported by the papers.
‘*’ means scores are obtained from the open-source code and weights. ⋆ means models are not fine-tuned on individual datasets. ♯ means
models recognize all symbol classes, including case-sensitive characters and punctuation symbols.