0% found this document useful (0 votes)
32 views11 pages

Hierarchical Text Spotter For Joint Text Spotting and Layout Analysis

Uploaded by

abcfake123efg456
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views11 pages

Hierarchical Text Spotter For Joint Text Spotting and Layout Analysis

Uploaded by

abcfake123efg456
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis

Shangbang Long, Siyang Qin, Yasuhisa Fujii, Alessandro Bissacco, Michalis Raptis
Google Research
{longshangbang,qinb,yasuhisaf,bissacco,mraptis}@google.com
arXiv:2310.17674v1 [cs.CV] 25 Oct 2023

Line Bounding Polygons &


Abstract Unified- Paragraph Grouping (colors)
Detector-
Polygon
We propose Hierarchical Text Spotter (HTS), a novel
method for the joint task of word-level text spotting and ge-
ometric layout analysis. HTS can recognize text in an image
and identify its 4-level hierarchical structure: characters, Crop and Rectify
words, lines, and paragraphs. The proposed HTS is char- Hierarchical Text Representation
acterized by two novel components: (1) a Unified-Detector- Paragraph 1:
Polygon (UDP) that produces Bezier Curve polygons of Line 1:
Word 1: “GOLDEN”
text lines and an affinity matrix for paragraph grouping Char 1: “G”
Char 2: “O”
between detected lines; (2) a Line-to-Character-to-Word L2C2W …
Word 2: “SEA”
(L2C2W) recognizer that splits lines into characters and Recognizer
Paragraph 2:
further merges them back into words. HTS achieves state- Line 1:
Word 1: “RESTAURANT”
of-the-art results on multiple word-level text spotting bench- Transcriptions: Word 2: “SEAFOOD”
“GOLDEN SEA” Line 2:
mark datasets as well as geometric layout analysis tasks. “RESTAURANT SEAFOOD” Word 1:“&”
“& BAR” Word 2: “BAR”

Character Boxes: Character bounding boxes:

1. Introduction
The extraction and comprehension of text in images play
a critical role in many computer vision applications. Text
spotting algorithms have progressed significantly in recent
Figure 1. Illustration of our Hierarchical Text Spotter (HTS).
years [33, 42, 45, 49, 67], specifically within the task of de-
HTS consists of two main components: (1) Unified-Detector-
tecting [2, 28, 36, 63] and recognizing [5, 12, 40, 41, 59] in-
Polygon (UDP) that detects text lines with bounding polygons and
dividual text instances in images. Previously, defining the clusters them into paragraph groups. In this figure (upper right),
geometric layout [7, 9, 24, 62] of extracted textual content paragraph groups are illustrated by different colors. The bound-
occurred independent of text spotting and remained focused ing polygons are used to crop and rectify text lines into canonical
on document images. In this paper, we aim to further the forms that are easy to recognize. (2) Line-to-Character-to-Word
argument [34] that consolidating these separately treated (L2C2W) Recognizer that jointly predicts character classes and
tasks is complementary and mutually enhancing. We postu- bounding boxes. Spaces are used to split lines into words. The
late a joint approach for text spotting and geometric layout output of HTS is a Hierarchical Text Representation (HTR) that
analysis could provide useful signals for downstream tasks encodes the layout of all text entities in an image. In this figure,
such as semantic parsing and reasoning of text in images we use indents to represent the hierarchy of text entities (middle
right), and visualize the character bounding boxes (bottom right).
such as text-based VQA [6, 53] and document understand-
ing [16, 23, 25].
Existing text spotting methods [45, 49, 67] most com- end-to-end way. This method is limited to the detection task
monly extract text at the word level, where ‘word’ is defined and it can not produce character or word-level outputs.
as a sequence of characters delimited by space without tak- In this paper, we propose a novel method, termed Hi-
ing into account the text context. Recently, the Unified De- erarchical Text Spotter (HTS), that simultaneously localize,
tector [34], which is built upon detection transformer [58], recognize and recovers the geometric relationship of the text
detects text ‘lines’ with instance segmentation mask and on an image. The framework of HTS is illustrated in Fig.
produces an affinity matrix for paragraph grouping in an 1. It is designed to extract a hierarchical text representa-
tion (HTR) of text entities in images. HTR has four lev- 2. Related Works
els of hierarchy1 , including character, word, text line, and
paragraph, from bottom to top. The HTR representation Text Spotting Two-stage text spotters consist of a text de-
encodes the structure of text in images. To the best of our tection stage and a text recognition stage. Text detec-
knowledge, HTS is the first unified method for text spotting tion stage produces bounding polygons or rotated bounding
and geometric layout analysis. boxes for text instances at one granularity, usually words.
Text instances are cropped from input image pixels [4], en-
The proposed HTS consists of two main components: coded backbone features [26, 45], or both [49]. The text
(1) A Unified-Detector-Polygon (UDP) model that jointly recognition stage decodes the text transcription. End-to-
predicts Bezier Curve polygons [30] for text lines and an end text spotters use feature maps for the cropping process.
affinity matrix supporting the grouping of lines to para- In this case, the text recognition stage reuses those fea-
graphs. Notably, we find that the conventional way of train- tures, improving the computational efficiency [29]. How-
ing Bezier Curve polygon prediction head, i.e. applying L1 ever, end-to-end text spotters suffer from asynchronous con-
losses on control points directly [30, 47, 56], fails to capture vergence between the detection and the recognition branch
text shapes accurately on highly diverse dataset such as Hi- [22]. Due to this challenge, our proposed HTS crops from
erText [34]. Hence, we propose a novel Location and Shape input image pixels with bounding polygons. The aforemen-
Decoupling Module (LSDM) which decouples the represen- tioned text spotter framework connects detection and recog-
tation learning of location and shape. UDP equipped with nition explicitly with detection boxes. Another branch of
LSDM can accurately detect text lines of arbitrary shapes, two-stage text spotter performs implicit feature feeding via
sizes and locations across multiple datasets of different object queries [21] as in detection transformer [8] or de-
domains. (2) A Line-to-Character-to-Word (L2C2W) text formable multi-head attention [67]. More recently, single
line recognizer based on Transformer encoder-decoder [57] stage text spotters [20, 42] are proposed under a sequence-
that jointly predicts character bounding boxes and charac- to-sequence framework. These works do not perform layout
ter classes. L2C2W is trained to produce the special space analysis and are thus orthogonal to this paper.
character to delimit text lines into words. Also, unlike other Text Detection Top-down text detection methods view text
recognizers or text spotters that are based on character de- instances as objects. These methods produce detection
tection [3, 27, 31, 61], L2C2W only needs a small fraction boxes [30,56] or instance segmentation masks [34] for each
of training data to have bounding box annotations. text instance. Bottom-up methods first detect sub-parts
The proposed HTS method achieves state-of-the-art text of text instances and then connect these parts to construct
spotting results on multiple datasets across different do- whole-text bounding boxes [52] or masks [36]. Top-down
mains, including ICDAR 2015 [19], Total-Text [10], and methods tend to have simpler pipelines, while bottom-up
HierText [34]. It also surpasses Unified Detector [34] on the techniques excel at detecting text of arbitrary shapes and
geometric layout analysis benchmark of HierText, achiev- aspect ratios. Neither top-down nor bottom-up mask pre-
ing new state-of-the-art result. Importantly, these results diction methods are proficient for spotting curved text, be-
are obtained with a single model, without fine-tuning on tar- cause a mask can only locate text but cannot rectify it. Ad-
get datasets; ensuring that the proposed method can support ditionally, the performance of such models on curved text
generic text extraction applications. In ablation studies, we datasets is commonly reported by fine-tuning those models
also examine our key design choices. on the specific data. Therefore, it is unknown whether poly-
gon prediction methods can adapt to text of arbitrary shapes
Our core contributions can be summarized as follows:
and aspect ratios on diverse datasets.
• A novel Hierarchical Text Spotter for the joint task of Text Recognition An important branch of text recognizers
word-level text spotting and geometric layout analysis. [12, 32, 40] formulates the task as a sequence-to-sequence
• Location and Shape Decoupling Module which en- task [55], where the only output target is a sequence of
ables accurate polygon prediction of text lines on di- characters. Another branch formulates the task as character
verse datasets. detection [27, 31], where it produces character classes and
• L2C2W that reformulates the role of recognizer in text locations simultaneously. However, it requires bounding
spotter algorithms by performing part of layout analy- box annotations on all training data, which are rare for real-
sis and text entities localization. image data. Our recognition method falls into the sequence-
• State-of-the-art results on both text spotting and geo- to-sequence learning paradigm, with the additional ability
metric layout analysis benchmarks without fine-tuning to produce each character bounding box. Importantly, our
to each particular test dataset. model’s training requires only partially annotated data, i.e.
only a fraction of the data needs to include character level
bounding box annotations.
1 Here, we follow the definitions of these levels in [34]. Layout analysis Geometric layout analysis [18, 43, 60, 68]
aims to detect visually and geometrically coherent text Queries Unified-Detector-Polygon
blocks as objects. Recent works formulate this task as ob- NxD
ject detection [51], semantic segmentation [24,34], or learn-
ing on the graphical structure of OCR tokens via GCN [60].
Almost all entries in the HierText competition at IDCAR KMaX-DeepLab
2023 [35] adopt the segmentation formulation. Unified De-
tector [34] consolidates the task of text line detection and
geometric layout analysis. However, it can not produce
word-level entities and does not provide a recognition out- Pixel Features
put. Another line of layout analysis research focuses on Encoded Query
NxC H’xW’xC
semantic parsing of documents [16, 23, 25] to identify key-
value pairs. These methods build language models [13, 46]
Bezier Layout Cls
on top of OCR results. Recently, StruturalLM [25] and Lay- Head Head Head
outLMv3 [16] show that the grouping of words into seg- Text Masks
ments using heuristics, which is equivalent to text line for- NxHxW
Ctrl points Affinity matrix Textness Score
mation, improves parsing results. We believe our work of Nx4(m+1) NxN Nx1

jointly text spotting and geometric layout analysis can ben-


efit semantic parsing and layout analysis.
Bezier polygon prediction head

3. Methodology Location and Shape Decoupling


AABB
FFN
3.1. Hierarchical Text Spotter Nx4
Location Head
Global Bezier
As illustrated in Fig. 1 our HTS method mainly com- Nx4(m+1)
prises 2 stages: (1) Unified Detection Stage: we propose Obj Features Local Bezier
FFN Nx4(m+1)
an end-to-end trainable model termed Unified-Detector- NxC
Shape Head
Polygon (UDP) that detects text lines in the form on Bezier
Curve Polygons [30], and simultaneously clusters them into Illustrations for AABB, local and global Bezier curves
paragraphs. UDP contains the Location and Shape De-
coupling Module (LSDM), a key component in accurate
text line detection across diverse datasets. Text line im-
ages are cropped from the input image with BezierAlign
[30] and then converted to grayscale image patches. (2)
Line Recognition Stage: We propose an autoregressive text
line recognizer based on Transformer encoder-decoder [57] AABB Local Bezier Global Bezier
that jointly predicts character bounding boxes and character
classes. We train our recognizer to identify printable char-
acters and a special non-printable space delimiter. We use Figure 2. Illustration of our Unified-Detector-Polygon (UDP).
the space character to split text lines into word-level granu- Top: Architecture of UDP, where each color tint represents one
larity. The word-level bounding boxes are formed from the prediction branch. N is the number of queries. m is the order of
predicted character-level bounding boxes. Character and the Bezier Curves. C is the model width. D is the query dimen-
sion. Middle: Architecture of our Bezier polygon prediction head
word bounding boxes are estimated in the coordinate space
with a dual-head Location and Shape Decoupling Module. Bot-
of text line image patches. During the post-processing step, tom: Illustrations for axis-aligned bounding box (AABB), local
they are projected back to the input image coordinate space. and global Bezier curve representation.
Putting these together, we obtain a hierarchical text repre-
sentation of character, word, line, and paragraph.
transformer layers applied on object queries.
3.2. Unified Detection of Text Line and Paragraph
Unified-Detector-Polygon (UDP) While Unified Detector
Preliminaries Based on MaX-DeepLab [58], Unified- [34] achieves state-of-the-art text detection performance, it
Detector [34] detects text lines by producing instance seg- uses only masks to localize text instances. The estimated
mentation masks from the inner product of object queries masks can not be directly used to rectify curved text lines.
and pixel features. Further, an affinity matrix that repre- Thus, complex post-processing heuristics are required to
sents the paragraph grouping is produced by computing the build an effective text-spotting system. Therefore, we ex-
inner product of layout features which are extracted by extra tend the model with an additional Bezier polygon prediction
head applied on the encoded object queries, as illustrated final training loss is the weighted sum of all Unified Detec-
in the top of Fig. 2. The Bezier polygon prediction head tor [34] loss, GIoU loss on AABB [48], L1 loss on AABB,
produces a polygon representation [30] based on Bezier and L1 loss on local control points:
Curve2 . In this representation, each text line is parametrized
as two Bezier Curves of order m, one for the top and one Ldet = Lunified detector
for the bottom polyline of the text boundary. Each Bezier + λ1 LAABB,GIoU + λ2 LAABB,L1 + λ3 LLocal,L1 (6)
Curve has m + 1 control points. The model is trained to
predict these 2(m + 1) control points i.e. 4(m + 1) coor- where λ1 , λ2 , λ3 are the weights for loss balancing.
dinates. During inference, the text boundaries are recon-
structed from the predicted control points. In addition, we 3.3. Line-to-Character-to-Word Recognition
also replace MaX-DeepLab in the original Unified Detector
with KMaX-DeepLab [65] as the backbone, which is faster We propose a novel hierarchical text recognition frame-
and more accurate. work, termed Line-to-Character-to-Word (L2C2W). Fig. 3
Location and Shape Decoupling Module Previous works illustrated our framework. Text line images are cropped and
[47, 56] use a single feed-forward neural network (FFN) to rectified from the input image with BezierAlign [30]. We
predict the control points in image space and train the net- use the grayscale cropped image as input for the recognizer.
work by applying L1 loss on the control points. However, as The model predicts character-level outputs. To correctly
shown in Sec. 4.4, such approach results to sub-optimal de- group characters into words, our recognition model learns
tection accuracy for text line datasets such as HierText [34] to predict both printable characters and the space charac-
due to its diverse locations, aspect ratios, and shapes. To ter. During inference, we use the space as the delimiter
mitigate this issue, we propose a novel Location and Shape to segment a text line string into words. The model also
Decoupling Module (LSDM). As shown in the middle of produces character-level bounding boxes. These character
Fig. 2, it consists of two parallel FFNs, one for location bounding boxes are grouped based on each word’s bound-
prediction and the other for shape prediction. The Loca- aries and produce the words’ bounding boxes.
tion Head predicts Axis-Aligned Bounding Boxes (AABB) Text Line Recognition Model Our transformer-based rec-
whose coordinates are normalized in the image space. For ognizer consists of three stages. First, a MobileNetV2 [50]
the i−th text instance, we denote its predicted AABB as: convolutional backbone encodes the image pixels, and re-
AABB i = [xcenter,i , ycenter,i , wi , hi ] ∈ R4 (1) duces the height dimension to 1 using strided convolutions.
Then, a sinusoidal positional encoding [57] is added, and
representing its center, width, and height. The Shape Head transformer encoder layers are applied on the encoded fea-
predicts Local Bezier Curve control points whose coordi- tures. Lastly, a transformer decoder produces the predicted
nates are normalized in the space of the AABB: output autoregressively [55].
2(m+1)
Character Localization We use axis-aligned bounding
bezierlocal,i = {(x̃i,j , ỹi,j )}j=1 (2) boxes to represent the location of characters in cropped text
lines. Vanilla transformer decoder [57] has only one pre-
Finally, the Global, i.e. image space, Bezier curve con-
diction head to produce a probability distribution over the
trol point coordinates are obtained by scaling and translat-
next token. To predict character bounding boxes, we add
ing Local Bezier coordinates by AABB:
a 2-layer FFN prediction head on the output feature from
2(m+1)
bezierglobal,i = {(xi,j , yi,j )}j=1 (3) decoder, in parallel to the classification head. The charac-
ter location head produces a 4d vector representing the top-
where xi,j = x̃i,j ∗ wi + xcenter,i (4) left and bottom-right coordinates of the character bounding
yi,j = ỹi,j ∗ hi + ycenter,i (5) boxes. These character coordinates are normalized by each
text line’s height.
The concepts of AABB, Local Bezier coordinates, and
Training The total loss for training is the weighted sum of
Global Bezier coordinates are further illustrated in the bot-
character classification loss and character localization loss.
tom of Fig. 2. During training, we generate appropri-
We use cross-entropy for character classification and L1
ately ground-truth data for both heads and apply super-
loss for character localization. It is important to note that
vision on both of them. Specifically, given ground-truth
ground-truth annotated character bounding boxes are rare
Global Bezier control points, we first compute ground-truth
in real-image datasets but are available in most synthetic
AABB as the minimum area AABB enclosing the ground-
text data [14, 37, 64]. During training, we mix real-image
truth polygons, and then use the reverse of Eq. (3) (4) (5)
and synthetic data and apply character localization loss only
to compute ground-truth Local Bezier control points. The
when ground-truth labels are available. The training target
2 https://en.wikipedia.org/wiki/Bezier_curve for one text line can be formulated as:
Image & Line Polygon we build a bijection from coordinates in text line crops to
Autoregressive Recognition coordinates in the input image. We re-use this bijection to
compute this projection operation (detailed in supplemen-
“M” tary material Sect. A).

BezierAlign Pos Enc 4. Experiments


Transformer
Encoder In this section, we evaluate the proposed method on a
CNN
Decoder number of benchmarks. We first introduce the experimental
Cropped rectified
gray line image settings, including the training and test datasets, the hyper-
<SOS> parameters of models, and the evaluation practices. We
compare our method to the current state-of-the-art on end-
Recognized text: “MORNING CALL COFFEE STAND” to-end text spotting, text detection, and geometric layout
analysis [34]. Finally, we conduct comprehensive ablation
Character bounding boxes: studies and analyze our design choices.
4.1. Experiment Setting
Character bounding boxes Word bounding boxes
Unified-Detector-Polygon We base our UDP implementa-
tion on the official repository3 of Unified Detector [34]. The
input resolution is 1600 × 1600. Model dimensions are
Split words: “MORNING” “CALL” “COFFEE” “STAND”
N = 384, D = 256, C = 128 respectively, using the
Project with BezierAlign’s bijective correspondence same settings as Unified Detector [34]. As for the Bezier
polygon prediction head, we use a 2-layer MLPs for both
branches, with a hidden state dimension of 256. ReLU and
LayerNorm [1] are applied in between the two layers. The
AABB and Local Bezier prediction head outputs are acti-
vated by a sigmoid and a linear function respectively. We
use m = 3, i.e. cubic order Bezier Curves. For the loss bal-
Figure 3. Illustration of our Line-to-Character-to-Word ancing weights, we set λ1 = 1.0, λ2 = 2.5, and λ3 = 0.5.
(L2C2W) recognition method. Top: Text line images are cropped The ratio of λ1 and λ2 are set after DETR [8]. UDP is
and rectified from the input image using BezierAlign [30]. Our trained on 128 TPUv3 devices for 100K iterations with a
L2C2W recognition model uses an autoregressive transformer batch size of 256, AdamW [39] optimizer, cosine learn-
encoder-decoder model [57] to predict character class and box si- ing rate [38] of 0.001, and weight decay of 0.05. We train
multaneously. Middle: Output sample. Bottom: Text line recog- UDP on a combination of the training sets of HierText [34]
nition results are split into words, and character bounding boxes and CTW1500 [66], which both provide line-level text an-
are clustered in accordance with words and form word bounding notations. During training, images are randomly rotated,
boxes. Bounding boxes are projected back to image space.
cropped, padded, and resized to the input resolution. A ran-
dom scheme of color distortion [11] is also applied.
T PT
1X λ4 t=1 αt LL1 (boxt , box
d t) L2C2W Recognizer We use the TensorFlow Model Garden
Lrec = LCE (yt , ŷt )+
T t=1
PT library [15] to implement our model. Input text lines are re-
t=1 αt + ϵ
(7) sized to height of 40 pixels and padded to width of 1024 pix-
els to accommodate the variable aspect ratios of lines. The
where T is the number of characters, λ4 is the weight for lo- CNN backbone is a MobileNetV2 [50] model with 7 identi-
calization loss, αt is an indicator for whether it has ground- cal blocks each with a filter dimension of 64 and an expan-
truth character bounding box, and ϵ is a small positive num- sion ratio of 8. The following strided convolution has 128
ber to avoid zero denominator. In practice, we do the sum- filters. The transformer encoder stack consists of 8 encoder
mation and average on a batch level, to balance the loss be- layers, with hidden size of 256 and 4 heads for each layer.
tween long and short text. The inner layers of FFNs in transformer encoders have a
Post-processing We partition text lines into words using hidden size of 512. We use a single layer transformer de-
the predicted space character. We obtain word-level bound- coder with hidden size 256 and only 1 attention head. The
ing boxes by finding the minimum-area axis-aligned bound- character classification head is trained to recognize case-
ing boxes of each word’s characters. Finally, we project sensitive Latin characters, digits and printable punctuation
these word and character bounding boxes back to the image 3 https : / / github . com / tensorflow / models / tree /

space. When we perform BezierAlign [30] in line cropping, master/official/projects/unified_detector


symbols (see supplementary material Sect. B). We set λ4 = move all non-alphanumeric symbols; (3) remove a detec-
0.05 for the bounding box loss. L2C2W is trained on 16 tion if it consists of only punctuation symbols; (4) use edit
TPUv3 cores for 200K iterations with a batch size of 1024 distance to pick the best-match when lexicons are used.
and the same optimizer setup as UDP. The training data con- Table 1 summarizes our evaluation results. We mark
sists of SynthText [14], Synth90K [17], HierText [34], IC- methods with different labels based on: a) the ability to
DAR 2015 [19], Total-Text [10], CTW1500 [66], and an recognize case-sensitive or case-insensitive characters and
internal dataset of 1M synthetic text lines, with a sam- b) whether the underlying models are fine-tuned on the
pling ratio of [0.25, 0.20, 0.25, 0.0005, 0.001, 0.001, 0.25]. target dataset. On ICDAR 2015, our proposed HTS sur-
From the full-image datasets [10, 14, 19, 34, 66], we use the passes the recent state-of-the-art UNITS [20] considerably
ground-truth text polygon to crop and rectify text. Synth- and beats previous ones significantly without fine-tuning. In
Text and HierText provide word and line-level annotations. the Word-Spotting mode, HTS has a large margin of +1.45
The internal synthetic dataset generation process utilizes a / +0.82 / +0.53 on S/W/G lexicons and almost matches
similar method to Synth90K [17] but mainly contains text UNITS on non-lexicon. In the End-to-End mode, HTS also
lines instead of single words4 . SynthText and our inter- achieves considerable margins on all lexicon settings. Note
nal synthetic dataset provide annotations for character-level that, the strongest competitor UNITS uses additional train-
bounding boxes. ing data from TextOCR [54] while we don’t, demonstraing
Evaluation Practices Unless specified, e.g. in ablation the advantage of our method.
studies, we use the same model and weights in all ex- On Total-Text, we surpass all current state-of-the-art in
periments. We do not perform fine-tuning on individual both settings. Some of these prior arts [21, 49, 67] fine-tune
datasets. During inference, we filter the model’s output with their models on Total-Text which boosts the performance
a confidence threshold of 0.5 for the detector and 0.8 for the on this target dataset at the cost of dropping performance
recognizer. We determine these thresholds on the HierText on others. Also note that, some prior arts [21, 22, 26, 44]
validation set and apply them to all experiments. limit recognition to case-insensitive letters and no punctua-
tion symbols, while ours operate in a case-sensitive mode, a
4.2. Results on End-to-End Text Spotting
more difficult but more important one. This is not reflected
4.2.1 Comparison with State-of-the-Art Results in the scores due to the text normalization rules in the eval-
uation protocol.
We evaluate the proposed HTS method on ICDAR 2015 In-
cidental [19] and Total-Text [10], the most popular bench-
marks for straight and curved text respectively, and compare 4.2.2 Comparison based on HierText’s Eval
our results with current state-of-the-art. The evaluation of
We also compare the proposed HTS method with others un-
ICDAR 2015 is case-insensitive and includes several heuris-
der the evaluation protocol7 of HierText [34]. The Hier-
tics with regard to punctuation symbols and text length.
Text protocol directly compares predictions against ground-
In the End-to-End mode, if a ground-truth text starts with
truth, without normalizing letter cases, punctuation sym-
or ends with punctuation, it is considered a true positive
bols, or filtering based on text lengths. It does not have lex-
match whether or not the prediction includes those punctu-
icon modes either. Compared with the ICDAR 2015 proto-
ation symbols. In the Word-Spotting mode, both ground-
col, it provides a more strict and comprehensive comparison
truth and predictions are normalized by: (1) removing the
since letter cases, punctuation symbols, and text of different
’s and ’S suffix; (2) removing dash (’-’) prefix and suffix;
lengths are all important in real-scenario applications.
(3) remove other punctuation symbols; (4) only keep nor-
malized words that are at least 3 letters long. The Total- For a more fair comparison, we re-train several previ-
Text dataset does not provide an evaluation script for text ous state-of-the-art methods that have opensourced code by
spotting, and previous works evaluate their results using a the time of this work, including MTSv3 [26], TESTR [67]
script5 adapted from ICDAR 2015’s. This script inherits the and GLASS [49], using their open-source codes. We use
aforementioned heuristics but computes IoU based on poly- the same combination of HierText [34], Total-Text [10],
gons as opposed to rotated bounding boxes. Additionally, CTW1500 [66], SynthText [14], and ICDAR 2015 [19] as
both datasets are evaluated with the help of lexicon lists of training data and evaluate on HierText [34], Total-Text [10],
different levels of perplexity6 . and ICDAR 2015 [19]. We obtain results on HierText test
To adapt to these heuristics, we transform the output of set using the online platform8 since the test set annotation
HTS by: (1) convert all characters to lower cases; (2) re- is not released. We are also the first to report results on the
HierText test set [34]. Results are summarized in Table 2.
4 See Supp. Sect. C. The dataset will be made publicly available.
5 https://github.com/MhLiao/MaskTextSpotterV3/ 7 https : / / github . com / google - research - datasets /

tree/master/evaluation/totaltext/e2e hiertext/blob/main/eval.py
6 https://rrc.cvc.uab.es/?ch=4&com=tasks 8 https://rrc.cvc.uab.es/?ch=18
ICDAR 2015 Incidental Total-Text
Method Word-Spotting End-to-End N Full
S W G N S W G N P R F1 P R F1
MTSv3⋆ [26] 83.1 79.1 75.1 - 83.3 78.1 74.2 - - - 71.2 - - 78.4
MANGO⋆ [44] 85.2 81.1 74.6 - 85.4 80.1 73.9 - - - 68.9 - - 78.9
YAMTS⋆ [22] 86.8 82.4 76.7 - 85.3 79.8 74 - - - 71.1 - - 78.4
CharNet [61] - - - - 83.10 79.15 69.14 65.73 - - 69.2 - - -
TESTR♯ [67] - - - - 85.16 79.36 73.57 65.27 - - 73.3 - - 83.9
Qin et al.♯ [45] - - - - 85.51 81.91 - 69.94 - - 70.7 - - -
TTS⋆ [21] 85.0 81.5 77.3 - 85.2 81.7 77.4 - - - 75.6 - - 84.4
GLASS♯ [49] 86.8 82.5 78.8 71.69* 84.7 80.1 76.3 70.15* - - 76.6 - - 83
UNITS ⋆ [20] 88.1 84.9 80.7 78.7 88.4 83.9 79.7 78.5 - - 77.3 - - 85.0
HTS⋆♯ (ours) 89.55 85.72 81.23 78.62 89.38 84.61 80.69 78.81 80.41 75.92 78.10 90.12 80.74 85.17

Table 1. Results for ICDAR 2015 and Total-Text. ‘S’, ‘W’, ‘G’ and ‘N’ refer to strong, weak, generic and no lexicons. ‘Full’ for
Total-Text means all test set words, and is equivalent to the weak setting in ICDAR 2015. ‘-’ means scores are not reported by the papers.
‘*’ means scores are obtained from the open-source code and weights. ⋆ means models are not fine-tuned on individual datasets. ♯ means
models recognize all symbol classes, including case-sensitive characters and punctuation symbols.

ICDAR 2015 Total-Text HierText test


Method
P R F1 P R F1 P R F1 Individual Characters Words
TESTR 65.52 68.08 66.78 59.40 68.33 63.55 65.05 44.89 53.12
MTSv3 63.89 58.88 61.28 64.13 62.85 63.48 66.61 41.29 50.98
GLASS 74.11 63.08 68.15 68.54 60.12 64.05 73.84 57.20 64.47
HTS
81.87 68.41 74.53 75.65 69.43 72.40 86.71 68.48 76.52
(ours)

Table 2. Results under the evaluation protocol of HierText.


Line Grouping Paragraph Grouping
Method
P R F1 T PQ P R F1 T PQ
Unified
79.64 80.19 79.91 77.87 62.23 76.04 62.45 68.58 78.17 53.60
Detector [34]
HTS (ours) 82.71 82.03 82.37 80.51 66.31 75.26 75.98 75.62 79.67 60.25
Lines Paragraphs

Table 3. Results of geometric layout analysis on HierText test


set. Panoptic Quality, equals to the product of F1 and Tightness.

Our HTS achieves significant advantage over these base-


lines by a large margin on all datasets, proving the effec-
tiveness of our method across straight and curved text, and
sparse and dense text. For ICDAR 2015 and Total-Text,
the performance gap with Tab. 1 highlights the impact of
Figure 4. Qualitative results for layout analysis. We draw char-
text normalization and the use of lexicon lists, and that with
acter bounding boxes with different colors to indicate layout at
such heuristics in evaluation we tend to overestimate the different levels.
progress of text spotting method’s accuracy. HierText is a
new dataset that is characterized by its high word density
of more than 100 words per image, a variety of image do- task, and summarize the results in Table 3. HTS achieves
mains, a diversity in text sizes and locations and an abun- betters scores in the PQ metric on both line (+4.08) and
dance in text lines that have plenty of punctuation symbols. paragraph grouping (+6.65) compared to Unified Detector.
The recall rate is lower than 45% for TESTR and MTSv3, Most notably, line and paragraph predictions are formed as
and lower than 60% for GLASS, while our proposed HTS union masks of underlying character boxes. This indicates
can recall more than 68% of words. That indicates that the that our character localization, as well as the word box esti-
word-centric design of most existing text spotting models is mation based on it, are accurate.
not optimal for natural images with high text density.
4.4. Ablation Studies
4.3. Results on Geometric Layout Analysis
To better understand the effectiveness of our design
The proposed HTS is able to estimate the text’s layout choices, we conduct ablation studies and summarize the re-
structure in images, as shown in Fig. 4. We further evalu- sults in Tab. 4. Different from the previous sections, here
ate our model on HierText on the geometric layout analysis we use 1024 × 1024 as input image resolution.
representation is unsuitable for curved text spotting since it
is non-trivial to crop and rectify with masks. Note that Hier-
Text consists mostly of straight text and is thus less affected.
Word Based v.s. Line Based OCR We train HTS on Hi-
erText and Total-Text for the word spotting task, as op-
posed to line-level. The line-based model is better than the
Single FFN Prediction Head
word based model on both Total-Text (+2.88), a sparse text
Location and Shape Decoupling Prediction Head dataset, and HierText (+12.34), a dense text dataset (Tab. 4).
The recall rate on HierText drops by (-15.09) if the model
detects words instead of lines. It is consistent with Tab. 2,
where word-based current arts have much lower scores on
the dense HierText dataset.
Choice of Backbone We train two versions of HTS, one
with MaX-DeepLab as the backbone and the other with
KMaX-DeepLab as backbone. HTS with KMaX-DeepLab
Figure 5. Qualitative comparison between Single FFN predic-
achieves +3.03 / +0.86 better F1 scores on Total-Text
tion head (Top) and the proposed Location and Shape Decoupling
Module (LSDM) prediction head (Bottom) for Bezier curve poly-
and HierText respectively, demonstrating the advantage of
gons. The original images are turned to gray for clearer views. KMaX-DeepLab, a follow-up model of MaX-DeepLab. In
addition to improved accuracy, the KMaX-DeepLab version
of HTS can run at a much higher speed, with a 5.6 FPS
Abaltion
Total-Text HierText Val on average on HierText, while the MaX-DeepLab version
P R F1 P R F1 is 1.2 FPS, when measured on A100. Adopting KMaX-
Full 72.33 62.95 67.31 86.54 67.03 75.55
w/o LSDM 68.79 56.12 61.81 86.75 60.97 71.61
DeepLab benefits both accuracy and latency.
Mask-based detection 64.52 54.39 59.02 84.87 63.52 72.66
Word-level detection 71.54 58.61 64.43 80.71 51.94 63.21 4.5. Limitations
MaX-DeepLab 70.16 59.32 64.28 87.66 65.07 74.69
Latency On a A100 GPU, our method runs at 7.8 FPS
Table 4. Ablation results as evaluated on the text spotting task. on ICDAR 2015 and Total-Text, and 5.6 FPS on HierText
which is an order of magnitude more dense, while TESTR
[67] runs at 10.2 FPS on HierText. We believe a faster
LSDM We replace our LSDM prediction head with a single
backbone for UDP can help. Sharing features for UDP and
FFN branch prediction head to produce the Global Bezier
L2C2W i.e. making it end-to-end trainable will also save
directly, and remove the 2 loss items on AABB. We use
computations.
λ5 = 3 for the weight of this loss, which is larger than
λ1−3 to compensate for the difference of loss scales. The Line labels The training of UDP requires line level annota-
model is trained on the same combination of HierText and tions which are limited in few public datasets. However,
CTW1500. As shown in Tab. 4, removal of LSDM results it is relatively low-cost to annotate line grouping on top
in a sharp drop of text spotting performances on both Total- of existing word-level polygons. Line grouping of ground-
Text (-5.50) and HierText validation set (-3.94). Fig. 5 truth words can also be accurately estimated using heuris-
further demonstrates that LSDM is important for the learn- tics based on word size, location, and orientation.
ing of text shapes. Without LSDM, predicted location is Character Localization Benchmark datasets used in this
only roughly correct but its shape is inaccurate. This shows work do not provide character-level labels so we are unable
that training Bezier prediction head with L1 loss on diverse to evaluate the accuracy of our character localization. We
datasets could be dominated by the location learning and can only use word level text spotting and layout analysis
thus shape prediction fails. The proposed LSDM, on the results as a proxy as it highly depends on character local-
other hand, can solve this issue by separating and balancing ization quality.
the learning of location and shape.
Mask v.s. Polygon One main difference between our UDP 5. Conclusion
with Unified Detector [34] is that UDP produces polygons
as output as opposed to masks. In this ablation study, we use In this paper, we propose the first Hierarchical Text Spot-
the mask outputs as detections and find minimum-area ro- ter (HTS) for the joint task of text spotting and layout analy-
tated bounding boxes instead of using the predicted Bezier sis. HTS achieves new state-of-the-art performance on mul-
polygons. This results in a significant drop in Total-Text tiple word-level text spotting benchmarks as well as a geo-
(-8.29) and a less severe drop in HierText (-2.89). Mask metric layout analysis task.
References transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018. 3
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
[14] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman.
ton. Layer normalization. arXiv preprint arXiv:1607.06450,
Synthetic data for text localisation in natural images. In Pro-
2016. 5
ceedings of the IEEE conference on computer vision and pat-
[2] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun,
tern recognition, pages 2315–2324, 2016. 4, 6
and Hwalsuk Lee. Character region awareness for text detec-
[15] Xianzhi Du Yeqing Li Abdullah Rashwan Le Hou
tion. In Proceedings of the IEEE/CVF Conference on Com-
Pengchong Jin Fan Yang Frederick Liu Jaeyoun Kim
puter Vision and Pattern Recognition, pages 9365–9374,
Hongkun Yu, Chen Chen and Jing Li. TensorFlow Model
2019. 1
Garden. https : / / github . com / tensorflow /
[3] Youngmin Baek, Seung Shin, Jeonghun Baek, Sungrae Park,
models, 2020. 5
Junyeop Lee, Daehyun Nam, and Hwalsuk Lee. Charac-
ter region attention for text spotting. In Computer Vision– [16] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu
ECCV 2020: 16th European Conference, Glasgow, UK, Au- Wei. Layoutlmv3: Pre-training for document ai with unified
gust 23–28, 2020, Proceedings, Part XXIX 16, pages 504– text and image masking. In Proceedings of the 30th ACM
521. Springer, 2020. 2 International Conference on Multimedia, pages 4083–4091,
2022. 1, 3
[4] Christian Bartz, Haojin Yang, and Christoph Meinel. See:
towards semi-supervised end-to-end scene text recognition. [17] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An-
In Proceedings of the AAAI conference on artificial intelli- drew Zisserman. Synthetic data and artificial neural net-
gence, volume 32, 2018. 2 works for natural scene text recognition. arXiv preprint
[5] Darwin Bautista and Rowel Atienza. Scene text recognition arXiv:1406.2227, 2014. 6
with permuted autoregressive sequence models. In Computer [18] Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe
Vision–ECCV 2022: 17th European Conference, Tel Aviv, Is- Thiran. Funsd: A dataset for form understanding in noisy
rael, October 23–27, 2022, Proceedings, Part XXVIII, pages scanned documents. In 2019 International Conference on
178–196. Springer, 2022. 1 Document Analysis and Recognition Workshops (ICDARW),
[6] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, volume 2, pages 1–6. IEEE, 2019. 2
Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimos- [19] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos
thenis Karatzas. Scene text visual question answering. In Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwa-
Proceedings of the IEEE/CVF international conference on mura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chan-
computer vision, pages 4291–4301, 2019. 1 drasekhar, Shijian Lu, et al. Icdar 2015 competition on robust
[7] Thomas M Breuel. Two geometric algorithms for layout reading. In 2015 13th International Conference on Docu-
analysis. In International workshop on document analysis ment Analysis and Recognition (ICDAR), pages 1156–1160.
systems, pages 188–199. Springer, 2002. 1 IEEE, 2015. 2, 6
[8] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [20] Taeho Kil, Seonghyeon Kim, Sukmin Seo, Yoonsik Kim, and
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Daehee Kim. Towards unified scene text spotting based on
end object detection with transformers. In European Confer- sequence generation. In Proceedings of the IEEE/CVF Con-
ence on Computer Vision, pages 213–229. Springer, 2020. 2, ference on Computer Vision and Pattern Recognition, pages
5 15223–15232, 2023. 2, 6, 7
[9] Roldano Cattoni, Tarcisio Coianiz, Stefano Messelodi, and [21] Yair Kittenplon, Inbal Lavi, Sharon Fogel, Yarin Bar, R Man-
Carla Maria Modena. Geometric layout analysis techniques matha, and Pietro Perona. Towards weakly-supervised text
for document image understanding: a review. ITC-irst Tech- spotting using a multi-task transformer. In Proceedings of
nical Report, 9703(09), 1998. 1 the IEEE/CVF Conference on Computer Vision and Pattern
[10] Chee Kheng Ch’ng and Chee Seng Chan. Total-text: A com- Recognition, pages 4604–4613, 2022. 2, 6, 7
prehensive dataset for scene text detection and recognition. [22] Ilya Krylov, Sergei Nosov, and Vladislav Sovrasov. Open
In 2017 14th IAPR International Conference on Document images v5 text annotation and yet another mask text spotter.
Analysis and Recognition (ICDAR), volume 1, pages 935– In Asian Conference on Machine Learning, pages 379–389.
942. IEEE, 2017. 2, 6 PMLR, 2021. 2, 6, 7
[11] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V [23] Chen-Yu Lee, Chun-Liang Li, Timothy Dozat, Vincent
Le. Randaugment: Practical automated data augmen- Perot, Guolong Su, Nan Hua, Joshua Ainslie, Ren-
tation with a reduced search space. In Proceedings of shen Wang, Yasuhisa Fujii, and Tomas Pfister. Form-
the IEEE/CVF conference on computer vision and pattern net: Structural encoding beyond sequential modeling in
recognition workshops, pages 702–703, 2020. 5 form document information extraction. arXiv preprint
[12] Cheng Da, Peng Wang, and Cong Yao. Levenshtein ocr. arXiv:2203.08411, 2022. 1, 3
In Computer Vision–ECCV 2022: 17th European Confer- [24] Joonho Lee, Hideaki Hayashi, Wataru Ohyama, and Seiichi
ence, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Uchida. Page segmentation using a convolutional neural net-
Part XXVIII, pages 322–338. Springer, 2022. 1, 2 work with trainable co-occurrence features. In 2019 Inter-
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina national Conference on Document Analysis and Recognition
Toutanova. Bert: Pre-training of deep bidirectional (ICDAR), pages 1023–1028. IEEE, 2019. 1, 3
[25] Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang the European conference on computer vision (ECCV), pages
Huang, Fei Huang, and Luo Si. Structurallm: Struc- 20–36, 2018. 1, 2
tural pre-training for form understanding. arXiv preprint [37] Shangbang Long and Cong Yao. Unrealtext: Synthesizing
arXiv:2105.11210, 2021. 1, 3 realistic scene text images from the unreal world. arXiv
[26] Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xi- preprint arXiv:2003.10608, 2020. 4
ang Bai. Mask textspotter v3: Segmentation proposal net- [38] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas-
work for robust scene text spotting. In Computer Vision– tic gradient descent with warm restarts. arXiv preprint
ECCV 2020: 16th European Conference, Glasgow, UK, Au- arXiv:1608.03983, 2016. 5
gust 23–28, 2020, Proceedings, Part XI 16, pages 706–722. [39] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
Springer, 2020. 2, 6, 7 regularization. In 7th International Conference on Learning
[27] Minghui Liao, Jian Zhang, Zhaoyi Wan, Fengming Xie, Jia- Representations, ICLR 2019, New Orleans, LA, USA, May
jun Liang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Scene 6-9, 2019. OpenReview.net, 2019. 5
text recognition from two-dimensional perspective. In Pro- [40] Byeonghu Na, Yoonsik Kim, and Sungrae Park. Multi-
ceedings of the AAAI conference on artificial intelligence, modal text recognition networks: Interactive enhancements
volume 33, pages 8714–8721, 2019. 2 between visual and semantic features. In Computer Vision–
[28] Minghui Liao, Zhisheng Zou, Zhaoyi Wan, Cong Yao, and ECCV 2022: 17th European Conference, Tel Aviv, Israel,
Xiang Bai. Real-time scene text detection with differentiable October 23–27, 2022, Proceedings, Part XXVIII, pages 446–
binarization and adaptive scale fusion. IEEE Transactions on 463. Springer, 2022. 1, 2
Pattern Analysis and Machine Intelligence, 45(1):919–931,
[41] Oren Nuriel, Sharon Fogel, and Ron Litman. Textadain: Pay-
2022. 1
ing attention to shortcut learning in text recognizers. In Com-
[29] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and
puter Vision–ECCV 2022: 17th European Conference, Tel
Junjie Yan. Fots: Fast oriented text spotting with a uni-
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII,
fied network. In Proceedings of the IEEE conference on
pages 427–445. Springer, 2022. 1
computer vision and pattern recognition, pages 5676–5685,
[42] Dezhi Peng, Xinyu Wang, Yuliang Liu, Jiaxin Zhang,
2018. 2
Mingxin Huang, Songxuan Lai, Jing Li, Shenggao Zhu,
[30] Yuliang Liu, Chunhua Shen, Lianwen Jin, Tong He, Peng
Dahua Lin, Chunhua Shen, et al. Spts: single-point text spot-
Chen, Chongyu Liu, and Hao Chen. Abcnet v2: Adaptive
ting. In Proceedings of the 30th ACM International Confer-
bezier-curve network for real-time end-to-end text spotting.
ence on Multimedia, pages 4272–4281, 2022. 1, 2
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 44(11):8048–8064, 2021. 2, 3, 4, 5 [43] Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S
[31] Shangbang Long, Yushuo Guan, Kaigui Bian, and Cong Yao. Nassar, and Peter Staar. Doclaynet: A large human-
A new perspective for flexible feature gathering in scene text annotated dataset for document-layout segmentation. In Pro-
recognition via character anchor pooling. In ICASSP 2020- ceedings of the 28th ACM SIGKDD Conference on Knowl-
2020 IEEE International Conference on Acoustics, Speech edge Discovery and Data Mining, pages 3743–3751, 2022.
and Signal Processing (ICASSP), pages 2458–2462. IEEE, 2
2020. 2 [44] Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi
[32] Shangbang Long, Yushuo Guan, Bingxuan Wang, Kaigui Niu, Shiliang Pu, and Fei Wu. Mango: A mask attention
Bian, and Cong Yao. Rethinking irregular scene text recog- guided one-stage scene text spotter. In Proceedings of the
nition. arXiv preprint arXiv:1908.11834, 2019. 2 AAAI Conference on Artificial Intelligence, volume 35, pages
[33] Shangbang Long, Xin He, and Cong Yao. Scene text detec- 2467–2476, 2021. 6, 7
tion and recognition: The deep learning era. International [45] Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa
Journal of Computer Vision, 129(1):161–184, 2021. 1 Fujii, and Ying Xiao. Towards unconstrained end-to-end
[34] Shangbang Long, Siyang Qin, Dmitry Panteleev, Alessandro text spotting. In Proceedings of the IEEE/CVF International
Bissacco, Yasuhisa Fujii, and Michalis Raptis. Towards end- Conference on Computer Vision, pages 4704–4714, 2019. 1,
to-end unified scene text detection and layout analysis. In 2, 7
Proceedings of the IEEE/CVF Conference on Computer Vi- [46] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya
sion and Pattern Recognition, pages 1049–1059, 2022. 1, 2, Sutskever, et al. Improving language understanding by gen-
3, 4, 5, 6, 7, 8 erative pre-training. 2018. 3
[35] Shangbang Long, Siyang Qin, Dmitry Panteleev, Alessan- [47] Zobeir Raisi, Georges Younes, and John Zelek. Arbitrary
dro Bissacco, Yasuhisa Fujii, and Michalis Raptis. Icdar shape text detection using transformers. In 2022 26th Inter-
2023 competition on hierarchical text detection and recogni- national Conference on Pattern Recognition (ICPR), pages
tion. In Gernot A. Fink, Rajiv Jain, Koichi Kise, and Richard 3238–3245. IEEE, 2022. 2, 4
Zanibbi, editors, Document Analysis and Recognition - IC- [48] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir
DAR 2023, pages 483–497, Cham, 2023. Springer Nature Sadeghian, Ian Reid, and Silvio Savarese. Generalized in-
Switzerland. 3 tersection over union: A metric and a loss for bounding
[36] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, box regression. In Proceedings of the IEEE/CVF conference
Wenhao Wu, and Cong Yao. Textsnake: A flexible represen- on computer vision and pattern recognition, pages 658–666,
tation for detecting text of arbitrary shapes. In Proceedings of 2019. 4
[49] Roi Ronen, Shahar Tsiper, Oron Anschel, Inbal Lavi, Amir Proceedings of the IEEE/CVF Winter Conference on Appli-
Markovitz, and R Manmatha. Glass: Global to local at- cations of Computer Vision, pages 493–502, 2022. 2, 3
tention for scene-text spotting. In Computer Vision–ECCV [61] Linjie Xing, Zhi Tian, Weilin Huang, and Matthew R Scott.
2022: 17th European Conference, Tel Aviv, Israel, Octo- Convolutional character networks. In Proceedings of the
ber 23–27, 2022, Proceedings, Part XXVIII, pages 249–266. IEEE/CVF international conference on computer vision,
Springer, 2022. 1, 2, 6, 7 pages 9126–9136, 2019. 2, 7
[50] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- [62] Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted Kifer, and C Lee Giles. Learning to extract semantic struc-
residuals and linear bottlenecks. In Proceedings of the ture from documents using multimodal fully convolutional
IEEE conference on computer vision and pattern recogni- neural networks. In Proceedings of the IEEE Conference
tion, pages 4510–4520, 2018. 4, 5 on Computer Vision and Pattern Recognition, pages 5315–
[51] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Den- 5324, 2017. 1
gel, and Sheraz Ahmed. Deepdesrt: Deep learning for de- [63] Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du,
tection and structure recognition of tables in document im- and Dacheng Tao. Dptext-detr: Towards better scene text
ages. In 2017 14th IAPR international conference on doc- detection with dynamic points in transformer. arXiv preprint
ument analysis and recognition (ICDAR), volume 1, pages arXiv:2207.04491, 2022. 1
1162–1167. IEEE, 2017. 3 [64] Moonbin Yim, Yoonsik Kim, Han-Cheol Cho, and Sungrae
[52] Baoguang Shi, Xiang Bai, and Serge Belongie. Detecting Park. Synthtiger: synthetic text image generator towards
oriented text in natural images by linking segments. In Pro- better text recognition models. In Document Analysis and
ceedings of the IEEE conference on computer vision and pat- Recognition–ICDAR 2021: 16th International Conference,
tern recognition, pages 2550–2558, 2017. 2 Lausanne, Switzerland, September 5–10, 2021, Proceedings,
[53] Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Part IV 16, pages 109–124. Springer, 2021. 4
Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards [65] Qihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins,
vqa models that can read. In Proceedings of the IEEE Con- Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh
ference on Computer Vision and Pattern Recognition, pages Chen. k-means mask transformer. In European Conference
8317–8326, 2019. 1 on Computer Vision, pages 288–307. Springer, 2022. 4
[54] Amanpreet Singh, Guan Pang, Mandy Toh, Jing Huang, Wo- [66] Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang
jciech Galuba, and Tal Hassner. Textocr: Towards large- Sheng. Detecting curve text in the wild: New dataset and
scale end-to-end reasoning for arbitrary-shaped scene text. new solution. arXiv preprint arXiv:1712.02170, 2017. 5, 6
In Proceedings of the IEEE/CVF Conference on Computer
[67] Xiang Zhang, Yongwen Su, Subarna Tripathi, and Zhuowen
Vision and Pattern Recognition (CVPR), pages 8802–8812,
Tu. Text spotting transformers. In Proceedings of the
June 2021. 6
IEEE/CVF Conference on Computer Vision and Pattern
[55] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to
Recognition, pages 9519–9528, 2022. 1, 2, 6, 7, 8
sequence learning with neural networks. Advances in neural
[68] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub-
information processing systems, 27, 2014. 2, 4
laynet: largest dataset ever for document layout analysis. In
[56] Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun
2019 International Conference on Document Analysis and
Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. Few could
Recognition (ICDAR), pages 1015–1022. IEEE, 2019. 2
be better than all: Feature sampling and grouping for scene
text detection. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 4563–
4572, 2022. 2, 4
[57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017. 2,
3, 4, 5
[58] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and
Liang-Chieh Chen. Max-deeplab: End-to-end panoptic
segmentation with mask transformers. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 5463–5474, 2021. 1, 3
[59] Peng Wang, Cheng Da, and Cong Yao. Multi-granularity
prediction for scene text recognition. In Computer Vision–
ECCV 2022: 17th European Conference, Tel Aviv, Israel,
October 23–27, 2022, Proceedings, Part XXVIII, pages 339–
355. Springer, 2022. 1
[60] Renshen Wang, Yasuhisa Fujii, and Ashok C Popat. Post-ocr
paragraph recognition by graph convolutional networks. In

You might also like