0% found this document useful (0 votes)

34 views17 pages

SVTRV 2

The paper introduces SVTRv2, a novel CTC-based scene text recognition model that outperforms existing encoder-decoder models in both accuracy and inference speed. SVTRv2 incorporates a multi-size resizing strategy and a feature rearrangement module to effectively handle text irregularities and integrate linguistic context, addressing common challenges in scene text recognition. Extensive evaluations demonstrate its superiority across various benchmarks, solidifying its position as a leading method in the field.

Uploaded by

772845695

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views17 pages

SVTRV 2

Uploaded by

772845695

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Yongkun Du1 , Zhineng Chen1 *, Hongtao Xie2 , Caiyan Jia3 , Yu-Gang Jiang1
1
School of Computer Science, Fudan University, China
2
School of Information Science and Technology, USTC, China
3
School of Computer Science and Technology, Beijing Jiaotong University, China
[email protected], {zhinchen, ygj}@fudan.edu.cn, [email protected], [email protected]
arXiv:2411.15858v1 [cs.CV] 24 Nov 2024

Abstract SVTR CPPD PARSeq MARRec LISTER SVTRv2

Common
100
FPS Union14M
Connectionist temporal classification (CTC)-based 80
scene text recognition (STR) methods, e.g., SVTR, are 60
widely employed in OCR applications, mainly due to their 40
simple architecture, which only contains a visual model Long 20 Curved
and a CTC-aligned linear classifier, and therefore fast in- 0
ference. However, they generally have worse accuracy than
encoder-decoder-based methods (EDTRs), particularly in
challenging scenarios. In this paper, we propose SVTRv2, Chinese Multi-Oriented
a CTC model that beats leading EDTRs in both accuracy
and inference speed. SVTRv2 introduces novel upgrades to
Occluded Artistic
handle text irregularity and utilize linguistic context, which Fixed-Scale Resizing BB_UULLC_
endows it with the capability to deal with challenging
BULC
and diverse text instances. First, a multi-size resizing Reading order
(MSR) strategy is proposed to adaptively resize the text and Severe distortion puzzling CTC Objccts
causing unreadable alignment
maintain its readability. Meanwhile, we introduce a feature Text Irregularity Linguistic Missing
rearrangement module (FRM) to ensure that visual features
accommodate the alignment requirement of CTC well, thus Figure 1. Top: comparison with previous methods [4, 8, 11, 12,
alleviating the alignment puzzle. Second, we propose a 25] best in a single scenario, where long text recognition accuracy
semantic guidance module (SGM). It integrates linguistic (Long) and FPS are normalized. Our SVTRv2 achieves the new
context into the visual model, allowing it to leverage state of the arts in every scenario except for FPS. Nevertheless,
SVTRv2 is still the fastest compared to all the EDTRs. Bottom:
language information for improved accuracy. Moreover,
challenges caused by text irregularity and linguistic missing.
SGM can be omitted at the inference stage and would
not increase the inference cost. We evaluate SVTRv2 in
both standard and recent challenging benchmarks, where decades. Unlike text from scanned documents, scene text
SVTRv2 is fairly compared with 24 mainstream STR models often exists within complex natural scenarios, posing chal-
across multiple scenarios, including different types of text lenges such as background noise, text distortions, irregular
irregularity, languages, and long text. The results indicate layouts, artistic fonts [7], etc. To tackle these challenges, a
that SVTRv2 surpasses all the EDTRs across the scenarios variety of STR methods have been developed and they can
in terms of accuracy and speed. Code is available at be roughly divided into two categories, i.e., connectionist
https://github.com/Topdu/OpenOCR. temporal classification (CTC)-based methods and encoder-
decoder-based methods (EDTRs).
Typically, CTC-based methods [11, 23, 28, 39] employ
1. Introduction a single visual model to extract image features and then ap-
As a task of extracting text from natural images, scene text ply a CTC-aligned linear classifier [16] to predict recogni-
recognition (STR) has garnered considerable interest over tion results. This straightforward architecture provides ad-
vantages such as fast inference, which makes them espe-
* Corresponding Author cially popular in OCR applications. However, these mod-

1
els struggle to handle text irregularity, i.e., text distortions, character arrangement does not align with the reading order
varying layouts, etc. As a consequence, advanced attention- of the text, causing the puzzle for CTC alignment, as shown
based decoders are introduced as alternatives to the CTC in the bottom-centre example in Fig. 1. To solve this, we in-
classifier, leading to a series of EDTRs [4, 8, 9, 12, 15, troduce a feature rearrangement module (FRM) which rear-
17, 18, 29, 32, 34, 36, 38, 40, 46–49, 51, 52, 54, 55, 57– ranges visual features with a horizontal rearrangement, and
59, 62, 63]. These attention-based decoders show appeal- then identifying and prioritizing relevant vertical features.
ing performance in integrating multi-modal cues, including FRM maps 2D visual features into a sequence aligned with
visual [12, 48, 52, 58], linguistic [15, 36, 38, 55], and po- the text’ reading order, thus effectively alleviating the align-
sitional [8, 57, 62] ones, which are largely missed in cur- ment puzzle. Consequently, CTC models integrating MSR
rent CTC models. This integration enables EDTRs to per- and FRM can recognize irregular text effectively, without
form more effectively in complex scenarios. As depicted in using rectification modules or attention-based decoders.
the top of Fig. 1, compared to SVTR [11], a leading CTC As for the latter, the mistakenly recognized example
model that is adopted by famous commercial OCR engines shown in the bottom-right of Fig. 1 clearly highlights the
[28], EDTRs achieve superior results in English and Chi- necessity of integrating linguistic information. Since CTC
nese benchmarks [6, 25], covering challenging scenarios models directly classify visual features, we have to endow
such as curved, multi-oriented, occluded, and artistic text. the visual model with linguistic context modeling capabil-
Nevertheless, EDTRs are commonly built upon complex ity, which is less discussed previously. To this end, in-
architectures, thus sacrificing the inference speed (FPS), as spired by guided training of CTC (GTC) [23, 28] and string
shown in the top of Fig. 1. In addition, besides slower in- matching-based recognition [13], we propose a semantic
ference speed, EDTRs do not handle long text well. Even guidance module (SGM), which devises a new scheme that
LISTER [8], an EDTR dedicated to long text recognition, solely leverages surrounding string context to recognize tar-
performs worse than SVTR [11]. Since fast response and get characters during training. This approach effectively
recognizing long text are both important for many applica- guides the visual model to learn to perceive the linguistic
tions, the OCR community has to face the dilemma that no context without rely on decoders. During inference, SGM
model excels in accuracy, speed and versatility. When se- can be omitted and would not increase the inference cost.
lecting either CTC-based models or EDTRs, users have to With these contributions, we develop SVTRv2, a novel
accept that the model is inferior in some aspects. CTC-based method whose recognition ability has been
The inferior accuracy of CTC models can be attributed largely enhanced, while still maintaining a simple architec-
to two primary factors. First, these models struggle with ture and fast inference. To thoroughly validate SVTRv2,
irregular text, as CTC alignment presumes that the text ap- we conducted extensive ablation and comparative experi-
pears in a near canonical left-to-right order [2, 7], which is ments on benchmarks including standard regular and irregu-
not always true, particularly in complex scenarios. Second, lar text [2], occluded scene text [48], the recent Union14M-
CTC models seldom encode linguistic information, which L benchmarks [25], long text [13], and Chinese [6] text. The
is typically accomplished by the decoder of EDTRs. While results demonstrate that SVTRv2 consistently outperforms
recent advancements deal with the two issues by employing all the compared EDTRs across the evaluated scenarios in
text rectification [32, 40, 61], developing 2D CTC [44], uti- terms of accuracy and speed, highlighting its effectiveness
lizing masked image modeling [48, 58], etc., the accuracy and broad applicability.
gap between CTC and EDTRs remains significant, indicat- In addition, recent advances [4, 25, 37] revealed the
ing that novel solutions still need to be investigated. importance of large-scale real-world datasets in improving
In this paper, we aim to build more powerful CTC mod- STR model performance. However, many STR methods
els by better handling text irregularity and integrating lin- primarily derived from synthetic data [19, 24], which fail
guistic context. For the former, we address this challenge to fully represent real-world complexities and lead to per-
by first extracting discriminative features and then better formance limitations, particularly on challenging scenarios.
aligning them. First, existing methods uniformly resize text Notably, we observe that the existing real-word datasets
images with various shapes to a fixed size before feeding [4, 25, 37] suffer from data leakage, and the results reported
into the visual model. We question the rationality of this in [25] should be updated. As a result, we introduce U14M-
resizing, which easily causes severe distortion of the text, Filter, a rigorously filtered version of the real-world training
making it unreadable to humans, as shown in the bottom- dataset Union14M-L [25]. We systematically reproduced
left of Fig. 1. To this end, we propose a multi-size resizing and retrained 24 mainstream STR methods from scratch
(MSR) strategy to adaptively resize text images according based on U14M-Filter. These methods are also thoroughly
to their aspect ratios, thus minimizing text distortion and evaluated across various STR benchmarks. Their accuracy,
ensuring the discrimination of the extracted visual features. model size, and inference time constitute a comprehensive
Second, irregular text may be rotated significantly, and the and reliable new benchmark for future reference.

2
Multi-Size Resizing H
×
W
× D0
H
×
W
× D1
H
×
W
× D2 [C] [C] CE Loss
4 4 4 4 8 4
64×64 Dot-Product
Attention Classifier

Conv 3×3 & BN

Conv×

Conv×
Mixing Blocks

Mixing Blocks

Mixing Blocks
48×96
SGM

GELU ×2

3 3 & LN

3 3 & LN
40×112 ...
[C] : A Character in the Img
32×128
FRM
... H
: Left-side String of [C]
...
... H
H : Right-side String of [C]
... H : Image Feature
R<1.5
H : Horizontal Rearranging
...
1.5≤R<2.5 V : Vertical Rearranging

Local/Global
V V V V V V ... V V : Selecting Token
2.5≤R<3.5 Mixing

MLP
: Training Forward

LN
R≥3.5 Classifier : Inference Forward

Grouped by Aspect Ratio (R = WH) Mixing Block ×Ni COFFEE CTC Loss CTC Postprocess COFFEE

Figure 2. An illustrative overview of SVTRv2. The text is first resized according to multi-size resizing (MSR), then experiences feature
extraction. During training both the semantic guidance module (SGM) and feature rearrangement module (FRM) are employed, which are
responsible for linguistic context modeling and CTC-oriented feature rearrangement, respectively. Only FRM is retained during inference.

2. Related Work tegrating linguistic cues into the visual model, thus limiting
their effectiveness in enhancing CTC models.
Irregular text recognition [1, 25, 35] has posed a sig-
nificant challenge in STR due to the diverse variation of
text instances, where CTC-based methods [11, 23, 28, 39]
3. Methods
are often less effective. To address this, some meth-
ods [11, 36, 40, 54, 59, 61, 62] incorporate rectification
Fig. 2 illustrates the overview of SVTRv2. A text image
modules [32, 40, 61] that aim to transform irregular text
is first resized by MSR to the closest aspect ratio, forming
into more regular format. Alternatively, more methods
the input X ∈ R3×H×W . X then experiences three con-
utilize attention-based decoders [29, 38, 47], which em-
secutive feature extraction stages, yielding visual features
ploy the attention mechanism to dynamically localize char- H W
F ∈ R 8 × 4 ×D2 . During training, F is fed into both SGM
acters regardless of text layout, and thus less affected.
and FRM. SGM guides SVTRv2 to model linguistic con-
However, these methods generally have tailored training
text, while FRM rearranges F into the character feature se-
hyper-parameters. For instance, the rectification modules W
quence F̃ ∈ R 4 ×D2 , which is synchronized with the text
[32, 40, 61] typically specify a fixed output image size (e.g.
reading order and aligns with the label sequence well. Dur-
32×128), which is not always a suitable choice. While
ing inference, the SGM is discarded for efficiency.
attention-based decoders [29, 38, 47] generally set the max-
imum recognition length to 25 characters, thus longer text
cannot be correctly recognized, as shown in Fig. 5. 3.1. Multi-Scale Resizing
Linguistic Context Modeling. There are several ways of
modeling linguistic context. One major branch is auto- Previous works typically resize irregular text images to a
regressive (AR)-based STR methods [14, 25, 29, 38, 40, 47, fixed size, such as 32 × 128, which, however, may cause
51, 52, 54, 57, 62, 63], which utilize previously decoded undesired text distortion and severely affect the quality of
characters to model contextual cues. However, their infer- extracted visual features. To address this issue, we propose
ence speed is slow due to the character-by-character decod- a simple yet effective multi-size resizing (MSR) strategy
ing nature. Some other methods [4, 15, 34, 55] integrate that resizes text shapes based on the aspect ratio (R= W H ).
external language models to model the linguistic context Specifically, we define four specific sizes: [64, 64], [48, 96],
and correct the recognition results. While effective, the lin- [40, 112], and [32, ⌊R⌋× 32], respectively corresponding to
guistic context is purely text-based, making it challenging aspect ratio: R<1.5 (R1 ), 1.5 ≤R<2.5 (R2 ), 2.5≤R<3.5
to adapt them to CTC models. There also are some stud- (R3 ), and R≥3.5 (R4 ). Therefore, MSR allows text in-
ies [36, 48, 58] to modeling linguistic context with visual stances adaptively resized under the principles of roughly
information only by using masked image modeling-based maintaining their aspect ratios, such that significant text dis-
pretraining [3, 21]. However, they still depend on attention- tortion caused by resizing is almost eliminated. As a result,
based decoders to unleash the linguistic information, not in- the quality of extracted visual features is guaranteed.

3
3.2. Visual Feature Extraction within Fh to learn a vertical rearrangement matrix Mvj ∈
H
Motivated by SVTR [11], the network architecture of R1× 8 , as detailed in Eq. 2. The element Mvj,i represents the
SVTRv2 comprises three stages, with stagei containing Ni probability that the vertical rearranged feature Fvj ∈ R1×D2
mixing blocks, as illustrated in Fig. 2. To extract dis- corresponds to Fhi,j . Moreover, all column features share
criminative visual features, we devise two types of mix- a single selecting token, rather than assigning a unique se-
ing blocks: local and global. Local mixing is implemented lecting token to each column feature at different locations.
through two consecutive grouped convolutions, which are This scheme allows the model to generalize to longer text
expected to capture local character features, such as edges, sequences, even when the number of column features ex-
textures, and strokes. Meanwhile, global mixing is real- ceeds the number seen during training, thereby facilitating
ized by the multi-head self-attention (MHSA) mechanism the effective recognition of long text.
[42]. This mechanism performs global contextual modeling t
on features, thereby enhancing the model’s comprehension Mvj = σ T s Fh:,j Wjk , Fvj = Mvj Fh:,j Wjv (2)
of inter-character relationships and the overall text image.
Both the number of groups in the grouped convolution and where Wjq , Wjk , Wjv ∈ RD2 ×D2 are learnable weights.
W
the number of heads in MHSA are set to D 32 . Similar to
i We denote Fv = {Fv1 , Fv2 , . . . , FvW } ∈ R 4 ×D2 as
4
SVTR, by adjusting hyper-parameters Ni and Di , we can the rearranged feature sequence F̃. Then, the predicted
derive three variants of SVTRv2 with different capacities, W
character sequence Ỹ ctc ∈ R 4 ×Nc is obtained after F̃
i.e., Tiny, Small, and Base, which are detailed in Sec. 7 of passes through the classifier Ỹ ctc = F̃W ctc , and further
Supplementary. aligned with the label sequence Y using the CTC rule,
where W ctc ∈ RD2 ×Nc is the learnable weight of the clas-
3.3. Feature Rearranging Module
sifier. Furthermore, to intuitively clarify the role of FRM,
We propose a feature rearrangement module (FRM) to we rewrite the mapping relation between F̃ and the original
tackle the CTC alignment puzzle arising from rotated text. feature F by the following procedure:
H W
It rearranges the 2D features F ∈ R( 8 × 4 )×D2 into a fea- X W4
W
ture sequence F̃ ∈ R 4 ×D2 synchronized with the reading Fhi,j = Mhi,j × Fi = Mhi,j,w Fi,w (3)
w=1
order of the text image. We regard this process as mapping H
the relevant features from Fi,j ∈ R1×D2 to F̃m ∈ R1×D2 ,
X8
Fvj = Mvj × Fhj = Mvi,j Fhi,j
i=1
where i ∈ {1, 2, . . . , H8 } and j, m ∈ {1, 2, . . . , W
4 }. This X H8 X W4
this rearrangement can be formalized by a matrix M ∈ = Mvi,j Mhi,j,w Fi,w
W H W i=1 w=1
R 4 ×( 8 × 4 ) , whereby F̃ can be derived from M × F.
Grounding in the fact that the degree of text curve and = Mj × F
H W
rotation can be decomposed into offset components in both Mj = Mvj ⊙ {Mh1,j , Mh2,j , . . . , MhH ,j } ∈ R1×( 8 × 4 )
8
horizontal and vertical directions, we propose to learn M by W
×( H W
8 × 4 )
using two distinct steps: horizontal rearrangement and ver- M = {M1 , M2 , . . . , M W } ∈ R 4
4
tical rearrangement. As described in Eq. 1, the horizontal W
×D2
W F̃ = Fv = M × F ∈ R 4 (4)
one operates on Fi ∈ R 4 ×D2 , each row in the feature map,
W W
to learn a horizontal rearrangement matrix Mhi ∈ R 4 × 4 , As seen in Eq. 4, the visual features F, which fully de-
h
where element Mi,j,m represents the probability that the scribe the text to be recognized but may be irregularly or-
horizontal rearranged feature Fhi,j corresponds to the origi- ganized, are mapped to a more canonical F̃. Different from
nal feature Fi,m . Based on the learned Mhi , the features in existing text rectification models [32, 40], which apply geo-
each row are rearranged on the horizontal direction. Subse- metric transformations at image space and often fail to cor-
quently, through a residual and MLP processing, we obtain rect severely irregular text, our FRM is carried out at feature
H W
Fh ∈ R( 8 × 4 )×D2 . space. It learns a condensed feature sequence F̃ by mapping
and rearranging the relevant cues from F to proper positions
t
Mhi = σ Fi Wiq Fi Wik (1) in F̃. By employing a structured way that the features are
′ ′ ′
first rearranged horizontally and then vertically, FRM can
Fhi = LN(Mhi Fi Wiv + Fi ), Fhi = LN(MLP(Fhi ) + Fhi ) accommodate text instances with significant irregularities
such as large orientation variations, and obtaining features
where Wiq , Wik , Wiv ∈ RD2 ×D2 are learnable weights, σ is better aligned with the CTC classification.
the Softmax function, and Fh = {Fh1 , Fh2 , . . . , FhH }.
8
In the following vertical rearrangement, we introduce a
3.4. Semantic Guidance Module
selecting token, denoted as T s ∈ R1×D2 , which simulta- CTC models directly classify visual features to obtain
H
neously attends to each column of features Fh:,j ∈ R 8 ×D2 recognition results. This scheme inherently requires that

4
+
the linguistic context must be incorporated into visual fea-
[P][P][P][P][P] [P][P][P][P]C [P][P][P]CO [P][P]CPF [P]COFF COFFE
tures, so that the CTC could benefit from. In light of this, OFFEE FFEE[P] FEE[P][P] EE[P][P][P] E[P][P][P][P] [P][P][P][P][P]
we propose a semantic guidance module (SGM) as follows. / C O F F E E

For each character ci in a text image with character labels

Y = {c1 , c2 , . . ., cL }, we define its contextual information as Figure 3. Visualization of attention maps when recognizing the
target character by string matching on both sides, where li is set to
the surrounding left string Sli = {ci−ls , . . . , ci−1 } and right
5. [P] denotes the padding symbol.
string Sri = {ci+1 , . . . , ci+ls }, where ls denotes the context
window length. The SGM’s role is to guide the visual model
to integrate context from both Sli and Sri into visual features. 3.5. Optimization Objective
We describe the process using the left string Sli , with the
same process applied symmetrically to the right string Sri . During training, The optimization objective is to minimize
Firstly, the characters in Sli are mapped to string embed- the loss L, which comprises Lctc and Lsgm , as listed below:
dings Eli ∈ Rls ×D2 . Subsequently, these embeddings are
Lctc = CT CLoss(Ỹ ctc , Y) (6)
encoded to create a hidden representation Qli ∈ R1×D2 ,
representing the context of the left-side string Sli . The at- 1 XL l r
Lsgm = (ce(Ỹ i , ci ) + ce(Ỹ i , ci ))
tention map Ali is computed by applying a dot product be- 2L i=1

tween the hidden representation Qli and the visual features L = λ1 Lctc + λ2 Lsgm
F, transformed by learned weight matrices W q and W k .
where ce represents the cross-entropy loss, λ1 and λ2 are
The detailed formulation is as follows:
weighting parameters setting to 0.1 and 1, respectively.
t l v
Qli = LN σ T l W q Eli W k Ei W + T l (5)
4. Experiments
t
Ali = σ Qli W q FW k , Fli = Ali FW v

4.1. Datasets and Implementation Details
We evaluate SVTRv2 across multiple benchmarks cover-
where T l ∈ R1×D2 represents a pre-defined token that en-
ing diverse scenarios. They are: 1) six common regular
codes the left-side string. The attention map Ali is used to
and irregular benchmarks (Com), including ICDAR 2013
weight the visual features F, producing the feature Fli ∈
(IC13) [27], Street View Text (SVT) [45], IIIT5K-Words
R1×D2 corresponding to character ci . After processing
l (IIIT5K) [33], ICDAR 2015 (IC15) [26], Street View Text-
through the classifier Ỹ i = Fli W sgm , the predicted class Perspective (SVTP) [35] and CUTE80 [1]. For IC13 and
l
probabilities Ỹ i ∈ R1×Nc for ci is obtained to calculate IC15, we use the versions with 857 and 1811 images, re-
the cross-entropy loss with the label ci , where W sgm ∈ spectively; 2) the test set of the recent Union14M-L bench-
RD2 ×Nc is learnable weights of the classifier. mark (U14M) [25], which includes seven challenging sub-
The weight of the attention map Ali records the relevance sets: Curve, Multi-Oriented (MO), Artistic, Contextless
of Qli to the visual features F, and moreover, Qli represents (Cless), Salient, Multi-Words (MW) and General; 3) oc-
the context of string Sli . So only when the visual model in- cluded scene text dataset (OST) [48], which is categorized
corporates the context from Sli into the visual features of the into two subsets based on the degree of occlusion: weak oc-
target character ci , the attention map Ali can maximize the clusion (OST w ) and heavy occlusion (OST h ); 4) long text
relevance between Qli and visual features of the character benchmark (LTB) [13], which includes 3376 samples of text
ci , thereby accurately highlighting the corresponding posi- length from 25 to 35; 5) the test set of BCTR [6], a Chinese
tion of character ci , as shown in Fig. 3. A similar process text recognition benchmark with four subsets: Scene, Web,
applies to the right-side string Sri , where the correspond- Document (Doc) and Hand-Writing (HW).
ing attention map Ari and visual feature Fri contribute to For English recognition, we train models on real-world
r
the prediction Ỹ i . By leveraging the above scheme during datasets, from which the models exhibit stronger recogni-
training, SGM effectively guides the visual model in inte- tion capability [4, 25, 37]. There are three large-scale real-
grating the linguistic context into visual features. Conse- world training sets, i.e., the Real dataset [4], REBU-Syn
quently, even when SGM is not used during inference, the [37], and the training set of Union14M-L (U14M-Train)
linguistic context can still be maintained alongside the vi- [25]. However, they all overlap with U14M (detailed in
sual features, and enhancing the accuracy of CTC models. Sec. 8 in Supplementary) across the seven subsets, leading
In contrast, previous methods, such as VisionLAN [48] and to data leakage, which makes them unsuitable for training
LPV [58], despite modeling linguistic context using visual models. To resolve this, we introduce a filtered version of
features, still rely on attention-based decoders to unleash Union14M-L training set, termed as U14M-Filter, by filter-
linguistic information, a process that is incompatible with ing these overlapping instances. This new dataset is used to
CTC models. train SVTRv2 and 24 mainstream methods we reproduced.

5
For Chinese recognition, we train models on the training R1 R2 R3 R4
Curve MO Com U14M
set of BCTR [6]. Unlike previous methods that train sepa- 2,688 788 266 32
rately for each subset, we trained the model on an integrated SVTRv2 (+MSR+FRM) 87.4 88.3 86.1 87.5 88.17 86.19 96.16 83.86
dataset and then evaluated it on the four subsets. SVTRv2 (w/o both) 70.5 81.5 82.8 84.4 82.89 65.59 95.28 77.78
We use AdamW optimizer [30] with a weight decay of vs. Fixed32×128 72.1 83.1 84.1 85.6 83.18 68.71 95.56 78.87
0.05 for training. The LR is set to 6.5 × 10−4 and batchsize MSR Padding32×W 52.1 71.3 82.3 87.4 71.06 51.57 94.70 71.82
(+FRM) Fixed64×256 76.6 81.6 81.9 80.2 85.70 67.49 95.07 79.03
is set to 1024. One cycle LR scheduler [43] with 1.5/4.5
epochs linear warm-up is used in all the 20/100 epochs, w/o FRM 85.7 86.3 86.0 85.5 87.35 83.73 95.44 82.22
vs.
+ H rearranging 87.0 87.1 86.3 85.5 88.05 85.76 95.98 82.94
where a/b means a for English and b for Chinese. For En- FRM
+ V rearranging 85.0 87.6 88.5 85.5 88.01 84.44 95.66 82.70
glish model, we take SVTRv2 without SGM as the pre- (+MSR)
+ TF1 86.4 86.3 87.5 86.1 87.51 85.50 95.60 82.49
trained and fine-tuned SVTRv2 with SGM with the same
ResNet+TF3 49.3 63.5 64.0 66.7 65.00 42.07 92.26 63.00
above settings. Word accuracy is used as the evaluation met- FocalNet-B 56.7 73.2 75.3 73.9 76.46 45.80 94.49 71.63
ric. Data augmentation like rotation, perspective distortion, - ConvNeXtV2 58.4 71.0 73.6 71.2 75.97 45.95 93.93 70.43
motion blur and gaussian noise, are randomly performed ViT-S 68.5 73.8 73.8 73.0 75.02 64.35 93.57 72.09
and the maximum text length is set to 25 during training. SVTR-B 53.3 74.8 76.4 78.4 76.22 44.49 94.58 71.17
The size of the character set Nc is set to 94 for English ResNet+TF3 53.8 67.9 65.5 65.8 69.00 46.02 93.12 66.81
and 6624 [28] for Chinese. In experiments, SVTRv2 means FocalNet-B 57.1 75.2 77.1 78.4 75.52 51.21 94.39 72.73
+FRM ConvNeXtV2 60.7 79.0 79.0 81.1 79.72 53.32 94.19 73.09
SVTRv2-B unless specified. All models are trained with ViT-S 75.1 79.4 79.0 78.4 80.42 72.17 94.44 77.07
mixed-precision on 4 RTX 4090 GPUs. SVTR-B 59.1 79.0 78.8 80.2 79.84 51.28 94.75 73.48
ResNet+TF3 68.2 71.3 75.3 72.1 75.64 60.33 93.50 71.95
4.2. Ablation Study +MSR FocalNet-B 80.5 80.6 79.2 85.0 82.26 74.82 94.92 78.94
ConvNeXtV2 76.2 79.0 82.3 80.2 81.05 73.27 94.60 77.71
Effectiveness of MSR. We group the Curve and MO text in
U14M based on the aspect ratio Ri . As shown in Tab. 1, the - / + SGM OST w OST h Avg OST ∗w OST ∗h Avg Com∗ U14M ∗
majority of irregular texts fall within R1 and R2 , where they ResNet+TF3 71.6 51.8 61.72 77.9 55.0 66.43 95.19 78.61
are particularly prone to distortion when resized to a fixed FocalNet-B 78.9 62.8 70.88 84.6 70.6 77.61 96.28 84.10
ConvNeXtV2 76.0 58.2 67.10 82.0 63.9 72.97 96.09 82.10
size (see Fixed32×128 in Fig. 4). In contrast, MSR demon-
strates significant improvements of 15.3% in R1 and 5.2%
Table 1. Ablations on MSR and FRM (top) and assessing MSR,
in R2 compared to Fixed32×128 . Meanwhile, a large fixed- FRM, and SGM across visual models (lower). * means with SGM.
size Fixed64×256 , although improving the accuracy com-
pared to the baseline, still performs worse than our MSR
Method OST w OST h Avg Com U14M
by clear margins. The results strongly confirm our hypoth-
esis that undesired resizing would hurt the recognition. Our w/o SGM 82.86 66.97 74.92 96.16 83.86
SGM 86.26 73.80 80.03 96.57 86.14
MSR effectively mitigates this issue, providing better visual Linguistic GTC [23] 83.07 68.32 75.70 96.01 84.33
features thus enhancing the recognition accuracy. context ABINet [15] 83.07 67.54 75.31 96.25 84.17
Effectiveness of FRM. We ablate the two rearrangement modeling VisionLAN [48] 83.25 68.97 76.11 96.39 84.01
PARSeq [4] 83.85 69.24 76.55 96.21 84.72
sub-modules (Horizontal (H) rearranging and Vertical (V)
MAERec [25] 83.21 69.69 76.45 96.47 84.69
rearranging). As shown in Tab. 1, compared to without
FRM (w/o FRM), they individually improve accuracy by Table 2. Comparison of the proposed SGM with other language
2.03% and 0.71% on MO, and they together result in a models in linguistic context modeling on OST.
2.46% gain. Additionally, we explore using a Transformer
block (+ TF1 ) to learn the rearrangement matrix holistically,
whose effectiveness is less obvious. The most probable rea- characters, this notable gain implies that the linguistic con-
son is that this scheme does not well distinguish between text has been successfully established. For comparison, we
vertical and horizontal orientations. In contrast, FRM per- also employ GTC [23] and four popular language decoders
forms feature rearrangement in both directions, making it [4, 15, 25, 48] to substitute for our SGM. However, there is
highly sensitive to text irregularity, and thus facilitating ac- no much difference between the gains obtained from OST
curate CTC alignment. As shown in the left five cases in and the other two datasets (Com and U14M). This sug-
Fig. 4, FRM successfully recognizes reverse instances, pro- gests that SGM offers a distinct advantage in integrating
viding strong evidence of FRM’s effectiveness. linguistic context into visual features, and significantly im-
Effectiveness of SGM. As illustrated in Tab. 2, SGM proving the recognition accuracy of CTC models. The five
achieves 0.41% and 2.28% increase on Com and U14M, cases on the right side of Fig. 4 showcase that SGM fa-
respectively, while gains a 5.11% improvement on OST. cilitates SVTRv2 to accurately decipher occluded charac-
Since OST frequently suffers from missing a portion of ters, achieving comparable results with PARSeq [4], which

6
Text Image

MSR SVTRv2: WORLD STREET Coffee

w/o SGM: WORED STREEi Goffee
SVTRv2: COMMUNITY ISSUE! NATIONAUX Position REPUBLIC LIBERTY UNLIMITED PARSeq: WORLD STREET Coffee
w/o FRM: COINMMUN_TY iSSUEL NATMIONAUX Pos_tion KEPUB_EY LIBERIY UNLINITED
Fixed32×128
SVTRv2†: COMMMUNITY L55UE! KAT_ONAUX PO__tIOR KEUA____ LIBERTT UNLI__TED
MAERec*: COMMUNITY ISSUE! NATIONALIX POSTEIGH KEPUBLI_ LIBEIRITY UNLIMITED
SVTRv2: DUMPLINGS DEPARTMENT
TPS w/o SGM: SUMPLINGS DEP_ RTMENT
SVTRv2†: COMMUNITY ISSUE! MATIONAUX Position NEPUE___ VIBBKEY UNLIMITED PARSeq: DUMPLINGS DER_RTMENT

Figure 4. Qualitative comparison of SVTRv2 with previous methods on irregular and occluded text. † means that SVTRv2 utilizes the
fixed-scale (Fixed32×128 ) or rectification module (TPS) as the resize strategy. MAERec* means that SVTRv2† integrates with the attention-
based decoder from the previous best model, i.e. MAERec [25], such a decoder is widely employed in [5, 31, 38, 51, 52, 54, 56]. Green,
red, and denotes correctly, wrongly and missed recognition, respectively.

IIIT5k SVT ICDAR2013 ICDAR2015 SVTP CUTE80 ∥ Curve Multi-Oriented Artistic Contextless Salient Multi-Words General
Method Venue Encoder Common Benchmarks Avg Union14M Benchmarks Avg LTB OST Size FPS
ASTER [40] TPAMI19 ResNet+LSTM 96.1 93.0 94.9 86.1 87.9 92.0 91.70 70.9 82.2 56.7 62.9 73.9 58.5 76.3 68.75 0.1 61.9 19.0 67.1
NRTR [38] ICDAR19 Stem+TF6 98.1 96.8 97.8 88.9 93.3 94.4 94.89 67.9 42.4 66.5 73.6 66.4 77.2 78.3 67.46 0.0 74.8 44.3 17.3
MORAN [32] PR19 ResNet+LSTM 96.7 91.7 94.6 84.6 85.7 90.3 90.61 51.2 15.5 51.3 61.2 43.2 64.1 69.3 50.82 0.1 57.9 17.4 59.5
SAR [29] AAAI19 ResNet+LSTM 98.1 93.8 96.7 86.0 87.9 95.5 93.01 70.5 51.8 63.7 73.9 64.0 79.1 75.5 68.36 0.0 60.6 57.5 15.8
DAN [47] AAAI20 ResNet+FPN 97.5 94.7 96.5 87.1 89.1 94.4 93.24 74.9 63.3 63.4 70.6 70.2 71.1 76.8 70.05 0.0 61.8 27.7 99.0
SRN [55] CVPR20 ResNet+FPN 97.2 96.3 97.5 87.9 90.9 96.9 94.45 78.1 63.2 66.3 65.3 71.4 58.3 76.5 68.43 0.0 64.6 51.7 67.1
SEED [36] CVPR20 ResNet+LSTM 96.5 93.2 94.2 87.5 88.7 93.4 92.24 69.1 80.9 56.9 63.9 73.4 61.3 76.5 68.87 0.1 62.6 24.0 65.4
AutoSTR [59] ECCV20 NAS+LSTM 96.8 92.4 95.7 86.6 88.2 93.4 92.19 72.1 81.7 56.7 64.8 75.4 64.0 75.9 70.09 0.1 61.5 6.0 82.6
RoScanner [57] ECCV20 ResNet 98.5 95.8 97.7 88.2 90.1 97.6 94.65 79.4 68.1 70.5 79.6 71.6 82.5 80.8 76.08 0.0 68.6 48.0 64.1
ABINet [15] CVPR21 ResNet+TF3 98.5 98.1 97.7 90.1 94.1 96.5 95.83 80.4 69.0 71.7 74.7 77.6 76.8 79.8 75.72 0.0 75.0 36.9 73.0
VisionLAN [48] ICCV21 ResNet+TF3 98.2 95.8 97.1 88.6 91.2 96.2 94.50 79.6 71.4 67.9 73.7 76.1 73.9 79.1 74.53 0.0 66.4 32.9 93.5
PARSeq [4] ECCV22 ViT-S 98.9 98.1 98.4 90.1 94.3 98.6 96.40 87.6 88.8 76.5 83.4 84.4 84.3 84.9 84.26 0.0 79.9 23.8 52.6
MATRN [34] ECCV22 ResNet+TF3 98.8 98.3 97.9 90.3 95.2 97.2 96.29 82.2 73.0 73.4 76.9 79.4 77.4 81.0 77.62 0.0 77.8 44.3 46.9
MGP-STR [46] ECCV22 ViT-B 97.9 97.8 97.1 89.6 95.2 96.9 95.75 85.2 83.7 72.6 75.1 79.8 71.1 83.1 78.65 0.0 78.7 148 120
CPPD [12] Preprint SVTR-B 99.0 97.8 98.2 90.4 94.0 99.0 96.40 86.2 78.7 76.5 82.9 83.5 81.9 83.5 81.91 0.0 79.6 27.0 125
LPV [58] IJCAI23 SVTR-B 98.6 97.8 98.1 89.8 93.6 97.6 95.93 86.2 78.7 75.8 80.2 82.9 81.6 82.9 81.20 0.0 77.7 30.5 82.6
MAERec [25] ICCV23 ViT-S 99.2 97.8 98.2 90.4 94.3 98.3 96.36 89.1 87.1 79.0 84.2 86.3 85.9 84.6 85.17 9.8 76.4 35.7 17.1
LISTER [8] ICCV23 FocalNet-B 98.8 97.5 98.6 90.0 94.4 96.9 96.03 78.7 68.8 73.7 81.6 74.8 82.4 83.5 77.64 36.3 77.1 51.1 44.6
CDistNet [62] IJCV24 ResNet+TF3 98.7 97.1 97.8 89.6 93.5 96.9 95.59 81.7 77.1 72.6 78.2 79.9 79.7 81.1 78.62 0.0 71.8 43.3 15.9
CAM [54] PR24 ConvNeXtV2 98.2 96.1 96.6 89.0 93.5 96.2 94.94 85.4 89.0 72.0 75.4 84.0 74.8 83.1 80.52 0.7 74.2 58.7 28.6
BUSNet [49] AAAI24 ViT-S 98.3 98.1 97.8 90.2 95.3 96.5 96.06 83.0 82.3 70.8 77.9 78.8 71.2 82.6 78.10 0.0 78.7 32.1 83.3
OTE [52] CVPR24 SVTR-B 98.6 96.6 98.0 90.1 94.0 97.2 95.74 86.0 75.8 74.6 74.7 81.0 65.3 82.3 77.09 0.0 77.8 20.3 55.2
CRNN [39] TPAMI16 ResNet+LSTM 95.8 91.8 94.6 84.9 83.1 91.0 90.21 48.1 13.0 51.2 62.3 41.4 60.4 68.2 49.24 47.2 58.0 16.2 172
C SVTR [11] IJCAI22 SVTR-B 98.0 97.1 97.3 88.6 90.7 95.8 94.58 76.2 44.5 67.8 78.7 75.2 77.9 77.8 71.17 45.1 69.6 18.1 161
T SVTRv2-T 98.6 96.6 98.0 88.4 90.5 96.5 94.78 83.6 76.0 71.2 82.4 77.2 82.3 80.7 79.05 47.8 71.4 5.1 201
C SVTRv2 - SVTRv2-S 99.0 98.3 98.5 89.5 92.9 98.6 96.13 88.3 84.6 76.5 84.3 83.3 85.4 83.5 83.70 47.6 78.0 11.3 189
SVTRv2-B 99.2 98.0 98.7 91.1 93.5 99.0 96.57 90.6 89.0 79.3 86.1 86.2 86.7 85.1 86.14 50.2 80.0 19.8 143

Table 3. All the models and SVTRv2 are trained on U14M-Filter. TFn denotes the n-layer Transformer block [42]. Size denotes the model
size (M). FPS is uniformly measured on one NVIDIA 1080Ti GPU. In addition, we discuss the results of SVTRv2 trained on synthetic
datasets [19, 24] in Supplementary.

is equipped with an advanced permuted language model. Net [53], and ConvNeXtV2 [54] exhibit significant accu-
racy improvements, either matching or even exceeding the
Adaptability to different visual models. We further ex-
accuracy of their EDTR counterparts (see Tab. 3). The re-
amine MSR, FRM, and SGM on five frequently used visual
sults highlight the versatility of the three proposed modules.
models [10, 11, 20, 50, 53]. As presented in the bottom part
of Tab. 1, these modules consistently enhance the perfor-
4.3. Comparison with State-of-the-arts
mance (ViT [10] and SVTR [11] employ absolute positional
coding and do not compatible with MSR). When both FRM To demonstrate the effectiveness of SVTRv2 in English, we
and MSR modules incorporated, ResNet+TF3 [20], Focal- compare it with 24 popular STR methods. All the mod-

7
els are tested on the newly constructed U14M and the re- R1 R2 R3 R4 Curve MO Com U14M LTB
sults are given in Tab. 3. SVTRv2-B ranks the top in SVTRv2 90.8 89.0 90.4 91.0 90.64 89.04 96.57 86.14 50.2
12 of the 15 evaluated scenarios, It almost outperforms all
SVTR [11] 86.8 82.3 77.3 75.7 82.19 86.12 94.62 78.44 0.0
the EDTRs in every scenario, showing a clear accuracy TPS
SVTRv2 89.5 85.1 78.4 83.8 84.71 88.97 94.62 79.94 0.5
advantage. Meanwhile, it still enjoys a small model size
MAE- SVTR [11] 81.3 87.6 87.6 88.3 87.88 78.74 96.32 83.23 0.0
and a significant speed advantage. Specifically, compared REC* SVTRv2 88.0 88.9 89.4 88.3 89.96 87.56 96.42 85.67 0.2
to MAERec, the best-performed existing model on U14M,
SVTRv2-B shows an accuracy improvement of 0.97% and Table 4. SVTRv2 and SVTR comparisons on irregular text and
8× faster inference speed. Compared to CPPD, which is LTB, where the rectification module (TPS) and the attention-based
known for its accuracy-speed tradeoff, SVTR-B runs faster decoder (MAERec*) are employed.
than 10%, along with a 4.23% accuracy increase on U14M.
Regarding OST, as illustrated in the right part of Fig. 4,
SVTR-B relies solely on a single visual model but achieves CRNN: "SWEET LADY IDOK DOWN FROM THY WOYDOW ON XE"
comparable accuracy to PARSeq, which employed the ad- SVTR: "SWEET LADY LOOK DOWN FRO_ THY W_NDOW OW ME,
LISTER: "SWEET LADY LOOK _________________ WINDOW ON ME?
vanced permuted language model and is the best-performed SVTRv2: "SWEET LADY LOOK DOWN FROM THY WINDOW ON ME"
w/ TPS: C
existing model on OST. In the case of long text recognition, w/ MAERec*: "mayLosMocanos.com
where a large portion of EDTRs are incapable of recogniz-
ing, SVTR-B outperforms LISTER, the best EDTR method CRNN: EDITED WITH INTRODUCTION BY ROY TORGESON
SVTR: EDITED W_TH INTRODUCTION BY ROY TORGESON
on LTB, by 13%, demonstrating the remarkable scalability LISTER: EDITED WITH INTRODUCTION B_ _O_ TORGESON
of SVTRv2. In addition, SVTRv2-T and SVTRv2-S, the SVTRv2: EDITED WITH INTRODUCTION BY ROY TORGESON
w/ TPS: CIYYS
two smaller models also show leading accuracy compared w/ MAERec*: EDITED WITH IN_________I_N __ ___ ____G_SON
with models of similar sizes, offering solutions with differ-
ent accuracy-speed tradeoff. Figure 5. Long text recognition results. TPS and MAERec* denote
Two observations are derived when looking into the re- SVTRv2 integrated with TPS and the decoder of MAERec.
sults on Curve and MO. First, SVTRv2 models significantly
surpass existing CTC models. For example, compared to Method Scene Web Doc HW Avg SceneL>25 Size
SVTR-B, SVTRv2-B gains prominent accuracy improve- ASTER [40] 61.3 51.7 96.2 37.0 61.55 - 27.2
ments of 14.4% and 44.5%, respectively. Second, as shown MORAN [32] 54.6 31.5 86.1 16.2 47.10 - 28.5
in Tab. 4, comparing with previous methods employing the SAR [29] 59.7 58.0 95.7 36.5 62.48 - 27.8
SEED [36] 44.7 28.1 91.4 21.0 46.30 - 36.1
rectification modules [11, 36, 40, 54, 59, 61, 62] and the MASTER [31] 62.8 52.1 84.4 26.9 56.55 - 62.8
attention-based decoder [5, 25, 29, 31, 38, 47, 51, 52, 54, ABINet [15] 66.6 63.2 98.2 53.1 70.28 - 53.1
56] to recognize irregular text, SVTRv2 also performs bet- TransOCR [5] 71.3 64.8 97.1 53.0 71.55 - 83.9
ter than these methods on Curve and MO. In Fig. 4, the CCR-CLIP [56] 71.3 69.2 98.3 60.3 74.78 - 62.0
DCTC [60] 73.9 68.5 99.4 51.0 73.20 - 40.8
rectification module (TPS) and the attention-based decoder CAM [54] 76.0 69.3 98.1 59.2 76.80 - 135
(MAERec*) do not recognize the extremely curved and ro- PARSeq* [4] 84.2 82.8 99.5 63.0 82.37 0.0 28.9
tated text correctly, in contrast, SMTR successes. More- CPPD* [12] 82.7 82.4 99.4 62.3 81.72 0.0 32.1
MAERec* [25] 84.4 83.0 99.5 65.6 83.13 4.1 40.8
over, as demonstrated by the results on LTB in Tab. 4 and
LISTER* [8] 79.4 79.5 99.2 58.0 79.02 13.9 55.0
Fig. 5, TPS and MAERec* both do not effectively recog-
CRNN* [39] 63.8 68.2 97.0 46.1 68.76 37.6 19.5
nize long text, while SVTRv2 circumvents this limitation.
SVTR-B* [11] 77.9 78.7 99.2 62.1 79.49 22.9 19.8
These results indicate that our proposed modules success- SVTRv2-T 77.8 78.8 99.2 62.0 79.45 47.8 6.8
fully address the challenge of handling irregular text that SVTRv2-S 81.1 81.2 99.3 65.0 81.64 50.0 14.0
existing CTC models encountered, while preserving CTC’s SVTRv2-B 83.5 83.3 99.5 67.0 83.31 52.8 22.5
proficiency in recognizing long text.
SVTRv2 models also exhibit strong performance in Chi- Table 5. Results on Chinese text dataset. * denotes that the model
nese text recognition (see Tab. 5), where SVTRv2-B is retrained using the same setting as SVTRv2 (Sec. 4.1).
achieve state of the art. The result underscores its great
adaptability to different languages. To sum, we evaluate
SVTRv2 across a wide range of scenarios. The results con- developing the MSR and FRM modules to tackle the text
sistently confirm that this CTC model beats leading EDTRs. irregular challenge, and devising the SGM module to en-
dow linguistic context to the visual model. These upgrades
5. Conclusion maintain the simple architecture of CTC models, thus they
remain quite efficient. More importantly, our thorough val-
In this paper, we have presented SVTRv2, an accurate and idation on multiple benchmarks demonstrates the effective-
efficient CTC-based STR method. SVTRv2 is featured by ness of SVTRv2. It achieves leading accuracy and inference

8
speed in various challenging scenarios covering regular, ir- modeling for scene text recognition. In CVPR, pages 7098–
regular, occluded, Chinese and long text, convincingly in- 7107, 2021. 2, 3, 6, 7, 8
dicating that SVTRv2 has beat EDTRs in scene text recog- [16] A. Graves, S. Fernández, F. Gomez, and J. Schmidhu-
nition. In addition, we retrain 24 methods from scratch on ber. Connectionist temporal classification: Labelling unseg-
U14M-Filter without data leakage, constituting a compre- mented sequence data with recurrent neural networks. In
hensive and reliable benchmark. We hope that SVTRv2 and ICML, page 369–376, 2006. 1
this benchmark will further advance the development of the [17] T. Guan, C. Gu, J. Tu, X. Yang, Q. Feng, Y. Zhao, and W.
OCR community. Shen. Self-Supervised implicit glyph attention for text recog-
nition. In CVPR, pages 15285–15294, 2023. 2, 3
References [18] Tongkun Guan, Wei Shen, Xue Yang, Qi Feng, Zekun Jiang,
and Xiaokang Yang. Self-Supervised Character-to-Character
[1] R. Anhar, S. Palaiahnakote, C. S. Chan, and C. L. Tan. A ro- distillation for text recognition. In ICCV, pages 19473–
bust arbitrary text detection system for natural scene images. 19484, 2023. 2, 3
Expert Syst. Appl., 41(18):8027–8048, 2014. 3, 5, 6 [19] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for
[2] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, text localisation in natural images. In CVPR, pages 2315–
and H. Lee. What is wrong with scene text recognition model 2324, 2016. 2, 7, 3
comparisons? dataset and model analysis. In ICCV, pages [20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
4714–4722, 2019. 2 for image recognition. In CVPR, pages 770–778, 2016. 7, 1
[3] H. Bao, L. Dong, S. Piao, and F. Wei. BEiT: BERT pre- [21] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Girshick.
training of image transformers. In ICLR, 2022. 3 Masked autoencoders are scalable vision learners. In CVPR,
[4] D. Bautista and R.l Atienza. Scene text recognition with pages 15979–15988, 2022. 3
permuted autoregressive sequence models. In ECCV, pages
[22] A. G Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
178–196, 2022. 1, 2, 3, 5, 6, 7, 8
T. Weyand, M. Andreetto, and H. Adam. MobileNets: Effi-
[5] J. Chen, B. Li, and X. Xue. Scene Text Telescope: Text- cient convolutional neural networks for mobile vision appli-
focused scene image super-resolution. In CVPR, pages cations. CoRR, abs/1704.04861, 2017. 1
12021–12030, 2021. 7, 8
[23] W. Hu, X. Cai, J. Hou, S. Yi, and Z. Lin. GTC: Guided train-
[6] J. Chen, H. Yu, J. Ma, M. Guan, X. Xu, X. Wang, S. Qu,
ing of ctc towards efficient and accurate scene text recogni-
B. Li, and X. Xue. Benchmarking chinese text recogni-
tion. In AAAI, pages 11005–11012, 2020. 1, 2, 3, 6
tion: Datasets, baselines, and an empirical study. CoRR,
[24] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman.
abs/2112.15093, 2021. 2, 5, 6
Synthetic data and artificial neural networks for natural scene
[7] X. Chen, L. Jin, Y. Zhu, C. Luo, and T. Wang. Text recog-
text recognition. CoRR, abs/1406.2227, 2014. 2, 7, 3
nition in the wild: A survey. ACM Comput. Surv., 54(2):
42:1–42:35, 2022. 1, 2 [25] Q. Jiang, J. Wang, D. Peng, C. Liu, and L. Jin. Revisiting
[8] C. Cheng, P. Wang, C. Da, Q. Zheng, and C. Yao. LISTER: scene text recognition: A data perspective. In ICCV, pages
Neighbor decoding for length-insensitive scene text recogni- 20486–20497, 2023. 1, 2, 3, 5, 6, 7, 8
tion. In ICCV, pages 19484–19494, 2023. 1, 2, 7, 8, 3 [26] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A.
[9] C. Da, P. Wang, and C. Yao. Levenshtein OCR. In ECCV, Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chan-
pages 322–338, 2022. 2, 3 drasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. IC-
[10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. DAR 2015 competition on robust reading. In ICDAR, pages
Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, 1156–1160, 2015. 5, 7
S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth [27] D. KaratzasAU, F. ShafaitAU, S. UchidaAU, M. Iwamu-
16x16 words: Transformers for image recognition at scale. raAU, L. G. i. BigordaAU, S. R. MestreAU, J. MasAU, D. F.
In ICLR, 2021. 7 MotaAU, J. A. AlmazànAU, and L. P. de las Heras. ICDAR
[11] Y. Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y. Du, and 2013 robust reading competition. In ICDAR, pages 1484–
Y. Jiang. SVTR: Scene text recognition with a single visual 1493, 2013. 5, 6
model. In IJCAI, pages 884–890, 2022. 1, 2, 3, 4, 7, 8 [28] C. Li, W. Liu, R. Guo, X. Yin, K. Jiang, Y. Du, Y. Du, L. Zhu,
[12] Y. Du, Z. Chen, C. Jia, X. Yin, C. Li, Y. Du, and Y. Jiang. B. Lai, X. Hu, D. Yu, and Y. Ma. PP-OCRv3: More attempts
Context perception parallel decoder for scene text recogni- for the improvement of ultra lightweight ocr system. CoRR,
tion. CoRR, abs/2307.12270, 2023. 1, 2, 7, 8, 3 abs/2206.03001, 2022. 1, 2, 3, 6
[13] Y. Du, Z. Chen, C. Jia, X. Gao, and Y. Jiang. Out of [29] H. Li, P. Wang, C. Shen, and G. Zhang. Show, attend and
length text recognition with sub-string matching. CoRR, read: A simple and strong baseline for irregular text recogni-
abs/2407.12317, 2024. 2, 5 tion. In AAAI, pages 8610–8617, 2019. 2, 3, 7, 8
[14] Y. Du, Z. Chen, Y. Su, C. Jia, and Y. Jiang. Instruction- [30] I. Loshchilov and F. Hutter. Decoupled weight decay regu-
guided scene text recognition. CoRR, abs/2401.17851, 2024. larization. In ICLR, 2019. 6
3 [31] N. Lu, W. Yu, X. Qi, Y. Chen, P. Gong, R. Xiao, and X.
[15] S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang. Read Like Bai. MASTER: Multi-aspect non-local network for scene
Humans: Autonomous, bidirectional and iterative language text recognition. Pattern Recognit., 117:107980, 2021. 7, 8

9
[32] C. Luo, L. Jin, and Z. Sun. MORAN: A multi-object rec- [50] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon,
tified attention network for scene text recognition. Pattern and S. Xie. Convnext V2: co-designing and scaling convnets
Recognit., 90:109–118, 2019. 2, 3, 4, 7, 8 with masked autoencoders. In CVPR, pages 16133–16142,
[33] A. Mishra, A. Karteek, and C. V. Jawahar. Scene text recog- 2023. 7
nition using higher order language priors. In BMVC, pages [51] X. Xie, L. Fu, Z. Zhang, Z. Wang, and X. Bai. Toward Un-
1–11, 2012. 5, 6 derstanding WordArt: Corner-guided transformer for scene
[34] B. Na, Y. Kim, and S. Park. Multi-modal Text Recognition text recognition. In ECCV, pages 303–321, 2022. 2, 3, 7, 8
Networks: Interactive enhancements between visual and se- [52] J. Xu, Y. Wang, H. Xie, and Y. Zhang. Ote: Exploring accu-
mantic features. In ECCV, pages 446–463, 2022. 2, 3, 7 rate scene text recognition using one token. In CVPR, pages
[35] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recog- 28327–28336, 2024. 2, 3, 7, 8
nizing text with perspective distortion in natural scenes. In [53] Jianwei Yang, Chunyuan Li, Xiyang Dai, and Jianfeng Gao.
CVPR, pages 569–576, 2013. 3, 5, 6 Focal modulation networks. In NeurIPS, 2022. 7
[36] Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang. SEED: [54] M. Yang, B. Yang, M. Liao, Y. Zhu, and X. Bai. Class-aware
Semantics enhanced encoder-decoder framework for scene mask-guided feature refinement for scene text recognition.
text recognition. In CVPR, pages 13525–13534, 2020. 2, 3, Pattern Recognition, 149:110244, 2024. 2, 3, 7, 8
7, 8 [55] D. Yu, X. Li, C. Zhang, T. Liu, J. Han, J. Liu, and E. Ding.
[37] M. Rang, Z. Bi, C. Liu, Y. Wang, and K. Han. An empirical Towards accurate scene text recognition with semantic rea-
study of scaling law for scene text recognition. In CVPR, soning networks. In CVPR, pages 12113–12122, 2020. 2, 3,
pages 15619–15629, 2024. 2, 5 7
[38] F. Sheng, Z. Chen, and B. Xu. NRTR: A no-recurrence [56] H. Yu, X. Wang, B. Li, and X. Xue. Chinese text recogni-
sequence-to-sequence model for scene text recognition. In tion with a pre-trained CLIP-Like model through image-ids
ICDAR, pages 781–786, 2019. 2, 3, 7, 8 aligning. In ICCV, pages 11909–11918, 2023. 7, 8
[39] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural [57] X. Yue, Z. Kuang, C. Lin, H. Sun, and W. Zhang. Ro-
network for image-based sequence recognition and its appli- bustScanner: Dynamically enhancing positional clues for ro-
cation to scene text recognition. IEEE Trans. Pattern Anal. bust text recognition. In ECCV, pages 135–151, 2020. 2, 3,
Mach. Intell., 39(11):2298–2304, 2017. 1, 3, 7, 8 7
[40] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai. [58] B. Zhang, H. Xie, Y. Wang, J. Xu, and Y. Zhang. Linguistic
ASTER: An attentional scene text recognizer with flexible More: Taking a further step toward efficient and accurate
rectification. IEEE Trans. Pattern Anal. Mach. Intell., 41(9): scene text recognition. In IJCAI, pages 1704–1712, 2023. 2,
2035–2048, 2019. 2, 3, 4, 7, 8 3, 5, 7
[59] H. Zhang, Q. Yao, M. Yang, Y. Xu, and X. Bai. AutoSTR:
[41] K. Simonyan and A. Zisserman. Very deep convolu-
Efficient backbone search for scene text recognition. In
tional networks for large-scale image recognition. CoRR,
ECCV, pages 751–767. Springer, 2020. 2, 3, 7, 8
abs/1409.1556, 2014. 1
[60] Z. Zhang, N. Lu, M. Liao, Y. Huang, C. Li, M. Wang, and
[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
W. Peng. Self-distillation regularized connectionist temporal
A. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you
classification loss for text recognition: A simple yet effective
need. In NIPS, pages 5998–6008, 2017. 4, 7
approach. In AAAI, pages 7441–7449, 2024. 8, 3
[43] I. Loshchilov and F. Hutter. SGDR: stochastic gradient de-
[61] Tianlun Zheng, Zhineng Chen, Jinfeng Bai, Hongtao Xie,
scent with warm restarts. In ICLR, 2017. 6
and Yu-Gang Jiang. TPS++: Attention-enhanced thin-plate
[44] Z. Wan, F. Xie, Y. Liu, X. Bai, and C. Yao. 2d-ctc for scene spline for scene text recognition. In IJCAI, pages 1777–1785,
text recognition. CoRR, abs/1907.09705, 2019. 2 2023. 2, 3, 8
[45] K. Wang, B. Babenko, and S. Belongie. End-to-end scene [62] T. Zheng, Z. Chen, S. Fang, H. Xie, and Y. Jiang. CDistNet:
text recognition. In ICCV, pages 1457–1464, 2011. 5, 6 Perceiving multi-domain character distance for robust text
[46] P. Wang, C. Da, and C. Yao. Multi-Granularity Prediction recognition. Int. J. Comput. Vis., 132(2):300–318, 2024. 2,
for scene text recognition. In ECCV, pages 339–355, 2022. 3, 7, 8
2, 7, 3 [63] B. Zhou, Y. Qu, Z. Wang, Z. Li, B. Zhang, and H. Xie. Focus
[47] T. Wang, Y. Zhu, L. Jin, C. Luo, X. Chen, Y. Wu, Q. Wang, on the whole character: Discriminative character modeling
and M. Cai. Decoupled attention network for text recogni- for scene text recognition. CoRR, 2407.05562, 2024. 2, 3
tion. In AAAI, pages 12216–12224, 2020. 3, 7, 8
[48] Y. Wang, H. Xie, S. Fang, J. Wang, S. Zhu, and Y. Zhang.
From Two to One: A new scene text recognizer with visual
language modeling network. In ICCV, pages 14194–14203,
2021. 2, 3, 5, 6, 7
[49] J. Wei, H. Zhan, Y. Lu, X. Tu, B. Yin, C. Liu, and U. Pal.
Image as a language: Revisiting scene text recognition via
balanced, unified and synchronized vision-language reason-
ing network. In AAAI, pages 5885–5893, 2024. 2, 7, 3

10
SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition
Supplementary Material
6. More detail of Ablation Study 6.1. Progressive Ablation Experiments
To comprehensively evaluate the contributions of the inno-
SVTRv2 builds upon the foundation of SVTR by introduc-
vations in SVTRv2, a series of progressive ablation exper-
ing several innovative strategies aimed at addressing chal-
iments are conducted. Tab 6 outlines the results, with the
lenges in recognizing irregular text and modeling linguistic
following observations:
context. The key advancements and their impact are sys-
1. Baseline (ID0): The original SVTR serves as the base-
tematically detailed below:
line for comparison.
Removal of the Rectification Module and Introduc- 2. Rectification Module Removal (ID1) reveals that
tion of MSR and FRM. In the original SVTR, a rectifica- while the rectification module (e.g., TPS) improves irreg-
tion module is employed to recognize irregular texts. How- ular text recognition accuracy, it hinders the model’s ability
ever, this approach negatively impacts the recognition of to recognize long texts. This confirms its limitations in bal-
long texts. To overcome this limitation, SVTRv2 removes ancing these tasks.
the rectification module entirely. To effectively handle ir- 3. Improvement in Feature Resolution (ID2): Doubling
regular text without compromising the CTC model’s ability H
the height resolution ( 16 → H8 ) significantly boosts perfor-
to generalize to long text, MSR and FRM are introduced. mance across challenging datasets, particularly for irregular
Improvement in Feature Resolution. SVTR extracts text.
H
visual representations of size 16 ×W 4 × D2 from input im-
4. Replacement of Local Attention with Conv2 (ID3):
ages of size H × W × 3. While this approach is effective Replacing the sliding window-based local attention with
for regular text, it struggles with retaining the distinct char- two consecutive group convolutions (Conv2 ) yields im-
acteristics of irregular text. SVTRv2 doubles the height res- provements in artistic text, with a 3.0% increase in accu-
H
olution ( 16 → H8 ) of visual features, producing features of racy. This result highlights the efficacy of convolution-
size H8 × W 4 × D2 , thereby improving its capacity to recog-
based approaches in capturing character-level nuances, such
nize irregular text. as strokes and textures, thereby improving its ability to rec-
Refinement of Local Mixing Mechanisms. SVTR em- ognize artistic and irregular text styles.
ploys a hierarchical vision transformer structure, leverag- 5. Incorporation of MSR and FRM (ID4 and ID5): These
ing two mixing strategies: Local Mixing is implemented components collectively enhance accuracy on irregular text
through a sliding window-based local attention mechanism, benchmarks (e.g., Curve), surpassing the rectification-based
and Global Mixing employs the standard global multi-head SVTR (ID0) by 6.0%, without compromising the CTC
self-attention mechanism. SVTRv2 retains the hierarchical model’s ability to generalize to long text.
vision transformer structure and the global multi-head self- 6. Integration of SGM (ID6): Adding SGM yields sig-
attention mechanism for Global Mixing. For Local Mix- nificant gains on multiple datasets, improving accuracy on
ing, SVTRv2 introduces a pivotal change. Specifically, the OST by 5.11% and Union14M-Benchmark by 2.28%.
sliding window-based local attention is replaced with two It can be summarized as that, by integrating Conv2 ,
consecutive group convolutions (Conv2 ). It is important to MSR, FRM, and SGM, SVTRv2 significantly improves per-
highlight that unlike previous CNNs [20, 22, 41], there is no formance in recognizing irregular text and modeling lin-
normalization or activation layer between the two convolu- guistic context over SVTR, while maintaining robust long-
tions. text recognition capabilities and preserving the efficiency of
CTC-based inference.
Semantic Guidance Module (SGM). The original
SVTR model relies solely on CTC framework for both 7. SVTRv2 Variants
training and inference. However, CTC is inherently limited
in its ability to model linguistic context. SVTRv2 addresses There are several hyper-parameters in SVTRv2, including
this by introducing a Semantic Guidance Module (SGM) the depth of channel (Di ) and the number of heads at each
during training. SGM facilitates the visual encoder in cap- stage, the number of mixing blocks (Ni ) and their permuta-
turing linguistic information, enriching the feature repre- tion. By varying them, SVTRv2 architectures with different
sentation. Importantly, SGM is discarded during inference, capacities could be obtained and we construct three typical
ensuring that the efficiency of CTC-based decoding remains ones, i.e., SVTRv2-T (Tiny), SVTRv2-S (Small), SVTRv2-
unaffected while still benefiting from its contributions dur- B (Base). Their detail configurations are shown in Tab. 7.
ing the training phase. [L]m [G]n denotes that the first m mixing blocks in

1
IIIT5k SVT ICDAR2013 ICDAR2015 SVTP CUTE80 ∥ Curve Multi-Oriented Artistic Contextless Salient Multi-Words General
ID Method Common Benchmarks Avg Union14M Benchmarks Avg LTB OST Size FPS
0 SVTR (w/ TPS) 98.1 96.1 96.4 89.2 92.1 95.8 94.62 82.2 86.1 69.7 75.1 81.6 73.8 80.7 78.44 0.0 71.2 19.95 141
1 0 + w/o TPS 98.0 97.1 97.3 88.6 90.7 95.8 94.58 76.2 44.5 67.8 78.7 75.2 77.9 77.8 71.17 45.1 67.8 18.10 161
H
2 1 + 16 → H8
98.9 97.4 97.9 89.7 91.8 96.9 95.41 82.2 64.3 70.2 80.0 80.9 80.6 80.5 76.95 44.8 69.5 18.10 145
3 2 + Conv2 98.7 97.1 97.1 89.6 91.6 97.6 95.28 82.9 65.6 73.2 80.0 80.5 81.6 80.8 77.78 47.4 71.1 17.77 159
4 3 + MSR 98.7 98.0 97.4 89.4 91.6 97.6 95.44 87.4 83.7 75.4 80.9 81.9 83.5 82.8 82.22 50.9 72.5 17.77 159
5 4 + FRM 98.8 98.1 98.4 89.8 92.9 99.0 96.16 88.2 86.2 77.5 83.2 83.9 84.6 83.5 83.86 50.7 74.9 19.76 143
6 5 + SGM 99.2 98.0 98.7 91.1 93.5 99.0 96.57 90.6 89.0 79.3 86.1 86.2 86.7 85.1 86.14 50.2 80.0 19.76 143

Table 6. Ablation study of the proposed strategies on each benchmark subset, along with variations in the model parameters and speeds.

Models [D0 , D1 , D2 ] [N1 , N2 , N3 ] Heads Permutation Algorithm 1: Inference Time

SVTRv2-T [64,128,256] [3,6,3] [2,4,8] [L]6 [G]6 Input : A set of images I with size |I| = 3000,
SVTRv2-S [96,192,384] [3,6,3] [3,6,12] [L]6 [G]6
SVTRv2-B [128,256,384] [6,6,6] [4,8,12] [L]8 [G]10
batch size B = 1, N text lengths
Output: Overall inference time of the model
Table 7. Architecture specifications of SVTRv2 variants. Initialize two lists: total time list and
count list of size N , initialized to 0;
for each image Ij in I where j ∈ {1, 2, . . . , 3000}
REBU-Syn 3.45M do
ST+MJ 0.25M Determine the text length li for image Ij ;
Real Dataset Union14M Perform inference on Ij with text length li ;
14.5M Union14M-L Benchmark Record inference time tij ;
3.2M total time list[li ] += tij ;
Others
count list[li ] += 1;
60M 0.4M
6.5K Initialize avg time list;
for each text length li where i ∈ {1, 2, . . . , N } do
if count list[i] > 0 then
Figure 6. The relationship between the three real-world training avg time list[i] =
sets.
total time list[i] /
count list[i];

Curve MO Artistic Cless Salient MW General Compute the final average inference time:
2,426 1,369 900 779 1,585 829 400,000
Real [4] 1,276 440 432 326 431 193 254,174 N
1 X
REBU-Syn [37] 1,285 443 462 363 442 289 260,575 inference time = avg time list[i]
U14M-Train [25] 9 3 30 37 11 96 6,401 N i=1

Table 8. Overlapping analysis between the test set of Union14M-L return inference time;
(U14M) and three real-world training sets.

8. More detail of real-world datasets

SVTRv2 utilize local mixing, while the last n mixing blocks
employ global mixing. Specifically, in SVTRv2-T and For English recognition, we train models on real-world
SVTR-S, all blocks in the first stage and the first three datasets, from which the models exhibit stronger recogni-
blocks in the second stage use local mixing. The last three tion capability [4, 25, 37]. There are three large-scale real-
blocks in the second stage, as well as all blocks in the third world training sets, i.e., the Real dataset [4], REBU-Syn
stage, are global mixing. In the case of SVTRv2-B, all [37], and the training set of Union14M-L (U14M-Train)
blocks in the first stage and the first two blocks in the second [25]. However, as shown in Fig. 6 and Tab. 8, the for-
stage use local mixing, whereas the last four blocks in the mer two significantly overlap with Union14M-Benchmarks,
second stage and all blocks in the third stage adopt global thus not suitable for model training. Surprisingly, U14M-
mixing. Train is also overlapped with Union14M-Benchmarks, in

2
IIIT5k SVT ICDAR2013 ICDAR2015 SVTP CUTE80 ∥ Curve Multi-Oriented Artistic Contextless Salient Multi-Words General
Method Venue Encoder Common Benchmarks Avg Union14M Benchmarks Avg Size
ASTER [40] TPAMI2019 ResNet+LSTM 93.3 90.0 90.8 74.7 80.2 80.9 84.98 34.0 10.2 27.7 33.0 48.2 27.6 39.8 31.50 27.2
NRTR [38] ICDAR2019 Stem+TF6 90.1 91.5 95.8 79.4 86.6 80.9 87.38 31.7 4.40 36.6 37.3 30.6 54.9 48.0 34.79 31.7
MORAN [32] PR2019 ResNet+LSTM 91.0 83.9 91.3 68.4 73.3 75.7 80.60 8.90 0.70 29.4 20.7 17.9 23.8 35.2 19.51 17.4
SAR [29] AAAI2019 ResNet+LSTM 91.5 84.5 91.0 69.2 76.4 83.5 82.68 44.3 7.70 42.6 44.2 44.0 51.2 50.5 40.64 57.7
DAN [47] AAAI2020 ResNet+FPN 93.4 87.5 92.1 71.6 78.0 81.3 83.98 26.7 1.50 35.0 40.3 36.5 42.2 42.1 32.04 27.7
SRN [55] CVPR2020 ResNet+FPN 94.8 91.5 95.5 82.7 85.1 87.8 89.57 63.4 25.3 34.1 28.7 56.5 26.7 46.3 40.14 54.7
SEED* [36] CVPR2020 ResNet+LSTM 93.8 89.6 92.8 80.0 81.4 83.6 86.87 40.4 15.5 32.1 32.5 54.8 35.6 39.0 35.70 24.0
AutoSTR* [59] ECCV2020 NAS+LSTM 94.7 90.9 94.2 81.8 81.7 - - 47.7 17.9 30.8 36.2 64.2 38.7 41.3 39.54 6.00
RoScanner [57] ECCV2020 ResNet 95.3 88.1 94.8 77.1 79.5 90.3 87.52 43.6 7.90 41.2 42.6 44.9 46.9 39.5 38.09 48.0
ABINet [15] CVPR2021 ResNet+TF3 96.2 93.5 97.4 86.0 89.3 89.2 91.93 59.5 12.7 43.3 38.3 62.0 50.8 55.6 46.03 36.7
VisionLAN [48] ICCV2021 ResNet+TF3 95.8 91.7 95.7 83.7 86.0 88.5 90.23 57.7 14.2 47.8 48.0 64.0 47.9 52.1 47.39 32.8
PARSeq* [4] ECCV2022 ViT-S 97.0 93.6 97.0 86.5 88.9 92.2 92.53 63.9 16.7 52.5 54.3 68.2 55.9 56.9 52.62 23.8
MATRN [34] ECCV2022 ResNet+TF3 96.6 95.0 97.9 86.6 90.6 93.5 93.37 63.1 13.4 43.8 41.9 66.4 53.2 57.0 48.40 44.2
MGP-STR* [46] ECCV2022 ViT-B 96.4 94.7 97.3 87.2 91.0 90.3 92.82 55.2 14.0 52.8 48.5 65.2 48.8 59.1 49.09 148
LevOCR* [9] ECCV2022 ResNet+TF3 96.6 94.4 96.7 86.5 88.8 90.6 92.27 52.8 10.7 44.8 51.9 61.3 54.0 58.1 47.66 109
CornerTF* [51] ECCV2022 CornerEncoder 95.9 94.6 97.8 86.5 91.5 92.0 93.05 62.9 18.6 56.1 58.5 68.6 59.7 61.0 55.07 86.0
CPPD [12] Preprint SVTR-B 97.6 95.5 98.2 87.9 90.9 92.7 93.80 65.5 18.6 56.0 61.9 71.0 57.5 65.8 56.63 26.8
SIGA* [17] CVPR2023 ViT-B 96.6 95.1 97.8 86.6 90.5 93.1 93.28 59.9 22.3 49.0 50.8 66.4 58.4 56.2 51.85 113
CCD* [18] ICCV2023 ViT-B 97.2 94.4 97.0 87.6 91.8 93.3 93.55 66.6 24.2 63.9 64.8 74.8 62.4 64.0 60.10 52.0
LISTER* [8] ICCV2023 FocalNet-B 96.9 93.8 97.9 87.5 89.6 90.6 92.72 56.5 17.2 52.8 63.5 63.2 59.6 65.4 54.05 49.9
LPV-B* [58] IJCAI2023 SVTR-B 97.3 94.6 97.6 87.5 90.9 94.8 93.78 68.3 21.0 59.6 65.1 76.2 63.6 62.0 59.40 35.1
CDistNet* [62] IJCV2024 ResNet+TF3 96.4 93.5 97.4 86.0 88.7 93.4 92.57 69.3 24.4 49.8 55.6 72.8 64.3 58.5 56.38 65.5
CAM* [54] PR2024 ConvNeXtV2-B 97.4 96.1 97.2 87.8 90.6 92.4 93.58 63.1 19.4 55.4 58.5 72.7 51.4 57.4 53.99 135
BUSNet [49] AAAI2024 ViT-S 96.2 95.5 98.3 87.2 91.8 91.3 93.38 - - - - - - - - 56.8
DCTC [60] AAAI2024 SVTR-L 96.9 93.7 97.4 87.3 88.5 92.3 92.68 - - - - - - - - 40.8
OTE [52] CVPR2024 SVTR-B 96.4 95.5 97.4 87.2 89.6 92.4 93.08 - - - - - - - - 25.2
CRNN [39] TPAMI2016 ResNet+LSTM 82.9 81.6 91.1 69.4 70.0 65.5 76.75 7.50 0.90 20.7 25.6 13.9 25.6 32.0 18.03 8.30
SVTR* [11] IJCAI2022 SVTR-B 96.0 91.5 97.1 85.2 89.9 91.7 91.90 69.8 37.7 47.9 61.4 66.8 44.8 61.0 55.63 24.6
SVTRv2 - SVTRv2-B 97.7 94.0 97.3 88.1 91.2 95.8 94.02 74.6 25.2 57.6 69.7 77.9 68.0 66.9 62.83 19.8

Table 9. Results of SVTRv2 and existing models when trained on synthetic datasets (ST + MJ) [19, 24]. * represents that the results on
Union14M-Benchmarks are evaluated using the model they released.

nearly 6.5k text instances across the seven subsets. It means 10. Results when trained on synthetic datasets
the models trained based on U14M-Train suffer from data
leakage when tested on Union14M-Benchmarks,, thus the Previous research typically follows a typical evaluation pro-
results reported by [25] should be updated. To this end, tocol, where models are trained on synthetic datasets and
we create a filtered version of Union14M-L training set, validated using six widely recognized real-world bench-
termed as U14M-Filter, by filtering these overlapping in- marks. Building on this protocol, we trained the SVTRv2
stances. This new dataset is used to train SVTRv2 and other model on synthetic datasets. In addition to evaluating
24 methods we reproduced. SVTRv2 on the six common benchmarks, we also assess its
performance on challenging benchmarks, i.e. Union14M-
Benchmark, offering a comprehensive understanding of the
model’s generalization capabilities. For methods that have
not reported performance on challenging benchmarks, we
9. More detail of Inference Time conducted additional evaluations using their publicly avail-
able models and present these results for comparative anal-
In term of the inference time, we do not utilize any accel- ysis. As illustrated in Tab. 9, models trained on synthetic
eration framework and instead employ PyTorch’s dynamic datasets exhibit notably reduced performance compared to
graph mode on one NVIDIA 1080Ti GPU. We first mea- those trained on large-scale real-world datasets (see Tab. 3).
sure the inference time for 3,000 images with a batch size This performance drop is particularly pronounced on chal-
of 1, calculating the average inference time for each text lenging benchmarks. These findings emphasize the critical
length. We then computed the arithmetic mean of the av- importance of real datasets in improving recognition accu-
erage time across all text lengths to determine the overall racy for challenging text scenarios.
inference time of the model. Algorithm 1 details the pro- Despite the challenges associated with synthetic
cess of measuring inference time. datasets, SVTRv2 exhibits superior performance across

3
Blurred Artistic Incomplete Other Total Labelerr bols. As shown in Tab. 10, after excluding samples where
IIIT5k 0 16 1 4 21 4 the errors were due to labeling inconsistency, the remain-
SVT 4 4 4 0 12 0 ing errors could be attributed to blurred (30.81%), artistic
ICDAR 2013 2 2 4 2 10 2 (24.24%), and incomplete text (31.82%), respectively. This
ICDAR 2015 48 19 42 13 122 35
SVTP 7 6 12 7 32 4
classification allows us to conclude that SVTRv2’s recog-
CUTE80 0 1 0 0 1 1 nition performance, particularly in complex scenarios in-
Total 61 48 63 26 198 46
volving blurred, artistic, or incomplete text, requires further
30.81% 24.24% 31.82% 13.13% 100% enhancement.

Table 10. Distribution of bad cases for SVTRv2 on Common 12. Standardized Model Training Settings
benchmarks.
The optimal hyperparameters for training different models
often vary and are not universally fixed. However, criti-
both average accuracy metrics, surpassing the previously cal factors such as training epochs, data augmentation tech-
best-performing method by 0.22% and 2.73%, respectively. niques, input size, data type, and evaluation protocols have a
On irregular text benchmarks, such as Curve and Multi- substantial influence on model accuracy. To ensure fair and
Oriented, SVTR achieves strong results, largely due to its unbiased performance comparisons across models, these
integrated correction module, which is particularly adept factors must be strictly standardized. Accordingly, we adopt
at handling irregular text patterns, even when trained on a uniform training and evaluation setting, as shown in Table
synthetic datasets. Notably, SVTRv2 achieves a substantial 11, to maintain consistency across all settings while simul-
4.8% improvement over SVTR on the Curve benchmark, taneously enabling each model to achieve its best possible
further demonstrating its enhanced capacity to address accuracy.
irregular text. Overall, these results demonstrate that, even
when trained solely on synthetic datasets, SVTRv2 exhibits
strong generalization capabilities, effectively handling
complex and challenging text recognition scenarios.

11. Qualitative Analysis of Recognition Results

The SVTRv2 model achieved an average accuracy of 96%
on a standard dataset. To investigate the underlying causes
of the remaining 4% of recognition errors, we conducted a
detailed analysis of misclassified samples across six stan-
dard datasets, as illustrated in Fig. 7 and Fig. 8. While
previous research has typically categorized common bench-
marks into regular and irregular text. However, these er-
ror samples indicates that the majority of incorrectly recog-
nized text is not irregular. This suggests that, under the cur-
rent training paradigm using large-scale real-world datasets,
a more rigorous manual screening process for common
benchmarks is warranted.
Consequently, we identified five primary causes of
recognition errors in these samples: (1) blurred, (2) artistic,
(3) incomplete text, (4) image text labeling inconsistency,
and (5) others. Specifically, the blurring text category en-
compasses issues such as low resolution, motion blur, or
extreme lighting conditions. The artistic text category in-
cludes unconventional print fonts, predominantly found in
business signage, as well as a proportion of handwritten
text. Incomplete text refers to characters obscured by other
objects or missing due to improper cropping, where missing
information must be inferred from context. Image text la-
beling inconsistency means that there is an error in the given
text label or there are some characters with phonetic sym-

4
Setting Detail
Training Set For training, when the text length of a text image exceeds 25, samples
with text length ≤ 25 are randomly selected from the training set to
ensure models are only exposed to short texts (length ≤ 25).
Test Sets For all test sets except the long-text test set (LTB), text images with text
length > 25 are filtered. Text length is calculated by removing spaces
and non-94-character-set special characters.
Input Size Unless a method explicitly requires a dynamic size, models use a fixed
input size of 32 × 128. If a model performs incorrectly with 32 × 128
during training, the original size is used. The test input size matches the
training size.
Data Augmentation All models use the data augmentation strategy employed by PARSeq.
Training Epochs Unless pre-training is required, all models are trained for 20 epochs.
Optimizer AdamW is the default optimizer. If training fails to converge with
AdamW, Adam or other optimizers are used.
Batch Size Maximum batch size for all models is 1024. If single-GPU training is
not feasible, 2 GPUs (512 per GPU) or 4 GPUs (256 per GPU) are used.
If 4-GPU training runs out of memory, the batch size is halved, and the
learning rate is adjusted accordingly.
Learning Rate Default learning rate for batch size 1024 is 0.00065. The learning rate
is adjusted multiple times to achieve the best results.
Learning Rate Scheduler A linear warm-up for 1.5 epochs is followed by a OneCycle scheduler.
Weight Decay Default weight decay is 0.05. NormLayer and Bias parameters have a
weight decay of 0.
Data Type All models are trained with mixed precision.
EMA or Similar Tricks No EMA or similar tricks are used for any model.
Evaluation Protocols Word accuracy is evaluated after filtering special characters and con-
verting all text to lowercase.

Table 11. A uniform training and evaluation setting to maintain consistency across all settings while simultaneously enabling each model
to achieve its best possible accuracy.

5
Figure 7. The bad cases of SVTRv2 in IIIT5k [33], SVT [45], ICDAR 2013 [27], SVTP [35] and CUTE80 [1]. labels and predicted results,
and predicted scores are denoted as Textlabel | Textpred | Scorepred . Yellow, red, blue, and green boxes indicate blurred, artistic fonts,
incomplete text, and label-inconsistent samples, respectively. Other samples have no boxes.

6
Figure 8. The bad cases of SVTRv2 in ICDAR 2015 [26]. labels and predicted results, and predicted scores are denoted as Textlabel |
Textpred | Scorepred . Yellow, red, blue, and green boxes indicate blurred, artistic fonts, incomplete text, and label-inconsistent samples,
respectively. Other samples have no boxes.

Traditional Vs Modern Grammar
60% (5)
Traditional Vs Modern Grammar
57 pages
Semantic Roles PDF
100% (2)
Semantic Roles PDF
20 pages
Romanian Lang
100% (1)
Romanian Lang
61 pages
TPGSR (Batch 10)
No ratings yet
TPGSR (Batch 10)
69 pages
End-To-End Text Recognition With Convolutional Neural Networks
No ratings yet
End-To-End Text Recognition With Convolutional Neural Networks
60 pages
Unconstrained Text Recognition With Convolutional Neural Networks
No ratings yet
Unconstrained Text Recognition With Convolutional Neural Networks
13 pages
The Evolved Transformer
No ratings yet
The Evolved Transformer
14 pages
Text Classification Improved BT Integrating Bidirectional LSTM With Two-Dimensional Max Pooling
No ratings yet
Text Classification Improved BT Integrating Bidirectional LSTM With Two-Dimensional Max Pooling
11 pages
RNN Approaches To Text Normalization - A Challenge
No ratings yet
RNN Approaches To Text Normalization - A Challenge
17 pages
A Novel Ensemble Deep Network Framework For Scene Text Recognition
No ratings yet
A Novel Ensemble Deep Network Framework For Scene Text Recognition
11 pages
Discriminative Probing and Tuning For Text-To-Image Generation
No ratings yet
Discriminative Probing and Tuning For Text-To-Image Generation
22 pages
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
No ratings yet
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
11 pages
Biten Is An Image Worth Five Sentences A New Look Into WACV 2022 Paper
No ratings yet
Biten Is An Image Worth Five Sentences A New Look Into WACV 2022 Paper
10 pages
Electronics 12 03087
No ratings yet
Electronics 12 03087
10 pages
2021 Tacl-1 8
No ratings yet
2021 Tacl-1 8
19 pages
Deep Speech 3 1707.07413
No ratings yet
Deep Speech 3 1707.07413
8 pages
A Convolutional Recurrent Neural Network For The Handwritten Text Recognition of Historical Greek Manuscripts
No ratings yet
A Convolutional Recurrent Neural Network For The Handwritten Text Recognition of Historical Greek Manuscripts
14 pages
A Transformer-Based Framework For Scene Text Recognition
No ratings yet
A Transformer-Based Framework For Scene Text Recognition
16 pages
2022 Text Recognition in The Wild
No ratings yet
2022 Text Recognition in The Wild
35 pages
Lexicon-Free Conversational Speech Recognition With Neural Networks
No ratings yet
Lexicon-Free Conversational Speech Recognition With Neural Networks
10 pages
Robust Scene Text Recognition With Automatic Rectification
No ratings yet
Robust Scene Text Recognition With Automatic Rectification
9 pages
Enhancing Text Spotting With A Language Model and Visual Context Information
No ratings yet
Enhancing Text Spotting With A Language Model and Visual Context Information
10 pages
SVTRv2 Plus
No ratings yet
SVTRv2 Plus
16 pages
Kittenplon Towards Weakly-Supervised Text Spotting Using A Multi-Task Transformer CVPR 2022 Paper
No ratings yet
Kittenplon Towards Weakly-Supervised Text Spotting Using A Multi-Task Transformer CVPR 2022 Paper
10 pages
Graph Convolutional Networks For Named Entity Recognition: Gcns NER
No ratings yet
Graph Convolutional Networks For Named Entity Recognition: Gcns NER
9 pages
DeepNorm Deep Learning Approach
No ratings yet
DeepNorm Deep Learning Approach
7 pages
Deep LSTM Networks For Online Chinese Handwriting Recognition
No ratings yet
Deep LSTM Networks For Online Chinese Handwriting Recognition
6 pages
Scene Text Recognition Based On Improved CRNN
No ratings yet
Scene Text Recognition Based On Improved CRNN
14 pages
Plagiarism Checker X Originality Report: Similarity Found: 26%
No ratings yet
Plagiarism Checker X Originality Report: Similarity Found: 26%
29 pages
Jaderberg 16
No ratings yet
Jaderberg 16
20 pages
CPPD
No ratings yet
CPPD
12 pages
Dtrocr: Decoder-Only Transformer For Optical Character Recognition
No ratings yet
Dtrocr: Decoder-Only Transformer For Optical Character Recognition
11 pages
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
No ratings yet
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
6 pages
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
No ratings yet
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
11 pages
Reading Scene Text in Deep Convolutional Sequences: Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang
No ratings yet
Reading Scene Text in Deep Convolutional Sequences: Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang
8 pages
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
No ratings yet
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
10 pages
High-Performance OCR For Printed English and Fraktur Using LSTM Networks
No ratings yet
High-Performance OCR For Printed English and Fraktur Using LSTM Networks
5 pages
Show, Attend and Read: A Simple and Strong Baseline For Irregular Text Recognition
No ratings yet
Show, Attend and Read: A Simple and Strong Baseline For Irregular Text Recognition
9 pages
Full Page Handwriting Recognition Via Image To Sequence Extraction
No ratings yet
Full Page Handwriting Recognition Via Image To Sequence Extraction
16 pages
APznzaYD23xZzgrNn UY T9fGgJbB0 Kfhgt21x0vaHH4qfIvCmiqGVPY37T19O
No ratings yet
APznzaYD23xZzgrNn UY T9fGgJbB0 Kfhgt21x0vaHH4qfIvCmiqGVPY37T19O
10 pages
Recurrent Convolutional Neural Networks For Text Classification
No ratings yet
Recurrent Convolutional Neural Networks For Text Classification
7 pages
2019 - Joeseytre - TextTubes-for-Detecting-Curved-Text-in-the-Wild
No ratings yet
2019 - Joeseytre - TextTubes-for-Detecting-Curved-Text-in-the-Wild
10 pages
What Value Do Explicit High Level Concepts Have in Vision To Language Problems?
No ratings yet
What Value Do Explicit High Level Concepts Have in Vision To Language Problems?
10 pages
1 s2.0 S0031320318304370 Main
No ratings yet
1 s2.0 S0031320318304370 Main
10 pages
Paper Summary Advancements and Challenges in Handwritten Text Recognition A Comprehensive Survey
No ratings yet
Paper Summary Advancements and Challenges in Handwritten Text Recognition A Comprehensive Survey
7 pages
Convolutional Character Networks
No ratings yet
Convolutional Character Networks
11 pages
Badal and Its Types
100% (2)
Badal and Its Types
8 pages
Accepted Manuscript: Speech Communication
No ratings yet
Accepted Manuscript: Speech Communication
16 pages
Nikitha 2020
No ratings yet
Nikitha 2020
5 pages
Research Paper
No ratings yet
Research Paper
6 pages
Sensory Integration
100% (1)
Sensory Integration
5 pages
1808 08946v2 PDF
No ratings yet
1808 08946v2 PDF
10 pages
Artificial Intelligence and Machine Learning Approaches To Text Recognition: A Research Overview
No ratings yet
Artificial Intelligence and Machine Learning Approaches To Text Recognition: A Research Overview
5 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Evaluating Sequence-to-Sequence Models For Handwritten Text Recognition
No ratings yet
Evaluating Sequence-to-Sequence Models For Handwritten Text Recognition
8 pages
Build A Handwritten Text Recognition System Using TensorFlow - by Harald Scheidl - Towards Data Science
No ratings yet
Build A Handwritten Text Recognition System Using TensorFlow - by Harald Scheidl - Towards Data Science
11 pages
Handwriting Recognition Using Deep Learning: Image Processing
No ratings yet
Handwriting Recognition Using Deep Learning: Image Processing
14 pages
Rec Pa Mi
No ratings yet
Rec Pa Mi
14 pages
Improving Irregular Text Recognition With Adaptive Feature Compression
No ratings yet
Improving Irregular Text Recognition With Adaptive Feature Compression
5 pages
SVTR: Scene Text Recognition With A Single Visual Model
No ratings yet
SVTR: Scene Text Recognition With A Single Visual Model
7 pages
32285-Article Text-36353-1-2-20250410
No ratings yet
32285-Article Text-36353-1-2-20250410
9 pages
Arad Ill As Jaramillo 2018
No ratings yet
Arad Ill As Jaramillo 2018
6 pages
Gargiulo Quantum Psychoanalysis
No ratings yet
Gargiulo Quantum Psychoanalysis
7 pages
Lesson Plan
No ratings yet
Lesson Plan
4 pages
Simon G. Powell - The Psilocybin Solution 1of13
No ratings yet
Simon G. Powell - The Psilocybin Solution 1of13
3 pages
Expository Writing Checklist
No ratings yet
Expository Writing Checklist
3 pages
Assignment 2
No ratings yet
Assignment 2
10 pages
Preparing For The Job Interview PDF
No ratings yet
Preparing For The Job Interview PDF
4 pages
An Nahw
No ratings yet
An Nahw
31 pages
RPT Peralihan
No ratings yet
RPT Peralihan
8 pages
Explication (To Make Clear The: Meaning of - To Explain)
No ratings yet
Explication (To Make Clear The: Meaning of - To Explain)
9 pages
Understanding Stress
No ratings yet
Understanding Stress
4 pages
Document 4
No ratings yet
Document 4
3 pages
612le 13 SC Identifypurpose SC
No ratings yet
612le 13 SC Identifypurpose SC
3 pages
Selection: Good Selection Decisions Make A Difference
No ratings yet
Selection: Good Selection Decisions Make A Difference
26 pages
Edpe Ass 1 ..trm2
No ratings yet
Edpe Ass 1 ..trm2
5 pages
Final of Deepa A K Report
No ratings yet
Final of Deepa A K Report
48 pages
16 PF Practical
No ratings yet
16 PF Practical
12 pages
Cureus 0016 00000054193
No ratings yet
Cureus 0016 00000054193
8 pages
DWO Unit 12 Quiz - Knowledge Management and Learning Organization
No ratings yet
DWO Unit 12 Quiz - Knowledge Management and Learning Organization
3 pages
Think Questions
No ratings yet
Think Questions
2 pages
Serquina, Jessaine - 2-BSEA - Exercise 21 PDF
No ratings yet
Serquina, Jessaine - 2-BSEA - Exercise 21 PDF
10 pages
How To Improve Academic Presentation
No ratings yet
How To Improve Academic Presentation
3 pages
Chicaiza Verónica M0x Autonomouswork1
No ratings yet
Chicaiza Verónica M0x Autonomouswork1
5 pages
How To Teach Speaking
No ratings yet
How To Teach Speaking
25 pages
UWB 10802 Japanese Language (I) SEMESTER 1, SESSION 2016/17: Grammatical Notes
No ratings yet
UWB 10802 Japanese Language (I) SEMESTER 1, SESSION 2016/17: Grammatical Notes
8 pages
Subject: Artificial Intelligence Term: Mid Question No. 1 (10 0.5)
No ratings yet
Subject: Artificial Intelligence Term: Mid Question No. 1 (10 0.5)
3 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
From Everand
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Spanning Tree Protocol Essentials: Definitive Reference for Developers and Engineers
From Everand
Spanning Tree Protocol Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

SVTRV 2

Uploaded by

SVTRV 2

Uploaded by

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Abstract SVTR CPPD PARSeq MARRec LISTER SVTRv2

Conv 3×3 & BN

For each character ci in a text image with character labels

MSR SVTRv2: WORLD STREET Coffee

Models [D0 , D1 , D2 ] [N1 , N2 , N3 ] Heads Permutation Algorithm 1: Inference Time

8. More detail of real-world datasets

11. Qualitative Analysis of Recognition Results

You might also like