0% found this document useful (0 votes)
25 views20 pages

Workflow

The document describes a new method for scene text image super-resolution (STISR) called text prior guided super-resolution (TPGSR). TPGSR introduces text recognition probabilities as a categorical prior to guide the reconstruction of high-resolution text images. The method refines the text prior using the estimated high-resolution image in a multi-stage framework. Experiments show TPGSR improves visual quality and text recognition accuracy over existing STISR methods on benchmark datasets.

Uploaded by

Teja .Manchala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views20 pages

Workflow

The document describes a new method for scene text image super-resolution (STISR) called text prior guided super-resolution (TPGSR). TPGSR introduces text recognition probabilities as a categorical prior to guide the reconstruction of high-resolution text images. The method refines the text prior using the estimated high-resolution image in a multi-stage framework. Experiments show TPGSR improves visual quality and text recognition accuracy over existing STISR methods on benchmark datasets.

Uploaded by

Teja .Manchala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/353060493

Text Prior Guided Scene Text Image Super-resolution

Preprint · June 2021

CITATIONS READS
0 161

3 authors:

Ma Jianqi Shi Guo


The Hong Kong Polytechnic University The Hong Kong Polytechnic University
17 PUBLICATIONS 1,444 CITATIONS 21 PUBLICATIONS 1,812 CITATIONS

SEE PROFILE SEE PROFILE

Lei Zhang
University College Dublin
1,274 PUBLICATIONS 77,180 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ma Jianqi on 10 May 2022.

The user has requested enhancement of the downloaded file.


Text Prior Guided Scene Text Image Super-resolution

Jianqi Ma, Shi Guo, Lei Zhang


Dept. of Computing, The Hong Kong Polytechnic University
{csjma, cssguo, cslzhang}@comp.polyu.edu.hk
arXiv:2106.15368v2 [cs.CV] 30 Jun 2021

Abstract
Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of low-
resolution (LR) scene text images, and consequently boost the performance of text recognition. However,
most of existing STISR methods regard text images as natural scene images, ignoring the categorical
information of text. In this paper, we make an inspiring attempt to embed categorical text prior into
STISR model training. Specifically, we adopt the character probability sequence as the text prior, which
can be obtained conveniently from a text recognition model. The text prior provides categorical guidance
to recover high-resolution (HR) text images. On the other hand, the reconstructed HR image can refine
the text prior in return. Finally, we present a multi-stage text prior guided super-resolution (TPGSR)
framework for STISR. Our experiments on the benchmark TextZoom dataset show that TPGSR can not
only effectively improve the visual quality of scene text images, but also significantly improve the text
recognition accuracy over existing STISR methods. Our model trained on TextZoom also demonstrates
certain generalization capability to the LR images in other datasets.

1. Introduction
Scene text image recognition aims to recognize the text characters from the input image, which is
an important computer vision task that involves text information processing. It has been widely used
in text retrieval [25], sign recognition [17], license plate recognition [35] and other scene-text-based
image understanding tasks [6, 34]. However, due to the various issues such as low sensor resolution,
blurring, poor illumination, etc., the quality of captured scene text images may not be good enough,
which brings many difficulties to scene text recognition in practice. In particular, scene text recognition
from low-resolution (LR) images remains a challenging problem.
In recent years, single image super-resolution (SISR) techniques [15, 27, 29, 30, 31, 54, 53] have
achieved significant progress owing to the rapid development of deep neural networks [28, 43, 20].
Inspired by the success of SISR, researchers have started to investigate scene text image super-resolution
(STISR) to improve the quality of LR text images and hence improve the text recognition accuracy. Tran
et al. [44] adapted LapSRN [29] to STISR and significantly improved the text content details. To obtain
more realistic STISR results, Bı́lková et al. [5] and Wang et al. [47] employed GAN based networks with
CTC loss [18] and text perceptual losses. In these methods, the LR images are synthesized (e.g., bicubic
* This work is supported by the Hong Kong RGC RIF grant (R5001-18).
(a) BICUBIC NOX

(b) w/o TP PUNONS

(c) fixed TP PARIJHG

(d) refined TP PARKING

(e) HR PARKING

Figure 1: Comparison of super-resolution results generated from TSRN [46] with and without (w/o) our
text prior (TP). The right column shows the text recognition results.

down-sampling) from high resolution (HR) images for SR model learning, while the image degradation
process in real-world LR images can be much more complex. Recently, Wang et al. [46] collected a
real-world STISR dataset, namely TextZoom, where LR-HR image pairs captured by zooming lens are
provided. Wang et al. also proposed a TSRN model for STISR, achieving state-of-the-art performance
[46].
The existing STISR methods [44, 47, 5], however, mostly treat scene text images as natural scene
images to perform super-resolution, ignoring the important categorical information brought by the text
content in the image. As shown in Fig. 1(b), the result by TSRN [46] is much better than the simple
bicubic model (Fig. 1(a)), but it is still hard to tell the characters therein. Based on the observation that
semantic information can help to recover the shape and texture of objects [48], in this paper we propose a
new STISR method, namely text prior guided super-resolution (TPGSR), by introducing the categorical
text prior information into the model learning process. Unlike the face segmentation prior used in [10]
and the semantic segmentation used in [48], the text character segmentation is hard to obtain since there
are few datasets containing annotations of fine character segmentation masks. We instead employ a text
recognition model (e.g., CRNN [41]) to extract the probability sequence of the text as the categorical
prior of the given LR scene text image, and embed it into the super-resolution network learning process to
guide the reconstruction of HR image. As can be seen in Fig. 1(c), the text prior information can indeed
improve much the STISR results, making the text characters much more readable. On the other hand, the
reconstructed HR text image can be used to refine the text prior, and consequently a multi-stage TPGSR
framework can be built for effective STISR. Fig. 1(d) shows the super-resolved text image by using the
refined text prior, where the text can be clearly and correctly recognized. The major contributions of our
work are as follows:
• We introduce the text recognition categorical probability sequence as the prior into the STISR
task, and validate its effectiveness to improve the visual quality of text content and recognition
accuracy of scene text images.
• We propose the refinements of the text recognition categorical prior without using extra supervi-
sion of text labels except for the real HR image: refining recurrently by the estimated HR image
and by fine-tuning the text prior generator with our proposed TP Loss. With such refinement, the
text prior and the super-resolved text image can be jointly enhanced under our TPGSR framework.
• By improving the image quality of LR text images, the proposed TPGSR improves the text recog-
nition performance on TextZoom for different text recognition models by a large margin and
demonstrates good generalization performance to other recognition datasets.

2. Related Works
Single Image Super Resolution (SISR). Aiming at estimating a high-resolution (HR) image from
its low-resolution (LR) counterpart, SISR is a highly ill-posed problem. In the past, handcrafted im-
age priors are commonly used to regularize the SISR model to improve the image quality. In recent
years, the training of deep neural networks (DNNs) has dominated the research of SISR. The pioneer
work SRCNN [14] learns a three-layer convolutional neural network (CNN) for the SISR task. Later
on, many deeper CNN models have been proposed to improve the SISR performance, e.g., deep resid-
ual block [31], Laplacian pyramid structure [29], densely connected network [54] and channel attention
mechanism [53]. The PSNR and SSIM [49] losses are widely used in those works to train the SISR
model. In order to produce perceptually-realistic SISR results, SRGAN [30] employs a generative ad-
versarial network (GAN) to synthesize image details. SFT-GAN [48] and FSRNet [10] utilizes the GAN
loss and semantic segmentation to generate visually pleasing HR image.
Scene Text Image Super Resolution (STISR). Different from the general purpose SISR that works
on natural scene images, STISR focuses on text images, aiming to improve the readability of texts by
improving their visual quality. Intuitively, those methods for SISR can be directly adopted for STISR.
In [16], Dong et al. extended SRCNN [14] to text images, and obtained the best performance in ICDAR
2015 competition [38]. PlugNet [36] employs a light-weight pluggable super-resolution unit to deal
with LR images in feature domain. TextSR [47] utilizes the text recognition loss and text perceptual loss
to generate the desired HR images for text recognition. To improve the performance of STISR on real-
world scene text images, Wang et al. [46] built a real-world STISR image dataset, namely TextZoom,
where the LR and HR text image pairs were cropped from real-world SISR datasets [52, 7]. They also
proposed a TSRN [46] method to use the central alignment module and sequential residual block to
exploit the semantic information in internal features. SCGAN [51] employs a multi-class GAN loss
as supervision to equip the model with ability to generate more distinguishable face and text images.
Further by progressively adopting the high-frequency information derived from the text image, Quan et
al. [40] proposed a multi-stage model for recovering the blurry text images. Different from the above
methods, we propose employ the text recognition prior to guide the STISR model recovering better
quality text images.
Scene Text Recognition. In the early stage of deep learning based scene text recognition, researchers
intended to solve the problem in a bottom-up manner [24, 21], i.e., extracting the text from characters
into words. Some other approaches recognize the text in a top-down fashion [23], i.e., regarding the text
image as a whole and performing a word-level classification. Taking text recognition as an image-to-
sequence problem, CRNN [41] employs the CNN to extract image feature and uses the recurrent neural
obey tourist reproductive

Figure 2: Visualization of the text prior (TP) of some text images. The top, middle and bottom rows
present the text labels, input text images and their TPs, respectively.

networks to model the semantic information of image features. Trained with the CTC [18] loss, the pre-
dicted sequence can be more accurately aligned with the target sequence [32]. Recently, attention-based
methods thrive due to the improvement in text recognition benchmarks and the robustness to various
shapes of text images [11, 12]. In our method, we adopt CRNN as the text prior generator to generate
categorical text priors for STISR model training. It is shown that such text priors can significantly im-
prove the perceptual quality of super-resolved text images and consequently boost the text recognition
performance.

3. Methodology
In this section, we will first explain what the text prior (TP) is, and then introduce the text prior guided
super-resolution (TPGSR) network in detail, followed by the design of loss function.
3.1. Text Prior
In this paper, the TP is defined as the deep categorical representation of a scene text image generated
by some text recognition models. The TP is then used as guidance information to encourage our TPGSR
network to produce high quality scene text images, which are favorable to both visual perception and
scene text recognition.
Specifically, we choose the classic text recognition model CRNN [41] to be the TP Generator. CRNN
uses several convolution layers to extract the text features and five max pooling layers to down-sample
the features into a feature sequence. The TP is then defined as the categorical probability prediction by
CRNN, which is a sequence of |A|-dimensional probability vectors where |A| denotes the number of
characters learned by CRNN. Fig. 2 visualizes the TP of some scene text images, where the horizontal
axis represents the sequence in left-to-right order and the vertical axis represents the categories in reverse
alphabet order (e.g., ’Z’ to ’A’). In the visualization, the lighter the spot is, the higher the probability
of this category will be. By using the TP as guidance, our TPGSR model can recover visually more
pleasing HR images with higher text recognition accuracy, as we illustrated in Fig. 1.
SR Branch HR
SR Module Ground Truth
SR output
LR input
TP-Guided
SR Block
... TP-Guided
SR Block SR Loss

Bicubic

TP
TP feature Generator
TP Module
TP TP
Generator Transformer
TP Loss TP
TP Generation TP
Branch

Figure 3: Our proposed TPGSR framework, which consists of a Text Prior Generation Branch and
a Super-resolution (SR) Branch. Accordingly, TP loss and SR loss are employed to train the whole
network.

3.2. The Architecture of TPGSR


By introducing TP into the STISR process, the main architecture of our TPGSR is illustrated in Fig. 3.
Our TPGSR network has two branches: a TP generation branch and a super-resolution (SR) branch.
First, the TP branch intakes the LR image to generate the TP feature. Then, the SR branch intakes the
LR image and the TP feature to estimate the HR image. In the following, we introduce the two branches
in detail.
TP generation branch. This branch uses the input LR image to generate the TP feature and passes the
TP feature to the SR branch as guidance for more accurate STISR results. The key component of this
branch is the TP Module, which consists of a TP Generator and a TP Transformer. As mentioned in
Section 3.1, the TP generated by TP Generator is a probability sequence, whose size may not match the
image feature map in the SR branch. To solve this problem, we employ a TP Transformer to transform
the TP sequence into a feature map.
Specifically, the input LR image is first resized to match the input of TP Generator by bicubic interpo-
lator, and then passed to the TP Generator to generate a TP matrix whose width is L. Each position of the
TP is a vector of size |A|, which is the number of categories of alphabet A adopted in CRNN. To align
the size of TP feature with the size of image feature, we pass the TP feature to the TP Transformer. The
TP Transformer consists of 4 Deconv blocks, each of which consists of a deconvolution layer, a BN layer
and a ReLU layer. For an input TP matrix with width L and height |A|, the output of TP Transformer
will be a feature map with recovered spatial dimension and channel 32 after three deconvolution layers
with stride (2, 2) and one deconvolution layer with stride (2, 1). The kernel size of the all deconvolution
layers is 3 × 3.
SR branch The SR branch aims to reproduce an HR text image from the input LR image and TP
guidance feature. It mainly contains an SR Module. Please note that most of the SR blocks in existing
SISR and STISR methods [27, 29, 30, 31, 54, 44, 47, 46] can be adopted as our SR Module in couple with
our TP guidance features. Considering that these SR blocks, such as the residual block in SRResNet [30]
and the SRB in TSRN [46], only take the image features as input, we need to modify them in order to
embed the TP features. We call our modified SR blocks as TP-Guided SR Blocks.
The difference between previous SR Blocks and our TP-Guided SR Block is illustrated in Fig. 4. To
TP-Guided
SR Block
TP feature

32 ⊕
Projection
Block Add C
Layers
C C
image feature Concat

SR Block
Block
Layers
C C
image feature

Figure 4: Comparison of our TP-Guided SR Block and a common SR Block. In each block, the channel
numbers of image features and TP features are C and 32, repectively.

embed the TP features into the SR Block, we concatenate them to the image features along the channel
dimension. Before the concatenation, we align the spatial size of TP features to that of the image features
by bicubic interpolation. Suppose that the channel number of image features is C, then the concatenated
features of C + 32 channels will go through a projection layer to reduce the channel number back to C.
We simply use a 1 × 1 kernel convolution to perform this projection. The output of projection layer is
fused with the input image feature by addition. With several such TP-Guided SR Blocks, the SR branch
will output the estimated HR image, as in those previous super-resolution models.
3.3. Multi-stage Refinement
With the TPGSR framework described in Section 3.2, we can super-resolve an LR image to a better
quality HR image with the help of TP features extracted from the LR input. One intuitive question
is, if we extract the TP features from the super-resolved HR image, can we use those better quality
TP features to further improve the super-resolution results? Actually, multi-stage refinement has been
widely adopted in many computer vision tasks such as object detection [8] and instance segmentation [9]
to improve the prediction quality progressively. Therefore, we extend our one-stage TPGSR model to a
multi-stage learning framework by passing the estimated HR text image in one stage to the TP Generator
in next stage. The multi-stage TPGSR framework is illustrated in Fig. 5. In the 1st stage, the TP Module
accepts the bicubically interpolated LR image as input, while in the following stages, the TP Module
accepts the HR image output from the SR Module in previous stage as input for refinement. As we
will show in the ablation study in Section 4.3, both the quality of estimated HR text image and the text
recognition accuracy can be progressively improved by this multi-stage refinement.
Stage (2 to N)
Recurrent for next stage

LR input
SR Module

Bicubic SR output

TP Module

TP Feature

Stage 1 LR input
SR Module
Bicubic SR output

TP Module

TP Feature

Figure 5: Illustration of multi-stage TPGSR. The super-resolution output of one stage will be the text
image input of next stage.

3.4. Training Loss


There are two types of loss functions in our TPGSR framework, one for the SR branch and another
one for the TP generation branch. For the SR branch, the loss is similar to that in many previous SISR
methods [29, 31, 54]. Denote by IˆH the estimate HR image from the LR input and by IH the ground-
truth HR image, the loss for the SR branch, denoted by LS , can be commonly defined as the L1 norm
distance between IˆH and IH , i.e., LS = |IˆH − IH |.
Different from the many SISR works [29, 31, 54] as well as the STISR works [47, 46], in TPGSR we
have loss functions specifically designed for the TP generation branch, which is crucial to improve the
text image quality and text recognition. The TP sequence generated by the TP Generator has significant
impact on the final SR results. If the TP is correct (i.e., identical to the TP of ground-truth image),
it will bring positive impact on the estimated HR image, as we shown in Fig. 1. Otherwise, if the
categorical information in the extracted TP is incorrect, the SR result can be much damaged (please
refer to Section 4.5 for such failure cases). Therefore, we use the TP extracted from the ground-truth
HR image to supervise the learning of our TPGSR network. Denote by tL and tH the TP extracted
from LR image IL and HR ground-truth image IH , respectively, we use the L1 norm distance |tH − tL |
and the KL divergence DKL (tL ||tH ) to measure the similarity between tL and tH . With the text priors
tL , tH ∈ RL×|A| of the pair of LR and HR images, the DKL (tL ||tH ) can be calculated as follows:
|A|
L X
X tij
H +
DKL (tL ||tH ) = tij
H ln , (1)
i=1 j=1
tij
L + 

where tij ij
L and tH denote the element in ith position and jth dimension in tL and tH .  is a small positive
number to avoid numeric error in division and logarithm. Together with LS , the overall loss function for
a single-stage TPGSR can is follows:

L = LS + α |tH − tL | + β DKL (tL ||tH ), (2)

where α and β are the balancing parameters.


For the multi-stage TPGSR learning, the loss for each stage, denote by Li , can be similarly defined as
in Eq. 2. Suppose there are N stages in total, the overall loss is defined as follows:

Lmt = N
P
i=1 λi Li , (3)
PN
where λi balances the loss of each stage and i=1 λi = 1.

4. Experiments
4.1. Implementation Details
We implement our TPGSR method in Pytorch. Adam is selected as our optimizer with momentum
0.9. The batch size is set to 48 and the model is trained for 500 epochs with one NVIDIA RTX 2080Ti
GPU. The TP Generator is selected as CRNN [41] pre-trained on SynthText [19] and MJSynth [22]. In
Eq. 2, the weights α and β are both simply set to 1, while the  in Eq. 1 is set to 10−6 . The alphabet set
A includes mainly alphanumeric (0 to 9 and ’a’ to ’z’) case-insensitive characters. Together with a blank
label, |A| (i.e., the size of A), has 37 categories in total. For dealing with the out-of-category cases, we
assign all out-of-category characters with blank label, and the reconstruction of these characters mainly
depends on the SR Module.
For multi-stage TPGSR training, we adopt a well-trained single-stage model to initialize all stages and
cut the gradient across stages to speed up the training process to converge. The TP Generator are non-
shared while the SR Module are shared cross stages. As in previous multi-stage learning methods [50],
higher weight is assigned to the loss on the last stage, and the other stages are assigned with smaller
weights on loss. In particular, we use a 3-stage TPGSR. The parameters λi in Eq. 3 are set as λ1 = 14 ,
λ2 = 41 and λ3 = 12 .
4.2. Datasets and Experiment Settings
Datasets. The TextZoom [46], ICDAR2015 [26] and SVT [45] datasets are used to validate the effective-
ness of our proposed TPGSR method. TextZoom consists of 21, 740 LR-HR text image pairs collected
by lens zooming of the camera in real-world scenarios. The training set has 17, 367 pairs, while the test
set is divided into three subsets based on the camera focal length, namely easy (1, 619 samples), medium
(1, 411 samples) and hard (1, 343 samples). The dataset also provides the text label for each pair.
ICDAR2015 is a well-known scene text recognition dataset, which contains 2, 077 cropped text im-
ages from street view photos for testing. SVT is also a scene text recognition dataset, which contains
647 testing text images. Each image has a 50-word lexicon with it.
Approach Tuned ACC N E M H ACC
TSRN [46] - 41.4% 1 61.0% 49.9% 36.7% 49.8%
TPGSR-TSRN × 44.5% 2 62.2% 51.3% 37.4% 50.9%
TPGSR-TSRN X 49.8% 3 63.1% 52.0% 38.6% 51.8%
HR - 72.3% 4 63.7% 53.3% 39.4% 52.6%
5 64.3% 54.2% 39.2% 53.1%
(a) Tuning the TP Generator. (b) Ablation on different stage settings.

stage(N) TPG SR ACC


1 X × 49.8%
2 × X 50.9% ACC ACCT
3 × X 51.8% N ILR ISR ILR ISR
2 × × 50.2% 1 26.8% 49.8% 45.3% 50.5%
3 × × 51.5% 2 26.8% 50.9% 43.6% 51.6%
3 X X 49.2% 3 26.8% 51.8% 43.1% 52.9%
3 X × 49.2%
(c) Ablation on different sharing strategies. (d) Ablation on the impact of SR.
Table 1: Ablation studies on TPGSR. ’Tuned’ means whether the TP Generator is fine-tuned or not.
ACC means the average recognition accuracy, while ACCT means the average recognition accuracy with
tuned TP Generator. N means the stage number. ’E’, ’M’ and ’H’ in (b) namely mean the accuracies of
’Easy’, ’Medium’ and ’Hard’ split in TextZoom. ’X’ in (c) means that the component shares weights in
all stages, while ’×’ means that the components are trained separately. ’TPG’ and ’SR’ in (c) mean the
TP Generator and SR module in TPGSR. ILR and ISR in (d) refer to using the LR image and generated
SR images as input to the recognizer.

Experiment settings. Since there are real-world LR-HR image pairs in the TextZoom dataset, we
first use it to train and evaluate the proposed TPGSR model. We then apply the trained model to IC-
DAR2015/SVT to test its generalization performance to other datasets. Considering the fact that most
of the images in ICDAR2015 and SVT have good resolution and quality, while the TextZoom train-
ing data focus on LR images, we perform the generalization test only to the low quality images in
ICDAR2015/SVT whose height is less than 16 or the recognition score is less than 0.9.
4.3. Ablation Studies
To better understand the proposed TPGSR model, in this section we conduct a series of ablation exper-
iments on the selection of parameters in loss function, the selection of number of stages and whether the
TP Generator should be fine-tuned in training. We also perform experiments to validate the effectiveness
of SR Module in our TPGSR framework. We adopt TSRN [46] as the SR Module in the experiments,
and name our model as TPGSR-TSRN. All ablation experiments are performed on TextZoom and the
recognition accuracies are evaluated with CRNN [41].
Impact of tuning the TP Generator. The loss terms for the TP branch aim to fine-tune the TP Gen-
erator. To prove the significance of TP Generator tuning, we conduct experiments by fixing and tuning
the TP Generator in a one-stage TPGSR model. The text recognition accuracies are shown in Table 1a.
By fixing the TP Generator, we can enhance the SR image recognition by 3.1% compared to the TSRN
Approach ASTER [42, 1] MORAN [33, 3] CRNN [41, 2]
easy medium hard average easy medium hard average easy medium hard average
BICUBIC 64.7% 42.4% 31.2% 47.2% 60.6% 37.9% 30.8% 44.1% 36.4% 21.1% 21.1% 26.8%
SRCNN [13, 46] 69.4% 43.4% 33.0% 49.5% 63.2% 39.0% 30.2% 45.3% 38.7% 21.6% 20.9% 27.7%
TPGSR-SRCNN 72.9% 50.7% 34.7% 53.8% 67.7% 49.7% 32.8% 50.9% 47.0% 30.6% 24.7% 34.7%
SRResNet [30, 46] 69.6% 47.6% 34.3% 51.3% 60.7% 42.9% 32.6% 46.3% 39.7% 27.6% 22.7% 30.6%
TPGSR-SRResNet 76.0% 58.8% 40.1% 59.1% 72.3% 54.9% 38.4% 56.0% 54.6% 41.2% 32.3% 43.3%
RDN [54, 46] 70.0% 47.0% 34.0% 51.5% 61.7% 42.0% 31.6% 46.1% 41.6% 24.4% 23.5% 30.5%
TPGSR-RDN 72.6% 54.2% 37.2% 55.5% 67.8% 51.7% 36.0% 52.6% 53.0% 38.0% 27.7% 40.2%
TSRN [46] 75.1% 56.3% 40.1% 58.3% 70.1% 53.3% 37.9% 54.8% 52.5% 38.2% 31.4% 41.4%
TPGSR-TSRN 78.9% 62.7% 44.5% 62.8% 74.9% 60.5% 44.1% 60.5% 63.1% 52.0% 38.6% 51.8%
HR 94.2% 87.7% 76.2% 86.4% 91.2% 85.3% 74.2% 83.9% 76.4% 75.1% 64.6% 72.2%

Table 2: SR text image recognition performance of competing STISR models on TextZoom. The recog-
nition accuracies are evaluated by the officially released models of ASTER [42], MORAN [33] and
CRNN [41].

baseline [46]. By tuning the TP Generator during the training process, the recognition accuracy can
be further improved from 44.5% to 49.8%, achieving a performance gain of 5.3%. This clearly demon-
strates the benefits of tuning the TP Generator to the SR text recognition task.
Impact of multiple stages in TPGSR. In addition to refining the TP Generator, recurrently inputting
the estimated HR image into the TPGSR can also enhance the quality of TP since the SR Module can
improve the estimated HR text image in each recurrence. To find out how well the multi-stage refinement
can reach, we set the stage number N = 1, 2, . . . , 5 and report the text recognition accuracy in Table 1b.
We can see that the recognition accuracy increases with the increase of N ; however, the margin of
improvement decreases with N. When N = 5, the accuracy of ’Hard’ split begins to fall. Considering
the balance between the model size and the performance gain, we set N to 3 in our following experiments.
Parameter sharing strategy. To determine the best sharing strategies, we conduct experiments to test
on both the TP Module and the SR Module. As shown in Table 1c, we find that under different settings
of stage number, the setting of non-shared TP Module shows significant performance improvement.
However, when we use non-shared SR Module, little performance improvement in SR image recognition
is achieved. Thus we use the settings of shared SR Module and non-shared TP Module in our multi-stage
model.
The effectiveness of SR in TPGSR. Since one of the goals of STISR is to improve the text recognition
performance by HR image recovery, it is necessary to check if the estimated SR images truly help the
final text recognition task. To this end, we evaluate the TPGSR models with both fixed and tuned TP
Generator by using LR and SR images as inputs. For multi-stage version, we test all the TP Generators
and pick the best LR and SR results from all TP Generators. Note that models with tuned TP Generator
and LR image as input is similar to directly fine-tuning the text recognition model on the LR images.
The results are shown in Table 1d. It can be seen that by tuning TP Generator on the LR images, the text
recognition accuracy can be increased. However, the recognition accuracy can be improved more by us-
ing the SR text image. For example, at stage one, the recognition accuracy of LR images by using tuned
TP Generator is 45.3%, while the accuracy of SR images even without fine-tuning the TP Generator will
be 49.8%. If the tuned TP Generator is used to generate the SR text image, the text recognition perfor-
mance can be further improved compared to the fixed TP Generator. The experiments and comparisons
demonstrate the effectiveness of our SR Module in improving the final SR text recognition.
Dataset ICDAR2015 SVT
No. of images 563 104
Approach SEED [39] ASTER [42] CRNN [41] SEED [39] ASTER [42] CRNN [41]
Origin 54.0% 50.8% 21.5% 60.2% 50.8% 19.2%
TSRN [46] 52.6% 48.3% 24.5% 54.3% 48.3% 23.1%
TPGSR-TSRN 56.1% 52.0% 27.1% 61.1% 52.0% 29.8%

Table 3: Text recognition accuracy on the low-quality images in ICDAR2015/SVT datasets by the TSRN
and TPGSR-TSRN models trained on the TextZoom dataset.

4.4. Comparison with State-of-the-Arts


As described in Section 3.2 and illustrated in Fig. 4, the SR block of most existing representative
SISR and STISR models can be adopted in the SR Module of our TPGSR framework, resulting in a
new TPGSR model. To verify the superiority of our TPGSR framework, we select several popular SISR
models, including SRCNN [13], SRResNet [30], RDN [54], and the latest state-of-the-art STISR model
TSRN [46], and embed their SR blocks into our TPGSR framework. The corresponding STISR models
are called TPGSR-SRCNN, TPGSR-SRResNet, TPGSR-RDN, and TPGSR-TSRN, respectively. The
TextZoom [46], ICDAR2015 [26] and SVT [45] datasets are used to compare these models as well as
their prototypes. For fair comparison, all the models are trained trained in TextZoom dataset with the
same settings.
Results on TextZoom. The experimental results on TextZoom are shown in Table 2. Here we present
the text recognition accuracies on STISR results by using the official ASTER [42], MORAN [33] and
CRNN [41] text recognition models. In Fig. 6, we visualize the SR images by the competing models
with the ground-truth text labels. From Table 2 and Fig. 6, we can have the following findings.
First, from Table 2 we see that our TPGSR framework significantly improves the text recognition
accuracy of all original SISR/STISR methods under all settings. This clearly validates the effectiveness
of TP in guiding text image enhancement for recognition. Second, from Fig. 6 we can see that with
TPGSR, all SR models show clear improvement in text image recovery with more readable character
stroke, resulting in correct text recognition. This also explains why our TPGSR can improve significantly
the text recognition accuracy, as shown in Table 2.
Generalization to other datasets. As mentioned in Section 4.2, to verify the generalization perfor-
mance of our model trained on TextZoom to other datasets, we apply it to the low quality images (height
≤ 16 or recognition score ≤ 0.9) in ICDAR2015 and SVT. Overall, 563 low quality images were se-
lected from the 2, 077 testing images in ICDAR2015, and 104 images were selected from the 647 testing
images in SVT. The STISR and text image recognition experiments are then performed on the 667 low-
quality images. Since TSRN [46] is specially designed for text image SR and it performs much better
than other SISR models, we only employ TSRN and TPGSR-TSRN in this experiment. The ASTER
and CRNN text recognizers as well as stronger baseline SEED [39] are used.
The results are shown in Table 3. We can have the following findings. First, compared with the text
recognition results using original images without SR, TSRN improves the performance when CRNN is
used the text recognizer, but decreases the performance when ASTER or SEED is used as the recognizer.
This implies that TSRN does not have stable cross-dataset generalization capability. Second, TPGSR-
TSRN can consistently improve the performance over the original images in all three recognizer. This
demonstrates that it has good generalization performance on cross-dataset test. Third, TPGSR-TSRN
regular mosque you tourist memorable vehicles
BICUBIC
regular mosquo you jourrst drcenlle js

SRCNN
regular mosqua you lourist memereble vyng

TPGSR-SRCNN
regular mosqua you jourist memerable cdes

SRResNet
regular mosque you tounist nemereble vm

TPGSR-SRResNet
regular mosque you tourist menerable rehicies

RDN
regular mosque you jourist memersble raust

TPGSR-RDN
regular mosque you jourist menerable wices

TSRN
regular mosque yol tourist memerable vincais

TPGSR-TSRN
regular mosque you tourist memorable vehicles

HR
regular mosque you tourist memorable vehicles

Figure 6: Visual comparison of competing STISR models on TextZoom. The word on the bottom-right
corner of each image is the text recognition result, with correct characters or words in green and wrong
in red.
Approach Accuracy of ASTER [42] PSNR SSIM
easy medium hard easy medium hard easy medium hard
BICUBIC 64.7% 42.4% 31.2% 22.35 18.98 19.39 0.7884 0.6254 0.6592
SRCNN [13, 46] 69.4% 43.4% 33.0% 23.48 19.06 19.34 0.8379 0.6323 0.6791
TPGSR-SRCNN 72.9% 50.7% 34.7% 22.82 19.01 19.35 0.8232 0.6372 0.6798
SRResNet [30, 46] 69.6% 47.6% 34.3% 23.48 19.06 19.34 0.8681 0.6406 0.6911
TPGSR-SRResNet 76.0% 58.8% 40.1% 22.47 19.09 19.59 0.8648 0.6359 0.7006
RDN [54, 46] 70.0% 47.0% 34.0% 22.27 18.95 19.70 0.8249 0.6427 0.7113
TPGSR-RDN 72.6% 54.2% 37.2% 23.36 18.90 19.77 0.8512 0.6524 0.7155
TSRN [46] 75.1% 56.3% 40.1% 25.07 18.86 19.71 0.8897 0.6676 0.7302
TPGSR-TSRN 78.9% 62.7% 44.5% 24.35 18.73 19.93 0.8860 0.6763 0.7487

Table 4: Recognition accuracy, PSNR (dB) and SSIM results of the competing STISR models on
TextZoom. All experiment are conducted under the same settings.

consistently outperforms TSRN under all settings.


4.5. Discussions
Cost vs. performance. To further examine the value of our TPGSR, we compare the computational
cost of our single-stage TPGSR with the TSRN [46]. In Table 5, the experiments and results show that
straightly increasing the number of SRB blocks is not an effective way of gaining performance (1.3%
Approach TPG Backbone Flops ACC
TSRN [46] w 5 SRBs - 0.91G 41.4%
TSRN [46] w 7 SRBs - 1.16G 40.1%
TPGSR-TSRN (N = 1) VGG [41] 1.72G 49.8%

Table 5: Cost vs. performance. ACC means the average recognition accuracy. N refers to the stage
number of the TPGSR.

Approach Visualization PSNR SSIM Recognition

BICUBIC 23.53 0.8010 bning

TPGSR-TSRN 22.12 0.8185 bring

HR - -

BICUBIC 24.15 0.7814 purypround

TPGSR-TSRN 23.85 0.8075 playground

HR - -

Table 6: PSNR, SSIM and Recognition results of estimated HR images and real HR images.

accuracy drop). However, under our designed TPGSR network, the performance shows an improvement
of 8.4% compared to TSRN with 5-SRB. It is humble to conclude that generating text prior under our
TPGSR framework is more valued than the additional cost it introduces.
The PSNR/SSIM indices for STISR. In terms of STISR, better results of some objective metrics, e.g.,
PSNR and SSIM, do not always guarantee more accurate scene text estimation, and vice versa. Similar
conclusion can be also found in [46] that PSNR and SSIM are not stable metrics for STISR. For better
interpretation of this point, we adopt some popular SR models, including SRCNN [13], SRResNet [30],
RDN [54] and TSRN [46], and compare the results of these models before and after they are integrated
into our TPGSR framework for joint optimization. The results are demonstrated in Table 4. We can see
that the accuracy of all models boosts after they are embedded to our TPGSR framework. In terms of the
objective metrics, our best SR model under TPGSR consistently outperforms the competing methods for
samples at the “hard” difficulty level, since TP strengthens the power of the SR models in the challenging
cases. In contrast, for samples at both “easy” and “medium” difficulty levels, joint optimization using
our TPGSR does not always improve PSNR/SSIM results. This is because SR models trained without
TP may suffer from over-fitting in easier cases. We could alleviate this issue by introducing TP loss
between LR images and the predicted HR images into our objective function as a regularization term
for joint optimization. In this way, as shown in Table 6, we can obtain better perceptual quality and
more accurate scene text image recognition results, which we believe are more valuable in real world
applications than subtle rise in metrics such as PSNR and SSIM.
(a) Korean (b) Chinese (c) Bangla
Figure 7: Examples of out-of-category text image SR in different languages. From top to bottom: the
LR image and super-resolved HR images by TSRN [46] and our TPGSR-TSRN.
TP: sopervisar TP: university TP: collubmines
SR: sopervisar SR: university SR: rewlitanieaat
GT: supervisor GT: university GT: www.hairbywilson.com

(a) (b) (c)


Figure 8: Examples of failure cases. (a) False TP guidance. (b) Multi-oriented text. (c) Extreme long
text.

Out-of-category analysis. As mentioned in Section 4.1, in our implementation, we assign the out-of-
category characters with blank label. For such characters, the STISR results will mainly depend on the
SR Module in our TPGSR network. To test the SR performance of our TPGSR model on images with
out-of-category characters, we applied it to some text images in Korean, Chinese and Bangla picked
from the ICDAR-MLT [37] dataset. The results are shown in Fig. 7. We see that the reconstructed HR
text images by our model show clearer appearance and contour than their LR counterparts. Compared
with TSRN, TPGSR-TSRN demonstrates slightly better perceptual quality. The reason may be that
categorical text prior serves as a regularization term in training SR Module to avoid over-fitting. Hence,
the SR Module can still produce well-recovered scene text image with null guidance in inference stage.
Failure case. Though TPGSR can improve the visual quality of SR text images and boost the perfor-
mance of text recognition, it still has some limitations, as shown in Fig. 8. First, the TP Generators
tuned with LR samples are robust for most cases, while in some case it may produce false prior on the
input LR image, and the output SR image may show incorrect character stroke, and causing false text
recognition. Fig. 8(a) illustrates such a failure case. Second, if the text instance encounters certain ro-
tation, the benefit brought by TPGSR will be weakened. Fig. 8(b) shows such an example. Though the
recognition result is correct, the improvement on image quality is not significant. Third, for the cases
with extremely long text instance, as shown in Fig. 8(c), the TP Generator may fail with significantly
compressed outputs. In such case, the final output of TPGSR will suffer from text distortion with wrong
text recognition.
Bicubic

TSRN

Ours

HR

Figure 9: Application to the Chinese text recovery.

To address the above issues, in the future we could consider to adopt more powerful TP Generators
to provide more robust guidance, and design new TP guidance strategies for recovering multi-oriented
scene text and curve text. In addition, the failures on long text can be alleviated by lengthening the width
of input to the TP Generator.
Recovering hieroglyphs (e.g. Chinese). In this work, we focused on the real-world SR of English
text images, for which there is a well-prepared benchmark dataset TextZoom. It is interesting to know
whether our proposed method can be adopted for hieroglyphs such as Chinese. Here we perform some
preliminary experiments to validate the feasibility. We train a multilingue recognition model using
CRNN on the ICPR2018-MTWI Chinese and English dataset [4] as our TP Generator. The overall
alphabet contains 3, 965 characters, including the English and Chinese frequently-used set. Since there
is no real-world benchmark dataset with LR-HR image pairs of Chinese characters, we synthesize LR-
HR text image pairs by blurring and down-sampling the MTWI text images. We inherit the splits of
MTWI as our training (59, 886 samples) and testing (4, 838 samples) set. The model training and testing
are conducted following the settings described in Section 4.1.
The SR text recognition results are 27.7% (Bicubic), 41.1% (TSRN [46]), 42.7% (TPGSR-TSRN)
and 56.1% (HR). From the results, we can observe that our TPGSR framework can still achieve 1.6%
accuracy gain over TSRN. Visualization on some Chinese characters can be seen in Fig. 9. One can see
that our TPGSR can improve much the visual quality of SR results. Compared to TSRN [46], our TPGSR
can better recover the text stroke on the samples. This preliminary experiment verifies that our TPGSR
framework can be extended to hieroglyphs. More investigations and real-world dataset construction will
be made in our future work.

5. Conclusion
In this paper, we presented a novel scene text image super-resolution framework, namely TPGSR, by
introducing text prior (TP) to guide the text image super-resolution (SR) process. Considering the fact
that text images have distinct text categorical information compared with those natural scene images,
we integrated the TP features and image features to more effectively reconstruct the text characters. The
enhanced text image can produce better TP in return, and therefore multi-stage TPGSR was employed to
progressively improve the SR recovery of text images. Experiments on TextZoom benchmark and other
datasets showed that TPGSR can clearly improve the visual quality and readability of low-resolution
text images, especially for those hard cases, and consequently improve significantly the text recognition
performance on them.

References
[1] aster.pytorch. https://github.com/ayumiymk/aster.pytorch. 10
[2] crnn.pytorch. https://github.com/meijieru/crnn.pytorch. 10
[3] MORAN v2. https://github.com/Canjie-Luo/MORAN_v2. 10
[4] ICPR 2018 contest on robust reading for multi-type web images (MTWI). https://tianchi.aliyun.
com/getStart/introduction.htm?spm=5176.100066.0.0.50c233aftaCagb&
raceId=231686., Accessed: 2019-3-22. 15
[5] Zuzana Bı́lková and Michal Hradiš. Perceptual license plate super-resolution with CTC loss. J. Electron.
Imaging, 2020(6):52–1, 2020. 1, 2
[6] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and
Dimosthenis Karatzas. Scene text visual question answering. In Int. Conf. Comput. Vis., pages 4291–4301,
2019. 1
[7] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-
resolution: A new benchmark and a new model. In Int. Conf. Comput. Vis., pages 3086–3095, 2019. 3
[8] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In IEEE
Conf. Comput. Vis. Pattern Recog., pages 6154–6162, 2018. 6
[9] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu,
Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In IEEE Conf. Comput.
Vis. Pattern Recog., pages 4974–4983, 2019. 6
[10] Yu Chen, Ying Tai, Xiaoming Liu, Chunhua Shen, and Jian Yang. FSRNet: End-to-end learning face super-
resolution with facial priors. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2492–2501, 2018. 2, 3
[11] Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. Focusing attention:
Towards accurate text recognition in natural images. In Int. Conf. Comput. Vis., pages 5076–5084, 2017. 4
[12] Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. AON: Towards arbitrarily-
oriented text recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5571–5579, 2018. 4
[13] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for
image super-resolution. In Eur. Conf. Comput. Vis., pages 184–199. Springer, 2014. 10, 11, 12, 13
[14] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolu-
tional networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(2):295–307, 2015. 3
[15] Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerating the super-resolution convolutional neural
network. In Eur. Conf. Comput. Vis., pages 391–407. Springer, 2016. 1
[16] Chao Dong, Ximei Zhu, Yubin Deng, Chen Change Loy, and Yu Qiao. Boosting optical character recogni-
tion: A super-resolution approach. arXiv preprint arXiv:1506.02211, 2015. 3
[17] Chiung-Yao Fang, Chiou-Shann Fuh, PS Yen, Shen Cherng, and Sei-Wang Chen. An automatic road sign
recognition system based on a computational model of human recognition processing. 96(2):237–268, 2004.
1
[18] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal clas-
sification: labelling unsegmented sequence data with recurrent neural networks. pages 369–376, 2006. 1,
4
[19] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural im-
ages. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2315–2324, 2016. 8
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
IEEE Conf. Comput. Vis. Pattern Recog., pages 770–778, 2016. 1
[21] Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang. Reading scene text in deep convolu-
tional sequences. arXiv preprint arXiv:1506.04395, 2015. 3
[22] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial
neural networks for natural scene text recognition. In Adv. Neural Inform. Process. Syst. Worksh., 2014. 8
[23] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Reading text in the wild with
convolutional neural networks. Int. J. Comput. Vis., 116(1):1–20, 2016. 3
[24] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Deep features for text spotting. In Eur. Conf.
Comput. Vis., pages 512–528. Springer, 2014. 3
[25] Sezer Karaoglu, Ran Tao, Theo Gevers, and Arnold WM Smeulders. Words matter: Scene text for image
classification and retrieval. IEEE Trans. Multimedia, 19(5):1063–1076, 2016. 1
[26] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov,
Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar
2015 competition on robust reading. In Int. Conf. Doc. Anal. Recog., pages 1156–1160. IEEE, 2015. 8, 11
[27] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convo-
lutional networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1646–1654, 2016. 1, 5
[28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional
neural networks. Commun. ACM, 60(6):84–90, 2017. 1
[29] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep Laplacian pyramid networks
for fast and accurate super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., pages 624–632, 2017. 1,
3, 5, 7
[30] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta,
Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-
resolution using a generative adversarial network. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4681–
4690, 2017. 1, 3, 5, 10, 11, 12, 13
[31] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual
networks for single image super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages
136–144, 2017. 1, 3, 5, 7
[32] Wei Liu, Chaofeng Chen, Kwan-Yee K Wong, Zhizhong Su, and Junyu Han. STAR-Net: A spatial attention
residue network for scene text recognition. In Brit. Mach. Vis. Conf., volume 2, page 7, 2016. 4
[33] Canjie Luo, Lianwen Jin, and Zenghui Sun. MORAN: A multi-object rectified attention network for scene
text recognition. Pattern Recognition, 90:109–118, 2019. 10, 11
[34] Minesh Mathew, Dimosthenis Karatzas, R Manmatha, and CV Jawahar. DocVQA: A dataset for VQA on
document images. arXiv preprint arXiv:2007.00398, 2020. 1
[35] Sergio Montazzolli Silva and Claudio Rosito Jung. License plate detection and recognition in unconstrained
scenarios. In Eur. Conf. Comput. Vis., pages 580–596, 2018. 1
[36] Yongqiang Mou, Lei Tan, Hui Yang, Jingying Chen, Leyuan Liu, Rui Yan, and Yaohong Huang. PlugNet:
Degradation aware scene text recognition supervised by a pluggable super-resolution unit. In Eur. Conf.
Comput. Vis., 2020. 3
[37] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada
Pal, Christophe Rigaud, Joseph Chazalon, et al. Icdar2017 robust reading challenge on multi-lingual scene
text detection and script identification-rrc-mlt. In Int. Conf. Doc. Anal. Recog., volume 1, pages 1454–1459.
IEEE, 2017. 14
[38] Clément Peyrard, Moez Baccouche, Franck Mamalet, and Christophe Garcia. ICDAR2015 competition on
text image super-resolution. In Int. Conf. Doc. Anal. Recog., pages 1201–1205. IEEE, 2015. 3
[39] Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, and Weiping Wang. SEED: Semantics enhanced encoder-
decoder framework for scene text recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13528–
13537, 2020. 11
[40] Yuhui Quan, Jieting Yang, Yixin Chen, Yong Xu, and Hui Ji. Collaborative deep learning for super-resolving
blurry text images. IEEE Transactions on Computational Imaging, 6:778–790, 2020. 3
[41] Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based se-
quence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.,
39(11):2298–2304, 2016. 2, 3, 4, 8, 9, 10, 11, 13
[42] Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. ASTER:
An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell.,
41(9):2035–2048, 2018. 10, 11, 12
[43] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
tion. In IEEE Conf. Comput. Vis. Pattern Recog., 2014. 1
[44] Hanh TM Tran and Tien Ho-Phuoc. Deep Laplacian pyramid network for text images super-resolution. In
Int. Conf. Comput. and Commun. Technol., pages 1–6. IEEE, 2019. 1, 2, 5
[45] Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene text recognition. In Int. Conf. Comput.
Vis., pages 1457–1464. IEEE, 2011. 8, 11
[46] Wenjia Wang, Enze Xie, Xuebo Liu, Wenhai Wang, Ding Liang, Chunhua Shen, and Xiang Bai. Scene text
image super-resolution in the wild. arXiv preprint arXiv:2005.03341, 2020. 2, 3, 5, 7, 8, 9, 10, 11, 12, 13,
14, 15
[47] Wenjia Wang, Enze Xie, Peize Sun, Wenhai Wang, Lixun Tian, Chunhua Shen, and Ping Luo. TextSR:
Content-aware text super-resolution guided by recognition. arXiv preprint arXiv:1909.07113, 2019. 1, 2, 3,
5, 7
[48] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Recovering realistic texture in image super-
resolution by deep spatial feature transform. In IEEE Conf. Comput. Vis. Pattern Recog., pages 606–615,
2018. 2, 3
[49] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From
error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004. 3
[50] Qi Xie, Minghao Zhou, Qian Zhao, Deyu Meng, Wangmeng Zuo, and Zongben Xu. Multispectral and
hyperspectral image fusion by ms/hs fusion net. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1585–
1594, 2019. 8
[51] Xiangyu Xu, Deqing Sun, Jinshan Pan, Yujin Zhang, Hanspeter Pfister, and Ming-Hsuan Yang. Learning to
super-resolve blurry face and text images. In Proceedings of the IEEE international conference on computer
vision, pages 251–260, 2017. 3
[52] Xuaner Zhang, Qifeng Chen, Ren Ng, and Vladlen Koltun. Zoom to learn, learn to zoom. In IEEE Conf.
Comput. Vis. Pattern Recog., pages 3762–3770, 2019. 3
[53] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using
very deep residual channel attention networks. In Eur. Conf. Comput. Vis., pages 286–301, 2018. 1, 3
[54] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-
resolution. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2472–2481, 2018. 1, 3, 5, 7, 10, 11, 12,
13

View publication stats

You might also like