Workflow
Workflow
net/publication/353060493
CITATIONS READS
0 161
3 authors:
Lei Zhang
University College Dublin
1,274 PUBLICATIONS 77,180 CITATIONS
SEE PROFILE
All content following this page was uploaded by Ma Jianqi on 10 May 2022.
Abstract
Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of low-
resolution (LR) scene text images, and consequently boost the performance of text recognition. However,
most of existing STISR methods regard text images as natural scene images, ignoring the categorical
information of text. In this paper, we make an inspiring attempt to embed categorical text prior into
STISR model training. Specifically, we adopt the character probability sequence as the text prior, which
can be obtained conveniently from a text recognition model. The text prior provides categorical guidance
to recover high-resolution (HR) text images. On the other hand, the reconstructed HR image can refine
the text prior in return. Finally, we present a multi-stage text prior guided super-resolution (TPGSR)
framework for STISR. Our experiments on the benchmark TextZoom dataset show that TPGSR can not
only effectively improve the visual quality of scene text images, but also significantly improve the text
recognition accuracy over existing STISR methods. Our model trained on TextZoom also demonstrates
certain generalization capability to the LR images in other datasets.
1. Introduction
Scene text image recognition aims to recognize the text characters from the input image, which is
an important computer vision task that involves text information processing. It has been widely used
in text retrieval [25], sign recognition [17], license plate recognition [35] and other scene-text-based
image understanding tasks [6, 34]. However, due to the various issues such as low sensor resolution,
blurring, poor illumination, etc., the quality of captured scene text images may not be good enough,
which brings many difficulties to scene text recognition in practice. In particular, scene text recognition
from low-resolution (LR) images remains a challenging problem.
In recent years, single image super-resolution (SISR) techniques [15, 27, 29, 30, 31, 54, 53] have
achieved significant progress owing to the rapid development of deep neural networks [28, 43, 20].
Inspired by the success of SISR, researchers have started to investigate scene text image super-resolution
(STISR) to improve the quality of LR text images and hence improve the text recognition accuracy. Tran
et al. [44] adapted LapSRN [29] to STISR and significantly improved the text content details. To obtain
more realistic STISR results, Bı́lková et al. [5] and Wang et al. [47] employed GAN based networks with
CTC loss [18] and text perceptual losses. In these methods, the LR images are synthesized (e.g., bicubic
* This work is supported by the Hong Kong RGC RIF grant (R5001-18).
(a) BICUBIC NOX
(e) HR PARKING
Figure 1: Comparison of super-resolution results generated from TSRN [46] with and without (w/o) our
text prior (TP). The right column shows the text recognition results.
down-sampling) from high resolution (HR) images for SR model learning, while the image degradation
process in real-world LR images can be much more complex. Recently, Wang et al. [46] collected a
real-world STISR dataset, namely TextZoom, where LR-HR image pairs captured by zooming lens are
provided. Wang et al. also proposed a TSRN model for STISR, achieving state-of-the-art performance
[46].
The existing STISR methods [44, 47, 5], however, mostly treat scene text images as natural scene
images to perform super-resolution, ignoring the important categorical information brought by the text
content in the image. As shown in Fig. 1(b), the result by TSRN [46] is much better than the simple
bicubic model (Fig. 1(a)), but it is still hard to tell the characters therein. Based on the observation that
semantic information can help to recover the shape and texture of objects [48], in this paper we propose a
new STISR method, namely text prior guided super-resolution (TPGSR), by introducing the categorical
text prior information into the model learning process. Unlike the face segmentation prior used in [10]
and the semantic segmentation used in [48], the text character segmentation is hard to obtain since there
are few datasets containing annotations of fine character segmentation masks. We instead employ a text
recognition model (e.g., CRNN [41]) to extract the probability sequence of the text as the categorical
prior of the given LR scene text image, and embed it into the super-resolution network learning process to
guide the reconstruction of HR image. As can be seen in Fig. 1(c), the text prior information can indeed
improve much the STISR results, making the text characters much more readable. On the other hand, the
reconstructed HR text image can be used to refine the text prior, and consequently a multi-stage TPGSR
framework can be built for effective STISR. Fig. 1(d) shows the super-resolved text image by using the
refined text prior, where the text can be clearly and correctly recognized. The major contributions of our
work are as follows:
• We introduce the text recognition categorical probability sequence as the prior into the STISR
task, and validate its effectiveness to improve the visual quality of text content and recognition
accuracy of scene text images.
• We propose the refinements of the text recognition categorical prior without using extra supervi-
sion of text labels except for the real HR image: refining recurrently by the estimated HR image
and by fine-tuning the text prior generator with our proposed TP Loss. With such refinement, the
text prior and the super-resolved text image can be jointly enhanced under our TPGSR framework.
• By improving the image quality of LR text images, the proposed TPGSR improves the text recog-
nition performance on TextZoom for different text recognition models by a large margin and
demonstrates good generalization performance to other recognition datasets.
2. Related Works
Single Image Super Resolution (SISR). Aiming at estimating a high-resolution (HR) image from
its low-resolution (LR) counterpart, SISR is a highly ill-posed problem. In the past, handcrafted im-
age priors are commonly used to regularize the SISR model to improve the image quality. In recent
years, the training of deep neural networks (DNNs) has dominated the research of SISR. The pioneer
work SRCNN [14] learns a three-layer convolutional neural network (CNN) for the SISR task. Later
on, many deeper CNN models have been proposed to improve the SISR performance, e.g., deep resid-
ual block [31], Laplacian pyramid structure [29], densely connected network [54] and channel attention
mechanism [53]. The PSNR and SSIM [49] losses are widely used in those works to train the SISR
model. In order to produce perceptually-realistic SISR results, SRGAN [30] employs a generative ad-
versarial network (GAN) to synthesize image details. SFT-GAN [48] and FSRNet [10] utilizes the GAN
loss and semantic segmentation to generate visually pleasing HR image.
Scene Text Image Super Resolution (STISR). Different from the general purpose SISR that works
on natural scene images, STISR focuses on text images, aiming to improve the readability of texts by
improving their visual quality. Intuitively, those methods for SISR can be directly adopted for STISR.
In [16], Dong et al. extended SRCNN [14] to text images, and obtained the best performance in ICDAR
2015 competition [38]. PlugNet [36] employs a light-weight pluggable super-resolution unit to deal
with LR images in feature domain. TextSR [47] utilizes the text recognition loss and text perceptual loss
to generate the desired HR images for text recognition. To improve the performance of STISR on real-
world scene text images, Wang et al. [46] built a real-world STISR image dataset, namely TextZoom,
where the LR and HR text image pairs were cropped from real-world SISR datasets [52, 7]. They also
proposed a TSRN [46] method to use the central alignment module and sequential residual block to
exploit the semantic information in internal features. SCGAN [51] employs a multi-class GAN loss
as supervision to equip the model with ability to generate more distinguishable face and text images.
Further by progressively adopting the high-frequency information derived from the text image, Quan et
al. [40] proposed a multi-stage model for recovering the blurry text images. Different from the above
methods, we propose employ the text recognition prior to guide the STISR model recovering better
quality text images.
Scene Text Recognition. In the early stage of deep learning based scene text recognition, researchers
intended to solve the problem in a bottom-up manner [24, 21], i.e., extracting the text from characters
into words. Some other approaches recognize the text in a top-down fashion [23], i.e., regarding the text
image as a whole and performing a word-level classification. Taking text recognition as an image-to-
sequence problem, CRNN [41] employs the CNN to extract image feature and uses the recurrent neural
obey tourist reproductive
Figure 2: Visualization of the text prior (TP) of some text images. The top, middle and bottom rows
present the text labels, input text images and their TPs, respectively.
networks to model the semantic information of image features. Trained with the CTC [18] loss, the pre-
dicted sequence can be more accurately aligned with the target sequence [32]. Recently, attention-based
methods thrive due to the improvement in text recognition benchmarks and the robustness to various
shapes of text images [11, 12]. In our method, we adopt CRNN as the text prior generator to generate
categorical text priors for STISR model training. It is shown that such text priors can significantly im-
prove the perceptual quality of super-resolved text images and consequently boost the text recognition
performance.
3. Methodology
In this section, we will first explain what the text prior (TP) is, and then introduce the text prior guided
super-resolution (TPGSR) network in detail, followed by the design of loss function.
3.1. Text Prior
In this paper, the TP is defined as the deep categorical representation of a scene text image generated
by some text recognition models. The TP is then used as guidance information to encourage our TPGSR
network to produce high quality scene text images, which are favorable to both visual perception and
scene text recognition.
Specifically, we choose the classic text recognition model CRNN [41] to be the TP Generator. CRNN
uses several convolution layers to extract the text features and five max pooling layers to down-sample
the features into a feature sequence. The TP is then defined as the categorical probability prediction by
CRNN, which is a sequence of |A|-dimensional probability vectors where |A| denotes the number of
characters learned by CRNN. Fig. 2 visualizes the TP of some scene text images, where the horizontal
axis represents the sequence in left-to-right order and the vertical axis represents the categories in reverse
alphabet order (e.g., ’Z’ to ’A’). In the visualization, the lighter the spot is, the higher the probability
of this category will be. By using the TP as guidance, our TPGSR model can recover visually more
pleasing HR images with higher text recognition accuracy, as we illustrated in Fig. 1.
SR Branch HR
SR Module Ground Truth
SR output
LR input
TP-Guided
SR Block
... TP-Guided
SR Block SR Loss
Bicubic
TP
TP feature Generator
TP Module
TP TP
Generator Transformer
TP Loss TP
TP Generation TP
Branch
Figure 3: Our proposed TPGSR framework, which consists of a Text Prior Generation Branch and
a Super-resolution (SR) Branch. Accordingly, TP loss and SR loss are employed to train the whole
network.
32 ⊕
Projection
Block Add C
Layers
C C
image feature Concat
SR Block
Block
Layers
C C
image feature
Figure 4: Comparison of our TP-Guided SR Block and a common SR Block. In each block, the channel
numbers of image features and TP features are C and 32, repectively.
embed the TP features into the SR Block, we concatenate them to the image features along the channel
dimension. Before the concatenation, we align the spatial size of TP features to that of the image features
by bicubic interpolation. Suppose that the channel number of image features is C, then the concatenated
features of C + 32 channels will go through a projection layer to reduce the channel number back to C.
We simply use a 1 × 1 kernel convolution to perform this projection. The output of projection layer is
fused with the input image feature by addition. With several such TP-Guided SR Blocks, the SR branch
will output the estimated HR image, as in those previous super-resolution models.
3.3. Multi-stage Refinement
With the TPGSR framework described in Section 3.2, we can super-resolve an LR image to a better
quality HR image with the help of TP features extracted from the LR input. One intuitive question
is, if we extract the TP features from the super-resolved HR image, can we use those better quality
TP features to further improve the super-resolution results? Actually, multi-stage refinement has been
widely adopted in many computer vision tasks such as object detection [8] and instance segmentation [9]
to improve the prediction quality progressively. Therefore, we extend our one-stage TPGSR model to a
multi-stage learning framework by passing the estimated HR text image in one stage to the TP Generator
in next stage. The multi-stage TPGSR framework is illustrated in Fig. 5. In the 1st stage, the TP Module
accepts the bicubically interpolated LR image as input, while in the following stages, the TP Module
accepts the HR image output from the SR Module in previous stage as input for refinement. As we
will show in the ablation study in Section 4.3, both the quality of estimated HR text image and the text
recognition accuracy can be progressively improved by this multi-stage refinement.
Stage (2 to N)
Recurrent for next stage
LR input
SR Module
Bicubic SR output
TP Module
TP Feature
Stage 1 LR input
SR Module
Bicubic SR output
TP Module
TP Feature
Figure 5: Illustration of multi-stage TPGSR. The super-resolution output of one stage will be the text
image input of next stage.
where tij ij
L and tH denote the element in ith position and jth dimension in tL and tH . is a small positive
number to avoid numeric error in division and logarithm. Together with LS , the overall loss function for
a single-stage TPGSR can is follows:
Lmt = N
P
i=1 λi Li , (3)
PN
where λi balances the loss of each stage and i=1 λi = 1.
4. Experiments
4.1. Implementation Details
We implement our TPGSR method in Pytorch. Adam is selected as our optimizer with momentum
0.9. The batch size is set to 48 and the model is trained for 500 epochs with one NVIDIA RTX 2080Ti
GPU. The TP Generator is selected as CRNN [41] pre-trained on SynthText [19] and MJSynth [22]. In
Eq. 2, the weights α and β are both simply set to 1, while the in Eq. 1 is set to 10−6 . The alphabet set
A includes mainly alphanumeric (0 to 9 and ’a’ to ’z’) case-insensitive characters. Together with a blank
label, |A| (i.e., the size of A), has 37 categories in total. For dealing with the out-of-category cases, we
assign all out-of-category characters with blank label, and the reconstruction of these characters mainly
depends on the SR Module.
For multi-stage TPGSR training, we adopt a well-trained single-stage model to initialize all stages and
cut the gradient across stages to speed up the training process to converge. The TP Generator are non-
shared while the SR Module are shared cross stages. As in previous multi-stage learning methods [50],
higher weight is assigned to the loss on the last stage, and the other stages are assigned with smaller
weights on loss. In particular, we use a 3-stage TPGSR. The parameters λi in Eq. 3 are set as λ1 = 14 ,
λ2 = 41 and λ3 = 12 .
4.2. Datasets and Experiment Settings
Datasets. The TextZoom [46], ICDAR2015 [26] and SVT [45] datasets are used to validate the effective-
ness of our proposed TPGSR method. TextZoom consists of 21, 740 LR-HR text image pairs collected
by lens zooming of the camera in real-world scenarios. The training set has 17, 367 pairs, while the test
set is divided into three subsets based on the camera focal length, namely easy (1, 619 samples), medium
(1, 411 samples) and hard (1, 343 samples). The dataset also provides the text label for each pair.
ICDAR2015 is a well-known scene text recognition dataset, which contains 2, 077 cropped text im-
ages from street view photos for testing. SVT is also a scene text recognition dataset, which contains
647 testing text images. Each image has a 50-word lexicon with it.
Approach Tuned ACC N E M H ACC
TSRN [46] - 41.4% 1 61.0% 49.9% 36.7% 49.8%
TPGSR-TSRN × 44.5% 2 62.2% 51.3% 37.4% 50.9%
TPGSR-TSRN X 49.8% 3 63.1% 52.0% 38.6% 51.8%
HR - 72.3% 4 63.7% 53.3% 39.4% 52.6%
5 64.3% 54.2% 39.2% 53.1%
(a) Tuning the TP Generator. (b) Ablation on different stage settings.
Experiment settings. Since there are real-world LR-HR image pairs in the TextZoom dataset, we
first use it to train and evaluate the proposed TPGSR model. We then apply the trained model to IC-
DAR2015/SVT to test its generalization performance to other datasets. Considering the fact that most
of the images in ICDAR2015 and SVT have good resolution and quality, while the TextZoom train-
ing data focus on LR images, we perform the generalization test only to the low quality images in
ICDAR2015/SVT whose height is less than 16 or the recognition score is less than 0.9.
4.3. Ablation Studies
To better understand the proposed TPGSR model, in this section we conduct a series of ablation exper-
iments on the selection of parameters in loss function, the selection of number of stages and whether the
TP Generator should be fine-tuned in training. We also perform experiments to validate the effectiveness
of SR Module in our TPGSR framework. We adopt TSRN [46] as the SR Module in the experiments,
and name our model as TPGSR-TSRN. All ablation experiments are performed on TextZoom and the
recognition accuracies are evaluated with CRNN [41].
Impact of tuning the TP Generator. The loss terms for the TP branch aim to fine-tune the TP Gen-
erator. To prove the significance of TP Generator tuning, we conduct experiments by fixing and tuning
the TP Generator in a one-stage TPGSR model. The text recognition accuracies are shown in Table 1a.
By fixing the TP Generator, we can enhance the SR image recognition by 3.1% compared to the TSRN
Approach ASTER [42, 1] MORAN [33, 3] CRNN [41, 2]
easy medium hard average easy medium hard average easy medium hard average
BICUBIC 64.7% 42.4% 31.2% 47.2% 60.6% 37.9% 30.8% 44.1% 36.4% 21.1% 21.1% 26.8%
SRCNN [13, 46] 69.4% 43.4% 33.0% 49.5% 63.2% 39.0% 30.2% 45.3% 38.7% 21.6% 20.9% 27.7%
TPGSR-SRCNN 72.9% 50.7% 34.7% 53.8% 67.7% 49.7% 32.8% 50.9% 47.0% 30.6% 24.7% 34.7%
SRResNet [30, 46] 69.6% 47.6% 34.3% 51.3% 60.7% 42.9% 32.6% 46.3% 39.7% 27.6% 22.7% 30.6%
TPGSR-SRResNet 76.0% 58.8% 40.1% 59.1% 72.3% 54.9% 38.4% 56.0% 54.6% 41.2% 32.3% 43.3%
RDN [54, 46] 70.0% 47.0% 34.0% 51.5% 61.7% 42.0% 31.6% 46.1% 41.6% 24.4% 23.5% 30.5%
TPGSR-RDN 72.6% 54.2% 37.2% 55.5% 67.8% 51.7% 36.0% 52.6% 53.0% 38.0% 27.7% 40.2%
TSRN [46] 75.1% 56.3% 40.1% 58.3% 70.1% 53.3% 37.9% 54.8% 52.5% 38.2% 31.4% 41.4%
TPGSR-TSRN 78.9% 62.7% 44.5% 62.8% 74.9% 60.5% 44.1% 60.5% 63.1% 52.0% 38.6% 51.8%
HR 94.2% 87.7% 76.2% 86.4% 91.2% 85.3% 74.2% 83.9% 76.4% 75.1% 64.6% 72.2%
Table 2: SR text image recognition performance of competing STISR models on TextZoom. The recog-
nition accuracies are evaluated by the officially released models of ASTER [42], MORAN [33] and
CRNN [41].
baseline [46]. By tuning the TP Generator during the training process, the recognition accuracy can
be further improved from 44.5% to 49.8%, achieving a performance gain of 5.3%. This clearly demon-
strates the benefits of tuning the TP Generator to the SR text recognition task.
Impact of multiple stages in TPGSR. In addition to refining the TP Generator, recurrently inputting
the estimated HR image into the TPGSR can also enhance the quality of TP since the SR Module can
improve the estimated HR text image in each recurrence. To find out how well the multi-stage refinement
can reach, we set the stage number N = 1, 2, . . . , 5 and report the text recognition accuracy in Table 1b.
We can see that the recognition accuracy increases with the increase of N ; however, the margin of
improvement decreases with N. When N = 5, the accuracy of ’Hard’ split begins to fall. Considering
the balance between the model size and the performance gain, we set N to 3 in our following experiments.
Parameter sharing strategy. To determine the best sharing strategies, we conduct experiments to test
on both the TP Module and the SR Module. As shown in Table 1c, we find that under different settings
of stage number, the setting of non-shared TP Module shows significant performance improvement.
However, when we use non-shared SR Module, little performance improvement in SR image recognition
is achieved. Thus we use the settings of shared SR Module and non-shared TP Module in our multi-stage
model.
The effectiveness of SR in TPGSR. Since one of the goals of STISR is to improve the text recognition
performance by HR image recovery, it is necessary to check if the estimated SR images truly help the
final text recognition task. To this end, we evaluate the TPGSR models with both fixed and tuned TP
Generator by using LR and SR images as inputs. For multi-stage version, we test all the TP Generators
and pick the best LR and SR results from all TP Generators. Note that models with tuned TP Generator
and LR image as input is similar to directly fine-tuning the text recognition model on the LR images.
The results are shown in Table 1d. It can be seen that by tuning TP Generator on the LR images, the text
recognition accuracy can be increased. However, the recognition accuracy can be improved more by us-
ing the SR text image. For example, at stage one, the recognition accuracy of LR images by using tuned
TP Generator is 45.3%, while the accuracy of SR images even without fine-tuning the TP Generator will
be 49.8%. If the tuned TP Generator is used to generate the SR text image, the text recognition perfor-
mance can be further improved compared to the fixed TP Generator. The experiments and comparisons
demonstrate the effectiveness of our SR Module in improving the final SR text recognition.
Dataset ICDAR2015 SVT
No. of images 563 104
Approach SEED [39] ASTER [42] CRNN [41] SEED [39] ASTER [42] CRNN [41]
Origin 54.0% 50.8% 21.5% 60.2% 50.8% 19.2%
TSRN [46] 52.6% 48.3% 24.5% 54.3% 48.3% 23.1%
TPGSR-TSRN 56.1% 52.0% 27.1% 61.1% 52.0% 29.8%
Table 3: Text recognition accuracy on the low-quality images in ICDAR2015/SVT datasets by the TSRN
and TPGSR-TSRN models trained on the TextZoom dataset.
SRCNN
regular mosqua you lourist memereble vyng
TPGSR-SRCNN
regular mosqua you jourist memerable cdes
SRResNet
regular mosque you tounist nemereble vm
TPGSR-SRResNet
regular mosque you tourist menerable rehicies
RDN
regular mosque you jourist memersble raust
TPGSR-RDN
regular mosque you jourist menerable wices
TSRN
regular mosque yol tourist memerable vincais
TPGSR-TSRN
regular mosque you tourist memorable vehicles
HR
regular mosque you tourist memorable vehicles
Figure 6: Visual comparison of competing STISR models on TextZoom. The word on the bottom-right
corner of each image is the text recognition result, with correct characters or words in green and wrong
in red.
Approach Accuracy of ASTER [42] PSNR SSIM
easy medium hard easy medium hard easy medium hard
BICUBIC 64.7% 42.4% 31.2% 22.35 18.98 19.39 0.7884 0.6254 0.6592
SRCNN [13, 46] 69.4% 43.4% 33.0% 23.48 19.06 19.34 0.8379 0.6323 0.6791
TPGSR-SRCNN 72.9% 50.7% 34.7% 22.82 19.01 19.35 0.8232 0.6372 0.6798
SRResNet [30, 46] 69.6% 47.6% 34.3% 23.48 19.06 19.34 0.8681 0.6406 0.6911
TPGSR-SRResNet 76.0% 58.8% 40.1% 22.47 19.09 19.59 0.8648 0.6359 0.7006
RDN [54, 46] 70.0% 47.0% 34.0% 22.27 18.95 19.70 0.8249 0.6427 0.7113
TPGSR-RDN 72.6% 54.2% 37.2% 23.36 18.90 19.77 0.8512 0.6524 0.7155
TSRN [46] 75.1% 56.3% 40.1% 25.07 18.86 19.71 0.8897 0.6676 0.7302
TPGSR-TSRN 78.9% 62.7% 44.5% 24.35 18.73 19.93 0.8860 0.6763 0.7487
Table 4: Recognition accuracy, PSNR (dB) and SSIM results of the competing STISR models on
TextZoom. All experiment are conducted under the same settings.
Table 5: Cost vs. performance. ACC means the average recognition accuracy. N refers to the stage
number of the TPGSR.
HR - -
HR - -
Table 6: PSNR, SSIM and Recognition results of estimated HR images and real HR images.
accuracy drop). However, under our designed TPGSR network, the performance shows an improvement
of 8.4% compared to TSRN with 5-SRB. It is humble to conclude that generating text prior under our
TPGSR framework is more valued than the additional cost it introduces.
The PSNR/SSIM indices for STISR. In terms of STISR, better results of some objective metrics, e.g.,
PSNR and SSIM, do not always guarantee more accurate scene text estimation, and vice versa. Similar
conclusion can be also found in [46] that PSNR and SSIM are not stable metrics for STISR. For better
interpretation of this point, we adopt some popular SR models, including SRCNN [13], SRResNet [30],
RDN [54] and TSRN [46], and compare the results of these models before and after they are integrated
into our TPGSR framework for joint optimization. The results are demonstrated in Table 4. We can see
that the accuracy of all models boosts after they are embedded to our TPGSR framework. In terms of the
objective metrics, our best SR model under TPGSR consistently outperforms the competing methods for
samples at the “hard” difficulty level, since TP strengthens the power of the SR models in the challenging
cases. In contrast, for samples at both “easy” and “medium” difficulty levels, joint optimization using
our TPGSR does not always improve PSNR/SSIM results. This is because SR models trained without
TP may suffer from over-fitting in easier cases. We could alleviate this issue by introducing TP loss
between LR images and the predicted HR images into our objective function as a regularization term
for joint optimization. In this way, as shown in Table 6, we can obtain better perceptual quality and
more accurate scene text image recognition results, which we believe are more valuable in real world
applications than subtle rise in metrics such as PSNR and SSIM.
(a) Korean (b) Chinese (c) Bangla
Figure 7: Examples of out-of-category text image SR in different languages. From top to bottom: the
LR image and super-resolved HR images by TSRN [46] and our TPGSR-TSRN.
TP: sopervisar TP: university TP: collubmines
SR: sopervisar SR: university SR: rewlitanieaat
GT: supervisor GT: university GT: www.hairbywilson.com
Out-of-category analysis. As mentioned in Section 4.1, in our implementation, we assign the out-of-
category characters with blank label. For such characters, the STISR results will mainly depend on the
SR Module in our TPGSR network. To test the SR performance of our TPGSR model on images with
out-of-category characters, we applied it to some text images in Korean, Chinese and Bangla picked
from the ICDAR-MLT [37] dataset. The results are shown in Fig. 7. We see that the reconstructed HR
text images by our model show clearer appearance and contour than their LR counterparts. Compared
with TSRN, TPGSR-TSRN demonstrates slightly better perceptual quality. The reason may be that
categorical text prior serves as a regularization term in training SR Module to avoid over-fitting. Hence,
the SR Module can still produce well-recovered scene text image with null guidance in inference stage.
Failure case. Though TPGSR can improve the visual quality of SR text images and boost the perfor-
mance of text recognition, it still has some limitations, as shown in Fig. 8. First, the TP Generators
tuned with LR samples are robust for most cases, while in some case it may produce false prior on the
input LR image, and the output SR image may show incorrect character stroke, and causing false text
recognition. Fig. 8(a) illustrates such a failure case. Second, if the text instance encounters certain ro-
tation, the benefit brought by TPGSR will be weakened. Fig. 8(b) shows such an example. Though the
recognition result is correct, the improvement on image quality is not significant. Third, for the cases
with extremely long text instance, as shown in Fig. 8(c), the TP Generator may fail with significantly
compressed outputs. In such case, the final output of TPGSR will suffer from text distortion with wrong
text recognition.
Bicubic
TSRN
Ours
HR
To address the above issues, in the future we could consider to adopt more powerful TP Generators
to provide more robust guidance, and design new TP guidance strategies for recovering multi-oriented
scene text and curve text. In addition, the failures on long text can be alleviated by lengthening the width
of input to the TP Generator.
Recovering hieroglyphs (e.g. Chinese). In this work, we focused on the real-world SR of English
text images, for which there is a well-prepared benchmark dataset TextZoom. It is interesting to know
whether our proposed method can be adopted for hieroglyphs such as Chinese. Here we perform some
preliminary experiments to validate the feasibility. We train a multilingue recognition model using
CRNN on the ICPR2018-MTWI Chinese and English dataset [4] as our TP Generator. The overall
alphabet contains 3, 965 characters, including the English and Chinese frequently-used set. Since there
is no real-world benchmark dataset with LR-HR image pairs of Chinese characters, we synthesize LR-
HR text image pairs by blurring and down-sampling the MTWI text images. We inherit the splits of
MTWI as our training (59, 886 samples) and testing (4, 838 samples) set. The model training and testing
are conducted following the settings described in Section 4.1.
The SR text recognition results are 27.7% (Bicubic), 41.1% (TSRN [46]), 42.7% (TPGSR-TSRN)
and 56.1% (HR). From the results, we can observe that our TPGSR framework can still achieve 1.6%
accuracy gain over TSRN. Visualization on some Chinese characters can be seen in Fig. 9. One can see
that our TPGSR can improve much the visual quality of SR results. Compared to TSRN [46], our TPGSR
can better recover the text stroke on the samples. This preliminary experiment verifies that our TPGSR
framework can be extended to hieroglyphs. More investigations and real-world dataset construction will
be made in our future work.
5. Conclusion
In this paper, we presented a novel scene text image super-resolution framework, namely TPGSR, by
introducing text prior (TP) to guide the text image super-resolution (SR) process. Considering the fact
that text images have distinct text categorical information compared with those natural scene images,
we integrated the TP features and image features to more effectively reconstruct the text characters. The
enhanced text image can produce better TP in return, and therefore multi-stage TPGSR was employed to
progressively improve the SR recovery of text images. Experiments on TextZoom benchmark and other
datasets showed that TPGSR can clearly improve the visual quality and readability of low-resolution
text images, especially for those hard cases, and consequently improve significantly the text recognition
performance on them.
References
[1] aster.pytorch. https://github.com/ayumiymk/aster.pytorch. 10
[2] crnn.pytorch. https://github.com/meijieru/crnn.pytorch. 10
[3] MORAN v2. https://github.com/Canjie-Luo/MORAN_v2. 10
[4] ICPR 2018 contest on robust reading for multi-type web images (MTWI). https://tianchi.aliyun.
com/getStart/introduction.htm?spm=5176.100066.0.0.50c233aftaCagb&
raceId=231686., Accessed: 2019-3-22. 15
[5] Zuzana Bı́lková and Michal Hradiš. Perceptual license plate super-resolution with CTC loss. J. Electron.
Imaging, 2020(6):52–1, 2020. 1, 2
[6] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and
Dimosthenis Karatzas. Scene text visual question answering. In Int. Conf. Comput. Vis., pages 4291–4301,
2019. 1
[7] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-
resolution: A new benchmark and a new model. In Int. Conf. Comput. Vis., pages 3086–3095, 2019. 3
[8] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In IEEE
Conf. Comput. Vis. Pattern Recog., pages 6154–6162, 2018. 6
[9] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu,
Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In IEEE Conf. Comput.
Vis. Pattern Recog., pages 4974–4983, 2019. 6
[10] Yu Chen, Ying Tai, Xiaoming Liu, Chunhua Shen, and Jian Yang. FSRNet: End-to-end learning face super-
resolution with facial priors. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2492–2501, 2018. 2, 3
[11] Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. Focusing attention:
Towards accurate text recognition in natural images. In Int. Conf. Comput. Vis., pages 5076–5084, 2017. 4
[12] Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. AON: Towards arbitrarily-
oriented text recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5571–5579, 2018. 4
[13] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for
image super-resolution. In Eur. Conf. Comput. Vis., pages 184–199. Springer, 2014. 10, 11, 12, 13
[14] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolu-
tional networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(2):295–307, 2015. 3
[15] Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerating the super-resolution convolutional neural
network. In Eur. Conf. Comput. Vis., pages 391–407. Springer, 2016. 1
[16] Chao Dong, Ximei Zhu, Yubin Deng, Chen Change Loy, and Yu Qiao. Boosting optical character recogni-
tion: A super-resolution approach. arXiv preprint arXiv:1506.02211, 2015. 3
[17] Chiung-Yao Fang, Chiou-Shann Fuh, PS Yen, Shen Cherng, and Sei-Wang Chen. An automatic road sign
recognition system based on a computational model of human recognition processing. 96(2):237–268, 2004.
1
[18] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal clas-
sification: labelling unsegmented sequence data with recurrent neural networks. pages 369–376, 2006. 1,
4
[19] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural im-
ages. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2315–2324, 2016. 8
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
IEEE Conf. Comput. Vis. Pattern Recog., pages 770–778, 2016. 1
[21] Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang. Reading scene text in deep convolu-
tional sequences. arXiv preprint arXiv:1506.04395, 2015. 3
[22] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial
neural networks for natural scene text recognition. In Adv. Neural Inform. Process. Syst. Worksh., 2014. 8
[23] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Reading text in the wild with
convolutional neural networks. Int. J. Comput. Vis., 116(1):1–20, 2016. 3
[24] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Deep features for text spotting. In Eur. Conf.
Comput. Vis., pages 512–528. Springer, 2014. 3
[25] Sezer Karaoglu, Ran Tao, Theo Gevers, and Arnold WM Smeulders. Words matter: Scene text for image
classification and retrieval. IEEE Trans. Multimedia, 19(5):1063–1076, 2016. 1
[26] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov,
Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar
2015 competition on robust reading. In Int. Conf. Doc. Anal. Recog., pages 1156–1160. IEEE, 2015. 8, 11
[27] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convo-
lutional networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1646–1654, 2016. 1, 5
[28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional
neural networks. Commun. ACM, 60(6):84–90, 2017. 1
[29] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep Laplacian pyramid networks
for fast and accurate super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., pages 624–632, 2017. 1,
3, 5, 7
[30] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta,
Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-
resolution using a generative adversarial network. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4681–
4690, 2017. 1, 3, 5, 10, 11, 12, 13
[31] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual
networks for single image super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages
136–144, 2017. 1, 3, 5, 7
[32] Wei Liu, Chaofeng Chen, Kwan-Yee K Wong, Zhizhong Su, and Junyu Han. STAR-Net: A spatial attention
residue network for scene text recognition. In Brit. Mach. Vis. Conf., volume 2, page 7, 2016. 4
[33] Canjie Luo, Lianwen Jin, and Zenghui Sun. MORAN: A multi-object rectified attention network for scene
text recognition. Pattern Recognition, 90:109–118, 2019. 10, 11
[34] Minesh Mathew, Dimosthenis Karatzas, R Manmatha, and CV Jawahar. DocVQA: A dataset for VQA on
document images. arXiv preprint arXiv:2007.00398, 2020. 1
[35] Sergio Montazzolli Silva and Claudio Rosito Jung. License plate detection and recognition in unconstrained
scenarios. In Eur. Conf. Comput. Vis., pages 580–596, 2018. 1
[36] Yongqiang Mou, Lei Tan, Hui Yang, Jingying Chen, Leyuan Liu, Rui Yan, and Yaohong Huang. PlugNet:
Degradation aware scene text recognition supervised by a pluggable super-resolution unit. In Eur. Conf.
Comput. Vis., 2020. 3
[37] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada
Pal, Christophe Rigaud, Joseph Chazalon, et al. Icdar2017 robust reading challenge on multi-lingual scene
text detection and script identification-rrc-mlt. In Int. Conf. Doc. Anal. Recog., volume 1, pages 1454–1459.
IEEE, 2017. 14
[38] Clément Peyrard, Moez Baccouche, Franck Mamalet, and Christophe Garcia. ICDAR2015 competition on
text image super-resolution. In Int. Conf. Doc. Anal. Recog., pages 1201–1205. IEEE, 2015. 3
[39] Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, and Weiping Wang. SEED: Semantics enhanced encoder-
decoder framework for scene text recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13528–
13537, 2020. 11
[40] Yuhui Quan, Jieting Yang, Yixin Chen, Yong Xu, and Hui Ji. Collaborative deep learning for super-resolving
blurry text images. IEEE Transactions on Computational Imaging, 6:778–790, 2020. 3
[41] Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based se-
quence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.,
39(11):2298–2304, 2016. 2, 3, 4, 8, 9, 10, 11, 13
[42] Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. ASTER:
An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell.,
41(9):2035–2048, 2018. 10, 11, 12
[43] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
tion. In IEEE Conf. Comput. Vis. Pattern Recog., 2014. 1
[44] Hanh TM Tran and Tien Ho-Phuoc. Deep Laplacian pyramid network for text images super-resolution. In
Int. Conf. Comput. and Commun. Technol., pages 1–6. IEEE, 2019. 1, 2, 5
[45] Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene text recognition. In Int. Conf. Comput.
Vis., pages 1457–1464. IEEE, 2011. 8, 11
[46] Wenjia Wang, Enze Xie, Xuebo Liu, Wenhai Wang, Ding Liang, Chunhua Shen, and Xiang Bai. Scene text
image super-resolution in the wild. arXiv preprint arXiv:2005.03341, 2020. 2, 3, 5, 7, 8, 9, 10, 11, 12, 13,
14, 15
[47] Wenjia Wang, Enze Xie, Peize Sun, Wenhai Wang, Lixun Tian, Chunhua Shen, and Ping Luo. TextSR:
Content-aware text super-resolution guided by recognition. arXiv preprint arXiv:1909.07113, 2019. 1, 2, 3,
5, 7
[48] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Recovering realistic texture in image super-
resolution by deep spatial feature transform. In IEEE Conf. Comput. Vis. Pattern Recog., pages 606–615,
2018. 2, 3
[49] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From
error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004. 3
[50] Qi Xie, Minghao Zhou, Qian Zhao, Deyu Meng, Wangmeng Zuo, and Zongben Xu. Multispectral and
hyperspectral image fusion by ms/hs fusion net. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1585–
1594, 2019. 8
[51] Xiangyu Xu, Deqing Sun, Jinshan Pan, Yujin Zhang, Hanspeter Pfister, and Ming-Hsuan Yang. Learning to
super-resolve blurry face and text images. In Proceedings of the IEEE international conference on computer
vision, pages 251–260, 2017. 3
[52] Xuaner Zhang, Qifeng Chen, Ren Ng, and Vladlen Koltun. Zoom to learn, learn to zoom. In IEEE Conf.
Comput. Vis. Pattern Recog., pages 3762–3770, 2019. 3
[53] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using
very deep residual channel attention networks. In Eur. Conf. Comput. Vis., pages 286–301, 2018. 1, 3
[54] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-
resolution. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2472–2481, 2018. 1, 3, 5, 7, 10, 11, 12,
13