0% found this document useful (0 votes)
91 views11 pages

Convolutional Character Networks

The document proposes a new model called CharNet for joint text detection and recognition. CharNet is a one-stage model that can directly output bounding boxes of words and characters along with character labels in one pass. This overcomes limitations of existing two-stage models that rely on region of interest pooling and recurrent neural networks for recognition. The model uses characters as the basic detection unit rather than words. Evaluation shows CharNet outperforms state-of-the-art two-stage models on standard benchmarks.

Uploaded by

Quang Nhật
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views11 pages

Convolutional Character Networks

The document proposes a new model called CharNet for joint text detection and recognition. CharNet is a one-stage model that can directly output bounding boxes of words and characters along with character labels in one pass. This overcomes limitations of existing two-stage models that rely on region of interest pooling and recurrent neural networks for recognition. The model uses characters as the basic detection unit rather than words. Evaluation shows CharNet outperforms state-of-the-art two-stage models on standard benchmarks.

Uploaded by

Quang Nhật
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Convolutional Character Networks

Linjie Xing1,2 , Zhi Tian3 , Weilin Huang∗1,2 , and Matthew R. Scott1,2


1
Malong Technologies, Shenzhen, China
2
Shenzhen Malong Artificial Intelligence Research Center, Shenzhen, China
3
University of Adelaide, Australia

Abstract

Recent progress has been made on developing a unified


framework for joint text detection and recognition in natu-
ral images, but existing joint models were mostly built on
two-stage frameworks by involving ROI pooling, which can
degrade the performance on recognition tasks. In this work,
we propose convolutional character networks (”CharNet”),
which is a one-stage model that can process two tasks simul-
taneously in one pass. CharNet directly outputs bounding
boxes of words and characters, with corresponding char-
acter labels. We utilize a character as basic element, al-
lowing us to overcome the main difficulty of existing ap-
proaches that attempted to optimize text detection jointly
with a RNN-based recognition branch. In addition, we Figure 1: The proposed CharNet can directly output bounding boxes of
words and characters, with corresponding character labels in one pass.
develop an iterative character detection approach able to
transform the ability of character detection learned from
synthetic data to real-world images. These technical im-
provements result in a simple, compact, yet powerful one- rent leading approaches are mainly extended from object
stage model that works reliably on multi-orientation and detection or segmentation frameworks, such as [26, 41, 25].
curved text. We evaluate CharNet on three standard bench- Built on text detection, the goal of text recognition is to rec-
marks, where it consistently outperforms the state-of-the- ognize a sequence of character labels from a cropped image
art approaches [26, 25] by a large margin, e.g., with im- patch including a text instance. Generally, it can be cast into
provements of 65.33%→71.08% (with generic lexicon) on a sequence labeling problem, where various recurrent mod-
ICDAR 2015, and 54.0%→69.23% on Total-Text, on end- els with CNN-extracted features have been developed, with
to-end text recognition. Code is available at: https:// state-of-the-art performance achieved [34, 4, 32, 10].
github.com/MalongTech/research-charnet. However, the two-step pipeline often suffers from a num-
ber of limitations. First, learning the two tasks indepen-
dently would result in a sub-optimization problem, making
1. Introduction it difficult to fully explore the potential of text nature. For
example, text detection and recognition can work collab-
Text reading in natural images has long been considered
oratively by providing strong context and complementary
as two separate tasks: text detection and recognition, which
information to each other, which is critical to improving
are implemented sequentially. The two tasks have been ad-
the performance, as substantiated by recent work [12, 25].
vanced individually by the success of deep neural networks.
Second, it often requires to implement multiple sequential
Text detection aims to predict a bounding box for each text
steps, resulting in a relatively complicated system, where
instance (e.g., typically a word) in natural images, and cur-
the performance of text recognition is heavily relied on text
∗ Corresponding author: [email protected]. detection results.

9126
Recent effort has been devoted to developing a unified word) detection. This allows it to avoid the RNN-based
framework that implements text detection and recognition word recognition, resulting in a simple, compact, yet power-
simultaneously [12, 25, 26]. For example, in [12] and ful model that directly outputs bounding boxes of words and
[25], text detection models were extended to joint detec- characters, as well as the corresponding character labels, as
tion and recognition, by adding a new RNN-based branch shown in Fig.1. Our main contributions are summarized as
for recognition, leading to the state-of-the-art performance follows.
on end-to-end (E2E) text recognition. These approaches Firstly, we propose a one-stage CharNet for joint text de-
can achieve joint detection and recognition using a single tection and recognition, where a new branch for direct char-
model, but they are in the family of two-stage framework acter detection and recognition is introduced, and can be in-
and thus have the following limitations. Firstly, the recogni- tegrated seamlessly into existing text detection framework.
tion branch often explores a RNN-based sequential model, We explore character as basic unit, which allows us to over-
which is difficult to optimize jointly with the detection task, come the main limitations of current two-stage framework
by requiring a significantly larger amount of training sam- using RoI pooling with RNN-based recognition.
ples. Thus the performance is heavily depended on a well- Secondly, we develop an iterative character detection
designed but complicated training scheme (e.g., [12] and method which allows CharNet to transform the character
[21]). This is the central issue that impedes the develop- detection capability learned from synthetic data to real-
ment of a united framework. Secondly, current two-stage world images. This makes it possible for training CharNet
framework commonly involves RoI cropping and pooling, on real-world images, without providing additional char-
making it difficult to crop an accurate text region for feature level bounding boxes.
pooling, where a large amount of background information Thirdly, CharNet consistently outperforms recent two-
may be included. This inevitably leads to significant per- stage approaches such as [12, 26, 25, 36] by a large margin,
formance degradation on recognition task, particularly for with improvements of 65.33%→71.08% (generic lexicon)
multi-orientation or curved text. on ICDAR 2015, and 54.0%→69.23% (E2E) on Total-Text.
To overcome the limitations of RoI cropping and pool- Particularly, it can achieve comparable results, e.g., 67.24%
ing for two-stage framework, He et al. [12] proposed a on ICDAR 2015, even by completely removing a lexicon.
text-alignment layer to precisely compute the convolutional
features for a text instance of arbitrary orientation, which 2. Related Work
boosted the performance. In [25], multiple affinity transfor- Traditional approaches often regard text detection and
mations were applied to the convolutional features for en- recognition as two separate tasks that process sequentially
hancing text information in the RoI regions. However, these [15, 16, 37, 41, 10, 33]. Recent progress has been made
methods failed to work on curved text. In addition, many on developing a unified framework for joint text detection
high-performance models consider words (for English) as and recognition [12, 25, 26]. We briefly review the related
detection units, but word-level detection often requires to studies on text detection, recognition and join of two tasks.
cast text recognition into a sequence labelling problem, Text detection. Recent approaches for text detection
where a RNN model with additional modules, such as CTC were mainly built on general object detectors with various
[6, 11, 33] or attention mechanism [34, 4, 1, 12], was ap- text-specific modifications. For instance, by building on
plied. Unlike English, words are not clearly distinguish- Region Proposal Networks [30], Tian et al. [37] proposed
able in some languages such as Chinese, where text in- a Connectionist Text Proposal Network (CTPN) to explore
stances can be defined and separated more clearly by char- the sequence nature of text, and detect a text instance in
acters. Therefore, characters are more clearly-defined ele- a sequence of fine-scale text proposals. Similarly, Shi et
ments that generalize better over various languages. Impor- al. [31] developed a method with linking segment which
tantly, character recognition is straightforward, and can be also localizes a text instance in a sequence, with the ca-
implemented with a simple CNN model, rather than using a pability for detecting multi-oriented text. In [41], EAST
RNN-based sequential model. was introduced by exploring IOU loss [39] to detect multi-
Contributions. In this work, we present Convolutional oriented text instances (e.g., words), with impressive results
Character Networks (referred as CharNet) for joint text de- achieved. Recently, a single-shot text detector (SSTD) [9]
tection and recognition, by leveraging character as basic was proposed by extending SSD object detector [23] to text
unit. Moreover, for the first time, we provide a one-stage detection. SSTD encodes text regional attention into convo-
CNN model for the joint tasks, with significant performance lutional features to enhance text information.
improvements over the state-of-the-art results achieved by a Text recognition. Inspired from speech recognition,
more complex two-stage framework, such as [12], [26] and recent work on text recognition commonly cast it into a
[25]. The proposed CharNet implements direct character sequence-to-sequence recognition problem, where recur-
detection and recognition, jointly with text instance (e.g., rent neural networks (RNNs) were employed. For exam-

9127
Figure 2: Overview of the proposed CharNet, which contains two branches working in parallel: a character branch for direct character detection and
recognition, and a detection branch for text instance detection.

ple, He et al. [10] exploited convolution neural networks recognition, built on recent Mask R-CNN [9]. However, our
(CNNs) to encode a raw input image into a sequence of CharNet has a number of clear distinctions: (1) CharNet is
deep features, and then a RNN is applied to the sequen- the first one-stage model for E2E text recognition, which is
tial features for decoding and yielding confidence maps, different from the two-stage Mask TextSpotter, where RoI
where connectionist temporal classification CTC [6] is ap- cropping and pooling operations are required; (2) CharNet
plied to generate final results. Shi et al. [32] improved has a character branch that directly outputs accurate char-
such CNN+RNN+CTC framework by making it end-to- level bounding boxes. This enables it to automatically iden-
end trainable, with significant performance gain obtained. tify characters, allowing it to work in a weakly-supervised
Recently, the framework was further improved by intro- manner by using the proposed iterative character detection;
ducing various attention mechanisms, which are able to (3) This results in a distinct capability for training CharNet
encode more character information explicitly or implicitly without additional char-level bounding boxes in real-world
[34, 4, 1, 12]. images, while Mask TextSpotter requires full char-level an-
End-to-end (E2E) text recognition. Recent work at- notations which are often highly expensive; (4) CharNet
tempted to integrate text detection and recognition into a achieved consistent and significant performance improve-
unified framework for E2E text recognition. Li et al. [21] ments over Mask TextSpotter, as shown in Table 4 and 5.
drew inspiration from Faster R-CNN [30] and employed
RoI pooling to obtain text features from a detection frame- 3. Convolutional Character Networks
work for further recognition. In [12], He et al. proposed an In this section, we describe the proposed CharNet in de-
E2E framework by introducing a new text-alignment layer tails. Then an iterative character detection method is intro-
with character attention mechanism, leading to significant duced for automatically identifying characters with bound-
performance improvements by jointly training two tasks. ing boxes from real-world images, by leveraging synthetic
Similar framework has been developed by Liu et al. in [25]. data. In this work, we use “text instance” as a higher level
Both works have achieved strong performance on E2E text concept for text, which can be a word or a text-line, with
recognition, but they were built on two-stage models imple- multi-orientation or curved shape.
menting ROI cropping and pooling operations, which may
reduce the performance, particularly on the recognition task 3.1. Overview
for multi-orientation or curved text. As discussed, existing approaches for E2E text recogni-
Our work is related to character-based approaches for tion are commonly limited by using RoI cropping and pool-
text detection or recognition. Hu et al. proposed a Word- ing, with a RNN-based sequential model for word recogni-
Sup able to detect text instances at the character level [14], tion. The proposed CharNet is a one-stage convolutional
while Liu et al. [24] developed a character-aware neural net- architecture consisting of two branches: (1) a character
work for distorted scene text recognition. However, they branch designed for direct character detection and recogni-
did not provide a full solution for E2E text recognition. The tion, and (2) a text detection branch predicting a bounding
most closely related work is that of Mask TextSpotter [26] box for each text instance in an image. The two branches
which is a two-stage character-based framework for E2E are implemented in parallel, which form a one-stage model

9128
for joint text detection and recognition, as shown in Fig. boost. This suggests that precise identification of characters
2. The character branch can be integrated seamlessly into is of great importance to RNN-based text recognition, which
a one-stage text detection framework, resulting in an end- inspired the current work to simplify it into direct character
to-end trainable model. Training the model requires both recognition with an automatic character localization mech-
instance-level and char-level bounding boxes with character anism, resulting in a simple yet powerful one-stage fully
labels as supervised information. In inference, CharNet can convolutional model for E2E text recognition.
directly output both instance-level and char-level bounding To this end, we introduce a new character branch which
boxes with corresponding character labels in one pass. has the functions of direct character detection and recogni-
Many existing text databases often do not include char- tion. The character branch uses character as basic unit for
level annotations which are highly expensive to obtain. We detection and recognition, and outputs char-level bounding
develop an iterative learning approach for automatic charac- boxes as well as the corresponding character labels. Specif-
ter detection, which allows us to learn a character detector ically, the character branch is a stack of convolutional lay-
from synthetic data where full char-level annotations can ers, which move densely over the final feature maps of the
be generated unlimitedly. Then the learned character detec- backbone. It has the input features maps with 14 spatial
tion capability can be transformed and adapted gradually to resolution of the input image. This branch contains three
real-word images. This enables the model with ability to sub-branches, for text instance segmentation, character de-
automatically identify characters in real-world images, pro- tection and character recognition, respectively. The text
viding a weakly-supervision learning manner for CharNet. instance segmentation sub-branch and character detection
Backbone networks. We employ ResNet-50 [8] and sub-branch have three convolutional layers with filter sizes
Hourglass [20] networks as backbone for our CharNet of 3×3, 3×3 and 1×1, respectively. The character recogni-
framework. For ResNet-50, we follow [41], and make use tion sub-branch has four convolutional layers with one more
of the convolutional feature maps with 4× down-sampling 3 × 3 convolutional layer.
ratio as the final convolutional maps to implement text de- Text instance segmentation sub-branch exploits a binary
tection and recognition. This results in high-resolution fea- mask as supervision, and outputs 2-channel feature maps
ture maps that enable CharNet to identify extremely small- indicating text or non-text probability at each spatial lo-
scale text instances. For Hourglass networks, we stack two cation. Character detection sub-branch outputs 5-channel
hourglass modules, as shown in Fig. 2, and the final feature feature maps, estimating a character bounding box at each
maps are up-sampled to 14 resolution of the input image. spatial location. By following EAST [41], each character
In this work, we use two variants of Hourglass networks, bounding box is parameterized by five parameters, indicat-
Hourglass-88 and Hourglass-57. Hourglass-88 is modi- ing the distances of current location to the top, bottom, left
fied from Hourglass-104 in [20] by removing two down- and right sides of the bounding box, as well as the orienta-
sampling stages and reducing the number of layers in the tion of bounding-box. In character recognition sub-branch,
last stage of each hourglass module by half. Hourglass-57 character labels are predicted densely over the input fea-
is constructed by further removing half number of layers in ture maps, generating 68-channel probability maps. Each
each stage of hourglass modules. Notice that, for both vari- channel is a probability map for a specific character class
ants, we do not employ the intermediate supervision as did among 68 character classes, including 26 English charac-
in CornerNet [20]. ters, 10 digital numbers and 32 special symbols. All of
the output feature maps from three sub-branches have the
3.2. Character Branch same spatial resolution, which is exactly the same as that
Existing RNN-based recognition methods were com- of the input feature maps ( 41 of the input image). Finally,
monly built on word-level optimization with a sequential the char-level bounding boxes are generated by keeping the
model, which has a significantly larger search space than bounding boxes having a confident value over 0.95. Each
direct character classification. This inevitably makes the generated bounding box has a corresponding character la-
models more complicated and difficult to train by requir- bel, which is computed at the corresponding spatial loca-
ing a significantly longer training time with a large amount tion from the 68-channel classification maps - by using the
of training samples. Recent work, such as [34, 4, 12], had maximum of the computed softmax scores.
shown that the performance of RNN-based methods can be Training character branch requires char-level bounding
improved considerably by introducing char-level attention boxes with the corresponding character labels. Compared
mechanism which is able to encode strong character infor- to word-level annotations, acquiring char-level labels with
mation implicitly or explicitly. This enables the models to bounding boxes is much more expensive and would signifi-
have the ability to identify characters more accurately, and cantly increase labor cost. To avoid such additional cost, we
essentially adds additional constraints to the models which develop an iterative character detection mechanism which is
in turn reduce the search space, leading to performance described in Section 3.4.

9129
3.3. Text Detection Branch Method w/ Real. Detection E2E
CharNet R-50 65.38 33.69
Text detection branch is designed to identify text in-
CharNet R-50 X 89.70 62.18
stances at a higher level concept, such as words or text-
CharNet H-57 65.19 39.43
lines. It provides strong context information which is used
CharNet H-57 X 89.66 66.92
to group the detected characters into text instances. Be-
CharNet H-88 65.11 39.94
cause directly grouping characters by using characters in-
CharNet H-88 X 90.97 69.14
formation (e.g., character locations or geometric features)
is heuristic and complicated when multiple text instances Table 1: Performance of CharNet with various backbone networks on
ICDAR 2015. “Real.” denotes “CharNet trained on real-world images
are located closely within a region, particularly for text in- with the proposed iterative character detection”. Detection is compared
stances with multiple orientations or in a curved shape. Our by using F-measure.
text detection branch can be defined in different forms sub-
jected to the type of text instances, and existing instance-
level text detectors can be adapted with minimum modifica- ters by leveraging synthetic data, such as Synth800k [7],
tion. We take text detectors for multi-orientation words or where multi-level supervised information can be generated
curved text-lines as examples. unlimitedly. This allows us to train CharNet in a weakly-
Multi-Orientation Text. We simply modify EAST de- supervised manner by just using instance-level annotations
tector [41] as our text detection branch, which contains two from real-world images.
sub-branches for text instance segmentation and instance- A straightforward approach is to train our model directly
level bounding box regression using IoU loss. The pre- with synthetic images, and then run inference on real-world
dicted bounding boxes are parameterized by five parameters images. However, it has a large domain gap between the
including 4 scalars for a bounding box with an orientation synthetic images and real ones, and therefore the model
angle. We compute dense prediction at each spatial location trained from synthetic images is difficult to work directly on
of the feature maps by using two 3 × 3 convolutional layers, the real-world ones, as shown in Table 1, where low perfor-
followed by another 1 × 1 convolutional layer. Finally, the mance is obtained on both text detection and E2E recogni-
text detection branch outputs 2-channel feature maps indi- tion. We observed that a text detector has relatively stronger
cating text or non-text probability, and 5-channel detection generalization capability than a text recognizer. As shown
maps for bounding boxes with orientation angles. We keep in [37], a text detector trained solely on English and Chi-
the bounding boxes having a confident value over 0.95. nese data can work reasonably on other languages, which
Curved Text. For curved text, we modify Textfield in inspired us to explore the generalization ability of a charac-
[38] by using a direction field, which encodes the direction ter detector to bridge the gap between the two domains.
information that points away from text boundary. The di- Our intuition is to gradually improve the generalization
rection field is used to separate adjacent text instances, and capability of model which is initially trained from synthetic
can be predicted by a new branch in parallel with text detec- images where full char-level annotations are provided, and
tion branch and character branch. This branch is composed the key is to transform the capability of character detec-
of two 3 × 3 convolutional layers, followed by another 1 × 1 tion learned from the synthetic data to real-world images.
convolutional layer. We develop an iterative process by gradually identifying the
Generation of Final Results. The predicted instance- “correct” char-level bounding boxes from real-world im-
level bounding boxes are applied to group the generated ages by the model itself. We make use of a simple rule
characters into text instances. We make use of a simple that identifies a group of char-level bounding boxes as “cor-
rule, by assigning a character to a text instance if the char- rect” if the number of character bounding boxes in a text
acter bounding box have an overlap (e.g., with > 0 IoU) instance is exactly equal to the number of character labels
with an instance-level bounding box. The final outputs of in the provided instance-level transcript. Note that instance-
our CharNet are bounding boxes of both text instances and level transcripts (e.g., words) are often provided in existing
characters, with the corresponding character labels. datasets for E2E text recognition. The proposed iterative
character detection are described as follows.
3.4. Iterative Character Detection
– (i) We first train an initial model on synthetic data,
Training our model requires both char-level and word- where both char-level and instance-level annotations
level bounding boxes as well as the corresponding charac- are available to our CharNet. Then we apply the
ter labels. However, char-level bounding boxes are expen- trained model to the training images from a real-world
sive to obtain and are not available in many existing bench- dataset, where char-level bounding boxes are predicted
mark datasets such as ICDAR 2015 [18] and Total-Text [5]. by the learned model.
We develop an iterative character detection method that en-
ables our model to have capability for identifying charac- – (ii) We explore the aforementioned rule to collect the

9130
Step # Words Ratio (%) E2E # Epochs
0 6033 64.95 39.3 5
1 8262 88.94 62.9 100
2 8494 91.44 65.0 400
3 8606 92.65 66.1 800
Table 2: 4-step iterative character detection with CharNet. “# Words” is
the number of words identified as “correct” at each step iterative learning.
“Ratio” denotes the ratio of the “correct” words to all words in the train-
ing images from Total-Text. “# Epochs” indicates the number of training
epochs for each iterative step. At the Step 0, CharNet is trained on syn-
thetic data for 5 epochs, while Step 1-3 are implemented on real-world
images. “E2E” means “End-to-End Recognition with F-measure”.
Figure 3: Character bounding boxes generated at 4 interactive steps from
left to right. Red boxes indicate the identified “correct” ones by our rule,
while blue boxes mean invalid ones, which are not collected for training
in next step. of 32 images, with 4 images per GPU. On the synthetic data,
we set a base learning rate of 0.0002, which is reduced ac-
iter power
cording to lrbase × (1 − max iter ) with power = 0.9,
“correct” char-level bounding boxes detected in real- by following [3]. The remained three iterative steps are im-
world images, which are used to further train the model plemented on real-world data, by training CharNet for 100,
with the corresponding transcripts provided. Note that 400 and 800 epochs respectively, on the training set of a
we do not use the predicted character labels, which are benchmark provided, e.g., ICDAR 2015 [18] or Total-Text
not fully correct and would reduce the performance in [5]. On the real-world data, we set a base learning rate of
our experiments. 0.002, and use the char-level bounding boxes generated by
the model trained from the previous step. We make use of
– (iii) This process is implemented iteratively to enhance
similar data augmentation as [25] and OHEM [35].
the model capability gradually for character detection,
which in turn continuously improves the quality of
the identified characters, with an increasing number of
4.2. On Iterative Character Detection
the “correct” char-level bounding boxes generated, as Interactive character detection is an important function
shown in Fig. 3 and Table 2. for CharNet that allows us to train the model on real-world
images by only using text instance-level annotations. Thus
4. Experiments, Results and Comparisons accurate identification of characters is critical to the perfor-
mance of CharNet. We evaluate the iterative character de-
Our CharNet is evaluated on three standard benchmarks: tection with CharNet by using various backbone networks
ICDAR 2015 [18], Total-Text [5], and ICDAR MLT 2017 on ICDAR 2015. Results are reported in Table 1. As can
[28]. ICDAR 2015 includes 1,500 images collected by be found, CharNet has low performance on both text detec-
using Google Glasses. The training set has 1,000 im- tion and E2E recognition when we directly apply the model
ages, and the remaining 500 images are used for evaluation. trained from synthetic data to testing images from ICDAR
This dataset is challenging due to the presence of multi- 2015, due to a large domain gap between the two data sets.
orientated and very small-scale text instances. Total-Text The performance can be improved considerably by training
consists of 1,555 images with a variety of text types includ- CharNet on real-world data with iterative character detec-
ing horizontal, multi-oriented, and curved text instances. tion, which demonstrates its efficiency.
The training split and testing split have 1,255 images and We further investigate the capability of our model for
300 images, respectively. ICDAR MLT 2017 is a large- identifying the “correct” characters in real-world images.
scale multi-lingual text dataset, which contains 7,200 train- Experiments were conducted on Total-Text. In this experi-
ing images, 1,800 validation images, and 9,000 testing im- ment, the “correct” characters are grouped into words, and
ages. 9 languages are included in total. we calculate the number of correctly-detected words at each
iterative step. As shown in Table 2, at the step 0, when
4.1. Implementation Details
CharNet is only trained on synthetic data, only 64.95%
Similar to recent work in [12, 25], our CharNet is trained words are identified as “correct” from real-world training
on both synthetic data and real-world data. The proposed images. Interestingly, this number increases immediately
iterative character detection is implemented by using 4 iter- from 64.95% to 88.94% at the step 1, when the proposed
ative steps. At the first step, CharNet is trained on synthetic iterative character detection is applied. This also leads
data, Synth800k [7], for 5 epochs, where both char-level and to a significant performance improvement, from 39.3% to
word-level annotations are available. We use a mini-batch 62.9% on E2E text recognition. The iterative training con-

9131
Method Rec. R P F Gain
He et al. [12] 83.00 84.00 83.00 -
He et al. [12] X 86.00 87.00 87.00 +4.00
FOTS [25] 82.04 88.84 85.31 -
FOTS [25] X 85.17 91.00 87.99 +2.68
CharNet 81.37 90.23 85.57 -
CharNet X 88.30 91.15 89.70 +4.13
Table 3: Detection performance on ICDAR 2015. ResNet-50 was used by
Figure 4: CharNet improves both recall and precision on text detection both FOTS and CharNet as backbone, while PVAnet [19] was applied in
by jointly learning with character recognition. [12]. “Rec.” denotes “Recognition”. “Gain” is the performance gain ob-
tained by joint optimization with text recognition. “R”, “P”, “F” indicate
“Recall”, “Precision”, “F-measure”.
tinues until the number of the identified words dose not in-
crease further. Finally, our method is able to collect 92.65%
correct words from real-world images by implementing 4 4.4. Results on End-to-End Text Recognition
iterative steps in total. We argue that this number of char- For E2E text recognition task, we compare our CharNet
level annotations learned automatically by model is enough with recent state-of-the-art methods on ICDAR 2015 [18]
to train our CharNet, as evidenced by the state-of-the-art and Total-Text [5].
performance obtained, which is shown next.
4.3. Results on Text Detection ICDAR 2015. As shown in Table 4, by using a same back-
bone ResNet-50, our CharNet has comparable results with
We evaluate the performance of CharNet on the text de-
Mask TextSpotter [26]. However, Mask TextSpotter has
tection task. To make a fair comparison, we use the same
significant performance improvements by using additional
backbone ResNet-50 as FOTS [25]. As shown in Table 3,
char-level manual annotations on real-world images, with a
our CharNet achieves comparable performance with FOTS
weighted edit distance applied to a lexicon, e.g., 76.1% →
when both methods are trained without recognition branch.
79.3% (S), 67.1% → 73.0% (W) and 56.7% → 62.4%
By jointly optimizing the model with text recognition, Char-
(G) on E2E recognition. Furthermore, CharNet also out-
Net improves the detection performance by 4.13%, from a
performs FOTS by 1.38% in terms of generic lexicon. Un-
F-measure of 85.57% to 89.70%, which is more significant
like FOTS, which makes use of a heavy recognition branch
than 2.68% performance gain achieved by FOTS. It sug-
with 6.31M parameters, our one-stage model only employs
gests that our one-stage model allows text detection and
a light-weight CNN-based character branch with 1.19M pa-
recognition to work more effectively and collaboratively.
rameters. Importantly, our model can work reliably without
This enables CharNet with higher capability for identifying
a lexicon, with performance of 60.72%, which is compa-
extremely challenging text instances with stronger robust-
rable to 60.72% of FOTS with a generic lexicon. These
ness which also reduces false detections, as shown in Fig.
lexicon-free results demonstrate the strong capability of our
4. In addition, CharNet also has a performance improve-
CharNet, making it better applicable to real-world applica-
ment of 87.00% → 89.70% on F-measure over that of [12]
tions where a lexicon is not always available.
which uses a PVAnet [19] as backbone with multi-scale im-
plementation. We further employ Hourglass-57 [20] as backbone,
Moreover, our one-stage CharNet achieves new stage- which has the similar number of model parameters com-
of-the-art performance on text detection on all three bench- pared to FOTS (34.96M v.s. 34.98M). As shown in Table
marks, which improves recent strong baseline (e.g., He et 4, our CharNet outperforms FOTS by 6.12% with generic
al. [12], FOTS [25] and TextField [38]) by a large mar- lexicon. With a more powerful Hourglass-88, we set a new
gin. For example, on single-scale case, the improvements state-of-the-art single-scale performance on the benchmark,
on F-measure are: 87.99% → 90.97% on ICDAR 2015 (in and improve both Mask TextSpotter and FOTS considerably
Table 4), 80.3% → 85.6% on the Total-Text for curved text in all terms. Finally, with multi-scale inference, CharNet
(in Table 5), and 67.25%→75.77% on ICDAR 2017 MLT surpasses the previous best results [25] by a large margin,
(in Table 6). Notice that CharNet is designed by using char- e.g., from 65.33% to 71.08% with generic lexicon.
acters as basic unit. This natural property allows it to be eas-
ily adapted to curved text, where FOTS is difficult to work Total-Text. We conduct experiments on Total-text to
reliably. TextField was designed specifically for curved text show that the capability of our CharNet on curved text. We
but only has a F-measure of 82.4% on ICDAR 2015. Sev- employ the protocol described in [5] to evaluate the perfor-
eral examples for detecting challenging text instances are mance of text detection, and follow the evaluation protocol
presented in Fig. 5. presented in [26] for E2E recognition. No lexicon is used

9132
Detection End-to-End Recognition
Method Params Method
R P F S W G N
Single Scale
WordSup [14] - 77.03 79.33 78.16 Neumann et al. [29] 35.00 20.00 16.00 -
EAST [41] - 78.33 83.27 80.72 Deep text spotter [2] 54.00 51.00 47.00 -
R2CNN [17] - 79.68 85.62 82.54 TextProp.+DictNet [13, 40] 53.30 49.61 47.18 -
Mask TextSpotter [26] * - 81.00 91.60 86.00 Mask TextSpotter [26] * 79.30 73.00 62.40 -
FOTS R-50 [25] 34.98 M 85.17 91.00 87.99 FOTS R-50 [25] 81.09 75.90 60.80 -
CharNet R-50 26.48 M 88.30 91.15 89.70 CharNet R-50 80.14 74.45 62.18 60.72
CharNet H-57 34.96 M 88.88 90.45 89.66 CharNet H-57 81.43 77.62 66.92 62.79
CharNet H-88 89.21 M 89.99 91.98 90.97 CharNet H-88 83.10 79.15 69.14 65.73
Multi-Scale
He et al. MS [12] - 86.00 87.00 87.00 He et al. MS [12] 82.00 77.00 63.00 -
FOTS R-50 MS [25] 34.98 M 87.92 91.85 89.84 FOTS R-50 MS [25] 83.55 79.11 65.33 -
CharNet R-50 MS 26.48 M 90.90 89.44 90.16 CharNet R-50 MS 82.46 78.86 67.64 62.71
CharNet H-57 MS 34.96 M 91.43 88.74 90.06 CharNet H-57 MS 84.07 80.10 69.21 65.26
CharNet H-88 MS 89.21 M 90.47 92.65 91.55 CharNet H-88 MS 85.05 81.25 71.08 67.24
Table 4: Results on ICDAR 2015. “R-*” and “H-*” denote “ResNet-*” and “Hourglass-*”. “MS” means multi-scale inference. “R”, “P”, “R” are “Recall”,
“Precision”, “F-measure”. “S”, “W”, “G” and “N” mean F-measure using “Strong”, “Week”, “Generic” and “None” lexicon.

Detection
Method E2E
R P F
Textboxes [22] 45.5 62.1 52.5 36.3
Mask TextSpotter [26] 55.0 69.0 61.3 52.9
TextNet [36] 59.5 68.2 63.5 54.0
TextField [38] 79.9 81.2 80.6 -
CharNet H-57 81.0 88.6 84.6 63.6
CharNet H-88 81.7 89.9 85.6 66.6
CharNet H-57 MS 85.0 87.3 86.1 66.2
CharNet H-88 MS 85.0 88.0 86.5 69.2
Table 5: Results on Total-Text. “H-*” denotes “Hourglass-*”. “MS” in-
dicates multi-scale inference. “R”, “P”, “R” are “Recall”, “Precision”,
“F-measure”. “E2E” is “End-to-End Recognition using F-measure”. Figure 5: Full results by CharNet.

Method R P F
5. Conclusions
SARI FDU RRPN [27] 55.50 71.17 62.37
SCUT DLVClab 54.54 80.28 64.96 We have presented a one-stage CharNet for E2E text
FOTS [25] 57.51 80.95 67.25 recognition. We introduce a new branch for direct char-
FOTS MS [25] 62.30 81.86 70.75 acter recognition, which can be integrated seamlessly into
CharNet R-50 70.10 77.07 73.42 text detection frameworks. This results in the first one-
CharNet H-88 70.97 81.27 75.77 stage fully convolutional model that implements two tasks
Table 6: Text detection on ICDAR 2017 MLT. “R-*” and “H-*” denote jointly, setting it apart from existing RNN-integrated two-
“ResNet-*” and “Hourglass-*”. “R”, “P” and “F” represent “Recall”, stage frameworks. We demonstrate that with CharNet, the
“Precision” and “F-measure”. “MS” indicates multi-scale inference. two tasks can be trained more effectively and collabora-
tively, leading to significant performance improvements.
Furthermore, we develop an iterative character detection
in E2E recognition. As shown in Table 5, CharNet outper- able to transfer the character detection capability learned
forms current state-of-the-art methods by 5.9% F-measure from synthetic data to real-world images. Additionally,
on text detection, and 15.2% on E2E recognition. Com- CharNet is compact with less parameters, and can work re-
pared to character-based method, Mask TextSpotter [26], liably on curved text. Extensive experiments on ICDAR
our CharNet can obtain even larger performance improve- 2015, MTL 2017 and Total-text, show CharNet consistently
ments on curved text. outperforms existing approaches by a large margin.

9133
References Artificial Intelligence (AAAI), volume 16, pages 3501–
3508, 2016.
[1] Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and
Shuigeng Zhou. Edit probability for scene text recog- [11] Pan He, Weilin Huang, Yu Qiao, Chen Change Loy,
nition. In Proceedings of the IEEE Conference on and Xiaoou Tang. Reading scene text in deep convo-
Computer Vision and Pattern Recognition (CVPR), lutional sequences. In Thirtieth AAAI Conference on
pages 1508–1516, 2018. Artificial Intelligence (AAAI), 2016.
[2] Michal Busta, Lukas Neumann, and Jiri Matas. Deep [12] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu
textspotter: An end-to-end trainable scene text local- Qiao, and Changming Sun. An end-to-end textspotter
ization and recognition framework. In Proceedings of with explicit alignment and attention. In Proceedings
the IEEE International Conference on Computer Vi- of the IEEE Conference on Computer Vision and Pat-
sion (ICCV), pages 2204–2212, 2017. tern Recognition (CVPR), pages 5020–5029, 2018.
[3] Liang-Chieh Chen, George Papandreou, Iasonas [13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-
Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: term memory. Neural computation, 9(8):1735–1780,
Semantic image segmentation with deep convolutional 1997.
nets, atrous convolution, and fully connected crfs.
IEEE Transactions on Pattern Analysis and Machine [14] Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo
Intelligence (TPAMI), 40(4):834–848, 2018. Wang, Junyu Han, and Errui Ding. Wordsup: Exploit-
ing word annotations for character based text detec-
[4] Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng,
tion. In Proceedings of the IEEE International Con-
Shiliang Pu, and Shuigeng Zhou. Focusing attention:
ference on Computer Vision (ICCV), 2017.
Towards accurate text recognition in natural images.
In Proceedings of the IEEE International Conference [15] Weilin Huang, Zhe Lin, Jianchao Yang, and Jue Wang.
on Computer Vision (ICCV), pages 5076–5084, 2017. Text localization in natural images using stroke feature
[5] Chee Kheng Ch’ng and Chee Seng Chan. Total-text: transform and text covariance descriptors. In Proceed-
A comprehensive dataset for scene text detection and ings of the IEEE International Conference on Com-
recognition. In 2017 14th IAPR International Con- puter Vision (ICCV), 2013.
ference on Document Analysis and Recognition (IC- [16] Weilin Huang, Yu Qiao, and Xiaoou Tang. Robust
DAR), volume 1, pages 935–942. IEEE, 2017. scene text detection with convolution neural network
[6] Alex Graves, Santiago Fernández, Faustino Gomez, induced mser tree. In European Conference on Com-
and Jürgen Schmidhuber. Connectionist temporal puter Vision (ECCV), 2014.
classification: labelling unsegmented sequence data [17] Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli
with recurrent neural networks. In Proceedings of the Yang, Wei Li, Hua Wang, Pei Fu, and Zhenbo
23rd International Conference on Machine Learning, Luo. R2cnn: rotational region cnn for orienta-
pages 369–376. ACM, 2006. tion robust scene text detection. arXiv preprint
[7] Ankush Gupta, Andrea Vedaldi, and Andrew Zisser- arXiv:1706.09579, 2017.
man. Synthetic data for text localisation in natural im-
[18] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Angue-
ages. In Proceedings of the IEEE Conference on Com-
los Nicolaou, Suman Ghosh, Andrew Bagdanov,
puter Vision and Pattern Recognition (CVPR), pages
Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vi-
2315–2324, 2016.
jay Ramaseshan Chandrasekhar, Shijian Lu, et al. Ic-
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian dar 2015 competition on robust reading. In 2015 13th
Sun. Deep residual learning for image recognition. International Conference on Document Analysis and
In Proceedings of the IEEE Conference on Computer Recognition (ICDAR), pages 1156–1160. IEEE, 2015.
Vision and Pattern Recognition (CVPR), pages 770–
778, 2016. [19] Kye-Hyeon Kim, Sanghoon Hong, Byungseok Roh,
Yeongjae Cheon, and Minje Park. Pvanet:
[9] Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, Lightweight deep neural networks for real-time object
and Xiaolin Li. Single shot text detector with regional detection. arXiv preprint arXiv:1611.08588, 2017.
attention. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), volume 6, [20] Hei Law and Jia Deng. Cornernet: Detecting objects
2017. as paired keypoints. In European Conference on Com-
puter Vision (ECCV), pages 734–750, 2018.
[10] Pan He, Weilin Huang, Yu Qiao, Chen Change Loy,
and Xiaoou Tang. Reading scene text in deep convo- [21] Hui Li, Peng Wang, and Chunhua Shen. Towards
lutional sequences. In Thirtieth AAAI Conference on end-to-end text spotting with convolutional recurrent

9134
neural networks. In Proceedings of the IEEE Interna- [32] Baoguang Shi, Xiang Bai, and Cong Yao. An end-
tional Conference on Computer Vision (ICCV), pages to-end trainable neural network for image-based se-
5238–5246, 2017. quence recognition and its application to scene text
[22] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang recognition. IEEE Transactions on Pattern Analy-
Wang, and Wenyu Liu. Textboxes: A fast text detec- sis and Machine Intelligence (TPAMI), 39(11):2298–
tor with a single deep neural network. In Thirty-First 2304, 2017.
AAAI Conference on Artificial Intelligence (AAAI), [33] Baoguang Shi, Xiang Bai, and Cong Yao. An end-
2017. to-end trainable neural network for image-based se-
quence recognition and its application to scene text
[23] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Chris-
recognition. IEEE Transactions on Pattern Analy-
tian Szegedy, Scott Reed, Cheng-Yang Fu, and
sis and Machine Intelligence (TPAMI), 39(11):2298–
Alexander C Berg. Ssd: Single shot multibox detector.
2304, 2017.
In European Conference on Computer Vision (ECCV),
pages 21–37. Springer, 2016. [34] Baoguang Shi, Mingkun Yang, Xinggang Wang,
Pengyuan Lyu, Cong Yao, and Xiang Bai. Aster: An
[24] Wei Liu, Chaofeng Chen, and Kwan-Yee K Wong.
attentional scene text recognizer with flexible rectifi-
Char-net: A character-aware neural network for dis-
cation. IEEE Transactions on Pattern Analysis and
torted scene text recognition. In Thirty-Second AAAI
Machine Intelligence (TPAMI), 2018.
Conference on Artificial Intelligence (AAAI), 2018.
[35] Abhinav Shrivastava, Abhinav Gupta, and Ross Gir-
[25] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu
shick. Training region-based object detectors with on-
Qiao, and Junjie Yan. Fots: Fast oriented text spotting
line hard example mining. In Proceedings of the IEEE
with a unified network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 761–769, 2016.
tion (CVPR), pages 5676–5685, 2018.
[36] Yipeng Sun, Chengquan Zhang, Zuming Huang, Ji-
[26] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, aming Liu, Junyu Han, and Errui Ding. Textnet: Ir-
and Xiang Bai. Mask textspotter: An end-to-end train- regular text reading from images with an end-to-end
able neural network for spotting text with arbitrary trainable network. arXiv preprint arXiv:1812.09900,
shapes. 2018. 2018.
[27] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, [37] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu
Hong Wang, Yingbin Zheng, and Xiangyang Xue. Qiao. Detecting text in natural image with connec-
Arbitrary-oriented scene text detection via rotation tionist text proposal network. In European Conference
proposals. IEEE Transactions on Multimedia, 2018. on Computer Vision (ECCV), pages 56–72. Springer,
[28] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, 2016.
Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Uma- [38] Yongchao Xu, Yukang Wang, Wei Zhou, Yongpan
pada Pal, Christophe Rigaud, Joseph Chazalon, et al. Wang, Zhibo Yang, and Xiang Bai. Textfield: Learn-
Icdar2017 robust reading challenge on multi-lingual ing a deep direction field for irregular scene text detec-
scene text detection and script identification-rrc-mlt. tion. IEEE Transactions on Image Processing (TIP),
In 2017 14th IAPR International Conference on Doc- 2019.
ument Analysis and Recognition (ICDAR), volume 1,
[39] Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin
pages 1454–1459. IEEE, 2017.
Cao, and Thomas Huang. Unitbox: An advanced ob-
[29] Lukáš Neumann and Jiřı́ Matas. Real-time lexicon- ject detection network. In Proceedings of the 2016
free scene text localization and recognition. IEEE ACM on Multimedia Conference (ACMM), pages 516–
Transactions on Pattern Analysis and Machine Intelli- 520. ACM, 2016.
gence (TPAMI), 38(9):1872–1885, 2016.
[40] Zheng Zhang, Wei Shen, Cong Yao, and Xiang Bai.
[30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Symmetry-based text line detection in natural scenes.
Sun. Faster r-cnn: Towards real-time object detection In Proceedings of the IEEE Conference on Computer
with region proposal networks. In Advances in Neural Vision and Pattern Recognition (CVPR), pages 2558–
Information Processing Systems (NIPS), pages 91–99, 2567, 2015.
2015. [41] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang,
[31] Baoguang Shi, Xiang Bai, and Serge Belongie. De- Shuchang Zhou, Weiran He, and Jiajun Liang. East:
tecting oriented text in natural images by linking seg- an efficient and accurate scene text detector. In Pro-
ments. arXiv preprint arXiv:1703.06520, 2017. ceedings of the IEEE Conference on Computer Vision

9135
and Pattern Recognition (CVPR), pages 2642–2651,
2017.

9136

You might also like