0% found this document useful (0 votes)
21 views10 pages

Vintext CVPR21

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

Vintext CVPR21

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Dictionary-guided Scene Text Recognition

Nguyen Nguyen1 , Thu Nguyen1,2,4 , Vinh Tran6 ,


Minh-Triet Tran3,4 , Thanh Duc Ngo2,4 , Thien Huu Nguyen1,5 , Minh Hoai1,6
1
VinAI Research, Hanoi, Vietnam; 2 University of Information Technology, VNU-HCM, Vietnam;
3
University of Science, VNU-HCM, Vietnam;
4
Vietnam National University, Ho Chi Minh City, Vietnam;
5
University of Oregon, Eugene, OR, USA; 6 Stony Brook University, Stony Brook, NY, USA
{v.nguyennm, v.thunm15, v.thiennh4, v.hoainm}@vinai.io
[email protected], [email protected], [email protected]

Abstract In fact, one popular approach to improve the perfor-


Language prior plays an important role in the way hu- mance of a scene text recognition system is to use a dictio-
mans detect and recognize text in the wild. Current scene nary and cast the predicted output as a word from the dic-
text recognition methods do use lexicons to improve recog- tionary. The normal pipeline for processing an input image
nition performance, but their naive approach of casting the consists of: (1) detect text instances, (2) for each detected
output into a dictionary word based purely on the edit dis- text instance, generate the most probable sequence of char-
tance has many limitations. In this paper, we present a novel acters, based on local appearance of the text instance with-
approach to incorporate a dictionary in both the training out a language model, and (3) find the word in the dictionary
and inference stage of a scene text recognition system. We that has smallest edit distance (also called Levenshtein dis-
use the dictionary to generate a list of possible outcomes tance [14]) to the generated sequence of characters and use
and find the one that is most compatible with the visual ap- this word as the final recognition output.
pearance of the text. The proposed method leads to a ro- However, the above approach has three major problems.
bust scene text recognition model, which is better at han- First, many text instances are foreign or made-up words that
dling ambiguous cases encountered in the wild, and im- are not in the dictionary so forcing the output to be a dictio-
proves the overall performance of state-of-the-art scene text nary word will yield wrong outcomes in many cases. Sec-
spotting frameworks. Our work suggests that incorporating ond, there is no feedback loop in the above feed-forward
language prior is a potential approach to advance scene processing pipeline; the language prior is not used in the
text detection and recognition methods. Besides, we con- second step for scoring and generating the most probable
tribute VinText, a challenging scene text dataset for Viet- sequence of characters. Third, edit distance by itself is in-
namese, where some characters are equivocal in the vi- determinate and ineffective in many cases. It is unclear what
sual form due to accent symbols. This dataset will serve to output when multiple dictionary words have the same
as a challenging benchmark for measuring the applicabil- edit distance to the intermediate output character sequence.
ity and robustness of scene text detection and recognition Moreover, many languages have special symbols that have
algorithms. Code and dataset are available at https: different roles than the main characters of the alphabet, so
//github.com/VinAIResearch/dict-guided. the uniform treatment of the symbols and characters in edit
distance is inappropriate.
1. Introduction In this paper, we address the problems of the current
Scene text detection and recognition is an important re- scene text recognition pipeline by introducing a novel ap-
search problem with a wide range of applications, from proach to incorporate a dictionary into the pipeline. Instead
mapping and localization to robot navigation and accessibil- of forcing the predicted output to be a dictionary word, we
ity enhancement for the visually impaired. However, many use the dictionary to generate a list of candidates, which will
text instances in the wild are inherently ambiguous due to subsequently be fed back into a scoring module to find the
artistic styles, weather degradation, or adverse illumination output that is most compatible with the appearance feature.
conditions. In many cases, the ambiguity cannot be resolved One additional benefit of our approach is that we can incor-
without reasoning about the language of the text. porate the dictionary into the end-to-end training procedure,
training the recognition module with hard examples. sequence of characters.
Empirically, we evaluate our method on several bench- A dictionary is an explicit language model, and the ben-
mark datasets including TotalText [3], ICDAR2013 [10], efits of a dictionary for scene text recognition are well es-
ICDAR2015 [11] and find that our approach of using a dic- tablished. In most previous works, a dictionary was used
tionary yield benefits in both training and inference stages. to ensure that the output sequence of characters is a legit-
We also demonstrate the benefits of our approach for rec- imate word from the dictionary, and it improved the accu-
ognizing non-English text. In particular, we show that our racy immensely. Furthermore, if one could correctly reduce
approach works well for Vietnamese, an Austroasiatic lan- the size of the dictionary (e.g., only considering words ap-
guage based on Latin alphabet with additional accent sym- pearing in the dataset), the accuracy would increase further.
bols (´, `, ?, . , ˜) and derivative characters (ô, ê, â, ă, ơ, ư). All of these are the evidence for the importance of the dic-
Being the native language of 90 million people in Vietnam tionary, and it does matter how the dictionary is used [32].
and 4.5 million Vietnamese immigrants around the world, However, the current utilization of dictionaries based on the
Vietnamese texts appear in many scenes, so detecting and smallest edit distance [14] is too elementary. In this paper,
recognizing Vietnamese scene text is an important prob- we propose a novel method to incorporate a dictionary in
lem on its own. Vietnamese script is also similar to other both training and testing phases, harnessing the full power
scripts such as Portuguese, so an effective transfer learn- of the dictionary.
ing technique for Vietnamese might be applicable to other Compared to the number of datasets for other visual
languages as well. To this end, a contribution of our paper recognition tasks such as image classification and object de-
is the introduction of an annotated dataset for Vietnamese tection, there are few datasets for scene text spotting. Most
scene text, and our experiments on this dataset is a valuable datasets including ICDAR2015 [11], Total Text [3], and
demonstration for the benefits of the proposed language in- CTW1500 [33] are for English only. Only the ICDAR2017
corporation approach. dataset [21] is multi-lingual with nine languages, which was
In summary, the contributions of our paper are twofold. recently expanded with an additional language to become
First, we propose a novel approach for incorporating a lan- ICDAR2019 [22]. However, this dataset also does not have
guage model into scene text recognition. Second, we in- Vietnamese. Our newly collected Vietnamese scene text
troduce a dataset for Vietnamese scene text with 2000 fully dataset will contribute to the effort of developing robust
annotated images and 56K text instances. multi-lingual scene text spotting methods.
3. Language-Aware Scene Text Recognition
2. Related Work
To resolve the inherent ambiguity of scene text in the
The ultimate task of our work is scene text spotting [4, wild, we propose to incorporate a dictionary into the recog-
15, 17, 19, 24, 29, 31], which requires both detecting and nition pipeline. From the initial recognition output, we use
recognizing detected text instances. However, the main the dictionary to generate a list of additional candidates,
technical focus of our work is on the recognition stage. Cur- which will subsequently be evaluated by a scoring mod-
rently, there are two main approaches in the recognition ule to identify the output that is most compatible with the
stage. The first approach is based on character segmenta- appearance feature. We also use the dictionary during the
tion and recognition [2, 7, 9, 20, 31]; it requires segmenting training stage to train the recognition module to recognize
a text region into individual characters for recognition. One the correct text instance from a list of hard examples. In this
weakness of this approach is that the characters are indepen- section, we will describe the recognition pipeline and how
dently recognized, failing to incorporate a language model the candidates are generated in details. We will also de-
in the processing pipeline. The second approach is based on scribe the architecture of our network and the loss functions
recurrent neural networks [26] with attention [6, 17, 18, 30] for training this network.
or CTC loss [5, 28, 34]. This approach decodes a text in-
3.1. Recognition pipeline
stance sequentially from the first to the last character; the
most recently recognized character will be fed back to a re- Our scene text spotting system consists of two stages:
current neural network for predicting the next character in detection and recognition. Given an input image, the detec-
the text sequence. In theory, with sequential decoding, this tion stage will detect text instances in the image, which will
approach can implicitly learn and incorporate a language be then passed to the recognition stage. The main focus of
model, similar to probabilistic language models in the nat- our paper is to improve the recognition stage, regardless of
ural language domain [12, 25, 27]. However, this approach the detection algorithm. Specifically in this paper, we pro-
cannot fully learn a language model due to the limited num- pose to use the state-of-the-art detection modules of ABC-
ber of words appearing in the training images. Furthermore, Net [19] and MaskTextSpotterV3 [16], but other detection
because of the implicitness of the language model, there is algorithms can also be used. For brevity, we will describe
no guarantee that the model will not output a nonsensical our method together with the ABCNet framework in this
Calculate <latexit sha1_base64="jmqMC0Rw9ebGjJ39t5I/AwpKOY8=">AAAC1HicbVFdi9NAFJ3Erxo/tqtv+jJYuqwgTVIVfVkoqKCCywq2u9ApZTK52Q6dZMLMpGyIeRJf/TX+Gf+NkzTg7tYLQ86955y5k3ujXHBtguCP4964eev2nd5d7979Bw/3+vuPZloWisGUSSHVWUQ1CJ7B1HAj4CxXQNNIwGm0ftfwpxtQmsvsmylzWKT0POMJZ9TY0rL/e1guQ3x0gAkmBi6MSqsN11TXhHiWGu9Q1lVjQrA3PCCbWBqNt8r1VeWxvSSrLYGP8NZlDeLQNnuBN89bT5uO/6VkpXPKoHoJF/Wlq1vZupF5FrXfZX8QjII28C4IOzBAXZws950eiSUrUsgME1TreRjkZlFRZTgTUHuk0GCbr+k5zC3MaAp6UbXjrfHQVmKcSGVPZnBbveyoaKp1mUZWmVKz0te5pvg/bl6Y5O2i4lleGMjYtlFSCGwkbnaFY66AGVFaQJni9q2YraiizNiNeqQ1Vv5U28xfScr9990/av9LeSwNaD+GhGe82bUe2dXUzezC65PaBbPxKHw9Cr6+Gkw+dFPsoafoGTpEIXqDJugjOkFTxJwnzsT55Hx2Z+5394f7cyt1nc7zGF0J99df8I7X8A==</latexit>

Appearance Used in training


y=vision compatibility l(y, v)
<latexit sha1_base64="fKys9OYgI0BGoJTDg04RShizkzs=">AAADA3icbVLLitswFJXdV+q+Mu2yG9GQYQolttMO001goC20iw5TaGYGohBkWZ6IyJKR5BBjvOzXdFe67Yd03R+p/IAmk14wnHvPObrXV4oyzrQJgt+Oe+v2nbv3eve9Bw8fPX7SP3h6oWWuCJ0SyaW6irCmnAk6NcxwepUpitOI08to9a7mL9dUaSbFV1NkdJ7ia8ESRrCxpUX/z7BYhHByCBFEhm6MSss101hXCHmWGu9R1lVBhKA3PETrWBoNW+VqV3lmDxGVJeAEti5r4Ee22Su4ftl4mnT8L0VLnWFCy9d0U20d3chWtcwbWtgCtMSmLKpyUm1PbRsWu6V6Wm/RHwSjoAm4D8IODEAX54sDp4diSfKUCkM41noWBpmZl1gZRjitPJRramdd4Ws6s1DglOp52dxGBYe2EsNEKvsJA5vqtqPEqdZFGlllis1S3+Tq4v+4WW6St/OSiSw3VJC2UZJzaCSsrxbGTFFieGEBJorZWSFZYoWJsQ/AQ42x9KfaZv5SYua/7/5R+5+LM2mo9mOaMMHqp6FHdovN7sKbm9oHF+NReDwKvrwZnH7ottgDz8ELcARCcAJOwUdwDqaAOJ8c6Wycwv3mfnd/uD9bqet0nmdgJ9xffwHZbey6</latexit>

Calculate loss Used in


in testing
training
score
<latexit sha1_base64="jmqMC0Rw9ebGjJ39t5I/AwpKOY8=">AAAC1HicbVFdi9NAFJ3Erxo/tqtv+jJYuqwgTVIVfVkoqKCCywq2u9ApZTK52Q6dZMLMpGyIeRJf/TX+Gf+NkzTg7tYLQ86955y5k3ujXHBtguCP4964eev2nd5d7979Bw/3+vuPZloWisGUSSHVWUQ1CJ7B1HAj4CxXQNNIwGm0ftfwpxtQmsvsmylzWKT0POMJZ9TY0rL/e1guQ3x0gAkmBi6MSqsN11TXhHiWGu9Q1lVjQrA3PCCbWBqNt8r1VeWxvSSrLYGP8NZlDeLQNnuBN89bT5uO/6VkpXPKoHoJF/Wlq1vZupF5FrXfZX8QjII28C4IOzBAXZws950eiSUrUsgME1TreRjkZlFRZTgTUHuk0GCbr+k5zC3MaAp6UbXjrfHQVmKcSGVPZnBbveyoaKp1mUZWmVKz0te5pvg/bl6Y5O2i4lleGMjYtlFSCGwkbnaFY66AGVFaQJni9q2YraiizNiNeqQ1Vv5U28xfScr9990/av9LeSwNaD+GhGe82bUe2dXUzezC65PaBbPxKHw9Cr6+Gkw+dFPsoafoGTpEIXqDJugjOkFTxJwnzsT55Hx2Z+5394f7cyt1nc7zGF0J99df8I7X8A==</latexit>

Appearance Used
y=vision compatibility l(y, v)
<latexit sha1_base64="fKys9OYgI0BGoJTDg04RShizkzs=">AAADA3icbVLLitswFJXdV+q+Mu2yG9GQYQolttMO001goC20iw5TaGYGohBkWZ6IyJKR5BBjvOzXdFe67Yd03R+p/IAmk14wnHvPObrXV4oyzrQJgt+Oe+v2nbv3eve9Bw8fPX7SP3h6oWWuCJ0SyaW6irCmnAk6NcxwepUpitOI08to9a7mL9dUaSbFV1NkdJ7ia8ESRrCxpUX/z7BYhHByCBFEhm6MSss101hXCHmWGu9R1lVBhKA3PETrWBoNW+VqV3lmDxGVJeAEti5r4Ee22Su4ftl4mnT8L0VLnWFCy9d0U20d3chWtcwbWtgCtMSmLKpyUm1PbRsWu6V6Wm/RHwSjoAm4D8IODEAX54sDp4diSfKUCkM41noWBpmZl1gZRjitPJRramdd4Ws6s1DglOp52dxGBYe2EsNEKvsJA5vqtqPEqdZFGlllis1S3+Tq4v+4WW6St/OSiSw3VJC2UZJzaCSsrxbGTFFieGEBJorZWSFZYoWJsQ/AQ42x9KfaZv5SYua/7/5R+5+LM2mo9mOaMMHqp6FHdovN7sKbm9oHF+NReDwKvrwZnH7ottgDz8ELcARCcAJOwUdwDqaAOJ8c6Wycwv3mfnd/uD9bqet0nmdgJ9xffwHZbey6</latexit>

loss
score Find best Used in
Used in training
testing & testing
Calculate
v y ⇤ =visas
<latexit sha1_base64="E2YkDl7LCbv6YGjwn9bNF/TjtSU=">AAADB3icbVLditQwFE7r31j/ZvXSm+DQZRWZtqOiN8KCCt7ssoKzuzAZhzRNt2HSpCTpsKX2AXwa78RbH8Mn8DVMOwVnZj1Q+M53vi/n9CRxwZk2Yfjbca9dv3Hz1uC2d+fuvfsPhnsPT7UsFaFTIrlU5zHWlDNBp4YZTs8LRXEec3oWL9+19bMVVZpJ8dlUBZ3n+EKwlBFsLLUY/vGrRVS/bfaRoZdG5fWKaawbhDxbmOwUrKOBCEHP30erRBoN17rlpu7YHiAar/ryzLLbp1ofP7D9nsPV087apZN/Kcp0gQmtX9DLZqNDJ1u2Ms+3cA1Qhk1dNTtNbGe/2ubaqb3FcBSOwy7gVRD1YAT6OFnsOQOUSFLmVBjCsdazKCzMvMbKMMJp46FSUzvsEl/QmYUC51TP6+5GGuhbJoGpVPYTBnbspqPGudZVHltljk2md2st+b/arDTpm3nNRFEaKsi6UVpyaCRsrxcmTFFieGUBJorZWSHJsMLE2Efgoc5YB1NtsyCTmAXv+3/UwVF1LA3VQUJTJlj7PPTYbrHbXbS7qavgdDKOXo3DTy9Hhx/6LQ7AY/AEHIAIvAaH4CM4AVNAnCNHO1+dxv3mfnd/uD/XUtfpPY/AVri//gIEh/Jc</latexit>

Predict ŷ=visan match in a


<latexit sha1_base64="9jg/t3Q1ONhJrvZqGDYgDNAipLE=">AAAC7nicbVJda9swFJW9r8z7SrfHvYiFlA5GYmcd7UugsA32stLB0pZFIciyXIvIspGuQ43x39jb2Ot+zd73byY7hqXNLgjOvecc3esrh7kUBnz/j+PeuXvv/oPeQ+/R4ydPn/X3np+brNCMz1gmM30ZUsOlUHwGAiS/zDWnaSj5Rbh63/AXa66NyNRXKHO+SOmVErFgFGxp2f89LJcBnu5jggnwa9BptRaGmpoQz1KTHcq6akwI9ob7ZB1lYPBGubqpPLWXqNoSeIo3LmuQB7bZG7x+3XradPIvJYnJKePVW35db13dylaNzBta2AKSUKjKuprW20Pbfsv+wB/5beBdEHRggLo4W+45PRJlrEi5AiapMfPAz2FRUQ2CSV57pDDcjrWiV3xuoaIpN4uqXXyNh7YS4TjT9ijAbXXbUdHUmDINrTKlkJjbXFP8HzcvID5eVELlBXDFNo3iQmLIcPOKOBKaM5ClBZRpYWfFLKGaMrBv7ZHWWI1nxmbjJKNi/KH7RjP+XJ5mwM044rFQovkLzMgusd1dcHtTu+B8Mgrejfwvh4OTj90We+gleoUOUICO0An6hM7QDDHn0PnmMCdyc/e7+8P9uZG6Tud5gW6E++svTcjjew==</latexit>

<latexit sha1_base64="iDrB8ZAKLYC2A/0ty1UUwy76bPY=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoehRU8KIoWC20pWy2k3bpZjfsTool9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2H4xKNYM6U0LpRkgNCC6hjhwFNBINNA4FPIaD87z+OARtuJL3OEqgHdOe5BFnFK10N+xUqn7Nn8CdJUFBqqTAbWfLKbW6iqUxSGSCGtMM/ATbGdXImYBxuZUaSCgb0B40LZU0BtPOJpuO3X2rdN1IafskuhP1tyOjsTGjOLSdMcW+ma7l4n+1ZorRaTvjMkkRJPsZFKXCReXmZ7tdroGhGFlCmeZ2V5f1qaYMbTjl1sSYeXVjf15fUe5dFDca73p0oxCM14WIS57HZmoIT+OyzS6YTmqWPBzWguOaf3dUPbssUiyRXbJHDkhATsgZuSK3pE4YAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/foyuiFA==</latexit>

features Find best Usedterm


in training & testing
Calculate Loss
v dictionary y ⇤ =visas
<latexit sha1_base64="E2YkDl7LCbv6YGjwn9bNF/TjtSU=">AAADB3icbVLditQwFE7r31j/ZvXSm+DQZRWZtqOiN8KCCt7ssoKzuzAZhzRNt2HSpCTpsKX2AXwa78RbH8Mn8DVMOwVnZj1Q+M53vi/n9CRxwZk2Yfjbca9dv3Hz1uC2d+fuvfsPhnsPT7UsFaFTIrlU5zHWlDNBp4YZTs8LRXEec3oWL9+19bMVVZpJ8dlUBZ3n+EKwlBFsLLUY/vGrRVS/bfaRoZdG5fWKaawbhDxbmOwUrKOBCEHP30erRBoN17rlpu7YHiAar/ryzLLbp1ofP7D9nsPV087apZN/Kcp0gQmtX9DLZqNDJ1u2Ms+3cA1Qhk1dNTtNbGe/2ubaqb3FcBSOwy7gVRD1YAT6OFnsOQOUSFLmVBjCsdazKCzMvMbKMMJp46FSUzvsEl/QmYUC51TP6+5GGuhbJoGpVPYTBnbspqPGudZVHltljk2md2st+b/arDTpm3nNRFEaKsi6UVpyaCRsrxcmTFFieGUBJorZWSHJsMLE2Efgoc5YB1NtsyCTmAXv+3/UwVF1LA3VQUJTJlj7PPTYbrHbXbS7qavgdDKOXo3DTy9Hhx/6LQ7AY/AEHIAIvAaH4CM4AVNAnCNHO1+dxv3mfnd/uD/XUtfpPY/AVri//gIEh/Jc</latexit>

Predict ŷ=visan match in a


<latexit sha1_base64="9jg/t3Q1ONhJrvZqGDYgDNAipLE=">AAAC7nicbVJda9swFJW9r8z7SrfHvYiFlA5GYmcd7UugsA32stLB0pZFIciyXIvIspGuQ43x39jb2Ot+zd73byY7hqXNLgjOvecc3esrh7kUBnz/j+PeuXvv/oPeQ+/R4ydPn/X3np+brNCMz1gmM30ZUsOlUHwGAiS/zDWnaSj5Rbh63/AXa66NyNRXKHO+SOmVErFgFGxp2f89LJcBnu5jggnwa9BptRaGmpoQz1KTHcq6akwI9ob7ZB1lYPBGubqpPLWXqNoSeIo3LmuQB7bZG7x+3XradPIvJYnJKePVW35db13dylaNzBta2AKSUKjKuprW20Pbfsv+wB/5beBdEHRggLo4W+45PRJlrEi5AiapMfPAz2FRUQ2CSV57pDDcjrWiV3xuoaIpN4uqXXyNh7YS4TjT9ijAbXXbUdHUmDINrTKlkJjbXFP8HzcvID5eVELlBXDFNo3iQmLIcPOKOBKaM5ClBZRpYWfFLKGaMrBv7ZHWWI1nxmbjJKNi/KH7RjP+XJ5mwM044rFQovkLzMgusd1dcHtTu+B8Mgrejfwvh4OTj90We+gleoUOUICO0An6hM7QDDHn0PnmMCdyc/e7+8P9uZG6Tud5gW6E++svTcjjew==</latexit>

x
<latexit sha1_base64="iDrB8ZAKLYC2A/0ty1UUwy76bPY=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoehRU8KIoWC20pWy2k3bpZjfsTool9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2H4xKNYM6U0LpRkgNCC6hjhwFNBINNA4FPIaD87z+OARtuJL3OEqgHdOe5BFnFK10N+xUqn7Nn8CdJUFBqqTAbWfLKbW6iqUxSGSCGtMM/ATbGdXImYBxuZUaSCgb0B40LZU0BtPOJpuO3X2rdN1IafskuhP1tyOjsTGjOLSdMcW+ma7l4n+1ZorRaTvjMkkRJPsZFKXCReXmZ7tdroGhGFlCmeZ2V5f1qaYMbTjl1sSYeXVjf15fUe5dFDca73p0oxCM14WIS57HZmoIT+OyzS6YTmqWPBzWguOaf3dUPbssUiyRXbJHDkhATsgZuSK3pE4YAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/foyuiFA==</latexit>

features
<latexit sha1_base64="CF4zWc6AvOQwZFm0izDrgRgFbh4=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoeiyo4EVRsFpoS9lsJ+3SzW7YnYgl9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2741KNYMGU0LpZkgNCC6hgRwFNBMNNA4FPITDs7z+8AjacCXvcJRAJ6Z9ySPOKFrp9qlbqfo1fwJ3lgQFqZICN90tp9TuKZbGIJEJakwr8BPsZFQjZwLG5XZqIKFsSPvQslTSGEwnm2w6dvet0nMjpe2T6E7U346MxsaM4tB2xhQHZrqWi//VWilGp52MyyRFkOxnUJQKF5Wbn+32uAaGYmQJZZrbXV02oJoytOGU2xNj5jWM/XkDRbl3XtxovKvRtUIwXg8iLnkem6khPI3LNrtgOqlZcn9YC45r/u1RtX5RpFgiu2SPHJCAnJA6uSQ3pEEYAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/fpqGiFg==</latexit>

dictionary Loss term


x <latexit sha1_base64="CF4zWc6AvOQwZFm0izDrgRgFbh4=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoeiyo4EVRsFpoS9lsJ+3SzW7YnYgl9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2741KNYMGU0LpZkgNCC6hgRwFNBMNNA4FPITDs7z+8AjacCXvcJRAJ6Z9ySPOKFrp9qlbqfo1fwJ3lgQFqZICN90tp9TuKZbGIJEJakwr8BPsZFQjZwLG5XZqIKFsSPvQslTSGEwnm2w6dvet0nMjpe2T6E7U346MxsaM4tB2xhQHZrqWi//VWilGp52MyyRFkOxnUJQKF5Wbn+32uAaGYmQJZZrbXV02oJoytOGU2xNj5jWM/XkDRbl3XtxovKvRtUIwXg8iLnkem6khPI3LNrtgOqlZcn9YC45r/u1RtX5RpFgiu2SPHJCAnJA6uSQ3pEEYAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/fpqGiFg==</latexit>

(a) The normal scene text recognition pipeline

Calculate edit
distanceedit
Calculate Output most
Calculate <latexit sha1_base64="jmqMC0Rw9ebGjJ39t5I/AwpKOY8=">AAAC1HicbVFdi9NAFJ3Erxo/tqtv+jJYuqwgTVIVfVkoqKCCywq2u9ApZTK52Q6dZMLMpGyIeRJf/TX+Gf+NkzTg7tYLQ86955y5k3ujXHBtguCP4964eev2nd5d7979Bw/3+vuPZloWisGUSSHVWUQ1CJ7B1HAj4CxXQNNIwGm0ftfwpxtQmsvsmylzWKT0POMJZ9TY0rL/e1guQ3x0gAkmBi6MSqsN11TXhHiWGu9Q1lVjQrA3PCCbWBqNt8r1VeWxvSSrLYGP8NZlDeLQNnuBN89bT5uO/6VkpXPKoHoJF/Wlq1vZupF5FrXfZX8QjII28C4IOzBAXZws950eiSUrUsgME1TreRjkZlFRZTgTUHuk0GCbr+k5zC3MaAp6UbXjrfHQVmKcSGVPZnBbveyoaKp1mUZWmVKz0te5pvg/bl6Y5O2i4lleGMjYtlFSCGwkbnaFY66AGVFaQJni9q2YraiizNiNeqQ1Vv5U28xfScr9990/av9LeSwNaD+GhGe82bUe2dXUzezC65PaBbPxKHw9Cr6+Gkw+dFPsoafoGTpEIXqDJugjOkFTxJwnzsT55Hx2Z+5394f7cyt1nc7zGF0J99df8I7X8A==</latexit>

Appearance distance compatible


Output most
y=vision compatibility l(y, v)
<latexit sha1_base64="fKys9OYgI0BGoJTDg04RShizkzs=">AAADA3icbVLLitswFJXdV+q+Mu2yG9GQYQolttMO001goC20iw5TaGYGohBkWZ6IyJKR5BBjvOzXdFe67Yd03R+p/IAmk14wnHvPObrXV4oyzrQJgt+Oe+v2nbv3eve9Bw8fPX7SP3h6oWWuCJ0SyaW6irCmnAk6NcxwepUpitOI08to9a7mL9dUaSbFV1NkdJ7ia8ESRrCxpUX/z7BYhHByCBFEhm6MSss101hXCHmWGu9R1lVBhKA3PETrWBoNW+VqV3lmDxGVJeAEti5r4Ee22Su4ftl4mnT8L0VLnWFCy9d0U20d3chWtcwbWtgCtMSmLKpyUm1PbRsWu6V6Wm/RHwSjoAm4D8IODEAX54sDp4diSfKUCkM41noWBpmZl1gZRjitPJRramdd4Ws6s1DglOp52dxGBYe2EsNEKvsJA5vqtqPEqdZFGlllis1S3+Tq4v+4WW6St/OSiSw3VJC2UZJzaCSsrxbGTFFieGEBJorZWSFZYoWJsQ/AQ42x9KfaZv5SYua/7/5R+5+LM2mo9mOaMMHqp6FHdovN7sKbm9oHF+NReDwKvrwZnH7ottgDz8ELcARCcAJOwUdwDqaAOJ8c6Wycwv3mfnd/uD9bqet0nmdgJ9xffwHZbey6</latexit>

Calculate loss
Appearance candidate
compatible
y1 =visas
<latexit sha1_base64="KB2nvZ997gn3g20VsHTB8bQMpJI=">AAACyHicbVFdb9MwFHXCVwkwOnjkxaJqNSTUJGVoeymaBEgIiWlIdJtUV5Xj3KxWnTiynbIoygu/hr/Ev8FJI2ArV7J07j3n+Nr3Rrng2gTBL8e9c/fe/Qe9h96jx0/2nvb3n51rWSgGMyaFVJcR1SB4BjPDjYDLXAFNIwEX0fp9w19sQGkus2+mzGGR0quMJ5xRY0vL/s9huQzxdIQJJgaujUqrDddU14R4lprsUNZVY0KwNxyRTSyNxlvl+qby1F6S1ZbAU7x1eSNxYHu9xptXjaXNJn8ystI5ZVC9gev6772taN2IvGV/EIyDNvAuCDswQF2cLfedHoklK1LIDBNU63kY5GZRUWU4E1B7pNBge67pFcwtzGgKelG1I63x0FZinEhlT2ZwW/3XUdFU6zKNrDKlZqVvc03xf9y8MMnxouJZXhjI2LZRUghsJG72g2OugBlRWkCZ4vatmK2ooszYLXqkNVb+TNvMX0nK/Q/dH7X/pTyVBrQfQ8Iz3uxXj+066mZ24e1J7YLzyTh8Ow6+Hg5OPnZT7KEX6CU6QCE6QifoEzpDM8ScPefQmTrv3M9u7n53y63UdTrPc3Qj3B+/ATDj1OI=</latexit> <latexit sha1_base64="rWC/E7a/+8qQZr5uvKhhYcINXzI=">AAACyHicbVFdb9MwFHXCVwkwOnjkxaJqNSTUJGVoeymaBEgIiWlIdJtUV5Xj3KxWEzuynbIoygu/hr/Ev8FJI2ArV7J07j3n+Nr3RnnKtQmCX4575+69+w96D71Hj5/sPe3vPzvXslAMZkymUl1GVEPKBcwMNylc5gpoFqVwEa3fN/zFBpTmUnwzZQ6LjF4JnnBGjS0t+z+H5TLE0xEmmBi4NiqrNlxTXRPiWWqyQ1lXjQnB3nBENrE0Gm+V65vKU3uJqC2Bp3jr8kbxge31GpevGkubTf5kZKVzyqB6A9f133tb0boRecv+IBgHbeBdEHZggLo4W+47PRJLVmQgDEup1vMwyM2iospwlkLtkUKD7bmmVzC3UNAM9KJqR1rjoa3EOJHKHmFwW/3XUdFM6zKLrDKjZqVvc03xf9y8MMnxouIiLwwItm2UFCk2Ejf7wTFXwExaWkCZ4vatmK2ooszYLXqkNVb+TNvMX0nK/Q/dH7X/pTyVBrQfQ8IFb/arx3YddTO78PakdsH5ZBy+HQdfDwcnH7sp9tAL9BIdoBAdoRP0CZ2hGWLOnnPoTJ137mc3d7+75VbqOp3nOboR7o/fFHDU0w==</latexit>

l(y1 , v) d(y1 , y)
<latexit sha1_base64="4gviE6HEv5g2d021s6Pz88luQJc=">AAAC+nicbVJdi9NAFJ3Erxq/uvroy2BpWUHapCr6slBQwReXFezuQqeEyWSyGTqZCTOTsiHmz/gmvvpn9Nc4+YDtdr0QOPfcc3Jv7k2Uc6aN7/9x3Fu379y9N7jvPXj46PGT4cHTUy0LReiSSC7VeYQ15UzQpWGG0/NcUZxFnJ5Fmw9N/WxLlWZSfDNlTtcZvhAsYQQbS4XDv2UYVEf1BBl6aVRWbZnGukbIK8P5Hm/1NUQIehO0jaXRsFVtdlXH1i1qb1zCI9g5vPGEH9oer+D2ZWPo0vlVilKdY0Kr1/SyvnpvJ9s0Mm9sYQdQik1V1rbj7rhtw+tcM6oXDkf+1G8D3gRBD0agj5PwwBmgWJIio8IQjrVeBX5u1hVWhhFOaw8VmtphN/iCriwUOKN6XbVHqOHYMjFMpLKPMLBldx0VzrQus8gqM2xSvV9ryP/VVoVJ3q8rJvLCUEG6RknBoZGwuSiMmaLE8NICTBSzs0KSYoWJsXf3UGusZktts1kqMZt97L9Rz76Ux9JQPYtpwgRr/gg9tVtsdxfsb+omOJ1Pg7dT/+ub0eJTv8UBeA5egEMQgHdgAT6DE7AExFk4iSOd3P3u/nB/ur86qev0nmfgWri//wHStuyl</latexit>

score
<latexit sha1_base64="jmqMC0Rw9ebGjJ39t5I/AwpKOY8=">AAAC1HicbVFdi9NAFJ3Erxo/tqtv+jJYuqwgTVIVfVkoqKCCywq2u9ApZTK52Q6dZMLMpGyIeRJf/TX+Gf+NkzTg7tYLQ86955y5k3ujXHBtguCP4964eev2nd5d7979Bw/3+vuPZloWisGUSSHVWUQ1CJ7B1HAj4CxXQNNIwGm0ftfwpxtQmsvsmylzWKT0POMJZ9TY0rL/e1guQ3x0gAkmBi6MSqsN11TXhHiWGu9Q1lVjQrA3PCCbWBqNt8r1VeWxvSSrLYGP8NZlDeLQNnuBN89bT5uO/6VkpXPKoHoJF/Wlq1vZupF5FrXfZX8QjII28C4IOzBAXZws950eiSUrUsgME1TreRjkZlFRZTgTUHuk0GCbr+k5zC3MaAp6UbXjrfHQVmKcSGVPZnBbveyoaKp1mUZWmVKz0te5pvg/bl6Y5O2i4lleGMjYtlFSCGwkbnaFY66AGVFaQJni9q2YraiizNiNeqQ1Vv5U28xfScr9990/av9LeSwNaD+GhGe82bUe2dXUzezC65PaBbPxKHw9Cr6+Gkw+dFPsoafoGTpEIXqDJugjOkFTxJwnzsT55Hx2Z+5394f7cyt1nc7zGF0J99df8I7X8A==</latexit>

y=vision compatibility l(y, v)


<latexit sha1_base64="fKys9OYgI0BGoJTDg04RShizkzs=">AAADA3icbVLLitswFJXdV+q+Mu2yG9GQYQolttMO001goC20iw5TaGYGohBkWZ6IyJKR5BBjvOzXdFe67Yd03R+p/IAmk14wnHvPObrXV4oyzrQJgt+Oe+v2nbv3eve9Bw8fPX7SP3h6oWWuCJ0SyaW6irCmnAk6NcxwepUpitOI08to9a7mL9dUaSbFV1NkdJ7ia8ESRrCxpUX/z7BYhHByCBFEhm6MSss101hXCHmWGu9R1lVBhKA3PETrWBoNW+VqV3lmDxGVJeAEti5r4Ee22Su4ftl4mnT8L0VLnWFCy9d0U20d3chWtcwbWtgCtMSmLKpyUm1PbRsWu6V6Wm/RHwSjoAm4D8IODEAX54sDp4diSfKUCkM41noWBpmZl1gZRjitPJRramdd4Ws6s1DglOp52dxGBYe2EsNEKvsJA5vqtqPEqdZFGlllis1S3+Tq4v+4WW6St/OSiSw3VJC2UZJzaCSsrxbGTFFieGEBJorZWSFZYoWJsQ/AQ42x9KfaZv5SYua/7/5R+5+LM2mo9mOaMMHqp6FHdovN7sKbm9oHF+NReDwKvrwZnH7ottgDz8ELcARCcAJOwUdwDqaAOJ8c6Wycwv3mfnd/uD9bqet0nmdgJ9xffwHZbey6</latexit>

loss candidate
yy21=vision
=visas
<latexit sha1_base64="KB2nvZ997gn3g20VsHTB8bQMpJI=">AAACyHicbVFdb9MwFHXCVwkwOnjkxaJqNSTUJGVoeymaBEgIiWlIdJtUV5Xj3KxWnTiynbIoygu/hr/Ev8FJI2ArV7J07j3n+Nr3Rrng2gTBL8e9c/fe/Qe9h96jx0/2nvb3n51rWSgGMyaFVJcR1SB4BjPDjYDLXAFNIwEX0fp9w19sQGkus2+mzGGR0quMJ5xRY0vL/s9huQzxdIQJJgaujUqrDddU14R4lprsUNZVY0KwNxyRTSyNxlvl+qby1F6S1ZbAU7x1eSNxYHu9xptXjaXNJn8ystI5ZVC9gev6772taN2IvGV/EIyDNvAuCDswQF2cLfedHoklK1LIDBNU63kY5GZRUWU4E1B7pNBge67pFcwtzGgKelG1I63x0FZinEhlT2ZwW/3XUdFU6zKNrDKlZqVvc03xf9y8MMnxouJZXhjI2LZRUghsJG72g2OugBlRWkCZ4vatmK2ooszYLXqkNVb+TNvMX0nK/Q/dH7X/pTyVBrQfQ8Iz3uxXj+066mZ24e1J7YLzyTh8Ow6+Hg5OPnZT7KEX6CU6QCE6QifoEzpDM8ScPefQmTrv3M9u7n53y63UdTrPc3Qj3B+/ATDj1OI=</latexit> <latexit sha1_base64="rWC/E7a/+8qQZr5uvKhhYcINXzI=">AAACyHicbVFdb9MwFHXCVwkwOnjkxaJqNSTUJGVoeymaBEgIiWlIdJtUV5Xj3KxWEzuynbIoygu/hr/Ev8FJI2ArV7J07j3n+Nr3RnnKtQmCX4575+69+w96D71Hj5/sPe3vPzvXslAMZkymUl1GVEPKBcwMNylc5gpoFqVwEa3fN/zFBpTmUnwzZQ6LjF4JnnBGjS0t+z+H5TLE0xEmmBi4NiqrNlxTXRPiWWqyQ1lXjQnB3nBENrE0Gm+V65vKU3uJqC2Bp3jr8kbxge31GpevGkubTf5kZKVzyqB6A9f133tb0boRecv+IBgHbeBdEHZggLo4W+47PRJLVmQgDEup1vMwyM2iospwlkLtkUKD7bmmVzC3UNAM9KJqR1rjoa3EOJHKHmFwW/3XUdFM6zKLrDKjZqVvc03xf9y8MMnxouIiLwwItm2UFCk2Ejf7wTFXwExaWkCZ4vatmK2ooszYLXqkNVb+TNvMX0nK/Q/dH7X/pTyVBrQfQ8IFb/arx3YddTO78PakdsH5ZBy+HQdfDwcnH7sp9tAL9BIdoBAdoRP0CZ2hGWLOnnPoTJ137mc3d7+75VbqOp3nOboR7o/fFHDU0w==</latexit>

l(y21,, v)
v) d(y
d(y21,, y)
y)
<latexit sha1_base64="4gviE6HEv5g2d021s6Pz88luQJc=">AAAC+nicbVJdi9NAFJ3Erxq/uvroy2BpWUHapCr6slBQwReXFezuQqeEyWSyGTqZCTOTsiHmz/gmvvpn9Nc4+YDtdr0QOPfcc3Jv7k2Uc6aN7/9x3Fu379y9N7jvPXj46PGT4cHTUy0LReiSSC7VeYQ15UzQpWGG0/NcUZxFnJ5Fmw9N/WxLlWZSfDNlTtcZvhAsYQQbS4XDv2UYVEf1BBl6aVRWbZnGukbIK8P5Hm/1NUQIehO0jaXRsFVtdlXH1i1qb1zCI9g5vPGEH9oer+D2ZWPo0vlVilKdY0Kr1/SyvnpvJ9s0Mm9sYQdQik1V1rbj7rhtw+tcM6oXDkf+1G8D3gRBD0agj5PwwBmgWJIio8IQjrVeBX5u1hVWhhFOaw8VmtphN/iCriwUOKN6XbVHqOHYMjFMpLKPMLBldx0VzrQus8gqM2xSvV9ryP/VVoVJ3q8rJvLCUEG6RknBoZGwuSiMmaLE8NICTBSzs0KSYoWJsXf3UGusZktts1kqMZt97L9Rz76Ux9JQPYtpwgRr/gg9tVtsdxfsb+omOJ1Pg7dT/+ub0eJTv8UBeA5egEMQgHdgAT6DE7AExFk4iSOd3P3u/nB/ur86qev0nmfgWri//wHStuyl</latexit>

score Calculate l(y


Calculate Generate
v Predict ŷ=visan compatibility

<latexit sha1_base64="9jg/t3Q1ONhJrvZqGDYgDNAipLE=">AAAC7nicbVJda9swFJW9r8z7SrfHvYiFlA5GYmcd7UugsA32stLB0pZFIciyXIvIspGuQ43x39jb2Ot+zd73byY7hqXNLgjOvecc3esrh7kUBnz/j+PeuXvv/oPeQ+/R4ydPn/X3np+brNCMz1gmM30ZUsOlUHwGAiS/zDWnaSj5Rbh63/AXa66NyNRXKHO+SOmVErFgFGxp2f89LJcBnu5jggnwa9BptRaGmpoQz1KTHcq6akwI9ob7ZB1lYPBGubqpPLWXqNoSeIo3LmuQB7bZG7x+3XradPIvJYnJKePVW35db13dylaNzBta2AKSUKjKuprW20Pbfsv+wB/5beBdEHRggLo4W+45PRJlrEi5AiapMfPAz2FRUQ2CSV57pDDcjrWiV3xuoaIpN4uqXXyNh7YS4TjT9ijAbXXbUdHUmDINrTKlkJjbXFP8HzcvID5eVELlBXDFNo3iQmLIcPOKOBKaM5ClBZRpYWfFLKGaMrBv7ZHWWI1nxmbjJKNi/KH7RjP+XJ5mwM044rFQovkLzMgusd1dcHtTu+B8Mgrejfwvh4OTj90We+gleoUOUICO0An6hM7QDDHn0PnmMCdyc/e7+8P9uZG6Tud5gW6E++svTcjjew==</latexit>
<latexit sha1_base64="ratkDp8hPpCdg2U01D57Ky2R2z0=">AAADCHicbVLditQwFE7r31j/ZvXSm+DQZRWZtqOiN8KCCt44rODsLkzGIU3TbZi2KcnpsKXUB/BpvBNvfQvfwMcw7RScnd0Dge+c8335kpOERSo0+P4fy752/cbNW4Pbzp279+4/GO49PNayVIzPmEylOg2p5qnI+QwEpPy0UJxmYcpPwtW7tn+y5koLmX+BquCLjJ7lIhaMgikth3/dahnUb5t9AvwcVFavhaa6IcQxjclOwygaTAh23H2yjiRovOGttnlTs0HeONXXZ6a6IzbC9MAYPsfrp522Syf/U5LogjJev+DnzZZFR1u1NMc1cANIQqGumosurbVbXeG8HI78sd8FvgyCHoxQH0fLPWtAIsnKjOfAUqr1PPALWNRUgWApbxxSam4Ou6JnfG5gTjOuF3X3JA12TSXCsVRm5YC76raippnWVRYaZkYh0bu9tnhVb15C/GZRi7wogedsYxSXKQaJ2/fFkVCcQVoZQJkS5qyYJVRRBuYXOKQT1t5Mm8xLJBXe+/6O2vtUTSVw7UU8Frlo/4cemyl2swt2J3UZHE/Gwaux//nl6PBDP8UBeoyeoAMUoNfoEH1ER2iGmDW1wGqsb/Z3+4f90/61odpWr3mELoT9+x9P+fLY</latexit>

y2 =vision l(y2. , v) d(y.2 , y) y =vision


<latexit sha1_base64="iDrB8ZAKLYC2A/0ty1UUwy76bPY=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoehRU8KIoWC20pWy2k3bpZjfsTool9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2H4xKNYM6U0LpRkgNCC6hjhwFNBINNA4FPIaD87z+OARtuJL3OEqgHdOe5BFnFK10N+xUqn7Nn8CdJUFBqqTAbWfLKbW6iqUxSGSCGtMM/ATbGdXImYBxuZUaSCgb0B40LZU0BtPOJpuO3X2rdN1IafskuhP1tyOjsTGjOLSdMcW+ma7l4n+1ZorRaTvjMkkRJPsZFKXCReXmZ7tdroGhGFlCmeZ2V5f1qaYMbTjl1sSYeXVjf15fUe5dFDca73p0oxCM14WIS57HZmoIT+OyzS6YTmqWPBzWguOaf3dUPbssUiyRXbJHDkhATsgZuSK3pE4YAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/foyuiFA==</latexit>

features candidates .. Calculate .. ..


Calculate Generate scores
v Predict ŷ=visan .. compatibility y ⇤ =vision
<latexit sha1_base64="9jg/t3Q1ONhJrvZqGDYgDNAipLE=">AAAC7nicbVJda9swFJW9r8z7SrfHvYiFlA5GYmcd7UugsA32stLB0pZFIciyXIvIspGuQ43x39jb2Ot+zd73byY7hqXNLgjOvecc3esrh7kUBnz/j+PeuXvv/oPeQ+/R4ydPn/X3np+brNCMz1gmM30ZUsOlUHwGAiS/zDWnaSj5Rbh63/AXa66NyNRXKHO+SOmVErFgFGxp2f89LJcBnu5jggnwa9BptRaGmpoQz1KTHcq6akwI9ob7ZB1lYPBGubqpPLWXqNoSeIo3LmuQB7bZG7x+3XradPIvJYnJKePVW35db13dylaNzBta2AKSUKjKuprW20Pbfsv+wB/5beBdEHRggLo4W+45PRJlrEi5AiapMfPAz2FRUQ2CSV57pDDcjrWiV3xuoaIpN4uqXXyNh7YS4TjT9ijAbXXbUdHUmDINrTKlkJjbXFP8HzcvID5eVELlBXDFNo3iQmLIcPOKOBKaM5ClBZRpYWfFLKGaMrBv7ZHWWI1nxmbjJKNi/KH7RjP+XJ5mwM044rFQovkLzMgusd1dcHtTu+B8Mgrejfwvh4OTj90We+gleoUOUICO0An6hM7QDDHn0PnmMCdyc/e7+8P9uZG6Tud5gW6E++svTcjjew==</latexit>
<latexit sha1_base64="ratkDp8hPpCdg2U01D57Ky2R2z0=">AAADCHicbVLditQwFE7r31j/ZvXSm+DQZRWZtqOiN8KCCt44rODsLkzGIU3TbZi2KcnpsKXUB/BpvBNvfQvfwMcw7RScnd0Dge+c8335kpOERSo0+P4fy752/cbNW4Pbzp279+4/GO49PNayVIzPmEylOg2p5qnI+QwEpPy0UJxmYcpPwtW7tn+y5koLmX+BquCLjJ7lIhaMgikth3/dahnUb5t9AvwcVFavhaa6IcQxjclOwygaTAh23H2yjiRovOGttnlTs0HeONXXZ6a6IzbC9MAYPsfrp522Syf/U5LogjJev+DnzZZFR1u1NMc1cANIQqGumosurbVbXeG8HI78sd8FvgyCHoxQH0fLPWtAIsnKjOfAUqr1PPALWNRUgWApbxxSam4Ou6JnfG5gTjOuF3X3JA12TSXCsVRm5YC76raippnWVRYaZkYh0bu9tnhVb15C/GZRi7wogedsYxSXKQaJ2/fFkVCcQVoZQJkS5qyYJVRRBuYXOKQT1t5Mm8xLJBXe+/6O2vtUTSVw7UU8Frlo/4cemyl2swt2J3UZHE/Gwaux//nl6PBDP8UBeoyeoAMUoNfoEH1ER2iGmDW1wGqsb/Z3+4f90/61odpWr3mELoT9+x9P+fLY</latexit>

x
<latexit sha1_base64="iDrB8ZAKLYC2A/0ty1UUwy76bPY=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoehRU8KIoWC20pWy2k3bpZjfsTool9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2H4xKNYM6U0LpRkgNCC6hjhwFNBINNA4FPIaD87z+OARtuJL3OEqgHdOe5BFnFK10N+xUqn7Nn8CdJUFBqqTAbWfLKbW6iqUxSGSCGtMM/ATbGdXImYBxuZUaSCgb0B40LZU0BtPOJpuO3X2rdN1IafskuhP1tyOjsTGjOLSdMcW+ma7l4n+1ZorRaTvjMkkRJPsZFKXCReXmZ7tdroGhGFlCmeZ2V5f1qaYMbTjl1sSYeXVjf15fUe5dFDca73p0oxCM14WIS57HZmoIT+OyzS6YTmqWPBzWguOaf3dUPbssUiyRXbJHDkhATsgZuSK3pE4YAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/foyuiFA==</latexit>

features candidates ... ..


<latexit sha1_base64="CF4zWc6AvOQwZFm0izDrgRgFbh4=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoeiyo4EVRsFpoS9lsJ+3SzW7YnYgl9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2741KNYMGU0LpZkgNCC6hgRwFNBMNNA4FPITDs7z+8AjacCXvcJRAJ6Z9ySPOKFrp9qlbqfo1fwJ3lgQFqZICN90tp9TuKZbGIJEJakwr8BPsZFQjZwLG5XZqIKFsSPvQslTSGEwnm2w6dvet0nMjpe2T6E7U346MxsaM4tB2xhQHZrqWi//VWilGp52MyyRFkOxnUJQKF5Wbn+32uAaGYmQJZZrbXV02oJoytOGU2xNj5jWM/XkDRbl3XtxovKvRtUIwXg8iLnkem6khPI3LNrtgOqlZcn9YC45r/u1RtX5RpFgiu2SPHJCAnJA6uSQ3pEEYAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/fpqGiFg==</latexit>

.. scores
x <latexit sha1_base64="CF4zWc6AvOQwZFm0izDrgRgFbh4=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoeiyo4EVRsFpoS9lsJ+3SzW7YnYgl9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2741KNYMGU0LpZkgNCC6hgRwFNBMNNA4FPITDs7z+8AjacCXvcJRAJ6Z9ySPOKFrp9qlbqfo1fwJ3lgQFqZICN90tp9TuKZbGIJEJakwr8BPsZFQjZwLG5XZqIKFsSPvQslTSGEwnm2w6dvet0nMjpe2T6E7U346MxsaM4tB2xhQHZrqWi//VWilGp52MyyRFkOxnUJQKF5Wbn+32uAaGYmQJZZrbXV02oJoytOGU2xNj5jWM/XkDRbl3XtxovKvRtUIwXg8iLnkem6khPI3LNrtgOqlZcn9YC45r/u1RtX5RpFgiu2SPHJCAnJA6uSQ3pEEYAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/fpqGiFg==</latexit>

yk =Nisan l(yk , v) d(y.k , y)


yk =Nisan l(yk , v) d(yk , y)
Language-based
contrastive loss
Language-based
contrastive loss
(b) The proposed scene text recognition pipeline
Figure 1: Traditional recognition pipeline (a) and proposed pipeline (b). In the traditional pipeline, the output is forced
to be in the dictionary. The dictionary is only used at the inference time, during a post-processing step. In the proposed
approach, the dictionary is used for both inference and training. The dictionary is used to generate a list of candidates, and
the candidates are evaluated by a compatibility scoring module. The final output does not have to be a word in the dictionary.

section, but we will demonstrate the empirical benefits of During training, we use both the ground truth word y and
our method with both ABCNet and MaskTextSpotterV3 in the initial recognition output ŷ to generate the list of candi-
the experiment section. date words, creating a list with a total of k words.
Fig. 1b depicts the processing pipeline of the recognition 3.3. Training losses
stage of our method. Given a detected text instance x (de-
To train our recognition network, we minimize an objec-
lineated by two Bezier curves [19]), a fixed-size feature map
tive function that is the weighted combination of two losses.
v will be calculated (using Bezier alignment module [19]).
The first loss is defined based on the negative log likeli-
From v, we obtain an initial recognition output ŷ. We will
hood of the ground truth. The second loss is defined for the
then compile a list of candidate words y1 , . . . , yk , which
list of candidate words to maximize the likelihood of the
are dictionary words with smallest edit distances to y. We
ones that are close to the ground truth while minimizing the
will then calculate the compatibility scores between each
likelihood of the candidates that are further away from the
candidate word yi and the feature map v, and output the
ground truth.
word with the highest compatibility score.
The negative log likelihood for a feature map v and a
During training, we also calculate the compatibility word y is calculated as follows. First, using recurrent neural
score between the appearance feature map v and the ground network with attention [1], we obtain a probability matrix P
truth word y, which is used to calculate the appearance loss. of size s×m, where m is the maximum length of a word and
We also minimize a contrastive loss, which is defined based s is the size of the alphabet, including special symbols and
on the compatibility scores between the feature map v and characters (m = 25, s = 97 for English). Let y j be the
the list of candidate words y1 , . . . , yk . index of the j th character of the word y; y j ∈ {1, . . . , s}.
3.2. Candidate generation The negative log likelihood for y is defined as:
len(y)
We use a dictionary to generate a list of candidate words
X
l(y, v) = − log(P[y j , j]), (1)
in both inference and training phases. During inference, j=1
given the initial recognition output ŷ, the list of candidates
is the k dictionary words with smallest edit distance (Leven- where len(y) is the length of word y, and P[y j , j] denotes
shtein distance [14]) to ŷ. For example, if ŷ = visan and the entry at row y j and column j of matrix P.
k = 10 then the list of candidates will be: visas, vise, The second loss is defined based on the negative log like-
vised, vises, visi, vising, vision, visit, visor, vista. lihood of candidate words and their edit distances to the
ground truth word. We first convert the list of negative log Given a word, either the ground truth word or the initially
likelihood values into a probability distribution: recognized one, we need to find the list of candidate words
exp(−l(yi , v)) that have smallest edit distances to the given word. This
Li = Pk . (2) can be done based on exact search or approximate nearest
j=1 exp(−l(yj , v))
neighbor retrieval. The former approach requires exhaus-
We also first convert the list of edit distances to a proba- tively computing the edit distance between the given word
bility distribution: and all dictionary words. It generates a better list of candi-
 
exp − d(yTi ,y) dates and leads to higher accuracy, but it also takes longer
Di = P  , (3) time. The latter approach is more efficient, but it only re-
k d(yj ,y)
j=1 exp − T
turns approximate nearest neighbors. We experiment with
both approaches in this paper. For the second approach,
Finally we compute the KL-divergence between two we use the dict-trie library to retrieve all words that
probability distributions D and L: have the Levenshtein distance to query word smaller than
k
three. If the number of candidate words is smaller than ten,
we fill the missing candidates by ###. We notice that the
X
KL(D||L) ∝ − Di log(Li ). (4)
i=1
query time will increase significantly if we use a larger dis-
tance threshold for dict-trie. Approximate search can
In Eq. (3), T is a tunable temperature parameter. T is a reduce the query time, but it also decreases the final accu-
positive value that should neither be too large nor too small. racy slightly.
When T is too small, the target probability distribution D In our experiments, we used Adam optimizer [13] for
has low entropy, and none of the candidate words, except training. The parameter λ in Eq. (5) was set to 1.0, and the
the ground truth, would matter. When T is too big, the tar- temperature parameter T of Eq. (3) was set to 0.3.
get probability distribution D has high entropy, and there is
no contrast between good and bad candidates. In our exper-
4. VinText: a dataset for Vietnamese scene text
iments, T is set to 0.3. In this section, we will describe our dataset for
The loss in Eq. (4) is formulated to maximize the like- Vietnamese scene text, named VinText. This dataset con-
lihood of the candidates that are close to the ground truth, tains 2,000 fully annotated images with 56,084 text in-
while minimizing the likelihood of the faraway candidates. stances. Each text instance is delineated by a quadrilat-
We call this loss as the contrastive loss because its goal is eral bounding box and associated with the ground truth se-
to contrast the ones that are closer to ground truth with the quence of characters. We randomly split the dataset into
ones further away. three subsets for training (1,200 images), validation (300
The total training loss of our recognition network is: images), and testing (500 images). This is the largest dataset
for Vietnamese scene text.
L(x, y) = l(y, v) + λKL(D||L), (5)
Although this dataset is specific to Vietnamese, we be-
where λ is a hyper parameter that balances these two loss lieve it will greatly contribute to the advancement of re-
terms. We simply set λ = 1 in our experiments. search in scene text detection and recognition in general.
3.4. Network architecture and details First, this dataset contains images from a developing coun-
try, and it complements the existing datasets of images taken
In the detection stage, we use the Bezier detection and in developed countries. Second, images from our dataset
alignment module from ABCNet [19]. The output of the are very challenging, containing busy and chaotic scenes
detection stage is the input to the recognition stage, and with many shop signs, billboards, and propaganda panels.
it is a 3D feature tensor of size n×32×256, with n being As such, this dataset will serve as a challenging benchmark
the number of detected text instances. Each text instance for measuring the applicability and robustness of scene text
is represented by a feature map of size 32×256, and we detection and recognition algorithms.
use a sequential decoding network with attention to output In the rest of this section, we will describe how the im-
a probability matrix P of size s×25, where s is the size of ages were collected to ensure the dataset covers a diverse
the extended alphabet, including letters, numbers, and spe- set of scene text and backgrounds. We will also describe
cial characters. In our experiments, s = 97 for English the annotation and quality control process.
and s = 106 for Vietnamese. Each column i of P is a
4.1. Image collection
probability distribution of the ith character in the text in-
stance. During inference, we use this matrix to produce the The images from our dataset were either downloaded
initial recognition output, which is the sequence of charac- from the Internet or captured by some data collection work-
ters where each character is the one with highest probability ers. Our objective was to compile a collection of images
in each respective column of P. that represent the diverse set of scene texts that are encoun-
Figure 2: Some representative images from VinText dataset. This is a challenging data, containing busy and chaotic
scenes with scene text instances of various types, appearance, sizes, and orientations. Each text instance is annotated with
a quadrilateral bounding box and and world-level transcription. This dataset will be a good benchmark for measuring the
applicability and robustness of scene text spotting algorithms.

tered in everyday life in Vietnam. To ensure the diversity

Skateboard, 10
Additio
of our dataset, we first created a list of scene categories and

Store
Flye
sub-categories. The list of categories at the first level are:
Ou

n
r

ny
td

pa
o
In

Transportati
shop signs, notice boards, bulletins, banners, flyers, street
or
do

m
or

) Co
(4.3%)
Flye
l

(4.5%
walls, vehicles, and miscellaneous items. These categories
Str 4.6%
oo
Sch
ee

4. s
Trad (

(2 ign
tW )

on
itio

0%
were divided into subcategories, and many subcategories boa nal
Bul
Hou
se
all

S
leti
rd n
were further divided into sub-subcategories. For example, (7.8 Boar
%) d B nk
a
the the first-level category “Miscellaneous Items” contains Adver tisement
Banners
many subcategories, including book covers, product labels, (9.0%) Instru
clothes. Images from these categories and sub-categories Slogan
No cted
Bo tice board
were abundant on the Internet, but there were also many s
a
(22 rd
the .8% P
Clo
(22 em
)

more irrelevant and unsuitable images. As a result, for some


.8%

d, ) rohib
It

categories, it was easier to capture the images ourselves an al sig itio


Br ori ns n
A m
rather than wasting time filtering out irrelevant search re-
Tra s
Consumption
ko

sign
Bo

Product

fific

sults. Thus, for a part of our dataset, we hired 20 data collec-


Fast

tion workers to capture images of the scene texts that they


encountered while shopping or walking on the streets, using
their own phone or hand-held cameras. To ensure no dupli- Figure 3: Scene categories of the VinText dataset, and the
cates, we used p-hashing to find and remove duplicates, and proportion for each category. This is a diverse dataset with
we also visually inspected every image of our dataset. Our many categories and sub-categories.
final dataset contains 764 images from the Internet and 1236
images captured by the data collection workers.
5. Experiments
4.2. Image annotation
In this section, we report the performance of our method
We divided the 2000 images into 10 batches, each with on several datasets: TotalText, ICDAR2013, ICDAR2015,
200 images. Each image was annotated by two annota- and the newly collected VinText dataset. We measure the
tion workers independently. If the correlation between their performance of the entire detection-recognition system. An
annotations was smaller than 98%, we would ask them to annotated text instance is considered correctly recognized
cross-check each other and resolve the differences. As a only if it is detected correctly and the predicted word is the
final step, we manually chose the annotation from one an- same with the annotated word. A detected output of the sys-
notator and visually inspected the annotation to ensure it tem is considered correct only when it corresponds to a cor-
satisfied our quality requirements. rectly detected and recognized text instance. Following pre-
vious works in scene text recognition, we use H-mean score
as the performance metric, which is the harmonic mean of Dictionary used?
the precision and recall values, also known as the F1 score. Method No Yes
The proposed method for using the dictionary during
training and inference is general, but our specific implemen- TextDragon [4] 48.8 74.8
tation in this paper is based either on the ABCNet frame- Boundary TextSpotter [29] 65.0 76.1
work [19] or the MaskTextSpotterV3 [16], so we will re- CharNet [31] 63.6 -
fer to our method as either ABCNet+D or MaskTextSpot- Mask TextSpotter v2 [15] 65.3 77.4
terV3+D for brevity. Mask TextSpotter v3 [16] 71.2 78.4
ABCNetpub (reported in [19]) 64.2 75.7
5.1. Experiments on TotalText
ABCNet [19] (github checkpoint) 67.1 76.0
TotalText [3] is a comprehensive scene text dataset. It ABCNet+D (proposed) 68.3 78.5
consists of text instances in various orientations, including
horizontal, vertical, and curved. The dataset contains 1255 Table 1: Scene text recognition results on Total-Text. The
training images and 300 test images with 11459 annotated values in the table are the H-mean scores. We consider both
text instances, covering in-the-wild scenes. All images were scenarios where a large dictionary (> 90K words) is used
annotated with polygons and word-level transcriptions. or not used during inference. ABCNet+D yields significant
Table 1 compares the performance of the proposed ap- improvement over its direct baseline ABCNet. ABCNet+D
proach ABCNet+D with several recently proposed meth- is not as good as MaskTextSpotterV3 when the dictionary
ods. We consider two scenarios, depending on whether is not used, but it is better when the dictionary is used. This
a dictionary is used at the inference time or not. This proves the effectiveness of our proposed approach for incor-
dictionary is the combination of the Oxford VGG dictio- porating the dictionary in the inference phase.
nary [8] (> 90K words) with all the words in the test set,
and this setting corresponds truthfully to the Full Lexicon # candidates 1 10 20 30 70 300 600
setting reported in the existing scene text recognition litera-
H-mean 76.1 77.9 78.1 78.2 78.4 78.5 78.5
ture. ABCNetpub indicates the published result in the ABC-
Net paper [19]. ABCNet is the released model on github Table 2: Recognition accuracy of ABCNet+D on Total-
(https://github.com/aim-uofa/AdelaiDet), which Text as the number of candidates varies. The accuracy
is also the base of the proposed ABCNet+D method. As can increases as more candidates are evaluated during the infer-
be seen, ABCNet+D yields significant improvement over its ence time, but saturates after 300 candidates have been con-
direct baseline ABCNet. When the dictionary is not used at sidered. The second column, when the number of candidate
the inference stage, the key difference between ABCNet+D is one, corresponds to the naive way of using the dictionary
and ABCNet is the language-based contrastive loss, so the word with the smallest edit distance.
1.2 performance gap between the two methods indicates the
benefits of this contrastive loss. When the dictionary is also
used at the inference time, the performance gap becomes being able to select the correct word from a large set of 300
wider, and this demonstrates the benefits of the novel way candidates means the visual-language compatibility scoring
for using the dictionary at the inference time. These bene- model works reasonably well. The performance saturates
fits can also be observed when comparing ABCNet+D and after 300 candidates have been considered.
Mask TextSpotter v3. ABCNet+D is ranked lower than As discussed above, there are strong empirical evidence
Mask TextSpotter v3 when the dictionary is not used, but for the benefits of using the contrastive loss during train-
their order is swapped when the dictionary is used. ing. Even without using the dictionary during the infer-
Table 2 shows how the performance of ABCNet+D ence time, the model trained with the contrastive loss makes
varies as the number of candidate words considered during fewer recognition mistakes than the model trained without.
the inference stage increases. As can be seen, the recog- Fig. 4 displays some qualitative results of the two models.
nition accuracy increases as more candidate words are ex-
5.2. Experiments on ICDAR13 and ICDAR15
amined. This demonstrates the usefulness of correlating the
candidates back to the visual features for deciding the fi- ICDAR13 [10] is a scene text dataset that focused around
nal output. Using the edit distance alone is not adequate the text content of interest. This dataset contains 462 im-
since the correct word might not be the one with the clos- ages (229 for training and 233 for testing), together with
est edit distance to the intermediate output. In fact, as can rectangular bounding boxes and world-level transcription.
be seen from Table 2, the correct word might not even be ICDAR2015 is an incidental scene text dataset, consisting
among the top 70 candidates; increasing from 70 to 300 of 1000 training and 500 test images respectively. The im-
candidates still provides some accuracy gain. Furthermore, ages were taken by Google Glasses, and most of them are at
Dictionary used?
Dataset Method No Yes
ABCNet [19] 83.5 85.6
ICDAR13
ABCNet+D (proposed) 84.6 87.5
(a) ABCNet: CHIBESE (b) ABCNet: FARN ABCNet [19] 36.5 47.6
ICDAR15
ABCNet+D: CHEESE ABCNet+D: FARM ABCNet+D (proposed) 57.9 67.2

Table 3: Comparing ABCNet with ABCNet+D on the IC-


DAR13 and ICDAR15 datasets. We consider two scenar-
ios when a large dictionary (about 90K words) is used or not
during testing. On ICDAR15, ABCNet performs relatively
poor, possibly due to the low quality of Google glasses im-
ages with many small and blurry text. ABCNet+D outper-
(c) ABCNet: FAEAD (d) ABCNet: HOOME
forms its direct baseline ABCNet by a wide margin.
ABCNet+D: HEAD ABCNEt+D: HOME

Dictionary type
Strong Weak General
MaskTextspotterV3 83.3 78.1 74.2
MaskTextspotterV3+D (proposed) 85.2 81.9 75.9
(e) ABCNet: KITGHEN (f) ABCNet: TOSTWORLD Table 4: H-mean scores on ICDAR15, comparing
ABCNet+D: KITCHEN ABCNet+D: LOSTWOLRD
MaskTextSpotterV3 with MaskTextSpotterV3+D, the
proposed method trained with a general dictionary of
∼90K words. In testing, one can consider different
types of dictionary Strong/Weak/General, which corre-
sponds to the standard evaluation protocols for ICDAR15:
Strongly/Weakly/Generic Contextualised.
ing testing. As can be seen, MaskTextSpotterV3+D outper-
(g) ABCNet: LOUUIE (h) ABCNet: PLAMET
ABCNet+D: LOUIE ABCNet+D: PLANET
forms MaskTextSpotterV3 for all settings.

Figure 4: Several cases where ABCNet makes mistakes- 5.3. Experiments on VinText
but ABCNet+D does not. These are intermediate outputs, The Vietnamese script is based on the Latin alphabet like
when a dictionary has not been used for post processing. English, but it additionally has seven derivative characters
(d̄, ô, ê, â, ă, ơ, ư) and five accent symbols (´, `, ?, . , ˜).
A derivative character can also be combined with an accent
low resolution with blurry and small text. ICDAR2015 fo- symbol; for example, ế is popular Vietnamese word, com-
cuses on English, and it comes with quadrilateral bounding bining the letter e with both the circumflex and the acute
box annotation and word-level transcriptions. symbols. It is unclear how to handle these extra symbols,
The results of ABCNet on these datasets are not reported and we consider here two approaches. The first approach
in the ABCNet paper [19], and there is no released model is to create a new alphabet symbol for each valid combina-
for ICDAR2015. So we train an ABCNet model on these tion of letter and accent symbols. For example, ế would be
datasets ourselves. The results of ABCNet and ABCNet+D a character of the alphabet by itself, and also for ế, ề, ê., ễ,
are reported in Table 3. On ICDAR2015, ABCNet performs ê, é, è, e., ẽ, e. The second approach is to break a derivative
relatively poor, possibly due to the low quality of Google character into two parts: the English character and either the
glasses images with many small and blurry text. In this case, hat ˆ, the breve ˘, or the horn plus one of the accent symbol.
we find that the use of the dictionary boost the performance Thus, the word ế would be the sequence of three symbols:
of the model immensely. (e, ˆ, ´). The first approach requires extending the English
Table 4 compares the performance of MaskTextSpot- alphabet of 97 characters to an alphabet with a total of 258
terV3 with MaskTextSpotterV3+D on the ICDAR15, for characters, while the second approach only requires only
different ways of using different types of dictionary dur- eight additional symbols, leading to a total of 97+9=106
Total Text ICDAR13 ICDAR15 VinText

Figure 5: Detection and recognition results by ABCNet+D: on TotalText, ICDAR13, ICDAR15, and VinText

characters. Furthermore, one advantage of the second ap-


proach is that we can utilize annotated training data more Dictionary used?
effectively. For example, an annotated instance of ế also Method No Yes
gives us one annotated instance of e, while an annotated in-
stance of e would also be an annotated instance for a part of Alphabet: English + derivative characters
ế. However, the disadvantages of the second approach are: Mask TextSpotter v3 [16] 53.4 57.4
(1) not all symbol combinations are valid, and (2) multiple ABCNet [19] 50.6 55.1
sequential combinations lead to the same character, and we Alphabet: English + accent symbols
loose the uniqueness of the sequential order of characters in ABCNet 54.2 58.5
a word. Given these advantages and disadvantages, it is un- ABCNet+D 57.4 63.6
clear which of the above two approaches would work best in Mask Textspotterv3+D 68.5 70.3
practice, so we benchmark both of them in this experiment.
Table 5: Recognition accuracy (H-mean) on the VinText
We train ABCNet and ABCNet+D on VinText, starting dataset. We consider both scenarios when a large dictio-
from a English pre-trained ABCNet model. To handle the nary is used or not during testing. This dictionary is the
extra characters of the extended alphabet, we create addi- combination of RDRSegmenter [23] and the words in the
tional character recognition heads, and replicate the weights test set. We also consider two different approaches of ex-
of some existing recognition heads to the new ones based on tending the English alphabets on two backbones: ABCNet
the visual similarity of the characters. For example, we ini- and Mask Textspotterv3.
tialize the recognition heads of ế and ê with the recognition
head of the character e.
6. Conclusions
Table 5 shows the recognition accuracy of several meth- We have proposed a novel language-aware approach to
ods on the VinText dataset. As can be seen, the second ap- tackle the visual ambiguity in scene text recognition. Our
proach of extending the English alphabet works better than approach can harness the power of a dictionary in both
the first one. Comparing ABCNet and ABCNet+D, we see the training and inference stages. Our approach can re-
clear benefits for using the proposed approach to incorpo- solve ambiguity in many conditions, by considering mul-
rate the dictionary in both the training and testing stages. tiple suggestions from a dictionary given an intermediate
recognition output. In addition, we propose VinText, a new
Fig. 5 shows some representative detection and recogni- dataset for Vietnamese scene text recognition which brings
tion results on VinText and also on TotalText and ICDAR15. new challenges in discriminating a character from multiple
As can been, VinText contains more text instances, and is similar ones. Experiment results on TotalText, ICDAR13,
more challenging than the other three datasets. ABCNet+D ICDAR15, and the newly collected VinText dataset demon-
works well even on these challenging images. strate the benefits of our dictionary incorporation approach.
References [16] Minghui Liao, Guan Pang, J. Huang, Tal Hassner, and
X. Bai. Mask textspotter v3: Segmentation proposal network
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. for robust scene text spotting. ArXiv, 2020. 2, 6, 8
Neural machine translation by jointly learning to align and [17] Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xi-
translate. CoRR, abs/1409.0473, 2015. 3 ang Bai. Mask textspotter v3: Segmentation proposal net-
[2] A. Bissacco, M. Cummins, Yuval Netzer, and H. Neven. work for robust scene text spotting. In Proceedings of the
Photoocr: Reading text in uncontrolled conditions. Proceed- European Conference on Computer Vision, 2020. 2
ings of the International Conference on Computer Vision, [18] Ron Litman, Oron Anschel, Shahar Tsiper, R. Litman, Shai
2013. 2 Mazor, and R. Manmatha. Scatter: Selective context atten-
[3] Chee Kheng Chng and Chee Seng Chan. Total-text: A com- tional scene text recognizer. Proceedings of the IEEE Con-
prehensive dataset for scene text detection and recognition. ference on Computer Vision and Pattern Recognition, 2020.
In IAPR International Conference on Document Analysis 2
and Recognition (ICDAR), 2017. 2, 6 [19] Y. Liu, Hao Chen, Chunhua Shen, Tong He, Lian-Wen Jin,
[4] Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng- and L. Wang. ABCNet: Real-time scene text spotting with
Lin Liu. Textdragon: An end-to-end framework for arbitrary adaptive bezier-curve network. In Proceedings of the IEEE
shaped text spotting. In Proceedings of the International Conference on Computer Vision and Pattern Recognition,
Conference on Computer Vision, 2019. 2, 6 2020. 2, 3, 4, 6, 7, 8
[5] Yunze Gao, Y. Chen, J. Wang, M. Tang, and H. Lu. Read- [20] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and
ing scene text with fully convolutional sequence modeling. X. Bai. Mask textspotter: An end-to-end trainable neural
Neurocomputing, 2019. 2 network for spotting text with arbitrary shapes. ArXiv, 2018.
[6] Wen-Yang Hu, Xiaocong Cai, J. Hou, Shuai Yi, and Zhip- 2
ing Lin. Gtc: Guided training of ctc towards efficient and [21] N. Nayef, F. Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Di-
accurate scene text recognition. ArXiv, 2020. 2 mosthenis Karatzas, Zhenbo Luo, U. Pal, Christophe Rigaud,
[7] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An- J. Chazalon, Wafa Khlif, M. Luqman, J. Burie, C. Liu, and
drew Zisserman. Synthetic data and artificial neural net- J. Ogier. ICDAR2017 robust reading challenge on multi-
works for natural scene text recognition. In NIPS Workshop lingual scene text detection and script identification - rrc-
on Deep Learning, 2014. 2 mlt. In International Conference on Document Analysis and
[8] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An- Recognition (ICDAR), 2017. 2
drew Zisserman. Synthetic data and artificial neural net- [22] N. Nayef, Yash Patel, M. Busta, P. Chowdhury, Dimosthenis
works for natural scene text recognition. arXiv preprint Karatzas, Wafa Khlif, Jiri Matas, U. Pal, J. Burie, Cheng-Lin
arXiv:1406.2227, 2014. 6 Liu, and J. Ogier. ICDAR2019 robust reading challenge on
[9] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. multi-lingual scene text detection and recognition — rrc-mlt-
Deep features for text spotting. In Proceedings of the Eu- 2019. In International Conference on Document Analysis
ropean Conference on Computer Vision, 2014. 2 and Recognition (ICDAR), 2019. 2
[10] Dimosthenis Karatzas, F. Shafait, S. Uchida, M. Iwamura, [23] Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras,
L. G. I. Bigorda, Sergi Robles Mestre, J. M. Romeu, D. F. and Mark Johnson. A Fast and Accurate Vietnamese Word
Mota, Jon Almazán, and L. D. L. Heras. Icdar 2013 robust Segmenter. In Proceedings of International Conference on
reading competition. 2013 12th International Conference Language Resources and Evaluation, 2018. 8
on Document Analysis and Recognition, pages 1484–1493, [24] Liang Qiao, Sanli Tang, Zhanzhan Cheng, Yunlu Xu, Yi Niu,
2013. 2, 6 Shiliang Pu, and Fei Wu. Text perceptron: Towards
[11] Dimosthenis Karatzas, L. G. I. Bigorda, A. Nicolaou, end-to-end arbitrary-shaped text spotting. arXiv preprint
S. Ghosh, Andrew D. Bagdanov, M. Iwamura, Jiri Matas, arXiv:2002.06820, 2020. 2
Lukas Neumann, V. Chandrasekhar, S. Lu, F. Shafait, [25] A. Radford, Jeffrey Wu, R. Child, David Luan, Dario
S. Uchida, and Ernest Valveny. ICDAR2015 competition on Amodei, and Ilya Sutskever. Language models are unsuper-
robust reading. In International Conference on Document vised multitask learners. 2019. 2
Analysis and Recognition (ICDAR), 2015. 2 [26] D. Rumelhart, Geoffrey E. Hinton, and R. J. Williams.
[12] Yoon Kim, Yacine Jernite, D. Sontag, and Alexander M. Learning internal representations by error propagation. 1986.
Rush. Character-aware neural language models. In Proceed- 2
ings of AAAI Conference on Artificial Intelligence, 2016. 2 [27] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to
[13] Diederik P Kingma and Jimmy Ba. Adam: A method for sequence learning with neural networks. Advances in Neural
stochastic optimization, 2014. 4 Information Processing Systems, 2014. 2
[14] V. I. Levenshtein. Binary codes capable of correcting inser- [28] Zhaoyi Wan, Fengming Xie, Y. Liu, X. Bai, and Cong Yao.
tions and reversals. 1966. 1, 2, 3 2d-ctc for scene text recognition. ArXiv, 2019. 2
[15] Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao, [29] Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, X. Bai,
Wenhao Wu, and Xiang Bai. Mask textspotter: An end-to- Yongchao Xu, Mengchao He, Yongpan Wang, and W. Liu.
end trainable neural network for spotting text with arbitrary All you need is boundary: Toward arbitrary-shaped text spot-
shapes. IEEE Transactions on Pattern Analysis and Machine ting. In Proceedings of AAAI Conference on Artificial Intel-
Intelligence, 2019. 2, 6 ligence, 2020. 2, 6
[30] T. Wang, Yuanzhi Zhu, Lianwen Jin, Canjie Luo, Xiaoxue
Chen, Y. Wu, Qianying Wang, and Mingxiang Cai. Decou-
pled attention network for text recognition. ArXiv, 2020. 2
[31] Linjie Xing, Zhi Tian, Weilin Huang, and Matthew R Scott.
Convolutional character networks. In Proceedings of the In-
ternational Conference on Computer Vision, 2019. 2, 6
[32] X. Yang, Kaihua Tang, Hanwang Zhang, and J. Cai. Auto-
encoding scene graphs for image captioning. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2019. 2
[33] Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang
Sheng. Detecting curve text in the wild: New dataset and
new solution. arXiv preprint arXiv:1712.02170, 2017. 2
[34] Ling-Qun Zuo, Hong-Mei Sun, Qi-Chao Mao, Rong Qi, and
R. Jia. Natural scene text recognition based on encoder-
decoder framework. IEEE Access, 2019. 2

You might also like