Vintext CVPR21
Vintext CVPR21
Appearance Used
y=vision compatibility l(y, v)
<latexit sha1_base64="fKys9OYgI0BGoJTDg04RShizkzs=">AAADA3icbVLLitswFJXdV+q+Mu2yG9GQYQolttMO001goC20iw5TaGYGohBkWZ6IyJKR5BBjvOzXdFe67Yd03R+p/IAmk14wnHvPObrXV4oyzrQJgt+Oe+v2nbv3eve9Bw8fPX7SP3h6oWWuCJ0SyaW6irCmnAk6NcxwepUpitOI08to9a7mL9dUaSbFV1NkdJ7ia8ESRrCxpUX/z7BYhHByCBFEhm6MSss101hXCHmWGu9R1lVBhKA3PETrWBoNW+VqV3lmDxGVJeAEti5r4Ee22Su4ftl4mnT8L0VLnWFCy9d0U20d3chWtcwbWtgCtMSmLKpyUm1PbRsWu6V6Wm/RHwSjoAm4D8IODEAX54sDp4diSfKUCkM41noWBpmZl1gZRjitPJRramdd4Ws6s1DglOp52dxGBYe2EsNEKvsJA5vqtqPEqdZFGlllis1S3+Tq4v+4WW6St/OSiSw3VJC2UZJzaCSsrxbGTFFieGEBJorZWSFZYoWJsQ/AQ42x9KfaZv5SYua/7/5R+5+LM2mo9mOaMMHqp6FHdovN7sKbm9oHF+NReDwKvrwZnH7ottgDz8ELcARCcAJOwUdwDqaAOJ8c6Wycwv3mfnd/uD9bqet0nmdgJ9xffwHZbey6</latexit>
loss
score Find best Used in
Used in training
testing & testing
Calculate
v y ⇤ =visas
<latexit sha1_base64="E2YkDl7LCbv6YGjwn9bNF/TjtSU=">AAADB3icbVLditQwFE7r31j/ZvXSm+DQZRWZtqOiN8KCCt7ssoKzuzAZhzRNt2HSpCTpsKX2AXwa78RbH8Mn8DVMOwVnZj1Q+M53vi/n9CRxwZk2Yfjbca9dv3Hz1uC2d+fuvfsPhnsPT7UsFaFTIrlU5zHWlDNBp4YZTs8LRXEec3oWL9+19bMVVZpJ8dlUBZ3n+EKwlBFsLLUY/vGrRVS/bfaRoZdG5fWKaawbhDxbmOwUrKOBCEHP30erRBoN17rlpu7YHiAar/ryzLLbp1ofP7D9nsPV087apZN/Kcp0gQmtX9DLZqNDJ1u2Ms+3cA1Qhk1dNTtNbGe/2ubaqb3FcBSOwy7gVRD1YAT6OFnsOQOUSFLmVBjCsdazKCzMvMbKMMJp46FSUzvsEl/QmYUC51TP6+5GGuhbJoGpVPYTBnbspqPGudZVHltljk2md2st+b/arDTpm3nNRFEaKsi6UVpyaCRsrxcmTFFieGUBJorZWSHJsMLE2Efgoc5YB1NtsyCTmAXv+3/UwVF1LA3VQUJTJlj7PPTYbrHbXbS7qavgdDKOXo3DTy9Hhx/6LQ7AY/AEHIAIvAaH4CM4AVNAnCNHO1+dxv3mfnd/uD/XUtfpPY/AVri//gIEh/Jc</latexit>
<latexit sha1_base64="iDrB8ZAKLYC2A/0ty1UUwy76bPY=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoehRU8KIoWC20pWy2k3bpZjfsTool9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2H4xKNYM6U0LpRkgNCC6hjhwFNBINNA4FPIaD87z+OARtuJL3OEqgHdOe5BFnFK10N+xUqn7Nn8CdJUFBqqTAbWfLKbW6iqUxSGSCGtMM/ATbGdXImYBxuZUaSCgb0B40LZU0BtPOJpuO3X2rdN1IafskuhP1tyOjsTGjOLSdMcW+ma7l4n+1ZorRaTvjMkkRJPsZFKXCReXmZ7tdroGhGFlCmeZ2V5f1qaYMbTjl1sSYeXVjf15fUe5dFDca73p0oxCM14WIS57HZmoIT+OyzS6YTmqWPBzWguOaf3dUPbssUiyRXbJHDkhATsgZuSK3pE4YAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/foyuiFA==</latexit>
x
<latexit sha1_base64="iDrB8ZAKLYC2A/0ty1UUwy76bPY=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoehRU8KIoWC20pWy2k3bpZjfsTool9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2H4xKNYM6U0LpRkgNCC6hjhwFNBINNA4FPIaD87z+OARtuJL3OEqgHdOe5BFnFK10N+xUqn7Nn8CdJUFBqqTAbWfLKbW6iqUxSGSCGtMM/ATbGdXImYBxuZUaSCgb0B40LZU0BtPOJpuO3X2rdN1IafskuhP1tyOjsTGjOLSdMcW+ma7l4n+1ZorRaTvjMkkRJPsZFKXCReXmZ7tdroGhGFlCmeZ2V5f1qaYMbTjl1sSYeXVjf15fUe5dFDca73p0oxCM14WIS57HZmoIT+OyzS6YTmqWPBzWguOaf3dUPbssUiyRXbJHDkhATsgZuSK3pE4YAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/foyuiFA==</latexit>
features
<latexit sha1_base64="CF4zWc6AvOQwZFm0izDrgRgFbh4=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoeiyo4EVRsFpoS9lsJ+3SzW7YnYgl9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2741KNYMGU0LpZkgNCC6hgRwFNBMNNA4FPITDs7z+8AjacCXvcJRAJ6Z9ySPOKFrp9qlbqfo1fwJ3lgQFqZICN90tp9TuKZbGIJEJakwr8BPsZFQjZwLG5XZqIKFsSPvQslTSGEwnm2w6dvet0nMjpe2T6E7U346MxsaM4tB2xhQHZrqWi//VWilGp52MyyRFkOxnUJQKF5Wbn+32uAaGYmQJZZrbXV02oJoytOGU2xNj5jWM/XkDRbl3XtxovKvRtUIwXg8iLnkem6khPI3LNrtgOqlZcn9YC45r/u1RtX5RpFgiu2SPHJCAnJA6uSQ3pEEYAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/fpqGiFg==</latexit>
Calculate edit
distanceedit
Calculate Output most
Calculate <latexit sha1_base64="jmqMC0Rw9ebGjJ39t5I/AwpKOY8=">AAAC1HicbVFdi9NAFJ3Erxo/tqtv+jJYuqwgTVIVfVkoqKCCywq2u9ApZTK52Q6dZMLMpGyIeRJf/TX+Gf+NkzTg7tYLQ86955y5k3ujXHBtguCP4964eev2nd5d7979Bw/3+vuPZloWisGUSSHVWUQ1CJ7B1HAj4CxXQNNIwGm0ftfwpxtQmsvsmylzWKT0POMJZ9TY0rL/e1guQ3x0gAkmBi6MSqsN11TXhHiWGu9Q1lVjQrA3PCCbWBqNt8r1VeWxvSSrLYGP8NZlDeLQNnuBN89bT5uO/6VkpXPKoHoJF/Wlq1vZupF5FrXfZX8QjII28C4IOzBAXZws950eiSUrUsgME1TreRjkZlFRZTgTUHuk0GCbr+k5zC3MaAp6UbXjrfHQVmKcSGVPZnBbveyoaKp1mUZWmVKz0te5pvg/bl6Y5O2i4lleGMjYtlFSCGwkbnaFY66AGVFaQJni9q2YraiizNiNeqQ1Vv5U28xfScr9990/av9LeSwNaD+GhGe82bUe2dXUzezC65PaBbPxKHw9Cr6+Gkw+dFPsoafoGTpEIXqDJugjOkFTxJwnzsT55Hx2Z+5394f7cyt1nc7zGF0J99df8I7X8A==</latexit>
Calculate loss
Appearance candidate
compatible
y1 =visas
<latexit sha1_base64="KB2nvZ997gn3g20VsHTB8bQMpJI=">AAACyHicbVFdb9MwFHXCVwkwOnjkxaJqNSTUJGVoeymaBEgIiWlIdJtUV5Xj3KxWnTiynbIoygu/hr/Ev8FJI2ArV7J07j3n+Nr3Rrng2gTBL8e9c/fe/Qe9h96jx0/2nvb3n51rWSgGMyaFVJcR1SB4BjPDjYDLXAFNIwEX0fp9w19sQGkus2+mzGGR0quMJ5xRY0vL/s9huQzxdIQJJgaujUqrDddU14R4lprsUNZVY0KwNxyRTSyNxlvl+qby1F6S1ZbAU7x1eSNxYHu9xptXjaXNJn8ystI5ZVC9gev6772taN2IvGV/EIyDNvAuCDswQF2cLfedHoklK1LIDBNU63kY5GZRUWU4E1B7pNBge67pFcwtzGgKelG1I63x0FZinEhlT2ZwW/3XUdFU6zKNrDKlZqVvc03xf9y8MMnxouJZXhjI2LZRUghsJG72g2OugBlRWkCZ4vatmK2ooszYLXqkNVb+TNvMX0nK/Q/dH7X/pTyVBrQfQ8Iz3uxXj+066mZ24e1J7YLzyTh8Ow6+Hg5OPnZT7KEX6CU6QCE6QifoEzpDM8ScPefQmTrv3M9u7n53y63UdTrPc3Qj3B+/ATDj1OI=</latexit> <latexit sha1_base64="rWC/E7a/+8qQZr5uvKhhYcINXzI=">AAACyHicbVFdb9MwFHXCVwkwOnjkxaJqNSTUJGVoeymaBEgIiWlIdJtUV5Xj3KxWEzuynbIoygu/hr/Ev8FJI2ArV7J07j3n+Nr3RnnKtQmCX4575+69+w96D71Hj5/sPe3vPzvXslAMZkymUl1GVEPKBcwMNylc5gpoFqVwEa3fN/zFBpTmUnwzZQ6LjF4JnnBGjS0t+z+H5TLE0xEmmBi4NiqrNlxTXRPiWWqyQ1lXjQnB3nBENrE0Gm+V65vKU3uJqC2Bp3jr8kbxge31GpevGkubTf5kZKVzyqB6A9f133tb0boRecv+IBgHbeBdEHZggLo4W+47PRJLVmQgDEup1vMwyM2iospwlkLtkUKD7bmmVzC3UNAM9KJqR1rjoa3EOJHKHmFwW/3XUdFM6zKLrDKjZqVvc03xf9y8MMnxouIiLwwItm2UFCk2Ejf7wTFXwExaWkCZ4vatmK2ooszYLXqkNVb+TNvMX0nK/Q/dH7X/pTyVBrQfQ8IFb/arx3YddTO78PakdsH5ZBy+HQdfDwcnH7sp9tAL9BIdoBAdoRP0CZ2hGWLOnnPoTJ137mc3d7+75VbqOp3nOboR7o/fFHDU0w==</latexit>
l(y1 , v) d(y1 , y)
<latexit sha1_base64="4gviE6HEv5g2d021s6Pz88luQJc=">AAAC+nicbVJdi9NAFJ3Erxq/uvroy2BpWUHapCr6slBQwReXFezuQqeEyWSyGTqZCTOTsiHmz/gmvvpn9Nc4+YDtdr0QOPfcc3Jv7k2Uc6aN7/9x3Fu379y9N7jvPXj46PGT4cHTUy0LReiSSC7VeYQ15UzQpWGG0/NcUZxFnJ5Fmw9N/WxLlWZSfDNlTtcZvhAsYQQbS4XDv2UYVEf1BBl6aVRWbZnGukbIK8P5Hm/1NUQIehO0jaXRsFVtdlXH1i1qb1zCI9g5vPGEH9oer+D2ZWPo0vlVilKdY0Kr1/SyvnpvJ9s0Mm9sYQdQik1V1rbj7rhtw+tcM6oXDkf+1G8D3gRBD0agj5PwwBmgWJIio8IQjrVeBX5u1hVWhhFOaw8VmtphN/iCriwUOKN6XbVHqOHYMjFMpLKPMLBldx0VzrQus8gqM2xSvV9ryP/VVoVJ3q8rJvLCUEG6RknBoZGwuSiMmaLE8NICTBSzs0KSYoWJsXf3UGusZktts1kqMZt97L9Rz76Ux9JQPYtpwgRr/gg9tVtsdxfsb+omOJ1Pg7dT/+ub0eJTv8UBeA5egEMQgHdgAT6DE7AExFk4iSOd3P3u/nB/ur86qev0nmfgWri//wHStuyl</latexit>
score
<latexit sha1_base64="jmqMC0Rw9ebGjJ39t5I/AwpKOY8=">AAAC1HicbVFdi9NAFJ3Erxo/tqtv+jJYuqwgTVIVfVkoqKCCywq2u9ApZTK52Q6dZMLMpGyIeRJf/TX+Gf+NkzTg7tYLQ86955y5k3ujXHBtguCP4964eev2nd5d7979Bw/3+vuPZloWisGUSSHVWUQ1CJ7B1HAj4CxXQNNIwGm0ftfwpxtQmsvsmylzWKT0POMJZ9TY0rL/e1guQ3x0gAkmBi6MSqsN11TXhHiWGu9Q1lVjQrA3PCCbWBqNt8r1VeWxvSSrLYGP8NZlDeLQNnuBN89bT5uO/6VkpXPKoHoJF/Wlq1vZupF5FrXfZX8QjII28C4IOzBAXZws950eiSUrUsgME1TreRjkZlFRZTgTUHuk0GCbr+k5zC3MaAp6UbXjrfHQVmKcSGVPZnBbveyoaKp1mUZWmVKz0te5pvg/bl6Y5O2i4lleGMjYtlFSCGwkbnaFY66AGVFaQJni9q2YraiizNiNeqQ1Vv5U28xfScr9990/av9LeSwNaD+GhGe82bUe2dXUzezC65PaBbPxKHw9Cr6+Gkw+dFPsoafoGTpEIXqDJugjOkFTxJwnzsT55Hx2Z+5394f7cyt1nc7zGF0J99df8I7X8A==</latexit>
loss candidate
yy21=vision
=visas
<latexit sha1_base64="KB2nvZ997gn3g20VsHTB8bQMpJI=">AAACyHicbVFdb9MwFHXCVwkwOnjkxaJqNSTUJGVoeymaBEgIiWlIdJtUV5Xj3KxWnTiynbIoygu/hr/Ev8FJI2ArV7J07j3n+Nr3Rrng2gTBL8e9c/fe/Qe9h96jx0/2nvb3n51rWSgGMyaFVJcR1SB4BjPDjYDLXAFNIwEX0fp9w19sQGkus2+mzGGR0quMJ5xRY0vL/s9huQzxdIQJJgaujUqrDddU14R4lprsUNZVY0KwNxyRTSyNxlvl+qby1F6S1ZbAU7x1eSNxYHu9xptXjaXNJn8ystI5ZVC9gev6772taN2IvGV/EIyDNvAuCDswQF2cLfedHoklK1LIDBNU63kY5GZRUWU4E1B7pNBge67pFcwtzGgKelG1I63x0FZinEhlT2ZwW/3XUdFU6zKNrDKlZqVvc03xf9y8MMnxouJZXhjI2LZRUghsJG72g2OugBlRWkCZ4vatmK2ooszYLXqkNVb+TNvMX0nK/Q/dH7X/pTyVBrQfQ8Iz3uxXj+066mZ24e1J7YLzyTh8Ow6+Hg5OPnZT7KEX6CU6QCE6QifoEzpDM8ScPefQmTrv3M9u7n53y63UdTrPc3Qj3B+/ATDj1OI=</latexit> <latexit sha1_base64="rWC/E7a/+8qQZr5uvKhhYcINXzI=">AAACyHicbVFdb9MwFHXCVwkwOnjkxaJqNSTUJGVoeymaBEgIiWlIdJtUV5Xj3KxWEzuynbIoygu/hr/Ev8FJI2ArV7J07j3n+Nr3RnnKtQmCX4575+69+w96D71Hj5/sPe3vPzvXslAMZkymUl1GVEPKBcwMNylc5gpoFqVwEa3fN/zFBpTmUnwzZQ6LjF4JnnBGjS0t+z+H5TLE0xEmmBi4NiqrNlxTXRPiWWqyQ1lXjQnB3nBENrE0Gm+V65vKU3uJqC2Bp3jr8kbxge31GpevGkubTf5kZKVzyqB6A9f133tb0boRecv+IBgHbeBdEHZggLo4W+47PRJLVmQgDEup1vMwyM2iospwlkLtkUKD7bmmVzC3UNAM9KJqR1rjoa3EOJHKHmFwW/3XUdFM6zKLrDKjZqVvc03xf9y8MMnxouIiLwwItm2UFCk2Ejf7wTFXwExaWkCZ4vatmK2ooszYLXqkNVb+TNvMX0nK/Q/dH7X/pTyVBrQfQ8IFb/arx3YddTO78PakdsH5ZBy+HQdfDwcnH7sp9tAL9BIdoBAdoRP0CZ2hGWLOnnPoTJ137mc3d7+75VbqOp3nOboR7o/fFHDU0w==</latexit>
l(y21,, v)
v) d(y
d(y21,, y)
y)
<latexit sha1_base64="4gviE6HEv5g2d021s6Pz88luQJc=">AAAC+nicbVJdi9NAFJ3Erxq/uvroy2BpWUHapCr6slBQwReXFezuQqeEyWSyGTqZCTOTsiHmz/gmvvpn9Nc4+YDtdr0QOPfcc3Jv7k2Uc6aN7/9x3Fu379y9N7jvPXj46PGT4cHTUy0LReiSSC7VeYQ15UzQpWGG0/NcUZxFnJ5Fmw9N/WxLlWZSfDNlTtcZvhAsYQQbS4XDv2UYVEf1BBl6aVRWbZnGukbIK8P5Hm/1NUQIehO0jaXRsFVtdlXH1i1qb1zCI9g5vPGEH9oer+D2ZWPo0vlVilKdY0Kr1/SyvnpvJ9s0Mm9sYQdQik1V1rbj7rhtw+tcM6oXDkf+1G8D3gRBD0agj5PwwBmgWJIio8IQjrVeBX5u1hVWhhFOaw8VmtphN/iCriwUOKN6XbVHqOHYMjFMpLKPMLBldx0VzrQus8gqM2xSvV9ryP/VVoVJ3q8rJvLCUEG6RknBoZGwuSiMmaLE8NICTBSzs0KSYoWJsXf3UGusZktts1kqMZt97L9Rz76Ux9JQPYtpwgRr/gg9tVtsdxfsb+omOJ1Pg7dT/+ub0eJTv8UBeA5egEMQgHdgAT6DE7AExFk4iSOd3P3u/nB/ur86qev0nmfgWri//wHStuyl</latexit>
x
<latexit sha1_base64="iDrB8ZAKLYC2A/0ty1UUwy76bPY=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoehRU8KIoWC20pWy2k3bpZjfsTool9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2H4xKNYM6U0LpRkgNCC6hjhwFNBINNA4FPIaD87z+OARtuJL3OEqgHdOe5BFnFK10N+xUqn7Nn8CdJUFBqqTAbWfLKbW6iqUxSGSCGtMM/ATbGdXImYBxuZUaSCgb0B40LZU0BtPOJpuO3X2rdN1IafskuhP1tyOjsTGjOLSdMcW+ma7l4n+1ZorRaTvjMkkRJPsZFKXCReXmZ7tdroGhGFlCmeZ2V5f1qaYMbTjl1sSYeXVjf15fUe5dFDca73p0oxCM14WIS57HZmoIT+OyzS6YTmqWPBzWguOaf3dUPbssUiyRXbJHDkhATsgZuSK3pE4YAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/foyuiFA==</latexit>
.. scores
x <latexit sha1_base64="CF4zWc6AvOQwZFm0izDrgRgFbh4=">AAACH3icbVBNS8NAEN34WetX1aOXYBE8NYkoeiyo4EVRsFpoS9lsJ+3SzW7YnYgl9Bd41R/gr/EmXv03bmoO2vpg4e2beczMCxPBDfr+lzM3v7C4tFxaKa+urW9sVra2741KNYMGU0LpZkgNCC6hgRwFNBMNNA4FPITDs7z+8AjacCXvcJRAJ6Z9ySPOKFrp9qlbqfo1fwJ3lgQFqZICN90tp9TuKZbGIJEJakwr8BPsZFQjZwLG5XZqIKFsSPvQslTSGEwnm2w6dvet0nMjpe2T6E7U346MxsaM4tB2xhQHZrqWi//VWilGp52MyyRFkOxnUJQKF5Wbn+32uAaGYmQJZZrbXV02oJoytOGU2xNj5jWM/XkDRbl3XtxovKvRtUIwXg8iLnkem6khPI3LNrtgOqlZcn9YC45r/u1RtX5RpFgiu2SPHJCAnJA6uSQ3pEEYAfJMXsir8+a8Ox/O50/rnFN4dsgfOF/fpqGiFg==</latexit>
section, but we will demonstrate the empirical benefits of During training, we use both the ground truth word y and
our method with both ABCNet and MaskTextSpotterV3 in the initial recognition output ŷ to generate the list of candi-
the experiment section. date words, creating a list with a total of k words.
Fig. 1b depicts the processing pipeline of the recognition 3.3. Training losses
stage of our method. Given a detected text instance x (de-
To train our recognition network, we minimize an objec-
lineated by two Bezier curves [19]), a fixed-size feature map
tive function that is the weighted combination of two losses.
v will be calculated (using Bezier alignment module [19]).
The first loss is defined based on the negative log likeli-
From v, we obtain an initial recognition output ŷ. We will
hood of the ground truth. The second loss is defined for the
then compile a list of candidate words y1 , . . . , yk , which
list of candidate words to maximize the likelihood of the
are dictionary words with smallest edit distances to y. We
ones that are close to the ground truth while minimizing the
will then calculate the compatibility scores between each
likelihood of the candidates that are further away from the
candidate word yi and the feature map v, and output the
ground truth.
word with the highest compatibility score.
The negative log likelihood for a feature map v and a
During training, we also calculate the compatibility word y is calculated as follows. First, using recurrent neural
score between the appearance feature map v and the ground network with attention [1], we obtain a probability matrix P
truth word y, which is used to calculate the appearance loss. of size s×m, where m is the maximum length of a word and
We also minimize a contrastive loss, which is defined based s is the size of the alphabet, including special symbols and
on the compatibility scores between the feature map v and characters (m = 25, s = 97 for English). Let y j be the
the list of candidate words y1 , . . . , yk . index of the j th character of the word y; y j ∈ {1, . . . , s}.
3.2. Candidate generation The negative log likelihood for y is defined as:
len(y)
We use a dictionary to generate a list of candidate words
X
l(y, v) = − log(P[y j , j]), (1)
in both inference and training phases. During inference, j=1
given the initial recognition output ŷ, the list of candidates
is the k dictionary words with smallest edit distance (Leven- where len(y) is the length of word y, and P[y j , j] denotes
shtein distance [14]) to ŷ. For example, if ŷ = visan and the entry at row y j and column j of matrix P.
k = 10 then the list of candidates will be: visas, vise, The second loss is defined based on the negative log like-
vised, vises, visi, vising, vision, visit, visor, vista. lihood of candidate words and their edit distances to the
ground truth word. We first convert the list of negative log Given a word, either the ground truth word or the initially
likelihood values into a probability distribution: recognized one, we need to find the list of candidate words
exp(−l(yi , v)) that have smallest edit distances to the given word. This
Li = Pk . (2) can be done based on exact search or approximate nearest
j=1 exp(−l(yj , v))
neighbor retrieval. The former approach requires exhaus-
We also first convert the list of edit distances to a proba- tively computing the edit distance between the given word
bility distribution: and all dictionary words. It generates a better list of candi-
exp − d(yTi ,y) dates and leads to higher accuracy, but it also takes longer
Di = P , (3) time. The latter approach is more efficient, but it only re-
k d(yj ,y)
j=1 exp − T
turns approximate nearest neighbors. We experiment with
both approaches in this paper. For the second approach,
Finally we compute the KL-divergence between two we use the dict-trie library to retrieve all words that
probability distributions D and L: have the Levenshtein distance to query word smaller than
k
three. If the number of candidate words is smaller than ten,
we fill the missing candidates by ###. We notice that the
X
KL(D||L) ∝ − Di log(Li ). (4)
i=1
query time will increase significantly if we use a larger dis-
tance threshold for dict-trie. Approximate search can
In Eq. (3), T is a tunable temperature parameter. T is a reduce the query time, but it also decreases the final accu-
positive value that should neither be too large nor too small. racy slightly.
When T is too small, the target probability distribution D In our experiments, we used Adam optimizer [13] for
has low entropy, and none of the candidate words, except training. The parameter λ in Eq. (5) was set to 1.0, and the
the ground truth, would matter. When T is too big, the tar- temperature parameter T of Eq. (3) was set to 0.3.
get probability distribution D has high entropy, and there is
no contrast between good and bad candidates. In our exper-
4. VinText: a dataset for Vietnamese scene text
iments, T is set to 0.3. In this section, we will describe our dataset for
The loss in Eq. (4) is formulated to maximize the like- Vietnamese scene text, named VinText. This dataset con-
lihood of the candidates that are close to the ground truth, tains 2,000 fully annotated images with 56,084 text in-
while minimizing the likelihood of the faraway candidates. stances. Each text instance is delineated by a quadrilat-
We call this loss as the contrastive loss because its goal is eral bounding box and associated with the ground truth se-
to contrast the ones that are closer to ground truth with the quence of characters. We randomly split the dataset into
ones further away. three subsets for training (1,200 images), validation (300
The total training loss of our recognition network is: images), and testing (500 images). This is the largest dataset
for Vietnamese scene text.
L(x, y) = l(y, v) + λKL(D||L), (5)
Although this dataset is specific to Vietnamese, we be-
where λ is a hyper parameter that balances these two loss lieve it will greatly contribute to the advancement of re-
terms. We simply set λ = 1 in our experiments. search in scene text detection and recognition in general.
3.4. Network architecture and details First, this dataset contains images from a developing coun-
try, and it complements the existing datasets of images taken
In the detection stage, we use the Bezier detection and in developed countries. Second, images from our dataset
alignment module from ABCNet [19]. The output of the are very challenging, containing busy and chaotic scenes
detection stage is the input to the recognition stage, and with many shop signs, billboards, and propaganda panels.
it is a 3D feature tensor of size n×32×256, with n being As such, this dataset will serve as a challenging benchmark
the number of detected text instances. Each text instance for measuring the applicability and robustness of scene text
is represented by a feature map of size 32×256, and we detection and recognition algorithms.
use a sequential decoding network with attention to output In the rest of this section, we will describe how the im-
a probability matrix P of size s×25, where s is the size of ages were collected to ensure the dataset covers a diverse
the extended alphabet, including letters, numbers, and spe- set of scene text and backgrounds. We will also describe
cial characters. In our experiments, s = 97 for English the annotation and quality control process.
and s = 106 for Vietnamese. Each column i of P is a
4.1. Image collection
probability distribution of the ith character in the text in-
stance. During inference, we use this matrix to produce the The images from our dataset were either downloaded
initial recognition output, which is the sequence of charac- from the Internet or captured by some data collection work-
ters where each character is the one with highest probability ers. Our objective was to compile a collection of images
in each respective column of P. that represent the diverse set of scene texts that are encoun-
Figure 2: Some representative images from VinText dataset. This is a challenging data, containing busy and chaotic
scenes with scene text instances of various types, appearance, sizes, and orientations. Each text instance is annotated with
a quadrilateral bounding box and and world-level transcription. This dataset will be a good benchmark for measuring the
applicability and robustness of scene text spotting algorithms.
Skateboard, 10
Additio
of our dataset, we first created a list of scene categories and
Store
Flye
sub-categories. The list of categories at the first level are:
Ou
n
r
ny
td
pa
o
In
Transportati
shop signs, notice boards, bulletins, banners, flyers, street
or
do
m
or
) Co
(4.3%)
Flye
l
(4.5%
walls, vehicles, and miscellaneous items. These categories
Str 4.6%
oo
Sch
ee
4. s
Trad (
(2 ign
tW )
on
itio
0%
were divided into subcategories, and many subcategories boa nal
Bul
Hou
se
all
S
leti
rd n
were further divided into sub-subcategories. For example, (7.8 Boar
%) d B nk
a
the the first-level category “Miscellaneous Items” contains Adver tisement
Banners
many subcategories, including book covers, product labels, (9.0%) Instru
clothes. Images from these categories and sub-categories Slogan
No cted
Bo tice board
were abundant on the Internet, but there were also many s
a
(22 rd
the .8% P
Clo
(22 em
)
d, ) rohib
It
sign
Bo
Product
fific
Dictionary type
Strong Weak General
MaskTextspotterV3 83.3 78.1 74.2
MaskTextspotterV3+D (proposed) 85.2 81.9 75.9
(e) ABCNet: KITGHEN (f) ABCNet: TOSTWORLD Table 4: H-mean scores on ICDAR15, comparing
ABCNet+D: KITCHEN ABCNet+D: LOSTWOLRD
MaskTextSpotterV3 with MaskTextSpotterV3+D, the
proposed method trained with a general dictionary of
∼90K words. In testing, one can consider different
types of dictionary Strong/Weak/General, which corre-
sponds to the standard evaluation protocols for ICDAR15:
Strongly/Weakly/Generic Contextualised.
ing testing. As can be seen, MaskTextSpotterV3+D outper-
(g) ABCNet: LOUUIE (h) ABCNet: PLAMET
ABCNet+D: LOUIE ABCNet+D: PLANET
forms MaskTextSpotterV3 for all settings.
Figure 4: Several cases where ABCNet makes mistakes- 5.3. Experiments on VinText
but ABCNet+D does not. These are intermediate outputs, The Vietnamese script is based on the Latin alphabet like
when a dictionary has not been used for post processing. English, but it additionally has seven derivative characters
(d̄, ô, ê, â, ă, ơ, ư) and five accent symbols (´, `, ?, . , ˜).
A derivative character can also be combined with an accent
low resolution with blurry and small text. ICDAR2015 fo- symbol; for example, ế is popular Vietnamese word, com-
cuses on English, and it comes with quadrilateral bounding bining the letter e with both the circumflex and the acute
box annotation and word-level transcriptions. symbols. It is unclear how to handle these extra symbols,
The results of ABCNet on these datasets are not reported and we consider here two approaches. The first approach
in the ABCNet paper [19], and there is no released model is to create a new alphabet symbol for each valid combina-
for ICDAR2015. So we train an ABCNet model on these tion of letter and accent symbols. For example, ế would be
datasets ourselves. The results of ABCNet and ABCNet+D a character of the alphabet by itself, and also for ế, ề, ê., ễ,
are reported in Table 3. On ICDAR2015, ABCNet performs ê, é, è, e., ẽ, e. The second approach is to break a derivative
relatively poor, possibly due to the low quality of Google character into two parts: the English character and either the
glasses images with many small and blurry text. In this case, hat ˆ, the breve ˘, or the horn plus one of the accent symbol.
we find that the use of the dictionary boost the performance Thus, the word ế would be the sequence of three symbols:
of the model immensely. (e, ˆ, ´). The first approach requires extending the English
Table 4 compares the performance of MaskTextSpot- alphabet of 97 characters to an alphabet with a total of 258
terV3 with MaskTextSpotterV3+D on the ICDAR15, for characters, while the second approach only requires only
different ways of using different types of dictionary dur- eight additional symbols, leading to a total of 97+9=106
Total Text ICDAR13 ICDAR15 VinText
Figure 5: Detection and recognition results by ABCNet+D: on TotalText, ICDAR13, ICDAR15, and VinText