2019 SINHA - Artificial Intelligence - Unsupervised - OCR - Model - Evaluation - Using - GAN
2019 SINHA - Artificial Intelligence - Unsupervised - OCR - Model - Evaluation - Using - GAN
Abstract—Optical Character Recognition (OCR) has achieved Hence, when we compare the input image with the fake image,
its state-of-the-art performance with the use of Deep Learning we can evaluate the performance of OCR models.
for character recognition. Deep Learning techniques need large On top of Jenckel et al’s approach, I have used Gen-
amount of data along with ground truth. Out of the available
data, small portion of it has to be used for validation purpose erative Adversial Networks (GANs) [6], first introduced by
as well. Preparing ground truth for historical documents is Goodfellow et al. in 2014, to generate the fake image. The
expensive and hence availability of data is of utmost concern. GAN architecture combines two networks, Generator and
Jenckel et al. [7] came up with an idea of using all the available Discriminator, which competes against each other which is
data for training the OCR model and for the purpose of also known as min-max game. Generator generates data and
validation, they generated the input image from Softmax layer
of the OCR model; using the decoder setup which can be used Discriminator tries to predict whether the data is fake or real.
to compare with the original input image to validate the OCR There have been various research in the field of GANs like
model. In this paper, we have explored the possibilities of using [9], [11], [8].
Generative Adversial Networks (GANs) [6] for generating the In Section II, we will be discussing about the related works.
image directly from the text obtained from OCR model instead In Section III of this paper, the evaluation of OCR trained
of using the Softmax layer which is not always accessible for
all the Deep Learning based OCR models. Using text directly to models will be discussed. In Section IV, we will discuss about
generate the input image back gives us the advantage to use this the data sets used in the experiments. Section V follows the
pipeline for any OCR models even whose Softmax layer is not discussion about the approach based on GANs to generate the
accessible. In the results section, we have shown that the current image and Section VI follows the result of the experiments.
state of using GANs for unsupervised OCR model evaluation.
Keywords—OCR, Historical Document Image Processing, GAN II. R ELATED W ORKS
Jenckel et al. [7] proposed a novel method for evaluation of
Deep Learning based OCR models without transcription. The
I. I NTRODUCTION
approach uses a secondary Long Short-Term Memory (LSTM)
There has been a significant progress in the field of Optical as an encoder-decoder setup. The goal is to generate the image
Character Recognition (OCR) in recent years. Traditionally, containing the text line from the output of the OCR model.
OCR applications were pipelined with hand-built and highly The trained OCR model can then be evaluated by comparing
tuned modules. For example, one module might find lines of the generated text line image and the original input image. The
text, then the next module would find words and segment paper has exploited the idea that if there are errors in the OCR
letters, then another module might apply different techniques model, it will be propagated to the generated image as well.
to each piece of a character’s recognition. The proposed method uses LSTM variation of auto-encoders
Inclusion of Deep Learning in OCR further improved the in which the encoder and decoder are trained independently.
accuracy of OCR. They have surpassed the performance of The encoder is the OCR model and the output of its Softmax
traditional Computer Vision techniques for OCR by a huge layer is used as input to the decoder setup.
margin in terms of accuracy. But training such an OCR model The advantage of using Softmax layer as input to decoder is
needs large amount of data. Also, to validate the trained that it contains information like alignment between the charac-
models, small set of data is needed which has not been used ters and the corresponding character in the input image. Also,
in training the model. The task becomes more daunting as the it contains the confidence score of each character which can
training is mostly Supervised Learning in which labeled data be useful in regenerating that character with some confidence.
is needed. Usually, this labelling is carried out by humans and With all these advantages, there is a disadvantage as well.
is often expensive. Not every OCR model’s Softmax layer can be accessed. OCR
To overcome this problem, Jenckel et al. [7] proposed an engines like Tesseract [4] and ABBY [1] does not provide
approach in which we can use all the available data only for access to their Softmax layer. This problem can be solved if
training and for validation, we can generate the input image we can generate the text line image from the output text itself
back from the OCR output text. The idea is that if the OCR rather than the Softmax layer.
model has any errors, those errors will be propagated to the In this paper, we have used Generative Adversial Network
output text and to the fake image generated out of that text. (GANs) [6] for generating the text line image instead of using
Authorized licensed use limited to: FORD MOTOR COMPANY. Downloaded on October 29,2022 at 21:11:33 UTC from IEEE Xplore. Restrictions apply.
the encoder-decoder setup of Jenckel et al. Also, we have
used the output text of the OCR model to generate the image
instead of using the Softmax layer of the model. In order to
evaluate the model, the generated text line image is compared
with the original input image. This idea is similar to the
idea proposed by Alex Graves in his paper [2] in which he
generated handwritten characters and sentences from the text.
1257
Authorized licensed use limited to: FORD MOTOR COMPANY. Downloaded on October 29,2022 at 21:11:33 UTC from IEEE Xplore. Restrictions apply.
In this paper, we have tried various versions of GANs. Also,
we tried using the combination of two GANs like Conditional
GAN [9] (CGAN) and Least Squared GAN [8] (LSGAN).
The text is taken as input and corresponding text line image
as ground truth for all the experiments performed.
Fig. 3. The network architecture. Text has been encoded to include after
every character and then it is passed to the Generator which takes noise as
another input and generates a fake image. The encoded text is also passed to
the Discriminator which takes the fake image from Generator or real image
from ground truth as another input and tries to predict whether the input image
to Discriminator is real or fake. It predicts 1 for real image and 0 for fake
image. Fig. 5. Discriminator with new encoded text. The encoded text is passed
to the Dense layer and then concatenated with the fake or real image which
is another input to the Discriminator. The concatenated tensor is passed to
The characters in the text line were separated from each BLSTM network and then to final Dense layer which has sigmoid as activation
other using a separator (). The characters were separated in function. It predicts 1 for real image and 0 for fake image.
order to make the repetition of characters in a word look more
obvious. The text lines were padded with zeros on either side
before feeding it to the Generator part of the model. This B. Experiments with UWIII data set
padding was done to keep the width of all the text in the Normal setup for this experiment is same as shown in Figure
data set to be same. Each character was encoded with its 3. The difference here is that the characters in the text line are
corresponding index from the vocabulary of characters created no more separated by .
over the whole data set. In the Generator, we used transposed Convolution layer
In the Generator, a Dense layer takes the input of encoded which upsamples the encoded text. The output of transposed
text. The output of the Dense layer was concatenated with the Convolution layer was concatenated with noise and then fed
noise and then fed to BLSTM (Bi-directional LSTM) layer. On into BLSTM layer followed by final Dense layer with tanh as
top of the BLSTM layer, a Dense layer was used to generate activation function. The architecture of the Generator is shown
fake image with same width and height as the corresponding in the Figure 6.
width and height of ground truth. The abstract architecture of In the Discriminator, as shown in the Figure 7, the encoded
Generator is shown in the Figure 4. text was passed to a Dense layer with number of neurons
In the Discriminator, the input text was first passed to a equal to the width of the input image which was given to the
Dense layer so that the output is equal to the width of the Discriminator as another input. The output of Dense layer and
image to be generated and then it is concatenated with the the input image were concatenated vertically and then passed
image either coming from Generator or ground truth. The to series of Convolution layers which extract features and can
concatenated output is then provided to a BLSTM layer. A be helpful in distinguishing real image and fake image. On top
final Dense layer was placed on top of the BLSTM layer with of Convolution layers, two Dense layers were stacked. The
sigmoid activation function to predict whether the input image final Dense layer had only one output neuron with sigmoid
1258
Authorized licensed use limited to: FORD MOTOR COMPANY. Downloaded on October 29,2022 at 21:11:33 UTC from IEEE Xplore. Restrictions apply.
For this experiment, we were passing the padded image of
UWIII data set, with 1200 pixels as width and 48 pixels as
height, to the pix2pixhd architecture as input image and the
ground truth image was the synthetic image generated by using
the corresponding text of the input image which also had the
width of 1200 pixels and height as 48 pixels. Input image has
been shown in the Figure 8 and corresponding ground truth
has been shown in the Figure 9.
Fig. 6. Generator: In this architecture, an encoded text is passed to a Fig. 8. A random sample from UWIII data set but not original. The foreground
transposed Convolution layer which translates the encoded text into a tensor and background color have been inverted with no specific reason behind. The
which has width equal to the image to be generated. This output is then size of the image is 1200 x 48 pixels.
concatenated with the noise and passed to BLSTM layers. On top of the
BLSTM layers, a Dense layer is placed with number of neurons equal to the
number of pixels in the fake image to be generated. tanh function is used as
activation to keep the value of generated image between -1 and 1.
Fig. 9. Ground truth of the input image shown in Figure 8. This image has
activation function to predict whether the input image is real been synthetically generated by using Pango library. The library takes the text,
or fake. font and size of the font to generate the image. The size of the image is same
to the input image which is 1200 x 48.
1259
Authorized licensed use limited to: FORD MOTOR COMPANY. Downloaded on October 29,2022 at 21:11:33 UTC from IEEE Xplore. Restrictions apply.
Fig. 11. Fake image generated in 150th epoch after using Sigmoid Cross Fig. 16. Synthetic image generated by pix2pixHD network. Initial result is
Entropy and Least Squares Error Losses. The top line is from the ground good. We can see few characters being formed across the width of the real
truth and the bottom line is the image generated by the Generator. image (Figure 15).
1260
Authorized licensed use limited to: FORD MOTOR COMPANY. Downloaded on October 29,2022 at 21:11:33 UTC from IEEE Xplore. Restrictions apply.
When the above pipeline is used, the generated image will
have characters in single font without corresponding characters
being bold or italic. Such characters cannot be compared.
In future, in order to preserve the character level informa-
tion, we can use one classifier which can give meta information
(bold or italic) about every character in a sentence. While
converting the OCRed text into image, these meta information
can be used to generate bold characters for characters which
were bold in original image and generate italic characters for
the characters which were italic in the original image. With
this method, we can generate an image from text which will
be close to the input image and can help in evaluating the
OCR model.
R EFERENCES
[1] ABBYY FineReader (2018). ABBYY FineReader.Wikipedia
[2] Alex Graves. 2013. Generating sequences with recurrent neural net-
works. CoRR, abs/1308.0850 (2013).
[3] Breuel, T. M. (2008). The OCRopus open source OCR system.
Proc.SPIE, pages 6815 6815 15.
[4] Chandel, V. S. (2018). Deep Learning based Text Recognition (OCR)
using Tesseractand OpenCV. Learn OpenCV.
[5] gnome (2018). Pango. https://www.pango.org/.
[6] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial
nets. In Advances in neural information processing systems, pages
26722680.
[7] Jenckel, M., Bukhari, S. S., and Dengel, A. (2018). Transcription free
lstm ocr model evaluation. In Proceedings of the 16th International
Conference on Frontiers in Handwriting Recognition (ICFHR 2018).
International Conference on Frontiers in Handwriting Recognition
(ICFHR-2018), The 16th International Conference on Frontiers in
Handwriting Recognition, August 5-8, Niagara Falls, New York, United
States. IAPR,IEEE.
[8] Mao, Xudong, et al. ”Least squares generative adversarial networks.”
Proceedings of the IEEE International Conference on Computer Vision.
2017.
[9] Mirza, Mehdi, and Simon Osindero. ”Conditional generative adversarial
nets.” arXiv preprint arXiv:1411.1784 (2014).
[10] Phillips, I. (1996). Users reference manual for the uw english/technical
document image database iii. UW-III English/Technical Document Image
Database Manual.
[11] Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised represen-
tation learning with deep convolutional generative adversarial networks.
arXiv preprint arXiv:1511.06434.
[12] University of Wrzburg (2018). ”Narragonien digital”.
http://kallimachos.de/kallimachos/index.php/Narragonien.
[13] Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., and Catanzaro, B.
(2018). High-resolution image synthesis and semantic manipulation with
conditional gans. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition.
1261
Authorized licensed use limited to: FORD MOTOR COMPANY. Downloaded on October 29,2022 at 21:11:33 UTC from IEEE Xplore. Restrictions apply.