Object Detection Based Handwriting Localization
Object Detection Based Handwriting Localization
Handwriting Localization
3
SAP Innovation Center Network (ICN) Nanjing, China
1 Introduction
utmost help to refine strategies and make decisions. However, the personally
identifiable information (PII) must be anonymized beforehand, as it is not worth
the risk of any privacy exposure.
Ostensibly, tabular texts are of overwhelming majority in the business pro-
cessing. In fact, the documents containing handwritten notes or signatures, such
as invoices, also play an important role. The goal of this work is to localize
the handwritten regions from the full-page invoice images for anonymization en-
hancement. The detected handwriting shall be anonymized afterwards, while the
detailed implementation of which is beyond the scope of this work. As the ex-
pected results of the whole processing, handwriting should be excluded from the
anonymized invoices, where it is assumed all handwritten regions would contain
PII. One trivial way to realize this would be replacing the handwritten boxes
with redacted signatures or notes.
Tesseract [24], the de facto paradigm of optical character recognition (OCR)
engines, is nowadays widely used in the industry to extract textual information
from images. OCR engines are competent to deal with the optimal data, which
is referred to as the image data of documents, where all items of interest are
regularly printed texts under the context of this work. In contrast, the real-
world documents are usually the ones containing not only the regularly printed
texts, but also some irregular patterns, such as handwritten notes, signatures,
logos etc, which might also be desired.
In this work, we adopt object detection approaches with deep learning net-
works to localize handwritten regions in the document data based on the SAP’s
Data Anonymization Challenge1 . The feasibility and effectiveness of such algo-
rithms have been empirically shown on those scenarios, where the objects (hand-
writing) and the backgrounds (printed texts) are extremely similar. Besides, the
improvement from Faster R-CNN [23] to Cascade R-CNN [1] can be effortlessly
reproduced. In addition, the new baseline of the handwriting localization as
the subtask from the SAP’s Data Anonymization Challenge1 has been released.
Last but not least, the proposed deep learning approach with Cascade R-CNN [1]
has demonstrated impressive generalizability. The trained model based on the
English-dominant dataset works well on the fictitious unseen invoices, even for
those in Chinese as toy examples. Empirically, it is believed that the deep learn-
ing model has learned the irregularity of the images.
Since the detailed types of handwritten regions, such as signatures or notes,
are not discriminated during the experiments, we term detection and localiza-
tion in this work interchangeably. The one-class detection merely consists of the
localization regression task without classification. Despite the simplicity of the
task description, it is still challenging to distinguish the handwritten notes from
the printed texts, as they are similar regarding the contextual information. Fur-
thermore, the detected bounding boxes, which should contain PII, are expected
to be more accurate, compared to the general object detection tasks, i.e., the
primary evaluation score AP F P (average precision with penalty of false positive,
see Section 4.4) is thresholded with the IoU of 80%.
1
https://www.herox.com/SAPAI/
Object Detection Based Handwriting Localization 3
2 Related Work
The input images are usually in the format of the cropped handwritten regions
in the signature verification competitions [15,19]. Likewise, some handwritten
text recognition datasets provide the option of the images labeled with divided
lines [16]. This work is expected to bridge the gap between these researches and
industrial applications through handwriting localization. Besides, text detection
in natural scene images is close to our work. One significant difference between
these two tasks is the target objects: All texts should be detected in scene text
detection task (e.g. [18]), while only the handwritten texts in this work. The
other difference is the background: The background in scene text detection task
is the natural view. In this work, the background is the printed texts and tables
on the blank document. Also based on Faster R-CNN [23], Zhong et al. [26]
uses LocNet [6] to improve the accuracy in scene text detection, whereas we use
Cascade R-CNN [1], the cascade version of Faster R-CNN.
There are two main categories of methods to localize the handwritten re-
gions in the documents. The OCR based approaches recognize and then exclude
printed texts. As a result, the unrecognizable parts are believed to be the hand-
writing. In contrast, the object detection based approaches regard this as a
localization task, where the handwriting is the target and all other items (such
as printed texts, logos, tables, etc.) are considered as the background. Thanks to
the datasets and detection challenges on common objects (e.g. [5,13]), a consider-
able number of novel algorithms about object detection have been productively
proposed in the recent years, e.g. Faster R-CNN [23], YOLO [21], SSD [14],
RetinaNet [12], Cascade R-CNN [1], etc.
Three different approaches submitted to the Data Anonymization Challenge1
are also briefly introduced in the following sections, including an OCR based
approach and two deep learning based approaches (one with YOLOv3 [22], one
with Google’s paid cloud service).
OCR based approaches. In this section, an example proposal from the
challenge1 is demonstrated. First, the images are sequentially preprocessed, in-
cluding removing the horizontal and vertical lines, median filtering (to remove
salt-and-pepper noises), thresholding and morphological filtering (e.g. dilation
and erosion). The handwritten parts are then discriminated from the printed
ones with respect to the manually chosen features like the heights and widths
of the text boxes, text contents and confidence scores recognized by OCR. In
the experiments, this approach brings in the results on a par with those using
deep learning approaches. However, the robustness and the generalizability of
the deep learning approaches are believed to be advantageous.
Object detection based approaches with deep learning. Since object
detection is an intensively researched area in the field of computer vision, it
is natural to directly apply the deep learning algorithms to the handwriting
localization task. With the deep learning engine ImageAI [17], the networks
like YOLOv3 [22] can be trained in an end-to-end manner. Moreover, some
deep learning services like Google’s Cloud AutoML Vision API take it further,
managing the training process even without specifically assigning an algorithm.
4 Yuli Wu et al.
Fig. 1: Network Architectures of Faster R-CNN [23] and Cascade R-CNN [1].
Figures are adapted from [1].
3 Method
3.1 Faster R-CNN
Faster R-CNN [23] consists of two modules: the Region Proposal Network (RPN)
that proposes rectangular regions containing the desired objects, and the Fast
R-CNN detector [7] that predicts the classes and the locations.
The processing pipeline is demonstrated based on Fig. 1a. The input images
(I) are first fed into a convolutional neural network (conv), where the shared
features are extracted for both RPN and Fast R-CNN detector. Given the shared
convolutional feature map of a size w × h × d and the number of the anchors k
for each location in the feature map, the RPN head (H0) transforms it into two
proposal features of w × h × 2k (C0) and w × h × 4k (B0) with one e.g. 3 × 3
convolutional layer followed by two sibling 1 × 1 convolutional layers.
Now, w × h × k proposals have been generated, each in the form of 6 repre-
sentative values: 2 objectness scores and 4 coordinate offsets. The higher-scored
proposals from B0 are selected as the inputs of the Fast R-CNN detector, together
with the shared convolutional feature map. The transform of the proposals’ co-
ordinates between the original images and the feature maps is calculated via
e.g. RoIPool [7] (pool) or RoIAlign [9] (align). The pooled or aligned region
of interest (RoI) feature map of some fixed size is flattened then projected onto
a feature vector via the RoI head (H1). Finally, two vectors of classes (C1) and
locations (B1) are obtained by fully connected layers upon the feature vector.
There are two places where multi-task loss functions are calculated: RPN
(C0 and B0) and Fast R-CNN detector (C1 and B1). First, log loss is used for
both classification tasks (specifically, sigmoid activation function plus binary
cross entropy loss for C0 and softmax activation function plus cross entropy loss
for C1). Second, smooth L1 loss [7] is used for both bounding box regression
tasks (B0 and B1), which is defined as: smoothL1 (x) = 0.5x2 if |x| < 1 and
smoothL1 (x) = |x| − 0.5 otherwise.
Object Detection Based Handwriting Localization 5
4 Experiments
4.1 Dataset
The dataset used in this work is the scanned full-page low-quality invoices in the
tobacco industry from the 1990s (http://legacy.library.ucsf.edu/), which
was once served in a document classification challenge [8]. Based on the invoice
(or invoice-like) images from the same dataset, the labels and the bounding
boxes of names and handwritten notes are manually annotated for the Data
Anonymization Challenge1 .
6 Yuli Wu et al.
4.2 Preprocessing
All experiments were running on a single RTX 2080 Ti GPU. The implementa-
tion of the deep learning networks are adopted by the open source toolbox from
OpenMMLab for object detection: MMDetection [3].
The default optimizer is Stochastic Gradient Descent (SGD [11]) with a learn-
ing rate of 0.001, a momentum of 0.9, and a weight decay of 0.0001. Since the
training set of 600 images is relatively small, the default number of epochs is
Object Detection Based Handwriting Localization 7
set to a relatively large one (200) to make full use of the computational capac-
ity if the training lasts overnight. During the experiments, it is observed that
200 epochs are appropriate (Fig. 4). The train/val/test sets are defined as
in Section 4.1. Model weights after each epoch with the best results of val set
are chosen to make predictions on test set. The different preprocessing steps
are introduced in Section 4.2 and they are compared with Faster R-CNN [23]
and Cascade R-CNN [1] in details. Next, additional two deep learning networks,
RetinaNet [12] and YOLOv3 [22], have been tested with the preprocessing step
which yields the best result on Cascade R-CNN. Detailed experimental results
can be found in Section 4.6.
GIoU. Global IoU of two lists of bounding boxes P = {p1 , p2 , ...} and G =
{g1 , g2 , ...} is defined as below:
Area{ (p1 ∩ p2 ∩ ...) ∩ (g1 ∩ g2 ∩ ...) }
GIoU(P, G) = (2)
Area{ (p1 ∩ p2 ∩ ...) ∪ (g1 ∩ g2 ∩ ...) }
APFP . Average Precision with penalty of False Positive is the original eval-
uation score for the handwriting detection used in the Data Anonymization
Challenge1 , which is defined as follows:
G
|M | · 0.75|P|−|M | , if |G| =
P
6 0;
AP FP
= |G| (3)
|P|−|MP |
0.75 , otherwise.
MG = { g ∈ G | ∃ pi ∈ P IoU(pi , g) > T }.
(4)
Analogously,
MP = { p ∈ P | ∃ gi ∈ G IoU(p , gi ) > T }.
(5)
It differs from the popular evaluation score AP (Average Precision) for object
detection in COCO [13], where the false positive (FP) has not been particularly
8 Yuli Wu et al.
FP FP
(a) GIoU = 76.4; AP80 = 0; AP50 = 100
FP FP FP FP
(b) GIoU = 70.7; AP80 = 0; AP50 =0 (c) GIoU = 23.1; AP80 = 0; AP50 =0
4.5 Postprocessing
The deep learning classifier outputs a confidence score for each corresponding
class. This confidence score can be thresholded as a hyperparameter to control
the false positive rate in the postprocessing step. It has been observed during
FP
the experiments that the best results (w.r.t. AP80 ) can be achieved, if this
confidence is thresholded as 0.8, which is thus chosen as default.
In addition, to reduce the overlapped bounding boxes, some postprocessing
steps are applied in the end, which follows the simple criterion: one large box is
preferable to multiple small ones. First, all the predicted bounding boxes from
each image are sorted in ascending order by their areas. Second, starting from
the smallest one, each box is checked if the intersection area over the smaller box
area is greater than a threshold (chosen as 0.9). If it is the case, the smaller box
is omitted. This postprocessing is applied by default and we observed a minimal
FP
improvement, namely around 0.3% in terms of AP80 .
4.6 Results
In this section, the experimental results are presented regarding the different
preprocessing steps (Table 1), the influences of the mechanism of Bad-Quality
(Table 1, 3), the comparison of various deep learning networks (Table 2) and the
results released on the leaderboard of Data Anonymization Challenge1 (Table 3).
The improved performance from Fast R-CNN and Cascade R-CNN is specifically
demonstrated (Fig. 3, 4). Furthermore, the examples of predicted handwritten
regions from the val set are visualized in Fig. 5, together with the ground-truth
bounding boxes.
Fig. 3: mAP50 and mAP75 (val) Fig. 4: Overall loss (bottom; left
of Cascade and Faster R-CNN. The axis) and mAP50 (top; right axis)
cascade version surpasses Faster R- during the training. The dashed ver-
CNN by a larger margin under the tical line indicates the boundary of
more strict criterion. overfitting after around 170 epochs.
10 Yuli Wu et al.
FP FP FP FP
Network Input AP80 AP80 ∗ AP80 + AP50 GIoU
Faster R-CNN o 34.2 45.4 59.2 65.5 64.6
Faster R-CNN o- 35.1 45.7 58.2 64.9 65.1
Faster R-CNN pre 31.3 43.1 55.9 54.0 56.3
Faster R-CNN pre- 29.6 42.4 54.0 54.6 55.3
Faster R-CNN o/pre 34.6 43.9 56.3 64.1 64.4
Faster R-CNN o-/pre- 35.1 44.5 53.3 66.7 65.4
Faster R-CNN o/o-/pre 37.2 45.6 59.6 63.3 64.9
Faster R-CNN o/o-/pre- 35.1 45.0 57.4 62.4 64.3
Cascade R-CNN o 37.7 45.7 56.6 65.6 66.4
Cascade R-CNN o- 34.7 42.9 49.2 65.1 66.5
Cascade R-CNN pre 31.9 41.6 47.5 55.4 56.7
Cascade R-CNN pre- 32.2 42.0 47.6 57.0 58.7
Cascade R-CNN o/pre 36.3 44.3 56.0 64.4 66.4
Cascade R-CNN o-/pre- 35.0 44.7 56.6 64.1 64.3
Cascade R-CNN o/o-/pre 37.2 46.9 60.0 64.4 66.3
Cascade R-CNN o/o-/pre- 38.3 46.0 56.5 65.0 66.8
FP FP FP FP
Network Inference AP80 AP80 ∗ AP80 + AP50 GIoU
YOLOv3 42 fps 36.6 43.2 47.4 61.0 62.2
RetinaNet 11 fps 27.9 40.8 44.4 51.7 54.9
Faster R-CNN 11 fps 37.1 45.3 57.2 62.1 66.6
Cascade R-CNN 10 fps 41.8 47.5 57.5 66.9 68.2
Table 3: Comparison
with the leaderboard.
OCR: Tesseract with
manual engineering.
Service: Google’s Cloud
API. ? and † denote
YOLOv3 results from
the leaderboard and
ours, respectively. BQ: if
Bad-Quality is used.
FP
Method AP80 BQ
YOLOv3 ? 26.3 7
OCR 37.5 7
Service 42.5 7
YOLOv3 † 36.6 7
YOLOv3 † 43.2 3 Fig. 5: Visual results of cropped handwritten regions
Cascade 41.8 7 from val set (Cascade R-CNN with "o/o-/pre-").
Cascade 47.5 3 Green: ground-truth box; Red : predicted box.
FP FP
machines, the evaluation scores AP80 ∗ and AP80 + might bring in pseudo
good results. Therefore, the number of images marked as Bad-Quality is loosely
limited up to 50% of all images. Conclusively, the mechanism of Bad-Quality
is believed to be a flexible trick to deal with the hard cases.
scores [13] for brevity) in the val set. In addition, as depicted in Fig. 4, the
training progress has been overfitted after around 170 epochs, from where the
loss values and mAP50 start decreasing. Thus, the chosen 200 epochs during all
the experiments are appropriate.
4.7 Generalizability
English is the vast majority of the languages used in the dataset. Other languages
such as Dutch or German are also included. However, the deep learning network
is not expected to recognize the discrepancy of different languages. It is natural to
categorize the languages using Latin alphabets indiscriminately. In this section,
it is tested if the trained model works on the redacted real-world images in
foreign languages.
Fig. 6 illustrates two toy examples of fictitious and unseen invoices to eval-
uate the generalizability of the trained model. The model used to localize the
handwritten regions is Cascade R-CNN. It is noteworthy that the language in
the left image in Fig. 6 is Chinese, which can be considered as a foreign language
in the dataset. Analogous to the German invoice demonstrated in the right one,
the handwritten regions of both images are accurately detected as desired. The
generalizability of the R-CNN family has also been observed by [26] during the
text detection in natural scene images.
It is believed to be beneficial in the industry, if the model can be trained
once and applicable to various cases. Additionally, it has also raised the com-
mon question of what the deep learning network has learned. In this case, it is
supposed that the irregularity might be learned to discriminate the printed and
handwritten texts.
Object Detection Based Handwriting Localization 13
Fig. 6: Test of generalizability on toy examples with fictitious and unseen in-
voices. The handwritten regions are accurately localized from the images in
Chinese and German.
5 Discussion
5.1 Conclusion
In this work, we present an object detection based approach to localize the hand-
written regions, which is effective, fast and applicable to the unseen languages.
First, the influences of the preprocessing steps are investigated. It has been em-
pirically found that the fused concatenation of original and preprocessed images
as the inputs can achieve the best performance. Second, different deep learning
networks are compared. It is noticeable that the improvement from Faster R-
CNN to Cascade R-CNN can be reproduced and the high quality characteristic
of the cascade version suits the problem of the handwriting localization well. The
results of our approaches can be served as a baseline of deep learning approaches
in the handwriting localization problem. At last, the generalizability of the deep
learning approach is impressive. The learned model is capable to successfully
detect the handwritten regions on the real-world unseen images, even for those
in the unseen language of Chinese. We believe it is of great interest both for the
future research and for the industrial applications.
5.2 Outlook
As showcased in Fig. 5, some printed cursive texts are also detected as hand-
writing. It remains challenging to distinguish such nuanced discrepancies. Fur-
thermore, apart from the object detection approaches, other proposals in the
field of computer vision can also be adopted to differ the handwritten texts
from the printed ones. One example is the anomaly detection, where the printed
14 Yuli Wu et al.
texts can be considered as the normal instances, since they are more regularly
shaped. Thanks to the algorithms like variational autoencoder (VAE) [10], it is
also promising to accomplish such tasks in a semi-supervised or even unsuper-
vised manner. The other benefit of using the algorithms like VAE is that the
learned intermediate representations can also be exploited to synthesize the ar-
tificial signatures, further enhancing the anonymization without eliminating the
existence of such entities.
References
1. Cai, Z., Vasconcelos, N.: Cascade r-cnn: high quality object detection and instance
segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence
(2019)
2. Canny, J.: A computational approach to edge detection. IEEE Transactions on
pattern analysis and machine intelligence (6), 679–698 (1986)
3. Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z.,
Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R.,
Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection:
Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155
(2019)
4. Duda, R.O., Hart, P.E.: Use of the hough transformation to detect lines and curves
in pictures. Communications of the ACM 15(1), 11–15 (1972)
5. Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisser-
man, A.: The pascal visual object classes challenge: A retrospective. International
Journal of Computer Vision 111(1), 98–136 (Jan 2015)
6. Gidaris, S., Komodakis, N.: Locnet: Improving localization accuracy for object
detection. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. pp. 789–798 (2016)
7. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on
computer vision. pp. 1440–1448 (2015)
8. Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for
document image classification and retrieval. In: 2015 13th International Conference
on Document Analysis and Recognition (ICDAR). pp. 991–995. IEEE (2015)
9. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the
IEEE international conference on computer vision. pp. 2961–2969 (2017)
10. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114 (2013)
11. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,
Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural
computation 1(4), 541–551 (1989)
12. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object
detection. In: Proceedings of the IEEE international conference on computer vision.
pp. 2980–2988 (2017)
13. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference
on computer vision. pp. 740–755. Springer (2014)
14. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.:
Ssd: Single shot multibox detector. In: European conference on computer vision.
pp. 21–37. Springer (2016)
Object Detection Based Handwriting Localization 15
15. Malik, M.I., Liwicki, M., Alewijnse, L., Ohyama, W., Blumenstein, M., Found,
B.: Icdar 2013 competitions on signature verification and writer identification for
on-and offline skilled forgeries (sigwicomp 2013). In: 2013 12th International Con-
ference on Document Analysis and Recognition. pp. 1477–1483. IEEE (2013)
16. Marti, U.V., Bunke, H.: The iam-database: an english sentence database for of-
fline handwriting recognition. International Journal on Document Analysis and
Recognition 5(1), 39–46 (2002)
17. Moses, Olafenwa, J.: Imageai, an open source python library built to empower
developers to build applications and systems with self-contained computer vision
capabilities (mar 2018–), https://github.com/OlafenwaMoses/ImageAI
18. Nayef, N., Patel, Y., Busta, M., Chowdhury, P.N., Karatzas, D., Khlif, W., Matas,
J., Pal, U., Burie, J.C., Liu, C.l., et al.: Icdar2019 robust reading challenge on
multi-lingual scene text detection and recognition—rrc-mlt-2019. In: 2019 Inter-
national Conference on Document Analysis and Recognition (ICDAR). pp. 1582–
1587. IEEE (2019)
19. Ortega-Garcia, J., Fierrez-Aguilar, J., Simon, D., Gonzalez, J., Faundez-Zanuy,
M., Espinosa, V., Satue, A., Hernaez, I., Igarza, J.J., Vivaracho, C., et al.: Mcyt
baseline corpus: a bimodal biometric database. IEE Proceedings-Vision, Image and
Signal Processing 150(6), 395–401 (2003)
20. Redmon, J.: Darknet: Open source neural networks in c. http://pjreddie.com/
darknet/ (2013–2016)
21. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,
real-time object detection. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. pp. 779–788 (2016)
22. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767 (2018)
23. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de-
tection with region proposal networks. IEEE transactions on pattern analysis and
machine intelligence 39(6), 1137–1149 (2016)
24. Smith, R.: An overview of the tesseract ocr engine. In: Ninth international con-
ference on document analysis and recognition (ICDAR 2007). vol. 2, pp. 629–633.
IEEE (2007)
25. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations
for deep neural networks. arXiv preprint arXiv:1611.05431 (2016)
26. Zhong, Z., Sun, L., Huo, Q.: Improved localization accuracy by locnet for faster
r-cnn based text detection. In: 2017 14th IAPR International Conference on Doc-
ument Analysis and Recognition (ICDAR). vol. 1, pp. 923–928. IEEE (2017)