Blur2Sharp A GAN-Based Model For Document Image de
Blur2Sharp A GAN-Based Model For Document Image de
Research Article
Blur2Sharp: A GAN-Based Model for Document Image
Deblurring
Hala Neji1,2,3,*, , Mohamed Ben Halima2 , Tarek. M. Hamdani2 , Javier Nogueras-Iso3, , Adel M. Alimi2,4
1
National Engineering School of Gabes (ENIG), University of Gabes, Gabes, Tunisia
2
REGIM Lab (Research Groups in Intelligent Machines), University of Sfax, National Engineering School of Sfax (ENIS), Sfax, Tunisia
3
Aragon Institute of Engineering Research (I3A), University of Zaragoza, Zaragoza, Spain
4
Department of Electrical and Electronic Engineering Science, Faculty of Engineering and the Built Environment, University of Johannesburg, Johannesburg, South Africa
1. INTRODUCTION The objective of this work is to study a new model for removing dif-
ferent types of blur from real blurry document images and generate
Thanks to mobile technology we are able to capture documents in the correspondent sharp images. Figure 1 illustrates the difference
a simple way and at any moment. Text documents such as bank between source blur images (right) and sharp images (right).
documents, advertisements, courier receipts, hand-written notes,
digitized receipts, public information signboards, and information Various deblurring techniques have been proposed so far based on
panels captured by portable cameras are very common in our daily blur kernel estimation: blind deconvolution methods and nonblind
lives. Portable cameras offer great conveniences for acquiring and deconvolution methods. In nonblind deconvolution methods, we
memorizing information as they provide a new alternative for doc- have some knowledge about the blur kernel. In contrast, in blind
ument acquisition in less constrained imaging environments than deconvolution methods no information about the blur kernel is
personal scanners. However, due to the variations in the imaging known. Blind deblurring estimates a latent sharp image and a ker-
conditions as well as the target document type, there are many fac- nel blur, namely point spread function (PSF), from a blurred image.
tors that can degrade the images such as image blurring, sharing This problem has been deeply studied (see Section 2). However,
distortions, geometrical warping, or noise pollution. Frequently,
the motion blur caused by camera shake and the out-of-focus blur
can affect the quality of the images obtained by mobile devices.
Although blur may be not very relevant at first sight, it may be
the cause of problems in ulterior processing tasks such as text seg-
mentation or optical character recognition (OCR). Thus, tradi-
tional scanner-based OCR systems cannot be directly applied on
camera-captured images and a new level of processing needs to be
addressed.
Figure 1 Examples of real source images with blur (right) and
* Corresponding author. Email: [email protected] sharp images (left).
2 H. Neji et al. / International Journal of Computational Intelligence Systems, in press
generic methods have not been effective with real world blurred proposed a method based on the two-tone prior for text image
images. deblurring.
The proposal of this work is to eliminate blur and restore blurred Convolutional neural networks (CNNs) have been also applied to
images that are captured by low-cost camera devices without the various image enhancement tasks [11,12]. Focusing on text image
best conditions. For this purpose, we propose to extend a cycle- deblurring, Hradiš et al. [13] proposed an end-to-end method to
consistent generative adversarial network (CycleGAN) for translat- generate sharp images from blurred ones using CNN. Their net-
ing a blurred input text image into a sharp one. Recent methods work, consisting of 15 convolutional layers, is trained on 3 classes
based on generative adversarial networks (GAN) for tasks such as (blur, sharp image and kernel blur), but there are some disadvan-
image-to-image translation [1]depend on the availability of training tages: on the one hand, pairwise images are required to train the
examples where the same data is available in both domains. How- networks; on the other hand, the result of deblur-ring is not appro-
ever, CycleGAN [2] is able to learn such pair information with- priate for natural images which have a color background. Pan et al.
out one-to-one mapping between training data in source and target [14] presented a genetic approach with two main ideas: modify the
domains. The challenge of this work is to propose a new architec- prior to assume that the dark channel of natural images is sparse
ture based on CycleGAN, which we call “Blur2Sharp CycleGAN,” instead of zero, and impose the sparsity for kernel estimation. They
for the task of text document deblurring. Blur2Sharp CycleGAN used CNN to predict the blur kernel.
adjusts the parameters of CycleGAN that best fit the purpose of doc-
In the past few years, generative adversarial networks (GANs) have
ument deblurring.
been used in different image-related applications [15] such as gen-
The rest of this work is organized as follows. Section 2 reviews erating synthetic data, style transfer, super-resolution, denoising,
the state-of-the-art in blind deconvolution methods. Section 3 deblurring or text-to-image generation. For instance, Lee et al.
describes our suggested system. The evaluation and experimental [16] used GAN for progressive semantic face deblurring to reduce
results are provided in section 4. Finally, section 5 provides some the motion blur on input face images. Guo et al. [17] proposed
concluding remarks. a GAN-based method for video deblurring with the objective to
improve visual object tracking. More related to document deblur-
ring, Xu et al. [18] proposed a method for jointly deblurring and
super-resolving face and text images that are typically degraded
2. STATE-OF-THE-ART by out-of-focus blur. Nimisha et al. [19] also used GAN for end-
to-end deblurring network: using adversarial loss, the network
Given its wide range of applications, document image deblurring
learns a strong prior on the clean image domain and it maps the
has attracted considerable attention in recent years, and various
blurred image to its clean equivalent one. Inspired also on GAN
approaches have been proposed, overall in the field of blind decon-
architectures, Lu et al. [20] presented an unsupervised method for
volution methods.
domain-specific single-image deblurring that tries to disentangle
In blind deconvolution methods, the first task is to estimate the blur the content features and the blur features in blurred images.
kernel. After this, the second part of the method is to restore the
In summary, the previous approaches cannot be generalized to deal
final latent image thanks to a nonblind deblurring algorithm. With
with different scenarios in text document deblurring. Therefore, it
the assumption that the blur is uniform and spatially invariant, the
is a challenging task to count on a method based on deep learning
mathematical formulation of the blurry image can be modeled as
for building a general image prior that is able to handle different
Y = X ∗ K + 𝜀, where X is a latent sharp image, K is an unknown
scenarios.
blur kernel, ∗ denotes the convolution operator, 𝜀 is the additive
noise and Y is the blurred observation.
3. PROPOSED METHOD
Chen et al. [3] suggested an effective document image deblurring
algorithm based on a Bayesian method that analyzes the local struc- Image deblurring aims to restore a clean image from a blurred
ture of a document image and uses an intensity probability den- one. As mentioned in the introduction, this work explores the
sity function as prior for deblurring. However, it is not generally feasibility of applying Cycle-GAN [2] with a new architecture for
applicable to blurred text images because it depends on text seg- the challenging task of cleaning blurred text images. We want to
mentation. Cho et al. [4] proposed another Bayesian method that demonstrate that CycleGAN approach is suitable for the simpler
takes into account more specific properties of text documents. Pan task of text image deblurring where the results are good. Our idea
et al. [5] proposed another approach for text de-blurring that makes is that if we can treat blur and sharpness as a kind of image style,
profit of L0 regularized intensity and gradient priors. Nayef et al. successful image deblurring may be achieved with unpaired image
[6] suggested a method for document image deblurring that uses dataset based on CycleGAN.
sparse representations improved by nonlocal means. Zhang et al.
The main advantage of CycleGAN is that it does not require a pair-
[7] used a gradient histogram preservation strategy for document
wise image dataset. To solve the issue of not having a ground-truth
image deblurring.
document of every input, CycleGAN proposes a reverse mapping
More recently, Ljubenovic et al. [8] proposed the use of a function Gy. If a generator function Gx translates a document from
dictionary-based prior for class-adapted blind deblurring of doc- domain X to domain Y just changing the appearance, it should be
ument images. Using a large set of motion blurred images with possible to map the image back to domain X with another gen-
the associated ground-truth blur kernels, Pan et al. [9] proposed erator function Gy and reconstruct the initial image: X → Gx(X)
a method to learn data fitting functions. Last, Jiang et al. [10] → Gy(Gx(X)) ≈ X. The functions could be also reversely applied,
H. Neji et al. / International Journal of Computational Intelligence Systems, in press 3
Y → Gy(Y) → Gx(Gy(Y)) ≈ Y, and no ground-truth documents Transformer The transformer aims to transform feature vectors of
should be needed for deblurring. The model contains also the asso- an image from one domain to another domain. The input is [64,
ciated adversarial discriminators Dy and Dx. Dy encourages Gx to 64, 256]. We started building our model from scratch. We started
translate X into outputs indistinguishable from domain Y, and vice- with 9 layers of ResNet blocks. However, as we did not obtain good
versa for Dx and Gy. To further regularize the mappings, we intro- results, we tried to change the model structure to improve the result.
duce two cycle consistency losses that capture the intuition that if After testing different possibilities, we chose the hyperparameters
we translate from one domain to the other and back again we should and structures that best suited for the deblurring task. We have
arrive at where we started. Figure 2 shows a schema of this model used 12 layers of ResNet blocks. All filters have 3 × 3 size, and
that allows us to train our network with unpaired images to trans- the stride is 1. Therefore, the output of our model transformer is
late from blur to sharp or inversely. [64, 64, 256].
Figure 3 shows the training process of our proposed Blur2Sharp Decoder The decoding works in the opposite way to the encoder.
CycleGAN. It consists of two generators and two discriminators. We reconstruct the image from the feature vectors by applying
The main goal of generators is to learn the mapping between two three deconvolution layers which use the reversed parameters of the
image domains. In addition, the two adversarial discriminators aim encoding step. Therefore, the output of our model decoder is [256,
to distinguish between images and translated images in the same 256, 3].
way as they aim to discriminate between the sharp images and the
generated images. The following subsections explain the details of
3.2. Discriminative Network
the generative and discriminator networks.
The main goal of discriminator is to extract features. It contains 5
3.1. Generative Network convolutional layers. The hyperparameters of all the layers are the
following: the total filters are [64, 128, 256, 5, 1]; the stride is [2, 2,
As shown in Figure 4, both generators consist of three parts: an
2, 2, 1]; and the activation function is LeakyReLU (slope = 0.2) for
encoder, a transformer and a decoder. This figure summarizes the
all the layers except for the last layer with LeastSquareGAN. All the
architecture of these three parts, which has shown good perfor-
filters have 4 × 4 size. The principal role of discriminator is to decide
mance for image deblurring.
whether these features belong to one particular category or not by
Encoder The input of the network is [256, 256, 3]. There are 3 lay- the last layer, which is a convolutional layer with output 1 × 1.
ers to extract the features from an image by one stride each time. In
the first layer each filter is 7 × 7 and the stride is 1 × 1. The activa-
tion function is ReLU. The shape of the output is [256, 256, 64] (by 3.3. Loss Function
padding). Then, the output of the first layer is passed to the follow-
ing layers. The hyper-parameters of the second layer are: 128 filters There are two components of the CycleGAN objective function:
where each filter is 3 × 3 with stride 2; the activation function is an adversarial loss and a cycle consistency loss. Both are essential
ReLU; and the output is [128, 128, 128]. The hyperparameters of the to obtaining good results. We train the models using Least Square
third layer are: 256 filters where each filter is 3 × 3 with stride 2; the GAN loss. To achieve our goal, the loss function must satisfy that the
activation function is ReLU; and the output is [64, 64, 256]. discriminator X should be trained for images as close as possible to
domain X, and fake images close to 0. Thus, discriminator X would
minimize (DiscriminatorX(x) − 1)2 . In the same way, it should be
able to distinguish original and generated images. The discrimina-
tor must return 0 for images generated by the generator X. Discrim-
inator X would minimize (DiscriminatorX(GeneratorY → X(y)))2 .
On the other hand, generator X should be able to deceive the
discriminator Y about the authenticity of its generated images.
This can be done if the recommendation by discriminator Y for
the generated images is as close to 1 as possible. Therefore, gen-
erator X would like to minimize (DiscriminatorY(GeneratorX →
Y(x)) − 1)2 . And the most important cycle loss captures that we
are able to get the image back using another generator and thus
the difference between the original image and the cyclic image,
which is [Gy(Gx(x)) − x] + [Gx(Gy(y)) − y], should be as small as
possible.
4. EXPERIMENTS
The following subsections describe the experiments that have been
performed using the proposed “Blur2Sharp CycleGAN” method:
the corpus of documents for training and testing together with the
Figure 2 Cycle-consistent generative adversarial network implementation details (see Section 4.1), and the results that have
(CycleGAN) approach. been obtained (see Section 4.2).
4 H. Neji et al. / International Journal of Computational Intelligence Systems, in press
Figure 4 Example of a strong fuzzy partition with three triangular fuzzy sets.
4.1. Datasets and Implementation of sharp images seems satisfactory, it is difficult to distinguish visu-
ally a real image from a synthetic one when the differences between
With respect to the unpaired dataset used for the experiments, we them are small. Therefore, we have compared quantitatively our
are using the dataset proposed by Hradiš et al. [13]. It consists method using two metrics: structural similarity index (SSIM) and
of images with both defocus blur (using an uniform anti-aliased peak signal to noise ratio (PSNR).
disc) and motion blur (generated by a random walk). We randomly
cropped 2000 blurred image patches with 256 × 256 size and 2001 The computation of SSIM is shown in Eq. (1) including the follow-
cleaned image patches with 256 × 256 size from the dataset for ing symbols: I is the structural similarity index; (x, y) are coordi-
training. nates indicating a nearby NxN window; 𝜎 x , 𝜎 y , are the variances of
intensities in x, y directions; 𝜎 xy is the covariance; and µx , µy are the
For the testing phase, depicted in Figure 5, we used the test dataset average intensities in x, y directions.
proposed by Hradiš et al. [13], which includes 100 pairs of images
( )( )
(blurred images and their corresponding sharp versions) to evaluate 2𝜇x 𝜇y + c1 2𝜎xy + c2
the quality of image restoration. I(x, y) = ( )( ) (1)
𝜇x2 + 𝜇y2 + c1 𝜎x2 + 𝜎y2 + c2
We have implemented our Blur2Sharp Cycle-GAN on document
deblurring in Python with the help of Keras and running the code
on top of Google Tensorflow framework. In addition, it must be Eq. (2) shows the formula for computing PSNR, where MSE stands
noted that all the computation works were run on an Ubuntu server for the Mean Square Error, Iori is the original image, and Ideblur is the
with NVIDIA Quadro P6000 GPUs. deblurred image. MSE is computed according to Eq. (3), where m,
n is the size of the image.
4.2. Results
( ) 2552
Figure 6 shows two examples of the application of our “Blur2Sharp PSNR Iori , Ideblur = 10 log10 (2)
CycleGAN” method on the test dataset. Although the visual result MSE
H. Neji et al. / International Journal of Computational Intelligence Systems, in press 5
Finally, we have also compared the quality of the text obtained after
applying an OCR algorithm on the output images generated by the
different deblur-ring methods. After applying Tesseract software
[21] to obtain OCR from the sharp images in the test dataset (used
as reference text) and the images returned by the methods, we have
computed the average cosine similarity between the character fre-
quency vectors derived from each pair of corresponding OCR out-
put files. We have considered only the detection of letters and digits
within the range of printable characters in ASCII encoding. In this
Figure 6 Blur2Sharp cycle-consistent generative adversarial comparison we have also included the OCR obtained directly from
network (CycleGAN). From left to right: image blurred photo, blurred images to verify that deblurring is beneficial for OCR. As
result of Blur2Sharp CycleGAN. shown in Table 2, our method performs better than the methods of
Pan et al. [5,14] and a bit worse than the method of Hradiš et al. [13].
It is logical that the method of Hradiš et al. performs better because
it is a nonblind deconvolution method. It must also be noticed that
1 ∑∑[
m−1 n−1
]2
MSE = I (i, j) − Ideblur (i, j) (3) the obtained OCR is not perfect, even for the sharp images in the
mn i=0 j=0 ori testing dataset, because we are using cropped images with incom-
plete words and lines.
Based on the two metrics defined above, we have compared the 5. CONCLUSIONS
results obtained with our proposed Blur2Sharp CycleGAN archi-
tecture and the state-of-the-art methods using the test dataset pre- We presented a novel model that uses cycle-consistent adversar-
viously described in Section 4.1: the 100 images extracted from the ial networks for document deblurring. Our proposed “Blur2Sharp
dataset proposed by Hradiš et al. [13]. The results are shown in CycleGAN” architecture adjusts CycleGAN to the task of text image
Table 1. Our PSNR is equal to 32.52db and according to SSIM, we deblurring. We can both deblur images and blur sharp images
achieve 0.7689 on average using CycleGAN. Therefore, we can con- because CycleGan has the property of cycle-consistency. It is worth
clude that our “Blur2Sharp CycleGAN” has achieved a comparable noting that since we use unpaired images as training dataset, we do
SSIM and PSNR with the advantage of not having to use pairwise not need the ground-truth sharp images. Based on our prior knowl-
images for the training phase. edge, successful image deblurring can be achieved with an unpaired
image dataset using Cy-cleGAN.
In addition, the results are good in terms of visual comparison
with respect to the state-of-the-art. For instance, Figure 7 compares In addition, it must be noted that our model for image document
the results obtained with different methods on an input image that deblurring obtains comparable results with respect to the current
belongs to the text image deblurring dataset proposed by Hradiš state-of-the-art. Moreover, the obtained PSNR values are at the
et al. [13]. It can be observed that our Blur2Sharp Cy-cleGAN same level as the best methods found in the research literature. This
method generates a sharp image with much clearer characters. demonstrates that the idea of treating blur and sharpness as a kind
6 H. Neji et al. / International Journal of Computational Intelligence Systems, in press
Figure 7 Example of the outputs obtained after applying different deblurring methods.
REFERENCES
of image style actually works. In addition, a by-product of generat-
ing fake blurring images using CycleGAN is also provided. Last, it [1] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image transla-
must be noted that our model significantly improved the speed of tion with conditional adversarial networks, in Proceedings of the
the deblurring process thanks to GPU acceleration. IEEE Conference on Computer Vision and Pattern Recognition
As future work we want to test the feasibility of “Blur2Sharp (CVPR), Honolulu, HI, USA, 2017, pp. 1125–1134.
CycleGAN” as part of a more general task of historical document [2] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image
enhancement. Historical documents (and their digitized versions) translation using cycle-consistent adversarial networks, in Pro-
suffer from several types of degradation that affect their readability ceedings of the IEEE International Conference on Computer
and the application of OCR or text analysis processes. Apart from Vision, Venice, Italy, 2017, pp. 2223–2232.
the noise originated by the bad conditions of original documents [3] X. Chen, X. He, J. Yang, Q. Wu, An effective document image
(e.g., stains or bad condition of paper), during the scanning (digiti- deblurring algorithm, in Conference on Computer Vision and
zation) of pages from a bound book, the curving of the part of the Pattern Recognition (CVPR 2011), IEEE, Colorado Springs, CO,
page close to the “spine” introduces not only gutter shadows, but USA, 2011, pp. 369–376.
also out-of-focus blur in the shade area. Our aim is to verify whether [4] H. Cho, J. Wang, S. Lee, Text image deblurring using text-
“Blur2Sharp CycleGAN” can help to remove this kind of blur in this specific properties, in European Conference on Computer Vision,
domain context. Springer, Florence, Italy, 2012, pp. 524–537.
H. Neji et al. / International Journal of Computational Intelligence Systems, in press 7
[5] J. Pan, Z. Hu, Z. Su, M.-H. Yang, Deblurring text images via L0- [13] M. Hradiš, J. Kotera, P. Zemcık, F. Šroubek, Convolutional neu-
regularized intensity and gradient prior, in Proceedings of the ral networks for direct text deblurring, in Proceedings of British
IEEE Conference on Computer Vision and Pattern Recognition, Machine Vision Conference (BMVC), 2015, vol. 10, p. 2.
Columbus, OH, USA, 2014, pp. 2901–2908. [14] J. Pan, D. Sun, H. Pfister, M.-H. Yang, Blind image deblurring
[6] N. Nayef, P. Gomez-Kramer, J.-M. Ogier, Deblur-ring of docu- using dark channel prior, in Proceedings of the IEEE Conference
ment images based on sparse representations enhanced by non- on Computer Vision and Pattern Recognition, Las Vegas, NV,
local means, in 2014 22nd International Conference on Pattern USA, 2016, pp. 1628–1636.
Recognition, IEEE, Stockholm, Sweden, 2014, pp. 4441–4446. [15] B. Fu, L. Wang, R. Wang, S. Fu, F. Liu, X. Liu, Underwater image
[7] M. Zhang, C. Desrosiers, C. Zhang, M. Cheriet, Effective docu- restoration and enhancement via residual two-fold attention net-
ment image deblurring via gradient histogram preservation, in works, Int. J. Comput. Intell. Syst. 14 (2021), 84–95.
2015 IEEE International Conference on Image Processing (ICIP), [16] T.B. Lee, S.H. Jung, Y.S. Heo, Progressive semantic face deblurring,
IEEE, Quebec City, Canada, 2015, pp. 779–783. IEEE Access. 8 (2020), 223548–223561.
[8] M. Ljubenovic, L. Zhuang, M.A. Figueiredo, Class-adapted blind [17] Q. Guo, W. Feng, R. Gao, Y. Liu, S. Wang, Exploring the effects of
deblurring of document images, in 2017 14th IAPR International blur and deblurring to visual object tracking, IEEE Trans. Image
Conference on Document Analysis and Recognition (ICDAR), Process. 30 (2021), 1812–1824.
IEEE, Kyoto, Japan, 2017, vol. 1, pp. 721–726. [18] X. Xu, D. Sun, J. Pan, Y. Zhang, H. Pfister, M.-H. Yang, Learning
[9] J. Pan, J. Dong, Y.-W. Tai, Z. Su, M.-H. Yang, Learning discrimina- to super-resolve blurry face and text images, in Proceedings of the
tive data fitting functions for blind image deblurring, in Proceed- IEEE International Conference on Computer Vision, Venice, Italy,
ings of the IEEE International Conference on Computer Vision, 2017, pp. 251–260.
Venice, Italy, 2017, pp. 1068–1076. [19] M.T. Nimisha, K. Sunil, A. Rajagopalan, Unsupervised class-
[10] X. Jiang, H. Yao, S. Zhao, Text image deblurring via two-tone prior, specific deblurring, in Proceedings of the European Conference
Neurocomputing. 242 (2017), 1–14. on Computer Vision (ECCV), 2018, pp. 353–369.
[11] Y. Liu, S. Liu, C. Li, D. Yang, Compressed sensing image recon- [20] B. Lu, J.-C. Chen, R. Chellappa, Unsupervised domain-specific
struction based on convolutional neural network, Int. J. Comput. deblurring via disentangled representations, in Proceedings of the
Intell. Syst. 12 (2019), 873–880. IEEE Conference on Computer Vision and Pattern Recognition,
[12] W. Wei, J. Yongbin, L. Yanhong, L. Ji, W. Xin, Z. Tong, An Long Beach, CA, USA, 2019, pp. 10225–10234.
advanced Deep Residual Dense Network (DRDN) approach for [21] Google Open Source, Tesseract OCR, 2021. https://opensource.
image super-resolution, Int. J. Comput. Intell. Syst. 12 (2019), google/projects/tesseract
1592–1601.