0% found this document useful (0 votes)
56 views10 pages

An End To End Method To Extract Information

This document presents an end-to-end method to extract information from Vietnamese ID card images using deep learning. The method contains three steps: 1) Pre-processing uses U-Net and VGG16 with image processing techniques to segment and normalize card images. 2) Text detection and recognition applies CRAFT and Rebia neural networks for optical character recognition. 3) Post-processing utilizes Levenshtein distance and regular expressions to correct extracted text. The method was tested on a dataset of over 3,000 Vietnamese ID cards with high accuracy for segmentation, classification and text recognition.

Uploaded by

ĐƯỜNG TĂNG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views10 pages

An End To End Method To Extract Information

This document presents an end-to-end method to extract information from Vietnamese ID card images using deep learning. The method contains three steps: 1) Pre-processing uses U-Net and VGG16 with image processing techniques to segment and normalize card images. 2) Text detection and recognition applies CRAFT and Rebia neural networks for optical character recognition. 3) Post-processing utilizes Levenshtein distance and regular expressions to correct extracted text. The method was tested on a dataset of over 3,000 Vietnamese ID cards with high accuracy for segmentation, classification and text recognition.

Uploaded by

ĐƯỜNG TĂNG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 13, No. 3, 2022

An End-to-End Method to Extract Information from


Vietnamese ID Card Images

Khanh Nguyen-Trong
Department of Software Engineering
Posts and Telecommunications Institute of Technology
Hanoi, Vietnam

Abstract—Information extraction from ID cards plays an text. As shown in Fig. 2a, the bounding box, the detected text
important role in many daily activities, such as legal, banking, region, also contains the pattern background. The interference
insurance, or health services. However, in many developing factors, such as noise, blur, distortion, low resolution, non-
countries, such as Vietnam, it is mostly carried out manually, uniform illumination, and partial occlusion make the detection
which is time-consuming, tedious, and may be prone to errors. and recognition harder, as illustrated in 2b.
Therefore, in this paper, we propose an end-to-end method
to extract information from Vietnamese ID card images. The Due to sensitive information, researchers have faced many
proposed method contains three steps with four neural networks difficulties in collecting Vietnamese ID cards for model train-
and two image processing techniques, including U-Net, VGG16, ing. This lack of data can easily lead to bad results in extracting
Contour detection, and Hough transformation to pre-process information. Moreover, the Vietnamese alphabet is a Latin-
input card images, CRAFT, and Rebia neural network for Optical
Character Recognition, and Levenshtein distance and regular
based alphabet, but with many additional characters, including
?
expression to post-process extracted information. In addition, five accent symbols (á, à, a, ã, a) and derivative characters,
a dataset, including 3.256 Vietnamese ID cards, 400k manual such as ê, ă, u . . . [2]. Therefore,˙ we cannot apply available
?

annotated text, and more than 500k synthetic text, was built for OCR methods and pre-trained models, usually for English, to
verifying our methods. The results of an empirical experiment this language.
conducted on our self-collected dataset indicate that the proposed
method achieves a high accuracy of 94%, 99.5%, and 98.3% for There have already been a few studies on similar cards,
card segmentation, classification, and text recognition. for example, Egyptian ID [3], Indonesian ID [4], or even Viet-
namese ID cards [5], [6]. Most of them are deep learning-based
Keywords—Optical character recognition; U-Net network; due to the outstanding performance in OCR [7]–[9]. However,
VGG16 network; CRAFT network; rebia network the proposed methods focus either on different languages, like
English or on sub-tasks, such as text recognition [5], [6]. They
I. I NTRODUCTION cannot be used for the Vietnamese language or to directly deal
with raw images captured in natural scenes.
The ID card is the most widely used identity document
of Vietnamese citizens. It provides crucial information used Therefore, this paper presents an end-to-end method for
for many business processes, such as ID number, name, information retrieval from Vietnamese ID card images. The
address, and date of birth. However, extracting information method consists of 3 consecutive steps (Pre-processing, Tex
from such cards is usually carried out manually, which is time- detection and recognition, and Post-processing) with four
consuming, tedious, and can prone to errors. In this context, neural networks, two image processing techniques, and two
methods that automatically analyze and extract information, basic Natural language processing to deal with raw card images
such as Optical Character Recognition (OCR) are frequently captured in natural scenes. We also created four datasets to deal
used. with the lack of data. In summary, the major contributions of
this work are as follows::
However, there have been several challenges in reading
information from cards captured in natural scenes, including • We proposed an end-to-end deep learning-based
difficulties in scene text recognition, lacking training data, and method to extract information from the Vietnamese
the complexity of Vietnamese language. According to et al. [1], ID card, which is based on state-of-the-art methods in
the diversity of scene text, the complexity of the background, related fields.
and the interference factors are the most difficulties for scene • We introduced four neural networks (U-NET, VGG16,
text detection and recognition. The first difficulty is caused CRAFT, and Rebia) to analyze the card and extract
by diversities in fonts, colors, scales, and text orientations. its content. To assure the correlation among models,
For example, a Vietnamese ID card can contain three or four we based on the output of the previous step to train
different fonts and colors. Moreover, there are four types of models for the subsequent one.
cards, as illustrated in Fig. 1, which have several formats
for the same field. Especially for the 9-digit ID card, it can • We applied (i) Contour detection and Hough transfor-
contain handwriting text or old fonts created by typewriters. mation at the pre-processing step to align and crop
The complexity of background leads to difficulty to clearly the card; (ii) Levenshtein distance and regular expres-
distinguish texts from backgrounds. For instance, the Viet- sion at the post-processing step to correct extracted
namese ID card background is usually incorrectly detected as information.
www.ijacsa.thesai.org 600 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 13, No. 3, 2022

(a) 9-digit ID Card (b) 12-digit ID Card (c) New 12-digit ID Card (d) Chip-based ID Card

Fig. 1. Front Side of Vietnamese ID Card.

common steps, including pre-processing, text detection, text


recognition, and layout analysis [5], [10], [16]–[18].
Clearly, in the context of Scene Text Recognition (STR),
where document images can be affected by many factors, pre-
processing is necessary to improve the quality and normalize
the input data. This step can include a series of sub-tasks, such
as document detection, segmentation, alignment, and basic
image processing techniques. For example, with a raw image
(a) Pattern background as text
captured in natural scenes, we must check the existence of
documents, and then extract their position. The latter usually
produces information about the top-left and bottom-up corners
of the box that covers the object. To have a stable result in
further steps, the document can be aligned vertically, in which
the text direction is from left to right.
The output of pre-processing is passed to the next step,
where the text detection is essentially performed [5], [16].
Then, at the layout analysis step, these ROIs are extracted and
(b) Occlusion text
classified into corresponding fields, for example, ID number,
name, date of birth, address. At the final step, where the main
principle of what is usually called optical character recognition
Fig. 2. Complexity of Background and Interference Factors.
happens, we predict the potential string of extracted text areas.
Regarding the order of these steps, the pre-processing is
usually performed first, while the remaining can be varied.
• We built four datasets for Vietnamese OCR, including For example, text detection and recognition can be performed
two manual (ID Cards and manually annotated text) together, as in the stepwise methodologies or step-by-step like
and two synthetic (synthetic image and text) datasets. integrated methodologies [19]. The layout analysis is typically
The datasets contain more than 400k manual labeled done after recognizing, which determines the corresponding
text from 3.256 Vietnamese ID Card. Furthermore, we label of ROI and so on.
evaluated our proposed method on these datasets and
highlighted the experimental results thus obtained. Recently, with the development of deep learning, many
methods have been proposed for these steps with potential
• We proposed a microservice architecture to deploy our
accuracy. For the pre-processing step, they usually applied
method on a real system, which allows balancing the
both traditional image processing techniques [20] and deep
information flow between each step of the end-to-end
neural networks such as the canny edge detection algorithm
method.
[10], Otsu’s method [5], U-Net [21], and VGG [22]. Applying
The remainder of this paper is structured as follows. Sec- deep learning to OCR also achieved higher performance and
tion 2 discusses relevant previous studies. Section 3 provides low processing time than traditional machine learning. Regard-
the details regarding our proposed networks. The experimental ing text detection, for example, Baek et al. [23] presented
evaluation is presented in Section 4, and finally, some conclud- a character-based method, Character Region Awareness for
ing remarks and a brief discussion are provided in Section 5. Text Detection (CRAFT) that effectively detects text areas by
exploring each character region and affinity between them.
II. R ELATED W ORKS Liao et al. [24] proposed another method that can detect the
character in real-time. Similarity these are also many models in
In general, information retrieval from ID card images the literature for text recognition, such as CHAR model [25],
relates to the OCR of semi-structured documents, such as CTPN [26], TRBA [27], Shi et al. [28] and so on. They can
receipts [10], bank cards [11], business cards [12], [13], be categorized into character-level and word-level. The first
invoices [14], [15], and so on. It typically contains several approach firstly locates the position of each character, then
www.ijacsa.thesai.org 601 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 13, No. 3, 2022

recognizes them by a classifier, and groups characters into the A. Preprocessing: Segmentation, Normalization, and Identifi-
final text. Meanwhile, the second method, which outperforms cation
the first one, considers the text line as a whole and focuses on
mapping the entire text into a target string sequence [29]. Owing to the unconstrained nature of captured images,
three preprocessing steps were applied to make them available
These methods are applicable to many language types, for the next steps, including segmentation, normalization,
such as English, Korean, Chinese. However, it is impossible identification. These tasks allowed us to separate the cards
to directly apply to Vietnamese that contain additional accent from background images, vertically align them, and identify
and diacritical marks. Additional training is needed to learn its their type.
specific features. Moreover, they usually focus on sub-steps
Thanks to its efficiency in image segmentation, we trained
or a specific problem, e.g., only pre-processing [5], or text
a U-Net model to segment Vietnamese ID cards. The network
detection [23]. We cannot simply juxtapose these steps for an
has the same architecture as the work of Ronneberge et al.
end-to-end method to extract information from the Vietnamese
[30]. But, instead of using 512 x 512 input images, we down-
ID card. To achieve high and stable performance, they need
scaled to 256 x 256 pixels. The output was a 128 x 128 image
to correlate with each other. The output of the previous step
that is used then as a mask to segment the card.
should be used as the input to train models for the subsequent
step. The model allows us to determine a binary mask of cards.
Thus, we combined two basic image-processing techniques to
align and crop cards: Contour Detection, and Hough Transfor-
III. P ROPOSED M ETHOD mation. The first algorithm was applied to the binary image to
detect the boundaries of cards. Then, we transformed these
Due to the variety of captured images and types of ID lines to Hough coordinate to find two interacted parallel
cards, the proposed method contained three steps with four lines that bound the card. It allows us to determine four
neural networks, as shown in Fig. 3. It consisted of a set of corner coordinates of the card. From these corners, we applied
deep learning and traditional machine learning techniques in a perspective transformation to align and crop the card to
Computer Vision and Natural Language Processing, as follows: 600x400 pixel images.
• We performed several pre-processing steps, including Lastly, we identified the type of cropped cards by fine-
segmentation, alignment, and identification, to deter- tuning the VGG16 network trained on ImageNet [31]. This
mine and normalize the card. First, we detected and network consisted of sixteen layers: thirteen convolutional and
segmented cards from input images. Next, they were three fully connected layers. Each layer contained convolu-
vertically aligned and cropped, with text from left to tional layers, max-pooling layers, and fully connected layers.
right. Then, we determined their type (9-digit, 12- We fine-tuned the model with our dataset to classify eight
digit, or the new 12-digit ID Card). Two deep learning classes: the front, the back, the reversed front, and back of
methods, and two image processing algorithms were the 9-digit, 12-digit cards. Therefore, the feature extractor of
applied for this step: U-Net model for the detection, VGG16 was kept as the original network. To adapt to our
and segmentation, VGG16 model for the classification, dataset, we updated the classifier section with a new fully-
Contour Detection, and Hough Transformation for the connected layer that adjusts to 8 classes only.
alignment.
• We applied the word-level approach to detect and B. Text Detection and Recognition
recognize Vietnamese optical texts on the cards, in-
Vietnamese is a Latin-based language that has several
cluding the CRAFT method [23] for text detection,
additional accents and diacritical marks [32]. The language
the Attn method with ResNet, and BiLSTM for text
has more than 250 characters. Among twenty-nine alphabetic
recognition.
scripts, the language consists of twenty-two Latin letters (’f’,
• Lastly, to correct text errors and identify text fields ’j’, ’w’, and ’z’ letters are eliminated). The remaining are
(Named Entity Recognition), such as names, date of created by combining these letters with diacritics located just
births, we performed two main tasks, including layout at the top or bottom of letters, with or without a small gap
analysis and text correction. The Levenshtein distance, between them.
regular expression, and two pre-defined dictionaries Therefore, instead of training a new model which requires
were applied. many manual labeled datasets, we adapted models trained on
English datasets to detect text. Thanks to its performance in
The input of the first step is images containing Vietnamese dealing with the low-quality dataset, we applied the CRAFT
ID cards, while the output is the ID cards that were cropped, text detector to localize the text in cropped cards. The model
aligned from the background. The type of card is an important supports effectively detect text area by exploring each charac-
output of this step. Then, the cropped ID is used as input for the ter and affinity between characters [23].
next step (text detection and text recognition), which produces
two lists: a list of predicted texts, and a list of bounding boxes. CRAFT has three detection levels: (i) individual character;
The last step takes these lists and the type of cards, from the (ii) individual word; and (iii) connected words or sentences.
first step, to analyze and results in a list of texts with their The processing time of the first level is long, while the third
field. More detail description of these steps will be presented level is unstable. Thus, we applied only the second level to
in the following subsections. detect individual words on the cards.
www.ijacsa.thesai.org 602 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 13, No. 3, 2022

Fig. 3. Information Retrieval from Vietnamese ID Card Images.

Fig. 4. Rebia Neural Network for Text Recognition.

The model allows detecting only the alphabet word without C. Post-processing: Layout Analysis and Text Update
the tone marks, such as the grave accent, hook above, tilde,
acute accent, and dot below. Therefore, we adapted the detector After recognizing, we categorized the text to the corre-
to cover the tone marks by enlarging bounding boxes of sponding fields, such as the name, ID, address, and date of
detected areas. birth. Due to the lack of a training dataset, we combined
natural language processing, regular expression techniques,
The input of this step is 600x400 pixel images that were and dedicated layout analysis algorithms to determine field
segmented and aligned previously. The output is a list of types.
bounding boxes that contains cropped words.
The front of a Vietnamese ID card is typically organized
For text recognition, we propose a deep neural network, from left to right, top to bottom, and line by line, while the
namely Rebia, to recognize the optical words on Vietnamese back is more complicated. But for the back, we are interested
ID cards. Unlike most existing methods in the field, we didn’t only in the issued date and place, which has the same structure.
apply the rectification step before extracting the feature. Since Therefore, we first sorted bounding boxes of detected words
these texts have a similar orientation and shape, rectification, from left to right and line by line.
which is used to normalize different types of texts (i.e., curved
Next, we evaluated and updated the recognized text by the
and tilted texts), is not necessary. The proposed network
Levenshtein distance with the help of several domain-specific
contained three blocks, as shown in Fig. 4 and detailed in Table
dictionaries. We combined these texts with results from the
I. First, the feature map that focuses on the word level was
above step to identify different fields. Furthermore, for fixed-
extracted by ResNet neural network. The network contained
format fields, such as the date, number, and ID, we also applied
five layers. We converted all text images to gray-scale of size
the regular expression and algorithms, as follows:
100 x 32 pixels, to normalize the input data.
Next, we reshaped the extracted features to a sequence fea- • For the ID number: By experiment, we found that
ture used for prediction. Thanks to its capability of capturing one of the most common issues for this field was
contextual information within a sequence [33], we used two the overlap of the caption and content (ID number).
Bidirectional Long Short-Term Memory (BiLSTM) at this step. It made the recognized ID numbers have several
The two networks have the same hidden unit that is 256. additional characters at the beginning. Therefore, for
a text line that was determined as the ID number, we
Lastly, an attention-based decoder was employed to predict took only a fixed number of digits, starting from the
the sequence feature. It contained an LSTM layer with 256 end of the text (9 for the old card, 12 for the new
hidden units. card).
We applied the ReLu activation for three blocks. After each • For the date of birth, expired, and issued date: Each
convolution, a batch normalization was used to standardize its card type has its format to represent the date, so
outputs. The objective function was the negative log-likelihood we applied a rule-based mechanism and the regular
of the probability of label sequence. expression to check the text.
www.ijacsa.thesai.org 603 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 13, No. 3, 2022

Fig. 5. Multi-layer Architecture for Load Balancing of Model Execution.

TABLE I. D ETAILED ARCHITECTURE OF R EBIA FOR TEXT RECOGNITION D. Multi-layer Architecture for Load Balance of Model Exe-
cution
Feature Extraction - Resnet
In this study, we propose a multi-layer system architecture
Configuration to deploy the end-to-end method, which supports load bal-
Layer Output
(kernel, stride, padding, channel) ancing of model execution. We considered each model as an
Conv1: 3x3, 1x1, 1x1, 32 independent and parallel process, which can serve different ID
conv1 x Conv2: 3x3, 1x1, 1x1, 64 100x32 cards at the same time, as shown in Fig. 6. It eliminated the
MaxPool: 2x2, 2x2, 0x0 bottleneck at some models due to high memory consumption,
01 BasicBlock: 3x3,128 especially text detection and recognition.
3x3,128
conv2 x 50x16 Therefore, we deployed the proposed end-to-end method
Conv3: 3x3,1x1,1x1,128
into five sub-tasks and processed them separately. Each one
MaxPool: 2x2, 2x2, 0x0
consists of a model/algorithm with different input and output.
02 BasicBlock: 3x3, 256 Owing to model orchestrators and API gateways, the proposed
3x3, 256 architecture can coordinate the input and output of these sub-
conv3 x 25x18
Conv4:3x3, 1x1,1x1,256 tasks, as presented in 5.
MaxPool:2x2, 2x2, 0x0
05 BasicBlock: 3x3, 512 Thus, each information extraction request was split into
conv4 x 3x3, 512 26x4 five sub-requests, as shown in Fig. 6. On the left figure
Conv5:3x3, 1x1,1x1,512 (single request invocation), all models are blocked until the last
sub-task (post-processing) finishes, while on the right (multi-
03 BasicBlock: 3x3,512
request supports), models are free if they finish their jobs. (A
3x3,512
conv5 x 26x1 request n was split into five sub-requests (Req n.x); x denotes
Conv6:2x2,1x2,1x0,512 the sub-task that corresponds to different steps of the proposed
Conv7:2x2,1x1,0x0, 512 method.)
Sequence Modelling - BiLSTM
BiLSTM Hidden units:256 256 We applied the FIFO (First In, First Out) technique to
BiLSTM Hidden units:256 256 handle multi-sub-requests. The complete architecture of our
system is presented in Fig. 5, which contains three layers and
Prediction - Attn decoded
is based on the virtual technique to maximize infrastructure
LSTM Hidden units:256 256 using, as follows:
• The orchestration layer contains the model orchestra-
tors that are responsible to coordinate different steps.
• The gateway layer includes different API gateways
that interact with the corresponding micro-services et
• For the name and address: First, we built three each step.
dictionaries containing common Vietnamese family
names, middle names, and addresses. Then we applied • The micro-services layer is composed of different
the Levenshtein algorithm and regular expression to micro-services that support API to interact with mod-
correct the text, if necessary. els.
www.ijacsa.thesai.org 604 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 13, No. 3, 2022

Req n

Req 2

Req 15.1

Req 14.1 Req 11.2

Req 13.1
Req n.1
Segmentation Segmentation


& Alignment & Alignment

Req 12.2

Req 10.2
Classification Classification


Req 9.3

Req 8.3

Req 7.3
Text Detection
Req 1

Text Detection


model model

Text Recognition Req 6.4

Req 5.4

Req 4.4
Text Recognition

model model
Req 2.5

Req 1.5
Req 3.5

Post-processing
Post-processing

(a) Single request end-to-end invocation (b) Multi-requests end-to-end invocation

Fig. 6. Single (a) and Multi Request (b) End-to-end Services. Fig. 7. Dedicated Annotation Tool.

• The infrastructure layer applied virtual techniques to tool, namely Synthetic Data Generator2 to generate
share physical hardware (GPU, CPU, and different this dataset.
virtual GPU and CPU) among micro-services.
For the manually annotated and synthetic text, we also
applied data augmentation methods to balance and increase
IV. E XPERIMENTS samples, including rotation, blur, noise, and so on. Finally, we
obtained a total of 8M samples. The first two datasets were
A. Dataset used to train U-Net and VGG16, while the last two datasets
We used four datasets in this study, including two manual were applied to train Rebia.
(ID Cards and manually annotated text) and two synthetic
(synthetic image and text) datasets for training U-Net, VGG16, B. Model Training Setup
and Rebia. The following steps were performed to prepare our
datasets: The proposed method is a continuous process, in which the
output of the previous step is the input of the subsequent step.
• ID cards: we collected 3.256 Vietnamese ID cards The used models should correlate with each other. Therefore,
from volunteers and public images on the internet. we based on the previous model to train the next model.
It contains 1.530 samples for the 9-digit card, 935 As detailed in Fig. 8, the model training contains three
for the 12-digit card, and 783 for the new 12-digit phases:
(since the chip-based ID card has just been released in
2021, in this study we focused only on the three other 1) Phase 1 - Segmentation model training: we trained
cards). We then used the labelImg1 tool to annotate and validated the U-Net model on ID cards and
this dataset. synthetic image datasets.
• Synthetic images: we extracted ID cards from the 2) Phase 2 - Classification model training: we used the
above dataset and randomly put them on background trained U-NET model to extract ID cards from the
images containing different objects (i.e., papers, busi- first dataset and vertically align them. To preserve
ness cards, license cards, and so on). We thus gener- consistency between the segmentation and identifi-
ated a total of 60k synthetic images. cation step, we combined these cards with those
manually extracted from the same dataset to train the
• Manually annotated text: We first applied the VGG16 model.
CRAFT model on extracted ID cards to detect in- 3) Phase3 - Recognition model training: similarly, we
dividual words. Next, Tesseract OCR was applied to used the trained VGG16 and CRAFT model to detect
predict the text. Then, we developed a dedicated tool and crop text areas. Based on these areas, we created
to correct the texts manually, as shown in Fig. 7. This the manually annotated text (as presented in the
process is illustrated as in Fig. 8, phase 3. Lastly, we previous subsection). This dataset was then combined
obtained a total of 400k manually annotated texts. with the systemic text to train the Rebia model.
• Synthetic text: This dataset consists of more than All models were implemented using the TensorFlow
500k synthetic texts generated from popular Viet- Framework 2.3.0 and Python 3.6.9, on an NVIDIA Tesla
namese names, addresses, numbers, etc. We used a K80 GPU with a 12 GB memory and an Intel(R) 2.3Ghz
1 https://github.com/tzutalin/labelImg 2 https://github.com/Belval/TextRecognitionDataGenerator

www.ijacsa.thesai.org 605 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 13, No. 3, 2022

Fig. 8. Model Training Setup.

Xeon(R) micro-processor. We used the following parameters text detection, and text recognition were hosted by three sepa-
and techniques to train the models: rated dockers, while both layout analysis and post-processing
were hosted by only one docker.
• To prevent bias, we divided all datasets into two sub-
sets for training (80% randomly sample) and testing
(20% randomly sample). V. R ESULTS AND D ISCUSSION
• For the loss function, we used a binary cross-entropy, Owing to the early stop technique, the U-NET and VGG16
categorical cross-entropy, and attention loss function training were stopped after 48 and 15 epochs, as shown in
to train the U-NET, VGG16, and Rebia models, re- Fig. 9a and 9b. The two figures also show that the gap
spectively. between training loss and test loss is tiny, which means that
• For the optimization, we applied an Adam optimizer the model operated accurately, without any overfitting. The
with β1 = 0.9, β2 = 0.999, and e = 10−7 . accuracy of U-NET and VGG16 is 94% and 99.5% on the test
The initiated learning rate was 10−4 , and a self- set, respectively, as shown in Table II. For the Rebia model,
adjusting learning rate technique (lr) to train U-NET the training was converged at 60 epochs. The model achieves
and VGG16; while, Rebia was used an Adadelta a high accuracy of 98.3% on the validation test, as shown in
optimizer with β1 = 0.9, rho = 0.95, eps = 10−8 , Fig. 11.
and lr = 1.
• To minimize the cost function, we applied a mini- TABLE II. ACCURACY OF THE P ROPOSED M ODELS ON THE T ESTING S ET
batch with a size of 192 for Rebia, 128 for U-NET
Model Accuracy
and VGG16. U-NET (Segmentation model) 94.0%
VGG16 (Classification model) 99.5%
• The early stop technique was employed to increase the Rebia (Recognition model) 98.3%
training speed and reduce over-fittings. It makes the
three models stop learning if they have reached their
Thanks to transfer learning, U-Net, and VGG16 mod-
maximum accuracy.
els quickly reached a high accuracy after several initial
• The shuffle data were used, such that the models could steps/iterations. They improved only 2 – 5% of accuracy during
learn randomly and provide more objective results. the remaining time. It can be explained by the fact that the pre-
Before selecting the batches, we conducted a shuffling trained models were trained on a large dataset (the ISBI dataset
process to balance the dataset, in which fifty percent for U-NET, ImageNet for VGG16).
of samples were randomly chosen from each one.
We used U-Net to segment the cards, as shown in Fig. 10.
After successfully training and validating the models, we The figure presents an example of using the U-Net model to
deployed the end-to-end method on the same hardware config- detect the binary mask of a 12-digit card. We then cropped
uration. To support the proposed multi-layer architecture, we the card from background images and vertically aligned it, as
implemented four Docker containers. Instance segmentation, presented in Fig. 12.
www.ijacsa.thesai.org 606 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 13, No. 3, 2022

(a) U-Net model (b) VGG16 model

Fig. 9. Progress of Loss and Accuracy for Training and Testing U-Net and VGG16.

(a) Raw image (back of 12-dgit) (b) Binary image (c) Raw image (front of 12-digit) (d) Binary image

Fig. 10. Segmentation Step (a, b: the Back of a 12-Digit Card and Its Binary Image; c and the Front of a 12-Digit Card and Its Binary Image.)

Experimental results have shown that the proposed meth- knowledge, no existing works presented the way to identify
ods outperformed similar works in extracting information this information. Based on the VGG16 network, we can
from the Vietnamese ID card. At the pre-processing step, classify different types of ID cards with high accuracy of
our method (combination of U-Net, and traditional machine 99.5%. Therefore, our method is capable to deal with input
learning) provides more stable results than the work presented images that have many ID cards. Furthermore, this information
in [5], [34]. These methods usually work well in controlled was very useful at the post-processing step. Regarding text
environments, e.g., enough light, clear cards [5], or existence recognition, we obtained a higher accuracy than recent works,
of four corners of cards [34]. For unstable environments, which such as Hoai et al. [6] (98.3% compared with 89.7%) and Viet
are very common in many practice applications, they failed in et al. [35] (98.3% compared with 91%).
pre-processing the input images, e.g., ID card detection and
classification. For this task, thanks to the U-Net model and Besides, owing to the proposed multi-layer architecture, the
a rich dataset, our method can respond to unstable situations end-to-end method took an average of 2.5 seconds to extract
with high accuracy of 94.0%. information from a raw image, as illustrated in Fig. 13. The
figure also shows a typical example of the input image that
The type of card is important information, but to our contained similar cards, such as student, business, or license
www.ijacsa.thesai.org 607 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 13, No. 3, 2022

Fig. 11. Progress of Loss and Accuracy for Training and Testing Rebia Network.

texts, was built to validate the proposed methods. We con-


ducted an empirical experiment on our self-collected dataset to
demonstrate the effectiveness of the proposed method, which
achieved a high accuracy of 94%, 99.5%, and 98.3% for
segmentation, classification, and text recognition. These results
indicate the promise of the proposed method in the information
extraction of similar semi-structured documents. In the future,
we will focus on the chip-based ID card and improve the
(a) performance of our model, especially the text detector.

R EFERENCES
[1] Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition:
Recent advances and future trends,” Front. Comput. Sci., vol. 10,
p. 19–36, Feb. 2016.
[2] N. Nguyen, T. Nguyen, V. Tran, T. Tran, T. Ngo, T. Nguyen, and
M. Hoai, “Dictionary-guided scene text recognition,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition
(b)
(CVPR), 2021.
[3] A. Nosseir and O. Adel, “Automatic extraction of arabic number from
egyptian id cards,” in Proceedings of the 7th International Conference
Fig. 12. An Old ID Card before (a) and after (b) the Pre-processing Step. on Software and Information Engineering, ICSIE ’18, (New York, NY,
USA), p. 56–61, Association for Computing Machinery, 2018.
[4] F. M. Rusli, K. A. Adhiguna, and H. Irawan, “Indonesian ID card
extractor using optical character recognition and natural language post-
cards. With the help of the trained models and algorithms, our processing,” CoRR, vol. abs/2101.05214, 2021.
method can extract important information accurately. [5] T. N. T. Thanh and K. N. Trong, “A method for segmentation of
vietnamese identification card text fields,” International Journal of
VI. C ONCLUSION Advanced Computer Science and Applications, vol. 10, no. 10, 2019.
[6] D. Hoai, H.-T. Duong, and V. Truong Hoang, “Text recognition for
In this paper, we have presented an end-to-end method to vietnamese identity card based on deep features network,” International
extract information from the Vietnamese ID card. The pro- Journal on Document Analysis and Recognition (IJDAR), vol. 24, 06
posed method contains three consecutive steps: Pre-processing, 2021.
text detection and recognition, and post-processing. Four neu- [7] R. Soni, B. Kumar, and S. Chand, “Text detection and localization
in natural scene images based on text awareness score,” Applied
ral networks were proposed and trained, which allows us to Intelligence, vol. 49, pp. 1376–1405, 2018.
extract information efficiently, including U-NET for segmenta-
[8] A. Hazra, P. Choudhary, S. Inunganbi, and M. Adhikari, “Bangla-meitei
tion, VGG16 for classification, CRAFT for text detection, and mayek scripts handwritten character recognition using convolutional
Rebia for text recognition. We also applied a natural language neural network,” Applied Intelligence, vol. 51, pp. 2291–2311, 2021.
processing technique (the edit distance) and two image process [9] X. Ma, K. He, D. Zhang, and D. Li, “Pieed: Position information en-
algorithms (Contour detection and Hough transformation) for hanced encoder-decoder framework for scene text recognition,” Applied
layout analysis, text correction, and card alignment. Intelligence, vol. 51, pp. 6698–6707, 2021.
[10] X. Wang, X. Zhang, S. Lei, and H. Deng, “A method of text detection
In addition, a dataset including 3.256 Vietnamese ID cards, and recognition from receipt images based on CRAFT and CRNN,”
400k manually annotated texts, and more than 500k synthetic Journal of Physics: Conference Series, vol. 1518, p. 012053, apr 2020.

www.ijacsa.thesai.org 608 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 13, No. 3, 2022

Fig. 13. Extracted Information (the Left Image) from a Raw Image Captured with Several Similar Cards (the Right Image). The Card and Texts were
Successfully Segmented, Cropped, Aligned and Detected, as in the Middle Image.

[11] Y. Gao, C. Xu, Z. Shi, and H. Zhang, “Bank card number recognition [23] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region
system based on deep learning,” (New York, NY, USA), Association awareness for text detection,” CoRR, vol. abs/1904.01941, 2019.
for Computing Machinery, 2019. [24] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-time scene text
[12] M.-C. Ko and Z.-H. Lin, “Cardbot: A chatbot for business card detection with differentiable binarization,” CoRR, vol. abs/1911.08947,
management,” in Proceedings of the 23rd International Conference on 2019.
Intelligent User Interfaces Companion, IUI ’18 Companion, (New York, [25] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic
NY, USA), Association for Computing Machinery, 2018. data and artificial neural networks for natural scene text recognition,”
[13] S. Srivastava, S. Sahay, D. Mehrotra, and V. Deep, “Automation of busi- 2014.
ness cards,” in Advances in Interdisciplinary Engineering (M. Kumar, [26] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detecting text in natural
R. K. Pandey, and V. Kumar, eds.), (Singapore), pp. 371–380, Springer image with connectionist text proposal network,” 2016.
Singapore, 2019.
[27] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee,
[14] H. T. Ha, M. Medved, Z. Neverilová, and A. Horak, “Recognition of “What is wrong with scene text recognition model comparisons? dataset
ocr invoice metadata block types,” in TSD, 2018. and model analysis,” CoRR, vol. abs/1904.01906, 2019.
[15] P. Kumar and S. Revathy, “An automated invoice handling method using [28] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster:
ocr,” in Data Intelligence and Cognitive Informatics (I. Jeena Jacob, An attentional scene text recognizer with flexible rectification,” IEEE
S. Kolandapalayam Shanmugam, S. Piramuthu, and P. Falkowski-Gilski, Transactions on Pattern Analysis and Machine Intelligence, vol. 41,
eds.), (Singapore), pp. 243–254, Springer Singapore, 2021. no. 9, pp. 2035–2048, 2019.
[16] M. Ryan and N. Hanafiah, “An Examination of Character Recognition [29] X. Chen, L. Jin, Y. Zhu, C. Luo, and T. Wang, “Text recognition in the
on ID card using Template Matching Approach,” Procedia Computer wild: A survey,” 2020.
Science, vol. 59, no. Iccsci, pp. 520–529, 2015.
[30] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net-
[17] R. Valiente, M. T. Sadaike, J. C. Gutiérrez, D. F. Soriano, and works for biomedical image segmentation,” in Medical Image Comput-
G. Bressan, “A process for text recognition of generic identification ing and Computer-Assisted Intervention – MICCAI 2015 (N. Navab,
documents over cloud computing,” IPCV’1International Conference on J. Hornegger, W. M. Wells, and A. F. Frangi, eds.), (Cham), pp. 234–
Image Processing, Computer Vision, and Pattern Recognition, no. April 241, Springer International Publishing, 2015.
2017, p. 4, 2016.
[31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
[18] A. Alnefaie, D. Gupta, M. H. Bhuyan, I. Razzak, P. Gupta, and A large-scale hierarchical image database,” in 2009 IEEE Conference
M. Prasad, “End-to-end analysis for text detection and recognition in on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
natural scene images,” in 2020 International Joint Conference on Neural
Networks (IJCNN), pp. 1–8, 2020. [32] T.-H. Pham, X.-K. Pham, and P. Le-Hong, “On the use of machine
translation-based approaches for vietnamese diacritic restoration,” in
[19] Q. Ye and D. Doermann, “Text detection and recognition in imagery: 2017 International Conference on Asian Language Processing (IALP),
A survey,” IEEE Transactions on Pattern Analysis and Machine Intel- pp. 272–275, 2017.
ligence, vol. 37, no. 7, pp. 1480–1500, 2015.
[33] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
[20] K. Karthick, K. B. Ravindrakumar, R. Manikandan, and R. Cristin, for image-based sequence recognition and its application to scene text
“Consumer service number recognition using template matching algo- recognition,” 2015.
rithm for improvements in OCR based energy consumption billing,”
ICIC Express Letters, Part B: Applications, vol. 10, no. 10, pp. 895– [34] H. D. Liem, N. Minh, N. B. Trung, H. T. Duc, P. H. Hiep, D. V.
901, 2019. Dung, and D. H. Vu, “Fvi: An end-to-end vietnamese identification card
detection and recognition in images,” 2018 5th NAFOSTED Conference
[21] Z. Zeng, W. Xie, Y. Zhang, and Y. Lu, “Ric-unet: An improved neural on Information and Computer Science (NICS), pp. 338–340, 2018.
network based on unet for nuclei segmentation in histology images,”
IEEE Access, vol. 7, pp. 21420–21428, 2019. [35] H. T. Viet, Q. Hieu Dang, and T. A. Vu, “A robust end-to-end
information extraction system for vietnamese identity cards,” in 2019
[22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for 6th NAFOSTED Conference on Information and Computer Science
large-scale image recognition,” 2015. (NICS), pp. 483–488, 2019.

www.ijacsa.thesai.org 609 | P a g e

You might also like