0% found this document useful (0 votes)

28 views

Build A Handwritten Text Recognition System Using TensorFlow - by Harald Scheidl - Towards Data Science

The document describes building a handwritten text recognition system using TensorFlow. It discusses preprocessing input images, building a neural network with convolutional and recurrent layers, and training it with Connectionist Temporal Classification. The neural network is implemented minimally to allow training on a CPU. Example visualizations of network outputs are also shown.

Uploaded by

octoparse8

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Build A Handwritten Text Recognition System Using TensorFlow - by Harald Scheidl - Towards Data Science

Uploaded by

octoparse8

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Published in Towards Data Science

You have 2 free member-only stories left this month. Upgrade for unlimited access.

Harald Scheidl Follow

Jun 15, 2018 · 7 min read · · Listen

Save

Build a Handwritten Text Recognition System using TensorFlow

A minimalistic neural network implementation which can be trained on the CPU

You can now subs

get stories delive
to your inbox.

Got it

Offline Handwritten Text Recognition (HTR) systems transcribe text contained in scanned images into digital text, an example is
shown in Fig. 1. We will build a Neural Network (NN) which is trained on word-images from the IAM dataset. As the input layer (and
therefore also all the other layers) can be kept small for word-images, NN-training is feasible on the CPU (of course, a GPU would be
better). This implementation is the bare minimum that is needed for HTR using TF.

Fig. 1: Image of word (taken from IAM) and its transcription into digital text.

3.3K 68
Get code and data
1. You need Python 3, TensorFlow 1.3, numpy and OpenCV installed

2. Get the implementation from GitHub: either take the code version this article is based on, or take the newest code version if you
can accept some inconsistencies between article and code

3. Further instructions (how to get the IAM dataset, command line parameters, …) can be found in the README

Model Overview
We use a NN for our task. It consists of convolutional NN (CNN) layers, recurrent NN (RNN) layers and a final Connectionist Temporal
Classification (CTC) layer. Fig. 2 shows an overview of our HTR system.

Fig. 2: Overview of the NN operations (green) and the data flow through the NN (pink).

We can also view the NN in a more formal way as a function (see Eq. 1) which maps an image (or matrix) M of size W×H to a
character sequence (c1, c2, …) with a length between 0 and L. As you can see, the text is recognized on character-level, therefore
words or texts not contained in the training data can be recognized too (as long as the individual characters get correctly classified).

Eq. 1: The NN written as a mathematical function which maps an image M to a character sequence (c1, c2, …).

Operations
CNN: the input image is fed into the CNN layers. These layers are trained to extract relevant features from the image. Each layer
consists of three operation. First, the convolution operation, which applies a filter kernel of size 5×5 in the first two layers and 3×3 in
the last three layers to the input. Then, the non-linear RELU function is applied. Finally, a pooling layer summarizes image regions and
outputs a downsized version of the input. While the image height is downsized by 2 in each layer, feature maps (channels) are added,
so that the output feature map (or sequence) has a size of 32×256.

RNN: the feature sequence contains 256 features per time-step, the RNN propagates relevant information through this sequence. The
popular Long Short-Term Memory (LSTM) implementation of RNNs is used, as it is able to propagate information through longer
distances and provides more robust training-characteristics than vanilla RNN. The RNN output sequence is mapped to a matrix of size
32×80. The IAM dataset consists of 79 different characters, further one additional character is needed for the CTC operation (CTC
blank label), therefore there are 80 entries for each of the 32 time-steps.

CTC: while training the NN, the CTC is given the RNN output matrix and the ground truth text and it computes the loss value. While
inferring, the CTC is only given the matrix and it decodes it into the final text. Both the ground truth text and the recognized text can
be at most 32 characters long.

Data
Input: it is a gray-value image of size 128×32. Usually, the images from the dataset do not have exactly this size, therefore we resize it
(without distortion) until it either has a width of 128 or a height of 32. Then, we copy the image into a (white) target image of size
128×32. This process is shown in Fig. 3. Finally, we normalize the gray-values of the image which simplifies the task for the NN. Data
augmentation can easily be integrated by copying the image to random positions instead of aligning it to the left or by randomly
resizing the image.

Fig. 3: Left: an image from the dataset with an arbitrary size. It is scaled to fit the target image of size 128×32, the
empty part of the target image is filled with white color.

CNN output: Fig. 4 shows the output of the CNN layers which is a sequence of length 32. Each entry contains 256 features. Of course,
these features are further processed by the RNN layers, however, some features already show a high correlation with certain high-level
properties of the input image: there are features which have a high correlation with characters (e.g. “e”), or with duplicate characters
(e.g. “tt”), or with character-properties such as loops (as contained in handwritten “l”s or “e”s).

Fig. 4: Top: 256 feature per time-step are computed by the CNN layers. Middle: input image. Bottom: plot of the 32nd
feature, which has a high correlation with the occurrence of the character “e” in the image.

RNN output: Fig. 5 shows a visualization of the RNN output matrix for an image containing the text “little”. The matrix shown in the
top-most graph contains the scores for the characters including the CTC blank label as its last (80th) entry. The other matrix-entries,
from top to bottom, correspond to the following characters: “ !”#&’()*+,-./0123456789:;?
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz”. It can be seen that most of the time, the characters are predicted
exactly at the position they appear in the image (e.g. compare the position of the “i” in the image and in the graph). Only the last
character “e” is not aligned. But this is OK, as the CTC operation is segmentation-free and does not care about absolute positions. From
the bottom-most graph showing the scores for the characters “l”, “i”, “t”, “e” and the CTC blank label, the text can easily be decoded:
we just take the most probable character from each time-step, this forms the so called best path, then we throw away repeated
characters and finally all blanks: “l---ii--t-t--l-…-e” → “l---i--t-t--l-…-e” → “little”.
Fig. 5: Top: output matrix of the RNN layers. Middle: input image. Bottom: Probabilities for the characters “l”, “i”, “t”, “e”
and the CTC blank label.

Implementation using TF
The implementation consists of 4 modules:

1. SamplePreprocessor.py: prepares the images from the IAM dataset for the NN

2. DataLoader.py: reads samples, puts them into batches and provides an iterator-interface to go through the data

3. Model.py: creates the model as described above, loads and saves models, manages the TF sessions and provides an interface for
training and inference

4. main.py: puts all previously mentioned modules together

We only look at Model.py, as the other source files are concerned with basic file IO (DataLoader.py) and image processing
(SamplePreprocessor.py).

CNN
For each CNN layer, create a kernel of size k×k to be used in the convolution operation.
Then, feed the result of the convolution into the RELU operation and then again to the pooling layer with size px×py and step-size
sx×sy.
These steps are repeated for all layers in a for-loop.

RNN
Create and stack two RNN layers with 256 units each.
Then, create a bidirectional RNN from it, such that the input sequence is traversed from front to back and the other way round. As a
result, we get two output sequences fw and bw of size 32×256, which we later concatenate along the feature-axis to form a sequence
of size 32×512. Finally, it is mapped to the output sequence (or matrix) of size 32×80 which is fed into the CTC layer.

CTC
For loss calculation, we feed both the ground truth text and the matrix to the operation. The ground truth text is encoded as a sparse
tensor. The length of the input sequences must be passed to both CTC operations.
We now have all the input data to create the loss operation and the decoding operation.
Training
The mean of the loss values of the batch elements is used to train the NN: it is fed into an optimizer such as RMSProp.

Improving the model

In case you want to feed complete text-lines as shown in Fig. 6 instead of word-images, you have to increase the input size of the NN.
Fig. 6: A complete text-line can be fed into the NN if its input size is increased (image taken from IAM).

If you want to improve the recognition accuracy, you can follow one of these hints:

Data augmentation: increase dataset-size by applying further (random) transformations to the input images

Remove cursive writing style in the input images (see DeslantImg)

Increase input size (if input of NN is large enough, complete text-lines can be used)

Add more CNN layers

Replace LSTM by 2D-LSTM

Decoder: use token passing or word beam search decoding (see CTCWordBeamSearch) to constrain the output to dictionary
words

Text correction: if the recognized word is not contained in a dictionary, search for the most similar one

Conclusion
We discussed a NN which is able to recognize text in images. The NN consists of 5 CNN and 2 RNN layers and outputs a character-
probability matrix. This matrix is either used for CTC loss calculation or for CTC decoding. An implementation using TF is provided
and some important parts of the code were presented. Finally, hints to improve the recognition accuracy were given.

FAQ
There were some questions regarding the presented model:

1. How to recognize text in your samples/dataset?

2. How to recognize text in lines/sentences?

3. How to compute a confidence score for the recognized text?

I discuss them in the FAQ article.

References and further reading

Source code and data can be downloaded from:

Source code of the presented NN

IAM dataset

These articles discuss certain aspects of text recognition in more detail:

FAQ

What a text recognition system actually sees

Introduction to CTC

Vanilla beam search decoding

Word beam search decoding

A more in-depth presentation can be found in these publications:

Thesis on handwritten text recognition in historical documents

Word beam search decoding

Convolutional Recurrent Neural Network (CRNN)

Recognize text on page-level

And finally, an overview of my other Medium articles.

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a
look.

Emails will be sent to [email protected]. Not you?

Get this newsletter

An Intuitive Explanation of Connectionist Temporal Classification
No ratings yet
An Intuitive Explanation of Connectionist Temporal Classification
7 pages
CNN-RNN Based Handwritten Text Recognition: G.R. Hemanth, M. Jayasree, S. Keerthi Venii, P. Akshaya, and R. Saranya
No ratings yet
CNN-RNN Based Handwritten Text Recognition: G.R. Hemanth, M. Jayasree, S. Keerthi Venii, P. Akshaya, and R. Saranya
7 pages
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
No ratings yet
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
71 pages
DL-unit-4-part-2
No ratings yet
DL-unit-4-part-2
8 pages
On The Benefits of Convolutional Neural Network Combinations in of Ine Handwriting Recognition
No ratings yet
On The Benefits of Convolutional Neural Network Combinations in of Ine Handwriting Recognition
6 pages
This Paper Is Partially Supported by Natural Science Foundation of AnHui Provin Ce Under Grant No. 1208085QF107 A Novel LSTM-RNN Decoding Algorithm in CAPTCHA Re
No ratings yet
This Paper Is Partially Supported by Natural Science Foundation of AnHui Provin Ce Under Grant No. 1208085QF107 A Novel LSTM-RNN Decoding Algorithm in CAPTCHA Re
6 pages
Unit 3 NNDL-1
No ratings yet
Unit 3 NNDL-1
31 pages
A Convolutional Recurrent Neural Network for the Handwritten Text Recognition of Historical Greek Manuscripts
No ratings yet
A Convolutional Recurrent Neural Network for the Handwritten Text Recognition of Historical Greek Manuscripts
14 pages
Word Beam Search A Connectionist Temporal Classification Decoding Algorithm
No ratings yet
Word Beam Search A Connectionist Temporal Classification Decoding Algorithm
6 pages
Cascading Training For Relaxation CNN On Handwritten Character Recognition
No ratings yet
Cascading Training For Relaxation CNN On Handwritten Character Recognition
6 pages
AIDS II (1)
No ratings yet
AIDS II (1)
42 pages
Sentiment Analysis with an Recurrent Neural Networks
No ratings yet
Sentiment Analysis with an Recurrent Neural Networks
12 pages
Deep LSTM Networks For Online Chinese Handwriting Recognition
No ratings yet
Deep LSTM Networks For Online Chinese Handwriting Recognition
6 pages
Plagiarism Checker X Originality Report: Similarity Found: 26%
No ratings yet
Plagiarism Checker X Originality Report: Similarity Found: 26%
29 pages
Automated Neural Image Caption Generator For Visually Impaired People
No ratings yet
Automated Neural Image Caption Generator For Visually Impaired People
6 pages
Mini Project
No ratings yet
Mini Project
30 pages
Automated Image Captioning Using CNN and RNN
No ratings yet
Automated Image Captioning Using CNN and RNN
17 pages
CNN and RNN code
No ratings yet
CNN and RNN code
10 pages
Unit 3 RCNN Updated
No ratings yet
Unit 3 RCNN Updated
28 pages
Day 4
No ratings yet
Day 4
22 pages
DLP
No ratings yet
DLP
50 pages
Rnn
No ratings yet
Rnn
50 pages
Summary of Progress
No ratings yet
Summary of Progress
9 pages
151180080_BM466_HOMEWORK 4
No ratings yet
151180080_BM466_HOMEWORK 4
10 pages
6b. Recurrent Neural Networks
No ratings yet
6b. Recurrent Neural Networks
38 pages
mergeddv
No ratings yet
mergeddv
2 pages
Toderici Full Resolution Image CVPR 2017 Paper
No ratings yet
Toderici Full Resolution Image CVPR 2017 Paper
9 pages
DLT Unit - 4
No ratings yet
DLT Unit - 4
36 pages
Deep Learning Unit 5
No ratings yet
Deep Learning Unit 5
23 pages
Machine Translation Using Natural Language Process
No ratings yet
Machine Translation Using Natural Language Process
6 pages
31.july Ijmte - 674
No ratings yet
31.july Ijmte - 674
7 pages
NNDL U-3
No ratings yet
NNDL U-3
7 pages
Handwritten Text Recognition Using Deep Learning
No ratings yet
Handwritten Text Recognition Using Deep Learning
13 pages
O S T R N N C T C: Nline Equence Raining OF Ecurrent Eural Etworks With Onnectionist Emporal Lassification
No ratings yet
O S T R N N C T C: Nline Equence Raining OF Ecurrent Eural Etworks With Onnectionist Emporal Lassification
16 pages
BLM5135_10_ResidualNetworks_Transformer
No ratings yet
BLM5135_10_ResidualNetworks_Transformer
60 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
111 pages
Ijcrt 196552
No ratings yet
Ijcrt 196552
6 pages
Helmet and Vehicle License Plate Detection System
No ratings yet
Helmet and Vehicle License Plate Detection System
26 pages
RNN
No ratings yet
RNN
86 pages
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
No ratings yet
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
15 pages
Part 5
No ratings yet
Part 5
6 pages
Deep Learning Image Classification
No ratings yet
Deep Learning Image Classification
11 pages
Extraction of Information From Handwriting Using Optical Character Recognition and Neural Networks
No ratings yet
Extraction of Information From Handwriting Using Optical Character Recognition and Neural Networks
6 pages
Difference Between ANN, CNN and RNN
100% (1)
Difference Between ANN, CNN and RNN
5 pages
Image Captioning Using CNN & RNN
No ratings yet
Image Captioning Using CNN & RNN
4 pages
Full Resolution Image Compression With Recurrent Neural Networks
No ratings yet
Full Resolution Image Compression With Recurrent Neural Networks
9 pages
2 U4-Rnn
No ratings yet
2 U4-Rnn
17 pages
What is a Convolutional Neural Network-unit3.docx
No ratings yet
What is a Convolutional Neural Network-unit3.docx
12 pages
RNN
No ratings yet
RNN
48 pages
2111CS010077 deep learning
No ratings yet
2111CS010077 deep learning
10 pages
Steps For Training A Recurrent Neural Network: Advantages
No ratings yet
Steps For Training A Recurrent Neural Network: Advantages
13 pages
Unit Iv (CNN)
No ratings yet
Unit Iv (CNN)
8 pages
BATCH 6
No ratings yet
BATCH 6
38 pages
DL mod 3
No ratings yet
DL mod 3
4 pages
DNN U2 Notes
No ratings yet
DNN U2 Notes
32 pages
CNN_and_RNN_mixed_model_for_image_classification
No ratings yet
CNN_and_RNN_mixed_model_for_image_classification
7 pages
DL UNIT-II
No ratings yet
DL UNIT-II
36 pages
RNN
No ratings yet
RNN
53 pages
Implementation of Handwritten Digit Recognizer Using CNN: Vinjit, Bhojak, Kumar and Nikam
No ratings yet
Implementation of Handwritten Digit Recognizer Using CNN: Vinjit, Bhojak, Kumar and Nikam
9 pages
An Invitation To 3-D Vision From Images To Models
No ratings yet
An Invitation To 3-D Vision From Images To Models
339 pages
机器学习周志华 8.16.23 PM
No ratings yet
机器学习周志华 8.16.23 PM
443 pages
Exponential Convergence Rates For Batch Normalization - 5
No ratings yet
Exponential Convergence Rates For Batch Normalization - 5
1 page
Exponential Convergence Rates For Batch Normalization - 4
No ratings yet
Exponential Convergence Rates For Batch Normalization - 4
1 page
Exponential Convergence Rates For Batch Normalization - 2
No ratings yet
Exponential Convergence Rates For Batch Normalization - 2
1 page
Exponential Convergence Rates For Batch Normalization - 1
No ratings yet
Exponential Convergence Rates For Batch Normalization - 1
1 page
Offline Handwriting
No ratings yet
Offline Handwriting
8 pages
GCNN
No ratings yet
GCNN
8 pages
20) Training An End-to-End System For Handwritten Mathematical Expressions by Generated Patterns
No ratings yet
20) Training An End-to-End System For Handwritten Mathematical Expressions by Generated Patterns
6 pages
18.034-Garrett Birkhoff, Gian-Carlo Rota Ordinary Differential Equations 1989
No ratings yet
18.034-Garrett Birkhoff, Gian-Carlo Rota Ordinary Differential Equations 1989
409 pages