Speeding Up Document Image Classi Cation
Speeding Up Document Image Classi Cation
BarcelonaTech
June, 2020
Abstract
This work presents a solution by means of light Convolutional Neural Net-
works (CNNs) in the Document Classification task, essential problem in the
digitalization process of institutions. We show in the RVL-CDIP dataset that
we can achieve state-of-the-art results with a set of lighter models such as the Ef-
ficientNets and present its transfer learning capabilities on a smaller in-domain
dataset such as Tobacco3482. Moreover, we present an ensemble pipeline which
is able to boost solely image input by combining image model predictions with
the ones generated by BERT model on extracted text by OCR. We also show
that the batch size can be effectively increased without hindering its accuracy
so that the training process can be sped up by parallelizing throughout multiple
GPUs, decreasing the computational time needed.
1
Acknowledgements
I would like to give special thanks to my advisor Jordi Torres for under-
standing what my interests were and for allowing me to focus on them, for the
great advice and for giving me the freedom to learn that much. I would also
like to appreciate the help of Juan Luis Domı́nguez, one of the greatest problem
solvers I have ever met, for always giving me sensational inputs and for fixing
countless issues I have had.
I would also like to express my gratitude to everyone in the Barcelona Su-
percomputing Center who has contribute to this work, from my colleagues in
the C6-E201 to the Support team.
Lastly, thanks to the collaboration with CaixaBank for creating the oppor-
tunity that has led to this work.
3
Contents
1 Introduction 9
1.1 Document Image Classification . . . . . . . . . . . . . . . . . . . . . 10
1.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.2 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Theoretical Framework 13
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Deep Feedforward Networks . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 14
2.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 LSTM Networks . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 CNNs for Image Classification . . . . . . . . . . . . . . . . . . 18
2.5.2 Transfer Learning in Computer Vision . . . . . . . . . . . . . 19
2.6 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1.1 Neural Model . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1.2 Continuous Bag-of-Words (CBOW) . . . . . . . . . . 22
2.6.1.3 Continuous Skip-Gram Model . . . . . . . . . . . . . 23
2.6.2 Sequence to sequence learning . . . . . . . . . . . . . . . . . . 25
2.6.2.1 CoVe . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.3 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.3.1 ELMo . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.3.2 ULM-Fit . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.4 Transformer-based models . . . . . . . . . . . . . . . . . . . . 28
2.6.5 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Related Work in Document Image Classification . . . . . . . . . . . . 33
3 Results 36
3.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Early Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Image model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Pre-training on BigTobacco . . . . . . . . . . . . . . . . . . . 40
3.3.3 Fine-tuning on SmallTobacco . . . . . . . . . . . . . . . . . . 40
3.3.4 Text model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.5 Image and Text ensemble . . . . . . . . . . . . . . . . . . . . 42
3.4 Results on BigTobacco . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Results on SmallTobacco . . . . . . . . . . . . . . . . . . . . . . . . . 44
5
4 Parallel and Distributed Deep Learning 46
4.1 Data Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 PyTorch’s DataParallel library . . . . . . . . . . . . . . . . . . 47
4.2.2 PyTorch’s DistributedDataParallel library . . . . . . . . . . . 48
4.2.3 Parallel platforms results . . . . . . . . . . . . . . . . . . . . . 50
5 Conclusion 51
5.1 Acquired knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6
List of Figures
1 One document example of each of the classes in BigTobacco . . . . . 9
2 SmallTobacco classes distribution . . . . . . . . . . . . . . . . . . . . 11
3 Fully-connected feedforward network . . . . . . . . . . . . . . . . . . 13
4 Applying two 3 × 3 × 3 filters over a 6 × 6 × 3 input space. Source [35] 14
5 Max-pooling operation. Source [19] . . . . . . . . . . . . . . . . . . . 15
6 Zero-Padding over a a 2D input. Source [3] . . . . . . . . . . . . . . . 15
7 RNN. Source [51] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8 LSTM Network. Source [51] . . . . . . . . . . . . . . . . . . . . . . . 16
9 Steps of a LSTM. Source [51] . . . . . . . . . . . . . . . . . . . . . . 17
10 VGG-16 architecture. Source [30] . . . . . . . . . . . . . . . . . . . . 18
11 Residual connection. Source [31] . . . . . . . . . . . . . . . . . . . . . 19
12 ImageNet sample. Source [22] . . . . . . . . . . . . . . . . . . . . . . 20
13 Feedforward Neural Network Language Model . . . . . . . . . . . . . 22
14 Continuous Bag-of-Words (CBOW) model . . . . . . . . . . . . . . . 23
15 Continuous Skip-Gram model . . . . . . . . . . . . . . . . . . . . . . 24
16 Attention mechanism from [44]. . . . . . . . . . . . . . . . . . . . . . 26
17 CoVe architecture. Source [66] . . . . . . . . . . . . . . . . . . . . . . 27
18 The Transformer architecture. Source [64] . . . . . . . . . . . . . . . 29
19 Multi-head attention layer. Source [64] . . . . . . . . . . . . . . . . . 30
20 The Transformer encoder. Source [5] . . . . . . . . . . . . . . . . . . 31
21 The Transformer decoder in action. Source [5] . . . . . . . . . . . . . 32
22 BERT architecture. Source [12] . . . . . . . . . . . . . . . . . . . . . 33
23 CNN operation on the sequence embedding matrix [67] . . . . . . . . 38
24 Mobile inverted bottleneck block. . . . . . . . . . . . . . . . . . . . . 39
25 EfficientNet B0 architecture . . . . . . . . . . . . . . . . . . . . . . . 40
26 Slanted Triangular learning rate . . . . . . . . . . . . . . . . . . . . . 40
27 Pipeline of the different stages of the pre-training of EfficientNet over
multiple GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
28 Pipeline of the proposed multimodal approach. . . . . . . . . . . . . . 42
29 Speedup of the training process when parallelizing. . . . . . . . . . . 43
30 Accuracy obtained in SmallTobacco by models pre-trainined on Big-
Tobacco (Left) and without BigTobacco pre-training (Right). . . . . . 44
31 Data loading process. Source [49] . . . . . . . . . . . . . . . . . . . . 46
32 Data Parallelism. Source [26] . . . . . . . . . . . . . . . . . . . . . . 47
33 PyTorch’s Dataparallel. Source [49] . . . . . . . . . . . . . . . . . . . 48
34 Ring All-reduce method. Source [26] . . . . . . . . . . . . . . . . . . 49
35 PyTorch’s DistributedDataparallel. Source [49] . . . . . . . . . . . . . 50
36 PyTorch DDP vs DP speedup . . . . . . . . . . . . . . . . . . . . . . 50
7
1 Introduction
This thesis emerges from a real problem that appears in the document digitization
process, which has become a common practice in a wide variety of industries that
deal with vast amounts of archives. This is the case of the financial institution Caix-
aBank, collaborator of the Barcelona Supercomputing Center - Centro Nacional de
Supercomputación1 . The proposed issue can be categorized as a document classi-
fication problem, which is a task to face when trying to automate their document
processes. Although initially the target of this work was to give a solution to the
aforementioned institute, we present the results using publicly available data. In Fig-
ure 1 we show one document for each of the classes from the used dataset. From left to
right: Letter, Form, Email, Handwritten, Advertisement, Scientific report, Scientific
publication, Specification, File folder, News article, Budget, Invoice, Presentation,
Questionnaire, Resume and Memory.
9
using a pretrained network on ImageNet [22], and later models have become increas-
ingly heavier [1, 63, 21, 2] with the speed and computational resources drawback this
entails.
Table 1: Parameters of the CNNs architectures used in BigTobacco.
Model #Params
AlexNet 60.97M
VGG-16 138.36M
ResNet-50 25.56M
Inception-V3 23.83M
EfficientNet-B2 9.2M
EfficientNet-B0 5.3M
• Training process speed up: we demonstrate the ability of these models to main-
tain their performance while saving a large amount of time by parallelizing
over several GPUs. We also show the performance differences between the two
PyTorch implementations to solve this task.
10
1.1.1 Datasets
As mentioned earlier, in this work we make use of two publicly available datasets
containing samples of images from scanned documents from USA Tobacco compa-
nies, published by Legacy Tobacco Industry Documents created by the University of
California San Francisco (UCSF). We find these datasets a good representation of
what enterprises and institutions may face with, based on the quality and type of
classes. Furthermore, they have been go-to datasets in this research field since 2014
with which we can compare results.
RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) is
a 400.000 document sample (BigTobacco from now onwards) presented in [29] for
document classification tasks. This dataset contains the first page of each of the doc-
uments, which are labeled in 16 different classes with an equal number of elements per
class. A smaller sample containing 3482 images was proposed in [36] as Tobacco3482
(SmallTobacco henceforth). This dataset is formed by documents belonging to 10
classes not uniformly distributed.
600
400
200
0
ADVE Email Form Letter Memo News Note Report Resume Scientific
In order to work with appropriate data formats to both feed correctly into the
classification system and be able to store it in our cluster’s memory, we need to
transform the datasets. In this case, we build HDF5 files for the datasets to work
with PyTorch and TFRecords in the TensorFlow case. Both HDF5 and TFRecord
formats serialize the dataset into a single file, which makes it much easier to deal
with.
To deal with text data in the SmallTobacco case, we extract the characters from
the document images by applying OCR Tesseract [58] to them.
We provide scripts to run OCR, save the datasets in HDF5 and TFRecords formats
in the accompanying repository https://javiferran.github.io/document-classification/.
1.1.2 Technologies
In this work a bunch of frameworks have been used. The Deep Learning library used
to develop all the different experiments have been PyTorch2 , although we also pro-
2
https://pytorch.org/
11
vide some of the code for its usage in TensorFlow3 . PyTorch, developed by Facebook,
and TensorFlow, by Google, are the two main Deep Learning frameworks nowadays.
PyTorch is the favorite for researchers while TensorFlow is more widely used in pro-
duction by companies. We selected PyTorch because its compatibility with some of
the NLP frameworks like FLAIR4 (from Zalando research) and Transformers5 (from
HuggingFace), although the latter has been adapted to TensorFlow while the develop-
ment of this work. Since PyTorch is more used in research and this project deals with
some state-of-the-art results, the fact that some of the code of the lately results are
just available in PyTorch, especially in the NLP field, has made us choose this option.
As mentioned earlier, the main two NLP libraries used in this project are FLAIR
and Transformers. The first includes a large number of NLP models, from which one
can obtain word embeddings or even NLP task ready output in a user-friendly man-
ner. The second focuses on Transformer architecture, providing pre-trained models
in different domains and languages.
3
https://www.tensorflow.org/
4
https://github.com/flairNLP/flair
5
https://huggingface.co/transformers/
12
2 Theoretical Framework
2.1 Deep Learning
The proposed methods in this work are based on supervised Deep Learning, where
each document is associated with a class (label) so that the algorithms are trained
by minimizing the error between the predictions and the truth. Deep Learning is a
branch of machine learning that deals with deep neural networks, where each of the
layers is trained to extract higher-level representations of the previous ones. These
models are trained by solving iteratively an unconstrained optimization problem. In
each iteration, a random batch of the training data is fed into the model to compute
the loss function value. Then, the gradient of the loss function with respect to the
weights of the network is computed (backpropagation) and an update of the weights
in the negative direction of the gradient is done. These networks are trained until
they converge into a loss function minimum.
13
In the above figure it is depicted a fully-connected feedforward network, where
each neuron is connected to every neuron of the previous layer. The mathematical
expression of the computation done in a neuron j of the layer l is:
Dim(l−1 )
X
l l−1
njl = wij oi
i=0
l−1
Where o is the previous layer output and wij represents the weight associated
with the neuron j and the previous layer output oi . The activation function applies
a non-linear transformation to njl , which could be a differentiable function mapping
values to [0,1] range such as the sigmoid function, giving oli = g(njl ). A bias term b
can be added to the calculation.
One way to represent the previous calculations for a complete layer transformation
is by means of a weight matrix W , where each row stores the weights of one neuron,
so the output of a hidden layer (l) can be computed by ol = g(ol−1 W l + b).
Usually, each convolution layer is associated with an activation layer, where an ac-
tivation function is applied elementwise to the whole output volume, without changing
6
Same notation as in [3]
14
its dimension. To reduce the number of parameters of the network and its computa-
tional costs, a pooling layer is typically located between convolution operations. The
pooling layer takes a region Pq × Pq in each of the dq activation maps and performs
an arithmetic operation as shown in Figure 5. The most used pooling layer is the
max-pool, which returns the maximum value of the aforementioned region.
Some important hyperparameters to take into account when building CNNs are
the stride, the zero-padding, the number of filters and their size Fq . The stride defines
the number of elements in the input matrix that are jumped in each slide. The higher
the stride the smaller the output volume. Zero-padding expands the input volume by
including zero elements around it as shown in Figure 6.
15
So the output ht+1 depends on both the input xt+1 and the previous hidden state ht .
In the next figure we can see the folded (left) and the unfolded (right) representation
of a RNN.
The problem of vanilla RNNs are the difficulties it has to deal with long-term
dependencies. So, for a long sequence such as a long sentence, the hidden state at the
last step time step will carry little information about the beggining of the sentence,
which in some cases can hurt the performance of the model. To solve this issue, a
new variant named the Long Short Term Memory (LSTM) network was proposed by
Hochreiter, S. and Schmidhuber, J in [33] and has been highly adopted in NLP.
Each LSTM cell is composed of four layers, the yellow squares represent the weights
matrices with an activation function, which are sigmoid σ and tanh functions, and
the pink circles represent pointwise operations. Starting from the left, the forget gate
(1) rules how much information comes into the cell, ft values close to 0 means that
no information comes from the previous cell output ht−1 and values close to 1 means
that almost all of the information is let pass.
Next, through the input gate (2) we get the information ht−1 and xt , which are
concatenated. This it values are compared with the candidates C̃t (3), which are
16
calculated through the tanh layer, to choose which values are added to the new cell
state Ct . Then the cell state Ct (4) is updated by multiplying the previous cell
state by ft and adding it ∗ C̃t . Finally, the hidden state ht (6) is obtained from the
multiplication of the tanh transformation of Ct and the output of the sigmoid gate ot
(5). Here is the mathematical formulation of the LSTM:
As we will see in Section 2.6 LSTMs have been used to get word representations
based on the context, which has helped the improvement in Natural Language Pro-
cessing tasks.
17
the error rate (AlexNet [40]), setting the beginning of the explosion of deep neural
networks. From then onwards, deeper networks have become the norm.
The use of deeper neural networks led to training difficulties such as the degrada-
tion problem, where accuracy gets saturated and then degrades quickly when adding
more layers to the network. To help mitigate this issue in 2015 were introduced the
ResNet family [31], which was considered a big leap forward. ResNet family of models
make use of shortcut connection (Figure 11) as a solution to the training difficulties
imposed by deep networks.
18
Figure 11: Residual connection. Source [31]
19
Figure 12: ImageNet sample. Source [22]
20
2.6 Natural Language Processing
The features learned from the OCR output are achieved by means of Natural Lan-
guage Processing techniques. NLP is the field that deals with the understanding of
human language by computers, which captures underlying meanings and relationships
between words.
The way machines deal with words is by means of a real value vector represen-
tation. Word2Vec [48] showed that a vector could represent semantic and syntactic
relationships between words. CoVe [45] introduced the concept of context-based em-
beddings, where the same word can have a different vector representation depending
on the surrounding text. ELMo [52] followed CoVe but with a different training ap-
proach, by predicting the next word in a text sequence (Language Modeling), which
made it possible to train on a large available text corpus. Depending on the task
(such as text classification, named entity recognition...) the output of the model can
be treated in different ways. Moreover, custom layers can be added to the features
extracted by these NLP models. For instance, ULM-Fit [34] introduced a language
model and a fine-tuning strategy to effectively adapt the model to various downstream
tasks, which pushed transfer learning in the NLP field. Lately, the Transformer ar-
chitecture [64] has dominated the scene, being the bidirectional Transformer encoder
(BERT) [23] the one who established recently state-of-the-art results over several
downstream tasks.
21
softmax
− +1
0 ( − +1
)
0
1
.
C
.
.
0
tanh
0 .
.
.
−2 .
.
1 ( −2
) .
0
0
.
C H U
.
( |)
.
0 .
0 . ℎ × ( − 1) | | × ℎ
.
−1
. .
.
0 ( −1
)
0
0
.
C
.
.
0
1 = × | |
22
softmax
−
0 ( −
)
0
1
.
C
.
.
0
0 .
.
− +1
. .
. .
1 ( − +1
) ¯
0
0
.
C U
.
( |)
.
0 | | × .
0 . .
. .
+
.
0 ( +
)
0
0
.
C
.
.
0
1 = × | |
23
softmax
.
.
1 ( )
.
0
0
.
C U
. ( | )
.
0 × | |
| | × .
0 .
.
A common loss function used to measure the similarity between two probability
distributions, the predicted probabilities ŷ and the ground truth
P y is the cross-entropy
function, that in the discrete setting looks like H(ŷ, y) = − v∈V p(y) log p(ŷ). In our
case the probability y is zero for every word in V except for the center word wt .
|V |
X
Lθ = − P (yi ) log P (ŷi )
i=1
= − log P (yˆt )
= − log P (wt |C)
= − log P (ut |C̄))
exp(ut C̄)
= − log P|V |
T
j=1 exp(uj C̄)
|V |
X
= −uTc C̄ + log exp(uTj C̄)
j=1
The weights of the hidden layer represent the embedding for each of the words in the
vocabulary. Once trained, C matrix can be used as a look up table for other NLP
tasks.
24
2.6.2 Sequence to sequence learning
Ilya Sutskever et. al [60] proposed a model that learned to map sequences to sequences
by means of RNNs. This paper started a new way of training models by making use
of the translation task. Machine translation is a subfield of NLP where models try to
translate from a source language to a target language. Initially, statistical machine
translation models where used, but lately it has been neural machine translation
(NMT) the predominant type of architectures. A neural machine translation system is
a neural network that directly models the conditional probability p(y|x) of translating
a source sentence, x1 , ..., xn , to a target sequence, y1 , ..., ym . NMT basic models
are made of an encoder and a decoder. The encoder calculates a representation of
the input sequence (s) and the decoder generates a sequence based on the encoder
representations and the previous target words. Typically, the encoder and decoder
are RNNs.
The decoder defines the probability over the target sentence y as:
m
Y
p(y) = p(yj |y<j , s)
j=1
The conditional probability of each target word can be calculated from a RNN
decoder as:
25
αt (s) = align(ht , hs )
exp(score(ht , hs ))
=P
s exp(score(ht , hs ))
where the score function can be for instance the dot product between ht and hs .
Finally the attentional vector h̃t is passed through a linear transformation g to-
gether with softmax layer
2.6.2.1 CoVe
Machine translation was used by McCan et al. to give context information to word
vector representations [45]. These word embeddings are learned by means of a two-
layer bidirectional LSTM, which the authors called MT-LSTM. MT-LSTM is used
as the encoder that will later be used to transfer the knowledge learned into other
downstream tasks.
Firstly, GloVe vectors of the source language words wx = [w1x , ..., wnx ] are used as
input to MT-LSTM, which computes a sequence of hidden states, one for each input
word
h = MT-LSTM(GloVe(wx ))
These hidden states are the new learned words vectors CoVe(wx ).
→
− ← − → − →
−
To be more precise, the MT-LSTM produces ht = [ h t ; h t ], h t = LSTM(xt , h t−1 )
←− ←−
and h t = LSTM(xt , h t−1 ). xt represents the GloVe(wx ) vector.
During the training phase, an attentional decoder is placed on top of the en-
coder producing a conditional distribution over the target language words p(wtz |
H, w1z , ..., wt−1
z
], where H represents the a stack of hidden states for each input word.
This decoder is made by a two-layer uni-directional LSTM which gives a hidden state
hdec
t for each time step calculated as
26
hdec
t = LSTM([zt−1 , h̃t−1 ], hdec
t−1 )
2.6.3.1 ELMo
Embeddings from Language Model (ELMo) [52] continued the concept of context-
based embeddings where the same word can have different vector representations
depending on the surrounding text. ELMo model is trained by predicting a word in a
text sequence, which makes it possible to be trained on large available text corpus in a
self-supervised way. In this case, it uses words before and after the target word. Word
representations are obtained by means of a linear combination of the internal states
of the network. It has a parameter that changes the result of this linear combination
depending on the task it performs (text classification, named entity recognition...).
27
ELMo uses a two-layer bidirectonal LSTM (bi-LM), each bi-LSTM layer j ouputs
− LM ←
→ −LM
hidden states hLM
k,j = [ h k,j ; h k,j ] for the kth token, being j = 0 the token embedding
layer and j = L the top layer. ELMo kth token vector representation for a specific
task is computed as follows:
L
X
ELMotask
k =γ task
stask
j hLM
k,j
j=0
where stask
jis a normalized-softmax weight vector and γ task allows scaling the
ELMo vector.
2.6.3.2 ULM-Fit
ULM-Fit (Universal Language Model Fine-tuning) [34] a fine-tuning procedure to
achieve transfer learning in NLP task. Specifically, Howard et al. propose discrimina-
tive fine-tuning, slanted triangular learning rates, and gradual unfreezing techniques.
In their experiments the use the AWD-LSTM model [47], a regular LSTM which to-
gether with the proposed pre-training methods is sufficient to beat the previous state-
of-the-art in several text classification datasets. We use slanted triangular learning
rates in the training process of the image model, which we talk about in Section 3.3.2,
and discriminative fine-tuning when training the text model in Section 3.3.4.
28
Figure 18: The Transformer architecture. Source [64]
The Transformer model is trained in the machine translation task, where inputs
are sequences of words of the source language and the outputs are the translation into
a target language.
The encoder receives as input either a word embedding plus a positional embed-
ding, in the case of the first layer (gives information about the order of the tokens
in the sequence), or the output of the previous encoder. In every case, this input is
passed firstly through a self-attention layer, which does the following operation:
QK T
Attention(Q, K, V ) = softmax V
dk
Each matrix Query (Q), a Key (K), and a Value (V) are created by multiplying
the embedding matrix (stack of token embeddings) or previous encoder output matrix
(stack of previous encoder output vectors) by three matrices that are trained during
the training process, W Q , W K and W V . These Q, K and V can be seen as matrices
were each row represent the q, k and v vector of each token.
29
Figure 19: Multi-head attention layer. Source [64]
For better understanding we can simplify this by calculating the attention of the
i-th token Attentioni by using its vector qi . The self-attention calculation in this case
is:
qi K T
Attentioni (qi , K, V ) = softmax V
dk
A score between the input token of the sequence and every other is calculated.
This score represents how much focus to put on other parts of the sequence when
encoding a specific token (i in this case). So, a dot product is calculated between
the qi vector and the Key vector of every other object in the sequence (the intuition
is that this is a similarity operation, we try to find the key similar to the queries
and a high value to them), which is equivalent to multiplying qi by the matrix K T ,
giving as many scores as elements in the sequence. Then this output of scalars are
divided by the square root of the dimension of the key vectors dk (for getting more
stable gradients) and finally it is passed through a softmax operation. Once we have a
normalized score for every pair of tokens, we multiply this score by the value vector of
each of the tokens (each row in matrix V ) and finally sum up those vectors, generating
one single vector zi . So for each query we get one vector zi .
Turning back to the matrix notation, the previous process can be done for the
entire input sequence by using matrix Q (each row is a token q vector). So, the
benefit of these matrices operations is that they can be parallelized. In this case we
end up with a matrix Z with each row representing the self-attention vector result of
each token.
30
Figure 20: The Transformer encoder. Source [5]
This process is done in each of the self-attention heads, in the original transformer
the model has 8 heads. So, in total 8 matrices are output. Each head has its own
W Q , W K and W V weight matrices which are randomly initialized, so the result leads
to different representation subspaces in each of the self-attention heads.
The output matrix Z of every self-attention head are concatenated into a single
matrix to which a linear transformation is applied,
M ultiHead = Concat(Zheadi , · · · , Zheadh )W O as show in Figure 19. This linear pro-
jection is added to the initial embeddings entering the encoder and then a LayerNorm
operation is performed.
On top of each encoder and decoder layer the transformer has a feed-forward
network (FFN) which consists of two linear transformations with a ReLu activation
function after the first transformation.
The decoder takes as input a double copy of the encoder output as K, V matrices.
The first self-attention layer, also composed of 8 heads gets K, V and Q inputs from
the output sequence but only from previously seen positions, the posterior ones are
masked.
31
Figure 21: The Transformer decoder in action. Source [5]
The decoder outputs a vector of floats that is turned into a target language proba-
bility distribution by making a linear transformation and adding a softmax operation.
The cross-entropy loss function is used compare the output probability distribution
for each word and the ground-truth probability distribution.
2.6.5 BERT
BERT model [23] was presented in 2018 and it set a major breakthrough in NLP.
This model is based on the encoder layer of the Transformer model, so to understand
how BERT works we first needed to dive into the Transformer model.
BERT model gets rid of the decoder part of the Transformer, which leaves the
architecture showed Figure 22, note that this is top-down view. Two model sizes
were presented, BERT BASE with 12 layers, 12 attention heads and hidden vectors
∈ R768 and BERT LARGE with 24 layers, 16 attention heads and hidden vectors size
∈ R1024 .
32
Figure 22: BERT architecture. Source [12]
BERT introduces a language model that is able to learn from both sides of the
word it tries to predict, instead of a left-to-right model or a separate right-to-left and
left-to-right training such in [52]. BERT objective is to introduce a model that can be
fine-tuned into downstream tasks with minor changes to the architecture and starting
with the pre-trained parameters.
BERT is pre-trained using two tasks: Masked LM and Next sentence prediction
(NSP). In order to solve the problem in LM of the attention mechanism seeing the
token it is trying to predict in a multilayered model BERT uses masks. The way it does
it is by selecting 15% of the random tokens and masking with a 0.8 probability with
the token [MASK]. NSP task is about predicting if a pair of sentences are consecutive.
By choosing sentences A and B, 50% of the time B follows A and 50% of the time
it does not. In this case a token [SEP] is added between the two tokenized sentences
and a segment embedding is summed to the input tokens describing whether it is the
first or the second sentence.
BookCorpus [68] and English Wikipedia are used to pre-train the model.
33
have tested their models on a large-scale sample BigTobacco. Others tried on a
smaller version named SmallTobacco, which could be seen as a more realistic scale
of annotated data that users might be able to find. Lastly, transfer learning from
in-domain datasets has been tested by using BigTobacco to pre-train the models to
finally fine-tune on SmallTobacco. Table 2 summarizes the results of previous works
in the different categories over time.
First results in the Deep Learning era have been mainly based on CNNs using
transfer learning techniques. Multiple networks were trained on specific sections of
the documents [29] to learn region-based high dimensional features later compressed
via Principal Component Analysis (PCA). The use of multiple Deep Learning models
was also exploited by Arindam Das et al. by using an ensemble as a meta-classifier
[21]. A VGG-16 stack of networks using 5 different classifiers has been proposed
[57], one of them trained on the full document and the others specifically over the
header, footer, left body and right body. The Multi Layer Perceptron (MLP) was the
ensemble that performed the better. A committee of models but with a SVM as the
ensemble was also proposed [53].
34
This was further investigated by adding content-based information with CNN 2D with
ranking textual features (ACC2) to the OCR extracted.
As far as we are concerned, there is no study about the use of multiple GPUs in the
training process for the task of Document Image Classification. However, parallelizing
a computer vision task has been shown to work properly using ResNet-50, which is
a widely used network that usually gives good results despite its low complexity
architecture. Several training procedures are demonstrated to work effectively with
this model [4, 27]. A learning rate value proportional to the batch size, warmup
learning rate behavior, batch normalization, SGD to RMSProp optimizer transition
are some of the techniques exposed in these works. A study of the distributed training
methods using ResNet-50 architecture on an HPC cluster is shown in [14, 15]. To
know more about the algorithms used in this field, look at [10].
35
3 Results
In this section we compare the performance of the different EfficientNets in Small-
Tobacco and BigTobacco as showed in Table 2 and demonstrate the benefits of the
multiple GPU training. Experiments have been carried out using GPUs clusters
Power-CTE9 of the Barcelona Supercomputing Center - Centro Nacional de Super-
computación10 , each one composed by: 2 IBM Power9 8335-GTGH at 2.40GHz (20
cores and 4 threads/core), 512GB of main memory distributed in 16 dimms × 32GB
at 2666MHz and 4 GPU NVIDIA V100 (Volta) with 16GB HBM2.
The operating system is RedHat Linux 7.4. The models and their training are
implemented with PyTorch version 1.0 running on CUDA 10.1 and using cuDNN
7.6.4.
The only modification done to the images is a resize to 384 × 384 as explained
in Section 3.3.1 and, in order to avoid overfitting, a shear transformation of an angle
θ ∈ [−5◦ , 5◦ ] [63] which is randomly applied in the training phase. No other modifica-
tions are used in our experiments. Source code is at https://github.com/javiferran/
document-classification.
3.1 Evaluation
In order to compare with previous results in SmallTobacco dataset, we divide the
dataset following the procedure in [36]. Documents are split into training, test and
validation sets, containing 800, 2482 and 200 samples each one. 10 different splits of
the dataset are created by randomly sampling from the 3482 documents so that 100
samples per class are guaranteed between train and validation sets. In the Figure 30
we give the accuracy on SmallTobacco as the median over the 10 dataset splits to
compare with previous results. Accuracy on BigTobacco is shown as the one achieved
on the test set. BigTobacco dataset used in Section 3.5 is slightly modified, where
overlapping documents with SmallTobacco are extracted. Top performing model’s
accuracies are written down in Table 2.
36
Class Type Paper
BertEmbeddings Embeddings from pretrained BERT [23]
BytePairEmbeddings Subword-level word embeddings Heinzerling and Strube (2018)
CharacterEmbeddings Task-trained character-level embeddings Lample et al. (2016)
ELMoEmbeddings Contextualized word-level embeddings [52]
FastTextEmbeddings Word embeddings with subword features Bojanowski et al. (2017)
FlairEmbeddings Contextualized character-level embeddings Akbik et al. (2018)
PooledFlairEmbeddings Pooled variant of FlairEmbeddings Akbik et al. (2019)
OneHotEmbeddings Standard one-hot embeddings of text or tags -
OpenAIGPTEmbeddings Embeddings from pretrained OpenAIGPT Radford et al. (2018)
RoBERTaEmbeddings Embeddings from RoBERTa Liu et al. (2019)
TransformerXLEmbeddings Embeddings from pretrained transformer-XL Dai et al. (2019)
WordEmbeddings Classic word embeddings
XLNetEmbeddings Embeddings from pretrained XLNet Yang et al. (2019)
XLMEmbeddings Embeddings from pretrained XLM Lample and Conneau (2019)
of the words into a larger vector. With these options, the possibilities are almost end-
less.
Given the embeddings of a sequence (document in our case) one can concatenate
them into a single vector and connect it to an MLP or stack the embeddings as a
matrix and perform more complex operations like in the case of convolutions over
the matrix as in Figure 23. This method was successfully proved in [39] and works
by applying a CNN 2D operation with different filter sizes. Each filter slides in a
one-way direction, only through the rows of the embedding matrix. The width of
the filter is determined by the word embedding vector length and the height is left
as a hyperparameter. To the feature maps obtained from the convolution operation
we apply a max-pooling operation. Then, these values are concatenated into a single
vector and passed to a fully-connected layer with a softmax function.
37
Figure 23: CNN operation on the sequence embedding matrix [67]
In this work we carry out multiple experiments with this embedding + CNN
2D configuration, some of the results obtained are shown in Figure 3. Moreover,
FLAIR allows obtaining a single embedding per document (document embedding). It
is calculated based on a word embedding type. It applies a pooling operation (max,
min or mean) over the whole document word embeddings. Note that in the table we
show the document embeddings results using Flair forward embeddings.
38
Document embedding (mean) 69,60%
After testing some of the FLAIR embeddings, BERT outperforms any other em-
bedding. To further investigate how BERT embedding can help improve the per-
formance on this task we continue with the Huggingface’s Transformers library and
directly tune the model. The details of the implementation are explained in Section
3.3.4.
39
Figure 25: EfficientNet B0 architecture
Number of iterations
40
Input Further pre-training Fine-tuning Output
SmallTobacco
GPU 1
BigTobacco EfficientNetImageNet
GPU 2
EfficientNetImageNet
Predicted
3D Pixels class
GPU 3
matrix
EfficientNetImageNet
GPU 4
ImageNet
EfficientNetImageNet
EfficientNetImageNet
Figure 27: Pipeline of the different stages of the pre-training of EfficientNet over
multiple GPUs.
41
3.3.5 Image and Text ensemble
In order to get the final enhanced prediction of the combination of both text and
image model we use a simple ensemble as in [6].
In this work w1 , wP
2 = 0.5 are found optimal. These parameters could be found by a
grid search where N i=1 wi = 1, being N the number of models. This procedure shows
to be an effective solution when both models have similar accuracy and it allows us
to avoid another training phase [7]. In Figure 28 this whole process is depicted.
Input Model training Output
Wikipedia
BERTWiki_Book
( | , )
3D Pixels EfficientNetSmallTobacco ( | )
matrix
Imagenet
EfficientNetImageNet
BigTobacco
Author Image
Harley et al. (2015)[29] 89.8
Csurka et al. (2016)[20] 90.7
Afzal et al. (2017)[2] 90.97
Tensmeyer et al. (2018)[63] 90.8
Das et al. (2018)[21] 92.21
Proposed work (2020) 92.31
42
Table 5: Time (hours) needed to train the EfficientNet models.
PP
PP GPUs
P 1 2 3 4
Model PPPP
B0 13.33 6.81 4.58 3.4
B1 19.44 9.81 6.81 4.94
B2 20.55 10.22 6.92 5.16
B3 25.64 12.94 8.78 6.55
B4 34.28 17.36 11.69 8.75
We show in Table 5 the time it takes to train the different networks while using 1,
2, 3 or 4 GPUs in a single node. In order to take advantage of the multiple GPUs we
use data parallelism, which consists of placing a copy of the model in each of them.
Since every GPU share parameters, it is equivalent to having a single GPU with larger
batch size.
4 Linear ●
● B2
B0 B1
●
●
B4
B3
3
Speedup
●
●
●
●
2 ●
●
●
●
1 ●
1 2 3 4
Number of GPUs
The time reduction to complete the entire training process with B0 variant is
≈ 61.14% lower when compared with B4 (4 GPUs). Time reduction by using multiple
GPUs is clearly shown in Figure 29. For instance, EfficientNet-B0 benefits from a
≈ 75.4% time reduction after parallelizing over 4 GPUs.
43
Table 6: Previous results on SmallTobacco (accuracy in %).
SmallTobacco
BigTobacco Pre-training No Pre-training
Author Image Image Image + Text
Kumar et al. (2014)[36] 43.8
Kang et al. (2014)[38] 65.37
Afzal et al. (2015)[1] 77.6
Harley et al. (2015)[29] 79.9
Noce et al. (2016)[50] 79.8
Afzal et al. (2017)[2] 91.13
Das et al. (2018)[21] 84.5 87.8
Audebert et al. (2019)[7] 93.2
Proposed work (2020) 94.04 85.99 89.47
95
90
model
Accuracy(%)
Accuracy(%)
94 89
b0
b1
SOTA 88 SOTA b2
b3
93 b4
87
92 86
b0 b1 b2 b3 b4 b0 b1 b2 b3 b4
Model Model
44
Every ensemble model achieves better accuracy than previous results, and there
is almost no difference between different EfficientNets results.
45
4 Parallel and Distributed Deep Learning
Single GPU training requires a huge amount of time, especially when dealing with
heavy architectures. For this reason, experimenting with several workers is crucial to
minimize the amount of time spent on these tasks. In order to try to minimize the time
spent in training these models and taking advantage of the computational resources
of the Barcelona Supercomputing Center - Centro Nacional de Supercomputación, we
exploit the available libraries to accomplish this goal. We focus on PyTorch framework
and its own APIs for making a parallel training in several GPUs by means of data
parallelism PyTorch’s Dataparallel and DistributedDataparallel.
46
4.2.1 PyTorch’s DataParallel library
The simplest way to perform data parallelism is to set a GPU as the master. The
master is in charge of getting the outputs computed by each of the network’s devices,
calculating the gradients and averaging them, then sending back to each GPU as
shown in Figure 32.
47
Figure 33: PyTorch’s Dataparallel. Source [49]
We can see that data is read from the disk to the host memory, then passed
to the master GPU and finally scattered to every other GPU. This is an inefficient
procedure. Moreover, before each forward-pass the master must send a new copy of
the model with the newly updated parameters so that they are synchronized. Finally,
there is a uneven usage of the GPUs since the master GPU is the one performing the
loss function and the average of the gradients thus computing more than the rest.
To help mitigate these inefficiencies some solutions in terms of more optimal com-
munication between GPUs have been done. A big step forward has been the intro-
duction of an HPC technique called Ring All-Reduce.
48
channel between GPUs.
The algorithm can be divided into two main steps: the scatter-reduce and the all-
gather. The scatter-reduce phase consists of dividing data in each GPU into chunks
of data, each of those is then shared to the next GPU and a reduced operation is
performed, the average for instance. In the all-gather phase every reduced result is
communicated until every GPU has the same information. This is very useful in the
case of neural networks training. During the backward-pass gradients are calculated
starting from the last layer all the way to the first layer. As gradients are being
computed in each GPU, the already computed gradients can be scatter-reduced to
the rest of the GPUs, while the backward-pass is still being executed.
PyTorch’s DistributedDataParallel functionality follows the all-reduce procedure
as shown in Figure 35.
49
Figure 35: PyTorch’s DistributedDataparallel. Source [49]
4 Linear ●
●
3
Speedup
software
● PyTorch/DDP
●
PyTorch/DP
●
2 ●
●
●
●
●
●
1 ●
1 2 3 4
Number of GPUs
50
5 Conclusion
In this work we have presented the use of EfficientNets for the Document Image
Classification task and their scaling capabilities through several GPUs. By means of
two versions of the Legacy Tobacco Industry Documents, a huge and a small dataset,
we demonstrated the training process to obtain high accuracy in both of them. We
have compared the different versions of the EfficientNets and raised the state-of-the-
art classification accuracy to 92.31% in BigTobacco and 94.04% when fine-tuned in
SmallTobacco. We can consider the B0 the best choice when considering limited
computational resources. We have also presented an ensemble method by adding the
content extracted by OCR. A reduced version of the BERT model is trained and both
models predictions are combined to achieve a new state-of-the-art accuracy of 89.47%.
Moreover, we have demonstrated that it is possible to distribute the training
process in several GPUs by means of data parallelism and to achieve a linear speedup
performance.
With this work we also provide researchers a benchmark in the Document Image
Classification task, which can serve as a reference point to effortlessly test parallel
systems in both PyTorch and TensorFlow.
Lastly, we release as open source in GitHub the code needed to run the experi-
ments shown in this work and a webpage that shows a brief explanation of the whole
work and a usage guide https://javiferran.github.io/document-classification/. As an
outcome of this thesis a research paper has been presented in ICCS 2020 [25].
51
advanced preprocessing techniques over the extracted text. Another area to focus on
is the use of the latest language models developed during the realization of this thesis
such as the GPT-3 [13] or language models which are pretrained in related domain
texts [28]. In the scalability side, the use of distributed libraries such as Horovod
[55] for multiple nodes training, which we have already done some experiments in the
Barcelona Supercomputing Center, can be useful to further reduce training times and
facilitate the prototyping speed.
52
References
[1] Afzal, M.Z., Capobianco, S., Malik, M.I., Marinai, S., Breuel, T.M., Dengel, A.,
Liwicki, M.: Deepdocclassifier: Document classification with deep convolutional
neural network. In: ICDAR. p. 1273–1278 (2015)
[2] Afzal, M.Z., Kölsch, A., Liwicki, S.A.M.: Cutting the error by half: Investi-
gation of very deep cnn and advanced training strategies for document image
classification. In: ICDAR (2017)
[3] Aggarwal, C.C.: Neural Networks and Deep Learning: A Textbook. Springer
(2018)
[4] Akiba, T., Suzuki, S., Fukuda, K.: Extremely large minibatch sgd: Training
resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325 (2017)
[6] Asim, M.N., Khan, M.U.G., Malik, M.I., Razzaque, K., Dengel, A., Ahmed, S.:
Two stream deep network for document image classification. In: ICDAR (2019)
[7] Audebert, N., Herold, C., Slimani, K., Vidal, C.: Multimodal deep networks for
text and image-based document classification. In: APIA (2019)
[8] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learn-
ing to align and translate. In: ICLR Workshop Papers (2015)
[9] Baldi, S., Marinai, S., , Soda, G.: Using tree-grammars for training set expansion
in page classification. In: ICDAR (2003)
[10] Ben-Nun, Hoefler, T.: Demystifying parallel and distributed deep learning: An
in-depth concurrency analysis. In: ACM Computing Surveys. vol. 12 (2019)
[11] Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
subword information. In: Trans. Assoc. Comput. Linguist (TACL) (2017)
[13] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., an et al., A.A.: Language models are few-shot
learners. arXiv preprint arXiv:2005.14165 (2020)
[14] Campos, V., Sastre, F., Yagues, M., Torres, J., i Nieto, X.G.: Scaling a convo-
lutional neural network for classification of adjective noun pairs with tensorflow
on gpu clusters. In: CCGRID. pp. 677–682 (2017)
53
[15] Campos, V., Sastre, F., Yagues, M., Torres, M.B.J., i Nieto, X.G.: Distributed
training strategies for a computer vision deep learning training algorithm on a
distributed gpu cluster. In: ICCS. pp. 315–324 (2017)
[16] Chen, S., He, Y., Sun, J., Naoi, S.: Structured document classification by match-
ing local salient features. In: ICPR. pp. 1558–1561 (2012)
[17] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In:
CVPR (2017)
[20] Csurka, G., Larlus, D., Gordo, A., , Almazan, J.: What is the right way to
represent document images? arXiv preprint arXiv:1603.01076 (2016)
[21] Das, A., Roy, S., Bhattacharya, U., Parui, S.K.: Document image classification
with intra-domain transfer learning and stacked generalization of deep convolu-
tional neural networks. In: ICDAR (2018)
[22] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: a large-scale
hierarchical image database. In: CVPR. pp. 248–255 (06 2009)
[23] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep
bidirectional transformers for language understanding. In: NAACL (2019)
[24] Diligenti, M., Frasconi, P., , Gori., M.: Hidden tree markov models for docu-
ment image classification. In: Transactions on Pattern Analysis and Machine
Intelligence (TPAMI) (2003)
[25] Ferrando, J., Domı́nguez, J.L., Torres, J., Garcı́a, R., Garcı́a, D., Garrido, D.,
Cortada, J., Valero, M.: Improving accuracy and speeding up document image
classification through parallel systems. In: Krzhizhanovskaya, V.V., Závodszky,
G., Lees, M.H., Dongarra, J.J., Sloot, P.M.A., Brissos, S., Teixeira, J. (eds.)
Computational Science – ICCS 2020. pp. 387–400. Springer International Pub-
lishing, Cham (2020), https://doi.org/10.1007/978-3-030-50417-5 29
[27] Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A.,
Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet
in 1 hour. CoRR, vol. abs/1706.02677 (2017)
54
[28] Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey,
D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains
and tasks. In: Proceedings of ACL (2020)
[29] Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets
for document image classification and retrieval. In: Proc. ICDAR 2015. IEEE.
p. 991–995 (2015)
[30] ul Hassan, M.: Vgg16 – convolutional network for classification and detection.
https://neurohive.io (2018), https://neurohive.io/en/popular-networks/vgg16/
[31] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition
(2015)
[32] He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., Mu Accurate, L.M.S.: Bag of
tricks for image classification with convolutional neural networks. arXiv preprint
arXiv:1812.01187 (2018)
[33] Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural computa-
tion, 9(8):1735–1780 (1997)
[34] Howard, J., Ruder, S.: Universal language model fine-tuning for text classifica-
tion. In: Association for Computational Linguistics. vol. 1, p. 328–339 (2018)
[36] Jayant, K., Peng, Y., David, D.: Structural similarity for document image clas-
sification and retrieval. In: Pattern Recognition Letters. vol. 43, pp. 119–126
(2014)
[37] Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text
classification. arXiv preprint arXiv:1607.01759 (2016)
[38] Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural net-
works for document image classification. In: ICPR. p. 3168–3172 (2014)
[39] Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP
(2014)
[40] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep
convolutional neural networks. In: Advances in neural information processing
systems (2012)
55
[42] Kumar, J., Ye, P., Doermann, D.S.: Learning document structure for retrieval
and classification. In: ICPR. pp. 653–656 (2012)
[43] Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard., J.: Build-
ing a test collection for complex document information processing. In: SIGIR.
pp. 665–666 (2006)
[44] Luong, M.T., Pham, H., D.Manning, C.: Effective approaches to attention-based
neural machine translation. In: ICLR Workshop Papers (2015)
[45] McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: Con-
textualized word vectors. In: Advances in Neural Information Processing Sys-
tems. pp. 6297–6308 (2017)
[47] Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing lstm language
models. In: ICLR (2018)
[48] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word rep-
resentations in vector space. In: ICLR Workshop Papers (2013)
[49] Mohan, A.: Distributed data parallel training using pytorch on aws.
http://www.telesens.co (2019), http://www.telesens.co/2019/04/04/
distributed-data-parallel-training-using-pytorch-on-aws
[50] Noce, L., Gallo, I., Zamberletti, A., Calefati, A.: Embedded textual content
for document image classification with convolutional neural networks. In: Pro-
ceedings of the 2016 ACM Symposium on Document Engineering (DocEng ’16)
(2016)
[52] Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle-
moyer, L.: Deep contextualized word representations. In: Proc. of NAACL (2018)
[53] Roy, S., Das, A., Bhattacharya, U.: Generalized stacking of layerwise-trained
deep convolutional neural networks for document image classification. In: 23rd
International Conference on Pattern Recognition (ICPR). p. 1273–1278 (2016)
[54] Sandler, M., Howard, A., Menglong, Zhu, Zhmoginov, A., Chen, L.C.: Mo-
bilenetv2: Inverted residuals and linear bottlenecks. In: CVPR. pp. 4510–4520
(2018)
56
[55] Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in
TensorFlow. arXiv preprint arXiv:1802.05799 (2018)
[56] Shin, C., Doermann, D.S.: Document image retrieval based on layout structural
similarity. In: IPCV. pp. 606–612 (2006)
[57] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. CoRR, abs/1409.1556 (2014)
[58] Smith, R.: An overview of the tesseract ocr engine. In: International Conference
on Document Analysis and Recognition (ICDAR) (2007)
[59] Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune bert for text classification?
arXiv preprint arXiv:1905.05583 (2019)
[60] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neu-
ral networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D.,
Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems
27, pp. 3104–3112. Curran Associates, Inc. (2014), http://papers.nips.cc/paper/
5346-sequence-to-sequence-learning-with-neural-networks.pdf
[61] Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le,
Q.V.: Mnasnet: Platform-aware neural architecture search for mobile. In: CVPR
(2019)
[62] Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional
neural networks. In: International Conference on Machine Learning (2019)
[63] Tensmeyer, C., Martinez, T.: Analysis of convolutional neural networks for doc-
ument image classification. In: ICDAR (2017)
[64] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Advances in Neural
Information Processing Systems 30. p. 6000–6010 (2017)
[65] Wang, R., Su, H., Wang, C., Ji, K., Ding, J.: To tune or not to tune? how about
the best of both worlds? arXiv preprint arXiv:1907.05338 (2019)
[66] Weng, L.: Generalized language models. lilianweng.github.io/lil-log (2019), http:
//lilianweng.github.io/lil-log/2019/01/31/generalized-language-models.html
[67] Zhang, Y., Wallace, B.C.: A sensitivity analysis of (and practitioners’ guide
to) convolutional neural networks for sentence classification. arXiv preprint
arXiv:1510.03820 (2015)
[68] Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A.,
Fidler, S.: Aligning books and movies: Towards story-like visual explanations by
watching movies and reading books. In: ICCV. pp. 19–27 (2015)
57