A Comprehensive Introduction To Convolutional Neural Networks: A Case Study For Character Recognition
A Comprehensive Introduction To Convolutional Neural Networks: A Case Study For Character Recognition
A Comprehensive Introduction To Convolutional Neural Networks: A Case Study For Character Recognition
Abstract—Convolutional neural networks have been subject. Araújo et al. [6] describe the CNNs main concepts,
attracted great attention in the field of complex tasks, besides presenting the model in a practical way.
mainly in image recognition. They were specifically The problem of character recognition is well known in
designed to handle images as inputs, as they act in
local receptive fields performing a convolution process. literature. LeCun et al. [7] used CNNs for a dataset of
However, understanding the working principle of con- digits known as MNIST. For the same base, Mohebi and
volutional neural networks may not be an easy task, Bagirov [8] used a modified self-organizing maps (SOM).
especially for beginners in the area of computational Bai et al. [9] used a variation of CNN to recognize
intelligence. So, the aim of this work is to present in characters from different languages. Besides character re-
a didactic and intuitive way the convolutional neural
networks. A case study involving alphabet character cognition, CNN can be used in several applications, such
recognition is presented in order to illustrate the feasi- as license plates automatic recognition. Another examples
bility of the approach. of applications include document digitalization, receipt
Index Terms—Convolutional neural networks, cha- images, check processing, medical services form processing
racter recognition. and others [10].
The goal of this work is to present convolutional neural
networks in a didactic and intuitive way. For that, we
I. Introduction will make a brief introduction to artificial neuron and
RTIFICIAL neural networks have been applied in
A several areas of knowledge, such as pattern recogni-
tion, character recognition, time series forecasting, among
conventional neural network (called multi-layer percep-
tron) in section 2, because it is used as the final stage
of a convolutional network. Section 3 discusses in details
others [1]. Since they first appeared, they have been the elements of a convolucional neural network, focusing
improved, adapting to several applications, one of which mainly on the convolution and pooling operations. In order
is image classification, for which the most widely used is to facilitate the understanding, we present a case study on
the convolutional neural network (CNN). alphabetic character recognition in section 4. Finally, we
The CNN [2] has a structure specially built to receive present the conclusions in section 5.
images as inputs. It can preserve the correlations among
neighboring pixels because it acts on the local receptive II. Artificial Neural Networks
fields, performing the convolution operations. This way, RTIFICIAL neural network is a parallel and dis-
we can reduce the sensitivity to image translation, rotation
and distortion [3]. Other types of neural networks cannot
A tributed information processing system [11]. It
presents similar characteristics to the biological neural
capture this kind of relationships, because they consider network, such as parallel processing and the ability to
images as an unidimensional array. learn [12].
Nevertheless, the convolutional neural networks are Neural networks are made up of several simple proces-
more complex than other neural networks architectures, sors called artificial neurons, each one of them producing a
what makes it more difficult for beginners in the area of series of real valued activations [13]. The artificial neuron
computational intelligence to understand the way it works. is the basic processing unit in an artificial neural network.
Even though there are several papers on convolutional Several models were proposed, but the most one used was
neural networks in the literature, just a few of them aims created by McCulloch and Pitts [14].
at a comprehensive introduction. O’Shea and Nash [4] give An artificial neuron basically consist of input signals,
a brief introduction to CNNs, discussing recent papers and weights, activation function and output signal, as shown
techniques for their development. Wu [5] discusses CNNs in Figure 1. Each input xi is multiplied by a weight wi and
mathematically in a more clear way. In Brazil, as far as the the resulting values are summed. A bias value b is added
authors know, there are few introductory articles on the and thus an activation signal u is generated (calculated by
49
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59
P
the formula u = i xi · wi + b). This signal is then passed
through an activation function f (u) resulting in the output
y.
The most common activation functions are: logistic responsible for information processing. Finally, the output
sigmoidal, hyperbolic tangent and rectified linear unit layer, whose goal is to present the network’s response
(ReLU). These functions are calculated by the equations (output) to the problem at hand, which can be either a
1, 2 and 3 respectively. class identification or a continuous value.
1
f (u) = (1)
1 + e−u III. Convolutional Neural Networks
A. Introduction
1 − e−u
f (u) = (2) Convolutional neural networks are a kind of neural net-
1 + e−u
works inspired by the animal visual cortex [16]. The initial
layers are specialized in extracting features from data
f (u) = max(0, u) (3)
(specially from images), and the neurons are not totally
It is easier to understand the process through an exam- connected to the next layer. The final layers are totally
ple. Given a neuron with a logistic sigmoidal activation connected, as shown in Figure 2 and are responsible for
function, inputs [x1 = 0, x2 = 0.5, x3 = 1], weights interpreting the information extracted by the initial layers
[w1 = 0.1, w2 = 0.2, w3 = 0.7] and bias [b = −0.8], and offering a response. Convolutional neural networks are
one generates the output 0.5, according to the following based on the work of Hubel and Wiesel [17], who studied
calculations: the visual field of cats, identifying that visual cells respond
X to different types of stimuli.
u= xi · wi + b
In 1980, Fukushima introduced a neural network called
i
Neocognitron. This network is self organized using un-
u = x1 · w1 + x2 · w2 + x3 · w3 + b supervised learning (which does not need a tutor) and
u = 0 · 0.1 + 0.5 · 0.2 + 1 · 0.7 − 0.8 acquires the ability to recognize patterns of stimuli based
u=0 on geometric similarity [18].
y = f (u) This network structure is made up of multiple layers
(4) with hierarchy levels, being made of layers of simple,
1
y= complex and hyper-complex cells. These cells respond to
1 + e−u
1 specific types of stimuli from receptive fields. At each level,
y= the complexity of the features extracted from the stimuli
1 + e−0
1 increases, in a similar way to the visual nervous system.
y= In 1986, there was a huge breakthrough in artificial
1+1
neural networks, with the work of Rumelhart et al. [19],
y = 0.5
who developed a supervised learning algorithm known as
Several connected neurons form an artificial neural backpropagation, which provide the networks the ability to
network, whereas the output y of a neuron is the input solve nonlinear problems.
of other neurons in the next level. It is common that they In 1989, LeCun [2] used hierarchically structured neural
are grouped into layers and their connection is unidirec- networks with invariance to translation detectors called
tional, with the input going ”forward”(feedforward). The “Multilayer Constrained Networks”. In those networks, the
structure of a neural network is illustrated in Figure 2. connections were made locally but using shared weights
The first layer is called input layer, which only re- and only the final layers were fully connected. The training
ceives data and transmits them to the following layer. was carried out in a supervised way using a variation of
The intermediate layers are called hidden layers, being the backpropagation learning algorithm.
50
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59
C. Convolutional layer
In the convolutional layer, the units are organized into
“feature maps”, in which each unit is connected to a part
of the previous layer by a set of weights called ”filters” [23].
The processing units (or neurons) use a technique called
”shared weights” which consists on several connections
being defined by the same parameter (weight) [2]. This
technique decreases the number of parameters of network.
In addition, this type of connection organization simulates
the convolution operation, which seeks to extract some
image features, such as lines and contours.
In Figure 3 we illustrate the technique of weight sharing.
The goal of this example is to show how the connections
between the inputs and the neurons in the convolutional
layer is made (the output of those neurons form the feature
maps). Each neuron, in grey, connects to two inputs
through a weight filter [w1 , w2 ]. In a sequential way, we can
imagine at first glance that the two weights are connected
to the two first inputs and generate the first unit of the
feature map. At second glance, the weights slide down and Fig. 4. Example of convolution
connect to the second and third input, forming the second
unit of the feature map. Further on in this paper, we will In Figure 4 we have an input image of size 3x3 with
show a case of a 2D input and a 2D filter. values between 0 (zero) and 1 (one), where 0 correspond
The filters can also be organized in two or three dimen- to the lighter pixels and 1 to the darker ones. The filter
sions. At each moment, they are connected to a specific used has size 2x2, with weights [1, 1, 0, 0] and acts on the
51
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59
local receptive fields (square with red borders) of the image Algorithm 1 Convolution
of the same size (2x2). Each filter weight is multiplied 1: Initialize the values of the weights w with small values
by the value of the corresponding pixel in the receptive 2: For i = 1 to (M − k)/s + 1 Do
field. These values are summed and add to the bias term, 3: For j = 1Pto (N
k P−
k
k)/s + 1 Do
resulting in a value for the corresponding unit of the 4: ui,j = c=1 d=1 xi·s−1+c,j·s−1+d · wc,d + b
feature map. In this example, the bias is not shown in the 5: yi,j = f (ui,j )
Figure 4 and, hence, its value was considered as equal to 6: End-for
zero. The size of the stride used is one, which means that 7: End-for
at each moment, the local receptive field is dislocated one
pixel to the right. When we get to the end of a line (like in
the second moment of the image), we shift one pixel down the feature map and yi,j is the output after going through
and go back to the beginning of the line. the activation function. The variable wc,d indicates the
As shown in Figure 4, the application of a filter (2x2) filter weight at position (c, d).
in an image (3x3) results in a feature map with size 2x2.
The size of the feature map is calculated according to the D. Pooling layer
following equation:: The goal of the pooling layer is to reach a certain level of
M −k invariance towards rotation, reducing the resolution of the
Tm = +1 (5) feature maps [24]. It acts similarly to the convolutional
s
layer, but the stride has the same size as the filter (in
where Tm is the size of the feature map, M is the size of
the case of the previous layer, the stride was equal to 1).
the image and k is the size of the filter kernel and s is
This makes the receptive fields to be totally different and
the stride size. The variables Tm , M and k represent the
reduces the feature maps by 75%. The most common types
size in one of the dimensions (horizontal or vertical), but
of pooling are max pooling and average pooling. In max
it is more frequent to use square images and filters, which
pooling, we select the highest value of the receptive field
have equal lengths and heights. Hence, the variables would
and in average pooling, we calculate the average of the
represent the sizes in both dimensions.
values.
In Figure 6, the pooling layer receives as input an image
in shades of green and perform the max pooling operation
on it. As a result, the output is an image with half of the
original height and width.
The pooling operation performed after the convolutional
layer is described in the algorithm 2 (considering s = k).
Algorithm 2 pooling
1: For i = 1 up to (M − k)/s + 1 Do
Fig. 5. Illustration of an example of zero-padding 2: For j = 1 up to (N − k)/s + 1 Do
3: If max pooling Then
In order to achieve a specific size for a feature map, 4: yi,j = max(x(i−1)·k+1,(i−1)·k+1 , . . . , xi·k,j·k )
sometimes it is necessary to change the size of the input 5: End-If
image. This is done adding a border called padding or If averagePpooling
zero-padding, which consists in inserting zeros around the
6:
k PkThen
7: yi,j = ( c=1 d=1 xi·s−1+c,j·s−1+d )/(k 2 )
image, as illustrated by Figure 5. This way, the resulting 8: End-If
size of the feature maps includes the width of the border 9: End-For
p that is calculated according to the following formula: 10: End-For
M + 2p − k
Tm = +1 (6)
s
After generating the values of the feature map, it is E. Fully connected layer
necessary to make them go through an activation function, The input passes through the convolutional and pooling
This is done so that they become able to solve nonlinear layers, which identify the features, from the simplest to the
problems (in this case, the activation function must also most complex, until they enter the fully connected layers.
be nonlinear). The activation function most used with Usually, the previous layer is a pooling one, in which the
convolutional neural networks is the ReLU. feature maps have more than one dimension. Hence, they
Algorithm 1 summarizes the process that is performed are redefined for a single dimension (as a vector), so that
by the convolutional layer. they can be connected to the final part of the network. The
In the pseudocode, xi,j is the pixel value at position fully connected layers work as a multilayer feedforward
(i, j) of image x, M is the image height and N is the image neural network and are responsible for the interpretation
width. The variable ui,j stores the value of position (i, j) of of the features extracted by the initial layers.
52
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59
ey(j)
yj = Pn y(i) (7)
i=1 e
ey(j)
yj = Pn y(i)
i=1 e
eyj
yj = y1
e + ey2 + ey3
eyj
yj = 1
e + e1.5 + e2
eyj
yj =
2.72 + 4.48 + 7.39 (8)
eyj
yj =
14.59
ey1 e1 2.72
y1 = = = = 0.186
14.59 14.59 14.59
ey2 e1.5 4.48
y2 = = = = 0.307
14.59 14.59 14.59
y3 2
e e 7.39
y3 = = = = 0.507
14.59 14.59 14.59
So, we complete the construction of the convolutional
neural network, which is illustrated in Figure.7.
53
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59
54
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59
of input samples and n is the number of neurons in the −6.4520 6.9861 −17.0639 12.2380
β̂ = T
hidden layer. In this case, xi is the atribute vector and wi 4.3023 −4.0280 10.9916 −7.2892
is a weight vector. 0
The second step consists in calculating β̂, the weight −6.4520 6.9861 −17.0639 12.2380 0
β̂ =
matrix between the hidden layer and the output layer. It is 4.3023 −4.0280 10.9916 −7.2892 0
calculated as the solution of the least squared error (LSE) 1
minimization problem for the linear system H β̂ = T , 12.2380
where β̂ = H † T , and H † is the generalized Moore-Penrose β̂ =
−7.2892
inverse of the matrix H and T is the desired output vector (16)
for the input samples [32], [33]. The calculation of the generalized inverse is a little more
Consider the problem of the logic function AND, in complex, but most programming languages have libraries
which there are two types of input: 1 (presence of an to perform this calculation. Having the matrix β̂, the
electrical current) and 0 (no electrical current). The logic output y of the ELM for a test sample Z is given as follows:
gate receives two inputs and returns: 0&0 = 0, 0&1 = 0,
1&0 = 0 and 1&1 = 1. Hence, X and T are defined as y = f (ZW + B) ∗ β̂ (17)
follows:
V. Case study
0 0 0 Even though MNIST is a standard digit recognition
0 1 0 database widely used in the literature [30], in this work
X=
1
, T = (14)
0 0 we will consider the problem of recognizing alphabetic
1 1 1 characters from several sources to illustrate the application
of CNN. It was chosen due to the fact that the latter
Consider also that the ELM has two neurons in the hidden problem is very similar to the former but the network
layer with a logistic sigmoidal activation function and that needs to classify among 26 letters instead of only 10 digits,
the weights W and the bias B were generated randomly making the task a little bit more complex.
as follows:
A. Database
0.2 0.5
W = , B = −0.5 0.5 (15) The database ’Alphabet’ was developed by students at
0.6 1
the Labcin research lab (Nature Inspired Engineering and
Computing Lab) from the graduate program in Computer
So, β̂ is calculated as follows: Science. This database contains a total of 11,960 images
of alphabetic characters in gray scale, both lowercase and
β̂ =H † T
capital letters, divided into 26 classes (each corresponding
β̂ =(f (XW + B))† T to a letter). Figure 9 shows some examples from this
† database.
f (x11 ẇ11 + x12 ẇ21 + b1 ) f (x11 ẇ21 + x12 ẇ22 + b2 )
f (x21 ẇ11 + x22 ẇ21 + b1 ) f (x21 ẇ21 + x22 ẇ22 + b2 )
β̂ =f (x31 ẇ11 + x32 ẇ21 + b1 ) f (x31 ẇ21 + x32 ẇ22 + b2 ) T
55
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59
Fig. 10. Architecture of the convolutional neural network used in this case study
TABLE I
padding of 4 pixels to keep the original image size. Hence, Confusion matrix of the database of alphabetic characters
we generate 32 feature maps of 30x30 pixels (in blue).
a b c d e f g h i J k l m n o p q r s t u v w x y z
In each one of these feature maps we apply the max a 59 2 2 1 1 1 2 1
b 66 1 1 1
pooling operation of size 2x2 pixels, reducing both dimen- c 1 65 1 2
sions of each one of the maps by half. As a result, we have d
e 1
63 2
64 1
1 1 1
1
1
1 1
a new image of 15x15x32 pixels (in orange), i.e., an image f
g 1
1 61
2 61
2 1
2
2 1
1 1
1
1
of 32 channels. h
i
2
1
63
53 3 9
1 1 1
1 1 1
1
56
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59
The ELM obtained an average classification accuracy of of the type feedforward have difficulties with the training
77.03% with standard deviation of 0.75%. This network process by adding layers.
took in average 51 seconds to be trained, with a stan- Nevertheless, the time to train a CNN is high. For
dard deviation of 5.88 seconds. Figures 11 and 12 show the case study, it took 30 times longer than the ELM.
the box-plots for the classification accuracy for and the Even with weight reduction techniques, the number of
computational training time, respectively for both CNN connections is still high. Training in an iterative way is
and ELM. a computational burden. Some CNN models require an
adequate hardware to train (usually GPUs of the latest
generation).
57
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59
58