A Comprehensive Introduction To Convolutional Neural Networks: A Case Study For Character Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Revista de Sistemas de Informação da FSMA

n. 22 (2018) pp. 49-59 http://www.fsma.edu.br/si/sistemas.html

A Comprehensive Introduction to Convolutional


Neural Networks: A Case Study for Character
Recognition
Elivelto Ebermam, Federal University of Espı́rito Santo, Vitória - ES
Renato A. Krohling, Federal University of Espı́rito Santo, Vitória - ES

Abstract—Convolutional neural networks have been subject. Araújo et al. [6] describe the CNNs main concepts,
attracted great attention in the field of complex tasks, besides presenting the model in a practical way.
mainly in image recognition. They were specifically The problem of character recognition is well known in
designed to handle images as inputs, as they act in
local receptive fields performing a convolution process. literature. LeCun et al. [7] used CNNs for a dataset of
However, understanding the working principle of con- digits known as MNIST. For the same base, Mohebi and
volutional neural networks may not be an easy task, Bagirov [8] used a modified self-organizing maps (SOM).
especially for beginners in the area of computational Bai et al. [9] used a variation of CNN to recognize
intelligence. So, the aim of this work is to present in characters from different languages. Besides character re-
a didactic and intuitive way the convolutional neural
networks. A case study involving alphabet character cognition, CNN can be used in several applications, such
recognition is presented in order to illustrate the feasi- as license plates automatic recognition. Another examples
bility of the approach. of applications include document digitalization, receipt
Index Terms—Convolutional neural networks, cha- images, check processing, medical services form processing
racter recognition. and others [10].
The goal of this work is to present convolutional neural
networks in a didactic and intuitive way. For that, we
I. Introduction will make a brief introduction to artificial neuron and
RTIFICIAL neural networks have been applied in
A several areas of knowledge, such as pattern recogni-
tion, character recognition, time series forecasting, among
conventional neural network (called multi-layer percep-
tron) in section 2, because it is used as the final stage
of a convolutional network. Section 3 discusses in details
others [1]. Since they first appeared, they have been the elements of a convolucional neural network, focusing
improved, adapting to several applications, one of which mainly on the convolution and pooling operations. In order
is image classification, for which the most widely used is to facilitate the understanding, we present a case study on
the convolutional neural network (CNN). alphabetic character recognition in section 4. Finally, we
The CNN [2] has a structure specially built to receive present the conclusions in section 5.
images as inputs. It can preserve the correlations among
neighboring pixels because it acts on the local receptive II. Artificial Neural Networks
fields, performing the convolution operations. This way, RTIFICIAL neural network is a parallel and dis-
we can reduce the sensitivity to image translation, rotation
and distortion [3]. Other types of neural networks cannot
A tributed information processing system [11]. It
presents similar characteristics to the biological neural
capture this kind of relationships, because they consider network, such as parallel processing and the ability to
images as an unidimensional array. learn [12].
Nevertheless, the convolutional neural networks are Neural networks are made up of several simple proces-
more complex than other neural networks architectures, sors called artificial neurons, each one of them producing a
what makes it more difficult for beginners in the area of series of real valued activations [13]. The artificial neuron
computational intelligence to understand the way it works. is the basic processing unit in an artificial neural network.
Even though there are several papers on convolutional Several models were proposed, but the most one used was
neural networks in the literature, just a few of them aims created by McCulloch and Pitts [14].
at a comprehensive introduction. O’Shea and Nash [4] give An artificial neuron basically consist of input signals,
a brief introduction to CNNs, discussing recent papers and weights, activation function and output signal, as shown
techniques for their development. Wu [5] discusses CNNs in Figure 1. Each input xi is multiplied by a weight wi and
mathematically in a more clear way. In Brazil, as far as the the resulting values are summed. A bias value b is added
authors know, there are few introductory articles on the and thus an activation signal u is generated (calculated by

49
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59

P
the formula u = i xi · wi + b). This signal is then passed
through an activation function f (u) resulting in the output
y.

Fig. 2. Structure of a multi-layer artificial neural network


Fig. 1. Artificial neuron [15]

The most common activation functions are: logistic responsible for information processing. Finally, the output
sigmoidal, hyperbolic tangent and rectified linear unit layer, whose goal is to present the network’s response
(ReLU). These functions are calculated by the equations (output) to the problem at hand, which can be either a
1, 2 and 3 respectively. class identification or a continuous value.
1
f (u) = (1)
1 + e−u III. Convolutional Neural Networks
A. Introduction
1 − e−u
f (u) = (2) Convolutional neural networks are a kind of neural net-
1 + e−u
works inspired by the animal visual cortex [16]. The initial
layers are specialized in extracting features from data
f (u) = max(0, u) (3)
(specially from images), and the neurons are not totally
It is easier to understand the process through an exam- connected to the next layer. The final layers are totally
ple. Given a neuron with a logistic sigmoidal activation connected, as shown in Figure 2 and are responsible for
function, inputs [x1 = 0, x2 = 0.5, x3 = 1], weights interpreting the information extracted by the initial layers
[w1 = 0.1, w2 = 0.2, w3 = 0.7] and bias [b = −0.8], and offering a response. Convolutional neural networks are
one generates the output 0.5, according to the following based on the work of Hubel and Wiesel [17], who studied
calculations: the visual field of cats, identifying that visual cells respond
X to different types of stimuli.
u= xi · wi + b
In 1980, Fukushima introduced a neural network called
i
Neocognitron. This network is self organized using un-
u = x1 · w1 + x2 · w2 + x3 · w3 + b supervised learning (which does not need a tutor) and
u = 0 · 0.1 + 0.5 · 0.2 + 1 · 0.7 − 0.8 acquires the ability to recognize patterns of stimuli based
u=0 on geometric similarity [18].
y = f (u) This network structure is made up of multiple layers
(4) with hierarchy levels, being made of layers of simple,
1
y= complex and hyper-complex cells. These cells respond to
1 + e−u
1 specific types of stimuli from receptive fields. At each level,
y= the complexity of the features extracted from the stimuli
1 + e−0
1 increases, in a similar way to the visual nervous system.
y= In 1986, there was a huge breakthrough in artificial
1+1
neural networks, with the work of Rumelhart et al. [19],
y = 0.5
who developed a supervised learning algorithm known as
Several connected neurons form an artificial neural backpropagation, which provide the networks the ability to
network, whereas the output y of a neuron is the input solve nonlinear problems.
of other neurons in the next level. It is common that they In 1989, LeCun [2] used hierarchically structured neural
are grouped into layers and their connection is unidirec- networks with invariance to translation detectors called
tional, with the input going ”forward”(feedforward). The “Multilayer Constrained Networks”. In those networks, the
structure of a neural network is illustrated in Figure 2. connections were made locally but using shared weights
The first layer is called input layer, which only re- and only the final layers were fully connected. The training
ceives data and transmits them to the following layer. was carried out in a supervised way using a variation of
The intermediate layers are called hidden layers, being the backpropagation learning algorithm.

50
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59

In 1994, LeCun et al. [3] referred to the networks


previously referenced as “Multilayer Convolutional Neural
Network” or simply “Convolutional Neural Network”. In
1994, the term “Convolutional Neural Networks” appeared
in the title of a paper by LeCun et al. [20] and in 1995,
LeCun and Bengio [21] published a paper dedicated to con-
volutional neural networks and their applications. In 1998,
LeCun et al. [7] described a model of convolutional neural
network with a large number of layers (seven in total)
called LeNet-5. In 2012, Krizhevsky et al. [22] developed
a model of CNN called AlexNet, made of 8 layers. They
used the model to perform classification within a large
database, which store millions of images, called ImageNet.
In order to train the network, they used graphic processing
units (GPUs).
Ever since, with the increase of computational power,
the number of layers of Convolutional Neural Networks
increased, e.g., dozens or even hundreds of layers. Fig. 3. Neuron weight sharing in the convolutional layer

B. Architecture of a convolutional neural network


part of a matrix called local receptive field. In this case,
Convolutional neural networks are feedforward neural filters can be understood as small squares or cuboids that
networks designed to minimize sensitivity to the input slide within a matrix. They have a specific size, such as 5x5
image translation, rotation and distortion [3]. They are or 5x5x3 (for inputs in three dimensions) and a movement
built of local connections, shared weights, pooling and the step called “stride”.
use of several layers [23].
The network is organized into layers with different
purposes. The convolutional layer is responsible for ex-
tracting image features, the pooling layer is responsible
for performing a sub-sampling of the image and the fully
connected layers are responsible for the interpretation of
the extracted features.

C. Convolutional layer
In the convolutional layer, the units are organized into
“feature maps”, in which each unit is connected to a part
of the previous layer by a set of weights called ”filters” [23].
The processing units (or neurons) use a technique called
”shared weights” which consists on several connections
being defined by the same parameter (weight) [2]. This
technique decreases the number of parameters of network.
In addition, this type of connection organization simulates
the convolution operation, which seeks to extract some
image features, such as lines and contours.
In Figure 3 we illustrate the technique of weight sharing.
The goal of this example is to show how the connections
between the inputs and the neurons in the convolutional
layer is made (the output of those neurons form the feature
maps). Each neuron, in grey, connects to two inputs
through a weight filter [w1 , w2 ]. In a sequential way, we can
imagine at first glance that the two weights are connected
to the two first inputs and generate the first unit of the
feature map. At second glance, the weights slide down and Fig. 4. Example of convolution
connect to the second and third input, forming the second
unit of the feature map. Further on in this paper, we will In Figure 4 we have an input image of size 3x3 with
show a case of a 2D input and a 2D filter. values between 0 (zero) and 1 (one), where 0 correspond
The filters can also be organized in two or three dimen- to the lighter pixels and 1 to the darker ones. The filter
sions. At each moment, they are connected to a specific used has size 2x2, with weights [1, 1, 0, 0] and acts on the

51
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59

local receptive fields (square with red borders) of the image Algorithm 1 Convolution
of the same size (2x2). Each filter weight is multiplied 1: Initialize the values of the weights w with small values
by the value of the corresponding pixel in the receptive 2: For i = 1 to (M − k)/s + 1 Do
field. These values are summed and add to the bias term, 3: For j = 1Pto (N
k P−
k
k)/s + 1 Do
resulting in a value for the corresponding unit of the 4: ui,j = c=1 d=1 xi·s−1+c,j·s−1+d · wc,d + b
feature map. In this example, the bias is not shown in the 5: yi,j = f (ui,j )
Figure 4 and, hence, its value was considered as equal to 6: End-for
zero. The size of the stride used is one, which means that 7: End-for
at each moment, the local receptive field is dislocated one
pixel to the right. When we get to the end of a line (like in
the second moment of the image), we shift one pixel down the feature map and yi,j is the output after going through
and go back to the beginning of the line. the activation function. The variable wc,d indicates the
As shown in Figure 4, the application of a filter (2x2) filter weight at position (c, d).
in an image (3x3) results in a feature map with size 2x2.
The size of the feature map is calculated according to the D. Pooling layer
following equation:: The goal of the pooling layer is to reach a certain level of
M −k invariance towards rotation, reducing the resolution of the
Tm = +1 (5) feature maps [24]. It acts similarly to the convolutional
s
layer, but the stride has the same size as the filter (in
where Tm is the size of the feature map, M is the size of
the case of the previous layer, the stride was equal to 1).
the image and k is the size of the filter kernel and s is
This makes the receptive fields to be totally different and
the stride size. The variables Tm , M and k represent the
reduces the feature maps by 75%. The most common types
size in one of the dimensions (horizontal or vertical), but
of pooling are max pooling and average pooling. In max
it is more frequent to use square images and filters, which
pooling, we select the highest value of the receptive field
have equal lengths and heights. Hence, the variables would
and in average pooling, we calculate the average of the
represent the sizes in both dimensions.
values.
In Figure 6, the pooling layer receives as input an image
in shades of green and perform the max pooling operation
on it. As a result, the output is an image with half of the
original height and width.
The pooling operation performed after the convolutional
layer is described in the algorithm 2 (considering s = k).

Algorithm 2 pooling
1: For i = 1 up to (M − k)/s + 1 Do
Fig. 5. Illustration of an example of zero-padding 2: For j = 1 up to (N − k)/s + 1 Do
3: If max pooling Then
In order to achieve a specific size for a feature map, 4: yi,j = max(x(i−1)·k+1,(i−1)·k+1 , . . . , xi·k,j·k )
sometimes it is necessary to change the size of the input 5: End-If
image. This is done adding a border called padding or If averagePpooling
zero-padding, which consists in inserting zeros around the
6:
k PkThen
7: yi,j = ( c=1 d=1 xi·s−1+c,j·s−1+d )/(k 2 )
image, as illustrated by Figure 5. This way, the resulting 8: End-If
size of the feature maps includes the width of the border 9: End-For
p that is calculated according to the following formula: 10: End-For
M + 2p − k
Tm = +1 (6)
s
After generating the values of the feature map, it is E. Fully connected layer
necessary to make them go through an activation function, The input passes through the convolutional and pooling
This is done so that they become able to solve nonlinear layers, which identify the features, from the simplest to the
problems (in this case, the activation function must also most complex, until they enter the fully connected layers.
be nonlinear). The activation function most used with Usually, the previous layer is a pooling one, in which the
convolutional neural networks is the ReLU. feature maps have more than one dimension. Hence, they
Algorithm 1 summarizes the process that is performed are redefined for a single dimension (as a vector), so that
by the convolutional layer. they can be connected to the final part of the network. The
In the pseudocode, xi,j is the pixel value at position fully connected layers work as a multilayer feedforward
(i, j) of image x, M is the image height and N is the image neural network and are responsible for the interpretation
width. The variable ui,j stores the value of position (i, j) of of the features extracted by the initial layers.

52
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59

neuron with the highest output. This is performed with


the softmax function, described by the following equation:

ey(j)
yj = Pn y(i) (7)
i=1 e

where the output y of neuron j is the output value divided


by the sum of all the other outputs from the neurons in
this layer.
For instance, consider that the neural network output
before the application of the softmax function are: [y1 = 1,
y2 = 1.5, y3 = 2]. Hence, the outputs from the network
will be calculated according to the following equations:

ey(j)
yj = Pn y(i)
i=1 e
eyj
yj = y1
e + ey2 + ey3
eyj
yj = 1
e + e1.5 + e2
eyj
yj =
2.72 + 4.48 + 7.39 (8)
eyj
yj =
14.59
ey1 e1 2.72
y1 = = = = 0.186
14.59 14.59 14.59
ey2 e1.5 4.48
y2 = = = = 0.307
14.59 14.59 14.59
y3 2
e e 7.39
y3 = = = = 0.507
14.59 14.59 14.59
So, we complete the construction of the convolutional
neural network, which is illustrated in Figure.7.

Fig. 6. Illustration of a max pooling operation

The pseudo-code for the fully connected layer processing


is shown in the algorithm 3.

Algorithm 3 Fully connected layer


1: If previous layer is either a pooling or convolutional
one Then
2: resize(x) Fig. 7. Convolutional neural network
3: End-If
4: For each neuron j of the fully connected layer Do
5: For eachPinput i from the previous layer Do
6:
n
uj = i=1 xi · wi,j + bj G. Learning Algorithm
7: yj = f (uj ) Usually, learning is performed using the error back-
8: End-For propagation method, which uses the gradient descent to
9: End-For update the weights [24]. In order to calculate the error,
it is necessary to define a specific cost function, such as
the Euclidean distance between the network response and
the expected value. Nevertheless, for classification, when
F. Softmax
there are more than two classes, it is used the cross entropy
The final layer is responsible for presenting the results. function described by the following equation:
The number of neurons is defined by the number of classes
of the problem. Each neuron presents the output as a
X
C=− yi log(ŷi ) (9)
probability in the range [0, 1])and the response is the i=1

53
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59

where C is the cost, yi is the neuron output and ŷi is the


desired output.
The training process consists in minimizing the cost
function by changing the weights. The update is performed
according to the following equation:
∂C
∆wit = η (10)
∂wit

wit+1 = wit − ∆wit (11)


where wit is the weight i of the network at the current Fig. 8. Dropout. In the left, we see a network without dropout,
with all its neurons active. In the right, one neuron in the second
time (t), η is the learning rate and ∆wit is the variation of layer is randomly chosen to be ”turned off”, characterizing the use of
weight i at the current time. dropout.
In order to smooth the gradient descent, we can use part
of its previous movement by incorporating the momentum
term, according to the following equation: Keras [27] is a high level API for neural networks
construction that works over other frameworks such as
wit+1 = wit − ∆wit + α∆wit−1 (12) TensorFlow, CNTK and Theano.
Caffe [28] is an open source framework developped by
where α is the momentum and its value is an exponential the Berkeley Vision and Learning Center (BVLC). Its code
decaying average of the previous weight variations. was written in C++, using CUDA for graphic board com-
In addition, the gradient descent can be described using putation, and it also provides libraries for Python/Numpy
a stochastic gradient descent (SGD). This method esti- and MATLAB.
mates the gradients based on single examples selected Torch [29] is a scientific computation framework which
randomly [24]. In fact, the gradients are calculated based supports several machine learning algorithms. It is used
on a set of samples called mini-batch [24]. Using batches through a script language called LuaJIT.
requires less weights updates than the incremental mode These tools make the process of building a neural
(one example at a time), and therefore, reduces training network simpler and faster. Besides, by using them, the
time (which can be high if there are many images). user can improve the performance of the neural network
At each iteration we apply the weight updating process, training using graphic processing units (GPUs) with no
which is done until we complete the whole set of samples. additional code.
The process of adjusting the weights for all training sam- It this paper, we used the Tensorflow framework, be-
ples is called an epoch. cause it supports the Python programming language and is
Some other algorithms based on gradient descent quite intuitive. In addition, it is very flexible, i.e, it allows
are: RMSProp, Adagrad, Adadelta, Adam, among many the users to create and modify several structures within
others. the neural network.

H. Dropout IV. Extreme Learning Machine


Sometimes, the network may learn too well the images Extreme Learning Machine (ELM) is a learning algo-
in its training set and at the same time, be unable to have rithm for artificial neural networks with a single hidden
a good performance with other images unseen during the layer [32]. The weights between the initial and the hid-
training. In order to avoid this overfitting in its learning, den layers are initialized randomly. The network training
we use the dropout technique. It consists in withdrawing occurs in the weight matrix between the hidden and the
randomly some units from the neural network [25], as can output layer, which is calculated analytically. Since it does
bee seen in Figure 8. We define a probability p, that defines not require an iterative process, generally the network
whether each unit may propagate its signal throughout the training is performed very quickly.
network. The first step of the ELM algorithm consists in defining
the matrix H, which is the resulting matrix from the
I. Development Platforms hidden layer neurons output for each of the samples of
There are several tools which include the implementa- the input set, according to the following formula:
tion of several mechanisms that help build convolutional  
neural networks, such as convolution, pooling, activation f (x1 w1 + bi ) . . . f (x1 wn + bm )
functions and learning algorithms. H=
 .. .. .. 
(13)
. . . 
TensorFlow [26] is a library for large scale machine lear-
f (xm w1 + bi ) . . . f (xm wn + bn )
ning developed by researchers from Google. It is normally
used with the Python programming language through an where the lines represent the samples and the columns
API. It also supports C++ and Java. represent the neurons in the hidden layer, m is the number

54
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59

 
of input samples and n is the number of neurons in the −6.4520 6.9861 −17.0639 12.2380
β̂ = T
hidden layer. In this case, xi is the atribute vector and wi 4.3023 −4.0280 10.9916 −7.2892
 
is a weight vector.   0
The second step consists in calculating β̂, the weight −6.4520 6.9861 −17.0639 12.2380  0

β̂ =
matrix between the hidden layer and the output layer. It is 4.3023 −4.0280 10.9916 −7.2892 0
calculated as the solution of the least squared error (LSE) 1
 
minimization problem for the linear system H β̂ = T , 12.2380
where β̂ = H † T , and H † is the generalized Moore-Penrose β̂ =
−7.2892
inverse of the matrix H and T is the desired output vector (16)
for the input samples [32], [33]. The calculation of the generalized inverse is a little more
Consider the problem of the logic function AND, in complex, but most programming languages have libraries
which there are two types of input: 1 (presence of an to perform this calculation. Having the matrix β̂, the
electrical current) and 0 (no electrical current). The logic output y of the ELM for a test sample Z is given as follows:
gate receives two inputs and returns: 0&0 = 0, 0&1 = 0,
1&0 = 0 and 1&1 = 1. Hence, X and T are defined as y = f (ZW + B) ∗ β̂ (17)
follows:
V. Case study
   
0 0 0 Even though MNIST is a standard digit recognition
0 1 0 database widely used in the literature [30], in this work
X=
1
, T =   (14)
0 0 we will consider the problem of recognizing alphabetic
1 1 1 characters from several sources to illustrate the application
of CNN. It was chosen due to the fact that the latter
Consider also that the ELM has two neurons in the hidden problem is very similar to the former but the network
layer with a logistic sigmoidal activation function and that needs to classify among 26 letters instead of only 10 digits,
the weights W and the bias B were generated randomly making the task a little bit more complex.
as follows:

  A. Database
0.2 0.5  
W = , B = −0.5 0.5 (15) The database ’Alphabet’ was developed by students at
0.6 1
the Labcin research lab (Nature Inspired Engineering and
Computing Lab) from the graduate program in Computer
So, β̂ is calculated as follows: Science. This database contains a total of 11,960 images
of alphabetic characters in gray scale, both lowercase and
β̂ =H † T
capital letters, divided into 26 classes (each corresponding
β̂ =(f (XW + B))† T to a letter). Figure 9 shows some examples from this
 † database.
f (x11 ẇ11 + x12 ẇ21 + b1 ) f (x11 ẇ21 + x12 ẇ22 + b2 )
f (x21 ẇ11 + x22 ẇ21 + b1 ) f (x21 ẇ21 + x22 ẇ22 + b2 )
β̂ =f (x31 ẇ11 + x32 ẇ21 + b1 ) f (x31 ẇ21 + x32 ẇ22 + b2 ) T

f (x41 ẇ11 + x42 ẇ21 + b1 ) f (x41 ẇ21 + x42 ẇ22 + b2 )


 †
f (0 · 0.2 + 0 · 0.6 + (−0.5)) f (0 · 0.5 + 0 · 1 + 0.5)
f (0 · 0.2 + 1 · 0.6 + (−0.5)) f (0 · 0.5 + 1 · 1 + 0.5)
β̂ =f (1 · 0.2 + 0 · 0.6 + (−0.5)) f (1 · 0.5 + 0 · 1 + 0.5) T Fig. 9. Samples from the Alphabet database.

f (1 · 0.2 + 1 · 0.6 + (−0.5)) f (1 · 0.5 + 1 · 1 + 0.5)


 †
f (−0.5) f (0.5) The database was divided into three parts: 70% for
 f (0.1) f (1.5) training, 15% for validation and 15% for testing. The value
β̂ =f (−0.3) f (1)  T

of each pixel is in the interval [0,255], but was normalized
f (0.3) f (2) in the [0,1] range.
 1 1 †
1+e−(−0.5) 1+e−0.5
1 1 B. Setting of the convolutional neural network
1+e−0.1 1+e−1.5 
 
β̂ =  1 1 T

1+e−(−0.3) 1+e−1
 A convolutional neural network used1 is presented in
1 1 Figure 10. It receives as input an image of size 30x30 pixels,
1+e−0.3 1+e−2
 † followed by a convolutional layer of 32 filters (in green).
0.38 0.62
0.53 0.82 Each filter has 5x5 weights, with a stride of 1 pixel and
β̂ = 
0.43 0.73 T

1 The source code is available from the author upon request via

0.57 0.88 email ([email protected])

55
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59

Fig. 10. Architecture of the convolutional neural network used in this case study

TABLE I
padding of 4 pixels to keep the original image size. Hence, Confusion matrix of the database of alphabetic characters
we generate 32 feature maps of 30x30 pixels (in blue).
a b c d e f g h i J k l m n o p q r s t u v w x y z
In each one of these feature maps we apply the max a 59 2 2 1 1 1 2 1
b 66 1 1 1
pooling operation of size 2x2 pixels, reducing both dimen- c 1 65 1 2
sions of each one of the maps by half. As a result, we have d
e 1
63 2
64 1
1 1 1
1
1
1 1
a new image of 15x15x32 pixels (in orange), i.e., an image f
g 1
1 61
2 61
2 1
2
2 1
1 1
1
1
of 32 channels. h
i
2
1
63
53 3 9
1 1 1
1 1 1
1

In this image, a new convolution operation is performed J 1 2 65 1


k 2 1 61 1 1 1 2
with 64 filters with dimension 5x5x32, with the same l 1 1 20 1 45 1
m 1 1 65 2
values of stride and padding as the first one. As result, we n 1 3 60 2 1 2
o 1 2 64 1 1
have 64 feature maps of size 15x15 pixels, and each one of p 1 68
q 2 2 64 1
them goes through the max pooling operation. After that, r 1 1 1 63 1 1 1
we have an image of size 8x8x64 pixels, which is vectorized s
t
1 2
1
1 1
1
64
66 1
to a single dimension of 4096 pixels. u
v
2
1
1 1 63
4 60 1
2
3
Each pixel is the input to a fully connected first layer w 1 1 67
x 1 1 1 63 3
with 512 neurons (in grey). We use dropout with 50% y 1 1 1 1 2 63
z 1 2 1 65
probability of signal propagation. The outputs of the first
layer neuron is connected to another one of 512 neurons
and once again we use the same droupout. For most letters, the network found a good classification.
Finally, there is a layer with 26 neurons (in pink), which The letters that were easier for the network were: p
is followed by a softmax function, which calculates the (98.5%), w (97.1%), b (95.65%) and t (95.65%). On the
probability of each class. other hand, the letters that imposed the greater difficulty
For the training, the images were scrambled and we used were: l (65.21%), i (76.81%), a (85.5%), n (86.95%) and v
batches of 52 images with 2 images for each class in them. (86.95%). The main confusions committed by the network
The activation function used was the Rectified Linear Unit were exchanging l by i (28.98%), i by l (13.04%), v by u
(ReLU). The learning method was the stochastic gradient (5.79%), i by j (4.34%), n by h (4.34%) and v by y (4.34%).
descent (SGD) with 0.9 momentum and learning rate of Based on the fact that we presented both lowercase and
0.01. uppercase letters, the possibility of having similar letters
is higher, such as in the case of an “I” (uppercase i) with a
C. Simulation results “l” (lowercase L). Sometimes, a letter from a specific font
E performed 30 tests in a computer with a AMD is very hard to recognize and very easy to confound in
W Phenom X2 processor and 4 Gigabytes of RAM
memory. The code was developed in Python with the use
isolation, and people usually distinguish them based on
their context.
of the TensorFlow API. The network achieved accuracy of
91.08% of correct classifications on average, with standard D. Comparison with another method
deviation of 1.44%. The best result achieved was 93.1%. The results obtained from the CNN were compared with
The network training took on average 1761.52 seconds, another neural network that has been widely used, the
with a standard deviation of 118.09 seconds. extreme learning machine (ELM) [32], [33].
Table I shows the confusion matrix for the letters. The The experiments performed for the ELM were done
columns represent the correct letters and the lines, the exactly like those for the CNN. The ELM used has 900
network responses. A confusion matrix, or error matrix, is input units (one for each pixel in the image), 3000 neurons
the most efficient way to represent the accuracy of each in the hidden layer and 26 neurons in the output layer. The
category, together with errors [31]. activation function used was the sigmoidal function.

56
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59

The ELM obtained an average classification accuracy of of the type feedforward have difficulties with the training
77.03% with standard deviation of 0.75%. This network process by adding layers.
took in average 51 seconds to be trained, with a stan- Nevertheless, the time to train a CNN is high. For
dard deviation of 5.88 seconds. Figures 11 and 12 show the case study, it took 30 times longer than the ELM.
the box-plots for the classification accuracy for and the Even with weight reduction techniques, the number of
computational training time, respectively for both CNN connections is still high. Training in an iterative way is
and ELM. a computational burden. Some CNN models require an
adequate hardware to train (usually GPUs of the latest
generation).

VI. Final Considerations


N this paper we presented a comprehensive introduc-
I tion to convolutional neural networks in an intuitive
and self-explanatory way. The network mechanism were
explained in details, with examples that illustrate the
processes. In addition, we presented a case study of the
application of a CNN model for the classification of manus-
cript alphabetic characters and compared it with another
neural network, the ELM.
The results for both techniques were described using
their average and standard deviation both for classification
accuracy and computational training time. The tests were
Fig. 11. Classification accuracy for test data in the alphabetic performed 30 times. The classification accuracy of the
characters database
CNN was a little inferior if compared to cases in which it
was applied to the MNIST database, for instance. It should
be taken into consideration the fact that the database used
in this study presents some very difficult cases, such as the
uppercase ”I”and the lowercase ”l”, which are quite similar.
Still, in terms of accuracy, the CNN was quite superior to
the ELM.
We believe that modifying the network structure and
other parameters, the results might be still better. Howe-
ver, the goal of this work was to show the CNN work prin-
ciple in a simple and direct way, and for that we applied
a standard model (LeNet). So, we expect that beginners
in the area of computational intelligence might be able to
use this material to better understand convolutional neural
networks.
As future goals, we can extend the application of the
CNN for words and license plates recognition. Another
Fig. 12. Computational time to train the neural network expansion is the application to color images and the
recognition of animals, plants and objects.
E. Evaluation of the performance of the convolutional
neural network Acknowledgements
The convolutional neural network, acting on the local E. Ebermam would like to thank CNPq for his M.Sc.
receptive fields of the image, can preserve important in- scholarship. R.A. Krohling would like to thank CNPq and
formation, such as the correlation between neighbouring FAPES (Research Support Foundation for the State of
pixels in the image. This is an advantage that it has Espirito Santo) for the financial support, given by grants
over other neural networks, such as ELM, which need to No. 309161/2015-0 and No. 039/2016, respectively. The
change the image to one dimension, losing the original authors would like to thank the students Gabriel Giori-
image structure. This way, as seen in the case of alpha- satto, André Georghton Cardoso Pacheco, Carlos Alexan-
betic characters classification, a CNN achieves a higher dre Siqueira da Silva and Igor de Oliveira Nunes for the
classification accuracy. creation and for making available the ”Alphabet”database.
The CNN structure is also effective for multiple layers. The authors would also like to thank the reviewers and
The shared weights and pooling techniques reduce the the editor in chief of this journal for their suggestions for
number of weights in the network. Other neural networks improving the quality of the manuscript.

57
EBERMAM, E., KROHLING, R. / Revista de Sistemas de Informação da FSMA n. 22 (2018) pp. 49-59

References from overfitting., Journal of Machine Learning Research, vol.


15, n. 1, 2014, pp. 1929–1958.
[1] Z. Bingul, H. M. Ertunc, & C. Oysu. Applying neural network to [26] M. Abadi, et al. TensorFlow: Large-scale machine learning on
inverse kinematic problem for 6R robot manipulator with offset heterogeneous systems, 2015. Software disponı́vel em tensor-
wrist. In Adaptive and Natural Computing Algorithms, pp. 112- flow.org.
115. Springer, Vienna. 2005. [27] François Chollet. Comma.ai, 2016. URL http://keras.io/
[2] Y. LeCun, et al., Generalization and network design strategies, [28] Y. Jia. Caffe: An open source convolutional architecture for fast
Connectionism in perspective, 1989, pp.143–155. feature embedding. http://caffe.berkeleyvision.org/.
[3] Y. Lecun, Y. Bengio, D. Henderson, A. Weisbuch, H. Weissman, [29] R. Collobert, S. Bengio, and J. Mariethoz. Torch: a modular
L. Jackel. On-line handwriting recognition with neural networks: machine learning software library. Technical Report IDIAPRR
Spatial representation versus temporal representation, Ecole 02-46, IDIAP, 2002.
Nationale Superieure des Telecommunications, 1993. [30] Y. Tang. Deep learning using support vector machines. CoRR,
[4] K. O’Shea, R. Nash, An introduction to convolutional neural abs/1306.0239, v. 2, 2013.
networks, arXiv preprint arXiv:1511.08458. 2015. [31] R. G. Congalton, A review of assessing the accuracy of classifica-
[5] J. Wu, Introduction to convolutional neural networks, National tions of remotely sensed data. Remote Sensing of Environment,
Key Lab for Novel Software Technology, Nanjing University, vol. 37, 1991, pp. 35-46.
China. 2017. [32] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, Extreme learning
[6] F. H. Araújo, A. C. Carneiro, R. R. Silva, Redes neurais convo- machine: a new learning scheme of feedforward neural networks.
lucionais com tensorflow: Teoria e prática. III Escola Regional In IEEE International Joint Conference on Neural Networks,
de Informática do Piauı́. Livro Anais - Artigos e Minicursos, v. vol. 2, 2004, pp. 985–990.
1, n. 1, 2017, pp. 382-406. [33] G. B. Huang, Q. Y. Zhu, and C. K. Siew. Extreme learning
[7] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based machine: theory and applications. Neurocomputing, vol. 70,
learning applied to document recognition, in Proceedings of the 2006, pp. 489–501.
IEEE, vol. 86, n. 11, pp.2278–2324, nov. 1998.
[8] E. Mohebi & A. Bagirov. A convolutional recursive modified
Self Organizing Map for handwritten digits recognition. Neural
Networks, vol. 60, 2014, pp. 104-118.
[9] J. Bai, Z. Chen, B. Feng, B. Xu, Image character recognition
using deep convolutional neural network learned from different
languages, in: IEEE International Conference on Image Pro-
cessing (ICIP), 2014, pp. 2560–2564.
[10] K. A. Hamad, M. Kaya, A detailed analysis of optical character
recognition technology, International Journal of Applied Mathe-
matics, Electronics and Computers, vol. 4, Special Issue-1, 2016,
pp. 244-249.
[11] R. Hecht-Nielsen, et al., Theory of the backpropagation neural
network., Neural Networks 1 (Supplement-1), 1988, pp.445–448.
[12] T. Fukuda, T. Shibata, M. Tokita, T. Mitsuoka. Neural network
application for robotic motion control-adaptation and learning,
In IEEE International Joint Conference on Neural Networks,
vol. 2, 1990, pp. 447-451.
[13] J. Schmidhuber, Deep learning in neural networks: An overview,
Neural Networks 61, 2015, pp. 85–117.
[14] W. S. McCulloch, W. Pitts, A logical calculus of the ideas
immanent in nervous activity, The Bulletin of Mathematical
Biophysics 5 (4), 1943, pp. 115–133.
[15] S. Haykin, Redes Neurais: Princı́pios e Práticas, Ed. Bookman,
2001.
[16] A. Solazzo, E. D. Sozzo, I. De Rose, M. D. Silvestri, G. C.
Durelli and M. D. Santambrogio. Hardware design automation
of convolutional neural networks. In IEEE Computer Society
Annual Symposium on VLSI (ISVLSI), 2016, pp. 224-229.
[17] D. H. Hubel, T. N. Wiesel, Receptive fields of single neurones
in the cat’s striate cortex, The Journal of Physiology 148 (3),
1959, pp. 574–591.
[18] K. Fukushima. Neocognitron: A self-organizing neural network
model for a mechanism of pattern recognition unaffected by shift
in position. Biological Cybernetics, vol. 36, 1980, pp. 193-202.
[19] D. E Rumelhart, G.E. Hinton & R. J. Williams, Learning
internal representations by backpropagating errors. In Nature,
vol. 323, 1986, pp. 533–536.
[20] Y. LeCun, Y. Bengio, Word-level training of a handwritten word
recognizer based on convolutional neural networks, in: Procee-
dings of the 12th IAPR International Conference on Pattern
Recognition, Vol. 2, IEEE, 1994, pp. 88–92.
[21] Y. Lecun, Y. Bengio, Convolutional networks for images, speech,
and time-series, MIT Press, 1995.
[22] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in
neural information processing systems, 2012, pp. 1097-1105.
[23] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521
(7553), 2015, pp. 436-444.
[24] J. Gu, et al. Recent advances in convolutional neural networks.
arXiv preprint arXiv:1512.07108, 2015.
[25] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, R. Sa-
lakhutdinov, Dropout: a simple way to prevent neural networks

58

You might also like