Parallelization of A Neural Network Algorithm For Use in Handwriting Recognition
Parallelization of A Neural Network Algorithm For Use in Handwriting Recognition
Abstract—This paper examines the problem of parallelizing on multiple workstations. Naturally, we also have to check the
neural network training using the back-propagation neural accuracy of the neural network trained in parallel, and how it
network, as a breakthrough example in the field of deep learning. compares to the accuracy of the sequentially trained NN.
The challenge of our solution is to twist the algorithm in such a
way so it can be executed in parallel, rather than sequentially. In The hypothesis we make is loosely based on the No-Free-
this paper, we test validity of a research hypothesis if the speed Lunch Theorem, in the sense that there can’t be a one-size-fits-
can be increased by parallelizing the back-propagation algorithm all solution to all problems. More concretely, the assumption
and keep the same accuracy. For this purpose, we developed a is that there will have to be a trade-off between execution
use-case of a handwriting recognition algorithm and conducted speed and prediction accuracy. This hypothesis is validated by
several experiments to test the performance, both in execution
speed and accuracy. At the end, we examine just how much it measuring the time the algorithm takes to finish training the
benefits to write a parallel program for a neural network, with NN sequentially, in contrast to the parallel execution. After the
regards to the time it takes to train the neural network and the different training sessions, the prediction accuracy is checked
accuracy of the predictions. Our handwriting problem is that of on the test set. Our expectation is that it will be slightly
classification, and in order to implement any sort of solution, lower for the parallel implementation, due to the fact that the
we must have data. The MNIST dataset of handwritten digits
provides necessary data to solve the problem. algorithm will not converge in a conventional sense as it would
Index Terms—message passing interface, neural network, if it ran sequentially.
handwriting recognition, multilayer perceptron, parallel process- There have been multiple approaches to parallelizing a NN.
ing, distributed processing Petchick et al. [1] explains four approaches in order of level
of granularity, from training session parallelism, to weight
I. I NTRODUCTION parallelism. Training session parallelism consists of no com-
We live in a world where everyone has a smartphone at munication between the processes, providing zero overhead.
their disposal and can write anything that comes to mind Weight parallelism is the finest grained solution. With it, the
on it. Even so, that world is still, and will continue to be input from each synapse is calculated in parallel for each node,
filled with handwritten notes, texts and descriptions. The prime and the net input is summed via some suitable communications
example are schools where students write their notes in paper scheme. Our solution is based on the exemplar parallelism
notebooks, rather than on their smartphones. Some students approach which provides little communication overhead and
are naturally good at organizing their handwritten notes, but is very suitable to our experimental environment.
some prefer them to be in a digital format in which they In Section II, we start by examining some related scientific
could store them in a cloud, manipulate them etc. Instead of papers regarding the problems of NN, handwriting recognition,
having to rewrite all of their notes, with an app for handwriting and parallelization, and how they relate to this paper. Then, a
recognition one could just scan or take a picture of a page of basic overview of our solution is presented in Section III, first
a notebook, feed it to a piece of software that would be able by going over existing knowledge of standard NN structure,
to process the image and give the written text as output. Said followed by a bird’s eye view of the parallel architecture.
output could then be processed and stored as per the user’s The implementation is discussed in Section IV providing an
desire. explanation of a standard sequential implementation for the
Using machine learning (ML), researchers have actively NN classification problem, a parallel implementation, and by
been developing new and refining old ways of recognising giving details on the specific environment being used to run
handwritten text. One of, if not the biggest breakthrough in the tests and provide our results. In Section V we compare
the field of ML is the multilayer perceptron algorithm, more the results generated by the different implementations, and in
commonly called a neural network (NN). Section VI we asses the impact and worth of our solution.
Of course, computer scientists would like to speed up and
optimize the process, if even just a little bit. The idea for II. R ELATED WORK
optimizing the training process of the NN is to run it in parallel The invention of neural networks has been one of the
on multiple CPU cores, or even (if the hardware is present) biggest breakthroughs in AI and ML. Heckt-Nielsen [2]
in Fig. 1. The NN consists of an input layer of nodes (per-
ceptrons), an output layer of nodes, and at least one (usually
more) intermediate layer. To keep things simple, we will train
NNs with only one hidden layer. The network is trained on a
data set consisting of (xi , yi ) pairs where xi is a feature vector
used as an input, and yi is the true output, or the ground truth,
for that feature vector.
A node in the ith layer of the NN is connected to all nodes
in the (i + 1)th layer. The connectors that connect a node in
a layer to the nodes in the next layer have assigned weights
wij . We denote by xij the ith input to the node j. A node in
one layer takes all of the products xij wij that come as inputs
from the previous layer, computes their sum zj and uses it to
compute the output (or the input to the next layer).
The NN learns by comparing the computed output with the
true output, and adjusting the weights of the nodes accordingly
so the computed output is the same as the desired one (if the
Fig. 1. Backpropagation NN structure computed output is already the true output, no adjustment is
done). The changes to the weights propagate back through
the network (hence the name) up until they reach the input
presents a survey of the basic theory of the back-propagation
layer. The algorithm stops after a full epoch without change
NN architecture covering architectural design, performance
in weights (sometimes we stop the algorithm prematurely
measurement, function approximation capability, and learning.
to combat the problem of overfitting to the training set).
One of, if not the most, useful applications of neural networks
Sometimes the outputs of the problem are not uniformly
is in pattern recognition. Pao [3] has elaborated the nature of
distributed among the population of the input/output pairs. For
the pattern-recognition task and adaptive pattern recognition
this reason we need a BIAS node with its own weight assigned
(and its applications) as one of the most useful applications
to it, at each non-output layer, that has no inputs going into
of AI.
it. If the population is biased towards a certain output class,
One application in which pattern recognition would be right
the BIAS allows us to skew the evaluation function so that we
at home is in the recognition of handwritten text. Graves
can better predict the true output.
and Schmidhuber [4] combine two recent innovations in NN,
multidimensional recurrent NN and connectionist temporal B. Parallelization
classification [5]. They introduce a globally trained offline There are multiple strategies to implement a parallel solu-
handwriting recogniser that takes raw pixel data as input. tion for the backpropagation NN. Some of the main strategies
Unlike competing systems, it does not require any alphabet are discussed by Pethick, Liddle, Huang and Werstein [1]. We
specific pre-processing, and can therefore be used unchanged will go with the ”preferred technique” according to Rogers
for any language. and Skillicorn [10], which is exemplar parallelism, also known
AI requires huge computational power for processing. See- training example parallelism. In essence, the training set is
ing as microprocessor manufacturers have struggled to in- split into disjoint subsets and each running process (a thread,
crease the raw computational power of CPUs, the relevancy microprocessor or separate machine) works on only one sub-
of Moore’s Law has slowly been fading. This in turn has set. The processes need to start with initial states identical to
increased the need for engineers to think of new ways to each other, meaning the weights associated with each node
increase the amount of computation that is possible in a finite need to be the same. Usually this means that every weight at
period of time. This is where parallel computing comes in with the beginning is set to 1. At the end of the training, the changes
[6] and [7], introducing new paradigms for parallel algorithm are combined and applied together to the neural network by
design, performance analysis and program construction. Akl averaging them out. The diagram for this parallel solution
[8] surveys existing parallel algorithms, with emphasis on approach is given in Fig. 2.
design methods and complexity results. Parallelizable opti- Two advantages come to mind when we think about this
mization techniques are applied to the problem of pattern approach to parallelizing a NN. First is the small overhead
recognition and learning in feed-forward neural networks by that occurs because of the communication between processes.
Kramer [9]. Namely, the processes only communicate at the beginning to
III. S OLUTION OVERVIEW distribute the data, and at the end of the training to average
out the weights, so relatively few messages are generated. The
A. Neural network structure second advantage is the speed-up during the training phase.
Our solution is designed using the standard multilayer Since the data set is split into n smaller subsets si , where
perceptron NN, also known as a backpropagation NN shown i = 1, 2, ..., n, the time it takes to go through an epoch is the
Fig. 2. The parallelization method. The dataset is split equally among the processes, each process trains its own neural network, and at the end all the
processes average out their weight vectors.
maximum of the times ti it takes to iterate through a subset si , in the format (y, x1 , ..., x784 ) where y is the digit that corre-
or tepoch = max(ti ). A disadvantage to this approach is that sponds to the image represented by the pixels denoted by xi
it does not provide a performance increase at the layer level, where each digit xi holds values from 0 to 255 representing
meaning that no two nodes in the same process, working on the amount of the color black in the pixel, 0 being all white
the same subset of data, work in parallel. and 255 being all black.
The dataset will be split into training and test data sets,
IV. E XPERIMENTAL METHODS The training subset will be used to train a vast variety of
For the purposes of this research, we will conduct two classifiers, each having a different number of neurons in their
experiments that correspond to two implementations for the hidden layer, and each being run with a different number of
handwriting recognition back-propagation NN. The first one epochs. The number of neurons in the hidden layer and the
will be a classical sequential solution and the other will be number of epochs will vary in the interval from 50 to 100.
a parallel implementation. For both experiments, we will use The test set will be used to determine which of the classifiers
Python with its scikit-learn library which includes the tool has the most optimal accuracy to training time ratio.
needed for training NNs, MLPClassifier. The goal is to find a NN model that is easy and fast to train
For the sake of simplicity, the dataset we will be using to but makes the most accurate predictions possible. Note that
train the NN is the MNIST dataset [11] of handwritten digits sometimes the most accurate classifier is not the best choice
which includes 60,000 labeled 28x28 images of handwritten because it may take longer to train than it’s worth. For this
digits in its training set and 10,000 images in its testing set. reason, the actual classifier that will be used on the parallel
Only the first 6,000 data points will be used for the training in implementation will be the classifier which has most optimal
our experiments. The implementation can easily be extended accuracy to training time ratio.
to use the EMNIST dataset [12] which also includes images
of handwritten letters. B. Parallel implementation
In development of a parallel implementation, the goal is
A. Sequential implementation to take the training set and distribute it evenly among the
The sequential implementation is executed conventionally, processes in the pool. The master process, usually the one
starting by reading the training and test data from csv files with rank = 0 is responsible to read the data, to determine
the number of training examples per process and to scatter
the data among the other processes. After each process has
their share of the training data, they can begin the training
independently of each other. First they initialize their own
classifier. Afterwards, each process begins fitting training its
NN to its respective subset of the training data.
When the processes are finished, they need to send their
weight matrices to the master process whose job is to compute
an average for each respective element in the different weight
matrices, and produces an all encompassing weight matrix for
the neural network. The master process then uses the test data
to predict the class of each test example, comparing it with
the real class and thus, determining the accuracy of the neural
network.
The main tool that will help us write the parallel algorithm is
the Python library mpi4py that offers the means necessary to
write a parallel program using the Message Passing Interface.
C. Experimental environment
Fig. 3. A 3D mesh presenting the training time with respect to the hidden
The solution is implemented in such a way that it can be layer size and number of epochs
executed on any multiple processor configuration, so long as
the data can be split evenly among the processes the algorithm
is started with. For example, our training data consists of 6,000
training examples, meaning that the program should be started
with a number of processes n, such that 6,000 is divisible by
n. For our testing purposes, we will use a system with an i7-
9750H six core CPU and 16GB of RAM. The algorithm will
be run three times, with n = 2, n = 4 and n = 6 processes
respectively.
Out of the family of classifiers described in the sequential
implementation, we will select the optimal classifier to use for
the parallel implementation. That classifier should provide a
good balance between the accuracy of the predictions, and the
time it takes to train.
To asses the value of our exercise in parallelization, we
introduce an impact f actor parameter, defined by (1).
prediction accuracy ∗ s1 Fig. 4. A 3D mesh presenting the accuracy with respect to the hidden layer
impact f actor = (1)
training time ∗ s2 size and number of epochs
The parameters s1 and s2 represent significance factors. For
the purpose of this paper, we will assume equal importance
between prediction accuracy and training time by setting Fig. 3 presents a mesh of all the classifiers and how the training
s1 = s2 = 1. This is not always the case in practice. Usually, time reacts almost linearly to different hidden layer sizes and
the trade off between time and accuracy will vary from different epoch numbers. Note that the results are not exactly
situation to situation and so the significance parameters would linear due to the nature of how operating systems schedule
have to be set accordingly. The impact f actor parameter tasks. Each time the code is executed, the measured time will
defined as such, can provide a numerical representation of be different. We conducted five test runs and calculated an
the benefit regarding the parallelization of the neural network average.
algorithm. When it comes to the accuracy of predicting the test set,
The impact f actor parameter defined as such, provides a the same type of mesh is given in Fig. 4, where the accuracy
numerical representation of the benefit regarding the paral- reacts very differently to epoch numbers and hidden layer
lelization of the neural network algorithm. sizes, from the training time. As we add more neurons to a
classifier’s hidden layer and we increase the number of epochs,
V. R ESULTS we can surmise that the time it takes to train it isn’t worth
A. Sequential implementation the negligible increase in accuracy we get. The numerical
Running the script that contains the sequential algorithm, we representation of these results is given in Table I.
observe the time it takes to train all of the individual classifiers. The defined impact factor parameter shows that the best
TABLE I
I MPACT FACTORS FOR DIFFERENT MLP CLASSIFIERS