Parallelization of A Neural Network Algorithm For Use in Handwriting Recognition

This document discusses parallelizing the training of a neural network algorithm for handwriting recognition. The goal is to speed up training while maintaining accuracy. The authors developed a handwriting recognition use case using the MNIST dataset. They implemented the backpropagation algorithm sequentially and in parallel on multiple workstations. Experiments tested the parallel approach's speed and accuracy compared to sequential training. The hypothesis was parallelization could increase speed at the cost of some reduction in accuracy due to non-conventional convergence. The paper examines different approaches to neural network parallelization and relates its solution to prior work in the field.

Uploaded by

David Todorov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views6 pages

Parallelization of A Neural Network Algorithm For Use in Handwriting Recognition

Uploaded by

David Todorov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Parallelization of a Neural Network Algorithm for

Handwriting Recognition: Can we Increase the

Speed, Keeping the Same Accuracy
D. Todorov∗ , V. Zdraveski∗ , M. Kostoska∗ and M. Gusev∗
∗
Ss. Cyril and Methodius University in Skopje, Faculty of Computer Science and Engineering, Skopje, North Macedonia
E-mail: [email protected], { vladimir.zdraveski, magdalena.kostoska, marjan.gushev }@finki.ukim.mk

Abstract—This paper examines the problem of parallelizing on multiple workstations. Naturally, we also have to check the
neural network training using the back-propagation neural accuracy of the neural network trained in parallel, and how it
network, as a breakthrough example in the field of deep learning. compares to the accuracy of the sequentially trained NN.
The challenge of our solution is to twist the algorithm in such a
way so it can be executed in parallel, rather than sequentially. In The hypothesis we make is loosely based on the No-Free-
this paper, we test validity of a research hypothesis if the speed Lunch Theorem, in the sense that there can’t be a one-size-fits-
can be increased by parallelizing the back-propagation algorithm all solution to all problems. More concretely, the assumption
and keep the same accuracy. For this purpose, we developed a is that there will have to be a trade-off between execution
use-case of a handwriting recognition algorithm and conducted speed and prediction accuracy. This hypothesis is validated by
several experiments to test the performance, both in execution
speed and accuracy. At the end, we examine just how much it measuring the time the algorithm takes to finish training the
benefits to write a parallel program for a neural network, with NN sequentially, in contrast to the parallel execution. After the
regards to the time it takes to train the neural network and the different training sessions, the prediction accuracy is checked
accuracy of the predictions. Our handwriting problem is that of on the test set. Our expectation is that it will be slightly
classification, and in order to implement any sort of solution, lower for the parallel implementation, due to the fact that the
we must have data. The MNIST dataset of handwritten digits
provides necessary data to solve the problem. algorithm will not converge in a conventional sense as it would
Index Terms—message passing interface, neural network, if it ran sequentially.
handwriting recognition, multilayer perceptron, parallel process- There have been multiple approaches to parallelizing a NN.
ing, distributed processing Petchick et al. [1] explains four approaches in order of level
of granularity, from training session parallelism, to weight
I. I NTRODUCTION parallelism. Training session parallelism consists of no com-
We live in a world where everyone has a smartphone at munication between the processes, providing zero overhead.
their disposal and can write anything that comes to mind Weight parallelism is the finest grained solution. With it, the
on it. Even so, that world is still, and will continue to be input from each synapse is calculated in parallel for each node,
filled with handwritten notes, texts and descriptions. The prime and the net input is summed via some suitable communications
example are schools where students write their notes in paper scheme. Our solution is based on the exemplar parallelism
notebooks, rather than on their smartphones. Some students approach which provides little communication overhead and
are naturally good at organizing their handwritten notes, but is very suitable to our experimental environment.
some prefer them to be in a digital format in which they In Section II, we start by examining some related scientific
could store them in a cloud, manipulate them etc. Instead of papers regarding the problems of NN, handwriting recognition,
having to rewrite all of their notes, with an app for handwriting and parallelization, and how they relate to this paper. Then, a
recognition one could just scan or take a picture of a page of basic overview of our solution is presented in Section III, first
a notebook, feed it to a piece of software that would be able by going over existing knowledge of standard NN structure,
to process the image and give the written text as output. Said followed by a bird’s eye view of the parallel architecture.
output could then be processed and stored as per the user’s The implementation is discussed in Section IV providing an
desire. explanation of a standard sequential implementation for the
Using machine learning (ML), researchers have actively NN classification problem, a parallel implementation, and by
been developing new and refining old ways of recognising giving details on the specific environment being used to run
handwritten text. One of, if not the biggest breakthrough in the tests and provide our results. In Section V we compare
the field of ML is the multilayer perceptron algorithm, more the results generated by the different implementations, and in
commonly called a neural network (NN). Section VI we asses the impact and worth of our solution.
Of course, computer scientists would like to speed up and
optimize the process, if even just a little bit. The idea for II. R ELATED WORK
optimizing the training process of the NN is to run it in parallel The invention of neural networks has been one of the
on multiple CPU cores, or even (if the hardware is present) biggest breakthroughs in AI and ML. Heckt-Nielsen [2]
in Fig. 1. The NN consists of an input layer of nodes (per-
ceptrons), an output layer of nodes, and at least one (usually
more) intermediate layer. To keep things simple, we will train
NNs with only one hidden layer. The network is trained on a
data set consisting of (xi , yi ) pairs where xi is a feature vector
used as an input, and yi is the true output, or the ground truth,
for that feature vector.
A node in the ith layer of the NN is connected to all nodes
in the (i + 1)th layer. The connectors that connect a node in
a layer to the nodes in the next layer have assigned weights
wij . We denote by xij the ith input to the node j. A node in
one layer takes all of the products xij wij that come as inputs
from the previous layer, computes their sum zj and uses it to
compute the output (or the input to the next layer).
The NN learns by comparing the computed output with the
true output, and adjusting the weights of the nodes accordingly
so the computed output is the same as the desired one (if the
Fig. 1. Backpropagation NN structure computed output is already the true output, no adjustment is
done). The changes to the weights propagate back through
the network (hence the name) up until they reach the input
presents a survey of the basic theory of the back-propagation
layer. The algorithm stops after a full epoch without change
NN architecture covering architectural design, performance
in weights (sometimes we stop the algorithm prematurely
measurement, function approximation capability, and learning.
to combat the problem of overfitting to the training set).
One of, if not the most, useful applications of neural networks
Sometimes the outputs of the problem are not uniformly
is in pattern recognition. Pao [3] has elaborated the nature of
distributed among the population of the input/output pairs. For
the pattern-recognition task and adaptive pattern recognition
this reason we need a BIAS node with its own weight assigned
(and its applications) as one of the most useful applications
to it, at each non-output layer, that has no inputs going into
of AI.
it. If the population is biased towards a certain output class,
One application in which pattern recognition would be right
the BIAS allows us to skew the evaluation function so that we
at home is in the recognition of handwritten text. Graves
can better predict the true output.
and Schmidhuber [4] combine two recent innovations in NN,
multidimensional recurrent NN and connectionist temporal B. Parallelization
classification [5]. They introduce a globally trained offline There are multiple strategies to implement a parallel solu-
handwriting recogniser that takes raw pixel data as input. tion for the backpropagation NN. Some of the main strategies
Unlike competing systems, it does not require any alphabet are discussed by Pethick, Liddle, Huang and Werstein [1]. We
specific pre-processing, and can therefore be used unchanged will go with the ”preferred technique” according to Rogers
for any language. and Skillicorn [10], which is exemplar parallelism, also known
AI requires huge computational power for processing. See- training example parallelism. In essence, the training set is
ing as microprocessor manufacturers have struggled to in- split into disjoint subsets and each running process (a thread,
crease the raw computational power of CPUs, the relevancy microprocessor or separate machine) works on only one sub-
of Moore’s Law has slowly been fading. This in turn has set. The processes need to start with initial states identical to
increased the need for engineers to think of new ways to each other, meaning the weights associated with each node
increase the amount of computation that is possible in a finite need to be the same. Usually this means that every weight at
period of time. This is where parallel computing comes in with the beginning is set to 1. At the end of the training, the changes
[6] and [7], introducing new paradigms for parallel algorithm are combined and applied together to the neural network by
design, performance analysis and program construction. Akl averaging them out. The diagram for this parallel solution
[8] surveys existing parallel algorithms, with emphasis on approach is given in Fig. 2.
design methods and complexity results. Parallelizable opti- Two advantages come to mind when we think about this
mization techniques are applied to the problem of pattern approach to parallelizing a NN. First is the small overhead
recognition and learning in feed-forward neural networks by that occurs because of the communication between processes.
Kramer [9]. Namely, the processes only communicate at the beginning to
III. S OLUTION OVERVIEW distribute the data, and at the end of the training to average
out the weights, so relatively few messages are generated. The
A. Neural network structure second advantage is the speed-up during the training phase.
Our solution is designed using the standard multilayer Since the data set is split into n smaller subsets si , where
perceptron NN, also known as a backpropagation NN shown i = 1, 2, ..., n, the time it takes to go through an epoch is the
Fig. 2. The parallelization method. The dataset is split equally among the processes, each process trains its own neural network, and at the end all the
processes average out their weight vectors.

maximum of the times ti it takes to iterate through a subset si , in the format (y, x1 , ..., x784 ) where y is the digit that corre-
or tepoch = max(ti ). A disadvantage to this approach is that sponds to the image represented by the pixels denoted by xi
it does not provide a performance increase at the layer level, where each digit xi holds values from 0 to 255 representing
meaning that no two nodes in the same process, working on the amount of the color black in the pixel, 0 being all white
the same subset of data, work in parallel. and 255 being all black.
The dataset will be split into training and test data sets,
IV. E XPERIMENTAL METHODS The training subset will be used to train a vast variety of
For the purposes of this research, we will conduct two classifiers, each having a different number of neurons in their
experiments that correspond to two implementations for the hidden layer, and each being run with a different number of
handwriting recognition back-propagation NN. The first one epochs. The number of neurons in the hidden layer and the
will be a classical sequential solution and the other will be number of epochs will vary in the interval from 50 to 100.
a parallel implementation. For both experiments, we will use The test set will be used to determine which of the classifiers
Python with its scikit-learn library which includes the tool has the most optimal accuracy to training time ratio.
needed for training NNs, MLPClassifier. The goal is to find a NN model that is easy and fast to train
For the sake of simplicity, the dataset we will be using to but makes the most accurate predictions possible. Note that
train the NN is the MNIST dataset [11] of handwritten digits sometimes the most accurate classifier is not the best choice
which includes 60,000 labeled 28x28 images of handwritten because it may take longer to train than it’s worth. For this
digits in its training set and 10,000 images in its testing set. reason, the actual classifier that will be used on the parallel
Only the first 6,000 data points will be used for the training in implementation will be the classifier which has most optimal
our experiments. The implementation can easily be extended accuracy to training time ratio.
to use the EMNIST dataset [12] which also includes images
of handwritten letters. B. Parallel implementation
In development of a parallel implementation, the goal is
A. Sequential implementation to take the training set and distribute it evenly among the
The sequential implementation is executed conventionally, processes in the pool. The master process, usually the one
starting by reading the training and test data from csv files with rank = 0 is responsible to read the data, to determine
the number of training examples per process and to scatter
the data among the other processes. After each process has
their share of the training data, they can begin the training
independently of each other. First they initialize their own
classifier. Afterwards, each process begins fitting training its
NN to its respective subset of the training data.
When the processes are finished, they need to send their
weight matrices to the master process whose job is to compute
an average for each respective element in the different weight
matrices, and produces an all encompassing weight matrix for
the neural network. The master process then uses the test data
to predict the class of each test example, comparing it with
the real class and thus, determining the accuracy of the neural
network.
The main tool that will help us write the parallel algorithm is
the Python library mpi4py that offers the means necessary to
write a parallel program using the Message Passing Interface.
C. Experimental environment
Fig. 3. A 3D mesh presenting the training time with respect to the hidden
The solution is implemented in such a way that it can be layer size and number of epochs
executed on any multiple processor configuration, so long as
the data can be split evenly among the processes the algorithm
is started with. For example, our training data consists of 6,000
training examples, meaning that the program should be started
with a number of processes n, such that 6,000 is divisible by
n. For our testing purposes, we will use a system with an i7-
9750H six core CPU and 16GB of RAM. The algorithm will
be run three times, with n = 2, n = 4 and n = 6 processes
respectively.
Out of the family of classifiers described in the sequential
implementation, we will select the optimal classifier to use for
the parallel implementation. That classifier should provide a
good balance between the accuracy of the predictions, and the
time it takes to train.
To asses the value of our exercise in parallelization, we
introduce an impact f actor parameter, defined by (1).
prediction accuracy ∗ s1 Fig. 4. A 3D mesh presenting the accuracy with respect to the hidden layer
impact f actor = (1)
training time ∗ s2 size and number of epochs
The parameters s1 and s2 represent significance factors. For
the purpose of this paper, we will assume equal importance
between prediction accuracy and training time by setting Fig. 3 presents a mesh of all the classifiers and how the training
s1 = s2 = 1. This is not always the case in practice. Usually, time reacts almost linearly to different hidden layer sizes and
the trade off between time and accuracy will vary from different epoch numbers. Note that the results are not exactly
situation to situation and so the significance parameters would linear due to the nature of how operating systems schedule
have to be set accordingly. The impact f actor parameter tasks. Each time the code is executed, the measured time will
defined as such, can provide a numerical representation of be different. We conducted five test runs and calculated an
the benefit regarding the parallelization of the neural network average.
algorithm. When it comes to the accuracy of predicting the test set,
The impact f actor parameter defined as such, provides a the same type of mesh is given in Fig. 4, where the accuracy
numerical representation of the benefit regarding the paral- reacts very differently to epoch numbers and hidden layer
lelization of the neural network algorithm. sizes, from the training time. As we add more neurons to a
classifier’s hidden layer and we increase the number of epochs,
V. R ESULTS we can surmise that the time it takes to train it isn’t worth
A. Sequential implementation the negligible increase in accuracy we get. The numerical
Running the script that contains the sequential algorithm, we representation of these results is given in Table I.
observe the time it takes to train all of the individual classifiers. The defined impact factor parameter shows that the best
TABLE I
I MPACT FACTORS FOR DIFFERENT MLP CLASSIFIERS

Hidden layer Epochs Time Accuracy Impact

50 50 7,2142 85,8985 11,9068
50 65 9,2273 87,2587 9,4564
50 75 10,5781 87,6787 8,2886
50 85 11,0793 88,1088 7,9525
50 100 11,324 88,1088 7,7806
65 50 7,5223 85,6985 11,3925
65 65 8,3173 88,3188 10,6186
65 75 8,2403 88,3188 10,7178
65 85 8,2412 88,3188 10,7167
65 100 8,1574 88,3188 10,8267
75 50 5,4276 88,2288 16,2552
75 65 5,7881 88,2288 15,2429
75 75 5,3966 88,2288 16,3487 Fig. 6. Speedup versus different numbers of processes
75 85 5,5618 88,2288 15,8633
75 100 5,5589 88,2288 15,8714
85 50 6,4337 89,6289 13,9311
85 65 6,5464 89,6289 13,6911
85 75 6,2227 89,6289 14,4034
85 85 6,353 89,6289 14,108
85 100 6,275 89,6289 14,2834
100 50 8,2521 89,4989 10,8454
100 65 8,0113 89,4989 11,1715
100 75 7,8782 89,4989 11,3603
100 85 8,0304 89,4989 11,1449
100 100 8,0716 89,4989 11,088

Fig. 7. Prediction accuracy for the sequential and parallel algorithms

algorithm. Using 4 cores, the increase in speed from 2 to 4

isn’t as significant as that from 1 to 2, and the increase from
4 to 6 even less so. Increasing the number of processors, will
cause negligible increase in speed. The reason for this is the
nature of Amdahl’s Law [13] (Fig. 6), which concludes that
the speedup is limited by the sequential part of the algorithm
Fig. 5. Training time for the sequential and parallel algorithms
and not by the number of processors. With all of that taken
into account, these results, presented in Table II, are still an
trade-off in terms of accuracy and training time is accom- improvement over the original sequential implementation.
plished using a classifier with 75 neurons in its hidden layer, The same can not be said, however, for the accuracy of the
with a number of 75 epochs. Such a classifier gives a reason- predictions. Although not bad by any means, Fig. 7 presents
ably good accuracy of 88.23 percent using about 5.39 seconds a slight drop-off from the sequential algorithm’s 88%, to this
to train, and therefore, it is used in our parallel implementation, parallel algorithm’s 85%, 82% and 80% accuracy, using 2,
to test if and how much of an increase in speed we can gain, 4 and 6 cores respectively. This result is in line with the
whilst keeping the accuracy in check. hypothesis we presented in the introduction.
The impact f actor parameter determines the usefulness
B. Parallel implementation of our experiment. Table II also presents the impact factor
With the parallel algorithm, interesting details come to light for each of the executions with 1, 2, 4 and 6 cores. The
about the speed with which the neural network is trained. We sequential algorithm’s impact f actor = 16.37 while for
are using a classifier identical to the optimal classifier from the parallel algorithm running on 2 cores impact f actor =
the sequential implementation (recall that the optimal classifier 31.89, for 4 cores impact f actor = 37.36 and for 6 cores
had 75 neurons in the hidden layer and a maximum of 75 impact f actor = 38.93.
epochs). The change in training time is presented visually in Even though the accuracy of our predictions went down
Fig. 5. Our parallel program using 2 cores managed to train using the parallel training, the higher impact of the parallel
the neural network in less than half of that of the sequential implementation tells us that the experiment was indeed worth-
TABLE II handwritten digits. Although we achieved that speed up, it
I MPACT FACTORS FOR DIFFERENT NUMBERS OF PROCESSES - NO came at the price of accuracy with the predictions. In the end,
COMMUNICATION OVERHEAD
these results prove that the hypothesis we presented in the
No. of processes Training time Accuracy Impact factor introduction can be considered as satisfied.
1 5,39 88.23 16.369 The algorithm of training neural networks is sequential in
2 2.67 85.15 31.891 its nature, so the end results make sense taking everything into
4 2.21 82.56 37.357
6 2.08 80.97 38.928 consideration. The experiment provided us with an interesting
insight into the world of neural networks and parallel com-
TABLE III
puting and how, sometimes, you can’t have your cake and eat
I MPACT FACTORS FOR DIFFERENT NUMBERS OF PROCESSES - WITH it too. In order to accomplish something, some sacrifices also
COMMUNICATION OVERHEAD need to be made, whether those be complexity, resources, or
accuracy.
No. of processes Training time Accuracy Impact factor
1 5,39 88.23 16.369 This experiment proves useful as a jumping-off point for
2 13.67 85.15 6.229 further research. Priorities in said research would include
4 13.21 82.56 6.250 refining the computing of the final weight matrix of the trained
6 13.08 80.97 6.190
NN, rather than just averaging the weight matrices of the
separate NNs trained in the individual processes. We would
also like to explore finer grained solutions where the processes
communicate more often (during or after each epoch, rather
than just at the end of the training to computer the weight
matrix).
Lastly something that would likely provide an interesting
read, is the exploration of different ML algorithms. Perhaps
Naive Bayes classifiers, Decision trees, Linear and Quadratic
discriminant analysis (LDA and QDA), etc. could prove them-
selves to be better suited for parallel execution than NNs.
R EFERENCES
[1] M. Pethick, M. Liddle, P. Werstein, and Z. Huang, “Parallelization of a
backpropagation neural network on a cluster computer,” in International
conference on parallel and distributed computing and systems (PDCS
2003), 2003.
[2] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in
Neural networks for perception. Elsevier, 1992, pp. 65–93.
Fig. 8. The change of the Impact Factor for different numbers of processes [3] Y. Pao, Adaptive pattern recognition and neural networks. Reading,
MA (US); Addison-Wesley Publishing Co., Inc., 1989.
[4] A. Graves and J. Schmidhuber, “Offline handwriting recognition with
while. We still however don’t get a clear enough picture of multidimensional recurrent neural networks,” Advances in neural infor-
the impact of our solution, until we take into consideration mation processing systems, vol. 21, pp. 545–552, 2008.
[5] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection-
the communication overhead that occurs during the parallel ist temporal classification: labelling unsegmented sequence data with
execution. After the master process reads the data, it needs recurrent neural networks,” in Proceedings of the 23rd international
to distribute it evenly among the other processes. Scattering conference on Machine learning, 2006, pp. 369–376.
[6] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to parallel
the data takes around 11 seconds, and so, when we inject that computing. Benjamin/Cummings Redwood City, CA, 1994, vol. 110.
into the calculation, Table III and Fig. 8 give much greater [7] I. Foster, Designing and building parallel programs. Addison-Wesley,
insight. The 2 core parallel impact f actor = 6.23 which is 2003.
[8] S. G. Aki, The design and analysis of parallel algorithms. Old Tappan,
a significant drop-off from the previously calculated 31.89. NJ (USA); Prentice Hall Inc., 1989.
A minor increase is observed for the 4 core implementation [9] A. H. Kramer and A. Sangiovanni-Vincentelli, “Efficient parallel learn-
(impact f actor = 6.25), while for the 6 core implementation, ing algorithms for neural networks,” in Advances in neural information
processing systems, 1989, pp. 40–48.
the impact factor actually decreases after taking the scattering [10] R. Rogers and D. Skillicorn, “Strategies for parallelizing supervised and
time into account (impact f actor = 6.19). This leads to unsupervised learning in artificial neural networks using the bsp cost
the conclusion that after a certain point, parallelizing a neural model,” Queens University, Kingston, Ontario, Tech. Rep, 1997.
[11] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
network stops being useful. [Online]. Available: http://yann.lecun.com/exdb/mnist/
[12] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik, “Emnist: Extending
VI. C ONCLUSION mnist to handwritten letters,” in 2017 International Joint Conference on
As with everything in life, every benefit comes with a price. Neural Networks (IJCNN). IEEE, 2017, pp. 2921–2926.
[13] G. M. Amdahl, “Validity of the single processor approach to achieving
In our case we wanted to provide an increase in speed for the large scale computing capabilities,” in Proceedings of the April 18-20,
training process of the neural network intended for predicting 1967, spring joint computer conference, 1967, pp. 483–485.