0% found this document useful (0 votes)
91 views10 pages

Implementation and Analysis of Different Digit Recognition Methods On Reduced MNIST Dataset

This document summarizes a conference paper that implemented and analyzed different digit recognition methods on reduced versions of the MNIST dataset. It describes using perceptrons, k-nearest neighbors, support vector machines, convolutional neural networks, and boosted LeNet4 models. The paper achieved an error rate of 0.2% by averaging results from 35 neural networks trained on differently distorted versions of reduced MNIST images. It provides background on various digit recognition methods before detailing the specific approaches used in the study.

Uploaded by

pradeep aryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views10 pages

Implementation and Analysis of Different Digit Recognition Methods On Reduced MNIST Dataset

This document summarizes a conference paper that implemented and analyzed different digit recognition methods on reduced versions of the MNIST dataset. It describes using perceptrons, k-nearest neighbors, support vector machines, convolutional neural networks, and boosted LeNet4 models. The paper achieved an error rate of 0.2% by averaging results from 35 neural networks trained on differently distorted versions of reduced MNIST images. It provides background on various digit recognition methods before detailing the specific approaches used in the study.

Uploaded by

pradeep aryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/326271883

Implementation and Analysis of Different Digit Recognition Methods


on Reduced MNIST Dataset

Conference Paper · June 2018

CITATIONS READS

0 735

3 authors, including:

Reza Shisheie
Cleveland State University
9 PUBLICATIONS   32 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Analysis of Digit Recognition Methods on Reduced MNIST Dataset View project

All content following this page was uploaded by Reza Shisheie on 09 July 2018.

The user has requested enhancement of the downloaded file.


Implementation and Analysis of Different Digit Recognition Methods on Reduced
MNIST Dataset

Reza Shisheie*, Benjamin A. Galun, Jongman Kim

Department of Computer Science


Cleveland State University
[email protected]

Stroke method, which is a statistical method, is found as one


of the most useful methods as far as feature extraction.
Abstract Neural Networks (NN) were used very widely due to ease
This paper focuses on the problem of hand-written of use and implementation. NNs are different from other
digit recognition on large and reduced datasets. For statistical calssifiers since it is an online learning system,
experiment and evaluation, MNIST dataset was which is non-parametric and model free. Some NNs do not
used to evaluate methods quickly. To see the effec- even have a cost function (L. Holmstrom, 1997). The com-
tiveness of methods on reduced data sets and one- putational aspect, however, is the most important aspect
shot learning, smaller modified versions of MNIST which needs extra attention. To avoid heavy run time, com-
were created. Perceptron was implemented as ref- putation should be kept local and simple. In contrast with
erence method. To improve results and decrease NNs, many other statistical classifiers follow the Bayesian
error rate a modified k-nearest neighbor (k-NN) was classification principle which uses and explicit function with
employed and results were shown. To reduce time prior probabilities. (L. Holmstrom, 1997) did two studies on
complexity and further improve performance, dif- NNs. He concluded that local linear regression (LLN), alt-
ferent variant of SVM (Support Vector Machine) hough computationally heavy, was the best classifier fol-
was used. Finally, a multi-layer CNN (Convolution lowed by Averaged Learning Subspace Method (ASLM).
Neural Network) using a modified LeNet4 Boosted The amount of required computation power needed for LLN
method was used. is so high, in comparison with ASLM, that it makes the
method practically impossible. To increase performce of
1. Introduction NNs (D.E. Rumelhart, 1986) proposed a back propagation
technique. In this method, net weights were adjusted
Handwritten digit recognition is still one of the hot topics in repeatedly in order to minimize difference between the
artificial intelligence area dues to its wide area of applica- output vector and the desired output. As a result, hidden units
tions from postal mail sorting to check processing. With the capture important features of the task domain which could
rise of high-power computing machines, more and more not be captured in previous methods.
methods were introduced in the past two decades and their In recent years a new technique called Support Vector
error rate, performance, and run time were studied. The two Machine (SVM) came to the focus of attention. SVM did not
main parameters playing the most significant rule on per- become the focus of research until 1992 when Vapnik and his
formance are feature extraction and classification method. colleagues at Bell Labs implemented and tested the method
Tier et. al. (O.D. Trier, 1995) presented an overview of for the first time (Boser, et al., 1992). SVM technique can be
off-line feature recognition of Latin characters. Some used for a wide variety of classifiaction functions, including
methods such as template matching, deformable templates, perceptions, polynomials, and Radial Basis Functions. The
unitary image transforms, graph description, geometric idea is to use a linear combination of support patterns, which
moment invariants, Zernike moments, Spline curve ap- are the subset of training patterns. (Burges, 1998) presented a
proximation, and Fourier descriptors were studied. They tutorial on SVM for seperable and non-seperable datasets
concluded that Zernike moments provided the most promis- usinig non-trivial examples. Traditional methods such as NN
ing result for their application. One of the early feature ex- usually minimize the emprical training error, however, for
traction methods were presented by (Y. Yamashita, 1983) for SVM, the goal is to minimize the upper bound of the
feature recognition of Kanji characters. They extracted generalization error. This goal can be achieved by
directional line segments and partitioned the character frame maximizing the margin, which seperates the hyperplane and
area first. Then, a distribution of strokes as a vector was data. (Byun, 2002. ) introduced many applications of SVM
generated and compared with average vectors in a dictionary. including butnot limited to handwritng digit recognition.

SMA 2018 The International Conference on Smart Media & Applications 161 page
Figure 1. Architecture of LeNet-1 (LeCun, 1998)

(LeCun, 1998) introduced Lenet-5 which is a 7-layer connected hidden layers with 150 and 10 neurons respec-
neural network. An improved version of LeNet is introduced tively, 150N and 10N. The final 10 neurons are the outputs, 1
as LeNet4 boosted. Boosted LetNet4 is a deep learning al- for each digit.
gorithm that correctly identified the MNIST dataset with a Once the architecture was established the original 28x28
0.7% error rate. The LeNet4 architecture is an improved MNIST images were modified from the original form and
architecture of the LeNet 1 system. normalized to 10, 12, 14, 16, 18 and 20 pixels. Once the
LeNet 4 includes more feature maps and an additional modified datasets obtained, 35 different deep neural nets with
hidden layer that is fully connected to the last feature layer the architecture above were created. 5 neural nets were as-
and output layer compared to LeNet 1. It requires ―260,000 signed to each of the different image distortions, and the
multiply/add steps and has about 17,000 free parameters‖. results were averaged at the end. With this method an error
LeNet 4 without boosting achieves a 1.1% error on the test- rate of 0.2% was achieved.
ing set. Once this was established, the system was boosted. In this paper, the MNIST dataset and procedure for ob-
Boosting refers to using a group of neural nets to improve taining reduced set is explained first. Then, different methods
accuracy. For boosting for LeNet 4 involves using 3 learning implemented in this paper are presented. In section 3 all
machines. The first one trains the normal way, the second one results analyzed and discussed. Finally, in section 4 a sum-
uses an even mix of inputs that the first classified correctly mary is provided.
and incorrectly. The third machine is trained off inputs where
the first two disagreed. Since LeNet4 has a 1.1% error, at 2. Methods
most only 2.2% of images would be sent to the second ma-
chine and less to the third. To allow for more inputs to the In this paper several methods are implemented on MNIST
second machine and third machines, images were heavily dataset. These methods include Perceptron, k-NN, 3 versions
distorted by shifting scaling rotating and skewing so that of SVN, and a CNN. A modified MNIST dataset was pro-
more images could be processed by the second and third posed and used for verification and test.
machine. After running the original and distorted images
through the three machines, a 0.7% error rate was observed. 2.1.Reduced MNIST Dataset
The total training time for this method took over 1 month to
train.
(Ciregan, 2012) introduced a multi-column deep neural
network method for image classification. The method de-
scribed by this paper has a 0.2% error on the MNIST dataset.
The neural net architecture is described by
1x28x28-20C4-MP2-40C5-MP3-150N-10N DNN. The input
images are the 28x28 pixels MNIST digits. Since there is
only one color in MNIST it is represented as a 28x28x1
vector. The next step is a convolution layer which consists of
20 filters with a 4x4 filter, represented by 20C4. After that
there is a max pooling layer. This is a 2x2 filter with a step of
2. This allows for a shrinkage of the image and is represented
by MP2. The next part is another convolution layer of 40 5x5
filters,40C5. Followed again by a max pooling layer with 3x3 Figure 2. Normalized sample from MNIST dataset (LeCun, 1998)
filters and a step of 3, MP3. The last two layers are fully

162 page The International Conference on Smart Media & Applications SMA 2018
The Modified Nation Institute of Standards and Technol-
ogy(MNIST) dataset is comprised of 60,000 28x28 pixel
images of digits. The set is equally broken up between each
of the ten digits.
The set contains 50,000 images for training and 10,000 for
testing. This equates to 5,000 training images of each digit
and 1,000 of each digit for testing.
To test learning algorithms accuracy with one-shot learn-
ing and reduced data sets, additional datasets were created.
The n-MNIST datasets were created, where n indicates the
number of each image included in the set. 1-MNIST,
5-MNIST, 10-MNIST, 20-MNIST, 50-MNIST, and
100-MNIST, were created. The n pictures of each images
were taken from the first n instances of the image in the
MNIST dataset. This allowed for the same datasets to be Figure 3. SVM with the margin and decision boundary
inputted for each learning algorithm. For testing each of the
algorithm proficiency at classification the full 10,000 images The fundamental idea behind SVM is constructing a de-
were used for testing. cision plate which separates samples on the positive side of
the margin from samples on the negative side. This idea is
2.2.One-Shot Learning shown in Figure 3 (Byun, 2002. )
In case of separable classes, the goal is to find an optimal
Most image classification methods require a lot of input data
hyperplane which separates the closest support vectors on
to get meaningful results. One-Shot learning attempts to
two sides meanwhile the width of the margin is maximized.
mimic human learning with regards to classification. Hu-
The problem should be solved for all values such that:
mans can classify images with very few examples and are
able to perform well with only one example. One-Shot
for all values on the positive side
Learning method take 1 or a few images of each class and
attempts to classify the testing set based off a little infor-
mation. To attempt to test our learning algorithms on for all values on the left side
One-Shot learning, reduced size datasets were created.
This is an optimization method for which the quadratic of
2.3.K- Nearest Neighbors (k-NN) weights must be minimized:
k-NN learning algorithm is based on classification of the
Euclidean distance between testing images and training im- (1)
ages. Each testing images is compared to every training im-
ages and the k closest images labels are taken, with the most For some real-life problem data might not be completely
frequent one used as the training label. The k value indicates separable. However, there still exists a hyperplane, which can
how many of the closest images are looked at to generate a separate classes with minimum misclassification. The ob-
result. When k is 1, the label of the closest training image is jective function for such cases is modified to the following:
used for the testing image. k is usually chosen to be an odd
number to avoid ties. The k-NN training method requires no
training time, however, each testing image must be compared
to every training image. This results in a large amount of time
to identify images when the dataset is large. For this papers (2)
implementation a K value of 1 was used, as increasing the K
value had a negligible result on performance. where ξ and C are positive slack varibale and user-defined
variable to control tradoff between the margin and the
miscalculation.
2.4.Support vector Machine (SVM) Taking Lagrangian from (2), finding Lagrange multipliers
Empirical Risk Minimization (ERM) is the goal of many and replacing them into Lagrange function and solving for it
classical learning approaches in which the goal is minimiza- yields a quadratic programing (QP) problem for linearly
tion of error of the training dataset. One of the prominent separable data:
examples of those methods is neural Networks. Structural
Risk Management (SRM) methods, however, re based on the (3)
statistical learning theory. A good example of SRM is SVM.
The goal of SVM is minimizing the sum of training error rate. which should be with respect to constraints:

SMA 2018 The International Conference on Smart Media & Applications 163 page
(4)

For many real-life problems data is so inseparable and


accuracy is so important that using a linear SVM does not
provide satisfactory result. The solution is to map data from
the current non-linear space to a higher dimension space in
which data is linear and separable. To solve the problem an
initial mapping Φ is introduced, which maps data from non-
linear space to linear space. Since the training algorithm is
only dependent of the dot-product of data, a symmetric ker-
nel function k can be defined as:
(5)
This kernel function is defined implicitly by the choice of
kernel for instance Sigmoid or polynomial function. Here is a Figure 4. Visualization of different Sci-kit SVM methods (Sci-kit,
list of kernel functions: 2018)
Table 1. Summary of kernels (Haykin, 1999) The hyper parameter γ is used to configure the sensitivity
to differences in feature vectors, which in turn depends on
various things such as input space dimensionality and feature
normalization. If γ is set too large, overfitting may occur. In
the limit case γ→∞, the kernel matrix becomes the unit ma-
trix which leads to a perfect fit of the training data, though an
entirely useless model. The best results were made once the
kernel coefficient was set to 1/number of features.
SVM, in its initial form, was designed as a binary problem
in which two classes were compared to each other (Boser, et
Several algorithms suggested solution for the optimization al., 1992). This solution does not work for multi-class prob-
problem (Schölkopf, 1999) (Smola, 2004). These two solu- lems. (Vapnik, 1996) suggested a multi comparison solution
tions as well as many other classical solutions are good for to resolve this issue in which one class is compared with
small amount of data. Most of these methods would fail for other classes and whichever class that has the largest margin
various reasons: they require larger amount of data since they will be picked as output. For such problem n hyperplanes
need the kernel function to be computed and stored on must be defined and the solution would be the result of n QP
memory. This results in out of memory crash. They also need problems. In this method, which is called ‗one against the rest‘
large and very expensive matrix operations which is not (OVR), each of these sub-problems puts one class against all
possible on many computers. Coding these algorithms can other classes.
also be tedious and complex. (Osuna, 1997) proved a theo- (Knerr, 1990) suggested a second approach in which pair
rem which partially overcomes the complexity of the prob- of classes are compared for all n classes: ‗one against one‘
lem. The theorem suggests breaking down the large QP (OVO). This result in generating n(n+1)/2 class classifiers
problem into a smaller number of QP problems and in each from the training set of n classes and each one of the classi-
step the cost of the main objective function is reduced while fiers. (Pal, 2008).
all constraints are satisfied. SVC and NuSVC implement both OVO and OVR ap-
For this project Sci-kit tool was used. Three different proaches for multi- class classification. However, as a rule of
method including the Linear SVM, SVC, NuSVC. Linear thumb OVR usually gives better results.
SVM tries to classify data with straight lines, however, SVC The only difference between SVC and NuSVC is on the
and NuSVC categorize data based on non-linear regions. proposed new variable ν (nu) which modifies the simple
Here is a visualization of result for sample data. primal problem for SVC:
As shown in Figure 4, using SVC with linear kernel gives
almost the same results as using LinearSVC. To reach the
best result for SVC and NuSVC, their parameters must be (6)
tuned. For the kernel function any of the linear, polynomial,
Radial Basis function (RBF) or sigmoid functions can be
used. The best result was generated using RBF as it is one of To a modified version of that in which fading value of C is
the most popular and most accurate methods as far as kernel much easier:
function.

164 page The International Conference on Smart Media & Applications SMA 2018
Perceptron Nearest-Neighbors
100

80
(7)

Correct rate
60
2.5.Convolution Neural networks (CNN)
For this project a neural network was created to tests its 40
ability to classify MNIST digit with a reduced dataset. The
framework was created on TensorFlow and written in Python. 20
The NN was trained using the machines CPU. Before testing
the performance at reduced and one-shot learning, a reliable 0
neural network needed to be created that could classify digits 0 50 Samples 100 150
based on the full dataset with at least 95% accuracy. The
initial architecture was taken from (Ciregan, 2012)‖. Recall Figure 5. Perceptron and k-NN without image distortion
as stated earlier the architecture was represented by
1x28x28-20C4-MP2-40C5-MP3-150N-10N and 35 parallel
For both tests, k-NN generated better and more robust
neural networks. Since the error rate was 0.23%, the method
results in comparison with Perceptron. The reason is pri-
was attempted to be replicated.
marily caused by fundamental difference between these two
With the limitations of the hardware available, 35 parallel
methods. k-NN compares each node with all other nodes and
neural networks, yet alone any parallel networks were too
then it goes to the nest node and it does this for all nodes
complex to implement. To begin only one layer of the neural
which means it has time complexity of ) while percep-
network was attempted to be replicated. This architecture
tron only goes through data once and adjusts weights once
was replicated on the local machine, however with only one
unless it is run multiple times. Thus, k-NN covers all possi-
machine running this algorithm the results were not suc-
bilities better than perceptron.
cessful. To attempt to increase the performance more neuron
were added to first hidden layer. It was boosted to 784 neu-
rons. This number was chosen because it is equal to the Perceptron Nearest-Neighbors
number of pixels in each photo (28x28). The allowed for an
increased classification, but not within the threshold. The
100
article also mentioning testing a neural network algorithm
with 5x5 sized filters. This was updated for the first convo-
lution layer along with boosting the number of filters from 20 80
Correct rate

and 40 to 32 and 64 respectively. This left with the final


architecture structure of 60
1x28x28-32C5-MP2-64C5-MP2-784N-10N. This describes
a neural net with a 28x28x1 feature input, followed by a 40
convolutional layer of 32 maps of 5x5 filters. To avoid image
shrinkage the image was padded with zeros to allow for the 20
same output size after each filter. After the first convolutional
layer, a max pooling layer with a 2x2 filter over 0
non-overlapping parts of the regions. Next another convolu- 0 50 Samples 100 150
tional layer with 64 maps and a 5x5 filter, again padded with
zeros to retain image size. Then another max pooling layer Figure 6. Perceptron and k-NN with image distortion
was used which was identical to the initial one. Finally, a
fully connected layer with 784 hidden units was used and In the second test set, results from Linear SVC, SVC,
fully connected to an output layer with 10 neurons, one for and NuSVC are shown below:
each digit. This neural network was able to achieve a 97.33%
accuracy.

3. Result and Analysis


The result comparison of perceptron and k-NN for data with
and without data distortion is shown below:

5
SMA 2018 The International Conference on Smart Media & Applications 165 page
be 97.33%. On the reduced dataset size, 10MNIST a 76.78%
Linear SVM SVC NuSVC accuracy and 63.52% accuracy on one shot learning.

0.95 4. Summary
0.85 In this paper several methods including Perceptron, k-NN,
SVM, and CNN were examined. k-NN gives good results for
Correct Rate

0.75 small datasets but as dataset get larger and larger it takes
more and more time as its complexity is . k-NN gives
0.65 much better results in comparison with single-layer Percep-
tron. SVM provided promising results in shorter amount of
0.55 time using all three methods, however, Linear SVM is sug-
gested to be used for MNIST dataset instead of Nonlinear
0.45 SVMs as the MNIST dataset was not found as nonlinear or
0 500 Samples 1000 1500 inseparable. CNN has the highest runtime, but it outperforms
all other methods by 97.33% correct data prediction.
Figure 7. Correct rate of Linear SVN, SVC, NuSVC

As shown in Figure 7, all three methods converge by the 6. Future Work


same rate. The convergence rate of NuSVC was the fastest at
the beginning but as the number of samples increases, SVC To improve the work on this project a few methods were
starts performing slightly better. There might be a slight proposed which could improve performance and results. To
variation between the performance of these three methods, improve the neural network capabilities a machine that is
but the main difference is not in their convergence. able to run on GPU would be preferred over CPU. Most
The main difference between these methods is in their papers reviewed for this work, trained their neural networks
runtime. A shown in Figure 8, the run time of linear SVC is 6 using GPUs at increased performance. In addition to using a
seconds, SVC 40 seconds, and NuSVC 135 seconds and as GPU, many papers suggested using multiple machines to
the number of training data increases, run time also increases able to learn off of each other.
exponentially. The reason is on the complexity of NuSVC Additionally, many different image distortion techniques
and SVC in comparison with SVC. By comparing the results would be used to increased information in reduced dataset.
from these three methods it can be concluded that MNIST These image distortion techniques would all for more in-
dataset is not as non-linear or inseparable since all these formation to be gained from few images.
linear and nonlinear methods have the same convergence rate Once a reliable way to identify MNIST digits on a reduced
and thus using nonlinear SVC is not recommended for dataset size, other datasets would be experimented with too
MNIST. see how the algorithms hold up. These datasets include but
are not limited to, the USPS handwritten dataset, Semeion
Linear SVM SVC NuSVC
and Gisette.
160
140 5. References
120
100
Time (s)

80 Boser, B. E., Guyon, I. M. & Vapnik, V. N., 1992. A training


algorithm for optimal margin classifiers. Proceedings of the
60
fifth annual workshop on Computational learning theory, p.
40 144.
20 Burges, C., 1998. A tutorial on support vector machines for
0 pattern recognition,. Knowledge Discovery Data Mining,
0 500 1000 1500 2(2), p. 1–43.
Samples Byun, H. a. S.-W. L., 2002. . Applications of support vector
machines for pattern recognition: A survey.. Pattern
recognition with support vector machines. Springer, Berlin,
Figure 8. Run time of Linear SVN, SVC, NuSNC Heidelberg,, pp. 213-236..
Ciregan, D. U. M. a. J. S., 2012. Multi-column deep neural
As far as CNN, as mentioned earlier the results from the networks for image classification.. IEEE, CVPR.
full MNIST data set on this neural network was observed to D.E. Rumelhart, G. H. R. W., 1986. Learning representations
by back-propagation errors. Nature, 323(9), p. 533–536.

166 page The International Conference on Smart Media & Applications SMA 2018
F. Kimura, K. T. S. T. Y. M., 1987. Modified Quadratic
Discriminant Functions and the Application to Chinese
Character Recognition. IEEE Trans. Pattern Anal., 9(1), p.
149–153.
Haykin, S., 1999. Neural Networks. s.l.:Prentice Hall Inc.
Knerr, S. L. P. a. G. D., 1990. Single-layer learning revisited:
a stepwise procedure for building and training a neural
network. Neurocomputing. Springer, Berlin, pp. 41-50.
L. Holmstrom, e. a., 1997. Neural and Statistical
Classifiers—Taxonomy and Two Case Studies. IEEE
TRANSACTIONS ON NEURAL NETWORKS, 8(1), pp. 5-17.
LeCun, Y. B. L. B. Y. &. H. P., 1998. Gradient-based
learning applied to document recognition. Proceedings of the
IEEE 86, 86(11), pp. 2278-2324.
O.D. Trier, A. J. T., 1995. Taxt, Feature extraction methods
for character recognition—a survey. Pattern Recognition, p.
53– 60.
Osuna, E. R. F. a. F. G., 1997. Training support vector
machines: an application to face detection. s.l., IEEE
computer society conference on Computer vision and pattern
recognition.
Pal, M., 2008. Multiclass approaches for support vector
machine based land cover classification, s.l.: arXiv preprint
arXiv:0802.2411.
Schölkopf, B. C. J. B. a. A. J. S. e., 1999. Advances in kernel
methods: support vector learning. s.l.:MIT Press.
Sci-kit, 2018. Sci-kit. [Online]
Available at: http://scikit-learn.org/stable/modules/svm.html
[Accessed 1 05 2018].
Smola, A. J. a. B. S., 2004. A tutorial on support vector
regression. Statistics and computing, 3(2004), pp. 199-222.
Vapnik, V., 1996. The nature of statistical learning theory..
s.l., Springer .
Y. Yamashita, K. H. Y. Y. Y. H., 1983. Classification of
handprinted Kanji characters by the structured segment
matching method. Pattern Recognition Letters, Volume 1, pp.
475-479 .

SMA 2018 The International Conference on Smart Media & Applications 167 page
View publication stats

You might also like