Implementation and Analysis of Different Digit Recognition Methods On Reduced MNIST Dataset
Implementation and Analysis of Different Digit Recognition Methods On Reduced MNIST Dataset
net/publication/326271883
CITATIONS READS
0 735
3 authors, including:
Reza Shisheie
Cleveland State University
9 PUBLICATIONS 32 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Reza Shisheie on 09 July 2018.
SMA 2018 The International Conference on Smart Media & Applications 161 page
Figure 1. Architecture of LeNet-1 (LeCun, 1998)
(LeCun, 1998) introduced Lenet-5 which is a 7-layer connected hidden layers with 150 and 10 neurons respec-
neural network. An improved version of LeNet is introduced tively, 150N and 10N. The final 10 neurons are the outputs, 1
as LeNet4 boosted. Boosted LetNet4 is a deep learning al- for each digit.
gorithm that correctly identified the MNIST dataset with a Once the architecture was established the original 28x28
0.7% error rate. The LeNet4 architecture is an improved MNIST images were modified from the original form and
architecture of the LeNet 1 system. normalized to 10, 12, 14, 16, 18 and 20 pixels. Once the
LeNet 4 includes more feature maps and an additional modified datasets obtained, 35 different deep neural nets with
hidden layer that is fully connected to the last feature layer the architecture above were created. 5 neural nets were as-
and output layer compared to LeNet 1. It requires ―260,000 signed to each of the different image distortions, and the
multiply/add steps and has about 17,000 free parameters‖. results were averaged at the end. With this method an error
LeNet 4 without boosting achieves a 1.1% error on the test- rate of 0.2% was achieved.
ing set. Once this was established, the system was boosted. In this paper, the MNIST dataset and procedure for ob-
Boosting refers to using a group of neural nets to improve taining reduced set is explained first. Then, different methods
accuracy. For boosting for LeNet 4 involves using 3 learning implemented in this paper are presented. In section 3 all
machines. The first one trains the normal way, the second one results analyzed and discussed. Finally, in section 4 a sum-
uses an even mix of inputs that the first classified correctly mary is provided.
and incorrectly. The third machine is trained off inputs where
the first two disagreed. Since LeNet4 has a 1.1% error, at 2. Methods
most only 2.2% of images would be sent to the second ma-
chine and less to the third. To allow for more inputs to the In this paper several methods are implemented on MNIST
second machine and third machines, images were heavily dataset. These methods include Perceptron, k-NN, 3 versions
distorted by shifting scaling rotating and skewing so that of SVN, and a CNN. A modified MNIST dataset was pro-
more images could be processed by the second and third posed and used for verification and test.
machine. After running the original and distorted images
through the three machines, a 0.7% error rate was observed. 2.1.Reduced MNIST Dataset
The total training time for this method took over 1 month to
train.
(Ciregan, 2012) introduced a multi-column deep neural
network method for image classification. The method de-
scribed by this paper has a 0.2% error on the MNIST dataset.
The neural net architecture is described by
1x28x28-20C4-MP2-40C5-MP3-150N-10N DNN. The input
images are the 28x28 pixels MNIST digits. Since there is
only one color in MNIST it is represented as a 28x28x1
vector. The next step is a convolution layer which consists of
20 filters with a 4x4 filter, represented by 20C4. After that
there is a max pooling layer. This is a 2x2 filter with a step of
2. This allows for a shrinkage of the image and is represented
by MP2. The next part is another convolution layer of 40 5x5
filters,40C5. Followed again by a max pooling layer with 3x3 Figure 2. Normalized sample from MNIST dataset (LeCun, 1998)
filters and a step of 3, MP3. The last two layers are fully
162 page The International Conference on Smart Media & Applications SMA 2018
The Modified Nation Institute of Standards and Technol-
ogy(MNIST) dataset is comprised of 60,000 28x28 pixel
images of digits. The set is equally broken up between each
of the ten digits.
The set contains 50,000 images for training and 10,000 for
testing. This equates to 5,000 training images of each digit
and 1,000 of each digit for testing.
To test learning algorithms accuracy with one-shot learn-
ing and reduced data sets, additional datasets were created.
The n-MNIST datasets were created, where n indicates the
number of each image included in the set. 1-MNIST,
5-MNIST, 10-MNIST, 20-MNIST, 50-MNIST, and
100-MNIST, were created. The n pictures of each images
were taken from the first n instances of the image in the
MNIST dataset. This allowed for the same datasets to be Figure 3. SVM with the margin and decision boundary
inputted for each learning algorithm. For testing each of the
algorithm proficiency at classification the full 10,000 images The fundamental idea behind SVM is constructing a de-
were used for testing. cision plate which separates samples on the positive side of
the margin from samples on the negative side. This idea is
2.2.One-Shot Learning shown in Figure 3 (Byun, 2002. )
In case of separable classes, the goal is to find an optimal
Most image classification methods require a lot of input data
hyperplane which separates the closest support vectors on
to get meaningful results. One-Shot learning attempts to
two sides meanwhile the width of the margin is maximized.
mimic human learning with regards to classification. Hu-
The problem should be solved for all values such that:
mans can classify images with very few examples and are
able to perform well with only one example. One-Shot
for all values on the positive side
Learning method take 1 or a few images of each class and
attempts to classify the testing set based off a little infor-
mation. To attempt to test our learning algorithms on for all values on the left side
One-Shot learning, reduced size datasets were created.
This is an optimization method for which the quadratic of
2.3.K- Nearest Neighbors (k-NN) weights must be minimized:
k-NN learning algorithm is based on classification of the
Euclidean distance between testing images and training im- (1)
ages. Each testing images is compared to every training im-
ages and the k closest images labels are taken, with the most For some real-life problem data might not be completely
frequent one used as the training label. The k value indicates separable. However, there still exists a hyperplane, which can
how many of the closest images are looked at to generate a separate classes with minimum misclassification. The ob-
result. When k is 1, the label of the closest training image is jective function for such cases is modified to the following:
used for the testing image. k is usually chosen to be an odd
number to avoid ties. The k-NN training method requires no
training time, however, each testing image must be compared
to every training image. This results in a large amount of time
to identify images when the dataset is large. For this papers (2)
implementation a K value of 1 was used, as increasing the K
value had a negligible result on performance. where ξ and C are positive slack varibale and user-defined
variable to control tradoff between the margin and the
miscalculation.
2.4.Support vector Machine (SVM) Taking Lagrangian from (2), finding Lagrange multipliers
Empirical Risk Minimization (ERM) is the goal of many and replacing them into Lagrange function and solving for it
classical learning approaches in which the goal is minimiza- yields a quadratic programing (QP) problem for linearly
tion of error of the training dataset. One of the prominent separable data:
examples of those methods is neural Networks. Structural
Risk Management (SRM) methods, however, re based on the (3)
statistical learning theory. A good example of SRM is SVM.
The goal of SVM is minimizing the sum of training error rate. which should be with respect to constraints:
SMA 2018 The International Conference on Smart Media & Applications 163 page
(4)
164 page The International Conference on Smart Media & Applications SMA 2018
Perceptron Nearest-Neighbors
100
80
(7)
Correct rate
60
2.5.Convolution Neural networks (CNN)
For this project a neural network was created to tests its 40
ability to classify MNIST digit with a reduced dataset. The
framework was created on TensorFlow and written in Python. 20
The NN was trained using the machines CPU. Before testing
the performance at reduced and one-shot learning, a reliable 0
neural network needed to be created that could classify digits 0 50 Samples 100 150
based on the full dataset with at least 95% accuracy. The
initial architecture was taken from (Ciregan, 2012)‖. Recall Figure 5. Perceptron and k-NN without image distortion
as stated earlier the architecture was represented by
1x28x28-20C4-MP2-40C5-MP3-150N-10N and 35 parallel
For both tests, k-NN generated better and more robust
neural networks. Since the error rate was 0.23%, the method
results in comparison with Perceptron. The reason is pri-
was attempted to be replicated.
marily caused by fundamental difference between these two
With the limitations of the hardware available, 35 parallel
methods. k-NN compares each node with all other nodes and
neural networks, yet alone any parallel networks were too
then it goes to the nest node and it does this for all nodes
complex to implement. To begin only one layer of the neural
which means it has time complexity of ) while percep-
network was attempted to be replicated. This architecture
tron only goes through data once and adjusts weights once
was replicated on the local machine, however with only one
unless it is run multiple times. Thus, k-NN covers all possi-
machine running this algorithm the results were not suc-
bilities better than perceptron.
cessful. To attempt to increase the performance more neuron
were added to first hidden layer. It was boosted to 784 neu-
rons. This number was chosen because it is equal to the Perceptron Nearest-Neighbors
number of pixels in each photo (28x28). The allowed for an
increased classification, but not within the threshold. The
100
article also mentioning testing a neural network algorithm
with 5x5 sized filters. This was updated for the first convo-
lution layer along with boosting the number of filters from 20 80
Correct rate
5
SMA 2018 The International Conference on Smart Media & Applications 165 page
be 97.33%. On the reduced dataset size, 10MNIST a 76.78%
Linear SVM SVC NuSVC accuracy and 63.52% accuracy on one shot learning.
0.95 4. Summary
0.85 In this paper several methods including Perceptron, k-NN,
SVM, and CNN were examined. k-NN gives good results for
Correct Rate
0.75 small datasets but as dataset get larger and larger it takes
more and more time as its complexity is . k-NN gives
0.65 much better results in comparison with single-layer Percep-
tron. SVM provided promising results in shorter amount of
0.55 time using all three methods, however, Linear SVM is sug-
gested to be used for MNIST dataset instead of Nonlinear
0.45 SVMs as the MNIST dataset was not found as nonlinear or
0 500 Samples 1000 1500 inseparable. CNN has the highest runtime, but it outperforms
all other methods by 97.33% correct data prediction.
Figure 7. Correct rate of Linear SVN, SVC, NuSVC
166 page The International Conference on Smart Media & Applications SMA 2018
F. Kimura, K. T. S. T. Y. M., 1987. Modified Quadratic
Discriminant Functions and the Application to Chinese
Character Recognition. IEEE Trans. Pattern Anal., 9(1), p.
149–153.
Haykin, S., 1999. Neural Networks. s.l.:Prentice Hall Inc.
Knerr, S. L. P. a. G. D., 1990. Single-layer learning revisited:
a stepwise procedure for building and training a neural
network. Neurocomputing. Springer, Berlin, pp. 41-50.
L. Holmstrom, e. a., 1997. Neural and Statistical
Classifiers—Taxonomy and Two Case Studies. IEEE
TRANSACTIONS ON NEURAL NETWORKS, 8(1), pp. 5-17.
LeCun, Y. B. L. B. Y. &. H. P., 1998. Gradient-based
learning applied to document recognition. Proceedings of the
IEEE 86, 86(11), pp. 2278-2324.
O.D. Trier, A. J. T., 1995. Taxt, Feature extraction methods
for character recognition—a survey. Pattern Recognition, p.
53– 60.
Osuna, E. R. F. a. F. G., 1997. Training support vector
machines: an application to face detection. s.l., IEEE
computer society conference on Computer vision and pattern
recognition.
Pal, M., 2008. Multiclass approaches for support vector
machine based land cover classification, s.l.: arXiv preprint
arXiv:0802.2411.
Schölkopf, B. C. J. B. a. A. J. S. e., 1999. Advances in kernel
methods: support vector learning. s.l.:MIT Press.
Sci-kit, 2018. Sci-kit. [Online]
Available at: http://scikit-learn.org/stable/modules/svm.html
[Accessed 1 05 2018].
Smola, A. J. a. B. S., 2004. A tutorial on support vector
regression. Statistics and computing, 3(2004), pp. 199-222.
Vapnik, V., 1996. The nature of statistical learning theory..
s.l., Springer .
Y. Yamashita, K. H. Y. Y. Y. H., 1983. Classification of
handprinted Kanji characters by the structured segment
matching method. Pattern Recognition Letters, Volume 1, pp.
475-479 .
SMA 2018 The International Conference on Smart Media & Applications 167 page
View publication stats