0% found this document useful (0 votes)
12 views18 pages

V Minor

The project report details the development of a deep convolutional neural network (CNN) model for MNIST digit classification, achieving high accuracy rates of 99.82% on training data and 99.40% on testing data. The report includes sections on model design, hyperparameter optimization, and techniques to prevent overfitting, emphasizing the importance of CNNs in improving handwriting recognition. Acknowledgments are given to the supervisor and contributors, highlighting the collaborative effort in completing the project.

Uploaded by

singhprem53069
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views18 pages

V Minor

The project report details the development of a deep convolutional neural network (CNN) model for MNIST digit classification, achieving high accuracy rates of 99.82% on training data and 99.40% on testing data. The report includes sections on model design, hyperparameter optimization, and techniques to prevent overfitting, emphasizing the importance of CNNs in improving handwriting recognition. Acknowledgments are given to the supervisor and contributors, highlighting the collaborative effort in completing the project.

Uploaded by

singhprem53069
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

National Institute of Technology Jamshedpur

Department of Engineering and Computational


Mechanics

PROJECT REPORT
ON

MNIST DIGIT CLASSIFICATION

Submitted To
Dr Vijay Kumar Dalla Sir

Submitted By

Premdeep (2022UGCM0016)
National Institute of Technology, Jamshedpur

CERTIFICATE
This is to certify that Mr. Premdeep (Reg. No: 2022UGCM016)
Bachelor of Engineering and Computational Mechanics Vth semester student
of National Institute of Technology, Jamshedpur, has done project work entitled
“Mnist Digit Classification” under the guidance of oursenior faculties during the
period of July 2024 to November 2024. They successfully completed the
project and during the period they were methodical and hardworking.

Premdeep
(2022UGCM0016)

This is to certify that the above statement made by the candidate is trueand
correct to the best of my knowledge and belief.

Prof Sanjay Dr. Vijay Kumar Dalla


Head of the Department (SUPERVISOR)
Department Of Mechanical
Engineering National Institute of
Technology Jamshedpur
ACKNOWLEDGEMENT

I would like to express our sincere gratitude to all those who contributed to the
completion of this research work titled “Mnist Digit Classification”. We are
deeply thankful to our supervisor Dr Vijay Kumar Dalla Sir whose guidance,
support, and valuable insights were instrumental throughout this research
project. Their expertise and encouragement have been invaluable.
I extend our appreciation to the participants of this study, whose cooperation
and willingness to participate were crucial in gathering meaningful data and
drawing insightful conclusions.
Lastly, I acknowledge the contributions of all the authors, researchers, and
institutions whose work and publications have enriched the literature and
served as a foundation for this study.
ABSTRACT

Accurate handwriting recognition has been a challenging computer vision problem,


because static feature analysis of the text pictures is often inadequate to account for
high variance in handwriting styles across people and poor image quality of the
handwritten text. Recently, by introducing machine learning, especially
convolutional neural networks (CNNs), the recognition accuracy of various
handwriting patterns is steadily improved. In this paper, a deep CNN model is
developed to further improve the recognition rate of the MNIST handwritten digit
dataset with a fast-converging rate in training. The proposed model comes with a
multi-layer deep arrange structure, including 3 convolution and activation layers for
feature extraction and 2 fully connected layers (i.e., dense layers) for classification.
The model’s hyperparameters, such as the batch sizes, kernel sizes, batch
normalization, activation function, and learning rate are optimized to enhance the
recognition performance. The average classification accuracy of the proposed
methodology is found to reach 99.82% on the training dataset and 99.40% on the
testing dataset, making it a nearly error-free system for MNIST recognition.
INDEX
i .Abstract
1. Introduction
2.The Proposed Approach for MNIST Handwritten Digit Classification
2.1 Deep Neural Network Design
2.2 Model Hyperparameter Optimization Strategy
2.2.1 Batch Size Analysis and Selection Method
2.2.2 Learning Rate Decay Technique
2.2.3 Techniques to Prevent Overfitting
2.2.4 Model Validation and Evaluation

3.Experiments
3.1 MNIST Handwritten Digit Dataset
3.2 Model Hyperparameter Optimization
3.3 Learning Rate Attenuation and Classification Analysis
3.4 Model Optimization and Performance Evaluation
4 Conclusions
5 References
LIST OF FIGURE

1: The proposed DL-based data processing


2: The architecture of the designed CNN model
3: Training and validation accuracies
4: Some of the predicted digit with recognition rate
5: Visualization of the training and validation dataset
6: Some of the validation digits with incorrect predictions
7: (a) Numbers with high recognition rates; (b) Numbers with low recognition rates
1 Introduction
The Modified National Institute of Standards and Technology (MNIST)
handwritten digit database, one of the most important areas of research in pattern
recognition, has excellent research and practical value. Generally speaking,
handwriting classification techniques can be divided into either statistical
featurerelated methods or structural feature-related approaches [1–7]. The former is
usually caused by features other than the beginning and end points, intersections,
contours, and unevenness; whereas the latter is mostly due to the density and feature
area of handwritten pointers, and it is easier to mitigate the impact of irregular
writing. However, traditional image processing algorithms focusing on static feature
analysis at pixel level meet tremendous difficulties in MNIST handwriting
recognition due to the two primary causes: 1) the glyph information for Arabic
numerals is scarce and may share common features over

This work is licensed under a Creative Commons Attribution 4.0 International


License, which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
different numbers; 2) the handwriting varies dramatically from person to person, and
some features may only occur under very specific but rare scenarios. These
challenges mandate the use of dynamic information, such as the coordinates of the
stroke trajectory point [8–13]. In recent years, there is a growing interest in using
prevailing machine learning (ML) techniques in handwriting classification with
outstanding performance and generalization capabilities. By combining
aforementioned typewritten features with traditional classification methods, such as
k-Nearest Neighbor (k-NN) and histogram of dimensional gradient (HOG) features
[5,11], statistical classification model [14–18], support vector machine (SVM) [19–
21] and clustering [22,23], the classification and recognition accuracy has been
tremendously promoted.
In comparison to these approaches that rely on manual feature extraction,
convolutional neural network (CNN)-based deep learning (DL) architectures, which
is also known as deep neural networks (DNNs), can automatically extract the
implicit correlation within and amongst data to find useful patterns [5,24–26]. DNNs
can process both shallow and deep features of the data in a thorough manner to
produce more abstract and higher-level features with stronger semantic information
[27–32]. Additionally, they are easier to design, modularize, and modify for different
applications. By replacing the fully linked layer section of CNNs by Gated Recurrent
Units (GRU), Vantruong Nguyen merged CNNs and GRU and achieve a recognition
accuracy of up to 99.21% [20]. Typically, in order to achieve a higher recognition
rate, more complex algorithm is required, which demands more computational time
and higher space usage. However, the increasing number of hidden layers in the
DNN may degrade the network’s capacity for generalization. As a result, using
CNNs to accomplish error-free MNIST recognition remains difficult, and there really
is no methodology which can reach a 100% classification performance for distinct
character traits.
This paper is organized as follows. The key CNN design method is introduced in
Section 2, focusing on the design of feature extraction, classification method, hyper-
parameter optimization, over-fitting prevention strategy, and model verification
strategy. The model structure design and optimization analysis, particularly the
experimental results of model performance validation, are discussed in Section 3.
Finally, the fourth part summarizes the paper.

2 The Proposed Approach for MNIST Handwritten Digit Classification


The core steps of CNN-based recognition model primarily include extraction,
classification yield, and backpropagation to alter parameters in the network. The
overall algorithm design process is presented in Fig. 1, where training, validation
and testing share the similar process, except for using different datasets and the
trained parameters are fixed during the validation and testing process. The validation
samples are utilized for cross-validation at the end of each training epoch/iteration.
In order to maximize CNN classification accuracy and highlight extraction for
MINST recognition, a multi-layer deep arrange structure is constructed upon Keras
and/or Tensorflow, which includes three convolution and activation layers for feature
extraction and two fully connected layers (i.e., dense layers) for classification (Fig.
2). The optimization strategy of hyperparameters (e.g., batch sizes, kernel sizes,
batch normalization, activation function, dropout rate, etc.) is illustrated afterwards
to get the best performance from the model.

2.1 Deep Neural Network Design


CNNs rely on multiple convolutional layers and non-linear layers (e.g.,
activation layer) for feature extraction, and key features can be retained through
feature reduction techniques (e.g., max pooling, etc.). Detail steps can be specified
as follows: 1) convolution layers can be employed multiple times throughout the
DNN; each convolution layer can be individually implemented by the
Keras/Tensorflow built-in twodimensional (2D) convolution function, with
appropriate kernel size, stride steps, input and output channels; “Padding” is set to
‘SAME’ so that the output of the convolution function remains the same
size as the input; 2) initialize each neuron/kernel in each CNN layer with a weight
matrix and a bias scalar; to avoid the vanishing gradient problem and break the
symmetry of neurons, a truncated normal distribution with a standard deviation of
0.1 is employed to initialize the weight matrix; in addition, for sake of avoiding the
“death” of the neuron node (i.e., the output is always 0), a small positive number
(e.g., 0.1) needs to be used to initialize the bias term; 3) batch normalization may be
exploited to regulate the convoluted results within a certain range, to facilitate the
following Rectified Linear Unit (ReLU) as the activation function to extract the
features; 4) max pooling and/or other data reduction techniques are optionally
applied to reduce the data dimension and computational cost, as well as to retain the
key features.

Figure 1: The proposed DL-based data processing diagram

The output of the last convolutional layer includes the extracted features, which
are flattened into a onedimensional tensor and then forwarded to two fully connected
(i.e., dense) layers. A Softmax layer is connected after the last fully connected layer
to predict the likelihood of each category for each sample. The class with the highest
probability is chosen as the sample’s predicted category.
In this study, three convolution layers are exploited for feature extraction (Fig.
2). The first layer is equipped with a kernel of 3 × 3 and a stride of (1, 1), while the
other two use a kernel of 6 × 6 and a stride of (2, 2) for two main purposes: 1)
speeding up the converging rate with larger kernel size in deeper layers, and 2) use
larger stride as an alternative to the max pooling layer for data dimension reduction.
After three convolution layers, each of which is accompanied with a batch
normalization and an activation layer, a total number of 1568 features are extracted
and fed to two dense layers for classification. Finally, the output layer has 10 nodes
for the 10 different number digit categories.

2.2 Model Hyperparameter Optimization Strategy


Once a CNN backbone architecture is established, detailed hyperparameters
(e.g., batch size, learning rate, etc.) need to be tweaked to determine the best shape
of this CNN model to fit the training and validation dataset, with a hope of achieving
the best performance (e.g., accuracy, loss, etc.) upon other general data of the same
kind.
2.2.1 Batch Size Analysis and Selection Method
The training batch size frequently has a significant impact on the training
outcomes. Too small batch size results in long training time and even may generate
severe gradient oscillations, which will make the models converge slowly or even
impossible. In contrast, too large value may lure the model to converge to a local
rather than a global optimal value. Consequently, it is challenging to get the ultimate
ideal solution through training. Thus, by appropriately adjusting the batch size, it is
possible to effectively increase memory utilization, increase the parallel efficiency
of high-dimensional matrix multiplication, reduce training time and the likelihood
of gradient oscillation, pushing the model convergence to the optimal results.

Empirically, the global batch size can be set as Batch_size = 128*Number of


Accelerators (e.g., parallel network replicas); tf.data will automatically split the
global batch size among all replicas.
Figure 2: The architecture of the designed CNN model
2.2.2 Learning Rate Decay Technique
When an alternative mini-batch gradient descent method is available for training
because of the intriguing project’s noise, employing a fixed learning rate leads the
error to decline and evolves toward the local minimum, but may eventually oscillate
around rather than truly converges at the global optimal point. Thus, the learning rate
should be gradually decreased throughout the training process to minimize this
problem. The research uses the attenuation of learning rate to strike a compromise
between rapid convergence and stability. A relatively large initial training step size
can be utilized to speed up the training convergence at early stage, and then it is
gradually decreased to approach the global optimum, or swing back and forth in the
vicinity of the optimum, as the strategic alignment and hidden layers are intensified
to improve the training accuracy at convergence. This prevents gradient descent from
falling into local minimum values or even gradient divergence.
2.2.3 Techniques to Prevent Overfitting
Generally speaking, deep learning networks are susceptible to overfitting
problems: the model’s performance on both the training and validation datasets
improves over the training iterations, but drops on the validation dataset after a
certain number of iterations while it continues to rise on the training dataset. This
indicates that the model fits the training sample too closely, limiting its
generalizability. In addition, encoding sample labels (e.g., one-hot encoding) may
unnecessarily increase the difference between the full probability and zero
probability categories, causing the prediction model to rely excessively on the
predicted category, thereby increasing the likelihood of overfitting.
Common approaches to ameliorate this issue includes 1) reasonable data fitting
and 2) using one or more “Dropout” layers to nullify a random portion of the trained
parameters in the CNN after each training epoch. This work analyzes the incorrect
predictions on the dataset to avoid overfitting and achieve the highest classification
accuracy. Whenever the model’s performance on the validation set begins to
deteriorate during the training process, the model’s training is interrupted and the
trained parameters are preserved to avoid the overfitting induced by excessive
training. By doing so, a model with the minimum validation loss and potentially
better test outcomes can be achieved. In this study, if the validation loss stops
dropping and starts to converge or increase in two consecutive training epochs, the
training process should be stopped, and the epoch when the local validation loss is
the lowest, or validation accuracy is the highest, must be noted to ease the process
of reloading model parameters for model deployment.
2.2.4 Model Validation and Evaluation
In order to evaluate the built model, the cross-entropy, Eq. (1), is employed as
the COST function to quantify the loss of the model outcomes. The index is
minimized over the training process to improve the recognition ability of the model.

Hy0 ðyÞ ¼ Xi2^ y0i logðyiÞ (1)

where y and y0 are the classifications of model prediction and the real classifications
(i.e., ground truths or labels) of samples respectively, and i 2 ^ is the sample indicator
corresponding to the dataset. The accuracy rate and losses are calculated separately
for each category, and these two indicators could describe the efficacy of the
proposed model more comprehensively.

3.Experiments
3.1 MNIST Handwritten Digit Dataset
MNIST database is one of the foremost classical imaging datasets within the field
of machine learning, and is broadly utilized for benchmarking in image
classification. The MNIST database contains 60,000 training samples and 10,000
testing samples (Table 1), each consisting of a 28*28 pixel grayscale image of a
handwritten Arabic digit. The number sample in each image is
normalized/standardized and centered. The 60,000 training samples are further
divided into 55,000 training dataset and
5,000 validation dataset.

Table 1: MNIST dataset


Dataset object Sample Role
amount
Data_sets.train 55,000 Training
dataset
Data_sets.validation 5,000 Validation
dataset
Data_sets.test 10,000 Testing
dataset
3.2 Model Hyperparameter Optimization
i) Placeholder and parameter setting.
A placeholder is created for each input image and its label: X represents the one-
dimensional (1*784) vector associated with a 28*28 image, and Y represents the
corresponding label (i.e., the ground truth of classification). Subsequently, the one-
dimensional image vector is transformed into a two-dimensional matrix, i.e., the
image data vector of 1*784 is converted into the original structure of 28*28. Since
the MNIST is a grayscale image dataset, the color channel for each image is 1 (3 for
RGB images). For training and testing with different numbers of images, the
conversion number is set to −1, indicating an indefinite number, for automatic
matching of the number of images.
ii) The design of convolutional layer and activation layer.
In the first layer, the size of the convolution kernel is set to be 3 × 3, and the
weight and bias terms are initialized. The output channel is 12 to extract 12 different
features. Then, the inner product of the convolution kernel and the input is computed,
and the bias term is added to the convolution result. A batch normalization layer is
employed to regulate the convolution results, followed by an ReLU activation
function/layer for non-linear processing and feature extraction. The second layer is
also a combination of convolution, batch normalization and activation functions, but
with a kernel size of 6*6 and a stride of (2, 2) in the convolution layer, which reduces
the image tensor size to a half, i.e., 14*14. In addition, the number of features is
increased to 24. The third layer is similar to the second in kernel size and stride, but
is extended to 32 features in total. In the end, the output of the third layer is flattened
to a 1* (7*7*32) = 1*1568 tensor, fed to the following fully connected layers for
classifications. iii) Fully connected layers and dropout layer for classification and
overfitting reduction.
The first fully connected layer has 200 hidden nodes, and an ReLU activation
function is applied afterwards to make the input with bias terms have nonlinear
characteristics. The Dropout is employed during training to randomly discard 40%
of the trained neurons (i.e., weights and biases) to reduce overfitting. The output of
the second dense layer is connected to the Softmax classifier to obtain the probability
of each category of classification, and the class with the highest probability is
selected as the predicted class of the corresponding sample. When the neural network
model is validated, all nodes are retained to obtain the best predictive classification
performance.
3.3 Learning Rate Attenuation and Classification Analysis
As aforementioned that a variable learning rate should be considered to achieve
the optimum training outcomes, in this paper, exponential decay is used as the
learning rate decay method. The initial learning rate is set at 0.001 and the learning
rate decay factor is set at 0.99. As a result, the recognition accuracy of the model has
been significantly improved by 2.9% on the test set, while the loss has been
drastically reduced. The improvements in accuracy of various numbers/classes are
also observed. This shows that the learning rate decay can effectively improve the
recognition accuracy of the MNIST handwritten dataset.

3.4 Model Optimization and Performance Evaluation


By defining the cross-entropy loss function and a small value of the initial
learning rate (i.e., 1*10−3), the Adam optimizer is used to automatically to minimize
the loss function during the training process. In this process, the loss will be
backpropagated to adjust the network parameters to better fit the training sample
data. Here, the batch size is set to 1000, which means 1000 training samples are sent
to the model for training with random gradient descent. The proper batch size can
reduce computational overhead while generalize the overall characteristics of the
dataset. The dropout rate is set to 0.4. One can see that the training loss and accuracy
converge within 10 iterations (Table 2) and each iteration uses 5000 validation
samples for cross-validation (Table 1 and Fig. 3). Some of the predicted digit with
recognition rate are shown in Fig. 4. Over the training process, the model’s
classification accuracy improves, the loss decreases, yet the best validation
performance is achieved at Epoch 8 (Fig. 3). Visualization of the training and
validation dataset is provided in Fig. 5, and some of the validation digits with
incorrect predictions are provided in Fig. 6.

Table 2: Model training accuracy, loss and learning rate


Epoch Training Training Learning
accuracy loss rate
1 0.9598 0.1311 1.00e-02
2 0.9874 0.0400 5.01e-03
3 0.9927 0.0238 2.52e-03
4 0.9950 0.0153 1.20e-03
5 0.9966 0.0115 6.25e-04
6 0.9975 0.0090 3.12e-04
7 0.9980 0.0078 1.56e-04
8 0.9982 0.0072 7.81e-05
9 0.9981 0.0068 3.90e-05
10 0.9939 0.0070 1.95e-05
The testing dataset is finally examined to verify the entire training and validation
process, and achieve an accuracy of 99.40% and loss of 0.0171. The performance of
the model on the test set resembles the training results. On the other hand, the
accuracy rates vary on different digits. For example, 97% of number “6” are correctly
classified while “1” reaches 100%. Fig. 7 displays figures that are challenging to
recognize correctly. Additionally, the convolution kernels provide the feature set of
the input pictures, and they can be used to visualize the characteristics of the input
images. However, there is currently no efficient analysis method for thoroughly
evaluating and modelling the significance of each neuron in convolutional layers
because it contains lots of high-dimensional elements which are difficult to
comprehend intuitively. Nevertheless, assessing the features extracted by each
convolution kernel by analyzing the model with a larger sample and displaying the
output of each layer is still beneficial. The classification accuracy of the proposed
structure is compared with other state-of-the-art models (Table 3), and has
demonstrated significant improvement in classification accuracy.

Figure 3: Training and validation accuracies


Figure 4: Some of the predicted digit with recognition rate

Figure 5: Visualization of the training and validation dataset

Figure 6: Some of the validation digits with incorrect predictions


Figure 7: (a) Numbers with high recognition rates; (b) Numbers with low
recognition rates

Table 3: Comparison with the benchmark approaches


Methods Train. acc. Train. loss Val. acc. Val. loss Test. acc. Test.
loss
Googlenet [33] 0.98 0.0064 0.97 0.0029 0.98 0.0014
InceptionV3 0.95 0.0020 0.95 0.0042 0.96 0.0013
[34]
Xception [35] 0.96 0.0035 0.96 0.0141 0.96 0.0019
VGG16 [36] 0.96 0.0065 0.97 0.0391 0.97 0.0024
Basic CNNs 0.98 0.0237 0.97 0.0578 0.98 0.0901
Proposed 0.99 0.0260 0.99 0.0174 0.99 0.0136

4 Conclusions
This study elaborates the details of developing a light-weight DNN for
handwriting digital recognition/ classification using the MNIST dataset. The CNN-
based backbone architecture and the methodology of numerous hyperparameter
optimizations, including batch size, learning rate, etc., are explored in detail to
enhance the training and testing outcomes. The developed neural network model is
evaluated upon Keras/ Tensorflow framework, and the overall accuracy of the
developed model can reach 99.4% on the MNIST dataset, compatible with the results
from other state-of-the-art models.

Acknowledgement: We would like to thank the editors and anonymous reviewers for
their thoughtful comments and efforts towards improving our manuscript.

Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to
report regarding the present study.
5.References
1. Dataset from Kaggle
https://www.kaggle.com/datasets/sachinshrivastv/simple-mnist
2.MNIST Handwritten Digit Classification Based on Convolutional Neural Network with
Hyperparameter Optimization
https://www.researchgate.net/publication/369265604_MNIST_Handwritten_Di
git_Classification_Based_on_Convolutional_Neural_Network_with_Hyperpara
meter_Optimization

You might also like