Image Classification Using CNN: Page - 1
Image Classification Using CNN: Page - 1
1..INTRODUCTION
The utilization of software for digital image processing has soared due to the drastic
rise in the volume of images database, wider availability of cost-effective image databases
and the need for human-level object classification accuracy. Enormous progress has been
accomplished in the field of image recognition and classification using deep convolutional
neural networks and machine learning in the past few years. The human neural system
consists of a vast interconnected network of neurons that communicate and exchange inputs
with each other for processing the information around us. Convolutional Neural networks
(CNNs), analogous to human neural system, contain neurons with learnable weights and
biases, as shown in figure 1. This system is trained with a variety of data sets to extract,
analyze and classify visual patterns from image pixels.
For example, Content Based Image Retrieval neural networks are capable of extracting
visual features of image data such as patterns, edges, colors, shapes et-cetera and
classifying these features to determine visually similar objects or images.
Convolving the image layer by layer is the principal behind image recognition using
this neural network. CNNs broadly incorporate convolutional layers, pooling layers,
hidden layers and fully connected layers.
The convolutional layer embodies a set of self- sufficient filters and every filter is
autonomously convolved with the input image. We start by choosing a filter and sliding it
across the whole image sequentially, in parts, and taking the dot product of the filter and
each part of the image simultaneously. These filters become classifying parameters that
are learned by the CNN. The output from one convolutional layer, called feature map, is
passed on further to other convolutional layers for deeper convolution. The weighted sum
of input values are passed to the activation function, which determine the output of a given
neuron after a given set of inputs [1].
Pooling layers work on each feature map independently and reduce the overall
computations and avoid overfitting by decreasing the spatial size of the representation
matrix. This layer removes redundancy and smoothens computations. The fully connected
layers are responsible for classifying and mapping the learned features into the sample
datasets. The optimization of the network model and increased accuracy can be achieved
with the help of loss functions. These functions determine the degree of variability between
the predicted and observed values. Smaller loss functions represent better models [2]. The
input, which is linearly transformed by the neuron’s weights and biases and non-linearly
transformed by the activation function is then passed to the hidden layers for further
processing to obtain the output, giving rise to feed forward propagation.
2.Theoretical background
Machine learning frameworks have made the demanding task of implementing
machine learning models much simpler. These frameworks help in acquiring datasets and
provide pretrained models with better refining. One such framework is Google’s
TensorFlow. Released in 2015, it’s an open-source machine learning library that allows
dataflow programming over various platforms. It can be implemented in multiple
languages such as C#, C++, Java and R. TensorFlow’s principle lies in deploying tensors
to power the learning. This framework is capable of training deep neural networks (In our
case, a Convolutional Neural Network) to perform copious tasks which can solve real
world problems, such as image recognition and classification, word embeddings, speech
recognition, sentiment analysis, natural language processing and so on [3].
TensorFlow enables the users to illustrate the movement of data through a
progression of processing nodes. Distinct mathematical computations are represented by
distinct individual nodes in the graph and nodes are connected to each other with edges,
forming a network. Edges are multidimensional data arrays, also known as tensors. These
mathematical computations are written in C++ and the nodes as well as the tensors are
python objects. TensorFlow allows its users to design neural networks line by line, using
python to conveniently couple high-level abstractions.
Linear Regression: In linear regression, we estimate the values of target variables on the basis
of predictor variables and the relationship between them is established by fitting a best line, called
the regression line.
Regression line Equation: Y= m*X + c
where, ‘Y’ : ‘dependent variable’
‘c’ : ‘intercept’
Logistic Regression: We use logistic regression to predict discrete values based on the independent
variables. These discrete values are probabilities, and thus the range of the output lies between 0
and 1.
The logistic function is given by: h(x)= 1/
(1 + e^ -x) ConvNet :
We learned how to implement ConvNet, first by Convolution layer followed by Max Pooling to
down sample the layer. Various activation functions were studied and the ReLU Activation
function was chosen.
ReLU Activation function: y=max(x,0) If in convolution layer, input image size = ‘n*n’, filter size
= ‘f’ and stride = ‘s’. Input is added with 0 pad of size ‘p’ then :
In max pooling: If input is of size ‘w1*h1*d1’, filter= ‘f*f’, stride = ‘s, then:
Output: (n-f+2p)/(s+1) Output:
w2=(w1-f)/(s+1) h2=(h1-f)/(s+1)
d2=d1
4.Proposed System:
Convolution Neural Network for image recognition. Our dataset contains 200 classes
(Subset of ImageNet dataset). Our model has 13 layers and consists of 5 Convolution layers out of
which 3 are followed by 1 Max Pooling Layer, 1 Dropout Layer, 1 Flattening layer and 2 Fully
Connected Layers along with one Softmax Layer. The dataset has been divided into 3 categories:
Training, validation and test-set to avoid overfitting. The convolutional layers are victualled with
input images one after the other from all the classes, during training. After convolution, some
neurons are dropped at the Dropout rate of 0.8 and the output is flattened before being finally passed
to the fully connected layers.
The number of outputs in the second fully connected layer match the number of classes
which represent the probability of an image for each class. The model is retained during training
and is utilized to operate on our input image dataset to predict whether the given image belongs to
the classes our model is trained on. Since labels are input along with the training image dataset
during training, the accuracy achieved during training will be greater than the validation accuracy.
However, it is important to report the training accuracy in every iteration so as to be constantly
ameliorating the accuracy in the training dataset. Each iteration, or Epoch, ends with saving the
model for re-iteration and reporting the accuracy.
The input image is required to be read and pre-processed in a manner identical to the
training, so as to obtain predictions. The saved model is restored, and the values of weights and
biases learned from the previous iterations are used to predict the probability of the input image
belonging to every class.
5.Requirements
5.1SOFTWARE AND HARDWARE
Linux, Docker are used to implement containers to use TensorFlow without disturbing our
computer environments. Python is used to implement our Neural Network and train our model.
We use laptop with good CPU to run the model.
5.2Functional Requirements
Our system needs a dataset of images to train the model and then user is required to give a
random image for prediction of the class the image belongs to. The image is preprocessed
according to the image provided in the training. If it is successful then it should provide the correct
prediction to the user, otherwise prompt error.
5.3.Non-Functional Requirements:
Our system requires high maintainability and our system should be secure. Our system
should not raise error while reading images i.e. our dataset should be reliable.
5.4User Requirements:
want our model to predict the correct label for the image with minimal error possible in
appreciable time.
6.Architecture
We built a Convolution Neural Network for image recognition. We take only 200 classes.
The model will identify and separate images of different classes. Here Instead of initializing our
parameters with zeros, we initialize them with Random Normal distribution with mean 0 and very
small standard deviation of 0.05. Our model is made up of 13 layers and has the following
architecture:
Convolution Layer 1: It takes images as input and applies convolution with 3*3 Filter Size
and a total of 32 Filters.
Max Pooling Layer 1: It takes the output of Convolution Layer 1 as input applies Max Pooling
with Kernel of Size 2*2 with stride 2*2 which reduces the image height and width to half.
Convolution Layer 2: It takes Output of Max Pooling Layer 1 as input and applies convolution
with 3*3 Filter Size and a total of 32 Filters.
Convolution Layer 3: It takes Output of Convolution Layer 2 as input and applies convolution
with 3*3 Filter Size and a total of 32 Filters.
Max Pooling Layer 2: It takes the output of Convolution Layer 3 as input applies Max Pooling
with Kernel of Size 2*2 with stride 2*2 which reduces the image height and width to half.
Convolution Layer 4: It takes the Output of Max Pooling Layer 2 as input with 3*3 Filter
Size and a total of 64 Filters.
Convolution Layer 5: It takes Output of Convolution Layer 4 as input and applies convolution
with 3*3 Filter Size and a total of 64 Filters.
Max Pooling Layer 3: It takes the output of Convolution Layer 5 as input applies Max Pooling
with Kernel of Size 2*2 with stride 2*2 which reduces the image height and width to half.
Dropout Layer: It takes the output of Max Pooling Layer 3 as input and randomly drops
neurons with dropout rate = 0.8.
Flattening Layer: This layer transfigures the multidimensional tensor output from the dropout
layer to a one-dimensional tensor.
Fully Connected Layer 1: It takes the output of the flattening layer and returns the layer after
applying ReLU to add non-linearity to it.
Fully Connected Layer 2: This layer receives the output from the first fully connected layer
and returns the layer without applying ReLU as this contains the probability for each class.
Softmax Layer:
The output from flattening layer is fed into the first fully connected layer with 128 neurons
and applies the operation Y= m.X+c. ReLU function is implemented on the output from the first
fully connected layer so as to add non-linearity to the model. The output of the activation function
is then treated as input to second fully connected layer which has 128 neurons as well. The output
of this layer is not treated with ReLU function as the output of this layer is the probability score of
each class. We use Softmax Classifier to convert scores of each class into probability distribution
and then use the Cross Entropy as our loss function.
For calculating the gradient and optimizing the weights, we deploy AdamOptimizer. We
are trying to achieve minimum cost with a learning rate of 0.0001 (1e-4). The training consists of
100 Epochs with an input of mini batch of 32 images using dataset class which provides next batch
of images for training. After each epoch, we calculate training accuracy, validation accuracy and
the validation loss (value of cost function). 3.3 Implementation 100000 images from 200 classes
(500 per class) are taken as input to train the network and are fed to the first Convolution Layer.
The convolution layer itself consists of convolution followed by max pooling.
The first convolution layer consists of 32 filters of size of 3*3 and gives an activation map
of 64*64*32 which is fed into max pooling with stride 2,2 decreases the height and width of the
image by half. We then apply ReLU Activation function to introduce non-linearity to the model.
The output from previous layer is fed to the second convolution layer and whose output is fed into
third convolution layer and the previous process is repeated. The output from these layers are fed
into fourth convolution layer and fifth convolution layer where number of filters are increased to
64 giving an activation map of 8*8*64. The output is now passed to dropout layer with dropout
rate of 0.8. Dropout is applied to ‘drop-out’ or obviate randomly selected neurons during training.
When these randomly selected neurons are ‘dropped-out’, weight updates are not applied to these
neurons on the backward pass and they do not help in the activation of downstream neurons on the
forward pass temporarily.
The output of the dropout layer is fed into the flattening layer to convert the multi ranked
tensor to a 1 rank tensor which is fed into first fully connected layer with 128 neurons and ReLU
function is applied to the output of the first fully connected layer which is then fed to the second
fully connected layer which then passes the data into softmax classifier which then finally gives
the class of the image provided to the network. We run the model with learning rate of 1e-4 and
we use AdamOptimizer to minimize the cost function and the values of weights and biases are
updated through backpropagation.
7.CONCLUSION
This model is able to 99% accuracy which is good achievement. The model reaches 96%
Validation accuracy. The Validation loss goes up to 0.12. Initially our model saw overfitting to a
great extent which was reduced by introducing a dropout layer before the flattening Layer. The
validation loss could be decreased by introducing randomization in dataset. Even though the
number of classes were high, and the number of images were less, our model achieved a good
accuracy in the end with only 13 layers in the architecture. We could save the model and use it for
prediction of images belonging to any of these 200 classes and get the probability score for image
belonging to each class. Our model saw the dataset for first time and was trained from scratch, so
it required a greater number of Epochs to achieve a good accuracy which it did.
8.References
[1] Yang, J., & Li, J. (2017). Application of deep convolution neural network. 2017 14Th
International Computer Conference On Wavelet Active Media Technology And Information
Processing (ICCWAMTIP).
doi: 10.1109/iccwamtip.2017.8301485
[2] Amrutkar, B., & Singh, L. (2016). Efficient Content Based Image Retrieval Using
Combination Of Dominant-Color, Shape And Texture Features And K-Means Clustering.
International Journal Of Engineering And Computer Science. doi: 10.18535/ijecs/v5i1.10
10.1016/j.cels.2016.01.009
[4] Ma, J., Wen, Y., & Yang, L. (2018). Lagrangian supervised and semisupervised extreme
learning machine. Applied Intelligence. doi:
10.1007/s10489-018-12734
[5] Multimedia Data Classification Using CNN. (2018). International Journal Of Recent Trends
In Engineering And Research, 4(4), 557-561. doi: 10.23883/ijrter.2018.4274.relqz
[6] A. Hirano, A. Tsukada, K. Kasahara, T. Ikeda and K. Aoi, "Research on construction and
verification of a color representation system of a cataract to assist design", The Japanese Journal of
Ergonomics, vol. 54, no., pp. 2B1-1-2B11, 2018.