0% found this document useful (0 votes)
27 views6 pages

Fine Tuning

Uploaded by

mohamed ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

Fine Tuning

Uploaded by

mohamed ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Comparative analysis of modifications on

Two-Stage Deep Learning approach to the


Classification of Fine-Art Paintings
* Note: Sub-titles are not captured in Xplore and should not be used

1st Akbay Yerkebulan 2nd Almat Kairbek


Department of Computational and Department of Computational and
Data Science Data Science
Astana IT University Astana IT University
Astana, Kazakhstan Astana, Kazakhstan
[email protected] [email protected]

Abstract—This paper introduces the comparisons of new modi- lutional Neural Networks (FCNN), Mask R-CNN and others.
fications into two-stage deep learning approach that improves the The convolutional neural network by using a special operation
style classification accuracy with existing one. The modification - the convolution itself - allows to simultaneously reduce the
was made in the first stage, instead of dividing the image into
five patches, it was divided into nine patches to not lose the data amount of information stored in memory, so it can better
in central part of the each side. A comparison analysis was made handle images of higher resolution, and to isolate the reference
in order to identify how increasing the size of WikiArt dataset image features, such as edges, contours or faces. In the next
affects the overall results of the accuracy. The method was tested layer of processing, these edges and faces can be used to
on VGG-16 and ResNet-50. recognize repeatable texture fragments that can be further
I. I NTRODUCTION folded into image fragments.
Essentially, each layer of the neural network uses its own
Image classification has two type of labelling. The first is
transformation. If in the first layers the network operates with
object recognition as an recognising the objects within an
such notions as ”edges”, ”faces”, etc., then further the notions
image and the second is semantic categorization known as
of ”texture”, ”parts of objects” are used. As a result of this
labeling based on the mood and emotions shown in the image
elaboration, we are able to correctly classify the picture or
[1]. If the first type is relatively well explored, the second one
identify, in the final step, the object we are looking for in the
is still need more research works to get the good results.
image.
Popular paintings are the work of human being. Every artist
A classic representative of CNN is AlexNet [4]. The first
put their own sights to painting and make an art work in a
study in this field that showed that the networks can smoothly
particular style. Sometimes, identifying the style of art piece
arrange the work of art only by using the style labels, without
is difficult even for the experts of fine arts. In visual arts,
any comprehensive information. It was a big breakthrough in
style is defined as a set of distinctive elements that can be
this field, not only it represented a new way of solving this
associated with a specific artistic movement, school or time
problem, but also the results of identifying art styles were
period [2]. So, style is an important sentiment criteria for the
increased. But still it was not enough to fully understand how
image classification in fine arts. The ability of the machine to
properly recognize the style in fine arts.
classify styles implies that the machine has learned an internal
Later, many other works improved the results of AlexNet.
representation that encodes discriminative features through its
The CNN models started become more complex, big, with
visual analysis of the paintings [3], which tells us how far
many different layers. VGG16 [5] was created in order to
machine can even go if the task will be fully successfully
reduce the number of parameters in the convolutional layers.
achieved.
ResNet-50 [6] concentrated on finding a simpler mapping
One of the latest method that was found as a best way for
system by introducing two types of shortcut connections.
style recognition is Convolutional Neural Network (ConvNet,
While the others was going in a direction of increasing the
CNN). CNN is a basic tool for classifying and recognizing
number of layers, Inception [7] worked on making the layers
objects, faces in photos, and speech recognition. There are
wider. Those were the CNNs developing and spending time
many applications of CNN, such as Deep Convolutional Neu-
on improving the models. The way of how everything needs
ral Network (DCNN), Region-CNN (R-CNN), Fully Convo-
to work. But there are also other ways of improving overall
Identify applicable funding agency here. If none, delete this. results of style classification: working with the image itself.
Most of the standard CNN architectures work with fixed,
small size paintings, which means that high-resolution art
pieces should go through downsampling process. It surely
leads to some loss of information. A new study shows that
this problem can be faced by using sub-region approach [1],
in which initial image divides into smaller regions, and those
patches have the same standard size as required by CNN. Also,
this study proposes two stages of art style recognition. The
first stage proposes dividing image into patches and by using
deep neural network get the parameters. The second stage
suggests, concatenating the vector parameters into one, and by
using shallow neural network labeling image the final result.
This paper’s work is based on this two-stage deep learning
approach.
The remainder of this paper is organized into five sec-
tions:some theoretical information of experiment in Section II; Fig. 1. The architecture of VGG16
proposed method in Section III; implementation of the method
and experiment in Section IV; and Section V describes the
functional block diagram of the architecture is shown in Fig.
results and the last Section VI is conclusion.
2. ResNet50 consists of 50 layers deep: 48 convolution layers
II. T HEORETICAL BACKGROUND along with 1 maxpool and 1 average pool layer. The input size
of an image is the same as VGG16 — 224-by-224.
This section aims to explore the theoretical background of
the key architectures and technologies utilized in the proposed
method, which include convolutional neural network models
such as VGG16 and ResNet50. Also, this section explains the
term transfer learning and its usage in the experiment.

A. VGG16
VGG16 is a classification algorithm and type of CNN
constructed by Karen Simonyan and Andrew Zisserman (Uni-
versity of Oxford) [5] in 2015. It can classify 1000 images of Fig. 2. The architecture of ResNet50
1000 categories with 92.7% accuracy, so that, nowadays, it is
one of the best algorithms of image classification problem. C. Transfer Learning
VGG16 has thirteen convolutional layers, five Max Pooling Transfer learning is the reuse of a pre-trained model as a
layers, and three Dense layers which sum up to twenty-one starting point for a new problem. Nowadays, it is an important
layers but only sixteen of them have weight layers, which part of deep learning. It try to transfer the knowledge from
explains its name as VGG16. The input size to ConvNet is the source domain to the target domain as shown in Fig. 3, by
a fixed-size 224 × 224 RGB image as shown in Fig. 1. The relaxing the assumption that the training data and the test data
architecture of the algorithm focused not on having a large must be i.i.d [8]. So that, the pre-trained model’s weights and
number of hyper-parameters but on having convolution layers learned features are transferred to the new model, but the final
of 3x3 filter with stride 1 and always used the same padding classification layer of the model is replaced to adapt to the new
and maxpool layer of 2x2 filter of stride 2. The creators of task. Transfer learning is effective when the pre-trained model
this model pushed the depth to 16 weight layers making it has learned generic features that are relevant to the new task,
approximately — 134 million trainable parameters. especially when the new task has limited training data. This
B. ResNet50 improves the efficiency and solve the problem of training on
insufficient data.
Unlike traditional sequential architecture as AlexNet and
VGG16, ResNet relies to ”building blocks” architecture, which D. Fine-tuning
is a collection of micro-architectures building a bigger one. Fine-tuning is a process where pre-trained model is used
The architecture was introduced in 2015, in paper ”Deep to train the data on a new dataset. It is a continuation of
Residual Learning for Image Recognition” written by Kaiming the transfer learning process. It allows to change not only
He, Xiangyu Zhang, Shaoqing Ren, Jian Sun and based on final layers of the pre-trained model, but also, get access and
residual learning [6]. It is a type of learning, where the unfroze some of the model’s earlier layers. Fine-tuning is good
connections between layers are a little different. The output when the task is similar to pre-trained model’s orginal task
of one earlier convolutional layer connects to the input of because it easily adapts to a new task and fine-tune on new
another future convolutional layer several layers later as. The data effectively.
Fig. 3. Learning process of transfer learning

E. Shallow neural
A shallow neural network consists of only one or a small
number of hidden layers positioned between the input and
output layers [9]. The input layer receives the data, the hidden
layer(s) perform computations on it, and the output layer
generates the final output.
Fig. 4. The proposed two-stage classification framework using nine image
patches; i is the analyzed image index.
III. P ROPOSED M ETHOD

A. General Explanation 3) Step 3 - Probability Vector Assembling: The probability


1) Step 1 - Patch Extraction: As it was mentioned before vectors Cij generated in Step 2 that belong to the same
the proposed method is a modified method of already existing image (i = 1,...,M) are concatenated into a single vector of
one. Two stages were presented as shown in Fig. 1. In the first probabilities Ii as given in (2)
stage, particularly is Step 1, the image was divided into nine
same-sized patches. But before that, image size was scaled
Ii = [pi ,1 ,1 , pi,2,1 , ..., pi,N ,1 , ......, pi,1,L , pi,2,L , ..., pi,N ,L , ]
according to the standard of CNN model. Four patches divide
(2)
the image into equal squares: upper right, upper left, lower
After that each image’s patch output vectors were concate-
right, lower left. Then another four was taken by overlapping
nated into one image vector as in (2).
the 50% of neighbouring two patches: center-right, center-left,
4) Step 4 - Shallow Neural Network: The final concatenated
center-up, center-down. And the last one overlaps the 25% of
vectors then used as features for the second stage, specifically
the first four patches.
in Step 4. As features they run through the second classifier.
2) Step 2 - Classifier 1: In Step 2, all 9 patches generated The shallow neural network again train them. But this time
in Step 1 goes through CNN model. The model trains on on a probability vectors not the images. After training at first
the dataset, so that it does not have pre-trained values. The and then at second classifier the final artistic style labels and
size of the last fully-connected layer is determined by the results shows overall accuracy and loss.
number of artistic styles. And this Softmax layer determines
the probability of each style, so that image could be painted in B. Programming
that style. After running through training parameters of each
The Fig. 6 illustrates the flowchart of the the whole two
patch were taken in Step 3. These parameters formed the Cij
stage classification method. First, we prepare the dataset and
output vector.
by preprocessing we divide it to train, validation and test sets.
Then we load the model and it’s already trained parameters.
Cij = pijk = 1 ,...,L (1) By utilizing transfer learning we fine tune the model to our
train and validation sets. After training on them, we save the
where i is the index of the analyzed input image (i = 1,...,M), model. The next step is to get the vector parameters of trained
j is the patch number (j = 1,...,N), and k is the style index (k features. Therefore, we predict the model on train set, then we
= 1,...,L). save those values in a new list. After that we need to group the
values by image patches, concatenate them and store in one
list(i.g. one image was taken as a set of ten patches, each patch
contains six (styles) probability vectors. After concatenating
them new list will consists of lists(the number of lists equals to
the number of images), and each list contains sixty probability
vectors. Starting from a point, where we predict the model
on the train set, simultaneously we execute identical steps on
the test set. Subsequent to this, we generate a new list of
labels, so that each label corresponds to a particular class, and
they are represented as integers from zero to five (according
to our six styles) containing true values for each vector in
a concatenated train and test lists. Next, we create a shallow
neural network (SNN) and train it on concatenated train values.
SNN has three layers as shown in Fig. 5. One fully connected
layer, where every neuron is connected to every neuron in the
previous layer. Its input size is 60 features. Another one fully
connected layer that has 128 neurons and uses ReLU as the
activation function. This layer takes the shape of the previous
layer’s output as its input. And the third Dense layer that has
6 neurons representing the output classes of the model. As an
activation function was used Softmax.

Fig. 5. Shallow Neural Network

The last step is evaluating the model on concatenated test


values with the list of our labels. As a result we get overall
accuracy and loss of the model.
IV. I MPLEMENTATION AND E XPERIMENT
A. Dataset
The dataset was taken from kaggle [10] - ”WikiArt all
artpieces”. Dataset contained 176436 pictures and in 193
different styles. The number of pictures for each style varied
from 1 to 16083. Therefore, the number of pictures and styles
Fig. 6. The flowchart of the proposed method
in pre-processing stage were decreased. Six popular styles
were taken as in Fig. 6: Abstract Expressionism, Cubism,
Minimalism, Realism, Surrealism and Symbolism. Some of
the artistic styles have similar patterns in their paintings of pictures is that when each picture is divided into 9 patches,
such as geometrical figures and shapes and color intensity the total number of dataset increases and make 21,300 and
as in Cubism, Minimalism, Symbolism and Surrealism. But 42,600 images including the initial picture itself. The amount
there is also a distinguished style as Realism, which mostly of pictures in every styles are balanced. Dataset was divided
have people in paintings. The total size of dataset that was into train, validation, and test sets by proportion 70/10/20.
used to train the CNN model are 2,130 pictures (dataset 1) In dataset 1, train set has 15000, validation set - 2,100, test
and 4,260 pictures (dataset 2). Firstly, there were taken two set - 4,200 image patches. Dataset 2 consists of train set
different sized datasets to see the differences and changes in - 30,000, validation set - 4,200, test set - 8,400. Each set
a model training. Secondly, the reason for taking that amount contains six folders and each style has 355 and 710 image
a given input image were assembled into feature vectors, as
in (2) and passed to the second stage classifier to determine
the final label for the analyzed image. At the end, the results
showed the overall accuracy.
C. Experiment 1: VGG16
In experiment 1, VGG16 architecture was used for training.
Firstly, experiment was conducted on dataset 1. Dataset was
inputted and first classifier VGG16 was loaded. Total parame-
ters number was 14,865,222, but only 150,534 of were trained.
The reason for that is only last Softmax layer were used for
training and the previous layers were just frozen. After that
we fine tuned the model on our dataset 1 and then retrieved
15,000 train and 4,200 test parameters. By concatenating them,
1,500 train and 420 test probability vectors were taken. As in
stage 4 in Fig. 4, train vectors were trained on shallow neural
network. Lately, trained model was evaluated on concatenated
test set a d results were taken. The same steps were used
in on dataset 2. So that we received 30,000 train and 8,400
parameters from classifier 1. After concatenating each 10
Fig. 7. The flowchart of the proposed method image patch parameters, 1500 train and 840 test probability
vectors were taken. Then by training them on classifier 2
- SNN and evaluating them, we received second dataset’s
patches, respectively. Also, there is a third dataset (dataset 3), accuracy and loss results. Lately, trained model was evaluated
which contains 25,560 image pacthes. This dataset was made on concateneted test set a d results were taken. The same steps
by dividing picture into not nine but five patches. So, in total were used on dataset 2. So that we received 30,000 train and
one picture is taken as set of six patches. As another datasets 8,400 parameters from classifier 1. After concatenating each
this one was also divided into train (18,000), validation (2,520) 10 image patch parameters, 1500 train and 840 test probability
and test (5,040) sets. Each style folder has 426 image patches vectors were taken. Then by training them on classifier 2
B. Setup - SNN and evaluating them, we received second dataset’s
accuracy and loss results.
For training the model it was taken two different CNNs:
VGG16 [5] and ResNet [6] (Table 1). The advantages of the D. Experiment 2: ResNet50
architectures: Experiment 2 was conducted the same way as experiment
• VGG16 uses very small receptive fields and fully- 1. But instead of VGG16, ResNet architecture was used
connected layers by that it showed high scores in the for training. As it was in the previous experiment, Firstly,
ImageNet Challenge. experiment was conducted on dataset 1.
• The ResNet architecture does not need to fire all neurons
in every epoch. This greatly reduces the training time and E. Experiment 3: Dataset 3
improves accuracy. Once a feature is learnt, it does not The third experiment was conducted on both models:
try to learn it again but rather focuses on learning newer VGG16 and ResNet50. But this experiment focused on com-
features. A smart approach that greatly improved model parison of taking not only ten patches of one image but also 6
training performance. patches as it was mentioned in original work. Dataset 3 were
trained on both models and then tested on SNN. So that we
TABLE I can compare the results of six and ten image patches.
C OMPARISON OF CNN MODELS
V. R ESULTS
Model No of layers Architecture Input Size Parameters (Millions)
VGG-16 16 linear 224x224 14.8 With implementing the two-staged method and conducting
ResNet50 50 residual blocks 224x224 24.1 the experiment the following results were taken (Table 2).

In experiment 80% of the data was utilized to train the A. Experiment 1 - VGG16
CNN models, while the remaining 20% was used to assess After fine tuning the model the first classifier’s training
the system’s performance. The results of perfomance were accuracy were around 77% and loss 24% on dataset 1 and
measured by accuracy and loss. 79.4% of accuracy and 27% loss on dataset 2. The results
The proposed two-stage method was used in experiment are almost the same. But after training them on the sec-
described in Section III. The probability vectors obtained ond classifier accuracy showed around 97% results on both
during the first-stage classification of individual patches of datasets. But after testing the SNN model the final accuracy
were different on datasets. Dataset 1 showed 58.09%, while on VI. C ONCLUSION
dataset 2 result was 63.57% as shown in Table II. As expected By making modifications as increasing the number of image
increasing the number of images can lead to better results. patches from six to nine as an input for automatic fine art
Dataset 2 is twice bigger than the dataset 1, so accuracy growth style classification, the two stage deep learning approach was
was expected. But the change in 5% is still not much as it improved. The proposed approach applied two independently
seems. trained stages of classification. While the first stage applied
B. Experiment 2 - ResNet50 a deep CNN trained directly on image data, the second stage
used a shallow neural network trained on the class probability
The training accuracy of the first classifier - ResNet50
vectors generated by the first stage classifier. The modification
model showed 37.5% on dataset 1. In comparison with ex-
on proposed method increased the accuracy results on VGG16
periment 1, it is twice less. Similar results were taken on
by from 61.21% to 63.57% by 2.36% and on ResNet50 from
dataset 2: 39.1%. On SNN model the results increased but
47.73% to 49.52% by 1.79% on dataset containing 4260
not as much as in experiment 1. It showed around 59.9%
images of six art styles.
on training for both datasets. But the final testing results
We encountered some difficulties during the project. Fisrt case,
accuracy dramatically decreased. 37.85% on dataset 1 and
we imported our dataset with a lot of pictures of different
49.52% on dataset 2 as shown in Table II. These are low
styles, during recognition by styles, we distributed first by
results, comparing to experiment 1. We assume the reason for
styles then into train(70%), test(10%), validation(20%). which
that is that the ResNet50 has lower parameter numbers and
caused the model to not work because the hierarchy was
that it does not learn again already learned parameters.
wrong. Then we changed this hierarchy: we first divided it into
C. Experiment 3 - Dataset 3 train(70%), test(10%), validation(20%) and then distributed it
In experiment 3, as it was mentioned previously, we took by styles. Thus the picture referred to the name of the folder,
dataset 3 and used proposed method on it. The results in and then the name of the style. Second case, it was difficult to
comparision with dataset 2 are showed in Table IV. Dataset 3 put output of the first classifier to the input of the second
on VGG16 classifier showed 61.21% and on ResNet50 47.73% classifier. The further research may improve this paper by
accuracy. While dataset 2, 63.57% and 49.52% respectively. conducting experiment of increased dataset, First is increasing
These accuracy results show us the improvements of the orig- the number of styles. Secondly, enlargment the dataset by
inally proposed method. The growth in accuracy on VGG16 number of images in it.
model is 2.36%, while on ResNet50 is 1.79%. Even if the R EFERENCES
results are small, it is still a big improvement because the
[1] C. Sandoval, E. Pirogova and M. Lech, 1955, “Two-Stage Deep Learning
changes in that kind of semantic art style recognition, it is Approach to the Classification of Fine-Art Paintings”, vol. 7, pp. 41770-
hard to improve the results dramatically. 41781.
[2] L. Fichner-Rathus, Understanding Art, 9th ed. Belmont, CA, USA:
Wadsworth, 2010, p. 560.
TABLE II [3] A. Krizhevsky, l. Sutskevera and E. Hinton, ”ImageNet Classification
C OMPARISON OF ACCURACY (%) with Deep Convolutional Neural Networks”, n.d.
[4] Y. Bar, N. Levy and L. Wolf, ”Classification of Artistic Styles Using
VGG-16 ResNet50 Binarized Features Derived from a Deep Neural Network”,, vol 8925,
January, 2015.
2130 images 58.09 37.85 [5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
4260 images 63.57 49.52 large-scale image recognition,” CoRR, vol. abs/1409.1556, pp. 1–14,
Sep. 2014. [online].
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778. doi: 10.1109/CVPR.2016.9
TABLE III
[7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
C OMPARISON OF LOSS
the inception architecture for computer vision,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, Jun. 2016, pp.
VGG-16 ResNet50 2818–2826. doi: 10.1109/CVPR.2016.308.
2130 images 2.91 1.41 [8] Aggarwal, C.C. (2018). Machine Learning with Shallow Neural Net-
works. In: Neural Networks and Deep Learning. Springer, Cham.
4260 images 1.98 1.75 [9] Lopes, S. (2022), WikiArt all artpieces
[10] Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C. (2018). A Survey
on Deep Transfer Learning. In: Kůrková, V., Manolopoulos, Y., Hammer,
B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and
TABLE IV Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in
C OMPARISON OF ACCURACY (%) Computer Science(), vol 11141. Springer, Cham.

VGG-16 ResNet50
6 patches 61.21 47.73
10 patches 63.57 49.52

You might also like