Fine Tuning
Fine Tuning
Abstract—This paper introduces the comparisons of new modi- lutional Neural Networks (FCNN), Mask R-CNN and others.
fications into two-stage deep learning approach that improves the The convolutional neural network by using a special operation
style classification accuracy with existing one. The modification - the convolution itself - allows to simultaneously reduce the
was made in the first stage, instead of dividing the image into
five patches, it was divided into nine patches to not lose the data amount of information stored in memory, so it can better
in central part of the each side. A comparison analysis was made handle images of higher resolution, and to isolate the reference
in order to identify how increasing the size of WikiArt dataset image features, such as edges, contours or faces. In the next
affects the overall results of the accuracy. The method was tested layer of processing, these edges and faces can be used to
on VGG-16 and ResNet-50. recognize repeatable texture fragments that can be further
I. I NTRODUCTION folded into image fragments.
Essentially, each layer of the neural network uses its own
Image classification has two type of labelling. The first is
transformation. If in the first layers the network operates with
object recognition as an recognising the objects within an
such notions as ”edges”, ”faces”, etc., then further the notions
image and the second is semantic categorization known as
of ”texture”, ”parts of objects” are used. As a result of this
labeling based on the mood and emotions shown in the image
elaboration, we are able to correctly classify the picture or
[1]. If the first type is relatively well explored, the second one
identify, in the final step, the object we are looking for in the
is still need more research works to get the good results.
image.
Popular paintings are the work of human being. Every artist
A classic representative of CNN is AlexNet [4]. The first
put their own sights to painting and make an art work in a
study in this field that showed that the networks can smoothly
particular style. Sometimes, identifying the style of art piece
arrange the work of art only by using the style labels, without
is difficult even for the experts of fine arts. In visual arts,
any comprehensive information. It was a big breakthrough in
style is defined as a set of distinctive elements that can be
this field, not only it represented a new way of solving this
associated with a specific artistic movement, school or time
problem, but also the results of identifying art styles were
period [2]. So, style is an important sentiment criteria for the
increased. But still it was not enough to fully understand how
image classification in fine arts. The ability of the machine to
properly recognize the style in fine arts.
classify styles implies that the machine has learned an internal
Later, many other works improved the results of AlexNet.
representation that encodes discriminative features through its
The CNN models started become more complex, big, with
visual analysis of the paintings [3], which tells us how far
many different layers. VGG16 [5] was created in order to
machine can even go if the task will be fully successfully
reduce the number of parameters in the convolutional layers.
achieved.
ResNet-50 [6] concentrated on finding a simpler mapping
One of the latest method that was found as a best way for
system by introducing two types of shortcut connections.
style recognition is Convolutional Neural Network (ConvNet,
While the others was going in a direction of increasing the
CNN). CNN is a basic tool for classifying and recognizing
number of layers, Inception [7] worked on making the layers
objects, faces in photos, and speech recognition. There are
wider. Those were the CNNs developing and spending time
many applications of CNN, such as Deep Convolutional Neu-
on improving the models. The way of how everything needs
ral Network (DCNN), Region-CNN (R-CNN), Fully Convo-
to work. But there are also other ways of improving overall
Identify applicable funding agency here. If none, delete this. results of style classification: working with the image itself.
Most of the standard CNN architectures work with fixed,
small size paintings, which means that high-resolution art
pieces should go through downsampling process. It surely
leads to some loss of information. A new study shows that
this problem can be faced by using sub-region approach [1],
in which initial image divides into smaller regions, and those
patches have the same standard size as required by CNN. Also,
this study proposes two stages of art style recognition. The
first stage proposes dividing image into patches and by using
deep neural network get the parameters. The second stage
suggests, concatenating the vector parameters into one, and by
using shallow neural network labeling image the final result.
This paper’s work is based on this two-stage deep learning
approach.
The remainder of this paper is organized into five sec-
tions:some theoretical information of experiment in Section II; Fig. 1. The architecture of VGG16
proposed method in Section III; implementation of the method
and experiment in Section IV; and Section V describes the
functional block diagram of the architecture is shown in Fig.
results and the last Section VI is conclusion.
2. ResNet50 consists of 50 layers deep: 48 convolution layers
II. T HEORETICAL BACKGROUND along with 1 maxpool and 1 average pool layer. The input size
of an image is the same as VGG16 — 224-by-224.
This section aims to explore the theoretical background of
the key architectures and technologies utilized in the proposed
method, which include convolutional neural network models
such as VGG16 and ResNet50. Also, this section explains the
term transfer learning and its usage in the experiment.
A. VGG16
VGG16 is a classification algorithm and type of CNN
constructed by Karen Simonyan and Andrew Zisserman (Uni-
versity of Oxford) [5] in 2015. It can classify 1000 images of Fig. 2. The architecture of ResNet50
1000 categories with 92.7% accuracy, so that, nowadays, it is
one of the best algorithms of image classification problem. C. Transfer Learning
VGG16 has thirteen convolutional layers, five Max Pooling Transfer learning is the reuse of a pre-trained model as a
layers, and three Dense layers which sum up to twenty-one starting point for a new problem. Nowadays, it is an important
layers but only sixteen of them have weight layers, which part of deep learning. It try to transfer the knowledge from
explains its name as VGG16. The input size to ConvNet is the source domain to the target domain as shown in Fig. 3, by
a fixed-size 224 × 224 RGB image as shown in Fig. 1. The relaxing the assumption that the training data and the test data
architecture of the algorithm focused not on having a large must be i.i.d [8]. So that, the pre-trained model’s weights and
number of hyper-parameters but on having convolution layers learned features are transferred to the new model, but the final
of 3x3 filter with stride 1 and always used the same padding classification layer of the model is replaced to adapt to the new
and maxpool layer of 2x2 filter of stride 2. The creators of task. Transfer learning is effective when the pre-trained model
this model pushed the depth to 16 weight layers making it has learned generic features that are relevant to the new task,
approximately — 134 million trainable parameters. especially when the new task has limited training data. This
B. ResNet50 improves the efficiency and solve the problem of training on
insufficient data.
Unlike traditional sequential architecture as AlexNet and
VGG16, ResNet relies to ”building blocks” architecture, which D. Fine-tuning
is a collection of micro-architectures building a bigger one. Fine-tuning is a process where pre-trained model is used
The architecture was introduced in 2015, in paper ”Deep to train the data on a new dataset. It is a continuation of
Residual Learning for Image Recognition” written by Kaiming the transfer learning process. It allows to change not only
He, Xiangyu Zhang, Shaoqing Ren, Jian Sun and based on final layers of the pre-trained model, but also, get access and
residual learning [6]. It is a type of learning, where the unfroze some of the model’s earlier layers. Fine-tuning is good
connections between layers are a little different. The output when the task is similar to pre-trained model’s orginal task
of one earlier convolutional layer connects to the input of because it easily adapts to a new task and fine-tune on new
another future convolutional layer several layers later as. The data effectively.
Fig. 3. Learning process of transfer learning
E. Shallow neural
A shallow neural network consists of only one or a small
number of hidden layers positioned between the input and
output layers [9]. The input layer receives the data, the hidden
layer(s) perform computations on it, and the output layer
generates the final output.
Fig. 4. The proposed two-stage classification framework using nine image
patches; i is the analyzed image index.
III. P ROPOSED M ETHOD
In experiment 80% of the data was utilized to train the A. Experiment 1 - VGG16
CNN models, while the remaining 20% was used to assess After fine tuning the model the first classifier’s training
the system’s performance. The results of perfomance were accuracy were around 77% and loss 24% on dataset 1 and
measured by accuracy and loss. 79.4% of accuracy and 27% loss on dataset 2. The results
The proposed two-stage method was used in experiment are almost the same. But after training them on the sec-
described in Section III. The probability vectors obtained ond classifier accuracy showed around 97% results on both
during the first-stage classification of individual patches of datasets. But after testing the SNN model the final accuracy
were different on datasets. Dataset 1 showed 58.09%, while on VI. C ONCLUSION
dataset 2 result was 63.57% as shown in Table II. As expected By making modifications as increasing the number of image
increasing the number of images can lead to better results. patches from six to nine as an input for automatic fine art
Dataset 2 is twice bigger than the dataset 1, so accuracy growth style classification, the two stage deep learning approach was
was expected. But the change in 5% is still not much as it improved. The proposed approach applied two independently
seems. trained stages of classification. While the first stage applied
B. Experiment 2 - ResNet50 a deep CNN trained directly on image data, the second stage
used a shallow neural network trained on the class probability
The training accuracy of the first classifier - ResNet50
vectors generated by the first stage classifier. The modification
model showed 37.5% on dataset 1. In comparison with ex-
on proposed method increased the accuracy results on VGG16
periment 1, it is twice less. Similar results were taken on
by from 61.21% to 63.57% by 2.36% and on ResNet50 from
dataset 2: 39.1%. On SNN model the results increased but
47.73% to 49.52% by 1.79% on dataset containing 4260
not as much as in experiment 1. It showed around 59.9%
images of six art styles.
on training for both datasets. But the final testing results
We encountered some difficulties during the project. Fisrt case,
accuracy dramatically decreased. 37.85% on dataset 1 and
we imported our dataset with a lot of pictures of different
49.52% on dataset 2 as shown in Table II. These are low
styles, during recognition by styles, we distributed first by
results, comparing to experiment 1. We assume the reason for
styles then into train(70%), test(10%), validation(20%). which
that is that the ResNet50 has lower parameter numbers and
caused the model to not work because the hierarchy was
that it does not learn again already learned parameters.
wrong. Then we changed this hierarchy: we first divided it into
C. Experiment 3 - Dataset 3 train(70%), test(10%), validation(20%) and then distributed it
In experiment 3, as it was mentioned previously, we took by styles. Thus the picture referred to the name of the folder,
dataset 3 and used proposed method on it. The results in and then the name of the style. Second case, it was difficult to
comparision with dataset 2 are showed in Table IV. Dataset 3 put output of the first classifier to the input of the second
on VGG16 classifier showed 61.21% and on ResNet50 47.73% classifier. The further research may improve this paper by
accuracy. While dataset 2, 63.57% and 49.52% respectively. conducting experiment of increased dataset, First is increasing
These accuracy results show us the improvements of the orig- the number of styles. Secondly, enlargment the dataset by
inally proposed method. The growth in accuracy on VGG16 number of images in it.
model is 2.36%, while on ResNet50 is 1.79%. Even if the R EFERENCES
results are small, it is still a big improvement because the
[1] C. Sandoval, E. Pirogova and M. Lech, 1955, “Two-Stage Deep Learning
changes in that kind of semantic art style recognition, it is Approach to the Classification of Fine-Art Paintings”, vol. 7, pp. 41770-
hard to improve the results dramatically. 41781.
[2] L. Fichner-Rathus, Understanding Art, 9th ed. Belmont, CA, USA:
Wadsworth, 2010, p. 560.
TABLE II [3] A. Krizhevsky, l. Sutskevera and E. Hinton, ”ImageNet Classification
C OMPARISON OF ACCURACY (%) with Deep Convolutional Neural Networks”, n.d.
[4] Y. Bar, N. Levy and L. Wolf, ”Classification of Artistic Styles Using
VGG-16 ResNet50 Binarized Features Derived from a Deep Neural Network”,, vol 8925,
January, 2015.
2130 images 58.09 37.85 [5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
4260 images 63.57 49.52 large-scale image recognition,” CoRR, vol. abs/1409.1556, pp. 1–14,
Sep. 2014. [online].
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778. doi: 10.1109/CVPR.2016.9
TABLE III
[7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
C OMPARISON OF LOSS
the inception architecture for computer vision,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, Jun. 2016, pp.
VGG-16 ResNet50 2818–2826. doi: 10.1109/CVPR.2016.308.
2130 images 2.91 1.41 [8] Aggarwal, C.C. (2018). Machine Learning with Shallow Neural Net-
works. In: Neural Networks and Deep Learning. Springer, Cham.
4260 images 1.98 1.75 [9] Lopes, S. (2022), WikiArt all artpieces
[10] Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C. (2018). A Survey
on Deep Transfer Learning. In: Kůrková, V., Manolopoulos, Y., Hammer,
B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and
TABLE IV Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in
C OMPARISON OF ACCURACY (%) Computer Science(), vol 11141. Springer, Cham.
VGG-16 ResNet50
6 patches 61.21 47.73
10 patches 63.57 49.52