Research Paper
Research Paper
Style Network
1Dushyant
Singh, 2Shivam Kumar, 3Yogesh Walecha, 4Astitva, 5Tausif Diwan
1, 2, 3, 4, 5, Department
of Computer Science & Engineering, Indian Institute of Information Technology, Nagpur, India
[email protected], [email protected], [email protected], [email protected], [email protected]
Abstract
For medical images classification methods based on Machine Learning and Deep Learning exist but they only work well if a
large amount of labeled data is present. But generally, we don’t have such datasets present for medical images. Trained
personnel may learn to classify new diseases by looking at a few relevant images but for a deep learning model to do so is not
possible as they end up overfitting it. That’s where few shot or K- shot learning can be useful which can learn to classify a new
disease class just by looking at a few labeled examples.
For few shot predictions we have used covid-19 radiography dataset which contains 3 classes Covid-19, Normal and
Pneumonia. We have used Siamese Network in which we make pairs of images to train on and label them similar or not similar.
Most of the literature currently on Siamese network in medical domain selects these pairs randomly we have come up with a
simple algorithm to find hard pairs i.e the pairs which are from same class and have large Euclidian distance between their
feature vector and the pairs belonging to different class but have similar Euclidean distance. If there are N images then for
binary cross-entropy loss we will have total N^2 pairs possible and if we are using triplet loss then it will be N^3 triplets so by
just randomly selecting we might not get these hard pairs so we need to explicitly select them.
Training on these hard pairs makes the model more robust. We used a simple Siamese Network as our base model and we found
a 2-3% increase in accuracy with hard pairs sampling over our base model. We also compared different CNN Architectures
like VGG-16, Resnet, DenseNet and MobileNet and found out the performance of VGG-16 and ResNet were the best but they
had higher training time compared to MobileNet and DenseNet. Overall accuracy of VGG-16 was best and MobileNet had the
least training time.
Keywords: Few shot learning, covid-19, deep learning, Siamese Network, Transfer Learning, Hard-pairs mining
1. INTRODUCTION
This research was undertaken to develop a model for predicting diseases from chest x-rays that can quickly adapt to new classes
with just a few training examples. The goal was to find an efficient and accurate method for detecting chest diseases which
have a smaller number of training images in particular.
Few-Shot Learning is where a learner is trained on several related tasks, during the meta-training phase, so that it can generalize
well to unseen (but related) tasks with just a few examples during the meta-testing phase. N-way K shot means to classify new
tasks presented in front of it having N classes with K examples each (N- Way K- Shot).
There are 4 main reasons why Few shot learning is an ideal choice for the Medical Field:
• There are Limited sources for medical images and are not readily available to the public domain.
• Manual labelling of data is time consuming, not always practical and needs medical experts.
• Some diseases are rare and just don’t have enough data present
• Few shot models will be more robust if we want to predict some new disease with it, it can quickly adapt with just a few
trainings example.
We are using Siamese Network + Transfer Learning. It can be used for few shots learning because during training phase we
are not interested in learning the labels of classes instead we are teaching our model how to know if two classes are similar or
not. So, if we have a rare disease say X for which we only have few training examples and there are other similar classes of
disease for which we do have large amount of data then essentially, we can use those classes as base classes and then we give
our model query example of the rare disease X which it might not have seen before but it can still compare it with other classes
in our support set and select the class which have least Euclidian distance with query image.
We begin by providing background on the problem and work done in this area. We then present our work and the results of our
hypothesis and experiments where we compare it to a base Siamese network model. Finally, the paper concludes with a
summary of the implications of our finding and recommendations for future research.
Siamese Networks have become popular in recent years, especially for tasks related to image similarity or text similarity. The
network architecture consists of two identical neural networks that take in two inputs, producing embeddings that are compared
using a distance metric. However, one issue with Siamese Networks is the identification of "hard pairs," or pairs that are difficult
to distinguish.
The identification of hard pairs is crucial for improving the performance of Siamese Networks. Hard pairs represent cases
where the network is struggling to learn and may be a source of error. By identifying these pairs, we can retrain the network to
improve its performance.The main challenge in identifying hard pairs is determining which pairs are truly hard. Simply
selecting pairs with large distances may not be effective, as some pairs may have large distances due to noise in the data rather
than true differences in similarity.
Several approaches have been proposed for identifying hard pairs in Siamese Networks. These include distance-based sampling
and margin-based sampling. However, the effectiveness of these techniques has not been thoroughly evaluated.
In this research report, we investigate various CNN architectures for Siamese Networks and compare their performance and
computation time. We propose a method for identifying hard pairs based on Euclidian distance differences and evaluate its
effectiveness.
We have used covid-19 radiography dataset which contains 3 classes Covid-19, Normal and Pneumonia. We have used NIH as
an auxiliary dataset which consists of 8 classes to get more variety in X-ray classes.
We see very small increase in 3 ways as the base model accuracy was already quite high but as we increase the N value, we
can see increase from base model to be increasing. For 10 way it’s around 2% increase in accuracy while in 20-way its 3%
improvement.We also comapred some of the populor CNN architecures in terms of accuracy and computaional efficency.
MobileNet and DenseNet are more efficient in terms of time taken per epoch, making them suitable for real-time or resource-
constrained applications. On the other hand, VGG-16 and ResNet got better accuracay and were comparable to each other but
were relatively slower compared to MobileNet and DenseNet.
Major contributions:
• This is the first work that uses Siamese Network with hard pairs sampling in the X-rays domain.
• Our research provides insights into the effectiveness of various CNN models for few shot learning via Siamese Networks.
• We propose a effective method for identifying hard pairs. Our findings can help improve the performance of Siamese
Networks on similarity tasks.
2. RELATED WORK
There are four main categories for few shot learning. transfer learning based, meta-learning based, data augmentation based,
and multimodal based methods. Transfer learning-based methods are used to transfer the knowledge it learned from training
on target domain and then fine-tune it to required tasks. Meta-learning-based methods employ past prior knowledge to guide
the learning of new tasks. Data augmentation is used when the amount of data is less, we use it to augment the data by rotation,
cropping etc. and generate new data. Multimodal based methods use auxiliary info such as text, audio, video to make up for
less data.
For Pre-processing we have resized all the images to 100 x 100 x3, and applied Histogram equalization to normalize the contrast
of the image makes the image clearer in darker areas. We are using no images of Covid-19 class during training process as it is
our novel class and we are using 30 images each from rest of the classes in our radiography dataset. To make up for less data
we have used NIH as an auxiliary dataset. We are using it as base classes. Although it has a significant number of images per
class, we only took around 30-50 from each class.
-
Fig. 5 Pneumonia class
3.2 Method
We are using Siamese Network + Transfer Learning. The reason for choosing Metric based approach over other alternatives in
Meta Learning such as optimization based and model based is because it works very well when combined with transfer learning
because if 2 images are similar their feature vectors should also be similar and if have a good feature extractor then our Siamese
network performance will also improve and we can get good feature vector by using transfer learning or pretraining on similar
task
So, combining transfer learning with the Siamese Network we can increase performance and save a lot of time by getting
ImageNet or NIH weights as initializer.
Covid-19 Radiography dataset consists of 3 classes Covid-19, Normal and Pneumonia.
An important choice we had to make was choosing Base and Novel classes combination. Main goal is to transfer the knowledge
of base classes to novel classes. Feature Extractor is trained using a base class and then we do few-shot predictions on novel
classes.
a. Pass xi and xj through the subnetworks to obtain the output feature vectors yi = f(xi) and yj = f(xj), where f is the subnetwork
function.
b. Calculate the distance between the feature vectors using the similarity function 𝑑 = ||𝑦𝑖 − 𝑦𝑗||, where ||.|| denotes the L2-
norm.
c. Calculate the loss using the contrastive loss function 𝐿 = (1 − 𝑦𝑖) ∗ (𝑑)^2 + 𝑦𝑖 ∗ 𝑚𝑎𝑥(0, 𝑚 − 𝑑)^2, where m is a
margin hyperparameter and max(0, m - d) is the hinge loss function.
d. Compute the gradient of the loss with respect to the weights of the subnetworks using backpropagation. Specifically, let ∇L
be the gradient of the loss with respect to the distance d, and let ∇d be the gradient of the distance d with respect to the feature
vectors yi and yj. Then, the gradients of the loss with respect to the feature vectors can be computed as follows:
𝛻𝑦𝑖 = 𝛻𝐿 ∗ 𝛻𝑑 ∗ 1, 𝑎𝑛𝑑
𝛻𝑦𝑗 = 𝛻𝐿 ∗ 𝛻𝑑 ∗ (−1),
where 1 and -1 are column vectors of ones and negative ones, respectively, with the same shape as yi and yj.
e. Compute the gradient of the loss with respect to the weights of the subnetworks using the chain rule:
where ∇W is the gradient of the loss with respect to the weights of the subnetworks, and ∇f is the gradient of the subnetwork
function with respect to its inputs.
𝑊 = 𝑊 − 𝜂 ∗ 𝛻𝑊,
Repeat steps 2-3 for a fixed number of epochs or until the loss stops improving on a validation set.
Once training is complete, the Siamese Network can be used to predict the similarity between new pairs of inputs by passing
them through the subnetworks and computing the distance between their feature vectors using the similarity function.
3.4 Hard Pairs
Hard pairs are the pairs which are from the same class and have large Euclidian distance between their feature vectors and the
pairs belonging to different classes but have similar Euclidean distance.
There has been previous work done in the domain of finding hard pairs or triplets for Siamese Network such as Ref [19]. But
we didn’t find any such work done for the Siamese Network in the medical domain. They mostly relied on generating these
pairs randomly from the dataset. Another motivation for coming up with our method was that the papers we studied on finding
hard pairs relayed on preparing these pairs beforehand. So, we came up with a much simpler and less time-consuming approach
to find these hard pairs.
In some cases, we can only rely on selecting random pairs especially when the classes differ greatly from each other, for
example if you are using 2 classes like elephant and cat. These 2 animals are completely different from each other so their
feature vectors will also differ greatly and there will be no need to find hard pairs to train on explicitly. But in case of X-rays
there is a good chance that 2 different diseases X-rays look similar or there are variations in the same class.
Let’s say we have 2 classes with 100 images each then it means there are 10^4 image pairs and if we just randomly select the
pairs then there is a good chance, we won’t be able to capture the hard pairs.
Training on hard pairs makes our model more robust and it forces it to learn the true distinguishing feature.
3.5 Methodology
We don’t have to choose hard pairs for every epoch. For most of the epochs we still pick pairs randomly but after every 10th
epoch we use our trained model till now to find and generate hard pairs equal to half our batch size and the other half is randomly
selected. For selecting hard pairs from the same class, we randomly choose a class label and then iterate over all the pairs from
that class and use our trained model till now to predict which of them have the largest margin between them. One more thing
to note is that we don’t always have to select the largest margin pair as it might just be an exception or noise. We select the
pairs that differ by a certain threshold distance. By always selecting the maximum or minimum distance pairs we might overfit
our model to outliners so we defined a threshold and selected randomly the pairs which crossed that threshold.
And for finding the hard pairs from different classes it is not optimal to generate all possible pairs from a dataset because that
will require a lot of time and memory. So instead, we select 2 random class labels and then form all possible pairs from those
2 different classes and use our trained model till the current epoch to find the pairs with distance less than our threshold value
between their feature vectors.
Fig.9 Overview of Model
We are not using the classical N-way K shot where there are N classes and K examples each. As we have only
3 query classes so a 20-way few shot means that if query example is COVID-19 image then we will generate 20
pairs such that only 1 pair is (Covid-19, Covid-19) and in rest 19 pairs we are pairing it with some other class
such as Normal or Pneumonia.
We do this to make our model task harder as now there are 20 pairs with only 1 possible correct answer for the
model to choose from.
3.6 Results
Fig.10 Loss and Accuracy plot between train and validation set
N value Siamese + Transfer (Base) Our Model
Above is the table showing comparison of base model with our model. Both of them are using VGG-16
architecture.
We see very small increase in 3 ways as the base model accuracy was already quite high but as we increase
the N value, we can see increase from base model to be increasing. For 10 way it’s around
2% increase in accuracy while in 20-way its 3% improvement.
We also tried our model with some popular CNN architectures and compared the results down below.
This table provides a comparison of different CNN architectures, namely VGG-16, ResNet, MobileNet, and DenseNet, in terms
of time taken per epoch on NVIDIA T4 Tensor Core GPUs and their accuracy on three different classification tasks with 3-
way, 10-way, and 20-way classifications.
The results show that VGG-16, ResNet, MobileNet, and DenseNet achieved competitive accuracy levels for all three
classification tasks. Scores of VGG-16 and ResNet were slighly better than compared to MobileNet and DenseNet. However,
MobileNet and DenseNet are relatively faster in terms of time taken per epoch compared to VGG-16 and ResNet. Specifically,
MobileNet takes only 3 minutes per epoch, while VGG-16 and ResNet take 11 minutes per epoch. DenseNet takes 5 minutes
per epoch, which is faster than VGG-16 and ResNet but slower than MobileNet.
Overall, MobileNet and DenseNet are more efficient in terms of time taken per epoch, making them suitable for real-time or
resource-constrained applications. On the other hand, VGG-16 and ResNet achieved comparable accuracy levels better than
MobileNet and DenseNet but are relatively slower.
• 2D images cannot truly reflect the 3D structure information of the human body.
• Adjusting to Domain Shift as currently few shot models perform poorly when there is shift in domain.
• There is a need of multimodal technologies in medical field like having list of symptoms with X=rays or diagnosis report
• from doctor that will help in increasing the accuracy and reliability of model.
• There are certain classes of diseases that are just hard to predict with few examples no matter how sophisticated the model
is.
5. REFERENCES
[1]. Csurka, G.; Dance, C.R.; Fan, L.; Willamowski, J.; Bray, C. Visual Categorization with Bags of Keypoints. In Proceedings
of the Conference and Workshop on European Conference on Computer Vision, Prague, Czech Republic, 11–14 May 2020.
[2]. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [CrossRef]
[3]. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 886–
893.Electronics 2022, 11, 1752 25 of 28
[4]. Ahonen, T.; Hadid, A.; Pietikainen, M. Face description with local binary patterns: Application to face recognition. IEEE
Trans.Pattern Anal. Mach. Intell. 2006, 28, 2037–2041. [CrossRef] [PubMed]
[5]. Yang, C. Plant leaf recognition by integrating shape and texture features. Pattern Recognit. 2021, 112, 107809. [CrossRef]
[6]. Al-Saffar, A.A.M.; Tao, H.; Talab, M.A. Review of deep convolution neural network in image classification. In
Proceedings of the 2017 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications
(ICRAMET),Jakarta, Indonesia, 23–24 October 2017; pp. 26–31. [CrossRef]
[7]. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In
Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009;
pp. 248–255.
[8] Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. Learning and transferring mid-level image representations using convolutional
neural networks In ICLR 2020
[9] Andrychowicz, M.; Denil, M.; GomezN. Learning to learn by gradient descent by gradient descent In ICLR 2019
[10] Dhillon Chaudhari Ravichandran & Soatto A baseline for few - shot image classification In ICLR 2020
[11] A baseline for few - shot image classification In ICLR 2020
[12] (Akihiro Nakamura and Tatsuya Harada. Revisiting fine-tuning for few-shot learning.)
[13] Mishra, N., Rohaninejad, M., Chen, X., & Abbeel, P. (2017).
[14] Papp, D., & Szűcs, G. (2017). Balanced active learning method for image classification. Acta Cybernetica, 23(2), 645-
658.
[15] Papp, D., & Szűcs, G. (2018). Double probability model for open set problem at image classification. Informatica, 29(2),
353-369.
[16] Ramachandra, B., Jones, M.J., & Vatsavai, R. (2020). Learning a distance function with a Siamese network to localize
anomalies in videos. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2587-2596.
[17] Seeland, M., & Mäder, P. (2021). Multi-view classification with convolutional neural networks. Plos ONE, 16(1),
e0245230. doi: 10.1371/journal.pone.0245230
[18] Shyam, P., Gupta, S., & Dukkipati, A. (2017). Attentive recurrent comparators. In Proceedings of the 34th International
Conference on Machine Learning - Volume 70 pp. 3173-3181. doi: 10.5555/3305890.3306009
[19]. Melekhov, J. Kannala and E. Rahtu, "Siamese network features for image matching," 2016 23rd International Conference
on Pattern Recognition (ICPR), 2016, pp. 378-383, doi: 10.1109/ICPR.2016.7899663.
[20] Wu, X., Sun, Y., Liu, L., & Liu, Z. (2017). Hard negative sample mining in siamese networks for object tracking. In
Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 5634-5642).
[21] Wang, W., Wu, Q., Zhang, X., & Li, W. (2020). Hard negative mining for siamese networks with adversarial attacks. IEEE
Transactions on Neural Networks and Learning Systems, 31(4), 1074-1084.
[22] Hassanpour, S., & Baydoun, M. (2020). Siamese network-based medical image retrieval system. Computer Methods and
Programs in Biomedicine, 191, 105422.
[23] Wang, L., Li, Y., & Huang, Y. (2018). Siamese neural network-based classification for medical images. In Proceedings of
the International Conference on Machine Learning and Cybernetics (ICMLC) (pp. 648-653).