Multiclass Report
Multiclass Report
Bachelor of Engineering
In
Computer Science and Engineering
Submitted By
2022 - 2023
CERTIFICATE
MEDICAL IMAGE DATA is a bonafied work carried out by the student team Mr.
Computer Science and Engineering during the year 2022 – 2023. The project report has
been approved as it satisfies the academic requirement with respect to the project work
External Viva:
Name of the Examiners Signature with date
1.
2.
ABSTRACT
One of the most important factors affecting human health is Breast Cancer. The process
of diagnosing this disease involves the use of pathological breast cancer images. Medical
image classification of these pathological breast cancer images plays an important role in
clinical treatment and computer-aided diagnosing tasks. Deep learning techniques provide
an effective way to construct an end-to-end model that can compute final classification
labels with the raw pixels of medical images. However, imbalanced class distribution in
the medical image data, which leads to misclassification, is a great challenge in this field.
In this project, to handle the problem of class imbalance, we perform various resampling
techniques to balance the medical image data. Multiple state-of-art deep learning models
are used for multi-class classification of the image data. These models are trained both on
imbalanced and balanced image data to compare the results. To improve the accuracy of
predictions ensemble learning is used, as an ensemble model which combines the trained
deep learning models, can make better predictions, improve accuracy and achieve better
performance than any single contributing model.
We also take this opportunity to thank Mrs. Nirmala Patil, our guide, for providing us
with an academic environment that nurtured our practical skills and contributed to our
project's success.
We sincerely thank Mr. Mahesh Patil, Mini Project Coordinator, for their support,
inspiration, and wholehearted cooperation during the course of completion.
Akshath Raj
Vaiybhav Balachenna
Atharv Kadole
Chapter TABLE OF CONTENTS Page No.
No.
1. INTRODUCTION 1-3
1.1 Preamble 1
1.2 Motivation 2
1.3 Objectives of the project 2
1.4 Literature Survey 2
1.5 Problem Definition 3
2. PROPOSED SYSTEM 4-4
2.1 Description of Proposed System. 4
2.2 Description of Target Users 4
2.3 Advantages of Proposed System 4
2.4 Scope 4
3. SOFTWARE REQUIREMENT SPECIFICATION 5-7
3.1 Overview of SRS 5
3.2 Requirement Specifications 5
3.2.1 Functional Requirements 5
3.2.2 Nonfunctional Requirements 5
3.2.4.1 Performance Requirements 5
3.2.4.2 Usability 5
3.2.3 Use Case Diagram 6
3.2.4 Use Case Description 6
3.3 Software and Hardware requirement specifications 7
4 SYSTEM DESIGN 8-10
4.1 Architecture of the system 8
4.2 Data Set Description 10
5 IMPLEMENTATION 11-11
5.1 Proposed Methodology 11
6 TESTING 12-12
6.1 Acceptance Testing 12
6.2 Unit Testing 12
7 RESULTS AND DISCUSSIONS 13-19
7.1 Results 13
7.2 Discussions 19
8 CONCLUSION AND FUTURE SCOPE 20-20
8.1 Conclusion 20
8.2 Future scope 20
9 REFERENCES 21-21
10 APPENDIX 22-23
A Gantt Chart 22
C Description of Tools & Technology used 22
D Blue Print 23
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
1. INTRODUCTION
1.1 Preamble
Breast cancer is the most common disease for women. It is cells with extra growth
of mass in women’s breast region. This breast tissue forms the tumor, which is
classified as benign or malignant. The malignant is the most affected cancerous
region, and the benign is the non-cancerous region. This disease is diagnosed by
biopsy. The researchers analyze various automated diagnosis approaches to
determine breast cancer. The stroma maturity of cancer in the breast is classified by
the histological image. The breast cancer image of the stroma is the matured result
of classification. Thermograph and mammography images are used for this
approach. The thermograph images are taken from cameras, which are analyzed by
infrared radiation and its intensity level. By comparing with the thermograph and
mammogram, the mammogram image provides the exact result.
However, in many medical and clinical cases, it can be hard to collect a balanced
dataset for training since some diseases have a low prevalence. This leads to the data
imbalance problem, namely, the number of samples in different classes is not
balanced. Imbalanced data can negatively affect the performance of models
significantly. Many models that perform well on balanced datasets cannot achieve
good performances when it comes to their imbalanced counterparts. To solve the
problem of class imbalance, resampling techniques can be used like over sampling
and under sampling.
The breast cancer images are classified using various machine learning and deep
learning techniques. Computer-aided diagnosis is an important research field in
medical image classification, where the goal of a majority of task is to differentiate
between different classes of benign and malignant, and predict the accurate class of
the breast cancer image. With the development of deep learning, medical image
classification has achieved remarkable progress. Usually, the training of deep
learning models need plenty of labeled samples that belong to different classes.
Various state-of-art deep learning models like CNN, VGG-19, ResNet-50 are used
for multi-class classification of medical image data.
Ensemble learning is a machine learning technique that combines several base
models in order to produce one optimal predictive model. Ensemble learning
strategies are beneficial in deep learning based medical image classification as
assembling of diverse models has the advantage to combine their strengths in
focusing on different features whereas balancing out the individual incapability of a
model. The final prediction from these ensembling techniques is obtained by
combining results from several base models. Averaging, weighted average method
and voting are some of the ways the results are combined to obtain a final
prediction.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 1
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
1.2 Motivation
• Cancer is a major health issue and due to its recent high increase, will be the
number one cause of death in the coming decades.
• The class imbalance distribution is a general problem for medical real-world data
and particularly cancer data.
• Incorrect classification of noncancerous cells may lead to serious health
consequences.
• As such, data analysis of healthcare and treatment data is crucial to doctors for
predicting the class of cancer at their early stages as well as making right clinical
decisions.
1.3 Objectives
• To rebalance the imbalanced data using resampling methods.
• To classify images to its corresponding class using deep learning models.
• To evaluate the accuracy of deep learning models.
• To compare the results of classification of imbalanced and balanced data.
• To ensemble the trained models to improve accuracy.
• To predict the class of unseen random pathological images from testing set.
[2] Medical image classification plays an essential role in clinical treatment and
teaching tasks. Moreover, by using them, much time and effort need to be spent on
extracting and selecting classification features. The deep neural network is an
emerging machine learning method that has proven its potential for different
classification tasks. Notably, the convolutional neural network dominates with the
best results on varying image classification tasks. However, medical image datasets
are hard to collect because it needs a lot of professional expertise to label them.
Therefore, this paper researches how to apply the convolutional neural network
(CNN) based algorithm on a chest X-ray dataset to classify pneumonia. Three
techniques are evaluated through experiments. These are linear support vector
machine classifiers with local rotation and orientation-free features, transfer learning
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 2
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
on two convolutional neural network models: Visual Geometry Group i.e., VGG16
and InceptionV3, and a capsule network training from scratch. Data augmentation is
a data preprocessing method applied to all three methods. The results of the
experiments show that data augmentation generally is an effective way for all three
algorithms to improve performance. Also, Transfer learning is a more useful
classification method on a small dataset compared to a support vector machine with
oriented fast and rotated binary (ORB) robust independent elementary features and
capsule network. In transfer learning, retraining specific features on a new target
dataset is essential to improve performance. And, the second important factor is a
proper network complexity that matches the scale of the dataset.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 3
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
2. PROPOSED SYSTEM
2.4 Scope
As of now our proposed system is trained only on histopathological images of 40x
magnification but in future we will be able to train this system irrespective of
magnifying factors.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 4
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
3.2.2.2 Usability
• The system should be user friendly to input the image data and
predict its corresponding class.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 5
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
3.2.3 Use Case Diagram
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 6
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
Scenarios:
• Dataset can be balanced by using oversampling techniques
• Dataset can be balanced by using undersampling techniques
• Dataset can be balanced by combing both oversampling and
undersampling
Exceptions:
• Dataset is already balanced.
• Dataset cannot be balanced since it is very large and have high
dimensions
Frequency of Use: Whenever dataset is imported
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 7
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
4. SYSTEM DESIGN
• VGG-19
VGG-19 is a convolutional neural network that is 19 layers deep. You can load a
pretrained version of the network trained on more than a million images from
the ImageNet database. The pretrained network can classify images into 1000
object categories, such as keyboard, mouse, pencil, and many animals. As per our
need we can change the output layer of the model. We flatten the layer after
loading the base model. Dense adds the flatten layer to output layer which has 8
classes to predict the different classes of breast cancer. Softmax activation
function is used as it is a multi-class classification problem. Figure 3 shows
VGG-19 architecture.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 8
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
• ResNet-50
ResNet-50 is a convolutional neural network that is 50 layers deep. ResNet, short
for Residual Networks is a classic neural network used as a backbone for many
computer vision tasks. We load a pretrained version of the network trained on
more than a million images from the ImageNet database which can classify
images into 1000 object categories. As per our need we add input layer which
takes input of images of dimension 64 x 64, and an output layer of 8 classes with
softmax activation function since it is a multiclass classification. Figure 4 shows
Resnet-50 archtecture.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 9
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
4.2 Data Set Description
Dataset Source:
• The name of the dataset is BreakHis_v1.
• It has 8 classes namely Adenosis, Fibroadenoma, Phyllode Tumor, Tubular
Adenoma, Ductal Carcinoma, Lobular Carcinoma, Mucinous Carcinoma,
Papillary Carcinoma.
• Source of Dataset: Breast Cancer Histopathological Database (BreakHis) -
Laboratório Visão Robótica e Imagem (ufpr.br)
• This database has been built in collaboration with the P&D Laboratory –
Pathological Anatomy and Cytopathology, Parana, Brazil.
Dataset Analysis:
• It is composed of 9,109 microscopic images of breast tumor tissue.
• It is collected from 82 patients using different magnifying factors (40X, 100X,
200X, and 400X).
• Due to hardware limitation, we use images of magnifying factor 40X for
classification which has 1995 images.
Dataset Pre-Processing:
• The dataset is highly imbalanced among the classes.
• For balancing, both Oversampling using SMOTE and Under sampling using
Tomek Links are combined.
• SMOTE (Synthetic Minority Oversampling TEchnique) is used for Oversampling
which generates synthetic samples of minority class.
• Under sampling is done by eliminating Tomek Links, are pairs of patches of
opposite classes who are their own nearest neighbors.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 10
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
5. IMPLEMENTATION
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 11
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
6. TESTING
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 12
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
7.1 Results
Dataset Visualization:
In Figure 6 this we can see that the image dataset is highly imbalanced.
Resampling:
The dataset is split into training and testing data with 80% training and 20% testing
data. Figure 7 shows distribution of classes before resampling and Figure 8 shows
distribution of classes after resampling.
CNN:
In Figure 9 we can compare the training and validation accuracy of imbalanced and
balanced data plotted against epochs. In Figure 10 we can compare the confusion
matrix obtained from imbalanced and balanced data.
Accuracy of CNN on imbalanced data is 90.48% and for balanced data is 93.55%.
Figure 9: Accuracy plot of CNN on imbalanced (left) and balanced (right) data
VGG-19:
In Figure 11 we can compare the training and validation accuracy of imbalanced
and balanced data plotted against epochs. In Figure 12 we can compare the
confusion matrix obtained from imbalanced and balanced data.
Accuracy of VGG-19 on imbalanced data is 45.68% and for balanced data is
51.38%.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 15
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
Figure 11: Accuracy plot of VGG-19 on imbalanced (left) and balanced (right) data
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 16
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
VGG-19 Classification Report (Balanced):
ResNet-50:
In Figure 13 we can compare the training and validation accuracy of imbalanced
and balanced data plotted against epochs. In Figure 14 we can compare the
confusion matrix obtained from imbalanced and balanced data.
Accuracy of ResNet-50 on imbalanced data is 25.11% and for balanced data is
31.08%.
Figure 13: Accuracy plot of ResNet-50 on imbalanced (left) and balanced (right)
data
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 17
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
Ensemble Model:
The weighted ensemble model gives the same accuracy as CNN Model as CNN as
been assigned with highest weight. Therefore, the final accuracy of the model is
93.55%.
7.2 Discussions
We have discussed the comparison of each deep learning model on both imbalanced
and balanced data and find out that accuracy has improved after balancing the
dataset.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 19
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
8.1 Conclusion
We thus conclude that CNN gives the best accuracy of 93.55% as compared to
VGG-19 and ResNet-50. The ensemble model which combines the predictions of all
three deep learning models also gives the same accuracy as CNN has been assigned
the highest weight for the weighted average ensemble model.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 20
Classification of Imbalanced Medical Image Data
------------------------------------------------------------------------------------------------------------
9. REFERENCES
[2]Deep convolutional neural network based medical image classification for disease
diagnosis Samir S. Yadav & Shivajirao M. Jadhav Journal of Big Data volume 6, Article
number: 113 (2019).
[4]Medical Image Classification Based on Deep Features Extracted by Deep Model and
Statistic Feature Fusion with Multilayer Perceptron ZhiFei Lai 1and HuiFang Deng1.
[6]A Framework for Medical Images Classification Using Soft Set Saima Anwar
Lashari*, Rosziati Ibrahim.
10. APPENDIX
A Gantt Chart
Tensorflow:
TensorFlow is a free and open-source software library for machine learning and
artificial intelligence. It can be used across a range of tasks but has a particular focus
on training and inference of deep neural networks.
Keras:
Keras is a free, open source, high-level, deep learning API developed by Google
for implementing neural networks. It is written in Python and is used to make the
implementation of neural networks easy. It also supports multiple backend neural
network computation.
------------------------------------------------------------------------------------------------------------
School of Computer Science and Engineering 22
Mini Project
ORIGINALITY REPORT
14 %
SIMILARITY INDEX
8%
INTERNET SOURCES
12%
PUBLICATIONS
6%
STUDENT PAPERS
PRIMARY SOURCES
coek.info
3
Internet Source 1%
Submitted to University of Essex
4
Student Paper 1%
5 Submitted to Westford School of
Management 1%
Student Paper
hdl.handle.net
6
Internet Source 1%
7
www.geeksforgeeks.org
Internet Source 1%
8 www.slideshare.net
Internet Source 1%
9 Lihua Song, Mengchen Li, Zongke Zhu, Peng
Yuan, Yunhua He. "Attribute-Based Access
1%
Control Using Smart Contracts for the Internet
of Things", Procedia Computer Science, 2020
Publication
10
opengovasia.com
Internet Source
1%
11
Submitted to University of Northumbria at
Newcastle
1%
Student Paper
12
A. Saritha Haridas, Arun T. Nair, K. S. Haritha,
Kesavan Namboothiri. "Chapter 13 Artificial
1%
Intelligence-Based Phonocardiogram:
Classification Using Cepstral Features", Springer
Science and Business Media LLC, 2022
Publication
www.hindawi.com <1 %
Internet Source
www.researchgate.net <1 %
Internet Source
www.mdpi.com
<1 %
Internet Source