Breast Cancer Using Image Processing
Breast Cancer Using Image Processing
Process Description
Breast cancer is the most common form of cancer in women, and invasive ductal carcinoma (IDC) is
the most common form of breast cancer. Accurately identifying and categorizing breast cancer
subtypes is an important clinical task, and automated methods can be used to save time and reduce
error.
The goal of this script is to identify IDC when it is present in otherwise unlabelled histopathology
images. The dataset consists of approximately five thousand 50x50 pixel RGB digital images of H&E-
stained breast histopathology samples that are labelled as either IDC or non-IDC. These numpy
arrays are small patches that were extracted from digital images of breast tissue samples. The breast
tissue contains many cells but only some of them are cancerous. Patches that are labelled "1"
contain cells that are characteristic of invasive ductal carcinoma.
The methodology involves use of classification techniques like Logistic Regression, Random Forest
Classifier, K Nearest Neighbour, Support Vector Machine, Linear SVC, Gaussian NB, Decision Tree
Classifier.
Neural Networks
Mammogram
Mammography is one of the most effective methods used in hospitals and clinics for early detection
of breast cancer. It has been proven effective to reduce mortality as much as by 30%. The main
objective of screening mammography is to early detect the cancerous tumour and remove it before
the establishment of metastases. The early signs for breast cancer are masses and microcalcification
but the abnormalities and normal breast tissues are often difficult to be differentiated due to their
subtle appearance and ambiguous margins. Only about 3% of the required information are revealed
during a mammogram where a part of suspicious region is covered with vessels and normal tissues.
This situation may cause the radiologists difficult to identify a cancerous tumour. Thus, computer-
aided diagnosis (CAD) has been developed to overcome the limitation of mammogram and assists
the radiologists to read the mammograms much better. ANN model is the most commonly used in
CAD for mammography interpretation and biopsy decision making.
Classification Techniques:
Logistic Regression
Logistic Regression uses an equation similar to Linear Regression but the outcome of logistic
regression is a categorical variable whereas it is a value for other regression models.
Nearest Neighbour
K-Nearest Neighbour is a supervised machine learning algorithm as the data given to it is labelled. It
is a nonparametric method as the classification of test data point relies upon the nearest training
data points rather than considering the dimensions (parameters) of the dataset.
Support Vector Machine is a supervised machine learning algorithm which is doing well in pattern
recognition problems and it is used as a training algorithm for studying classification and regression
rules from data. SVM is most precisely used when the number of features and number of instances
are high.
Random forest, like its name implies, consists of a large number of individual decision trees that
operate as an ensemble. Each individual tree in the random forest spits out a class prediction and
the class with the most votes becomes our model’s prediction.
❏ https://arxiv.org/abs/1609.04802
❏ http://torch.ch/blog/2016/02/04/resnets.html
❏ A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Image net classification with deep convolutional
neural networks,” in Proceedings of theAdvances in Neural Information Processing Systems, 2012,
pp. 1097–1105.
❏ R. Timofte, R. Rothe, and L. Van Gool, “Seven ways to improve example-based single image super
resolution,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2016,pp. 1865–1873.