0% found this document useful (0 votes)
29 views5 pages

Kancherla 2013

Uploaded by

minda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

Kancherla 2013

Uploaded by

minda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Early Lung Cancer Detection using Nucleus

Segementation based Features

Kesav Kancherla Srinivas Mukkamala


Institute for Complex Additive Systems and Analysis Institute for Complex Additive Systems and Analysis
(ICASA) (ICASA)
Computational Analysis and Network Enterprise Computational Analysis and Network Enterprise
Solutions (CAaNES) Solutions (CAaNES)
New Mexico Institute of Mining and Technology New Mexico Institute of Mining and Technology
Socorro, New Mexico 87801, U.S.A. Socorro, New Mexico 87801, U.S.A.
[email protected] [email protected]

Abstract— In this study we propose an early lung cancer lung cancer detection is a costly approach. In this paper we
detection methodology using nucleus based features. First the investigate the use of Tetrakis Carboxy Phenyl Porphine
sputum samples from patients are labeled with Tetrakis Carboxy (TCPP) as an alternative approach for early detection of lung
Phenyl Porphine (TCPP) and fluorescent images of these samples cancer.
are taken. TCPP is a porphyrin that is able to assist in labeling
lung cancer cells by increasing numbers of low density Machine learning for cancer detection is investigated in [8].
lipoproteins coating on the surface of cancer. We study the Machine learning techniques such as Artificial Neural
performance of well know machine learning techniques in the Networks (ANN) and Decision Tress (DT) are used for cancer
context of lung cancer detection on Biomoda dataset. We detection for nearly 20 years [10, 11, and 12]. The potential of
obtained an accuracy of 81% using 71 features related to shape, using machine learning methods for detecting cancer cells or
intensity and color in our previous work. By adding the nucleus tumors via X-rays, Computed Tomography (CT) is shown in
segmented features we improved the accuracy to 87%. Nucleus [13, 14]. Machine learning methods used for tumor
segmentation is performed by using Seeded region growing classification or cancer detection using microarray data or gene
segmentation method. Our results demonstrate the potential of expression include Fisher Linear Discriminant analysis [15], K-
nucleus segmented features for detecting lung cancer. Nearest Neighbor (KNN) [16], Support Vector Machines
(SVM)[17], boosting, and Self-Organizing Maps (SOM) [18],
Keywords- Lung Cancer detection; Bioinformatics; Machine
Hierarchical clustering [19], and Graph theoretic approaches
Learning; Seeded Region Growing segmentation.
[20].
I. INTRODUCTION In this study first the sputum samples are collected from
patients using triple morning cough method. Later the sputum
Lung cancer is the leading cancer killer among both men samples are stained with TCPP using Biomoda CyPath® assay
and women. Based on the statistics by the American Cancer and slides are prepared from this. After the slides are observed
Society, it is believed there are 220,000 new cases, 160,000 under Fluorescent microscope and images are of cells are
deaths per year and the 5-year survival rate for all stages is acquired. We perform segmentation on these images and
15% [1]. The various factors that influence the 5-year survival acquire individual cells. For our initial study we extract 71
rate are stage of cancer, type of cancer, other factors like features related to shape, color and texture of cell. We obtained
symptoms, general health etc. Early detection of lung cancer is an accuracy of 81% using these initial set of features [2].
the leading factor to improve survival rate. However the
symptoms of lung cancer do not appear until cancer spreads to Nucleus plays an important role in determination of cancer
other areas, thus leading to 24% chances of lung cancer cells. The size and florescence of nucleus is used to determine
detection in early stages [3]. So we need an accurate early whether a cell is cancer or not. So to capture the properties of
detection of lung cancer to increase the survival rate. nucleus, we perform nucleus segmentation using seeded region
growing methods. We extract features like size, intensity from
Various methods like Computed Tomography (CT) scan, each cell. By added nucleus based features to our initial set, we
chest radiography, Sputum analysis, microarray data analysis improved the performance of our system.
are used for lung cancer detection [5]. Mass screening by
Computed Tomography (CT) scan of chest is a promising This paper is organized as follows: In Section 2 provides
method for lung cancer detection. However this method is not sample collection process. In Section 3 we provide an overview
recommended because of its cost and long term safety of this of image processing steps and set of features used initially in
method is not established due to the risk of exposure to our experiments. In Section 4 we describe seeded region
radiation [7]. The use of microarray data for cancer is growing method used for nucleus segmentation and nucleus
investigated in [9]. However the use of micro array data for based feature extraction. In Section 5 we provide various

978-1-4673-5875-0/13/$31.00 2013
c IEEE 91
experiments performed and results obtained. Finally, in Section viewed under an ultraviolet microscope utilizing a FITC filter,
6 we conclude and explain our future work. and, was observed for the presence of fluorescing red cells and
other cellular metrics.
II. SAMPLE COLLECTION B. Slide scoring and analysis
Biomoda’s internal study [21] included 28 samples from a Slide scoring and analysis was carried out using the “CyPath
variety of sources. Biomoda performed this in-house Slide Scoring Procedure” (conducted with UV light with a
validation study using sputum samples from 15 lung cancer FITC filter) as described below.
patients and 13 normal patients. Cohort 1 consisted of 15 • The slide was placed on the microscope stage so that the
patients who had recently been diagnosed with lung cancer edge of the microscope’s 20x objective remained at the
and had not undergone surgery or received adjuvant therapy edge of the cellular area.
for lung cancer. Cohort 2 included 13 subjects who were • Each slide was scanned in a methodical pattern from one
heavy smokers but did not have a history or diagnosis of lung end of cellular area to the other, slightly overlapping the
cancer. “Heavy smoking” was defined as 20 pack-years or area that had already been scanned.
greater (i.e. 1 pack/day for 20 years or 2 packs / day for 10 • The results were interpreted based on the characteristics
years). of cancer cells, normal cells and necrotic cells.
This study was initiated with an approved protocol and a
copy of the informed consent document that was reviewed and
III. IMAGE PROCESSING AND INTIAL FEATURE SET
approved by a duly-constituted Institutional Review Board
(IRB). Subjects aged 18 and above were included in the study. For each image we apply image processing techniques to
Patients with a history of angina after minimal exertion, severe obtain individual cells and extract features that assist in lung
obstructive lung diseases (Predicted Forced Expiratory cancer detection. One of the discriminators used for
Volume in 1 Second (FEV1)<20% of predicted), uncontrolled differentiating between lung cancer and normal cells is that
asthma (defined as a hospitalization or emergency room visit cancer cells glow bright red when TCPP is added. Sputum
samples from patients that are diagnosed with lung cancer and
within the last year, > 2 nocturnal Morning dip index (MDI)
from normal are used for performing these experiments.
uses per month, or daily wheezing), and those on supplemental
oxygen or resting Saturation of Peripheral Oxygen (SpO2)% Image preprocessing is a multi staged process involving
<90 % were excluded from the study. The rationale for these steps like image segmentation, image transformation, image
exclusion criteria was to avoid any circumstances that could restoration etc. In order to obtain individual cells we perform
have aggravated their medical condition, considering that segmentation on each image. Inorder to segment each
some exertion was required for sputum collection (without individual cells in an image, we perform a simple threshold
which the subjects could not have been able to produce based segmentation. The value of the threshold is empircally
adequate quantity of sputum). chosen based on the experiments. The intial feature set can be
divided into three categories based on the properties they
capture: intensity/color based features, shape based features
A. Sample collection and processing
and texture based features.
Obtaining the deep lung sample is very important step in
successful accomplishment of the assay. The sputum sample
A. Intial Feature set
was collected over three days following a “triple morning
cough procedure”. The sputum sample collection required For our initial study [2] we extracted 71 features related to
two collection cups, one sterile cup and a Cytolyt collection shape, color and wavelet based features. Intensity based
cup that contained the fixative. The subjects were given the features include average intensity, minimum intensity,
materials and instructions for sputum collection at the doctor’s variance, mode, variance, maximum intensity, skewness,
kurtosis, and number of pixels with maximum intensity and
office. They were instructed to note any adverse event that
minimum intensity. Shape related features are size of the cell,
occurred up to 15 minutes after sputum collection, and to
aspect ratio and circularity of the cell. For texture based
report the same to the doctor. A patient is recommended to features we use Wavelet transform. Wavelet transform is
blow hard into the lung flute for approximately 40 times with powerful signal processing tool for analyzing signals. Discrete
5 seconds break between two blows. This induces coughing Wavelet Transform (DWT) provides high time resolution and
and the sputum is collected into a cup and then transferred to low frequency resolution for high frequencies and high
Cytolyt collection cup containing ~30ml fixative. After frequency resolution and low time resolution for low
completing 3 days of sputum collection, the subjects return the frequencies. The wavelet transform has excellent energy
collection cups containing the samples to the doctor’s office, compaction and de-correlation properties, which can be used to
upon which they were sent to the laboratory. effectively generate compact representations that exploit the
At the Biomoda laboratory, the samples were processed structure of data. Wavelets can capture both texture and shape
onto a microscope slide, which contained a monolayer of the information efficiently. In our experiments we applied level 3
sputum cells. After preparing the labeling reagents containing wavelet decomposition using Daubechies wavelet ‘db4’. After
TCPP (Biomoda CyPath® Early Detection Lung Cancer applying wavelet transform we get one set of approximate
Assay), the slide was immersed in the labeling solution, coefficients and three sets of detailed coefficients. We used
rinsed, air-dried and cover-slipped. The completed slide was

92 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)
mean, variance, maximum and minimum values for each of 16x16 pixels and with a step size of 1 pixel horizontally or
these coefficients as features. vertically. We find the sum of all pixel values in this sliding
window. As nucleus is said to have more florescence we find
IV. NUCLEUS SEGMENTATION BASED FEATURES the window with maximum sum of intensities. We take the
pixel with maximum intensity in this window as our initial
Region growing methods [4, 6] are a class of region-based seed.
segmentation method in which a group of pixels or sub regions
are grouped into a larger regions based on certain criteria. C. Finding Threshold Criteria
Initially a group of pixels satisfying a certain criteria are chosen
as seed points. After this neighboring pixels are added to this We find criteria and threshold values automatically for each
initial set if they satisfy certain properties like difference cell in the next step. In our experiments we choose criteria not
between intensities is below a certain threshold. The just of current seed but entire region of seed points. For this
neighboring pixels are defined by the concept of connectivity purpose we choose threshold th1 which is for difference
like 4-connectivity or 8-connectivity. We used 8-connectivy in between average of intensity values of pixels in the set of seeds
our experiments. This process continues until there are no and neighboring pixel. In addition to this, we choose another
further neighboring pixels that can be added to the set. The threshold th2 which is for difference between intensity value of
steps involved in our method are current seed and its neighboring pixels. The choice of threshold
th1 is as follows; we take a window of size 5x5 pixels which is
• Enhance the image. centered at initial seed and find the average of pixels in this
• Find initial seed point. window. As initial seed will be at the center of nucleus, it is
• Find the criteria like threshold value for adding pixels. assumed that this window and its neighboring pixels will be
• Include neighboring pixels if they satisfy these criteria. included into region. So threshold is the maximum of
• Repeat step4 on pixels that are added and stop when there differences between average and pixels surrounded by this 5x5
are no further neighboring pixels satisfying the criteria. pixel block. For threshold th2 we find the gradient in 5x5 block
centered at initial seed and take the average of this as threshold.
A. Enhance the image The choice of block size will depend on the nucleus size. If size
of nucleus is very small then we reduce the block size and
Pre-processing of data involves following steps determine the threshold values.
• Apply median filter: We applied 3x3 size median filter
to each cell. Median filter is less sensitive to extreme D. Perform Seeded Region Growing Segmentation
pixel values and does not reduce the sharpness of the Once we have initial seed points and threshold values we start
image. seeded region growing segmentation. After the initial seed
• Apply Histogram Equalization: After applying Median point selection we find neighboring pixels which satisfy both
filter we apply Histogram Equalization to enhance the conditions mentioned in previous section. We add these pixels
contrast of the image. to our current set of seed points and perform same process on
all the pixels in the seed point set. Consider current set of seed
• Apply top hat transformation: We apply top hat points with intensities c1, c2, c3 … cn and cn1, cn2….,cn are
transformation to further enhance the contrast of newly added seeds with cn1 be the current seed. Now the
image. This step will highlight bright pixels of cell neighboring pixel cm1 will be added if they satisfy the
which is our region of interest. The top-hat transform following conditions
is defined as the difference between the original image
and its opening (which is collection of foreground Figure 1. Example of Nucleus Segmentation using Seeded Region Growing
pixels of an image that fit a particular structuring
element). Top-hat transform will highlight brighter
spots than fit the structuring element specified. In our
case as nucleus is more like circular in shape, we
choose circle as our structuring element.

B. Initial Seed Points selection


Initial seed points are a set of pixels which are contained in
the region of interest. In this case the initial seed point must lie
on the nucleus of the cell. In our analysis we automatically
select the threshold value and initial seed points. Due to TCPP
staining, the nucleus is more florescent than cytoplasm. We use • If absolute difference between average of intensities in
this property for selection of initial seed point. As nucleus is region and neighboring pixel is less than th1.
more florescent, initial seed points are the pixels with Absolute (average (c1, c2, c3Ă cn)-cm1) İ th1
maximum florescent value. There might be cases where
• If absolute difference between intensity of current pixel
nucleus pixels might not have maximum florescent color value
and neighboring pixel is less than th2.
or there might be multiple maximum florescent color values.
So to avoid these problems we create a sliding window of size Absolute (cn1-cm1) İ th2

2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 93
• In addition to these we can also add additional criteria positives accumulate versus the rate at which false positives
like the maximum and minimum size of nucleus. accumulate with each one corresponding, to the vertical axis
However we need to have prior information about size and the horizontal axis in Figure 2. The point (0, 1) is the
of nucleus in order to add these constrains. perfect classifier, since it classifies all positive and negative
cases correctly. Thus an ideal system will initiate by identifying
After obtaining nucleus for each individual cell we extract all the positive examples and so the curve will rise to (0, 1)
average Red, Blue and Green component of the nucleus, size immediately, having a zero rate of false positives, and then
of the nucleus and other wavelet based features in this continue along to (1, 1).
experiment. We also remove insignificant features from our
previous work like harmonic mean for this new feature set. Figure 2. ROC Curve Obtained using 71 Features

V. EXPERIMENTS
After performing the preprocessing steps mentioned in
previous sections, we extract features from each cell. The final
dataset consists of 119 data points of which 60 are from cancer
samples and 59 are normal samples. Of these data points 66
percent of total dataset are used for training and 34 percent are
used for testing. The initial feature set consists of 71 features
(intensity based, shape based and texture based ) and Modified
feature set consists of 79 features including Nucleus size,
Nucleus perimeter, Ratio b/w nucleus size and cytoplasm,
mean, variance, skewness and kurtosis of intensity values of
nucleus and shape parameters.
We built different machine learning models using this
dataset. Results obtained using initial feature set and modified
feature set are given in table 1. For initial features we obtained
a best accuracy of 81% using Random forest. For modified
feature set we obtained a best accuracy of 87% using bagging
on Random Forest. For all the machine learning techniques
used we obtained superior performance when we used Figure 3. ROC Curve Obtained using 79 Features
modified feature set (adding nucleus segmented features). We
also show the performance of our method by using Receiver
Operating Characteristic (ROC) curves.

TABLE I. ACCURACY OBTAINED USING DIFFERENT MACHINE


LEARNING (ML) TECHINQUES

ML Technique Subhead Subhead


SVM 75.9 79.43

RBF Network 76 80.48

Naïve Bayesian 74 73.8

Multinomial Logistic Model 64 70.74

Sequential Minimal Optimization 70 75.00

Ada-boost RBF Network 74 83.1

Bagging RBF Network 74 87.8

Multilayer Perceptron 71.3 70.8 Figure 2 gives the ROC curve obtained using previous
feature set and Figure 3 gives the ROC curve obtained using
Random Forest 81 80.48 current modified feature set. Detection rates and false alarms
Linear Logistic Model 74 78.04 are evaluated for lung cancer dataset described in Section 2 and
the obtained results are used to form the ROC curves. In each
Ada-boost Linear Logistic Model 74 75.6 of these ROC plots, the x-axis is the false alarm rate, calculated
Bagging Linear Logistic Model 78 85.36 as the percentage of normal considered as tumor; the y-axis is
the classification rate, calculated as the percentage of tumors. A
data point in the upper left corner corresponds to optimal high
The Receiver Operating Characteristic (ROC) [6] curves performance, i.e., high classification rate with low false alarm
are generated for SVMs by considering the rate at which true rate. We can see from ROC that Area Under Curve (AUC) for
modified feature set is higher than for previous features.

94 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)
VI. CONCLUSION AND FUTURE WORK [9] S. Shah, and A. Kusiak, “Cancer gene search with data-mining and
genetic algorithms,” Computers in Biology and Medicine archive vol. 37
In this paper we show the potential of machine learning, , Issue 2, pp. 251-261, 2007
nucleus segmented features and Biomoda CyPath® staining [10] R.J. Simes “Treatment selection for cancer patients: application of
procedure in lung cancer detection. We obtained an accuracy of statistical decision theory to the treatment of advanced ovarian cancer,” J
88% in our experiments which is superior to other methods. Chronic Dis, 38:171–86, 1985
Besides its use as a potential screening tool for lung cancer, this [11] P.S. Maclin, J. Dempsey, and J. Brooks, “Using neural networks to
method can be used to monitor treatment effectiveness, to diagnose cancer,” JMedSyst, 15:11–9, 1991
detect the recurrence of lung cancer, and also to identify [12] D.V. Cicchetti, “Neural networks and diagnosis in the clinical
laboratory: state of the art,” Clin Chem, 38:9–10, 1992
patients who may need an invasive diagnostic procedure. In
[13] E.F. Petricoin, and L.A. Liotta, “SELDI-TOF-based serum proteomic
future studies, we also plan to include larger study populations pattern diagnostics for early detection of cancer,” Curr Opin Biotechnol,
to establish statistical significance. In this work we show the 15:24–30, 2004
importance of nucleus based features in detecting cancer cells. [14] L. Bocchi, G. Coppini, J. Nori, and G. Valli, “Detection of single and
We would like to additional criteria like nucleus size to seeded clustered microcalcifications in mammograms using fractals models and
region growing method for better accuracy. As more inputs are neural networks,” Med Eng Phys, 26:303–12, 2004
added, feature selection will have to follow a more stringent [15] S. Dudoit, J. Fridlyand, and T. P. Speed, "Comparison of discrimination
scrutiny. methods for the classification of tumors using gene expression data,"
Journal of the American Statistical Association, vol. 97, no. 457, pp. 77-
87, 2002
REFERENCES [16] L. Li, C.R. Weinberg, T.A. Darden, and L.G. Pedersen, “Gene selection
[1] American Cancer Society (ACS), “Report on Lung Cancer,” 2010 for sample classification based on gene expression data: study of
sensitivity to choice of parameters of the GA/KNN method,”
[2] K. Kancherla, S. Mukkamala, et al. “Non Intrusive and Extremely Early
Bioinformatics, vol. 17, no. 12, pp. 1131–1142, 2001
detection of Lung Cancer using TCPP,” 13th World Conference on Lung
Cancer, San Francisco, USA, 2009. [17] T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer,
and D. Haussler, “Support vector machine classification and validation
[3] A. Jemal, R. Siegel, E. Ward, T. Murray, J. Xu, and M. J. Thun, "Cancer
of cancer tissue samples using microarray expression data,”
statistics, 2007," CA Cancer J Clin, vol. 57, no. 1, pp. 43-66, 2007
Bioinformatics vol. 16, no. 10, pp. 906–914, 2000
[4] Mancas, M., Gosselin, B., Macq, B.: Segmentation using a region-
[18] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. GaasenBeek, J.P.
growing thresholding. Proceedings of the SPIE 5672 (2005) 388–398
Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D.
[5] M. M. Oken, P. M. Marcus, P. Hu, T. M. Beck, W. Hocking, P. A. Blomfield, and E.S. Lander, “Molecular classification of cancer: class
Kvale, J. Cordes, T. L. Riley, S. D. Winslow, S. Peace, D. L. Levin, P. discovery and class prediction by gene-expression monitoring,” Science,
C. Prorok, and J. K. Gohagan, "Baseline chest radiograph for lung vol. 286, pp. 531–537, 1999
cancer detection in the randomized prostate, lung, colorectal and ovarian
[19] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Bostein, “Cluster
cancer screening trial," Journal of the National Cancer Institute, vol. 97,
analysis and display of genome-wide expression patterns,” Proceedings
no. 24, pp. 1832-1839, December 2005
of the National Academy of Science USA, vol. 14, pp. 863–868, 1998
[6] Egan, J.P: Signal detection theory and ROC analysis. New York:
[20] E. Hartuv, A. Schmitt, J. Lange, S. Meier-Ewert, H. Lehrach, and R.
Academic Press, (1975)
Shamir, “An algorithm for clustering cDNA fingerprints,” Genomics,
[7] A. Berrington de Gonzalez, and S. Darby, “Risk of cancer from vol. 66, pp. 249–256, 2000
diagnostic X-rays: estimates for the UK and 14 other countries,”
[21] Biomoda Inc. http://www.biomoda.com/
363:345–51, Lancet, 2004
[8] J. A. Cruz and D. S. Wishart, "Applications of machine learning in
cancer prediction and prognosis," Cancer informatics, vol. 2, pp. 59-77,
2007

2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 95

You might also like