Large Scale Deep Learning For Computer Aided Detection of Mammographic Lesions
Large Scale Deep Learning For Computer Aided Detection of Mammographic Lesions
Kooi, T.; Litjens, G.J.; Ginneken, B. van; Gubern Merida, A.; Sanchez, C.I.; Mann, R.M.; Heeten, A.
den; Karssemeijer, N.
2017, Article / Letter to editor (Medical Image Analysis, 35, (2017), pp. 303-312)
Doi link to publisher: https://doi.org/10.1016/j.media.2016.07.007
Note:
To cite this publication please use the final published version (if applicable).
Article 25fa pilot End User Agreement
This publication is distributed under the terms of Article 25fa of the Dutch Copyright Act (Auteurswet)
with explicit consent by the author. Dutch law entitles the maker of a short scientific work funded either
wholly or partially by Dutch public funds to make that work publicly available for no consideration
following a reasonable period of time after the work was first published, provided that clear reference is
made to the source of the first publication of the work.
This publication is distributed under The Association of Universities in the Netherlands (VSNU) ‘Article
25fa implementation’ pilot project. In this pilot research outputs of researchers employed by Dutch
Universities that comply with the legal requirements of Article 25fa of the Dutch Copyright Act are
distributed online and free of cost or other barriers in institutional repositories. Research outputs are
distributed six months after their first online publication in the original published version and with
proper attribution to the source of the original publication.
You are permitted to download and use the publication for personal purposes. All rights remain with the
author(s) and/or copyrights owner(s) of this work. Any use of the publication other than authorised
under this licence or copyright law is prohibited.
If you believe that digital publication of certain material infringes any of your rights or (privacy)
interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library
will make the material inaccessible and/or remove it from the website. Please contact the Library
through email: [email protected], or send a letter to:
University Library
Radboud University
Copyright Information Point
PO Box 9100
6500 HA Nijmegen
a r t i c l e i n f o a b s t r a c t
Article history: Recent advances in machine learning yielded new techniques to train deep neural networks, which re-
Received 11 February 2016 sulted in highly successful applications in many pattern recognition tasks such as object detection and
Revised 12 July 2016
speech recognition. In this paper we provide a head-to-head comparison between a state-of-the art in
Accepted 20 July 2016
mammography CAD system, relying on a manually designed feature set and a Convolutional Neural Net-
Available online 2 August 2016
work (CNN), aiming for a system that can ultimately read mammograms independently. Both systems are
Keywords: trained on a large data set of around 45,0 0 0 images and results show the CNN outperforms the traditional
Computer aided detection CAD system at low sensitivity and performs comparable at high sensitivity. We subsequently investigate
Mammography to what extent features such as location and patient information and commonly used manual features
Deep learning can still complement the network and see improvements at high specificity over the CNN especially with
Machine learning location and context features, which contain information not available to the CNN. Additionally, a reader
Breast cancer
study was performed, where the network was compared to certified screening radiologists on a patch
Convolutional neural networks
level and we found no significant difference between the network and the readers.
© 2016 Elsevier B.V. All rights reserved.
1. Introduction training samples, vastly more than any radiologist will experience
in his lifetime.
Nearly 40 million mammographic exams are performed in the Until recently, the effectiveness of CAD systems and many other
US alone on a yearly basis, arising predominantly from screening pattern recognition applications depended on meticulously hand-
programs implemented to detect breast cancer at an early stage, crafted features, topped off with a learning algorithm to map it to
which has been shown to increase chances of survival (Tabar et al., a decision variable. Radiologists are often consulted in the process
2003; Broeders et al., 2012). Similar programs have been imple- of feature design and features such as the contrast of the lesion,
mented in many western countries. All this data has to be in- spiculation patterns and the sharpness of the border are used, in
spected for signs of cancer by one or more experienced readers the case of mammography. These feature transformations provide
which is a time consuming, costly and most importantly error a platform to instill task-specific, a-priori knowledge, but cause a
prone endeavor. Striving for optimal health care, Computer Aided large bias towards how we humans think the task is performed.
Detection and Diagnosis (CAD) (Giger et al., 20 01; Doi, 20 07; 20 05; Since the inception of Artificial Intelligence (AI) as a scientific dis-
van Ginneken et al., 2011) systems are being developed and are cipline, research has seen a shift from rule-based, problem spe-
currently widely employed as a second reader (Rao et al., 2010; cific solutions to increasingly generic, problem agnostic methods
Malich et al., 2006), with numbers from the US going up to 70% of based on learning, of which deep learning (Bengio, 2009; Bengio
all screening studies in hospital facilities and 85% in private insti- et al., 2013; Schmidhuber, 2015; LeCun et al., 2015) is its most
tutions (Rao et al., 2010). Computers do not suffer from drops in recent manifestation. Directly distilling information from training
concentration, are consistent when presented with the same input samples, rather than the domain expert, deep learning allows us to
data and can potentially be trained with an incredible amount of optimally exploit the ever increasing amounts of data and reduce
human bias. For many pattern recognition tasks, this has proven to
be successful to such an extent that systems are now reaching hu-
man or even superhuman performance (Cireşan et al., 2012; Mnih
∗
Corresponding author. et al., 2015; He et al., 2015).
E-mail address: [email protected], [email protected] (T. Kooi).
http://dx.doi.org/10.1016/j.media.2016.07.007
1361-8415/© 2016 Elsevier B.V. All rights reserved.
304 T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312
The term deep typically refers to the layered non-linearities In this paper, we provide a head-to-head comparison between
in the learning systems, which enables the model to represent a CNN and a CAD system relying on an exhaustive set of manu-
a function with far less parameters and facilitates more efficient ally designed features and show the CNN outperforms a state-of-
learning (Bengio et al., 2007; Bengio, 2009). These models are not the-art mammography CAD system, trained on a large dataset of
new and work has been done since the late seventies (Fukushima, around 45,0 0 0 images. We will focus on the detection of solid,
1980; Lecun et al., 1998). In 2006, however, two papers (Hinton malignant lesions including architectural distortions, treating be-
et al., 2006; Bengio et al., 2007) showing deep networks can be nign abnormalities such as cysts or fibroadenomae as false posi-
trained in a greedy, layer-wise fashion sparked new interest in the tives. The goal of this paper is not to give an optimally concise
topic. Restricted Boltzmann Machines (RBM’s), probabilistic gener- set of features, but to use a complete set where all descriptors
ative models, and autoencoders (AE), one layer neural networks, commonly applied in mammography are represented and provide
were shown to be expedient pattern recognizers when stacked a fair comparison with the deep learning method. As mentioned
to form Deep Belief Networks (DBN) (Hinton et al., 2006; Ben- by Szegedy et al. (2014), success in the past two years in the con-
gio et al., 2007) and Stacked Autoencoders, respectively. Currently, text of object recognition can in part be attributed to judiciously
fully supervised, Convolutional Neural Networks (CNN) dominate combining CNNs with classical computational vision techniques. In
the leader boards (Krizhevsky et al., 2012; Zeiler and Fergus, this spirit, we employ a candidate detector to obtain a set of sus-
2014; Simonyan and Zisserman, 2014; Ioffe and Szegedy, 2015; He picious locations, which are subjected to further scrutiny, either by
et al., 2015). Their performance increase with respect to the pre- the classical system or the CNN. We subsequently investigate to
vious decades can largely be attributed to more efficient training what extent the CNN is still complementary to traditional descrip-
methods, advances in hardware such as the employment of many tors by combining the learned representation with features such
core computing (Cireşan et al., 2011) and most importantly, sheer as location, contrast and patient information, part of which are not
amounts of annotated training data (Russakovsky et al., 2014). explicitly represented in the patch fed to the network. Lastly, a
To the best of our knowledge, Sahiner et al. (1996) were the reader study is performed, where we compare the scores of the
first to attempt a CNN setup for mammography. Instead of raw im- CNN to experienced radiologists on a patch level.
ages, texture maps were fed to a simple network with two hidden The rest of this paper is organized as follows. In the next sec-
layers, producing two and three feature images respectively. The tion, we will give details regarding the candidate detection system,
method gave acceptable, but not spectacular results. Many things shared by both methods. In Section 3, the CNN will be introduced
have changed since this publication, however, not only with regard followed by a description of the reference system in Section 4. In
to statistical learning, but also in the context of acquisition tech- Section 5, we will describe the experiments performed and present
niques. Screen Film Mammography (SFM) has made way for Dig- results, followed by a discussion in Section 6 and conclusion in
ital Mammography (DM), enabling higher quality, raw images in Section 7.
which pixel values have a well-defined physical meaning and eas-
ier spread of large amounts of training data. Given the advances in 2. Candidate detection
learning and data, we feel a revisit of CNNs for mammography is
more than worthy of exploration. Before gathering evidence, every pixel is a possible center of
Work on CAD for mammography (Elter and Horsch, 2009; a lesion. This approach yields few positives and an overwhelming
Nishikawa, 2007; Astley and Gilbert, 2004) has been done since amount of predominantly obvious negatives. The actual difficult
the early nineties but unfortunately, progress has mostly stag- examples could be assumed to be outliers and generalized away,
nated in the past decade. Methods are being developed on small hindering training. Sliding window methods, previously popular in
data sets (Mudigonda et al., 20 0 0; Zheng et al., 2010) which image analysis are recently losing ground in favor of candidate de-
are not always shared and algorithms are difficult to compare tection (Hosang et al., 2015) such as selective search (Uijlings et al.,
(Elter and Horsch, 2009). Breast cancer has two main manifesta- 2013) to reduce the search space (Girshick et al., 2014; Szegedy
tions in mammography, firstly the presence of malignant soft tis- et al., 2014). We therefore follow a two-stage classification pro-
sue or masses and secondly the presence of microcalcifications cedure where in the first stage, candidates are detected and sub-
(Cheng and Huang, 2003) and separate systems are being devel- jected to further scrutiny in a second stage, similar to the pipeline
oped for each. Microcalcifications are often small and can easily described in Hupse et al. (2013). Rather than class agnostic and
be missed by oversight. Some studies suggest CAD for microcal- potentially less accurate candidate detection methods, we use an
cifications is highly effective in reducing oversight (Malich et al., algorithm designed for mammographic lesions (Karssemeijer and
2006) with acceptable numbers of false positives. However, the te Brake, 1996). It operates by extracting five features based on first
merit of CAD for masses is less clear, with research suggesting hu- and second order Gaussian kernels, two designed to spot the cen-
man errors do not stem from oversight but rather misinterpreta- ter of a focal mass and two looking for spiculation patterns, char-
tion (Malich et al., 2006). Some studies show no increase in sen- acteristic of malignant lesions. A final feature indicates the size of
sitivity or specificity with CAD (Taylor et al., 2005) for masses or optimal response in scale-space.
even a decreased specificity without an improvement in detection To generate the pixel based training set, we extracted positive
rate or characterization of invasive cancers (Fenton et al., 2011; samples from a disk of constant size inside each annotated malig-
Lehman et al., 2015). We therefore feel motivated to improve upon nant lesion in the training set, to sample the same amount from
the state-of-the art. every lesion size and prevent bias for larger areas. To obtain nor-
In previous work in our group (Hupse et al., 2013) we showed mal pixels for training, we randomly sampled 1 in 300 pixels from
that a sophisticated CAD system taking into account not only local normal tissue in normal images, resulting in approximately 130
information, but also context, symmetry and the relation between negative samples per normal image. The resulting samples were
the two views of the same breast can operate at the performance used to train a random forest (Breiman, 2001) (RF) classifier. RFs
of a resident radiologist and of a certified radiologist at high speci- can be parallelized easily and are therefore fast to train, are less
ficity. In a different study (Karssemeijer et al., 2004) it was shown susceptible to overfitting and easily adjustable for class-imbalance
that when combining the judgment of up to twelve radiologists, and therefore suitable for this task.
reading performance improved, providing a lower bound on the To obtain lesion candidates, the RF is applied to all pixel loca-
maximum amount of information in the medium and suggesting tions in each image, both in the train and test set, generating a
ample room for improvement of the current system. likelihood image, where each pixel indicates the estimated suspi-
T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312 305
f (a ) = max(0, a ) (2)
N
arg max L(, D ) = arg max h(X|) (3)
n=1
N
− ln[P (D|)] = − yh(X; ) + (1 − y )(1 − h(X; ))] (4)
n=1
Fig. 2. Two systems are compared. A candidate detector (see Fig. 1) generates a set where y indicates the class label. This can be optimized using gra-
of candidate locations. A traditional CAD system (left) uses these locations as seed dient descent. For large datasets that do not fit in memory and
points for a segmentation algorithm. The segmentations are used to compute region data with many redundant samples, minibatch Stochastic Gradient
based features. The second system based on a CNN (right) uses the same locations Descent (SGD) is typically used. Rather than computing the gra-
as the center of a region of interest.
dient on the entire set, it is computed in small batches. Standard
back propagation is subsequently used to adjust weights in all lay-
ciousness. Non-maximum suppression was performed on this im- ers.
age and all optima in the likelihood image are treated as candi- Although powerful, contemporary architectures are not fully in-
dates and fed as input to both the reference feature system and variant to geometric transformations, such as rotation and scale.
the CNN. For the reference system, the local optima in the likeli- Data augmentation is typically performed to account for this.
hood image are used as seed points for a segmentation algorithm.
For the CNN, a patch centered around the location is extracted. An 3.1. Data augmentation
overview of the first stage pipeline is provided in Fig. 1. Fig. 2 il-
lustrates the generated candidates for both systems. Data augmentation is a technique often used in the context of
deep learning and refers to the process of generating new samples
3. Deep convolutional neural network from data we already have, hoping to ameliorate data scarcity and
prevent overfitting. In object recognition tasks in natural images,
In part inspired by human visual processing faculties, CNNs simple horizontal flipping is usually only performed, but for tasks
learn hierarchies of filter kernels, in each layer creating a more ab- such as optical character recognition it has been shown that elas-
stract representation of the data. The term deep generally refers tic deformations can greatly improve performance (Simard et al.,
to the nesting of non-linear functions (Bengio, 2009). Multi Lay- 2003). The main sources of variation in mammography at a lesion
ered Perceptrons (MLPs) have been shown to be universal function level are rotation, scale, translation and the amount of occluding
approximators, under some very mild assumptions, and therefore, tissue.
there is no theoretical limit that prevents them from learning the We augmented all positive examples with scale and transla-
same mapping as a deep architecture would. Training, however, tion transformations. Full scale or translation invariance is not de-
has been shown, mostly empirically, to be far more efficient in a sired nor required since the candidate detector is expected to find
deep setting and the same function can be represented with fewer a patch centered around the actual focal point of the lesion. The
parameters. Deep CNN’s are currently the most proficient for vi- problem is not completely scale-invariant either: large lesions in
sion and in spite of the simple mathematics, have been shown to a later stage of growth are not simply scaled-up versions of re-
be extremely powerful. cently emerged abnormalities. The key is therefore to perform the
Contemporary architectures roughly comprise convolutional, right amount of translation and scaling in order to generate re-
pooling and fully connected layers. Every convolution results in alistic lesion candidates. To this end, we translate each patch in
a feature map, which is downsampled in the pooling layer. The the training set containing an annotated malignant lesion 16 times
most common form of pooling is max-pooling, in which the max- by adding values sampled uniformly from the interval [−25, 25]
imum of a neighborhood in the feature map is taken. Pooling in- (0.5 cm) to the lesion center and scale it 16 times by adding val-
duces some translation invariance and downscales the image to re- ues from the interval [−30, 30] (0.6 cm) to the top left and bot-
duce the amount of weights with each layer. It also reduces loca- tom right of the bounding box. After this, all patches including the
tion precision, however, rendering it less suitable for segmentation normals were rotated using simple flipping actions, which can be
tasks. The exact merit of fully connected layers is still an open computed on the fly to generate three more samples. This results
306 T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312
Fig. 3. Examples of scaling and translation of the patches. The top left image is the
original patch, the second and third image of the top row examples of the smallest
and largest scaling employed. The bottom row indicates the extrema in the range
of translation used.
Fig. 4. A lesion (a), its segmentation (b), areas used for computing contrast features
(c) and areas used for computing margin contrast (d).
in (1 + 16 + 16 )4 = 132 patches per positive lesions and 4 per neg-
ative. Examples of the range of scaling and translation augmenta-
tion are given in Fig. 3. boundary and add the output of the candidate detector at the
found optimum. This gives us a set of nine outputs we call can-
4. Reference system didate detector features.
(Haralick et al., 1973) at two different scales (entropy, contrast, cor- Table 1
Overview of the data. Pos refers to the amount of malignant lesions and neg to
relation, energy and homogeneity) were added to give a set of 21
the amount of normals.
texture descriptors.
Cases Exams Images Candidates
4.4. Geometrical features Pos Neg Pos Neg Pos Neg Pos Neg
Regularity of the border of a lesion is often used to classify le- Train 296 6433 358 11,780 634 39,872 634 213,450
sions. Again, expedient computation relies heavily on proper seg- Valid. 35 710 42 1247 85 4218 85 19,460
Test 124 2064 124 5317 271 18,182 271 180,777
mentations. Nevertheless, we have incorporated five simple topol-
ogy descriptors as proposed by Peura and Iivarinen (1997) in the
system. These are eccentricity, convexity, compactness, circular vari- 5. Experiments
ance and elliptic variance. In order to capture more of the 3D shape,
we extended these descriptors to also work with 3 dimensions. 5.1. Data
The lesion was smoothed with a Gaussian kernel first and 3D ec-
centricity: the ratio between the largest and smallest eigenvalue of The mammograms used were collected from a large scale
the point cloud, 3D compactness: the ratio of the surface area to screening program in The Netherlands (bevolkingsonderzoek
the volume, spherical deviance, the average deviation of each point midden-west) and recorded using a Hologic Selenia digital mam-
from a sphere and elliptical deviance: the average deviation of each mography system. All tumours are biopsy proven malignancies
point to an ellipse fitted to the point cloud were computed. Since and annotated by an experienced reader. Before presentation
convex hull algorithms in 3D suffer from relatively high computa- to a radiologist, the manufacturer applies some processing to
tional complexity, this was not extended. Op top of this, we added optimize it for viewing by a human. To prevent information loss
a feature measuring reflective symmetry. The region is divided into and bias, we used the raw images instead and only applied a log
radial and angular bins and average difference pixel intensity be- transform which results in pixel values being linearly related to
tween opposing bins is summed and normalized by the size of the the attenuation coefficient. Images were scaled from 70 micron to
region. Lastly the area of the segmented region is added, giving us 200 for faster processing. Structure important for detecting lesions
a set of 10 geometric features. occur at larger scales and therefore this does not cause any loss of
information.
4.5. Location features An overview of the data is provided in Table 1. With the term
case, we refer to all screening images recorded from a single pa-
Lesions are more likely to occur in certain parts of the breast tient. Each case consists of several exams taken at typically a two
than others and other structures such as lymph nodes are more year interval and each exam typically comprises four views, two
common in the pectoralis than in other parts of the breast. To cap- of each breast, although these numbers vary: some patients skip
ture this, we use a simple coordinate system. The area of the breast a screening and for some exams only one view of each breast is
and pectoral muscle are segmented using thresholding and a poly- recorded. For training and testing, we selected all regions found
nomial fit. We subsequently estimate the nipple location by taking by the candidate detector. The train, validation and test set were
the largest distance to the chest wall and a central landmark in all split on a patient level to prevent any bias. The train and vali-
the chest wall is taken as the row location of the center of gravity. dation set comprise 44,090 mammographic views, from which we
From this, we extract: (1) the distance to the nipple (2) the same, used 39,872 for training and 4218 for validation. The test set con-
but normalized for the size of the breast, (3) the distance to the sisted of 18,182 images of 2064 patients with 271 malignant an-
chest wall and (4) the fraction of the lesion that lies in the pec- notated lesions. A total of 30 views from 20 exams in the test set
toral muscle. contained an interval cancer that was visible in the mammogram
or were taken prior to a screen detected cancer, with the abnor-
4.6. Context features
mality already visible.
Before patch extraction in the CNN system, we segmented all
To add more information about the surroundings of the le-
lesions in the training set in order to get the largest possible le-
sion, we added three context features as described by Hupse and
sion and choose the patch size with an extra margin resulting in
Karssemeijer (2009). The features again make use of the candidate
patches of size 250 × 250 (5 × 5 cm). The pixel values in the
detector and assume the posterior of pixels in the rest of the breast
patches were scaled using simple min-max scaling, with values
convey some information about the nature of the lesion in ques-
calculated over the whole training set. We experimented with scal-
tion. The first feature averages the output around the lesion, the
ing the patches locally, but this seemed to perform slightly though
second in a band at a fixed distance from the nipple and a third
not significantly worse on the validation set. All interpolation pro-
takes the whole segmented breast into account. On top of this, we
cesses were done with bilinear interpolation. Since some candi-
added the posterior of the candidate detector, normalized by the
dates occur at the border of the imaged breast, we pad the im-
sum of the top three and top five lesions in the breast, to give us
age with zeros. Negative examples were only taken from normal
five context features in total.
images. Annotated benign samples such as cysts and fibroadeno-
4.7. Patient features mae were removed from the training set. However, not all be-
nign lesions in our data are annotated and therefore some may
Lastly, we added the age of the patient, which is an important have ended in the train or validation set as negatives. After aug-
risk factor. From the age, we also estimate the screening round by mentation, the train set consisted of 334, 752 positive patches and
subtracting 50 (the age at which screening starts in The Nether- 853, 800 negatives. When combining the train and validation set,
lands) and dividing by 2 (the step size of the screening). This gives this amounts to 379, 632 positive and 931, 640 negative patches.
us two features.
Note that the last three sets of features provide information 5.2. Training and classification details
outside of the patch fed to the CNN. Even if the network is able
to exploit all information in the training set, these could still sup- For the second stage classification, we have experimented with
ply complementary information regarding the nature of the lesion. several classifiers (SVMs with several different kernels, Gradient
308 T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312
Table 2
Overview of results of individual feature sets
along the 95% confidence interval (CI) obtained
using 50 0 0 bootstraps.
Table 3
Overview of results of the CNN combined with individual
feature sets.
Table 4
AUC values obtained when training the
network on subsets of malignant lesions
in the training set, keeping the same
amount of normals.
Fig. 8. Lesion based FROC of the three systems. Please note that this concerns the
Data Augmentation 60% All
full reference system, where context, location and patient features are incorporated.
With 0.842 0.929
Without 0.685 0.875
Table 5
Partial Area under the FROC of different systems. P-values are refer-
ring to the comparison between the CNN with additional features
and the CNN without the specific feature group. In this case, the
reference system is the full system, including context, location and
patient information.
Fig. 12. Comparison between the CNN and three experienced readers on a patch
level.
networks can become beneficial. The problem could be less com- 7. Conclusion
plex than classifying natural images since it concerns a two-class
classification in the current setting and we are dealing with gray In this paper we have shown that a deep learning model in
scale images, contrary to the thousands of classes and RGB data in the form of a Convolutional Neural Network (CNN) trained on a
ImageNet (Russakovsky et al., 2014). Therefore, more shallow and large data set of mammographic lesions outperforms a state-of-
lower capacity networks than the one found optimal for natural the art system in Computer Aided Detection (CAD) and therefore
images could suffice for this particular problem. has great potential to advance the field of research. A major ad-
In our work, we made extensive use of data augmentation in vantage is that the CNN learns from data and does not rely on
the form of simply geometric transformations. We have also ex- domain experts, making development easier and faster. We have
perimented with full rotation, but this creates lesions not expected shown that the addition of location information and context can
during testing, due to the zero padding. This could be prevented easily be added to the network and that several manually designed
using test time augmentation, but when used in a sliding win- features can give some small improvements, mostly in the form of
dow fashion this is not convenient. The ROC curves in Fig. 7 show ‘common sense’: obviously false negatives will no longer be con-
a clear increase in performance for the full data set. The results sidered as such. On top of this, we have compared the CNN to a
in Table 4 show the current data augmentation scheme improves group of three experienced readers on a patch level, two of which
performance for large amounts of data but not for small amounts were certified radiologist and have show that the human readers
of data. We suspect in the latter setting, the network overfits and CNN have similar performance.
and more regularization is needed. These results may be differ-
ent when fully optimizing the architecture and augmentation pro- Acknowledgements
cedure for each setting individually. More research is needed to
draw clear conclusions. However effective, data augmentation is a This research was funded by grant KUN 2012-557 of the Dutch
rather computationally costly procedure. A more elegant approach Cancer Society and supported by the Foundation of Population
would be to add the invariance properties in the network architec- Screening Mid West.
ture, which is currently being investigated in several papers (Gens
and Domingos, 2014; Jaderberg et al., 2015). On top of the geomet-
References
ric transforms, occluding tissue is an important source of variance,
which is more challenging to explicitly code in the network archi- Astley, S.M., Gilbert, F.J., 2004. Computer-aided detection in mammography. Clinic.
tecture. In future work, we plan to explore simulation methods for Radiol. 59, 390–399.
this. Bengio, Y., 2009. Learning deep architectures for ai. Found.Trends Mach. Learn. 2,
1–127.
In this work, we have employed a previously developed can- Bengio, Y., 2012. Practical recommendations for gradient-based training of deep ar-
didate detector. This has two main advantages: (1) it is fast and chitectures. In: Neural Networks: Tricks of the Trade. Springer, pp. 437–478.
accurate (2) the comparison with the traditional CAD system is Bengio, Y., Courville, A., Vincent, P., 2013. Representation learning: A review and
new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828.
straightforward and fair, since exactly the same candidate locations Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., 2007. Greedy layer-wise train-
are trained with and evaluated on. The main disadvantage is that ing of deep networks. In: Advances in Neural Information Processing Systems,
the sensitivity is not hundred percent, which causes lesions to be pp. 153–160.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J.,
missed, although the case-based performance is close to optimal. Warde-Farley, D., Bengio, Y., 2010. Theano: a CPU and GPU math expression
In future work, we plan to explore other methods, such as the compiler. In: Proceedings of the Python for Scientific Computing Conference
strategy put forth by Cireşan et al. (2013), to train the system end- (SciPy), p. 3.
Bornefalk, H., Hermansson, A.B., 2005. On the comparison of froc curves in mam-
to-end. This will make training and classification less cumbersome
mography CAD systems. Med. Phys. 32, 412–417.
and has the potential to increase the sensitivity of the system. te Brake, G.M., Karssemeijer, N., 2001. Segmentation of suspicious densities in digital
In this work we have compared the CNN to a state-of-the art mammograms. Med. Phys. 28, 259–266.
te Brake, G.M., Karssemeijer, N., Hendriks, J.H., 20 0 0. An automatic method to dis-
CAD system (Hupse et al., 2013), which was combined with sev-
criminate malignant masses from normal tissue in digital mammograms. Phys.
eral other features commonly used in the mammography CAD lit- Med. Biol. 45, 2843–2857.
erature. A random forest was subsequently used, that performs fea- Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32.
ture selection during its training stage. We think the feature set we Broeders, M., Moss, S., Nyström, L., Njor, S., Jonsson, H., Paap, E., Massat, N., Duffy, S.,
Lynge, E., Paci, E., 2012. The impact of mammographic screening on breast can-
used is sufficiently exhaustive to include most features commonly cer mortality in europe: a review of observational studies. J. Med. Screening 19,
used in literature and therefore think similar conclusions hold for 1425.
other state-of-the art CAD systems. To the best of our knowl- Cheng, S.C., Huang, Y.M., 2003. A novel approach to diagnose diabetes based on
the fractal characteristics of retinal images. IEEE Trans. Inf. Technol. Biomed. 7,
edge, the Digital Database of Screening Mammography (DDSM) is 163–170.
the only publicly available data set, which comprises of digitized Cireşan, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J., 2013. Mitosis detection
screen film mammograms. Since almost all screening centers have in breast cancer histology images with deep neural networks. In: Medical Image
Computing and Computer-Assisted Intervention, pp. 411–418.
migrated to digital mammography, we have elected not to run our Cireşan, D.C., Meier, U., Masci, J., Maria Gambardella, L., Schmidhuber, J., 2011. Flex-
system on this data set, because we think the clinical relevance ible, high performance convolutional neural networks for image classification.
is arguable. On top of this, since this entails a transfer learning In: International Joint Conference on Artificial Intelligence, p. 1237.
Cireşan, D.C., Meier, U., Masci, J., Schmidhuber, J., 2012. Multi-column deep neural
problem, the system may require retraining to adapt to the older
network for traffic sign classification. Neural Netw. 32, 333–338.
modality. Dauphin, Y. N., de Vries, H., Chung, J., Bengio, Y., 2015. Rmsprop and equilibrated
The reader study illustrates the network is not far from the ra- adaptive learning rates for non-convex optimization. arXiv:150204390.
Doi, K., 2005. Current status and future potential of computer-aided diagnosis in
diologists performance, but still substantially below the mean of
medical imaging. British J. Radiol. 78 Spec No 1, S3–S19.
the readers, suggesting a large performance increase is still pos- Doi, K., 2007. Computer-aided diagnosis in medical imaging: historical review, cur-
sible. We suspect that some other augmentation methods as dis- rent status and future potential. Comput. Med. Imag. Graph. 31, 198–211. PMID:
cussed above could push the network a bit further, but expect 17349778.
Efron, B., 1979. Bootstrap methods: Another look at the jackknife. Annals Stat. 7,
more training data, when it becomes available will be the most im- 1–26.
portant factor. Also, we feel still employing some handcrafted fea- Elter, M., Horsch, A., 2009. Cadx of mammographic masses and clustered microcal-
tures that specifically target weaknesses of the CNN may be a good cifications: a review. Med. Phys. 36, 2052–2068.
Fenton, J.J., Abraham, L., Taplin, S.H., Geller, B.M., Carney, P.A., D’Orsi, C., Elmore, J.G.,
strategy and may be more pragmatic and effective than adding Barlow, W.E., 2011. Effectiveness of computer-aided detection in community
thousands of extra samples to the training set. mammography practice. J. Nation. Cancer Inst. 103, 1152–1161.
312 T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312
Fukushima, K., 1980. Neocognitron: a self-organizing neural network model for a Mudigonda, N.R., Rangayyan, R.M., Desautels, J.E., 20 0 0. Gradient and texture anal-
mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. ysis for the classification of mammographic masses. IEEE Trans. Med. Imag. 19,
36, 193–202. 1032–1043.
Gens, R., Domingos, P. M., 2014. Deep symmetry networks. 2537–2545. Nair, V., Hinton, G.E., 2010. Rectified linear units improve restricted boltzmann ma-
Giger, M.L., Karssemeijer, N., Armato, S.G., 2001. Computer-aided diagnosis in medi- chines. In: International Conference on Machine Learning, pp. 807–814.
cal imaging. IEEE Trans. Med. Imag. 20, 1205–1208. Nishikawa, R.M., 2007. Current status and future directions of computer-aided diag-
van Ginneken, B., Schaefer-Prokop, C., Prokop, M., 2011. Computer-aided diagnosis: nosis in mammography. Comput. Med. Imag. Graph. 31, 224–235.
how to move from the laboratory to the clinic. Radiology 261, 719–732. Peura, M., Iivarinen, J., 1997. Efficiency of simple shape descriptors. In: Proceedings
Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for ac- of the third international workshop on visual form, p. 451.
curate object detection and semantic segmentation. In: Computer Vision and Rangayyan, R.M., El-Faramawy, N.M., Desautels, J.E.L., Alim, O.A., 1997. Measures of
Pattern Recognition. IEEE, pp. 580–587. acutance and shape for classification of breast tumors. IEEE Trans. Med. Imag.
Haralick, R.M., Shanmugam, K., Dinstein, I., 1973. Textural features for image classi- 16, 799–810.
fication. IEEE Trans. Syst. Man Cybern. 3, 610–621. Rao, V.M., Levin, D.C., Parker, L., Cavanaugh, B., Frangos, A.J., Sunshine, J.H., 2010.
He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpass- How widely is computer-aided detection used in screening and diagnostic
ing human-level performance on imagenet classification. Comput. Vis. Pattern mammography? J. Am. College Radiol. 7, 802–805.
Recognit. 1026–1034. 1502.01852v1. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpa-
Hinton, G.E., Osindero, S., Teh, Y.W., 2006. A fast learning algorithm for deep belief thy, A., Khosla, A., Bernstein, M., 2014. Imagenet large scale visual recognition
nets. Neural Comput. 18, 1527–1554. challenge. Int. J. Comput. Vis. 115, 1–42.
Hosang, J., Benenson, R., Dollár, P., Schiele, B., 2015. What makes for effective detec- Sahiner, B., Chan, H.P., Petrick, N., Wei, D., Helvie, M.A., Adler, D.D., Goodsitt, M.M.,
tion proposals? arXiv:150205082. 1996. Classification of mass and normal breast tissue: a convolution neural net-
Hupse, R., Karssemeijer, N., 2009. Use of normal tissue context in computer-aided work classifier with spatial domain and texture images. IEEE Trans. Med. Imag.
detection of masses in mammograms. IEEE Trans. Med. Imag. 28, 2033–2041. 15, 598–610.
Hupse, R., Samulski, M., Lobbes, M., den Heeten, A., Imhof-Tas, M.W., Beijerinck, D., Schmidhuber, J., 2015. Deep learning in neural networks: an overview. Neural Netw.
Pijnappel, R., Boetes, C., Karssemeijer, N., 2013. Standalone computer-aided de- 61, 85–117.
tection compared to radiologists’ performance for the detection of mammo- Simard, P.Y., Steinkraus, D., Platt, J.C., 2003. Best practices for convolutional neu-
graphic masses. Eur. Radiol. 23, 93–100. ral networks applied to visual document analysis. In: Document Analysis and
Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training Recognition, pp. 958–963.
by reducing internal covariate shift. arXiv:150203167. Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K., 2015. Spatial trans- image recognition. arXiv:14091556.
former networks. arXiv preprint arXiv:150602025. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014.
Karssemeijer, N., te Brake, G., 1996. Detection of stellate distortions in mammo- Dropout: A simple way to prevent neural networks from overfitting. J. Mach.
grams. IEEE Trans. Med. Imag. 15, 611–619. Learn. Res. 15, 1929–1958.
Karssemeijer, N., Otten, J.D., Roelofs, A.A.J., van Woudenberg, S., Hendriks, J.H.C.L., Sutskever, I., Martens, J., Dahl, G., Hinton, G., 2013. On the importance of initializa-
2004. Effect of independent multiple reading of mammograms on detection per- tion and momentum in deep learning. In: International Conference on Machine
formance. In: Medical Imaging, pp. 82–89. Learning, pp. 1139–1147.
Kooi, T., Karssemeijer, N., 2014. Invariant features for discriminating cysts from solid Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
lesions in mammography. In: Breast Imaging. Springer, pp. 573–580. V., Rabinovich, A., 2014. Going deeper with convolutions. arXiv:14094842v1.
Krizhevsky, A., Sutskever, I., Hinton, G., 2012. Imagenet classification with deep con- 1409.4842v1.
volutional neural networks. In: Advances in Neural Information Processing Sys- Tabar, L., Yen, M.F., Vitak, B., Chen, H.H.T., Smith, R.A., Duffy, S.W., 2003. Mammogra-
tems 25, pp. 1097–1105. phy service screening and mortality in breast cancer patients: 20-year follow-up
Kupinski, M.A., Giger, M.L., 1998. Automated seeded lesion segmentation on digital before and after introduction of screening. Lancet 361, 1405–1410.
mammograms. IEEE Trans. Med. Imag> 17, 510–517. Taylor, P., Champness, J., Given-Wilson, R., Johnston, K., Potts, H., 2005. Impact of
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444. computer-aided detection prompts on the sensitivity and specificity of screen-
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to ing mammography. Health Technol. Assess. 9, iii,1–iii,58.
document recognition. Proc. IEEE 86, 2278–2324. Timp, S., Karssemeijer, N., 2004. A new 2D segmentation method based on dy-
Lehman, C.D., Wellman, R.D., Buist, D.S.M., Kerlikowske, K., Tosteson, A.N.A., namic programming applied to computer aided detection in mammography.
Miglioretti, D.L.,B.C.S.C., 2015. Diagnostic accuracy of digital screening mam- Med. Phys. 31, 958–971.
mography with and without computer-aided detection.. JAMA Intern Med 175, Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W., 2013. Selective search
1828–1837. for object recognition. Int. J. Comput. Vis. 104, 154–171.
Maddison, C. J., Huang, A., Sutskever, I., Silver, D., 2014. Move evaluation in go using Zeiler, M.D., Fergus, R., 2014. Visualizing and understanding convolutional networks.
deep convolutional neural networks. arXiv:14126564. In: European Conference on Computer Vision, pp. 818–833.
Malich, A., Fischer, D.R., Böttcher, J., 2006. CAD for mammography: the technique, Zheng, B., Wang, X., Lederman, D., Tan, J., Gur, D., 2010. Computer-aided detection;
results, current role and further developments. Eur. Radiol. 16, 1449–1460. the effect of training databases on detection of subtle breast masses. Acad. Ra-
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., diol. 17, 1401–1408.
Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., 2015. Human-level con-
trol through deep reinforcement learning. Nature 518, 529–533.