0% found this document useful (0 votes)

17 views12 pages

Large Scale Deep Learning For Computer Aided Detection of Mammographic Lesions

The paper compares a traditional computer-aided detection (CAD) system with a Convolutional Neural Network (CNN) for detecting mammographic lesions, using a dataset of approximately 45,000 images. Results indicate that the CNN outperforms the traditional CAD system at low sensitivity levels and performs comparably at high sensitivity, while also exploring the potential of combining CNN features with traditional descriptors for improved specificity. A reader study shows no significant difference in performance between the CNN and certified radiologists at the patch level.

Uploaded by

cindycinderella009

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views12 pages

Large Scale Deep Learning For Computer Aided Detection of Mammographic Lesions

Uploaded by

cindycinderella009

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Large scale deep learning for computer aided detection of mammographic lesions

Kooi, T.; Litjens, G.J.; Ginneken, B. van; Gubern Merida, A.; Sanchez, C.I.; Mann, R.M.; Heeten, A.
den; Karssemeijer, N.
2017, Article / Letter to editor (Medical Image Analysis, 35, (2017), pp. 303-312)
Doi link to publisher: https://doi.org/10.1016/j.media.2016.07.007

Version of the following full text: Publisher’s version

Published under the terms of article 25fa of the Dutch copyright act. Please follow this link for the
Terms of Use: https://repository.ubn.ru.nl/page/termsofuse
Downloaded from: https://hdl.handle.net/2066/173029
Download date: 2025-02-08

Note:
To cite this publication please use the final published version (if applicable).
Article 25fa pilot End User Agreement

This publication is distributed under the terms of Article 25fa of the Dutch Copyright Act (Auteurswet)
with explicit consent by the author. Dutch law entitles the maker of a short scientific work funded either
wholly or partially by Dutch public funds to make that work publicly available for no consideration
following a reasonable period of time after the work was first published, provided that clear reference is
made to the source of the first publication of the work.

This publication is distributed under The Association of Universities in the Netherlands (VSNU) ‘Article
25fa implementation’ pilot project. In this pilot research outputs of researchers employed by Dutch
Universities that comply with the legal requirements of Article 25fa of the Dutch Copyright Act are
distributed online and free of cost or other barriers in institutional repositories. Research outputs are
distributed six months after their first online publication in the original published version and with
proper attribution to the source of the original publication.

You are permitted to download and use the publication for personal purposes. All rights remain with the
author(s) and/or copyrights owner(s) of this work. Any use of the publication other than authorised
under this licence or copyright law is prohibited.

If you believe that digital publication of certain material infringes any of your rights or (privacy)
interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library
will make the material inaccessible and/or remove it from the website. Please contact the Library
through email: [email protected], or send a letter to:

University Library
Radboud University
Copyright Information Point
PO Box 9100
6500 HA Nijmegen

You will be contacted as soon as possible.

Medical Image Analysis 35 (2017) 303–312

Contents lists available at ScienceDirect

Medical Image Analysis

journal homepage: www.elsevier.com/locate/media

Large scale deep learning for computer aided detection of

mammographic lesions
Thijs Kooi a,∗, Geert Litjens a, Bram van Ginneken a, Albert Gubern-Mérida a,
Clara I. Sánchez a, Ritse Mann a, Ard den Heeten b, Nico Karssemeijer a
a
Diagnostic Image Analysis Group, Department of Radiology, Radboud University Medical Center, Nijmegen, The Netherlands
b
Department of Radiology, University Medical Centre Amsterdam, Amsterdam, The Netherlands

a r t i c l e i n f o a b s t r a c t

Article history: Recent advances in machine learning yielded new techniques to train deep neural networks, which re-
Received 11 February 2016 sulted in highly successful applications in many pattern recognition tasks such as object detection and
Revised 12 July 2016
speech recognition. In this paper we provide a head-to-head comparison between a state-of-the art in
Accepted 20 July 2016
mammography CAD system, relying on a manually designed feature set and a Convolutional Neural Net-
Available online 2 August 2016
work (CNN), aiming for a system that can ultimately read mammograms independently. Both systems are
Keywords: trained on a large data set of around 45,0 0 0 images and results show the CNN outperforms the traditional
Computer aided detection CAD system at low sensitivity and performs comparable at high sensitivity. We subsequently investigate
Mammography to what extent features such as location and patient information and commonly used manual features
Deep learning can still complement the network and see improvements at high specificity over the CNN especially with
Machine learning location and context features, which contain information not available to the CNN. Additionally, a reader
Breast cancer
study was performed, where the network was compared to certified screening radiologists on a patch
Convolutional neural networks
level and we found no significant difference between the network and the readers.
© 2016 Elsevier B.V. All rights reserved.

1. Introduction training samples, vastly more than any radiologist will experience
in his lifetime.
Nearly 40 million mammographic exams are performed in the Until recently, the effectiveness of CAD systems and many other
US alone on a yearly basis, arising predominantly from screening pattern recognition applications depended on meticulously hand-
programs implemented to detect breast cancer at an early stage, crafted features, topped off with a learning algorithm to map it to
which has been shown to increase chances of survival (Tabar et al., a decision variable. Radiologists are often consulted in the process
2003; Broeders et al., 2012). Similar programs have been imple- of feature design and features such as the contrast of the lesion,
mented in many western countries. All this data has to be in- spiculation patterns and the sharpness of the border are used, in
spected for signs of cancer by one or more experienced readers the case of mammography. These feature transformations provide
which is a time consuming, costly and most importantly error a platform to instill task-specific, a-priori knowledge, but cause a
prone endeavor. Striving for optimal health care, Computer Aided large bias towards how we humans think the task is performed.
Detection and Diagnosis (CAD) (Giger et al., 20 01; Doi, 20 07; 20 05; Since the inception of Artificial Intelligence (AI) as a scientific dis-
van Ginneken et al., 2011) systems are being developed and are cipline, research has seen a shift from rule-based, problem spe-
currently widely employed as a second reader (Rao et al., 2010; cific solutions to increasingly generic, problem agnostic methods
Malich et al., 2006), with numbers from the US going up to 70% of based on learning, of which deep learning (Bengio, 2009; Bengio
all screening studies in hospital facilities and 85% in private insti- et al., 2013; Schmidhuber, 2015; LeCun et al., 2015) is its most
tutions (Rao et al., 2010). Computers do not suffer from drops in recent manifestation. Directly distilling information from training
concentration, are consistent when presented with the same input samples, rather than the domain expert, deep learning allows us to
data and can potentially be trained with an incredible amount of optimally exploit the ever increasing amounts of data and reduce
human bias. For many pattern recognition tasks, this has proven to
be successful to such an extent that systems are now reaching hu-
man or even superhuman performance (Cireşan et al., 2012; Mnih
∗
Corresponding author. et al., 2015; He et al., 2015).
E-mail address: [email protected], [email protected] (T. Kooi).

http://dx.doi.org/10.1016/j.media.2016.07.007
1361-8415/© 2016 Elsevier B.V. All rights reserved.
304 T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312

The term deep typically refers to the layered non-linearities In this paper, we provide a head-to-head comparison between
in the learning systems, which enables the model to represent a CNN and a CAD system relying on an exhaustive set of manu-
a function with far less parameters and facilitates more efficient ally designed features and show the CNN outperforms a state-of-
learning (Bengio et al., 2007; Bengio, 2009). These models are not the-art mammography CAD system, trained on a large dataset of
new and work has been done since the late seventies (Fukushima, around 45,0 0 0 images. We will focus on the detection of solid,
1980; Lecun et al., 1998). In 2006, however, two papers (Hinton malignant lesions including architectural distortions, treating be-
et al., 2006; Bengio et al., 2007) showing deep networks can be nign abnormalities such as cysts or fibroadenomae as false posi-
trained in a greedy, layer-wise fashion sparked new interest in the tives. The goal of this paper is not to give an optimally concise
topic. Restricted Boltzmann Machines (RBM’s), probabilistic gener- set of features, but to use a complete set where all descriptors
ative models, and autoencoders (AE), one layer neural networks, commonly applied in mammography are represented and provide
were shown to be expedient pattern recognizers when stacked a fair comparison with the deep learning method. As mentioned
to form Deep Belief Networks (DBN) (Hinton et al., 2006; Ben- by Szegedy et al. (2014), success in the past two years in the con-
gio et al., 2007) and Stacked Autoencoders, respectively. Currently, text of object recognition can in part be attributed to judiciously
fully supervised, Convolutional Neural Networks (CNN) dominate combining CNNs with classical computational vision techniques. In
the leader boards (Krizhevsky et al., 2012; Zeiler and Fergus, this spirit, we employ a candidate detector to obtain a set of sus-
2014; Simonyan and Zisserman, 2014; Ioffe and Szegedy, 2015; He picious locations, which are subjected to further scrutiny, either by
et al., 2015). Their performance increase with respect to the pre- the classical system or the CNN. We subsequently investigate to
vious decades can largely be attributed to more efficient training what extent the CNN is still complementary to traditional descrip-
methods, advances in hardware such as the employment of many tors by combining the learned representation with features such
core computing (Cireşan et al., 2011) and most importantly, sheer as location, contrast and patient information, part of which are not
amounts of annotated training data (Russakovsky et al., 2014). explicitly represented in the patch fed to the network. Lastly, a
To the best of our knowledge, Sahiner et al. (1996) were the reader study is performed, where we compare the scores of the
first to attempt a CNN setup for mammography. Instead of raw im- CNN to experienced radiologists on a patch level.
ages, texture maps were fed to a simple network with two hidden The rest of this paper is organized as follows. In the next sec-
layers, producing two and three feature images respectively. The tion, we will give details regarding the candidate detection system,
method gave acceptable, but not spectacular results. Many things shared by both methods. In Section 3, the CNN will be introduced
have changed since this publication, however, not only with regard followed by a description of the reference system in Section 4. In
to statistical learning, but also in the context of acquisition tech- Section 5, we will describe the experiments performed and present
niques. Screen Film Mammography (SFM) has made way for Dig- results, followed by a discussion in Section 6 and conclusion in
ital Mammography (DM), enabling higher quality, raw images in Section 7.
which pixel values have a well-defined physical meaning and eas-
ier spread of large amounts of training data. Given the advances in 2. Candidate detection
learning and data, we feel a revisit of CNNs for mammography is
more than worthy of exploration. Before gathering evidence, every pixel is a possible center of
Work on CAD for mammography (Elter and Horsch, 2009; a lesion. This approach yields few positives and an overwhelming
Nishikawa, 2007; Astley and Gilbert, 2004) has been done since amount of predominantly obvious negatives. The actual difficult
the early nineties but unfortunately, progress has mostly stag- examples could be assumed to be outliers and generalized away,
nated in the past decade. Methods are being developed on small hindering training. Sliding window methods, previously popular in
data sets (Mudigonda et al., 20 0 0; Zheng et al., 2010) which image analysis are recently losing ground in favor of candidate de-
are not always shared and algorithms are difficult to compare tection (Hosang et al., 2015) such as selective search (Uijlings et al.,
(Elter and Horsch, 2009). Breast cancer has two main manifesta- 2013) to reduce the search space (Girshick et al., 2014; Szegedy
tions in mammography, firstly the presence of malignant soft tis- et al., 2014). We therefore follow a two-stage classification pro-
sue or masses and secondly the presence of microcalcifications cedure where in the first stage, candidates are detected and sub-
(Cheng and Huang, 2003) and separate systems are being devel- jected to further scrutiny in a second stage, similar to the pipeline
oped for each. Microcalcifications are often small and can easily described in Hupse et al. (2013). Rather than class agnostic and
be missed by oversight. Some studies suggest CAD for microcal- potentially less accurate candidate detection methods, we use an
cifications is highly effective in reducing oversight (Malich et al., algorithm designed for mammographic lesions (Karssemeijer and
2006) with acceptable numbers of false positives. However, the te Brake, 1996). It operates by extracting five features based on first
merit of CAD for masses is less clear, with research suggesting hu- and second order Gaussian kernels, two designed to spot the cen-
man errors do not stem from oversight but rather misinterpreta- ter of a focal mass and two looking for spiculation patterns, char-
tion (Malich et al., 2006). Some studies show no increase in sen- acteristic of malignant lesions. A final feature indicates the size of
sitivity or specificity with CAD (Taylor et al., 2005) for masses or optimal response in scale-space.
even a decreased specificity without an improvement in detection To generate the pixel based training set, we extracted positive
rate or characterization of invasive cancers (Fenton et al., 2011; samples from a disk of constant size inside each annotated malig-
Lehman et al., 2015). We therefore feel motivated to improve upon nant lesion in the training set, to sample the same amount from
the state-of-the art. every lesion size and prevent bias for larger areas. To obtain nor-
In previous work in our group (Hupse et al., 2013) we showed mal pixels for training, we randomly sampled 1 in 300 pixels from
that a sophisticated CAD system taking into account not only local normal tissue in normal images, resulting in approximately 130
information, but also context, symmetry and the relation between negative samples per normal image. The resulting samples were
the two views of the same breast can operate at the performance used to train a random forest (Breiman, 2001) (RF) classifier. RFs
of a resident radiologist and of a certified radiologist at high speci- can be parallelized easily and are therefore fast to train, are less
ficity. In a different study (Karssemeijer et al., 2004) it was shown susceptible to overfitting and easily adjustable for class-imbalance
that when combining the judgment of up to twelve radiologists, and therefore suitable for this task.
reading performance improved, providing a lower bound on the To obtain lesion candidates, the RF is applied to all pixel loca-
maximum amount of information in the medium and suggesting tions in each image, both in the train and test set, generating a
ample room for improvement of the current system. likelihood image, where each pixel indicates the estimated suspi-
T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312 305

research question, but many studies report an increase in perfor-

mance with these in the architectures.
If we let YkL denote the kth feature map of layer L, generated by
convolution with kernel Wk , it is computed according to:

YkL = f (WkL ∗ YL−1 + bkL ) (1)

with ∗ the convolution operator and f(·) a non-linear activation

Fig. 1. Illustration of the candidate detection pipeline. A candidate detector is function and bkL a bias term. Traditional MLPs use sigmoidal func-
trained using ﬁve pixel features and applied to all pixesl in all images, generating a tions to facilitate learning of non-linearly separable problems.
likelihood image. Local optima in the likelihood image are used as seed points for
both the reference system and the CNN. (See Fig. 2).
However, Rectiﬁed Linear Units (ReLU):

f (a ) = max(0, a ) (2)

of activation a, have been shown to be easier to train, since the

activation is not squashed by asymptote in the logistic functions
(Nair and Hinton, 2010). The parameters are typically ﬁt to the
data using maximum likelihood estimation:

N
arg max L(, D ) = arg max h(X|) (3)

n=1

where h(X|) produces the posterior probability of sample X. Tak-

ing the logarithm and negating it to put it into a minimization
framework for convenience, will yield the cross-entropy loss:

N
− ln[P (D|)] = − yh(X; ) + (1 − y )(1 − h(X; ))] (4)
n=1

Fig. 2. Two systems are compared. A candidate detector (see Fig. 1) generates a set where y indicates the class label. This can be optimized using gra-
of candidate locations. A traditional CAD system (left) uses these locations as seed dient descent. For large datasets that do not fit in memory and
points for a segmentation algorithm. The segmentations are used to compute region data with many redundant samples, minibatch Stochastic Gradient
based features. The second system based on a CNN (right) uses the same locations Descent (SGD) is typically used. Rather than computing the gra-
as the center of a region of interest.
dient on the entire set, it is computed in small batches. Standard
back propagation is subsequently used to adjust weights in all lay-
ciousness. Non-maximum suppression was performed on this im- ers.
age and all optima in the likelihood image are treated as candi- Although powerful, contemporary architectures are not fully in-
dates and fed as input to both the reference feature system and variant to geometric transformations, such as rotation and scale.
the CNN. For the reference system, the local optima in the likeli- Data augmentation is typically performed to account for this.
hood image are used as seed points for a segmentation algorithm.
For the CNN, a patch centered around the location is extracted. An 3.1. Data augmentation
overview of the first stage pipeline is provided in Fig. 1. Fig. 2 il-
lustrates the generated candidates for both systems. Data augmentation is a technique often used in the context of
deep learning and refers to the process of generating new samples
3. Deep convolutional neural network from data we already have, hoping to ameliorate data scarcity and
prevent overfitting. In object recognition tasks in natural images,
In part inspired by human visual processing faculties, CNNs simple horizontal flipping is usually only performed, but for tasks
learn hierarchies of filter kernels, in each layer creating a more ab- such as optical character recognition it has been shown that elas-
stract representation of the data. The term deep generally refers tic deformations can greatly improve performance (Simard et al.,
to the nesting of non-linear functions (Bengio, 2009). Multi Lay- 2003). The main sources of variation in mammography at a lesion
ered Perceptrons (MLPs) have been shown to be universal function level are rotation, scale, translation and the amount of occluding
approximators, under some very mild assumptions, and therefore, tissue.
there is no theoretical limit that prevents them from learning the We augmented all positive examples with scale and transla-
same mapping as a deep architecture would. Training, however, tion transformations. Full scale or translation invariance is not de-
has been shown, mostly empirically, to be far more efficient in a sired nor required since the candidate detector is expected to find
deep setting and the same function can be represented with fewer a patch centered around the actual focal point of the lesion. The
parameters. Deep CNN’s are currently the most proficient for vi- problem is not completely scale-invariant either: large lesions in
sion and in spite of the simple mathematics, have been shown to a later stage of growth are not simply scaled-up versions of re-
be extremely powerful. cently emerged abnormalities. The key is therefore to perform the
Contemporary architectures roughly comprise convolutional, right amount of translation and scaling in order to generate re-
pooling and fully connected layers. Every convolution results in alistic lesion candidates. To this end, we translate each patch in
a feature map, which is downsampled in the pooling layer. The the training set containing an annotated malignant lesion 16 times
most common form of pooling is max-pooling, in which the max- by adding values sampled uniformly from the interval [−25, 25]
imum of a neighborhood in the feature map is taken. Pooling in- (0.5 cm) to the lesion center and scale it 16 times by adding val-
duces some translation invariance and downscales the image to re- ues from the interval [−30, 30] (0.6 cm) to the top left and bot-
duce the amount of weights with each layer. It also reduces loca- tom right of the bounding box. After this, all patches including the
tion precision, however, rendering it less suitable for segmentation normals were rotated using simple flipping actions, which can be
tasks. The exact merit of fully connected layers is still an open computed on the fly to generate three more samples. This results
306 T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312

Fig. 3. Examples of scaling and translation of the patches. The top left image is the
original patch, the second and third image of the top row examples of the smallest
and largest scaling employed. The bottom row indicates the extrema in the range
of translation used.
Fig. 4. A lesion (a), its segmentation (b), areas used for computing contrast features
(c) and areas used for computing margin contrast (d).
in (1 + 16 + 16 )4 = 132 patches per positive lesions and 4 per neg-
ative. Examples of the range of scaling and translation augmenta-
tion are given in Fig. 3. boundary and add the output of the candidate detector at the
found optimum. This gives us a set of nine outputs we call can-
4. Reference system didate detector features.

The large majority of CAD systems rely on some form of seg-

4.2. Contrast features
mentation of the candidates on which region based features are
computed. To this end, we employ the mass segmentation method
When talking to a radiologist, a feature that is often mentioned
proposed by Timp and Karssemeijer (2004), which was shown
is how well a lesion is separated from the background. Contrast
to be superior to other methods (region growing (te Brake and
features are designed to capture this. To compute these, we apply
Karssemeijer, 2001) and active contour segmentation (Kupinski and
a distance transform to the segmented region and compare the in-
Giger, 1998)) on their particular feature set. The image is trans-
side of the segmentation with a border around it. The distance d
formed to a polar domain around the center of the candidate and
to the border of the segmentation is determined according to:
dynamic programming is used to find an optimal contour, subject √
to the constraint that the path must start and end in the same d = ρ Aπ (5)
column in order to generate a closed contour in the Cartesian do-
main. A cost function incorporating a deviation from the expected with A the area of the segmented lesion. An illustration is provided
Grey level, edge strength and size terms is used to find an optimal in Fig. 4. An important nuisance in this setting is the tissue sur-
segmentation. One of the problems with this method and many rounding the lesion. In previous work, we have derived two model
knowledge driven segmentation methods for that matter, is that it based features, designed to be invariant to this factor (Kooi and
is conditioned on a false prior: the size constraint is based on data Karssemeijer, 2014), which were also normalized for size of the le-
from malignant lesions. When segmenting a candidate, we there- sion. The sharpness of the border of the lesion is also often men-
fore implicitly assume that this is a malignant region, inadvertently tioned by clinicians. To capture this, we add two features: the acu-
driving the segmentation into a biased result. Many of the manual tance (Rangayyan et al., 1997) and margin contrast, the different
features described below rely on a precise segmentation but in the between the inside and outside of the segmentation, using a small
end, it is an intermediate problem. For a stand-alone application, margin. Illustrations of contrast features are provided in Fig. 4.
we are interested to provide the patient with an accurate diagno- Other contrast features described in te Brake et al. (20 0 0) were
sis, not the exact delineation. A huge advantage of CNNs is that no added to give a set of 12 features.
segmentation is required and patches are fed without any interme-
diate processing. 4.3. Texture features
After segmentation, we extract a set of 74 features. These can
broadly be categorized into pixel level features, used by the can- The presence of holes in the candidate lesion often decrease
didate detector, contrast features, capturing the relation between their suspiciousness, since tumours are solid, with possibly the ex-
the attenuation coefficients inside and outside the region, texture ception of lobular carcinoma. To detect this, we added the two iso-
features describing relations between pixels within the segmented density features proposed by te Brake et al. (20 0 0). Linear struc-
region, geometry features summarizing shape and border informa- tures within a lesion can indicate an unfortunate projection rather
tion location features, indicating where the lesion is with respect to then cancer, for which we used four linear texture features as de-
some landmarks in the breast, context features, capturing informa- scribed by the same authors (te Brake et al., 20 0 0). On top of this
tion about the rest of the breast and other candidates and patient we added two features based on the second order gradient im-
features, conveying some of the subjects background information. age of the segmented lesion. The image was convolved with sec-
ond order Gaussian derivative filters and the optimal location in
4.1. Candidate detector features scale space was selected for each pixel. We subsequently took the
first and second moment of the segmented lesion of the maximum
As a first set of descriptors, we re-use the five features em- magnitude, which is expected to be high for lesions with much line
ployed by the candidate detector, which has been shown to be structure. Secondly, we computed gradient coocurence, by count-
beneficial in previous work in our group. On top of this, we com- ing the number of times adjacent pixels have the same orienta-
pute the mean of the four texture features within the segmented tion. Ten less biophysical features in the form of Haralick features
T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312 307

(Haralick et al., 1973) at two different scales (entropy, contrast, cor- Table 1
Overview of the data. Pos refers to the amount of malignant lesions and neg to
relation, energy and homogeneity) were added to give a set of 21
the amount of normals.
texture descriptors.
Cases Exams Images Candidates
4.4. Geometrical features Pos Neg Pos Neg Pos Neg Pos Neg

Regularity of the border of a lesion is often used to classify le- Train 296 6433 358 11,780 634 39,872 634 213,450
sions. Again, expedient computation relies heavily on proper seg- Valid. 35 710 42 1247 85 4218 85 19,460
Test 124 2064 124 5317 271 18,182 271 180,777
mentations. Nevertheless, we have incorporated five simple topol-
ogy descriptors as proposed by Peura and Iivarinen (1997) in the
system. These are eccentricity, convexity, compactness, circular vari- 5. Experiments
ance and elliptic variance. In order to capture more of the 3D shape,
we extended these descriptors to also work with 3 dimensions. 5.1. Data
The lesion was smoothed with a Gaussian kernel first and 3D ec-
centricity: the ratio between the largest and smallest eigenvalue of The mammograms used were collected from a large scale
the point cloud, 3D compactness: the ratio of the surface area to screening program in The Netherlands (bevolkingsonderzoek
the volume, spherical deviance, the average deviation of each point midden-west) and recorded using a Hologic Selenia digital mam-
from a sphere and elliptical deviance: the average deviation of each mography system. All tumours are biopsy proven malignancies
point to an ellipse fitted to the point cloud were computed. Since and annotated by an experienced reader. Before presentation
convex hull algorithms in 3D suffer from relatively high computa- to a radiologist, the manufacturer applies some processing to
tional complexity, this was not extended. Op top of this, we added optimize it for viewing by a human. To prevent information loss
a feature measuring reflective symmetry. The region is divided into and bias, we used the raw images instead and only applied a log
radial and angular bins and average difference pixel intensity be- transform which results in pixel values being linearly related to
tween opposing bins is summed and normalized by the size of the the attenuation coefficient. Images were scaled from 70 micron to
region. Lastly the area of the segmented region is added, giving us 200 for faster processing. Structure important for detecting lesions
a set of 10 geometric features. occur at larger scales and therefore this does not cause any loss of
information.
4.5. Location features An overview of the data is provided in Table 1. With the term
case, we refer to all screening images recorded from a single pa-
Lesions are more likely to occur in certain parts of the breast tient. Each case consists of several exams taken at typically a two
than others and other structures such as lymph nodes are more year interval and each exam typically comprises four views, two
common in the pectoralis than in other parts of the breast. To cap- of each breast, although these numbers vary: some patients skip
ture this, we use a simple coordinate system. The area of the breast a screening and for some exams only one view of each breast is
and pectoral muscle are segmented using thresholding and a poly- recorded. For training and testing, we selected all regions found
nomial fit. We subsequently estimate the nipple location by taking by the candidate detector. The train, validation and test set were
the largest distance to the chest wall and a central landmark in all split on a patient level to prevent any bias. The train and vali-
the chest wall is taken as the row location of the center of gravity. dation set comprise 44,090 mammographic views, from which we
From this, we extract: (1) the distance to the nipple (2) the same, used 39,872 for training and 4218 for validation. The test set con-
but normalized for the size of the breast, (3) the distance to the sisted of 18,182 images of 2064 patients with 271 malignant an-
chest wall and (4) the fraction of the lesion that lies in the pec- notated lesions. A total of 30 views from 20 exams in the test set
toral muscle. contained an interval cancer that was visible in the mammogram
or were taken prior to a screen detected cancer, with the abnor-
4.6. Context features
mality already visible.
Before patch extraction in the CNN system, we segmented all
To add more information about the surroundings of the le-
lesions in the training set in order to get the largest possible le-
sion, we added three context features as described by Hupse and
sion and choose the patch size with an extra margin resulting in
Karssemeijer (2009). The features again make use of the candidate
patches of size 250 × 250 (5 × 5 cm). The pixel values in the
detector and assume the posterior of pixels in the rest of the breast
patches were scaled using simple min-max scaling, with values
convey some information about the nature of the lesion in ques-
calculated over the whole training set. We experimented with scal-
tion. The first feature averages the output around the lesion, the
ing the patches locally, but this seemed to perform slightly though
second in a band at a fixed distance from the nipple and a third
not significantly worse on the validation set. All interpolation pro-
takes the whole segmented breast into account. On top of this, we
cesses were done with bilinear interpolation. Since some candi-
added the posterior of the candidate detector, normalized by the
dates occur at the border of the imaged breast, we pad the im-
sum of the top three and top five lesions in the breast, to give us
age with zeros. Negative examples were only taken from normal
five context features in total.
images. Annotated benign samples such as cysts and fibroadeno-
4.7. Patient features mae were removed from the training set. However, not all be-
nign lesions in our data are annotated and therefore some may
Lastly, we added the age of the patient, which is an important have ended in the train or validation set as negatives. After aug-
risk factor. From the age, we also estimate the screening round by mentation, the train set consisted of 334, 752 positive patches and
subtracting 50 (the age at which screening starts in The Nether- 853, 800 negatives. When combining the train and validation set,
lands) and dividing by 2 (the step size of the screening). This gives this amounts to 379, 632 positive and 931, 640 negative patches.
us two features.
Note that the last three sets of features provide information 5.2. Training and classification details
outside of the patch fed to the CNN. Even if the network is able
to exploit all information in the training set, these could still sup- For the second stage classification, we have experimented with
ply complementary information regarding the nature of the lesion. several classifiers (SVMs with several different kernels, Gradient
308 T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312

Table 2
Overview of results of individual feature sets
along the 95% conﬁdence interval (CI) obtained
using 50 0 0 bootstraps.

Feature group AUC CI

Candidate detector 0.858 [0.827, 0.887]

Contrast 0.787 [0.752, 0.817]
Fig. 5. Illustration of the network architecture, The numbers indicate the amount Texture 0.718 [0.681, 0.753]
of kernels used. We employ a scaled down version of the VGG model. To see the Geometry 0.753 [0.721, 0.784]
extent to which conventional features can still help, the network is trained fully Location 0.686 [0.651, 0.719]
supervised and the learned features are subsequently extracted from the final layer Context 0.816 [0.781, 0.850]
and concatenated with the manual features and retrained using a second classifier. Patient 0.651 [0.612, 0.688]
Equal information 0.892 [0.864, 0.918]
All 0.906 [0.881, 0.929]
Boosted Trees, MLPs) on a validation set, but found in nearly all
circumstances the random forest performed similar or better than
others. To counteract the effect of class imbalance, trees in the RF
were grown using the balance corrected Gini criterion for split-
ting and in all situations we used 20 0 0 estimators and the square
root heuristic for the maximum number of features. The maxi-
mum depth was cross-validated using 8 folds. We employed class
weights inversely proportional to the distribution in the particu-
lar bootstrap sample. The posterior probability output by the RF
was calculated as a mean of the estimated classes. The systems are
trained using at most the ten most suspicious lesions per image
found by the candidate detector, during testing no such threshold
is applied to obtain highest sensitivity.
We implemented the network in Theano (Bergstra et al., 2010)
and pointers provided by Bengio (2012) were followed and very
helpful. We used OxfordNet-like architectures (Simonyan and Zis-
serman, 2014) with 6 convolutional layers of {16, 32, 64, 128, 128}
with 3 × 3 kernels and 2 × 2 max-pooling on all but the fourth
convolutional layer. A stride of 1 was used in all convolutions. Two
Fig. 6. Comparison of the CNN with the reference system using equal information,
fully connected layers of 300 each were added. An illustration of
i.e., only information represented in the patch used by the CNN, excluding context,
the network is provided in Fig. 5. location and patient information.
We employed Stochastic Gradient Descent (SGD) with RMSProp
(Dauphin et al., 2015), an adaption of R-Prop for SGD with Nesterov
momentum (Sutskever et al., 2013). Drop-out (Srivastava et al.,
2014) was used on the fully connected layers with p = 0.5. We
used the MSRA (He et al., 2015) weight filler, a learning rate of
5 × 10−5 with a weight decay of 5 × 10−5 . To battle the strong
class imbalance, positive samples were presented multiple times
during an epoch, keeping a 50/50 positive/negative ratio in each
minibatch. Alternatively, the loss function could be weighted, but
we found this to perform worse, we suspect this is because re-
balancing maintains a certain diversity in the minibatch. All hy-
perparameters were optimized on a validation set and the CNN
was subsequently retrained on the full training set using the found
parameters. All test patches were also augmented using the same
augmentation scheme. On the validation set, this gave a small im-
provement. The best validation AUC was 0.90.

5.3. ROC analysis

Fig. 7. Comparison of the CNN without any augmentation, with augmentation and
To first get an understanding of how well each feature set per- with added manual features.
forms individually, we trained different RFs for each feature set and
applied them separately to the test set. In all cases, the training
procedure as described above was used. AUC values along with a paring the CNN with data augmentation to the network without
95% confidence interval, acquired using bootstrapping (Efron, 1979; data augmentation and with data augmentation and added man-
Bornefalk and Hermansson, 2005) with 5000 bootstrap samples are ual features. Again bootstrapping was used to obtain significance. It
shown in Table 2. is clear that the proposed data augmentation methods contributes
The CNN was compared to the reference system with equal greatly to the performance, which was also found to be significant
amount of information (i.e., excluding location, context and patient (p 0.05).
information) to get a fair performance comparison. Fig. 6 shows a To combine the CNN with other descriptors, we extracted the
plot of the mean curves along with the 95% confidence interval features from the last fully connected layer and appended the
obtained after bootstrapping. Results were not found to be signif- other set (see Fig. 5). For each augmented patch, the additional
icantly different p = 0.2 on the full ROC. Fig. 7 shows a plot com- features were simply duplicated. Table 3 shows results of the CNN
T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312 309

Table 3
Overview of results of the CNN combined with individual
feature sets.

Feature group added to CNN AUC CI

CNN Only 0.929 [0.897, 0.938]

Candidate detector 0.938 [0.919, 0.955]
Contrast 0.931 [0.91, 0.949]
Texture 0.933 [0.912, 0.950]
Geometry 0.928 [0.907, 0.946]
Location 0.933 [0.913, 0.950]
Context 0.934 [0.914, 0.952]
Patient 0.929 [0.908, 0.947]
All 0.941 [0.922, 0.958]

Table 4
AUC values obtained when training the
network on subsets of malignant lesions
in the training set, keeping the same
amount of normals.
Fig. 8. Lesion based FROC of the three systems. Please note that this concerns the
Data Augmentation 60% All
full reference system, where context, location and patient features are incorporated.
With 0.842 0.929
Without 0.685 0.875

combined with different feature sets, again with conﬁdence inter-

val acquired by bootstrapping with 50 0 0 samples.
To investigate the degree to which a large data set is really
needed, we trained several networks on subsets, removing 40% of
the malignant lesions. Results are provided in Table 4. Since the
differences are rather large, we did not perform signiﬁcance test-
ing. For all settings, we optimized the learning rate but kept all
other hyperparameters equal to the ones found to be optimal for
the full training set.

5.4. FROC analysis

In practice, a CAD system should ideally be operating at a refer-

ral rate similar to that of a radiologists. To get a better understand-
ing of the system’s performance around this operating point, we
compute the Partial Area Under the Curve (PAUC) on a log scale: Fig. 9. Case based FROC of the three systems. In areas of high specificity, the CNN
and the addition of manual features is particularly useful. Please note that this con-

1 1
s( f ) cerns the full reference system, where context, location and patient features are
PAUC = df (6) incorporated.
ln[1] − ln[0.01] 0.01 f
and generate Free Receiver Operator Characteristic (FROC) curves,
to illustrate the numbers of false positives per image. information as the CNN. The group of readers consisted of one ex-
Plots of the FROCs of the full reference system (last line in perienced reader (non-radiologist) and two experienced certified
Table 2), the CNN only and the CNN plus manual features are radiologists. To get an idea of the performance that can at least be
shown in Figs. 8 and 9. To further investigate which features are obtained on this set, the mean of the three readers was also com-
helpful at high specificity, we compute PAUC for each feature set puted by simply averaging the scores that each of the three readers
individually. Results are shown in Table 5. We see a significant dif- assigned to each patch.
ference comparing the CNN with additional features to the refer- Patches were extracted from the mammogram processed by the
ence system P = 0.015 on a lesion level and P = 0.0 0 02 on a case manufacturer for optimal viewing and were shown at a normal
level. computer screen at a resolution of 200 micron. Microcalcifications
are difficult to see in this setting, but all structures relevant for soft
5.5. Human performance tissue lesions are intact and readers did not report difficulties. The
readers were provided with a slider and instructed to score the
In previous work in our group, performance of the CAD sys- patch between zero and one hundred based on their assessment
tem was compared to the performance of a radiologists at an exam of the suspiciousness of the patch.
level, a collection of four images, which contains more information As a test set, we used all masses that were used in
than only a patch, such as context in the mammogram, symmetri- Hupse et al. (2013) and selected an equal amount of negatives, that
cal difference between two breast, the relation between the CC and were considered the most difficult by the candidate detector, re-
MLO views. To get a better understanding of how close the CNN sulting in 398 patches. This gives a representative set of difficult
is to human performance on a patch level and how much more samples and allows for larger differences between readers and the
room there is for improvement in this sub part of the pipeline, we CNN, but is biased towards a set difficult for the reference system,
performed a study where we measured the performance of experi- which was therefore left out of the comparison (obtained AUC was
enced readers on a patch level, providing the reader with the same 0.64 on this set). Fig. 12 shows the ROC curves resulting from the
310 T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312

Table 5
Partial Area under the FROC of different systems. P-values are refer-
ring to the comparison between the CNN with additional features
and the CNN without the speciﬁc feature group. In this case, the
reference system is the full system, including context, location and
patient information.

Lesion Case P, lesion P, case

CNN 0.550 0.684 1 1
Reference system 0.547 0.594 0.451 0.013

CNN + candidate det. 0.590 0.701 < 0.0 0 01 0.026

CNN + contrast 0.571 0.704 0.011 0.0758
CNN + texture 0.574 0.705 0.0062 0.067
CNN + topology 0.561 0.700 0.0286 0.132
CNN + location 0.576 0.707 0.0038 0.0516
CNN + context 0.578 0.700 0.0028 0.121
CNN + patient 0.576 0.704 0.0034 0.0784
CNN + all features 0.594 0.711 < 0.001 0.04

Fig. 12. Comparison between the CNN and three experienced readers on a patch
level.

to add these to the training set and perform three-class classiﬁ-

cation or train a separate network to discriminate these lesions
properly.
The majority of ‘misclassified’ positives are lesions ill-
represented in the training data, either very subtle or extremely
large. When using CAD as a second reader, these will not influence
Fig. 10. Top misclassified negatives by the CNN. The second sample in the first row
the referral decision much, as they are clearly visible to a human,
is simply the nipple and the third sample in the second row displays fat necrosis. but when using the computer as an independent reader, these is-
Both are obviously normal patches and are filtered out using additional feature sets. sues need to be solved. In preliminary experiments, we have seen
that many of these misclassifications can be prevented by con-
sidering the contralateral breast and plan to work on this in the
future.
From the results in Tables 3 and 2 we can see that individually,
apart from the candidate detector, contrast and context are useful
features. Although age and screening round are some of the most
important risk factors, we do not see clear improvements when
added as features, which is slightly disappointing. To get training
data, we took negative patches only from normal images, but not
only from normal exams, to get as many data points as possible.
A possible explanation for the disappointing performance may be
that the relation between age and cancer is more difficult to learn
in the setting, since it is a relation that exist on an exam level.
Fig. 11. Top misclassified positives by the CNN, most samples are very large lesion To add features, we have used a second classification stage. This
unlikely to be found in the screening population and therefore under represented has the advantage it is easy to evaluate which features add infor-
in the training set. mation, without retraining a network and re-optimizing the pa-
rameters, which can take several weeks to do properly. On top of
this, the learned feature representation of the CNN is the same in
all situations, rendering comparison more reliable. A major disad-
reader study. Again, to test significance we used bootstrapping and vantage, however, is that the training procedure is rather compli-
two sided testing to get a significance score. We found no signifi- cated. Other more elegant methods such as coding features as a
cant difference between the CNN and any of the readers: CNN vs second channel, as done by Maddison et al. (2014) or adding the
reader 1: p = 0.1008, CNN vs reader 2: p = 0.6136, CNN vs reader features in one of the fully connected layers of the network during
3: p = 0.64, but found a significant difference between the CNN training could be better strategies and we plan to explore this in
and the mean of the human readers ( p = 0.001). future work.
We have made use of a more shallow and scaled down ver-
6. Discussion sion of the networks proposed by Simonyan and Zisserman (2014),
who obtain best performance on ImageNet with a 19 layer ar-
To get more insight into the performance of the network, ex- chitecture with four times the amount of kernels in each layer.
amples of the top misclassified positives and negatives are shown In initial experiments, we have worked with Alexnet-like archi-
in Fig. 11 and 10 respectively. A large part of the patches deter- tectures, which performed worse on our problem, obtaining and
mined as suspicious by the network are benign abnormalities such AUC of around 0.85 on the validation set. We have also experi-
as cysts and fibroadenomae or normal structures such as lymph mented with deeper networks and increasing the amount of ker-
nodes or fat necrosis. Cysts and lymph nodes can look relatively nels, but found no significant improvement on the validation set
similar to masses. These strong false positives occur due to the ab- (0.896 vs 0.897 of the network with larger capacity and 0.90 of
sence of benign lesions in our training set. In the future we plan 9 layer network). We suspect that with more data, larger capacity
T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312 311

networks can become beneficial. The problem could be less com- 7. Conclusion
plex than classifying natural images since it concerns a two-class
classification in the current setting and we are dealing with gray In this paper we have shown that a deep learning model in
scale images, contrary to the thousands of classes and RGB data in the form of a Convolutional Neural Network (CNN) trained on a
ImageNet (Russakovsky et al., 2014). Therefore, more shallow and large data set of mammographic lesions outperforms a state-of-
lower capacity networks than the one found optimal for natural the art system in Computer Aided Detection (CAD) and therefore
images could suffice for this particular problem. has great potential to advance the field of research. A major ad-
In our work, we made extensive use of data augmentation in vantage is that the CNN learns from data and does not rely on
the form of simply geometric transformations. We have also ex- domain experts, making development easier and faster. We have
perimented with full rotation, but this creates lesions not expected shown that the addition of location information and context can
during testing, due to the zero padding. This could be prevented easily be added to the network and that several manually designed
using test time augmentation, but when used in a sliding win- features can give some small improvements, mostly in the form of
dow fashion this is not convenient. The ROC curves in Fig. 7 show ‘common sense’: obviously false negatives will no longer be con-
a clear increase in performance for the full data set. The results sidered as such. On top of this, we have compared the CNN to a
in Table 4 show the current data augmentation scheme improves group of three experienced readers on a patch level, two of which
performance for large amounts of data but not for small amounts were certified radiologist and have show that the human readers
of data. We suspect in the latter setting, the network overfits and CNN have similar performance.
and more regularization is needed. These results may be differ-
ent when fully optimizing the architecture and augmentation pro- Acknowledgements
cedure for each setting individually. More research is needed to
draw clear conclusions. However effective, data augmentation is a This research was funded by grant KUN 2012-557 of the Dutch
rather computationally costly procedure. A more elegant approach Cancer Society and supported by the Foundation of Population
would be to add the invariance properties in the network architec- Screening Mid West.
ture, which is currently being investigated in several papers (Gens
and Domingos, 2014; Jaderberg et al., 2015). On top of the geomet-
References
ric transforms, occluding tissue is an important source of variance,
which is more challenging to explicitly code in the network archi- Astley, S.M., Gilbert, F.J., 2004. Computer-aided detection in mammography. Clinic.
tecture. In future work, we plan to explore simulation methods for Radiol. 59, 390–399.
this. Bengio, Y., 2009. Learning deep architectures for ai. Found.Trends Mach. Learn. 2,
1–127.
In this work, we have employed a previously developed can- Bengio, Y., 2012. Practical recommendations for gradient-based training of deep ar-
didate detector. This has two main advantages: (1) it is fast and chitectures. In: Neural Networks: Tricks of the Trade. Springer, pp. 437–478.
accurate (2) the comparison with the traditional CAD system is Bengio, Y., Courville, A., Vincent, P., 2013. Representation learning: A review and
new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828.
straightforward and fair, since exactly the same candidate locations Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., 2007. Greedy layer-wise train-
are trained with and evaluated on. The main disadvantage is that ing of deep networks. In: Advances in Neural Information Processing Systems,
the sensitivity is not hundred percent, which causes lesions to be pp. 153–160.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J.,
missed, although the case-based performance is close to optimal. Warde-Farley, D., Bengio, Y., 2010. Theano: a CPU and GPU math expression
In future work, we plan to explore other methods, such as the compiler. In: Proceedings of the Python for Scientific Computing Conference
strategy put forth by Cireşan et al. (2013), to train the system end- (SciPy), p. 3.
Bornefalk, H., Hermansson, A.B., 2005. On the comparison of froc curves in mam-
to-end. This will make training and classification less cumbersome
mography CAD systems. Med. Phys. 32, 412–417.
and has the potential to increase the sensitivity of the system. te Brake, G.M., Karssemeijer, N., 2001. Segmentation of suspicious densities in digital
In this work we have compared the CNN to a state-of-the art mammograms. Med. Phys. 28, 259–266.
te Brake, G.M., Karssemeijer, N., Hendriks, J.H., 20 0 0. An automatic method to dis-
CAD system (Hupse et al., 2013), which was combined with sev-
criminate malignant masses from normal tissue in digital mammograms. Phys.
eral other features commonly used in the mammography CAD lit- Med. Biol. 45, 2843–2857.
erature. A random forest was subsequently used, that performs fea- Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32.
ture selection during its training stage. We think the feature set we Broeders, M., Moss, S., Nyström, L., Njor, S., Jonsson, H., Paap, E., Massat, N., Duffy, S.,
Lynge, E., Paci, E., 2012. The impact of mammographic screening on breast can-
used is sufficiently exhaustive to include most features commonly cer mortality in europe: a review of observational studies. J. Med. Screening 19,
used in literature and therefore think similar conclusions hold for 1425.
other state-of-the art CAD systems. To the best of our knowl- Cheng, S.C., Huang, Y.M., 2003. A novel approach to diagnose diabetes based on
the fractal characteristics of retinal images. IEEE Trans. Inf. Technol. Biomed. 7,
edge, the Digital Database of Screening Mammography (DDSM) is 163–170.
the only publicly available data set, which comprises of digitized Cireşan, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J., 2013. Mitosis detection
screen film mammograms. Since almost all screening centers have in breast cancer histology images with deep neural networks. In: Medical Image
Computing and Computer-Assisted Intervention, pp. 411–418.
migrated to digital mammography, we have elected not to run our Cireşan, D.C., Meier, U., Masci, J., Maria Gambardella, L., Schmidhuber, J., 2011. Flex-
system on this data set, because we think the clinical relevance ible, high performance convolutional neural networks for image classification.
is arguable. On top of this, since this entails a transfer learning In: International Joint Conference on Artificial Intelligence, p. 1237.
Cireşan, D.C., Meier, U., Masci, J., Schmidhuber, J., 2012. Multi-column deep neural
problem, the system may require retraining to adapt to the older
network for traffic sign classification. Neural Netw. 32, 333–338.
modality. Dauphin, Y. N., de Vries, H., Chung, J., Bengio, Y., 2015. Rmsprop and equilibrated
The reader study illustrates the network is not far from the ra- adaptive learning rates for non-convex optimization. arXiv:150204390.
Doi, K., 2005. Current status and future potential of computer-aided diagnosis in
diologists performance, but still substantially below the mean of
medical imaging. British J. Radiol. 78 Spec No 1, S3–S19.
the readers, suggesting a large performance increase is still pos- Doi, K., 2007. Computer-aided diagnosis in medical imaging: historical review, cur-
sible. We suspect that some other augmentation methods as dis- rent status and future potential. Comput. Med. Imag. Graph. 31, 198–211. PMID:
cussed above could push the network a bit further, but expect 17349778.
Efron, B., 1979. Bootstrap methods: Another look at the jackknife. Annals Stat. 7,
more training data, when it becomes available will be the most im- 1–26.
portant factor. Also, we feel still employing some handcrafted fea- Elter, M., Horsch, A., 2009. Cadx of mammographic masses and clustered microcal-
tures that specifically target weaknesses of the CNN may be a good cifications: a review. Med. Phys. 36, 2052–2068.
Fenton, J.J., Abraham, L., Taplin, S.H., Geller, B.M., Carney, P.A., D’Orsi, C., Elmore, J.G.,
strategy and may be more pragmatic and effective than adding Barlow, W.E., 2011. Effectiveness of computer-aided detection in community
thousands of extra samples to the training set. mammography practice. J. Nation. Cancer Inst. 103, 1152–1161.
312 T. Kooi et al. / Medical Image Analysis 35 (2017) 303–312

Fukushima, K., 1980. Neocognitron: a self-organizing neural network model for a Mudigonda, N.R., Rangayyan, R.M., Desautels, J.E., 20 0 0. Gradient and texture anal-
mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. ysis for the classification of mammographic masses. IEEE Trans. Med. Imag. 19,
36, 193–202. 1032–1043.
Gens, R., Domingos, P. M., 2014. Deep symmetry networks. 2537–2545. Nair, V., Hinton, G.E., 2010. Rectified linear units improve restricted boltzmann ma-
Giger, M.L., Karssemeijer, N., Armato, S.G., 2001. Computer-aided diagnosis in medi- chines. In: International Conference on Machine Learning, pp. 807–814.
cal imaging. IEEE Trans. Med. Imag. 20, 1205–1208. Nishikawa, R.M., 2007. Current status and future directions of computer-aided diag-
van Ginneken, B., Schaefer-Prokop, C., Prokop, M., 2011. Computer-aided diagnosis: nosis in mammography. Comput. Med. Imag. Graph. 31, 224–235.
how to move from the laboratory to the clinic. Radiology 261, 719–732. Peura, M., Iivarinen, J., 1997. Efficiency of simple shape descriptors. In: Proceedings
Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for ac- of the third international workshop on visual form, p. 451.
curate object detection and semantic segmentation. In: Computer Vision and Rangayyan, R.M., El-Faramawy, N.M., Desautels, J.E.L., Alim, O.A., 1997. Measures of
Pattern Recognition. IEEE, pp. 580–587. acutance and shape for classification of breast tumors. IEEE Trans. Med. Imag.
Haralick, R.M., Shanmugam, K., Dinstein, I., 1973. Textural features for image classi- 16, 799–810.
fication. IEEE Trans. Syst. Man Cybern. 3, 610–621. Rao, V.M., Levin, D.C., Parker, L., Cavanaugh, B., Frangos, A.J., Sunshine, J.H., 2010.
He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpass- How widely is computer-aided detection used in screening and diagnostic
ing human-level performance on imagenet classification. Comput. Vis. Pattern mammography? J. Am. College Radiol. 7, 802–805.
Recognit. 1026–1034. 1502.01852v1. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpa-
Hinton, G.E., Osindero, S., Teh, Y.W., 2006. A fast learning algorithm for deep belief thy, A., Khosla, A., Bernstein, M., 2014. Imagenet large scale visual recognition
nets. Neural Comput. 18, 1527–1554. challenge. Int. J. Comput. Vis. 115, 1–42.
Hosang, J., Benenson, R., Dollár, P., Schiele, B., 2015. What makes for effective detec- Sahiner, B., Chan, H.P., Petrick, N., Wei, D., Helvie, M.A., Adler, D.D., Goodsitt, M.M.,
tion proposals? arXiv:150205082. 1996. Classification of mass and normal breast tissue: a convolution neural net-
Hupse, R., Karssemeijer, N., 2009. Use of normal tissue context in computer-aided work classifier with spatial domain and texture images. IEEE Trans. Med. Imag.
detection of masses in mammograms. IEEE Trans. Med. Imag. 28, 2033–2041. 15, 598–610.
Hupse, R., Samulski, M., Lobbes, M., den Heeten, A., Imhof-Tas, M.W., Beijerinck, D., Schmidhuber, J., 2015. Deep learning in neural networks: an overview. Neural Netw.
Pijnappel, R., Boetes, C., Karssemeijer, N., 2013. Standalone computer-aided de- 61, 85–117.
tection compared to radiologists’ performance for the detection of mammo- Simard, P.Y., Steinkraus, D., Platt, J.C., 2003. Best practices for convolutional neu-
graphic masses. Eur. Radiol. 23, 93–100. ral networks applied to visual document analysis. In: Document Analysis and
Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training Recognition, pp. 958–963.
by reducing internal covariate shift. arXiv:150203167. Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K., 2015. Spatial trans- image recognition. arXiv:14091556.
former networks. arXiv preprint arXiv:150602025. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014.
Karssemeijer, N., te Brake, G., 1996. Detection of stellate distortions in mammo- Dropout: A simple way to prevent neural networks from overfitting. J. Mach.
grams. IEEE Trans. Med. Imag. 15, 611–619. Learn. Res. 15, 1929–1958.
Karssemeijer, N., Otten, J.D., Roelofs, A.A.J., van Woudenberg, S., Hendriks, J.H.C.L., Sutskever, I., Martens, J., Dahl, G., Hinton, G., 2013. On the importance of initializa-
2004. Effect of independent multiple reading of mammograms on detection per- tion and momentum in deep learning. In: International Conference on Machine
formance. In: Medical Imaging, pp. 82–89. Learning, pp. 1139–1147.
Kooi, T., Karssemeijer, N., 2014. Invariant features for discriminating cysts from solid Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
lesions in mammography. In: Breast Imaging. Springer, pp. 573–580. V., Rabinovich, A., 2014. Going deeper with convolutions. arXiv:14094842v1.
Krizhevsky, A., Sutskever, I., Hinton, G., 2012. Imagenet classification with deep con- 1409.4842v1.
volutional neural networks. In: Advances in Neural Information Processing Sys- Tabar, L., Yen, M.F., Vitak, B., Chen, H.H.T., Smith, R.A., Duffy, S.W., 2003. Mammogra-
tems 25, pp. 1097–1105. phy service screening and mortality in breast cancer patients: 20-year follow-up
Kupinski, M.A., Giger, M.L., 1998. Automated seeded lesion segmentation on digital before and after introduction of screening. Lancet 361, 1405–1410.
mammograms. IEEE Trans. Med. Imag> 17, 510–517. Taylor, P., Champness, J., Given-Wilson, R., Johnston, K., Potts, H., 2005. Impact of
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444. computer-aided detection prompts on the sensitivity and specificity of screen-
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to ing mammography. Health Technol. Assess. 9, iii,1–iii,58.
document recognition. Proc. IEEE 86, 2278–2324. Timp, S., Karssemeijer, N., 2004. A new 2D segmentation method based on dy-
Lehman, C.D., Wellman, R.D., Buist, D.S.M., Kerlikowske, K., Tosteson, A.N.A., namic programming applied to computer aided detection in mammography.
Miglioretti, D.L.,B.C.S.C., 2015. Diagnostic accuracy of digital screening mam- Med. Phys. 31, 958–971.
mography with and without computer-aided detection.. JAMA Intern Med 175, Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W., 2013. Selective search
1828–1837. for object recognition. Int. J. Comput. Vis. 104, 154–171.
Maddison, C. J., Huang, A., Sutskever, I., Silver, D., 2014. Move evaluation in go using Zeiler, M.D., Fergus, R., 2014. Visualizing and understanding convolutional networks.
deep convolutional neural networks. arXiv:14126564. In: European Conference on Computer Vision, pp. 818–833.
Malich, A., Fischer, D.R., Böttcher, J., 2006. CAD for mammography: the technique, Zheng, B., Wang, X., Lederman, D., Tan, J., Gur, D., 2010. Computer-aided detection;
results, current role and further developments. Eur. Radiol. 16, 1449–1460. the effect of training databases on detection of subtle breast masses. Acad. Ra-
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., diol. 17, 1401–1408.
Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., 2015. Human-level con-
trol through deep reinforcement learning. Nature 518, 529–533.