Hover-Net: Simultaneous Segmentation and Classification of Nuclei in Multi-Tissue Histology Images
Hover-Net: Simultaneous Segmentation and Classification of Nuclei in Multi-Tissue Histology Images
net/publication/329734476
CITATIONS READS
0 2,928
7 authors, including:
Some of the authors of this publication are also working on these related projects:
Development of ICT based Surveillance System for Fusarium Wilt of Radish View project
All content following this page was uploaded by Simon Graham on 12 November 2019.
Abstract—Nuclear segmentation and classification within efficient processing, analysis and management of the tissue
Haematoxylin & Eosin stained histology images is a fundamental specimens [2]. Each WSI contains tens of thousands of nuclei
prerequisite in the digital pathology work-flow. The development of various types, which can be further analysed in a systematic
of automated methods for nuclear segmentation and classification
enables the quantitative analysis of tens of thousands of nuclei manner and used for predicting clinical outcome. Here, the
within a whole-slide pathology image, opening up possibilities of type of nucleus refers to the cell type in which it is located. For
further analysis of large-scale nuclear morphometry. However, example, nuclear features can be used to predict survival [3]
automated nuclear segmentation and classification is faced with and also for diagnosing the grade and type of disease [4]. Also,
a major challenge in that there are several different types of efficient and accurate detection and segmentation of nuclei
nuclei, some of them exhibiting large intra-class variability such
as the nuclei of tumour cells. Additionally, some of the nuclei are can facilitate good quality tissue segmentation [5], [6], which
often clustered together. To address these challenges, we present can in turn not only facilitate the quantification of WSIs but
a novel convolutional neural network for simultaneous nuclear may also serve as an important step in understanding how
segmentation and classification that leverages the instance-rich each tissue component contributes to disease. In order to use
information encoded within the vertical and horizontal distances nuclear features for downstream analysis within computational
of nuclear pixels to their centres of mass. These distances are
then utilised to separate clustered nuclei, resulting in an accurate pathology, nuclear segmentation must be carried out as an
segmentation, particularly in areas with overlapping instances. initial step. However, this remains a challenge because nuclei
Then, for each segmented instance the network predicts the type display a high level of heterogeneity and there is significant
of nucleus via a devoted up-sampling branch. We demonstrate inter- and intra-instance variability in the shape, size and chro-
state-of-the-art performance compared to other methods on matin pattern between and within different cell types, disease
multiple independent multi-tissue histology image datasets. As
part of this work, we introduce a new dataset of Haematoxylin & types or even from one region to another within a single
Eosin stained colorectal adenocarcinoma image tiles, containing tissue sample. Tumour nuclei, in particular, tend to be present
24,319 exhaustively annotated nuclei with associated class labels. in clusters, which gives rise to many overlapping instances,
providing a further challenge for automated segmentation, due
Index Terms—Nuclear segmentation, nuclear classification, to the difficulty of separating neighbouring instances.
computational pathology, deep learning. As well as extracting each individual nucleus, determining
the type of each nucleus can increase the diagnostic potential
of current DP pipelines. For example, accurately classifying
I. I NTRODUCTION
each nucleus to be from tumour or lymphocyte enables down-
Current manual assessment of Haematoxylin and Eosin stream analysis of tumour infiltrating lymphocytes (TILs),
(H&E) stained histology slides suffers from low throughput which have been shown to be predictive of cancer recurrence
and is naturally prone to intra- and inter-observer variability [7]. Yet, similar to nuclear segmentation, classifying the type
[1]. To overcome the difficulty in visual assessment of tissue of each nucleus is difficult, due to the high variance of nuclear
slides, there is a growing interest in digital pathology (DP), appearance within each WSI. Typically, nuclei are classified
where digitised whole-slide images (WSIs) are acquired from using two disjoint models: one for detecting each nucleus
glass histology slides using a scanning device. This permits and then another for performing nuclear classification [8], [9].
However, it would be preferable to utilise a single unified
∗ First authors contributed equally.
+ model for nuclear instance segmentation and classification.
Last authors contributed equally.
S.Graham, N.Rajpoot and S.E.A.Raza are with the Department of Computer In this paper, we present a deep learning approach1 for
Science, University of Warwick, UK. simultaneous segmentation and classification of nuclear in-
S.Graham is also with the Mathematics for Real-World Systems Centre for
Doctoral Training, University of Warwick, UK.
stances in histology images. The network is based on the
S.E.A.Raza is also with the Centre for Evolution and Cancer & Division prediction of horizontal and vertical distances (and hence the
of Molecular Pathology, The Institute of Cancer Research, London, UK. name HoVer-Net) of nuclear pixels to their centres of mass,
Q.D.Vu and J.T.Kwak are with the Department of Computer Science and
Engineering, Sejong University, South Korea.
which are subsequently leveraged to separate clustered nuclei.
A.Azam and Y-W.T are with the Department of Pathology at University
Hospitals Coventry and Warwickshire, Coventry, UK 1 Model code: https://github.com/vqdang/hover net
2
For each segmented instance, the nuclear type is subsequently of automatically extracting a representative set of features,
determined via a dedicated up-sampling branch. To the best of that strongly correlate with the task at hand. As a result,
our knowledge, this is the first approach that achieves instance they are preferable to hand-crafted approaches, that rely on a
segmentation and classification within the same network. We selection of pre-defined features. Inspired by the Fully Convo-
present comparative results on six independent multi-tissue lutional Network (FCN) [21], U-Net [22] has been successfully
histology image datasets and demonstrate state-of-the-art per- applied to numerous segmentation tasks in medical image
formance compared to other recently proposed methods. The analysis. The network has an encoder-decoder design with
main contributions of this work are listed as follows: skip connections to incorporate low-level information and uses
• A novel network, targeted at simultaneous segmentation a weighted loss function to assist separation of instances.
and classification of nuclei, where horizontal and vertical However, it often struggles to split neighbouring instances and
distance map predictions separate clustered nuclei. is highly sensitive to pre-defined parameters in the weighted
• We show that the proposed HoVer-Net achieves state-of- loss function. A more recently proposed method in Micro-
the-art performance on multiple H&E histology image Net [23] extends U-Net by utilising an enhanced network
datasets, as compared to over a dozen recently published architecture with weighted loss. The network processes the
methods. input at multiple resolutions and as a result, gains robustness
• An interpretable and reliable evaluation framework that against nuclei with varying size. In [24], the authors developed
effectively quantifies nuclear segmentation performance a network that is robust to stain variations in H&E images by
and overcomes the limitations of existing performance introducing a weighted loss function that is sensitive to the
measures. Haematoxylin intensity within the image.
2
• A new dataset of 24,319 exhaustively annotated nuclei
Other methods exploit information about the nuclear con-
within 41 colorectal adenocarcinoma image tiles.
tour (or boundary) within the network, such as DCAN [25]
that utilised a dual architecture that outputs the nuclear cluster
II. R ELATED W ORK
and the nuclear contour as two separate prediction maps.
A. Nuclear Instance Segmentation Instance segmentation is then achieved by subtracting the
Within the current literature, energy-based methods, in contour from the nuclear cluster prediction. [26] proposed
particular the watershed algorithm, have been widely utilised a network to predict the inner nuclear instance, the nuclear
to segment nuclear instances. For example, [10] used thresh- contour and the background. The network utilised a cus-
olding to obtain the markers and the energy landscape as input tomised weighted loss function based on the relative position
for watershed to extract the nuclear instances. Nonetheless, of pixels within the image to improve and stabilise the inner
thresholding relies on a consistent difference in intensity nuclei and contour prediction. Some other methods have also
between the nuclei and background, which does not hold for utilised the nuclear contour to achieve instance segmentation.
more complex images and hence often produces unreliable For example, [27] employed a deep learning technique for
results. Various approaches have tried to provide an improved labelling the nuclei and the contours, followed by a region
marker for marker-controlled watershed. [11] used active con- growing approach to extract the final instances. [28] used
tours to obtain the markers. [12] used a series of morphological the contour predictions as input into a further network for
operations to generate the energy landscape. However, these segmentation refinement. [29] proposed CIA-Net, that utilises
methods rely on the predefined geometry of the nuclei to a multi-level information aggregation module between two
generate the markers, which determines the overall accuracy of task-specific decoders, where each decoder segments either
each method. Notably, [13] avoided the trouble of refining the the nuclei or the contours. A Deep Residual Aggregation
markers for watershed by designing a method that relies solely Network (DRAN) was proposed by [30] that uses a multi-scale
on the energy landscape. They combined an active contour strategy, incorporating both the nuclei and nuclear contours to
approach with nuclear shape modelling via a level-set method accurately segment nuclei.
to obtain the nuclear instances. Despite its widespread usage,
obtaining sufficiently strong markers for watershed is a non- There have been various other methods to achieve instance
trivial task. Some methods have departed from the energy- separation. Instead of considering the contour, [31] proposed
based approach by utilising the geometry of the nuclei. For a deep learning approach to detect superior markers for water-
instance, [14], [15] and [16] computed the concavity of nuclear shed by regressing the nuclear distance map. Therefore, the
clusters, while [17] used eclipse-fitting to separate the clusters. network avoids making a prediction for areas with indistinct
However, this assumes a predefined shape, which does not contours.
encompass the natural diversity of the nuclei. In addition, these In line with these developments, the field of instance seg-
methods tend to be sensitive to the choice of manually selected mentation within natural images is also rapidly progressing
parameters. and have had a significant influence on nuclear instance
Recently, deep learning methods have received a surge of segmentation methods. A notable example is Mask-RCNN
interest due to their superior performance in many computer [32], where instance segmentation approach is achieved by first
vision tasks [18], [19], [20]. These approaches are capable predicting candidate regions likely to contain an object and
2 The CoNSeP dataset for nuclear segmentation is available at https: then deep learning based segmentation within those proposed
//warwick.ac.uk/fac/sci/dcs/research/tia/data/. regions.
3
Fig. 1: Overview of the proposed approach for simultaneous nuclear instance segmentation and classification. When no
classification labels are available, the network produces the instance segmentation as shown in (a). The different colours
of the nuclear boundaries represent different types of nuclei in (b).
Fig. 2: Overview of the proposed architecture. (a) (Pre-activated) residual unit, (b) dense unit. m indicates the number of
feature maps within each residual unit. The yellow square within the input denotes the considered region at the output. When
the classification labels aren’t available, only the up-sampling branches in the dashed box are considered.
branch. The NP branch predicts whether or not a pixel belongs encoder can also take advantage of the shared information
to the nuclei or background, whereas the HoVer branch across multiple tasks and thus, help to improve the model
predicts the horizontal and vertical distances of nuclear pixels performance on all tasks.
to their centres of mass. Then, the NC branch predicts the Finally, if we do not have the classification labels of the
type of nucleus for each pixel. In particular, the NP and HoVer nuclei, only the NP and HoVer up-sampling branches are
branches jointly achieve nuclear instance segmentation by first considered. Otherwise, we consider all three up-sampling
separating nuclear pixels from the background (NP branch) branches and perform simultaneous nuclear instance segmen-
and then separating touching nuclei (HoVer branch). The NC tation and classification.
branch determines the type of each nucleus by aggregating the We display an overview of the network architecture in Fig.
pixel-level nuclear type predictions within each instance. 2, where the spatial dimension of the input is 270×270 and
the output dimension of each branch is 80×80. The dashed
All three up-sampling branches utilise the same architectural box within Fig. 2 highlights the branches for nuclear instance
design, which consists of a series of up-sampling operations segmentation. Additionally, we also show a residual unit and
and densely connected units [39] (or dense units). By stacking a dense unit within Fig. 2a and Fig. 2b. We denote m as the
multiple and relatively cheap dense units, we build a large number of feature maps within each convolution of a given
receptive field with minimal parameters, compared to using residual unit. At each down sampling level, from left to right,
a single convolution with a larger kernel size and we ensure m=256, 512, 1024, 2048 respectively. We keep a fixed amount
efficient gradient propagation. We use skip connections [22] to of feature maps within each dense unit throughout the two
incorporate features from the encoder, but utilise summation branches as shown in Fig. 2c.
as opposed to concatenation. The consideration of low-level 1) Loss Function: The proposed network design has 4
information is particularly important in segmentation tasks, different sets of weights: w0 , w1 , w2 and w3 which refer to
where we aim to precisely delineate the object boundaries. the weights of the Preact-ResNet50 encoder, the HoVer branch
We use dense units after the first and second up-sampling decoder, the NP branch decoder and the NC branch decoder.
operations, where the number of units is 4 and 8 respec- These 4 sets of weights are optimised jointly using the loss L
tively. Valid convolution is performed throughout the two up- defined as:
sampling branches to prevent poor predictions at the boundary.
L = λa La + λb Lb + λc Lc + λd Ld + λe Le + λf Lf (1)
This results in the size of the output being smaller than the size | {z } | {z } | {z }
of the input. As opposed to using a dedicated network for each HoVer Branch NP Branch NC Branch
task, a shared encoder makes it possible to train the nuclear where La and Lb represent the regression loss with respect to
instance segmentation and classification model end-to-end and the output of the HoVer branch, Lc and Ld represent the loss
therefore, reduce the total training time. Furthermore, a shared with respect to the output at the NP branch and and finally, Le
5
-1
-1
Fig. 3: Cropped image regions showing horizontal and vertical map predictions, with corresponding ground truth. Arrows
highlight the strong instance information encoded within these maps, where there is a significant difference in the pixel values.
and Lf represent the loss with respect to the output at the NC horizontal map Γi,x and vertical map Γi,y respectively. Visual
branch. We choose to use two different loss functions at the examples of the horizontal and vertical maps can be seen in
output of each branch for an overall superior performance. Fig. 3.
λa ...λf are scalars that give weight to each associated loss At the output of the HoVer branch, we compute a multiple
function. Specifically, we set λb to 2 and the other scalars to term regression loss. We denote La as the mean squared error
1, based on empirical selection. between the predicted horizontal and vertical distances and the
Given the input image I, at each pixel i we define GT. We also propose a novel loss function Lb that calculates
pi (I, w0 , w1 ) as the regression output of the HoVer branch, the mean squared error between the horizontal and vertical
whereas qi (I, w0 , w2 ) and ri (I, w0 , w3 ) denote the pixel-based gradients of the horizontal and vertical maps respectively and
softmax predictions of the NP and NC branches respectively. the corresponding gradients of the GT. We formally define La
We also define Γi (I), Ψi (I) and Φi (I) as their corresponding and Lb as:
n
ground truth (GT). Ψi (I) is the GT of the nuclear binary map, 1X
where background pixels have the value of 0 and nuclear pixels La = (pi (I; w0 , w1 ) − Γi (I))2 (2)
n i=1
have the value 1. On the other hand, Φi (I) is the nuclear type
GT where background pixels have the value 0 and any integer 1 X
value larger than 0 indicates the type of nucleus. Meanwhile, Lb = (∇x (pi,x (I; w0 , w1 )) − ∇x (Γi,x (I)))2
m i∈M
Γi (I) denotes the GT of the horizontal and vertical distances 1 X
(3)
of nuclear pixels to their corresponding centres of mass. For + (∇y (pi,y (I; w0 , w1 )) − ∇y (Γi,y (I)))2
m i∈M
Γi (I), we assign values between -1 and 1 to nuclear pixels
in both the horizontal and vertical directions. We assign the Within equation (3), ∇x and ∇y denote the gradient in the
value of the background and the line crossing the centre of horizontal x and vertical y directions respectively. m denotes
mass within each nucleus to be 0. For clarity, we denote the total number of nuclear pixels within the image and M denotes
horizontal and vertical components of the GT HoVer map as the set containing all nuclear pixels.
6
an in depth analysis, thus facilitating a comprehensive under- pairs (TP), unmatched GT segments (FN) and unmatched
standing of the approach, which can help drive forward model prediction segments (FP). From this, PQ can be intuitively
development. For nuclear instance segmentation, the problem analysed as follows: the detection quality (DQ) is the F1
can be divided into the following three sub-tasks: Score that is widely used to evaluate instance detection, while
• Separate the nuclei from the background segmentation quality (SQ) can be interpreted as how close each
• Detect individual nuclear instances correctly detected instance is to their matched GT. DQ and SQ,
• Segment each detected instance in a way, also provide a direct insight into the second and third
In the current literature, two evaluation metrics have been sub-tasks, defined above. We believe that PQ should set the
mainly adopted to quantitatively measure the performance of standard for measuring the performance of nuclear instance
nuclear instance segmentation: 1) Ensemble Dice (DICE2) segmentation methods.
[30], and 2) Aggregated Jaccard Index (AJI) [27]. Given Overall, to fully characterise and understand the perfor-
the ground truth X and prediction Y , DICE2 computes and mance of each method, we use the following three metrics: 1)
aggregates DICE per nucleus, where Dice coefficient (DICE) DICE to measure the separation of all nuclei from the back-
is defined as 2×(X∩Y )/(|X|+|Y |) and AJI computes the ratio ground; 2) Panoptic Quality as a unified score for comparison
of an aggregated intersection cardinality and an aggregated and 3) AJI for direct comparison with previous publications3 .
union cardinality between X and Y . Panoptic quality is further broken down into DQ and SQ
These two evaluation metrics only provide an overall score components for interpretability. Note, SQ is calculated only
for the instance segmentation quality and therefore provides no within true positive segments and should therefore be observed
further insight into the sub-tasks at hand. In addition, these two together with DQ. Throughout this study, these metrics are
metrics have a limitation, which we illustrate in Fig. 4. From calculated for each image and the average of all images are
the figure, although prediction A only differs from prediction reported as final values for each dataset.
B by a few pixels, the DICE2 and AJI scores for B are inferior.
These scores are shown in Table I. This problem arises due B. Nuclear Classification Evaluation
to over-penalisation of the overlapping regions. By overlaying
Classification of the type of each nucleus is performed
the GT segment contours (red dashed line) upon the two pre-
within the nuclear instances extracted from the instance seg-
dictions, we observe that, although the cyan-coloured instance
mentation or detection tasks. Therefore, the overall measure-
within prediction A overlaps mostly with the cyan-coloured
ment for nuclear type classification should also encompass
GT instance, it also slightly overlaps with the blue-coloured
these two tasks. For all nuclear instances of a particular type
GT instance. As a result, according to the DICE2 algorithm,
t from both the ground truth and the prediction, the detection
the predicted cyan instance will be penalised by pixels not
task d splits the GT and predicted instances into the following
only coming from the dominant overlapping cyan-coloured GT
subsets: correctly detected instances (TPd ), misdetected GT
instance, but also from the blue-coloured GT instance. The AJI
instances (FNd ) and overdetected predicted instances (FPd ).
also suffers from the same phenomenon. However, because
Subsequently, the classification task c further breaks TPd into
AJI only uses the prediction and GT instance pair with the
correctly classified instances of type t (TPc ), correctly clas-
highest intersection over union, over-penalisation is less likely
sified instances of types other than type t (TNc ), incorrectly
compared to DICE2. Over-penalisation is likely to occur when
classified instances of type t (FPc ) and incorrectly classified
the model completely fails to detect the neighbouring instance,
instances of types other than type t (FNc ). We then define the
such as in Fig. 4. Nonetheless, when evaluating methods across
Fc score of each type t for combined nuclear type classification
different datasets, specifically on samples containing lots of
and detection as follows:
hard to recognise nuclei such as fibroblasts or nuclei with
poor staining, the number of failed detections may increase and 2(T Pc + T Nc )
Fct = " # (8)
therefore may have a negative impact on the AJI measurement. 2(T Pc + T Nc ) + α0 F Pc + α1 F Nc
Due to the limitations of DICE2 and AJI, it is clear that there +α2 F Pd + α3 F Nd
is a need for an improved reliable quantitative measurement.
Panoptic Quality: We propose to use another metric for where we use α0 = α1 = 2 and α2 = α3 = 1 to give more
accurate quantification and interpretability to assess the perfor- emphasis to nuclear type classification. Moreover, using the
mance of nuclear instance segmentation. Originally proposed same weighting, if we further extend t to encompass all types
by [40], panoptic quality (PQ) for nuclear instance segmenta- of nuclei T (t ∈ T ), the classification within TPd is then
tion is defined as: divided into a correctly classified set Ac and an incorrectly
P classified set Bc . We can therefore disassemble Fct into:
|T P | (x,y)∈T P IoU (x, y)
PQ = × (7)
|T P | + 21 |F P | + 12 |F N | |T P | 2Ac
FcT =
} | 2(Ac + Bc ) + F Pd + F Nd
| {z {z }
Detection Quality(DQ) Segmentation Quality(SQ)
2(Ac + Bc ) Ac
where x denotes a GT segment, y denotes a prediction segment = × (9)
2(Ac + Bc ) + F Pd + F Nd Ac + Bc
and IoU denotes intersection over union. Each (x,y) pair is Classification Accuracy within
mathematically proven to be unique [40] over the entire set = Fd ×
Correctly Detected Instances
of prediction and GT segments if their IoU(x,y)>0.5. The
unique matching splits all available segments into matched 3 Evaluation code: https://github.com/vqdang/hover net/src/metrics
8
where Fd is simply the standard detection quality like DQ pixel-level differences between the annotation and the true nu-
while the other term is the accuracy of nuclear type classifi- clear boundary in challenging cases. In addition to delineating
cation within correctly detected instances. In the case where the nuclear boundaries, every nucleus was labelled as either:
the GT is not exhaustively annotated for nuclear type classi- normal epithelial, malignant/dysplastic epithelial, fibroblast,
fication, like in CRCHisto, an amount equal to the number of muscle, inflammatory, endothelial or miscellaneous. Within
unlabelled GT instances in each set is subtracted from Bc and the miscellaneous category, necrotic, mitotic and cells that
F Nc . couldn’t be categorised were grouped. For our experiments, we
Finally, while IoU is utilised as the criteria in DQ for grouped the normal and malignant/dysplastic epithelial nuclei
selecting the TP for detection in instance segmentation, de- into a single class and we grouped the fibroblast, muscle and
tection methods can not calculate the IoU. Therefore, to facil- endothelial nuclei into a class named spindle-shaped nuclei.
itate comparison of both instance segmentation and detection Overall, six independent datasets are utilised for this study.
methods for the nuclear type classification tasks, for Fct , we A full summary for each of them is provided in Table II. Five
utilise the notion of distance to determine whether nuclei have of these datasets are used to evaluate the instance segmentation
been detected. To be precise, we define the region within a performance which we refer to as: CoNSeP; Kumar [27];
predefined radius from the annotated centre of the nucleus as CPM-15; CPM-17 [30] and TNBC [31]. Example images from
the ground truth and if a prediction lies within this area, then each of the five datasets can be seen in Fig. 7. Meanwhile, we
it is considered to be a true positive. Here, we are consistent utilise CoNSeP and a further dataset, named CRCHisto, to
with [35] and use a radius of 6 pixels at 20× or 12 pixels at quantify the performance of the nuclear classification model.
40×. The CRCHisto dataset consists of the same nuclei types
that are present in CoNSeP. It is also worth noting that the
CRCHisto dataset is not exhaustively annotated for nuclear
V. E XPERIMENTAL R ESULTS
class labels.
A. Datasets
As part of this work, we introduce a new dataset that B. Implementation and Training Details
we term as the colorectal nuclear segmentation and phe- We implemented our framework with the open source soft-
notypes (CoNSeP) dataset4 , consisting of 41 H&E stained ware library TensorFlow version 1.8.0 [41] on a workstation
image tiles, each of size 1,000×1,000 pixels at 40× objec- equipped with two NVIDIA GeForce 1080 Ti GPUs. During
tive magnification. Images were extracted from 16 colorectal training, data augmentation including flip, rotation, Gaussian
adenocarcinoma (CRA) WSIs, each belonging to an individual blur and median blur was applied to all methods. All networks
patient, and scanned with an Omnyx VL120 scanner within received an input patch with a size ranging from 252×252
the department of pathology at University Hospitals Coventry to 270×270. This size difference is due to the use of valid
and Warwickshire, UK. We chose to focus on a single cancer convolutions in some architectures, such as HoVer-Net and
type, so that we are able to display the true variation of tissue U-Net. Regarding HoVer-Net, we initialised the model with
within colorectal adenocarcinoma WSIs, as opposed to other pre-trained weights on the ImageNet dataset [37], trained only
datasets that instead focus on using a small number of visual the decoders for the first 50 epochs, and then fine-tuned all
fields from various cancer types. Within this dataset, stroma, layers for another 50 epochs. We train stage one for around
glandular, muscular, collagen, fat and tumour regions can be 120 minutes and stage two for around 260 minutes. Therefore,
observed. Beside incorporating different tissue components, the overall training time is around 380 minutes. Stage two
the 41 images were also chosen such that different nuclei types takes longer to train because unfreezing the encoder utilises
were present, including: normal epithelial; tumour epithelial; more memory and therefore a smaller batch size needs to
inflammatory; necrotic; muscle and fibroblast. Here, by type be used. Specifically, we used a batch size of 8 and 4 on
we are referring to the type of cell from which the nucleus each GPU for stage one and two respectively. We used Adam
originates from. Within the dataset, there are many signifi- optimisation with an initial learning rate of 10−4 and then
cantly overlapping nuclei with indistinct boundaries and there reduced it to a rate of 10−5 after 25 epochs. This strategy was
exists various artifacts, such as ink. As a result of the diversity repeated for fine-tuning. On the whole, training of the network
of the dataset, it is likely that a model trained on CoNSeP is stable, where the usage of fully independent decoders helps
will perform well for unseen CRA cases. For each image tile, the network to converge each time. The network was trained
every nucleus was annotated by one of two expert pathologists with an RGB input, normalised between 0 and 1.
(A.A, Y-W.T). After full annotation, each annotated sample
was reviewed by both of the pathologists; therefore refining C. Comparative Analysis of Segmentation Methods
their own and each others’ annotations. By the end of the
annotation process, each pathologist had fully checked every Experimental Setting: We evaluated our approach by em-
sample and consensus had been reached. Annotating the data ploying a full independent comparison across the three largest
in this way ensured that minimal nuclei were missed in the known exhaustively labelled nuclear segmentation datasets:
annotation process. However, we can not avoid inevitable Kumar; CoNSeP and CPM-17 and utilised the metrics as
described in Section IV-A. For this experiment, because we do
4 This dataset is available at https://warwick.ac.uk/fac/sci/dcs/research/tia/ not have the classification labels for all datasets, we perform
data/. instance segmentation without classification. This enables us to
9
TABLE II: Summary of the datasets used in our experiments. UHCW denotes University Hospitals Conventry and Warwickshire
and TCGA denotes The Cancer Genome Atlas. Seg denotes segmentation masks and Class denotes classification labels.
CoNSeP Kumar CPM-15 CPM-17 TNBC CRCHisto
Total Number of Nuclei 24,319 21,623 2,905 7,570 4,056 29,756
Labelled Nuclei 24,319 0 0 0 0 22,444
Number of Images 41 30 15 32 50 100
Origin UHCW TCGA TCGA TCGA Curie Institute UHCW
Magnification 40× 40× 40× & 20× 40× & 20× 40× 20×
Size of Images 1000×1000 1000×1000 400×400 to 1000×600 500×500 to 600×600 512×512 500×500
Seg/Class Both Seg Seg Seg Seg Class
Number of Cancer Types 1 8 2 4 1 1
fully leverage all data and allows us to rigorously evaluate the the challenge, with 32 images in both the training and test
segmentation capability of our model. In the same way as [27], datasets.
we split the Kumar dataset into two different sub-datasets: (i) We compared our proposed model to recent segmentation
Kumar-Train, a training set with 16 image tiles (4 breast, 4 approaches used in computer vision [21], [44], [32], medical
liver, 4 kidney and 4 prostate) and (ii) Kumar-Test, a test set imaging [22] and also to methods specifically tuned for the
with 14 image tiles (2 breast, 2 liver, 2 kidney and 2 prostate, task of nuclear segmentation [25], [23], [31], [29], [30]. We
2 bladder, 2 colon, 2 stomach). Note, we utilise the exact same also compared the performance of our model to two open
image split used by other recent approaches [27], [31], [29], source software applications: Cell Profiler [42] and QuPath
but we do not separate the test set into two subsets. We do this [43]. Cell Profiler is a software for cell-based analysis, with
to ensure that the test set is large enough, ensuring a reliable several suggested pipelines for computational pathology. The
evaluation. For CoNSeP, we devise a suitable train and test pipeline that we adopted applies a threshold to the greyscale
set that contains 26 and 14 images respectively. The images image and then uses a series of post processing operations.
within the test set were selected to ensure the true diversity QuPath is an open source software for digital pathology and
of nuclei types within colorectal tissue are represented. For whole slide image analysis. To achieve nuclear segmentation,
CPM-17, we utilise the same split that had been employed for we used the default parameters within the application. FCN,
10
Fig. 6: Sample cropped regions extracted from the CoNSeP datasets, where the colour of each nuclear boundary denotes the
category.
SegNet, U-Net, DCAN, Mask-RCNN and DIST have been tion is particularly evident in the Kumar and CoNSeP datasets,
implemented by the authors of the paper (S.G, Q.D.V). For where there exists a large number of overlapping nuclei. Both
Mask-RCNN, we slightly modified the original implementa- Cell Profiler [42] and QuPath [43] achieve sub-optimal perfor-
tion by using smaller anchor boxes. The default configuration mance for all datasets. In particular, both software applications
is fine-tuned for natural images and therefore, this modification consistently achieve a low DICE score, suggesting that their
was necessary to perform a successful nuclear segmentation. inability to distinguish nuclear pixels from the background
DIST was implemented with the assistance of the first author is a major limiting factor. FCN-based approaches improve
of the corresponding approach in order to ensure reliability the capability of models to detect nuclear pixels, yet often
during evaluation. This also enabled us to utilise DIST for fail due to their inability to separate clustered instances. For
further comparison in our experiments. For Micro-Net, we example, despite a higher DICE score than Cell Profiler and
used the same implementation that was described by [23] and QuPath, networks built only for semantic segmentation like
was implemented by the first author of the corresponding paper FCN8 and SegNet suffer from low PQ values. Therefore,
(S.E.A.R). For CNN3 and CIA-Net, we report the results on methods that incorporate strong instance-aware techniques are
the Kumar dataset that are given in their respective original favourable. Within CPM-17, there are less overlapping nuclei
papers. The authors of CIA-Net and DRAN provided their which explains why methods that are not instance-aware are
segmentation output, which meant that we were able to obtain still able to achieve a satisfactory performance. We observe
all metrics on the datasets that the models were applied to. that the weighted cross entropy loss that is used in both U-
Therefore, we report results of CIA-Net on the Kumar dataset Net and Micro-Net can help to separate joined nuclei, but its
and results of DRAN on the CPM-17 dataset. Note, for all success also depends on the capacity of the network. This is
self-implemented approaches we are consistent with our pre- reflected by the increased performance of Micro-Net over U-
processing strategy. However, DRAN, CNN3 and CIA-Net Net.
results are directly taken from their respective papers and DCAN is able to better distinguish between separate in-
therefore we can’t guarantee the same pre-processing steps. stances than FCN8, which uses a very similar encoder based
CNN3 and CIA-Net also use stain normalisation, whereas on the VGG16 network. Therefore, incorporating additional
other methods described in this paper do not. information at the output of the network can improve the
Comparative Results: Table III and the box plots in Fig. segmentation performance. This is also exemplified by the
8a and 8b show detailed results of this experiment. Within the fairly strong performances of CNN3, DIST, DRAN and CIA-
box plots, we choose not to show AJI, due to its limitations as Net. In a different way, Mask-RCNN is able to successfully
discussed in Section IV-A. A large variation in performance separate clustered nuclei by utilising a region proposal based
between methods within each dataset is observed. This varia- approach. However, Mask-RCNN is less effective than other
11
TABLE III: Comparative experiments on the Kumar [27], CoNSeP and CPM-17 [30] datasets. WS denotes watershed-based
post processing.
Kumar CoNSeP CPM-17
Methods DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ
Cell Profiler [42] 0.623 0.366 0.423 0.704 0.300 0.434 0.202 0.249 0.705 0.179 0.570 0.338 0.368 0.702 0.261
QuPath [43] 0.698 0.432 0.511 0.679 0.351 0.588 0.249 0.216 0.641 0.151 0.693 0.398 0.320 0.717 0.230
FCN8 [21] 0.797 0.281 0.434 0.714 0.312 0.756 0.123 0.239 0.682 0.163 0.840 0.397 0.575 0.750 0.435
FCN8 + WS [21] 0.797 0.429 0.590 0.719 0.425 0.758 0.226 0.320 0.676 0.217 0.840 0.397 0.575 0.750 0.435
SegNet [44] 0.811 0.377 0.545 0.742 0.407 0.796 0.194 0.371 0.727 0.270 0.857 0.491 0.679 0.778 0.531
SegNet + WS [44] 0.811 0.508 0.677 0.744 0.506 0.793 0.330 0.464 0.721 0.335 0.856 0.594 0.779 0.784 0.614
U-Net [22] 0.758 0.556 0.691 0.690 0.478 0.724 0.482 0.488 0.671 0.328 0.813 0.643 0.778 0.734 0.578
Mask-RCNN [32] 0.760 0.546 0.704 0.720 0.509 0.740 0.474 0.619 0.740 0.460 0.850 0.684 0.848 0.792 0.674
DCAN [25] 0.792 0.525 0.677 0.725 0.492 0.733 0.289 0.383 0.667 0.256 0.828 0.561 0.732 0.740 0.545
Micro-Net [23] 0.797 0.560 0.692 0.747 0.519 0.794 0.527 0.600 0.745 0.449 0.857 0.668 0.836 0.788 0.661
DIST [31] 0.789 0.559 0.601 0.732 0.443 0.804 0.502 0.544 0.728 0.398 0.826 0.616 0.663 0.754 0.504
CNN3 [27] 0.762 0.508 - - - - - - - - - - - - -
CIA-Net [29] 0.818 0.620 0.754 0.762 0.577 - - - - - - - - - -
DRAN [30] - - - - - - - - - - 0.862 0.683 0.811 0.804 0.657
HoVer-Net 0.826 0.618 0.770 0.773 0.597 0.853 0.571 0.702 0.778 0.547 0.869 0.705 0.854 0.814 0.697
methods at detecting nuclear pixels, which is reflected by a following organs: breast; prostate; kidney and liver. Despite
lower DICE score. all samples being extracted from TCGA, CPM samples come
Due to the reasoning given in Section IV, we place a larger from the brain, head & neck and lungs regions. Therefore, test-
emphasis on PQ to determine the success of different models. ing with CPM reflects the ability for the model to generalise
In particular, we consistently obtain an improved performance to new organs, as mentioned above by the first generalisation
over DIST, which justifies the use of our proposed horizontal criterion. TNBC contains samples from an already seen organ
and vertical maps as a regression target. We also report a better (breast), but the data is extracted from an independent source
performance than the winners of the Computational Precision with different specimen preservation and staining practice.
Medicine and MoNuSeg challenges [30], [29], that utlised the Therefore, this reflects the second generalisation criterion.
CPM-17 and Kumar datasets respectively. Therefore, HoVer- CoNSeP contains samples taken from colorectal tissue, which
Net achieves state-of-the art performance for nuclear instance is not represented in Kumar-Train, and is also extracted from
segmentation compared to all competing methods on multiple a source independent to TCGA. Therefore, this reflects both
datasets that consist of a variety of different tissue types. Our the first and second generalisation criteria. Also, as mentioned
approach also outperforms methods that were fine-tuned for in Section V-A, CoNSeP contains challenging samples, where
the task of nuclear segmentation. there exists various artifacts and there is variation in the quality
of slide preparation. Therefore, the performance on this dataset
D. Generalisation Study also reflects the ability of a model to generalise to difficult
Experimental Setting: The goal of any automated method samples.
is to perform well on unseen data, with high accuracy. There- Comparative Results: The results are reported in Table IV,
fore, we conducted a large scale study to assess how all meth- where we only display the results of methods that employ an
ods generalise to new H&E stained images. To analyse the instance-based technique. We observe that our proposed model
generalisation capability, we assessed the ability to segment is able to successfully generalise to unseen data in all three
nuclei from: i) new organs (variation in nuclei shapes) and ii) cases. However, some methods prove to perform poorly with
different centres (variation in staining). unseen data, where in particular, U-Net and DIST perform
The five instance segmentation datasets used within our worse than other competing methods on all three datasets.
experiments can be grouped into three groups according to Both SegNet with watershed and Mask-RCNN achieve a
their origin: TCGA (Kumar, CPM-15, CPM-17), TNBC and competitive performance across all three generalisation tests.
CoNSeP. We used Kumar as the training and validation set, due However, similar to the results reported in Table III, Mask-
to its size and diversity, whilst the combined CPM (CPM-15 RCNN is not able to distinguish nuclear pixels from the
and CPM-17), TNBC and CoNSeP datsets were used as three background as well as other competing methods, which has an
independent test sets. We split the test sets in this way in adverse effect on the overall segmentation performance shown
accordance with their origin. Note, for this experiment we use by PQ. On the other hand, SegNet proves to successfully detect
both the training and test sets of CPM-17 and CoNSeP to form nuclear pixels, reporting a greater DICE score than HoVer-
the independent test sets. Kumar was split into three subsets, Net on both the TNBC and CoNSeP datasets. However, the
as explained in Section V-A, and Kumar-Train was used to overall segmentation result for HoVer-Net is superior because
train all models, i.e. trained with samples originating from the it is better able to separate nuclear instances by incorporating
12
Fig. 7: Example visual results on the CPM-17, Kumar and CoNSeP datasets. For each dataset, we display the 4 models that
achieve the highest PQ score from left to right. The different colours of the nuclear boundaries denote separate instances.
13
(a) Kumar
(b) CoNSeP
Fig. 8: Box plots highlighting the performance of competing methods on the Kumar and CoNSeP datasets.
the horizontal and vertical maps at the output of the network. inflammatory, epithelial and spindle-shaped. Specifically, we
compared HoVer-Net with Micro-Net, Mask-RCNN and DIST.
For Micro-Net, we used an output depth of 5 rather than
E. Comparative Analysis of Classification Methods
2, where each channel gave the probability of a pixel being
Experimental Setting: We converted the top four per- either background, miscellaneous, inflammatory, epithelial or
forming nuclear instance segmentation algorithms, based on spindle-shaped. For Mask-RCNN, there is a devoted clas-
their panoptic quality on the CoNSeP dataset, such that they sification branch that predicts the class of each instance
were able to perform simultaneous instance segmentation and therefore is well suited to a multi-class setting. DIST
and classification. As mentioned in Section V-A, the nuclear performs regression at the output of the network and therefore
categories that we use in our experiments are: miscellaneous,
14
TABLE IV: Comparative results, highlighting the generalisation capability of different models. All models are initially trained
on Kumar and then the Combined CPM [30], TNBC [31] and CoNSeP datasets are processed.
Combined CPM TNBC All CoNSeP
Methods DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ
FCN8 + WS [21] 0.762 0.531 0.669 0.722 0.487 0.726 0.506 0.662 0.723 0.480 0.609 0.247 0.345 0.688 0.240
SegNet + WS [44] 0.791 0.583 0.738 0.755 0.561 0.758 0.559 0.734 0.750 0.554 0.681 0.315 0.449 0.733 0.332
U-Net [22] 0.720 0.541 0.652 0.672 0.446 0.681 0.514 0.635 0.676 0.442 0.585 0.363 0.442 0.670 0.297
Mask-RCNN [32] 0.764 0.575 0.760 0.719 0.549 0.705 0.529 0.726 0.742 0.543 0.606 0.348 0.492 0.720 0.357
DCAN [25] 0.770 0.582 0.716 0.730 0.528 0.725 0.537 0.683 0.720 0.495 0.609 0.306 0.403 0.685 0.278
Micro-Net [23] 0.792 0.615 0.716 0.751 0.542 0.701 0.531 0.656 0.753 0.497 0.644 0.394 0.489 0.722 0.356
DIST [31] 0.775 0.563 0.593 0.720 0.432 0.719 0.523 0.549 0.714 0.404 0.621 0.369 0.379 0.701 0.268
HoVer-Net 0.801 0.626 0.774 0.778 0.606 0.749 0.590 0.743 0.759 0.578 0.664 0.404 0.529 0.764 0.408
converting the model such that it is able to classify nuclei into whole-slide image (WSI). Performing simultaneous nuclear
multiple categories is non-trivial. Instead, we add an extra 1×1 instance segmentation and classification enables subsequent
convolution at the output of the network that performs nuclear exploration of the role that nuclear features play in predicting
classification. As well as comparing to the aforementioned clinical outcome. For example, [4] utilised nuclear features
methods, we compared our approach to a spatially constrained from histology TMA cores to predict survival in early-stage
CNN (SC-CNN), that achieves detection and classification. estrogen receptor-positive breast cancer. Restricting the anal-
Note, because SC-CNN does not produce a segmentation ysis to some specific nuclear types only may be advantageous
mask, we do not report the PQ for this method. for accurate analysis in computational pathology.
Comparative Results: We trained our models on the train- In this paper, we have proposed HoVer-Net for simulta-
ing set of the CoNSeP dataset and then we evaluated the neous segmentation and classification of nuclei within multi-
model on both the test set of CoNSeP and also the entire tissue histology images that not only detects nuclei with high
CRCHisto dataset. Table V displays the results of the multi- accuracy, but also effectively separates clustered nuclei. Our
class models on the CoNSeP and the CRCHisto datasets approach has three up-sampling branches: 1) the nuclear pixel
respectively, where the given metrics are described in Section branch that separates nuclear pixels from the background; 2)
IV-B. For CoNSeP, along with the classification metrics, we the HoVer branch that regresses the horizontal and vertical
provide PQ as an indication of the quality of instance seg- distances of nuclear pixels to their centres of mass and 3) the
mentation. However, in CRCHisto, only the nuclear centroids nuclear classification branch that determines the type of each
are given and therefore, we exclude PQ from the CRCHisto nucleus. We have shown that the proposed approach achieves
evaluation because it can’t be calculated without the instance the state-of-the-art instance segmentation performance com-
segmentation masks. We observe that HoVer-Net achieves a pared to a large number of recently published deep learning
good quality simultaneous instance segmentation and classi- models across multiple datasets, including tissues that have
fication, compared to competing methods. It must be noted, been prepared and stained under different conditions. This
that we should expect a lower F1 score for the miscellaneous makes the proposed approach likely to translate well to a
class because there are significantly less nuclei represented. practical setting due its strong generalisation capacity, which
Also, there is a high diversity of nuclei types that have can therefore be effectively used as a prerequisite step before
been grouped within this class, belonging to: mitotic; necrotic nuclear-based feature extraction. We have shown that utilising
and cells that are uncategorisable. Despite this, HoVer-Net the horizontal and vertical distances of nuclear pixels to their
is able to achieve a satisfactory performance on this class, centres of mass provides powerful instance-rich information,
where other methods fail. Furthermore, compared to other leading to state-of-the-art performance in histological nuclear
methods, our approach achieves the best F1 score for epithelial, segmentation. When the classification labels are available, we
inflammatory and spindle classes. Therefore, due to HoVer-Net show that our model is able to successfully segment and
obtaining a strong performance for both nuclear segmentation classify nuclei with high accuracy.
and classification, we suggest that our model may be used Region proposal (RP) methods, such as Mask-RCNN, show
for sophisticated subsequent cell-level downstream analysis in great potential in dealing with overlapping instances because
computational pathology. there is no notion of separating instances; instead nuclei are
segmented independently. However, a major limitation of the
VI. D ISCUSSION AND C ONCLUSIONS RP methods is the difficulty in merging instance predictions
Analysis of nuclei in large-scale histopathology images is between neigbouring tiles during processing. For example, if
an important step towards automated downstream analysis for a sub-segment of a nucleus at the boundary is assigned a
diagnosis and prognosis of cancer. Nuclear features have been label, one must ensure that the remainder of the nucleus in the
often used to assess the degree of malignancy [45]. However, neighbouring tile is also assigned the same label. To overcome
visual analysis of nuclei is a very time consuming task because this difficulty, for Mask-RCNN, we utilised an overlapping tile
there are often tens of thousands of nuclei within a given mechanism such that we only considered non-boundary nuclei.
15
TABLE V: Comparative results for nuclear classification on the CoNSeP and CRCHisto datasets. Fd denotes the F1 score for
nuclear detection, whereas Fec , Fic , Fsc and Fm
c denote the F1 classification score for the epithelial, inflammatory, spindle-shaped
and miscellaneous classes respectively.
CoNSeP CRCHisto
Methods PQ Fd Fec Fic Fsc Fm
c Fd Fec Fic Fsc Fm
c
SC-CNN [35] - 0.608 0.306 0.193 0.175 0.000 0.664 0.246 0.111 0.126 0.000
DIST [31] 0.372 0.712 0.617 0.534 0.505 0.000 0.616 0.464 0.514 0.275 0.000
Micro-Net [23] 0.430 0.743 0.615 0.592 0.532 0.117 0.638 0.422 0.518 0.249 0.059
Mask-RCNN [32] 0.450 0.692 0.595 0.590 0.520 0.098 0.639 0.503 0.537 0.294 0.077
HoVer-Net 0.516 0.748 0.635 0.631 0.566 0.426 0.688 0.486 0.573 0.302 0.178
Regarding the processing time, the average time to process statistical measures. We encourage researchers to utilise the
a 1,000×1,000 image tile over 10 runs using Mask-RCNN for proposed measures to not only maximise the interpretability
segmentation and classification was 106.98 seconds. Mean- of their results, but also to perform a fair comparison with
while, HoVer-Net only took an average of 11.04 seconds other methods.
to complete the same operation; approximately 9.7× faster. Finally, methods have surfaced recently that explore the
On the other hand, the average processing time for DIST relationship of various nuclear types within histology images
and Micro-Net was 0.600 and 0.832 seconds respectively. [6], [5], yet these methods are limited to spatial analysis
Mask-RCNN inherently stores a single instance per channel, because the segmentation masks are not available. Utilising our
which leads to very large arrays in memory when there are model for nuclear segmentation and classification enables the
many nuclei in a single image patch, which also contributes exploration of the spatial relationship between various nuclear
to the much longer processing time as seen above. Overall, types combined with nuclear morphological features and there-
FCN methods seem to better translate to WSI processing fore may provide additional diagnostic and prognostic value.
compared to Mask-RCNN or RPN methods in general. It Currently, our model is trained on a single tissue type, yet due
must be stressed that the timing is not exact and is depen- to the strong performance of our instance segmentation model
dent on hardware specifications and software implementation. across multiple tissues, we are confident that our model will
With optimised code and sophisticated hardware, we expect perform well if we were to incorporate additional tissue types.
these timings to be considerably different. Additionally, the We observe a low F1 classification score for the miscellaneous
inference time is also dependent on the size of the output. In category in the classification model because there are signifi-
particular, with a smaller output size, a smaller stride is also cantly less samples within this category and there exists high
required during processing. For instance, if we used padded intra-class variability. Future work will involve obtaining more
convolution in the up-sampling branches of HoVer-Net, then samples within this category, including necrotic and mitotic
we observe 5.6× speed up and the average processing time is nuclei, to improve the class balance of the data.
1.97 seconds per 1000×1000 image tile. For fair comparison,
all models were processed on a single GPU with 12GB RAM ACKNOWLEDGMENTS
and we fixed the batch size to a size of one. Future work will
explore the trade-off between the efficiency of HoVer-Net and This work was supported by the National Research Foun-
its potential to accurately perform instance segmentation and dation of Korea (NRF) grant funded by the Korea government
classification. (MSIP) (No. 2016R1C1B2012433) and by the Ministry of
Science and ICT (MSIT) (No. 2018K1A3A1A74065728). This
A major bottleneck for the development of successful
work was also supported in part by the UK Medical Research
nuclear segmentation algorithms is the limitation of data;
Council (No. MR/P015476/1). NR is part of the PathLAKE
particularly with additional associated class labels. In this
digital pathology consortium, which is funded from the Data
work, we introduce the colorectal adenocarcinoma nuclear
to Early Diagnosis and Precision Medicine strand of the
segmentation and phenotypes (CoNSeP) dataset, containing
governments Industrial Strategy Challenge Fund, managed and
over 24K labelled nuclei from challenging samples to reflect
delivered by UK Research and Innovation (UKRI). We also
the true difficulty of segmenting nuclei in whole-slide images.
acknowledge the financial support from the Engineering and
Due to the abundance of nuclei with an associated nuclear
Physical Sciences Research Council and Medical Research
category, CoNSeP aims to help accelerate the development
Council, provided as part of the Mathematics for Real-World
of further simultaneous nuclear instance segmentation and
Systems CDT. We thank Peter Naylor for his assistance in the
classification models to further increase the sophistication of
implementation of the DIST network.
cell-level analysis within computational pathology.
We analysed the common measurements used to assess
the true performance of nuclear segmentation models and A PPENDIX A. A BLATION S TUDIES
discussed their limitations. Due to the fact that these mea- To gain a full understanding of the contribution of our
surements did not always reflect the instance segmentation method, we investigated several of its components. Specifi-
performance, we proposed a set of reliable and informative cally, we performed the following ablation experiments: (i)
16
contribution of the proposed loss strategy; (ii) Sobel-based binary mask, the positive channels are combined together after
post processing technique compared to other strategies and nuclear type prediction. Utilising three branches decouples the
(iii) contribution of the dedicated classification branch. Here, tasks of nuclear classification and nuclear detection, where a
we utilised the Kumar and CoNSeP datasets for (i) and (ii) separate branch is devoted to each task. For this ablation study,
due to the large number of nuclei present, whereas for (iii) we train on the CoNSeP training set and then process both the
we use CoNSeP and CRCHisto because we do not have the CoNSeP test set and the entire CRCHisto dataset.
classification labels for Kumar. We report results in Table A3, where we observe that
Loss Terms: We conducted an experiment to understand the utilising a separate branch devoted to the task of nuclear
contribution of our proposed loss strategy. First, we used mean classification leads to an improved overall performance of
squared error (MSE) of the horizontal and vertical distances simultaneous nuclear instance segmentation and classification
La as the loss function of the HoVer branch and binary in both the CoNSeP and CRCHisto datasets. We can see that
cross entropy (BCE) loss Lc as the loss function for the NP if the classification takes place at the output of NP branch,
branch. We refer to this combination as the standard strategy then the network’s ability to determine the nuclear type is
because MSE and BCE are the two most commonly used compromised. This is because the task of nuclear classification
loss functions for regression and binary classification tasks is challenging and therefore the network benefits from the
respectively. Next, we introduced the MSE of the horizontal introduction of a branch dedicated to the task of classification.
and vertical gradients Lb to the HoVer branch and the dice
loss Ld to the NP branch. The intuition behind our novel R EFERENCES
Lb is that it enforces the correct structure of the horizontal [1] J. G. Elmore, G. M. Longton, P. A. Carney, B. M. Geller, T. Onega,
and vertical map predictions and therefore helps to correctly A. N. Tosteson, H. D. Nelson, M. S. Pepe, K. H. Allison, S. J. Schnitt
et al., “Diagnostic concordance among pathologists interpreting breast
separate neighbouring instances. The dice loss was introduced biopsy specimens,” Jama, vol. 313, no. 11, pp. 1122–1132, 2015.
because it can help the network to better distinguish between [2] A. Madabhushi and G. Lee, “Image analysis and machine learning
background and nuclear pixels and is particularly useful when in digital pathology: Challenges and opportunities,” Medical Image
Analysis, vol. 33, pp. 170 – 175, 2016, 20th anniversary of
there is a class-imbalance. We present the results in Table A1, the Medical Image Analysis journal (MedIA). [Online]. Available:
where we observe an increase in all performance measures for http://www.sciencedirect.com/science/article/pii/S1361841516301141
our proposed multi-term loss strategy. Therefore, the additional [3] N. Alsubaie, K. Sirinukunwattana, S. E. A. Raza, D. Snead, and
N. Rajpoot, “A bottom-up approach for tumour differentiation in whole
loss terms boost the network’s ability to differentiate between slide images of lung adenocarcinoma,” in Medical Imaging 2018: Digital
nuclear and background pixels (DICE) and separate individual Pathology, vol. 10581. International Society for Optics and Photonics,
nuclei (DQ and PQ). In particular, there is a significant boost in 2018, p. 105810E.
[4] C. Lu, D. Romo-Bucheli, X. Wang, A. Janowczyk, S. Ganesan,
the SQ for both Kumar and CoNSeP, which suggests that our H. Gilmore, D. Rimm, and A. Madabhushi, “Nuclear shape and orien-
proposed loss function Lb is necessary to precisely determine tation features from h&e images predict survival in early-stage estrogen
where nuclei should be split. receptor-positive breast cancers,” Laboratory Investigation, vol. 98,
no. 11, p. 1438, 2018.
Post Processing: Usually, markers obtained from applying [5] K. Sirinukunwattana, D. Snead, D. Epstein, Z. Aftab, I. Mujeeb, Y. W.
a threshold to an energy landscape (such as the distance map) Tsang, I. Cree, and N. Rajpoot, “Novel digital signatures of tissue phe-
is enough to provide a competitive input for watershed, as notypes for predicting distant metastasis in colorectal cancer,” Scientific
reports, vol. 8, no. 1, p. 13692, 2018.
seen by DIST in Table III. Although HoVer-Net is not directly [6] S. Javed, M. M. Fraz, D. Epstein, D. Snead, and N. M. Rajpoot, “Cellular
built upon an energy landscape, we devised a Sobel-based community detection for tissue phenotyping in histology images,” in
method to derive both the energy landscape and the markers. Computational Pathology and Ophthalmic Medical Image Analysis.
Springer, 2018, pp. 120–129.
To compare with other methods, we implemented two further [7] G. Corredor, X. Wang, Y. Zhou, C. Lu, P. Fu, K. Syrigos, D. L. Rimm,
techniques for obtaining the energy landscape and the markers. M. Yang, E. Romero, K. A. Schalper et al., “Spatial architecture and
We then exhaustively compared all energy landscape and arrangement of tumor-infiltrating lymphocytes for predicting likelihood
of recurrence in early-stage non–small cell lung cancer,” Clinical Cancer
marker combinations to assess which post processing strategy Research, vol. 25, no. 5, pp. 1526–1534, 2019.
is the best. We start by linking HoVer to the distance map [8] H. Sharma, N. Zerbe, D. Heim, S. Wienert, H.-M. Behrens, O. Hellwich,
by calculating the square sum χ2 + ϕ2 , which can be seen and P. Hufnagl, “A multi-resolution approach for combining visual infor-
mation using nuclei segmentation and classification in histopathological
as the distance from a pixel to its nearest nuclear centroid. images.” in VISAPP (3), 2015, pp. 37–46.
In other words, this is a pseudo distance map. Additionally, [9] P. Wang, X. Hu, Y. Li, Q. Liu, and X. Zhu, “Automatic cell nuclei
χ and ϕ values can be interpreted as Cartesian coordinates segmentation and classification of breast cancer histopathology images,”
Signal Processing, vol. 122, pp. 1–13, 2016.
with each nuclear centroid as the origin. By thresholding the [10] X. Yang, H. Li, and X. Zhou, “Nuclei segmentation using marker-
values between a certain range, we can obtain the markers. controlled watershed, tracking using mean-shift, and kalman filter in
The results of all combinations are shown in Table A2. Note, time-lapse microscopy,” IEEE Transactions on Circuits and Systems I:
Regular Papers, vol. 53, no. 11, pp. 2405–2414, 2006.
our gradient-based post processing technique is specifically [11] J. Cheng, J. C. Rajapakse et al., “Segmentation of clustered nuclei with
designed for the HoVer branch output. shape markers and marking function,” IEEE Transactions on Biomedical
Classification Branch: In order to assess the importance Engineering, vol. 56, no. 3, pp. 741–748, 2009.
[12] M. Veta, P. van Diest, R. Kornegoor, A. Huisman, M. Viergever, and
of a devoted branch for concurrent nuclear segmentation and J. Pluim, “Automatic nuclei segmentation in h&e stained breast cancer
classification, we compared the proposed three branch setup histopathology images,” PLoS ONE, vol. 8, no. 7, p. e70221, 2013.
of HoVer-Net to a two branch setup. Here, the two branch [13] S. Ali and A. Madabhushi, “An integrated region-, boundary-, shape-
based active contour for multiple object overlap resolution in histological
setup extends the NP branch to a multi-class setting, by imagery,” IEEE transactions on medical imaging, vol. 31, no. 7, pp.
predicting each nuclear type at the output. Then, to obtain the 1448–1460, 2012.
17
TABLE A1: Ablation study highlighting the contribution of the proposed loss strategy.
Kumar CoNSeP
Strategy DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ
Standard Loss 0.823 0.750 0.771 0.581 0.608 0.846 0.685 0.774 0.532 0.557
Proposed Loss 0.826 0.770 0.773 0.597 0.618 0.853 0.702 0.778 0.547 0.571
TABLE A2: Ablation study for post processing techniques: Sobel-based versus thresholding to get markers and Sobel-based
versus naive conversion to get energy landscape
Kumar CoNSeP
Energy Markers DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ
χ2 + ϕ2 Threshold 0.825 0.597 0.705 0.764 0.541 0.850 0.543 0.602 0.761 0.459
χ2 + ϕ2 Sobel 0.826 0.613 0.766 0.768 0.591 0.853 0.561 0.694 0.770 0.535
Sobel Threshold 0.825 0.614 0.715 0.772 0.554 0.850 0.566 0.617 0.775 0.479
Sobel Sobel 0.826 0.618 0.770 0.773 0.597 0.853 0.571 0.702 0.778 0.547
TABLE A3: Ablation study showing the contribution of the classification branch in HoVer-Net on the CoNSeP dataset. Fd
denotes the F1 score for nuclear detection, whereas Fec , Fic , Fsc and Fm
c denote the F1 classification score for the epithelial,
inflammatory, spindle-shaped and miscellaneous classes respectively.
CoNSeP CRCHisto
Branches PQ Fd Fec Fic Fsc Fm
c Fd Fec Fic Fsc Fm
c
NP & HoVer 0.499 0.736 0.636 0.545 0.528 0.333 0.666 0.458 0.523 0.271 0.132
NP & HoVer & NC 0.516 0.748 0.635 0.631 0.566 0.426 0.688 0.486 0.573 0.302 0.178
[14] S. Wienert, D. Heim, K. Saeger, A. Stenzinger, M. Beil, P. Hufnagl, [24] S. Graham and N. M. Rajpoot, “Sams-net: Stain-aware multi-scale
M. Dietel, C. Denkert, and F. Klauschen, “Detection and segmentation of network for instance-based nuclei segmentation in histology images,”
cell nuclei in virtual microscopy images: a minimum-model approach,” in Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International
Scientific reports, vol. 2, p. 503, 2012. Symposium on. IEEE, 2018, pp. 590–594.
[15] A. LaTorre, L. Alonso-Nanclares, S. Muelas, J. Pea, and J. DeFelipe, [25] H. Chen, X. Qi, L. Yu, and P.-A. Heng, “Dcan: deep contour-aware
“Segmentation of neuronal nuclei based on clump splitting and a networks for accurate gland segmentation,” in Proceedings of the IEEE
two-step binarization of images,” Expert Systems with Applications, conference on Computer Vision and Pattern Recognition, 2016, pp.
vol. 40, no. 16, pp. 6521 – 6530, 2013. [Online]. Available: 2487–2496.
http://www.sciencedirect.com/science/article/pii/S0957417413003904 [26] Y. Cui, G. Zhang, Z. Liu, Z. Xiong, and J. Hu, “A deep learning
[16] J. T. Kwak, S. M. Hewitt, S. Xu, P. A. Pinto, and B. J. Wood, “Nucleus algorithm for one-step contour aware nuclei segmentation of histopatho-
detection using gradient orientation information and linear least squares logical images,” arXiv preprint arXiv:1803.02786, 2018.
regression,” in Medical Imaging 2015: Digital Pathology, vol. 9420. [27] N. Kumar, R. Verma, S. Sharma, S. Bhargava, A. Vahadane, and
International Society for Optics and Photonics, 2015, p. 94200N. A. Sethi, “A dataset and a technique for generalized nuclear segmen-
[17] M. Liao, Y. qian Zhao, X. hua Li, P. shan Dai, X. wen tation for computational pathology,” IEEE Transactions on Medical
Xu, J. kai Zhang, and B. ji Zou, “Automatic segmentation for Imaging, vol. 36, no. 7, pp. 1550–1560, July 2017.
cell images based on bottleneck detection and ellipse fitting,” [28] M. Khoshdeli and B. Parvin, “Deep leaning models delineates multiple
Neurocomputing, vol. 173, pp. 615 – 622, 2016. [Online]. Available: nuclear phenotypes in h&e stained histology sections,” arXiv preprint
http://www.sciencedirect.com/science/article/pii/S0925231215011406 arXiv:1802.04427, 2018.
[18] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, [29] Y. Zhou, O. F. Onder, Q. Dou, E. Tsougenis, H. Chen, and P.-A.
M. Ghafoorian, J. A. van der Laak, B. Van Ginneken, and C. I. Sánchez, Heng, “Cia-net: Robust nuclei instance segmentation with contour-aware
“A survey on deep learning in medical image analysis,” Medical image information aggregation,” arXiv preprint arXiv:1903.05358, 2019.
analysis, vol. 42, pp. 60–88, 2017. [30] Q. D. Vu, S. Graham, M. N. N. To, M. Shaban, T. Qaiser, N. A.
[19] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image Koohbanani, S. A. Khurram, T. Kurc, K. Farahani, T. Zhao et al.,
analysis,” Annual review of biomedical engineering, vol. 19, pp. 221– “Methods for segmentation and classification of digital microscopy
248, 2017. tissue images,” arXiv preprint arXiv:1810.13230, 2018.
[20] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, [31] P. Naylor, M. Laé, F. Reyal, and T. Walter, “Segmentation of nuclei in
no. 7553, p. 436, 2015. histopathology images by deep regression of the distance map,” IEEE
[21] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks Transactions on Medical Imaging, 2018.
for semantic segmentation,” in Proceedings of the IEEE conference on [32] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” ArXiv
computer vision and pattern recognition, 2015, pp. 3431–3440. e-prints, p. arXiv:1703.06870, Mar. 2017.
[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [33] K. Nguyen, A. K. Jain, and B. Sabata, “Prostate cancer detection: Fusion
for biomedical image segmentation,” in International Conference on of cytological and textural features,” Journal of pathology informatics,
Medical image computing and computer-assisted intervention. Springer, vol. 2, 2011.
2015, pp. 234–241. [34] Y. Yuan, H. Failmezger, O. M. Rueda, H. R. Ali, S. Gräf, S.-F. Chin, R. F.
[23] S. E. A. Raza, L. Cheung, M. Shaban, S. Graham, D. Epstein, S. Pelen- Schwarz, C. Curtis, M. J. Dunning, H. Bardwell et al., “Quantitative
garis, M. Khan, and N. M. Rajpoot, “Micro-Net: A unified model for image analysis of cellular heterogeneity in breast tumors complements
segmentation of various objects in microscopy images,” ArXiv e-prints, genomic profiling,” Science translational medicine, vol. 4, no. 157, pp.
p. arXiv:1804.08145, Apr. 2018. 157ra143–157ra143, 2012.
18