0% found this document useful (0 votes)
48 views11 pages

Meng 2021

Uploaded by

FAYSAL Ahamad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views11 pages

Meng 2021

Uploaded by

FAYSAL Ahamad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 20XX 1

A Cervical Histopathology Dataset for Computer


Aided Diagnosis of Precancerous Lesions
Zhu Meng, Zhicheng Zhao, Bingyang Li, Fei Su, Limei Guo

Abstract— Cervical cancer, as one of the most frequently cancer. Routinely, once a cervical lesion is detected through
diagnosed cancers worldwide, is curable when detected the examinations of Pap smear, human papillomavirus (HPV),
early. Histopathology images play an important role in and colposcopy, the histopathological screening considered as
precision medicine of the cervical lesions. However, few
computer aided algorithms have been explored on cervical the gold standard, is adopted to finalize subsequent treatments.
histopathology images due to the lack of public datasets. Thus, the accuracy and efficiency of cervical histopathological
In this paper, we release a new cervical histopathology screening are of vital importance. However, the giga-pixels
image dataset for automated precancerous diagnosis. of whole slide images (WSIs) place high demands on the
Specifically, 100 slides from 71 patients are annotated professionalism and concentration of pathologists. Therefore,
by three independent pathologists. To show the difficulty
of the task, benchmarks are obtained through both fully computer aided diagnosis (CAD) is on urgent demand.
and weakly supervised learning. Extensive experiments Deep learning has shown great potential in CAD since
based on typical classification and semantic segmentation the emergence of some histopathology datasets, such as
networks are carried out to provide strong baselines. CAMELYON16 [3] and BACH [4] for breast cancer,
In particular, a strategy of assembling classification, DigestPath [5] for colon cancer, and PAIP [6] for liver cancer
segmentation, and pseudo-labeling is proposed to further
improve the performance. The Dice coefficient reaches etc. However, a recent study of CAD for pan-cancer shows
0.7833, indicating the feasibility of computer aided that the correlation between the algorithm of cervical cancer
diagnosis and the effectiveness of our weakly supervised and pathologist-estimated tumor purity is the lowest among
ensemble algorithm. The dataset and evaluation codes are 42 tissue types [7], which highlights the variety of the cervix
publicly available. To the best of our knowledge, it is the from other tissues and the necessity for special study on
first public cervical histopathology dataset for automated
precancerous segmentation. We believe that this work the cervix. Nevertheless, to the best of our knowledge, there
will attract researchers to explore novel algorithms on is no specially designed histopathology public dataset for
cervical automated diagnosis, thereby assisting doctors CAD of cervical precancerous lesions. The scarcity of public
and patients clinically. data further hinders the development of related algorithms.
Index Terms— Cervical histopathology, classification, Therefore, we release a new public dataset called MTCHI
dataset, segmentation, weakly supervised learning. to help researchers without medical background to delve
and compare the automated algorithms. The MTCHI dataset
contains 100 cervical WSIs at 10× magnification. Specifically,
I. I NTRODUCTION 20 WSIs containing 101 regions of interest (RoIs) are provided
Cervical cancer is one of the leading causes of cancer with pixel-level annotations, and additional 80 ones have
death in women aged 20 to 39 years with 10 premature image-level annotations. Considering diagnostic subjectivity
deaths per week [1]. It is observed that cervical lesion is a and experience, the data are annotated into four categories
continuous disease from mild dysplasia to cervical cancer [2]. (i.e., normal, CIN 1, CIN 2, and CIN 3) by three independent
Fortunately, cervical precancerous lesions can be identified pathologists according to the severity of cervical lesions as
and treated clinically to reduce the risk of developing invasive described in [8].
Automated precancerous diagnosis of cervical
This work is supported by grants from the National Natural Science histopathology may encounter multiple challenges. First,
Foundation of China (NSFC, U1931202, 62076033), the Beijing
Municipal Science and Technology Commission (Z201100007520001, the acquisition location and incision direction of the
Z131100004013036) and BUPT Excellent Ph.D. Students Foundation biopsied tissues determine the appearance of the cervical
(CX2019217). (Zhu Meng and Zhicheng Zhao contributed equally to this basement membrane in the histopathology images, leading
work.) (Corresponding author: Zhicheng Zhao.)
Z. Meng and B. Li are with the Beijing University of Posts and to uncertainty of spatial morphology and high demands on
Telecommunications, Beijing, China (e-mail: [email protected]; the ability of algorithms to identify diversity data. Second,
[email protected]). cervical carcinogenesis is developed from mild lesion to
Z. Zhao and F. Su are with the Beijing University of Posts and
Telecommunications, and are also with the Beijing Key Laboratory cancer gradually, and the lesion grading is subjective without
of Network System and Network Culture, Beijing, China (e-mail: precise quantification criteria, which causes lots of annotation
[email protected]; [email protected]). noises. In addition, compared with tissues such as breast
L. Guo is with the Department of Pathology, School of Basic Medical
Sciences, Third Hospital, Peking University Health Science Center, and colon, cervical tissues usually shape into strip with
Beijing, China (e-mail: [email protected]). small areas, resulting in the fitting difficulty of deep models.

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging
2 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 20XX

Furthermore, pixel-level annotated data are scarce, which of different grades in the breast, intestine, and liver possess
hinders the generalization performance of the fully supervised distinct morphological differences. The exploration on pan-
algorithms. In this paper, extensive experiments are conducted cancer has shown that algorithms that perform well in other
based on the analysis of image patches cropped from the cancers encounter obstacles in cervical lesions [7]. Therefore,
WSIs, and strong baselines are provided for future algorithm a public cervical histopathology image dataset is required to
comparison on the MTCHI dataset. Specifically, both fully specifically explore the CAD of cervical precancerous lesions.
and weakly supervised learning are considered and the
ensemble of classification, segmentation, and pseudo-labeling
B. Related Methods
achieves the best accuracy.
The remainder of this paper is organized as follows: Section 1) CAD for Cervical Histopathology Images: The CAD of
II introduces related datasets and deep learning algorithms. cervical precancerous histopathology images strongly depends
Section III describes our dataset construction and evaluation on the extraction of structural features, which is extremely
metrics. Section IV discusses the evaluated methods, including challenging. Thus, some algorithms attempted to start with
fully supervised and weakly supervised approaches. Section V simple samples. For example, algorithms from [15] [16] [17]
presents the experiments and discussion. Section VI provides ingeniously selected and divided simple samples into upper,
the conclusions. middle, and lower layers parallel to the basement membrane.
Each layer was further cut into several small patches to
II. R ELATED W ORK extract features such as color, texture, cell distribution, and
deep learning semantic information. These features were
A. Previous Datasets finally fused to determine the lesion grade of the whole
1) Previous Cervical Datasets: CAD on cervical Pap smear tissue. However, accurately cutting a sample into three layers
images has attracted much attention because of public datasets. is complicated in practical applications. Wang et al. [18]
For example, Herlev dataset [9] focuses on the segmentation achieved basement membrane segmentation via a generative
and classification of the nucleus and cytoplasm of a single adversarial network [19], but the basement membrane is
cell on Pap smear images. In addition, the ISBI 2014 [10] occasionally invisible because of incomplete tissue. All of the
and ISBI 2015 [11] datasets are designed to extract the abovementioned experiments were conducted on the basis of
boundaries of individual cytoplasm and nucleus from real private datasets.
and synthetic overlapping cervical cytology images. These 2) Fully Supervised Learning for Histopathology Images:
datasets pay much attention on the features of cytology, Convolutional neural networks (CNNs) have achieved
including the segmentation of nucleus and cytoplasm, the remarkable results in natural image processing; accordingly,
overlap and separation between cells, and the lesion grade the transfer learning for automatic processing of
of each cell. Differently, cervical histopathology images histopathology images has become increasingly popular.
contain rich information of tissue structure, concerning about The WSIs were often first cropped into patches with a small
both histology and cytology. Therefore, a public cervical size by sliding a window. The patches were then classified or
histopathology image dataset is urgently needed. segmented through CNNs, thereby stitching into a diagnostic
2) Previous Histopathology Datasets: CAMELYON16 is a result map. For example, Fu et al. [7] performed a pan-cancer
large histopathology dataset about the detection of cancer computational histopathology analysis with Inception-v4 [20].
metastasis on sentinel lymph nodes of breast cancer patients. The lightweight ShuffleNet [21] was used in [22] to identify
It contains 400 WSIs with giga-pixels, attracting generous microsatellite instability and mismatch-repair deficiency
researchers to delve in CAD of histopathology images. in colorectal tumors. ResNet-34 [23], VGG-16 [24] and
PatchCamelyon [12] extracts small patches with size of Inception-v4 were adopted in [25] to detect the invasive
96px × 96px from CAMELYON16 to assign a simple breast cancer. Wang et al. [26] utilized GoogLeNet [27]
and direct binary metastasis classification task. BreaKHis with hard example guided training to locate tumors in breast
[13] contains 7,909 breast cancer histopathology images and colon images. HookNet [28], a semantic segmentation
(700px × 460px) acquired from 82 patients for benign and network derived from U-Net [29], was designed to aggregate
malignant classification of tumors. BreastPathQ [14] scores patch features of multiple resolutions for breast and lung
cancer cellularity for tumor burden assessment in 2,579 breast cancer. In particular, many outstanding algorithms have been
pathology patches (512px × 512px) at 20× magnification. proposed since the establishment of CAMELYON16 dataset.
BACH attracts many algorithms via promoting a detailed Liu et al. [30] achieved high accuracy with Inception-v3 [31].
microscopy breast image classification (normal, benign, in Lin et al. [32] proposed the ScanNet based on a modified
situ carcinoma, and invasive carcinoma) with patches and VGG-16 network by replacing the last three fully connected
WSIs at 20× magnification. DigestPath evaluates algorithms layers with fully convolutional layers to avoid the boundary
through signet ring cell detection from 155 patients and effect of network predictions. Takahama et al. [33] extracted
872 colonoscopy tissue screening slices (3, 000px × 3, 000px patch features from a classification model (GoogLeNet)
on average) of gastric mucosa and intestine. PAIP contains and then input them into a segmentation model to obtain
100 liver WSIs with multiple magnifications to detect and probability heatmaps. Guo et al. [34] proposed a similar
segment areas of carcinogenic cells, and calculate the area method, but only applied the segmentation stage to the tumor
of the tumor burden. In the aforementioned datasets, cells regions detected by the classification model. Khened et al.

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging
MENG et al.: A CERVICAL HISTOPATHOLOGY DATASET FOR COMPUTER AIDED DIAGNOSIS OF PRECANCEROUS LESIONS 3

[35] assembled U-Net and DeepLab v3+ [36] together to


improve the diagnosis performance.
3) Weakly Supervised Learning for Histopathology Images:
Recently, pseudo-labeling [37] has been recognized as
an effective way to utilize unlabeled histopathology data.
Routinely, the limited labeled data are trained first, and then
the unlabeled patches are predicted by the model. If the Fig. 1. Characteristic image patches of normal cervical tissue, and
confidence probability of a patch exceeds the threshold, it lesions of CIN 1, CIN 2, and CIN 3 in the MTCHI dataset.
is added to the training set, and the predicted category is
regarded as the positive pseudo-label. Tokunaga et al. [38]
implemented weakly supervised learning by generating both
positive and negative pseudo-labels according to the proportion
of the adenocarcinoma subtypes. Li et al. [39] generated self-
loop uncertainty as a pseudo-label to augment the training set
and boost the organ segmentation accuracy. Cheng et al. [40] Fig. 2. Data with pixel-level annotations are obtained by cropping the
assembled predictions of similar patches as the pseudo-label RoIs from the slides. The pixels outside the regions are regarded as
of a given patch to counteract its noisy label. Chaw et al. [41] background. (a) and (b) are examples of data acquisition. The data are
annotated pixel by pixel according to the description (c).
trained a teacher model with labeled data first and generated
pseudo-labels for the unlabeled data to train a student model,
then fine-tuned the student model on original labeled dataset. unit (GPU), slides are often cropped to small-sized patches
The teacher-student loop was applied iteratively for reducing for processing. However, image-level annotations are only
the annotation burden of colorectal tissue samples. responsible for the partial regions in the slide; thus, the
cropped patches cannot match the label accurately and contain
III. T HE MTCHI DATASET a lot of noise, which requires weakly supervised algorithms.
A. Dataset Construction The data and annotations of MTCHI dataset are publicly
available in this website1 .
The data in the MTCHI dataset are provided by
Singularity.AI Technology. 400 cervical histopathology slides
were collected in China and scanned with pixel size 0.226µm B. Reference Annotation Protocol
at 40× magnification in 2018. Then 100 characteristic Cervical tissues in MTCHI dataset are grading into normal
slides from 71 patients who underwent cervical biopsy were or cervical intraepithelial neoplasia (CIN) as described in [8].
carefully selected for research purpose only. The MTCHI According to the dysplastic degrees of cervical precancerous
dataset contains only digital images without any patient-related lesions, CIN is further divided into three grades, i.e., CIN 1,
personal information. All the experiments were performed CIN 2, and CIN 3. As shown in Fig. 1, representative features
under the approval of ethnic guidance of the hospitals. of normal (a), mild dysplasia (CIN 1) (b), moderate dysplasia
The selected data are all hematoxylin and eosin stained (CIN 2) (c), and severe dysplasia (CIN 3) (d) are shown. Due
slides containing normal or precancerous cervical tissues. to the subjectivity of CIN diagnosis, the data in the MTCHI
Considering that pathologists usually diagnose cervical lesions dataset are annotated by three independent experienced
at 10× magnification or lower resolution, the scanned images pathologists. Annotator A has 5 years of diagnostic experience,
are down-sampled for 4 times and stored in PNG (Portable while Annotator B and Annotator C are experts with more than
Network Graphics) format in 3-channel RGB (Red-Green- 10 years of diagnostic experience.
Blue) for exploration. The data are divided into two subgroups. 1) Pixel-level Annotation: The 101 key regions cropped from
One subgroup contains 20 slides for fully supervised learning. 20 slides are annotated pixel by pixel. Since the pixel-level
The other contains 80 slides for weakly supervised learning. annotation plays an important role in the performance of both
Specifically, considering the complexity of tissue structure fully and weakly supervised algorithms, these data are directly
and cervical lesions, 101 RoIs without background from 20 annotated by Annotator B and checked by Annotator C. As
slides are pixel-level annotated, which is suitable to explore shown in Fig. 2, the RoIs are first cropped from the slides,
fully supervised classification and segmentation algorithms. then each pixel is annotated with a label i(i ∈ {0, 1, 2, 3, 4}),
These 20 slides are subdivided into a training set and a where 0, 1, 2, 3, 4 represent background, normal, CIN 1, CIN
test set in the ratio of 15:5. To be fair, the patients of 2, and CIN 3, respectively. The masks containing pixel-level
the training set and the test set are independent from each annotations are stored with the same length and width as the
other. Additional 80 slides are obtained from 53 different images, i.e., gray scale masks in PNG format.
patients. These slides are annotated with image-level multi- 2) Image-level Annotation: The 80 slides from 53 patients
labels. Actually, image-level annotations are easier to obtain with image-level annotations are independently annotated by
than pixel-level annotations; thus coarse-grained information is Annotator A and Annotator C. Each slide is provided with
valuable for clinical applications to improve the performance multiple labels. For example, a slide is annotated as “CIN 1
of the algorithms. Because the resolution of histopathology
slides exceeds the load capacity of the graphics processing 1 https://mcprl.com/html/dataset/MTCHI.html

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging
4 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 20XX

Third, average precision (AP) which is used to calculate the


average accuracy of pixel classification, can be expressed as
N 
1 X xj = 1 (yj = tj )
AP = xj , (3)
N xj = 0 (yj 6= tj )
j=1

where N is the number of pixels, yj denotes the predicted


category, and tj denotes the ground truth. Finally, different
pathologists may annotate the difficult samples as adjacent
different categories. Thus, window precision (WP) which is
proposed to ignore the deviation of adjacent false prediction,
can be defined as
Fig. 3. Contrast of image-level annotations from Annotator A and
Annotator C. N 
1 X xj = 1 (||yj − tj || <= 1)
WP = xj . (4)
N j=1 xj = 0 (||yj − tj || > 1)

and CIN 2”, if both lesions can be found in the slide with Note that forcing the category of the entire region by counting
obvious corresponding regions. Therefore, a slide has at least the categories of most pixels is not recommended because
one label and up to four labels (i.e., normal, CIN 1, CIN 2, many complex regions have been removed in advance for the
and CIN 3). However, cervical tissue is diverse in morphology, test images. Although the tricky approach yields better results,
and the diagnoses are almost experiential. Annotator A and such method is not optimal means for future exploration. In
Annotator C work independently. The annotation differences addition, the background is ignored in the evaluation.
of slides by the two annotators are contrasted in Fig. 3. Only
51.25% of the slides are identified consistently by the two IV. E VALUATED M ETHODS
annotators, indicating the difficulty of cervical precancerous
Considering that the high resolution of WSIs exceeds
diagnosis. In two diagnostic results, the information from the
the load capacity of GPUs, experiments on the MTCHI
professional Annotator C is recommended. The multi-labels
dataset are carried out by image patch analysis. The slides
for these data indicate that the slides contain the relevant
from the pixel-annotated training set are first cropped into
regions, but do not specify the corresponding locations. When
small patches (400px × 400px) with an overlapping stride
high resolution slides are cut into small patches for processing,
of 100px. Then the patches with foreground proportions of
the label of each patch is unknown. Therefore, the 80 slides
less than 20% are discarded before training. The remaining
are valuable materials for unsupervised and weakly supervised
7,724 patches are used for training the fully supervised
algorithms.
classification and segmentation, as well as extracting pseudo-
labeled patches from the image-annotated slides for weakly
supervised learning. The networks used for classification and
C. Evaluation Metrics segmentation are described in Section IV-A and Section IV-B.
The performance of the algorithms are evaluated by pixel- The strategy of assembling classification and segmentation is
level ground truth. Four evaluation metrics are applied to introduced in Section IV-C. The weakly supervised learning
compare the automated diagnostic results with the ground strategy for MTCHI is discussed in Section IV-D. Post-
truth to fairly measure the performance of the models. First, processing is presented in Section IV-E.
the Dice coefficient, a commonly used evaluation metric in
medical image segmentation tasks, is used to measure the A. Fully Supervised Classification
degree of coincidence between the prediction and the truth. Popular classification networks are adopted in the fully
The Dice coefficient is defined as supervised experiments, including MobileNet-v2 [42], VGG,
4 GoogLeNet, Inception-v3, DenseNet [43], and ResNet. All
1 X 2 | Pi ∩ Ti |
Dice = , (1) classification networks are initialized with parameters pre-
4 i=1 | Pi | + | Ti | trained on the ImageNet [44]. For a pixel-annotated patch, the
classification truth is the category with the largest area except
where Pi denotes the regions predicted to be category i background, i.e., one of normal, CIN 1, CIN 2, and CIN3.
(i = 1, 2, 3, 4 denotes normal, CIN 1, CIN 2, and CIN The learning rate is initialized with 0.001, and decreased by
3, respectively) and Ti denotes the truth. Second, mean using the cosine annealing strategy. The experimental results
Intersection over Union (mIoU), a commonly used evaluation of the 30th epoch are stored for comparison when cross-
metric in natural image segmentation, is also applied in this entropy loss (CE-loss) is used to constrain the model fitting,
task. The definition of mIoU is slightly different from the Dice while the 50th ones are stored when adaptive elastic loss (AE-
coefficient. mIoU is defined as loss) [45] is used. The outputs of the classification networks
4 are the confidence probabilities of the four categories, and the
1 X Pi ∩ Ti category with the largest confidence probability is regarded
mIoU = . (2)
4 i=1 Pi ∪ Ti as the prediction result. Each patch corresponds to a single

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging
MENG et al.: A CERVICAL HISTOPATHOLOGY DATASET FOR COMPUTER AIDED DIAGNOSIS OF PRECANCEROUS LESIONS 5

diagnostic result. Therefore, it is necessary to fill the whole [47], DeepLab v3+, U-Net and its variants such as ENS-
patch to obtain the diagnosis heatmap of the same size as the UNet [48], Res-UNet [49], UNET3+ [50], and HookNet. The
input patch. The structures of the classification networks are learning rate is initially set to 0.02 and decreased by a factor
described below. of 0.5 after every 10 epochs. The networks are all trained
1) MobileNet-v2: MobileNet-v2 contains very few network by CE-loss. The output of the semantic segmentation network
parameters to complete approximately the same function as has the same length and width as the input image. Semantic
traditional convolutions, thereby accelerating the feedforward segmentation can be regarded as a pixel-level classification,
process. Specifically, depthwise separable convolutions are and thus, each pixel has a corresponding prediction result. The
embedded instead of regular convolutions. An expansion layer segmentation network structures are introduced below.
is assigned before the depthwise convolution to expand the 1) FCN: The fully connected layers of VGG are replaced
feature channels, and a projection layer is assigned after it by convolutional layers, constituting a fully convolutional
to reduce the dimensions. The expansion, depthwise, and network (FCN). During feature extraction process by VGG,
projection layers form a bottleneck residual block. Multiple feature-map size is reduced from x to 1/32x. In FCN32s,
blocks are stacked to extract patch features. the feature-map with size 1/32x is return to the original size
2) VGG: Convolutional layers with the same kernel size through a deconvolution layer to obtain segmentation results.
of 3 × 3 are stacked in the VGG network. The number of FCN16s first up-samples the feature-map at size 1/32x, then
feature-map channels is increased step by step through the sums it to the feature-map at size 1/16x, and finally recovers
convolutional layers. The feature-map size is down-sampled it to the input size by a deconvolution layer.
five times through five max pooling layers with a stride 2) SegNet: SegNet employs the structure of encoder-
of 2. Three fully connected layers are assigned at the end decoder to implement semantic segmentation. Here, VGG-16
of the VGG network to obtain the classification results. In is regarded as an encoder to extract features. The decoder
addition to the 5 max pooling layers and the 3 fully connected consists of convolutional layers and up-sampling operations.
layers, VGG-16 contains 8 convolutional layers and VGG-19 During the max pooling in the encoder, pooling locations are
contains 11 convolutional layers. Due to the large number of stored as indexes for the up-sampling of the decoder.
parameters, VGG has a good ability to extract features but is 3) U-Net: U-Net was originally designed for cell
time-consuming. segmentation, and its output size is smaller than the
3) GoogLeNet and Inception-v3: GoogLeNet, also known input size. To obtain the diagnostic heatmap with the same
as Inception-v1, contains fewer parameters than VGG. It size as the input patch, all convolutional layers in U-Net
is composed of multiple Inception modules. An Inception are modified with zero padding. The encoder extracts
module aggregates convolutional layers (1 × 1, 3 × 3, 5 × 5), high-dimensional semantic features by max pooling and
pooling operations (3 × 3) to deal with different scales. convolutional layers, while the decoder gradually restores
Dimension reductions and projections are judiciously applied the feature size through convolutional and deconvolutional
before the convolutions with large kernel sizes. As to layers. Four sets of different scale features in the encoder
Inception-v3, the Inception modules are improved through the and decoder are cascaded by skip connections to improve the
factorization into small convolutions. position accuracy of the segmentation results.
4) DenseNet: DenseNet-121 and DenseNet-169 are stacked 4) U-Net Variants (ENS-UNet, Res-UNet, UNET3+): ENS-
by dense blocks. In a dense block, each layer takes all UNet inserts a noise suppression block in every skip
preceding feature-maps as input. Each layer can access the connection path of U-Net. Res-UNet replaces all convolutional
gradient directly from the loss and the original input image, layers and skip connections in U-Net with residual blocks. In
enabling implicit deep supervision. the decoder of UNET3+, features are densely concatenated
5) ResNet: Different from VGG-19, ResNet replaces the with all shallow features from the encoder.
max pooling layers except the first one using convolutions 5) HookNet: The U-Net-like structure without skip
with a stride of 2. For ResNet-18 and ResNet-34, two 3 × 3 connections is treated as one branch of the HookNet. The
convolutional layers constitute a residual block. The input structures of the two branches in the HookNet are exactly
of the residual block is connected to the output of the the same, but the input patches have different fields of
convolutional layers to avoid the gradient disappearance during view. Specifically, a 400px × 400px patch is resized to
training. When the networks are deeper (e.g., ResNet-50 284px × 284px as input to the first context branch. The same
and ResNet-101), the residual block is composed of three patch is center cropped by 284px × 284px as input to the
convolutional layers with kernel sizes of 1 × 1, 3 × 3, and target branch. The second layer of the decoder in the context
1 × 1. Multiple residual blocks are stacked to extract features. branch is center cropped and cascaded to the first layer of
the decoder in the target branch. Since the channel number
in the HookNet-x is elevated gradually, the output channel
B. Fully Supervised Segmentation number x of the first convolutional layer determines the
When there are multiple categories in one patch, it is an parameter quantity. Experiments on the basis of HookNet-16
inaccurate practice to take the category with the most pixels and HookNet-64 are conducted in this paper.
in the patch as the label, so the pixel-wise classification (i.e., 6) DeepLab v3+: DeepLab v3+ adopts ResNet as the
the semantic segmentation) is required. The semantic networks encoder to extract features. The high-dimensional features
used for experiments in this paper include FCN [46], SegNet from the end of the encoder are fed into an Atrous Spatial

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging
6 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 20XX

Algorithm 1: Construction of the weakly supervised


training set with pseudo-labels.
Input: Images with pixel-level annotations Ipixel ,
images with image-level annotations Iimg and
their labels Timg .
Output: Training set S for weakly supervised learning.
Parameters: Classification network N et, maximum
limits on the number of pseudo-labeled patches Np ,
proportion of pseudo-labeled patches involved in
training r%.
Step 1: Construct a pure training set S0 . Ipixel are
cropped into patches with overlapping and patches
with foreground less than 20% are discarded.
Fig. 4. The ensemble architecture of classification and segmentation
is constructed on the basis of DeepLab v3+. The assisted block can be Step 2: N et is trained with S0 to obtain a feature
assigned to point A, B, C, or D. The parameters of the assisted block extraction model M .
are trained with classification loss according to patch-level labels. In Step 3: Calculate the number of times ci that class
the inference process, the outputs of the segmentation branch and the
assisted block branch are combined by multiplication. i(i ∈ [1, 4]) appears in Timg .
Step 4: Select appropriate pseudo-labeled patches on
every image Ix ∈ Iimg with multi-labels Tx ∈ Timg .
Pyramid Pooling (ASPP) module. The ASPP module consists for Ix in Iimg do
of a global pooling, a 1 × 1 convolution, and three 3 × 3 atrous 1. Crop overlapping patches on Ix ;
convolutions with rates of 6, 12, and 18, thereby aggregating 2. Predict the confidence probabilities for each
features from multiple fields of view. The outputs from the patch;
ASPP are first up-sampled through bilinear interpolation, then for i in Tx do
cascaded with the features from the first layer of the encoder. 1. Confidence probability ranking of all patches
The feature-map which combines the location and the semantic for category i;
information is up-sampled by four times to obtain the final 2. The first Np /(4 × ci ) patches with high
semantic segmentation results. probability are added to the candidate
pseudo-label set Sp .
end
C. Fully Supervised Ensemble end
On the one hand, the block effect exists in the results of Step 5: Np × r% patches are selected.
the classification because all pixel in a patch are predicted foreach [Normal, CIN 1, CIN 2, CIN 3] do
as the same class. On the other hand, pepper noise in the 1. The corresponding confidence probabilities of
outputs of segmentation inevitably undermines the consistency the patches in Sp are reranked.
of the results. Therefore, we combine classification and 2. The first Np × r%/4 patches with high
segmentation together to make them complementary. To probability are retained and the rest removed.
make the parameters as few as possible, classification and end
segmentation are not implemented by two models separately. Step 6: S = S0 + Sp .
Instead, they share the same encoder to extract features.
For convenience, the segmentation network DeepLab v3+ is
combined with classification assisted blocks to realize the output distribution by multiplication. The final output will
ensemble of classification and segmentation. As shown in Fig. balance the advantages of classification and segmentation.
4, the assisted block can be connected after the classification
backbone with pretrained parameters, that is, points A, B, C,
or D after ResNet. At point A and D, the assisted block is D. Weakly Supervised Learning
composed of a global average pooling and a fully connected Pixel-level annotations are laborious, while image-level
layer for calculating the classification loss. At point B and C, annotations are relatively abundant. Therefore, algorithms
the assisted block is composed of three 3 × 3 convolutional based on image-level annotations are essential to continuously
layers, a global average pooling, and a fully connected layer. improve model performance. However, cervical histopathology
The parameters of the assisted block are constrained by the images are large and the labels only represent the category
CE-loss, while the previous shared parameters are jointly without the specific location. Therefore, the labels of the
optimized by classification and segmentation loss functions, cropped patches are unknown. Pseudo-labeling strategy is
thereby helping to improve the fitting performance of the adopted in our experiments to take advantage of the weak
segmentation model. To make more use of the information data. Specifically, the new training set with pseudo-labels
learned by the assisted block, during the inference process, is constructed by the process described in Algorithm 1.
the output of the assisted block is normalized to [0,1] by the For weakly supervised classification, the pseudo-label is the
sigmoid function, and then used to adjust the segmentation category with the largest confidence probability from the

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging
MENG et al.: A CERVICAL HISTOPATHOLOGY DATASET FOR COMPUTER AIDED DIAGNOSIS OF PRECANCEROUS LESIONS 7

TABLE I
R ESULTS OF FULLY SUPERVISED CLASSIFICATION . OVERLAPPING IS BETTER THAN NON - OVERLAPPING FOR MOST NETWORKS . M OBILE N ET- V 2
WORKS THE FASTEST, R ES N ET-101 WORKS THE BEST FOR NON - OVERLAPPING , AND VGG-19 WORKS THE BEST FOR OVERLAPPING .

Non-overlapping 50% Overlapping 75% Overlapping


Loss Network Para FLOPs
Dice mIoU AP WP Dice mIoU AP WP Dice mIoU AP WP
MobileNet-v2 2.23M 0.94G 0.6456 0.4955 0.6508 0.9432 0.6953 0.5527 0.7100 0.9372 0.7183 0.5771 0.7307 0.9514
VGG-16 134.28M 45.32G 0.6323 0.4902 0.6854 0.9530 0.6448 0.5160 0.7042 0.9599 0.6629 0.5372 0.7237 0.9593
VGG-19 139.59M 57.56G 0.6916 0.5587 0.7249 0.9636 0.6716 0.5473 0.7183 0.9739 0.6807 0.5571 0.7236 0.9793
GoogLeNet 5.60M 4.43G 0.5683 0.4265 0.6325 0.9412 0.5841 0.4636 0.6847 0.9411 0.5924 0.4750 0.6934 0.9545
Inception-v3 21.79M 9.47G 0.6419 0.4810 0.6643 0.9318 0.7015 0.5499 0.7281 0.9285 0.7106 0.5640 0.7414 0.9409
CE-loss
DenseNet-121 6.96M 8.47G 0.5778 0.4539 0.6512 0.9513 0.5913 0.4771 0.6852 0.9483 0.6041 0.4931 0.6989 0.9579
DenseNet-169 12.49M 10.04G 0.6362 0.5153 0.7147 0.9368 0.6518 0.5381 0.7406 0.9445 0.6714 0.5597 0.7587 0.9493
ResNet-34 21.29M 10.80G 0.6044 0.4622 0.6517 0.9023 0.6476 0.5186 0.7011 0.9200 0.6639 0.5398 0.7226 0.9271
ResNet-50 23.52M 12.10G 0.5937 0.4581 0.6593 0.9216 0.6309 0.5071 0.7108 0.9330 0.6515 0.5301 0.7317 0.9384
ResNet-101 42.51M 23.05G 0.7218 0.5850 0.7472 0.9417 0.6883 0.5475 0.7177 0.9337 0.7054 0.5658 0.7320 0.9391
MobileNet-v2 2.23M 0.94G 0.6591 0.5103 0.6663 0.9386 0.7099 0.5708 0.7244 0.9499 0.7172 0.5817 0.7321 0.9563
VGG-16 134.28M 45.32G 0.6367 0.5100 0.7083 0.9457 0.6572 0.5380 0.7235 0.9411 0.6752 0.5621 0.7477 0.9456
VGG-19 139.59M 57.56G 0.7055 0.5701 0.7321 0.9609 0.7239 0.5887 0.7519 0.9760 0.7476 0.6187 0.7765 0.9789
GoogLeNet 5.60M 4.43G 0.5874 0.4477 0.6420 0.9319 0.6175 0.4987 0.7051 0.9388 0.6268 0.5116 0.7102 0.9487
Inception-v3 21.79M 9.47G 0.6520 0.4919 0.6698 0.9318 0.7199 0.5714 0.7381 0.9424 0.7321 0.5870 0.7494 0.9513
AE-loss
DenseNet-121 6.96M 8.47G 0.5959 0.4787 0.6869 0.9428 0.6061 0.5079 0.7116 0.9433 0.6124 0.5119 0.7167 0.9547
DenseNet-169 12.49M 10.04G 0.6015 0.4857 0.6875 0.9082 0.6257 0.5192 0.7313 0.9000 0.6297 0.5297 0.7420 0.9010
ResNet-34 21.29M 10.80G 0.6519 0.5165 0.7082 0.9170 0.6311 0.5099 0.7078 0.9285 0.6773 0.5608 0.7502 0.9349
ResNet-50 23.52M 12.10G 0.6147 0.4848 0.6936 0.9201 0.6627 0.5391 0.7436 0.9293 0.6656 0.5442 0.7499 0.9303
ResNet-101 42.51M 23.05G 0.7003 0.5498 0.7154 0.9438 0.6981 0.5574 0.7219 0.9348 0.7228 0.5838 0.7446 0.9470

feature extractor M . For weakly supervised segmentation, more context information is gathered in each pixel, but with
the ground truth is the mask filled with the pseudo-label. In more computing resources.
addition, assigning the same pseudo-label to an entire patch is
coarse with false pixel-level labels. Thus, for the ensemble V. R ESULTS AND D ISCUSSION
of classification and segmentation, pseudo learning is only A. Fully Supervised Learning
applied to the classification branch to balance the information
and noise. Considering that the assisted blocks at point B and 1) Fully Supervised Classification: The classification
C need more parameters, only the block at point A is used for networks described in Section IV-A are all adopted for the
weakly supervised ensemble. First, the mixed training set S is experiments on the MTCHI dataset. As shown in Table I,
used to train the ResNet backbone and the assisted block for although the networks play their advantages in different
30 epochs. Second, the weights of the classification branch are aspects, there are obvious differences in speed and accuracy
fixed and the rest segmentation layers are trained by using the when applied to cervical precancerous CAD. MobileNet-v2
original training set S0 for 10 epochs with the learning rate of runs fastest because its floating point operations (FLOPs)
0.001. In inference process, the output of the assisted block is only 0.94G. Although Inception-v3 and ResNet-101 are
is normalized by the sigmoid function and multiplied with the very deep, they are still faster than VGG-19 because of their
segmentation output. Note that the backbone contains atrous special connection modes, namely, the Inception block and the
convolution to aggregate the information of multiple receptive residual block. The training performance of AE-loss is better
fields, namely, the classification branch in the ensemble is not than that of CE-loss in most experiments, because AE-loss fits
exactly the same as ResNet. Thus, it is retrained instead of the relationship between categories of cervical precancerous
directly loading the weights of weakly supervised ResNet. lesions better. Generally overlapping post-processing achieves
good results because it makes up for the lack of foreground
information of one patch. However, there are also some
E. Post-processing special cases of overlapping post-processing which leads to
the decrease of accuracy. For example, in Fig. 5 (b), the
The diagnosis results of patches are stitched into the
wrong prediction of a patch misleads the diagnosis of adjacent
entire diagnostic map. A simple method is to stitch non-
patches. In Fig. 5 (c), although the whole region is annotated
overlapping patches (0% overlapping). However, there are
as CIN 1 by pathologists, some regions are actually normal
noises in stitching due to insufficient information at the
tissues which can be clearly displayed by 75% overlapping.
edge of the patch. Therefore, the patches are cropped with
Although the overlapping post-processing in (c) is closer to
overlapping. Specifically, the network output of a pixel P~ =<
the real situation, it inevitably causes the loss of accuracy
p1 , p2 , p3 , p4 > denotes the confidence probability of <
when compared with the ground truth.
normal, CIN 1, CIN 2, CIN 3>. When N pixels overlap, the
1 PN ~ 2) Fully Supervised Segmentation: The results of fully
fusion confidence probability is P~ = Pk . The final supervised segmentation networks are fairly compared without
N k=1
diagnosis result of the pixel is the category with the highest overlapping in Table II. Compared with U-Net, the U-
confidence probability in P~ . The larger the overlap area is, the Net-variants (ENS-UNet, Res-UNet, and UNET3+) increase

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging
8 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 20XX

Fig. 5. Three examples from ResNet-101 trained with CE-loss. I, T , P0 and P75 represent the input image, the ground truth, non-overlapping
result and 75% overlapping result, respectively. The white, green, blue and red parts represent normal tissue, lesions of CIN 1, CIN 2 and CIN 3
respectively. (a) The 75% overlapping makes up for the lack of edge information in non-overlapping. (b) The wrong prediction results mislead the
judgement of adjacent patches in overlapping post-processing. (c) The upper left corner of the sample is closer to the normal tissue rather than CIN
1, and the result of overlapping is closer to the real situation than the annotation.

TABLE II
R ESULTS OF FULLY SUPERVISED SEGMENTATION AND ENSEMBLE WITH
ASSISTED BLOCK . E NSEMBLE -A, B, AND C REPRESENT THE RESULTS
OF ASSISTED BLOCK AT POINT A, B, AND C.

Overlap Network Para FLOPs Dice mIoU AP WP


HookNet-16 3.07M 2.76G 0.3804 0.2609 0.3700 0.7140
FCN32s 18.64M 45.25G 0.3882 0.2749 0.4231 0.8841
U-Net 31.03M 54.20G 0.4015 0.2820 0.4739 0.8523
ENS-UNet 34.18M 125.74G 0.4195 0.3039 0.4811 0.8784
FCN16s 18.64M 45.25G 0.4308 0.3214 0.4508 0.8716
HookNet-64 48.97M 43.37G 0.4652 0.3218 0.4382 0.7649
0% Res-UNet 65.45M 744.56G 0.5059 0.3770 0.5690 0.9203
SegNet 28.44M 109.94G 0.5260 0.4016 0.6118 0.9140
UNET3+ 26.98M 447.05G 0.5600 0.4122 0.5721 0.9148
DeepLab v3+ 59.34M 49.97G 0.6500 0.5180 0.7035 0.9167
Fig. 6. The box plots of the segmentation results, where each set of
Ensemble-A 59.35M 49.97G 0.7060 0.5626 0.7362 0.9561 experiments was repeated five times. The S in the abscissa represents
Ensemble-B 84.13M 66.28G 0.7261 0.5829 0.7492 0.9423 segmentation networks without assisted blocks. The left results are the
Ensemble-C 84.13M 66.28G 0.7321 0.5910 0.7477 0.9471 segmentation outputs of the models trained with assisted block A, B, C,
or D, but without the inference multiplication. The right results are the
Ensemble-A 59.35M 49.97G 0.7450 0.6109 0.7807 0.9601
segmentation results with assisted inference.
50% Ensemble-B 84.13M 66.28G 0.7671 0.6362 0.7954 0.9543
Ensemble-C 84.13M 66.28G 0.7621 0.6301 0.7832 0.9547
Ensemble-A 59.35M 49.97G 0.7559 0.6255 0.7930 0.9626
75% Ensemble-B 84.13M 66.28G 0.7699 0.6404 0.8004 0.9549 parameter iteration of the ensemble equally. Since the skeleton
Ensemble-C 84.13M 66.28G 0.7700 0.6403 0.7930 0.9569 network parameters are shared, the assisted block will help the
fitting of the shared parameters during the training process.
Due to the large connected area of cervical lesions, the patch
the computational burden while improving the accuracy. cut out from the central area of a RoI belongs to a single
Compared with HookNet-16, HookNet-64 contains more category. Thus, in the inference process, the outputs of the
parameters and improves the learning ability of the model. assisted block can be assigned as weights to the different
Although DeepLab v3+ is a deep network with more than channels of the segmentation feature-maps to suppress the
100 layers, some parameters of ResNet-101 pre-trained pepper noise. Fig. 6 shows the box plots of the segmentation
on ImageNet can be used in the encoder. Thus, it can results. Each set of experiments in the box plots was repeated
quickly fit and achieve good segmentation results in cervical five times. The left box plot reflects the segmentation Dice
segmentation task. Since the up-sampling operation based on coefficients with assisted blocks in the training process. The
bilinear interpolation is weight-free, the FLOPs of DeepLab right one reflects the Dice coefficients with different assisted
v3+ are similar to those of other shallow networks. Therefore, blocks in the inference process. S indicates the comparison
DeepLab v3+ is superior in both speed and accuracy. results without assisted blocks. A, B, C, and D indicate
3) Fully Supervised Ensemble: The classification loss of the the connection positions of the assisted blocks in Fig. 4.
assisted block and the segmentation loss jointly restrain the According to the Dice coefficients, the assisted block can

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging
MENG et al.: A CERVICAL HISTOPATHOLOGY DATASET FOR COMPUTER AIDED DIAGNOSIS OF PRECANCEROUS LESIONS 9

TABLE III
W EAKLY SUPERVISED CLASSIFICATION RESULTS WITH 75% OVERLAPPING . ACC REPRESENTS THE ACCURACY COMPARED WITH PATCH - LEVEL
GROUND TRUTH . T HE RESULTS USING R ES N ET-101 AS M FOR EXTRACTING PSEUDO - LABELED PATCHES IS BETTER THAN THOSE USING
M OBILE N ET- V 2. W HEN R ES N ET-101 IS ADOPTED AS M , I NCEPTION - V 3 SHOWS THE BEST EFFECT WHEN ALL PSEUDO DATA ARE INTRODUCED,
R ES N ET-101 SHOWS THE BEST EFFECT WHEN 50% ONES ARE INTRODUCED, AND VGG-19 SHOWS THE BEST EFFECT WHEN 25% ONES ARE
INTRODUCED.

Training M: MobileNet-v2 M: ResNet-101


Network
Set Dice mIoU AP WP Acc Dice mIoU AP WP Acc
VGG-19 0.4195 0.3054 0.4150 0.8707 0.3790 0.3939 0.2810 0.3847 0.8222 0.3644
Inception-v3 0.6437 0.5126 0.6531 0.9417 0.5221 0.4931 0.3868 0.4921 0.8854 0.3935
Sp
ResNet-101 0.4115 0.3086 0.4106 0.8159 0.3319 0.5101 0.4192 0.5370 0.9302 0.4339
MobileNet-v2 0.2941 0.1997 0.2823 0.6900 0.3208 0.3809 0.2817 0.3707 0.7790 0.3366
VGG-19 0.6904 0.5722 0.7394 0.9760 0.5610 0.6821 0.5589 0.7396 0.9424 0.5673
Inception-v3 0.7200 0.5896 0.7769 0.9542 0.5670 0.7785 0.6503 0.8022 0.9646 0.5850
Sp + S0
ResNet-101 0.7349 0.5976 0.7596 0.9493 0.5875 0.7118 0.5822 0.7633 0.9492 0.5898
MobileNet-v2 0.6790 0.5594 0.7360 0.9418 0.5414 0.6264 0.4996 0.6732 0.9075 0.5174
VGG-19 0.6309 0.4985 0.6856 0.9819 0.5496 0.6755 0.5474 0.7230 0.9791 0.5727
Inception-v3 0.7442 0.6065 0.7717 0.9734 0.5673 0.7588 0.6196 0.7751 0.9701 0.5841
50%Sp + S0
ResNet-101 0.7288 0.5863 0.7479 0.9535 0.5936 0.7730 0.6406 0.7908 0.9625 0.6169
MobileNet-v2 0.6414 0.5087 0.6926 0.9440 0.5446 0.6749 0.5401 0.7194 0.9423 0.5683
VGG-19 0.7477 0.6165 0.7670 0.9819 0.5847 0.7567 0.6290 0.7803 0.9746 0.5831
Inception-v3 0.7237 0.5794 0.7403 0.9647 0.5610 0.7570 0.6175 0.7709 0.9600 0.5819
25%Sp + S0
ResNet-101 0.7364 0.5939 0.7513 0.9512 0.5917 0.7586 0.6241 0.7795 0.9587 0.6037
MobileNet-v2 0.6355 0.4955 0.6599 0.9436 0.5363 0.6642 0.5308 0.6992 0.9402 0.5544

significantly improve the segmentation performance, and the TABLE IV


assisted inference can further enhance the overall effect. Due to R ESULTS OF WEAKLY SUPERVISED SEGMENTATION WITH DIFFERENT
PSEUDO EXTRACTOR M . A LL EXPERIMENTS ARE CONDUCTED BASED
the sharp decrease in the channels of the feature-map at point ON D EEP L AB V 3+ WITH 75% OVERLAPPING .
D, the model cannot fit the classification labels well, which
leads to a weaker improvement effect than the assisted blocks M Training Set Dice mIoU AP WP
at the other three points. The results of repeated experiments Sp 0.1913 0.1410 0.3510 0.8874
show that although the assisted blocks of B and C sometimes Sp + S0 0.6516 0.5094 0.6937 0.9322
MobileNet-v2
50%Sp + S0 0.6665 0.5263 0.6993 0.9193
bring great effects, the performance is unstable due to a 25%Sp + S0 0.6379 0.5055 0.6941 0.9052
large number of parameters. Table II compares the results
Sp 0.1929 0.1425 0.3549 0.8876
of the ensemble of classification and segmentation, where Sp + S0 0.6344 0.4962 0.6867 0.9276
ResNet-101
Ensemble-A, B, C represent the assisted block at point A, 50%Sp + S0 0.6740 0.5315 0.7033 0.9243
B, C, respectively. The Dice coefficient of Ensemble-A is 25%Sp + S0 0.6410 0.5057 0.6910 0.9109
0.706, while that of DeepLab v3+ is 0.65, indicating that the
ensemble algorithms show great advantages over the single
segmentation networks. The accuracy of Ensemble-A, B, and pseudo-labels is crucial for weakly supervised learning. The
C are relatively similar, but Ensemble-A is faster and more test results are very poor when only the pseudo dataset SP is
stable. used without the pixel-annotated training set S0 , suggesting
that the difference between the test set and the training
B. Weakly Supervised Learning set is too large, and many of the extracted pseudo-labels
Since the pseudo-labels in weakly supervised learning are are wrong. Thus, it is necessary to control the number of
not absolutely accurate, the pseudo dataset Sp should not be pseudo-labels. In addition, different networks have different
too large. In our experiments, the pseudo-label number Np ability to resist noise. MobileNet-v2 is greatly affected by
of Algorithm 1 is set to 4000. Considering that the feature the noisy data, and the performance becomes worse after the
extractor M is trained on pixel-annotated data, MobileNet-v2 introduction of pseudo-labels. VGG-19 is capable to fit data
which is the fastest and ResNet-101 which is the most accurate because of many parameters, and is also easy to fluctuate
in non-overlapping fully supervised classification, are selected. because of noise. Thus, when the pseudo data are relatively
1) Weakly Supervised Classification: According to Section pure, the Dice coefficient reaches relatively high accuracy.
V-A.1, MobileNet-v2 is superior in speed, while VGG-19, The Dice coefficients of ResNet-101 and Inception-v3 reach
Inception-v3, and ResNet-101 are superior in accuracy. Thus, approximate 0.77 when 50% and 100% pseudo-label data
the four networks are adopted for the experiments of weakly are introduced. Therefore, the key of improving the effect of
supervised classification. Table III shows the results with weakly supervised learning is increasing the purity of pseudo-
different amount of pseudo training data. The results of label data.
ResNet-101 as the feature extractor M are significantly better 2) Weakly Supervised Segmentation: DeepLab v3+ is
than those of MobileNet-v2, indicating that the accuracy of adopted in the experiments of weakly supervised segmentation

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging
10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 20XX

TABLE V is different, so the scale of pseudo-label dataset should


R ESULTS OF WEAKLY SUPERVISED ENSEMBLE WITH ASSISTED BLOCK be balanced according to the learning ability of networks.
AT POINT A. T HE RESULTS OBVIOUSLY SURPASS THOSE OF FULLY
SUPERVISED E NSEMBLE -A AND EVEN E NSEMBLE -C.
Considering the limitations of pseudo-label learning for
segmentation, the performance of the ensemble algorithm can
Non-overlapping 75% Overlapping be improved by effectively using pseudo-label dataset on the
Loss
Dice mIoU AP WP Dice mIoU AP WP classification branch.
CE-loss 0.7487 0.6087 0.7674 0.9567 0.7783 0.6472 0.7991 0.9637 The public dataset provided in this paper aims to attract
AE-loss 0.7529 0.6146 0.7707 0.9571 0.7833 0.6531 0.8017 0.9640
researchers to pay attention to the automated diagnosis
assistance of the cervix. In addition to the benchmarks
described in this paper, different directions can be explored
to meet the requirements of pathologists:
because it yields good performance in fully supervised
segmentation (Section V-A.2). Since the pseudo-label only • The original WSI size is far beyond the GPU memory

represents the existence of corresponding category in the limit, and the cropped patch has a limited field of
patch, so it is coarse and improper to fill the segmentation view. Therefore, it is expected to explore methods based
truth with the pseudo-label. As shown in Table IV, the on multi-scale image fusion to gather rich structural
Dice coefficient reaches 0.674 (75% overlapping) when 50% information to simulate the pathologists’ process of
Sp is used for training, which is only slightly improved determining the lesion location and grade at different
compared with fully supervised segmentation 0.65 (non- resolutions.
overlapping). Therefore, pseudo-labeling is unsuitable for the • Due to the complexity of cervical tissue structure,

weakly supervised segmentation because of too much noise. pathologists are subjective in the diagnosis of lesions.
3) Weakly Supervised Ensemble: As mentioned above, In addition, absolute accuracy cannot be achieved when
pseudo-labeling is valuable in classification instead of labeling lesion regions with polygonal lines on low-
segmentation. Thus, pseudo dataset Sp is only used for resolution images. Therefore, weakly supervised learning
training the classification branch in the ensemble, while the that only uses annotations as references is encouraged to
segmentation branch is still trained with pixel-level annotated resist labeling errors introduced by various factors.
S0 . Considering that the training of classification and • It is much easier to obtain unlabeled data. Hence,

segmentation is separate, Ensemble-B and C are difficult to unsupervised learning algorithms are strongly
converge due to more parameters and the lack of segmentation recommended.
loss; thus, only Ensemble-A is adopted for weakly supervised
ensemble learning. The Dice coefficient of Ensemble-A VI. C ONCLUSIONS
reaches 0.7833 in Table V, which is significantly higher than
0.7559 in Table II. Therefore, image-level annotated data are In this paper, a new cervical histopathology image dataset
valuable for cervical precancerous CAD. called MTCHI is introduced, and a precancerous task is
designed to evaluate the performance of automated diagnosis.
Four evaluation metrics (Dice coefficient, mIoU, AP, and WP)
C. Discussion
are provided particularly for this task. Both fully and weakly
The classification network transferred from the natural supervised algorithms are discussed. Extensive experiments
image processing tasks can quickly fit the cervical based on classification and segmentation networks are carried
histopathology data. Classification networks are prone to out to demonstrate the feasibility of CAD on cervical
misjudgement due to insufficient foreground information, precancerous lesions. The high accuracy of the ensemble
which is offset by overlapping post-processing through of fully and weakly supervised strategies demonstrates the
aggregating the diagnosis information of adjacent patches. potential of unlabeled data in improving the performance. The
The segmentation networks accurately locate the boundaries dataset is publicly available for researchers to reproduce and
of the lesion regions, but the segmentation performance is explore novel algorithms, and finally is helpful for diagnosis.
decreased because of the holes inside the prediction heatmaps.
By assigning an assisted block to the segmentation network,
the advantages of both classification and segmentation are R EFERENCES
combined and good performance is achieved. [1] R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2020,” Ca-
In fact, pixel-level annotation is time-consuming, while Cancer J. Clin., vol. 70, no. 1, pp. 7–30, 2020.
[2] R. M. Richart and B. A. Barron, “A follow-up study of patients with
image-level annotation is relatively abundant. Despite the cervical dysplasia,” Am. J. Obstet. Gynecol., vol. 105, no. 3, pp. 386–
image-level annotation, the label of the cropped patch is 393, 1969.
unknown due to the large size of the cervical histopathology [3] B. E. Bejnordi et al., “Diagnostic assessment of deep learning algorithms
image. Therefore, weakly supervised learning is a latent. for detection of lymph node metastases in women with breast cancer,”
Jama, vol. 318, no. 22, pp. 2199–2210, 2017.
Pseudo-labeling is adopted in this paper for weakly supervised [4] G. Aresta et al., “Bach: Grand challenge on breast cancer histology
learning. Experiments show that the accuracy of the images,” Med. Image Anal., vol. 56, pp. 122–139, 2019.
classification model for extracting pseudo-label data is critical [5] J. Li et al., “Signet ring cell detection with a semi-supervised learning
framework,” in IPMI, 2019, pp. 842–854.
to the purity of the training set which strongly affects the [6] Y. J. Kim et al., “Paip 2019: Liver cancer segmentation challenge,” Med.
final performance. The anti-noise performance of networks Image Anal., vol. 67, 2020.

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2021.3059699, IEEE
Transactions on Medical Imaging
MENG et al.: A CERVICAL HISTOPATHOLOGY DATASET FOR COMPUTER AIDED DIAGNOSIS OF PRECANCEROUS LESIONS 11

[7] Y. Fu et al., “Pan-cancer computational histopathology reveals [34] Z. Guo et al., “A fast and refined cancer regions segmentation framework
mutations, tumor composition and prognosis,” Nature Cancer, vol. 1, in whole-slide breast pathological images,” Scientific reports, vol. 9,
no. 8, pp. 800–810, 2020. no. 1, pp. 1–10, 2019.
[8] T. M. Darragh et al., “The lower anogenital squamous terminology [35] M. Khened et al., “A generalized deep learning framework
standardization project for hpv-associated lesions: background and for whole-slide image segmentation and analysis,” arXiv preprint
consensus recommendations from the college of american pathologists arXiv:2001.00258, 2020.
and the american society for colposcopy and cervical pathology,” Arch. [36] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
Pathol. Lab. Med., vol. 136, no. 10, pp. 1266–1297, 2012. “Encoder-decoder with atrous separable convolution for semantic image
[9] J. Jantzen, J. Norup, G. Dounias, and B. Bjerregaard, “Pap-smear segmentation,” in ECCV, 2018, pp. 801–818.
benchmark data for pattern classification,” Nature inspired Smart Inf. [37] D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervised
Syst. (NiSIS), pp. 1–9, 2005. learning method for deep neural networks,” in ICML Workshop, vol. 3,
[10] Z. Lu et al., “Evaluation of three algorithms for the segmentation of no. 2, 2013.
overlapping cervical cells,” IEEE J. Biomed. Health Inform., vol. 21, [38] H. a. Tokunaga, “Negative pseudo labeling using class proportion for
no. 2, pp. 441–450, 2016. semantic segmentation in pathology,” ECCV, 2020.
[11] Z. Lu, G. Carneiro, and A. P. Bradley, “An improved joint optimization [39] Y. Li et al., “Self-loop uncertainty: A novel pseudo-label for semi-
of multiple level set functions for the segmentation of overlapping supervised medical image segmentation,” in MICCAI. Springer, 2020,
cervical cells,” IEEE Trans. Image Process., vol. 24, no. 4, pp. 1261– pp. 614–623.
1272, 2015. [40] H.-T. Cheng et al., “Self-similarity student for partial label
[12] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling, histopathology image segmentation,” in ECCV. Springer, 2020, pp.
“Rotation equivariant cnns for digital pathology,” in MICCAI. Springer, 117–132.
2018, pp. 210–218. [41] S. Shaw et al., “Teacher-student chain for efficient semi-supervised
[13] F. A. Spanhol, L. S. Oliveira, C. Petitjean, and L. Heutte, “A dataset histology image classification,” ICLR Workshop, 2020.
for breast cancer histopathological image classification,” IEEE Trans. [42] M. Sandler et al., “Mobilenetv2: Inverted residuals and linear
Biomed. Eng., vol. 63, no. 7, pp. 1455–1462, 2015. bottlenecks,” in CVPR, 2018, pp. 4510–4520.
[14] M. Peikari, S. Salama, S. Nofech-Mozes, and A. L. Martel, “Automatic [43] G. Huang et al., “Densely connected convolutional networks,” in CVPR,
cellularity assessment from post-treated breast surgical specimens,” 2017, pp. 4700–4708.
Cytometry A, vol. 91, no. 11, pp. 1078–1087, 2017. [44] O. Russakovsky et al., “Imagenet large scale visual recognition
[15] P. Guo et al., “Nuclei-based features for uterine cervical cancer histology challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
image analysis with fusion-based classification,” IEEE J. Biomed. Health [45] Z. Meng et al., “Adaptive elastic loss based on progressive inter-class
Inform., vol. 20, no. 6, pp. 1595–1607, 2015. association for cervical histology image segmentation,” in ICASSP,
[16] S. De et al., “A fusion-based approach for uterine cervical cancer 2020, pp. 976–980.
histology image classification,” Comput. Med. Imag. Grap., vol. 37, no. [46] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
7-8, pp. 475–487, 2013. for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
[17] H. A. AlMubarak et al., “A hybrid deep learning and handcrafted feature [47] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
approach for cervical cancer digital histology image classification,” Int. convolutional encoder-decoder architecture for image segmentation,”
J. Healthcare Inf. Syst. Informatics, vol. 14, no. 2, pp. 66–87, 2019. IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
[18] D. Wang, C. Gu, K. Wu, and X. Guan, “Adversarial neural networks 2017.
for basal membrane segmentation of microinvasive cervix carcinoma in [48] Z. Meng, Z. Fan, Z. Zhao, and F. Su, “Ens-unet: End-to-end noise
histopathology images,” in ICMLC, vol. 2. IEEE, 2017, pp. 385–389. suppression u-net for brain tumor segmentation,” in EMBC. IEEE,
2018, pp. 5886–5889.
[19] I. Goodfellow et al., “Generative adversarial networks,” Commun. ACM,
[49] K. Cao and X. Zhang, “An improved res-unet model for tree species
vol. 63, no. 11, pp. 139–144, 2020.
classification using airborne high-resolution images,” Remote Sens.,
[20] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,
vol. 12, no. 7, p. 1128, 2020.
inception-resnet and the impact of residual connections on learning,”
[50] H. Huang et al., “Unet 3+: A full-scale connected unet for medical
arXiv preprint arXiv:1602.07261, 2016.
image segmentation,” in ICASSP. IEEE, 2020, pp. 1055–1059.
[21] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely
efficient convolutional neural network for mobile devices,” in CVPR,
2018, pp. 6848–6856.
[22] A. Echle et al., “Clinical-grade detection of microsatellite instability in
colorectal tumors by deep learning,” Gastroenterology, vol. 159, no. 4,
pp. 1406–1416, 2020.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in CVPR, 2016, pp. 770–778.
[24] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[25] H. Le et al., “Utilizing automated breast cancer detection to identify
spatial distributions of tumor infiltrating lymphocytes in invasive breast
cancer,” Am. J. Pathol., 2020.
[26] Y. Wang et al., “Pathological image classification based on hard example
guided cnn,” IEEE Access, vol. 8, pp. 114 249–114 258, 2020.
[27] C. Szegedy et al., “Going deeper with convolutions,” in CVPR, 2015,
pp. 1–9.
[28] M. van Rijthoven et al., “Hooknet: multi-resolution convolutional
neural networks for semantic segmentation in histopathology whole-slide
images,” Med. Image Anal., vol. 68, 2020.
[29] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in MICCAI. Springer, 2015, pp.
234–241.
[30] Y. Liu et al., “Detecting cancer metastases on gigapixel pathology
images,” arXiv preprint arXiv:1703.02442, 2017.
[31] C. Szegedy et al., “Rethinking the inception architecture for computer
vision,” in CVPR, 2016, pp. 2818–2826.
[32] H. Lin et al., “Scannet: A fast and dense scanning framework for
metastastic breast cancer detection from whole-slide image,” in WACV.
IEEE, 2018, pp. 539–546.
[33] S. Takahama et al., “Multi-stage pathological image classification using
semantic segmentation,” in ICCV, 2019, pp. 10 702–10 711.

0278-0062 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dedan Kimathi University of Technology. Downloaded on May 18,2021 at 14:34:55 UTC from IEEE Xplore. Restrictions apply.

You might also like