0% found this document useful (0 votes)
114 views8 pages

Edge Enhancement Based Transformer For Medical Image Denoising PDF

This document summarizes a research paper that proposes a new model called Eformer for medical image denoising. Eformer uses a transformer-based encoder-decoder architecture with non-overlapping window attention and learnable Sobel filters to enhance edges. It is evaluated on the AAPM-Mayo Clinic Low-Dose CT dataset and achieves state-of-the-art performance for metrics like PSNR, RMSE, and SSIM. The paper conducts experiments comparing residual learning vs direct prediction and finds residual learning outperforms for this medical image denoising task.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views8 pages

Edge Enhancement Based Transformer For Medical Image Denoising PDF

This document summarizes a research paper that proposes a new model called Eformer for medical image denoising. Eformer uses a transformer-based encoder-decoder architecture with non-overlapping window attention and learnable Sobel filters to enhance edges. It is evaluated on the AAPM-Mayo Clinic Low-Dose CT dataset and achieves state-of-the-art performance for metrics like PSNR, RMSE, and SSIM. The paper conducts experiments comparing residual learning vs direct prediction and finds residual learning outperforms for this medical image denoising task.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Eformer: Edge Enhancement based Transformer for Medical Image

Denoising
Achleshwar Luthra* Harsh Sulakhe* Tanish Mittal* Abhishek Iyer Santosh Yadav

Birla Institute of Technology and Science, Pilani


{f20180401, f20180186, f20190658, f20181105, santosh.yadav } @ pilani.bits-pilani.ac.in

Abstract cause for concern as the patient is exposed to radioactive


waves for varying durations. CT scans have been mainly
In this work, we present Eformer - Edge enhance- responsible for increasing the radiation received by humans
ment based transformer, a novel architecture that builds from medical procedures and have even led to medical pro-
an encoder-decoder network using transformer blocks for cedures becoming the second-largest source of radiation af-
medical image denoising. Non-overlapping window-based ter background radiation to affect humans. Reducing the
self-attention is used in the transformer block that reduces dose of the X-rays in CT scans is possible but leads to prob-
computational requirements. This work further incorpo- lems such as increased noise, reduction of contrast in edges,
rates learnable Sobel-Feldman operators to enhance edges corners, and sharp features, and over smoothing of images.
in the image and propose an effective way to concatenate We propose a method to help preserve the details and re-
them in the intermediate layers of our architecture. The ex- duce the noise generated from low dose scans so they may
perimental analysis is conducted by comparing determin- become a viable solution in place of high dose scans.
istic learning and residual learning for the task of medi- Medical Image Denoising has garnered considerable
cal image denoising. To defend the effectiveness of our ap- amount of attention from the computer vision research com-
proach, our model is evaluated on the AAPM-Mayo Clinic munity. There has been extensive research [19, 27, 4, 10,
Low-Dose CT Grand Challenge Dataset and achieves state- 14, 2] in this domain in the recent past. Although these
of-the-art performance, i.e., 43.487 PSNR, 0.0067 RMSE, methods have shown excellent results, they implicitly as-
and 0.9861 SSIM. We believe that our work will encourage sociate denoising with operations on a global scale rather
more research in transformer-based architectures for medi- than leveraging the local visual information. We argue that
cal image denoising using residual learning. we can benefit from the patch embedding operations that
form the basis of a vision transformer [8]. Recently, Vision
Transformers (ViT) have shown great success in many com-
1. Introduction
puter vision tasks including image restoration [25] but they
Modern methods for diagnosing medical conditions have have not been exploited on medical image datasets.
been developing rapidly in recent times and a tool of utmost To the best of our knowledge, this is the first work that
importance is the Computerized Tomography (CT) scan. It utilizes transformers for medical image denoising. The ma-
is used often to help diagnose complex bone fractures, tu- jor contributions of this paper are as follows:
mors, heart disease, emphysema, and more. It works in a
method similar to that of the X-Ray scan. A rotating source • We introduce a novel architecture - Eformer, for edge
of X-Ray beams is used to shoot narrow beams through a enhancement based medical image denoising using
certain section of your body with a highly sensitive detector transformers. We incorporate learnable Sobel filters
being placed opposite to the source which picks up these X for edge enhancement which results in improved per-
Rays and uses a highly advanced mathematical algorithm to formance of our overall architecture. We outperform
create 2D slices of a body part from one full rotation. This existing state-of-the-art methods and show how trans-
process is repeated until a number of slices are created. As formers can be useful for medical image denoising.
helpful as this procedure is in diagnosing, it does have some
• We conduct extensive experimentations on training our
* equal contribution network following the residual learning paradigm. To
Input Low-Dose Image Residual Noise Normal Dose image

LC2D Block LC2U Block


Input Projection Sobel Convolution Output Projection

LeWin Transformer
LeWin Transformer
Block
Block
x2
LC2D Block LC2U Block x2
(n=1, k=1) (n=1, k=1)

Edge Enhanced
Concatenation
Features Downsampling Convolution

Convolution LC2D Block LC2U Block


Edge Enhanced
(n=2, k=2) (n=2, k=2) Concatenation
Features

Downsampling
Upsampling
LeWin Transformer
Block

Figure 1. Detailed description of our method. All the steps involved have been explained in 3.6. LC2(D/U) stands for LeWin Transformer,
Concatenation block, Convolution block, and Downsampling/Upsampling.

prove the effectiveness of residual learning in image age. GAN based models such as [27, 14] use WGAN [1]
denoising tasks, we also show results using a deter- with Wasserstein Distance and Perceptual Loss for image
ministic approach where our model directly predicts denoising.
denoised images. In medical image denoising, resid- Recently, transformer-based architectures have also
ual learning clearly outperforms traditional learning achieved huge success in the computer vision domain pio-
approaches where directly predicting denoised images neered by ViT (Vision Transformer) [8], which successfully
becomes similar to formulating an identity mapping. utilized transformers for the task of image classification.
Since then, many models involving transformers have been
This paper follows the following structure - in Section proposed that have shown successful results for many low-
2 we discuss the previous work done in image denoising level vision tasks including image super-resolution [26], de-
and the use of transformers in related tasks. In Section 3, noising [25], deraining [3], and colorization [18]. Our work
we have explained our approach in a detailed manner. In is also inspired by one such denoising transformer, Uformer
Section 4, we compare our results with existing methods [25], which employs non-overlapping window-based self-
which is followed by some conclusive statements and future attention and depth-wise convolution in the feed forward
directions in Section 5. network to efficiently capture local context. We integrate
the edge enhancement module [19] and a Uformer-like ar-
2. Related Work chitecture in an efficient novel manner that helps us achieve
Low-dose CT (LDCT) image denoising is an active re- state-of-the-art results.
search area in medical image denoising due to its valuable
clinical usability. Due to the limitations in the amount of 3. Our Approach
data and the consequent low accuracy of conventional ap-
proaches [16], data-efficient deep learning approaches have In this section, we provide a detailed description about
a huge potential in this domain. The pioneering work of the components involved in our implementation.
Chen et al. [6] showed that a simple Convolutional Neural
3.1. Sobel-Feldman Operator
Network (CNN) can be used for suppressing the noise of
LDCT images. The models proposed in [11, 5, 23] show Inspired by [19], we use Sobel–Feldman operator [24],
that an encoder-decoder network is efficient in medical im- also called Sobel Filter, for our edge enhancement block.
age denoising. REDCNN [5] combines shortcut connec- Sobel Filter is specifically used in edge detection algorithms
tions into the residual encoder-decoder network and CPCE as it helps in emphasizing on the edges. Originally the oper-
[23] uses conveying-paths connections. Fully Convolu- ator had two variations - vertical and horizontal, but we also
tional Networks such as [10] uses dilated convolutions with include diagonal versions similar to [19] (See Supplemental
different dilation rates whereas [15] uses simple convolu- Material). Sample results of edge enhanced CT image have
tion layers with residual learning for denoising medical im- been shown in Figure 2. The set of image feature maps con-
Method MSE MSP Adv. VGG-P
REDCNN [5] ✓ - - -
WGAN [1] - - ✓ ✓
CPCE [23] - - ✓ ✓
EDCNN [19] ✓ ✓ - -
Eformer (ours) ✓ ✓ - -

Table 1. Comparison between losses used by different methods;


MSE - mean squared error, MSP - multi-scale perceptual, Adv. -
adversarial, and VGG-P - VGG network based perceptual loss.

taining edge information are efficiently concatenated with


the input projection and other parts of the network (refer to
Figure 1). Figure 2. Example of results obtained after convolution of images
with Sobel-filter. Input (left) and edge-enhanced images (right).
3.2. Transformer based Encoder-Decoder
Denoising Autoencoders [5, 23, 11], Fully Convolu-
can cause checkerboard artifacts which are not desirable for
tional Networks [19, 15, 10], and GANs [27, 14] have been
image denoising. [21] states, to avoid uneven overlap, the
successful in the past in the task of medical image de-
kernel size should be divisble by the stride. Hence, in our
noising, but transformers have not yet been explored for
upsampling layer, we use a kernel size of 4 × 4 and a stride
the same, despite their success in other computer vision
of 2.
tasks. Our novel network Eformer is one such step in
that direction. We take inspiration from Uformer [25] for 3.4. Residual Learning
this work. At every encoder and decoder stage, convolu-
tional feature maps are passed through a locally-enhanced The goal of residual learning is to implicitly remove the
window (LeWin) transformer block that comprises of a latent clean image in the hidden layers. We input a noisy
non-overlapping window-based Multi-head Self-Attention image x = y + v to our network, here x is the noisy im-
(W-MSA) and a Locally-enhanced Feed-Forward Network age, in our case the low-dose image, y is the ground truth,
(LeFF), integrated together (See Supplementary Material) . and v is the residual noise. Rather than directly outputting
the denoised image ŷ, the proposed Eformer predicts the
\begin {aligned} & \mathbf {X}_m^{\prime } = \text {W-MSA}(\text {LN}(\mathbf {X}_{m-1}))+\mathbf {X}_{m-1}, \\ & \mathbf {X}_m = \text {LeFF}(\text {LN}(\mathbf {X}_m^{\prime }))+\mathbf {X}_m^{\prime } \end {aligned} residual image v̂, i.e., difference between the noisy image
(1)
and the ground truth. According to [12], when the origi-
nal mapping is more like an identity mapping, the residual
here, LN represents the layer normalization. As shown in mapping is much easier to optimize. Discriminative denois-
Figure 1, the transformer block is applied prior to the LC2D ing models aim to learn a mapping function of F (x) = ŷ
block in each encoding stage and post the LC2U block in whereas we adopt residual formulation to train our network
each decoding stage, and also serves as the bottleneck layer. to learn a residual mapping R(x) = v̂ and then we obtain
3.3. Downsampling & Upsampling ŷ = x − R(x) =⇒ ŷ = x − v̂.

Pooling layers are the most common way of downsam- 3.5. Optimization
pling the input image signal in a convolutional network.
As a part of the optimization process, we employ multi-
They work well in image classification tasks as they help
ple loss functions to achieve the best possible results. We
in capturing the essential structural details but at the cost
initially use Mean Squared Error (MSE) which calculates
of losing finer details which we cannot afford, in our task.
the pixelwise distance between the output and the ground
Hence we choose strided convolutions in our downsampling
truth image defined as follows.
layer. More specifically, we use a kernel size of 3 × 3 with
stride of 2 and padding of 1.
Upsampling can be thought of as unpooling or reverse L_{mse} = \frac {1}{N}\sum _{i=1}^N\Big \|(x_i - R(x_i)) - y_i\Big \|^2 (2)
of pooling using simple techniques such as Nearest Neigh-
bor. In our network, we use transpose convolutions [9].
Transpose convolution reconstructs the spatial dimensions However, it tends to create unwanted artifacts such as over-
and learns its own parameters just like regular convolutional smoothness and image blur. To overcome this, we employ
layers. The issue with transpose convolutions is that they both, a ResNet [12] based Multi-scale Perceptual (MSP)
Method PSNR ↑ SSIM ↑ RMSE ↓
REDCNN 42.3891 0.9856 0.0076
WGAN 38.6043 0.9647 0.0108
CPCE 40.8209 0.9740 0.0093
EDCNN 42.0835 0.9866 0.0079
Eformer 42.2371 0.9852 0.0077
Eformer-residual 43.487 0.9861 0.0067

Table 2. Comparison with previous methods evaluated on AAPM


Dataset [20].

encoded feature map to another LeWin Transformer block


which is now ready to be decoded by the same number of
Input Images Our Results
stages as it was encoded. In each stage of the decoder, post
deconvolution, the earlier downsampled S(I) itself are con-
Figure 3. Sample Results on AAPM Dataset [20]. More results
have been added in the supplementary material.
catenated with the upsampled feature maps which are then
passed through a convolutional block. The decoder stage
can be viewed as an opposite of the encoder stage, with
a shared S(I). The final feature map produced after de-
Loss [19]. MSP can be described by the following equa- coding is then passed through a ’output projection’ block
tion to produce the desired residual. This ’output projection’ is
a convolutional layer, that simply projects the C-channel
L_{msp} = \frac {1}{NC}\sum _{i=1}^N\sum _{s=1}^C\Big \|\phi _s(x_i - R(x_i),\hat {\theta }) - \phi _{s}(y_i,\hat {\theta })\Big \|^2 feature map to a 1-channel grayscale image. In our experi-
ments, we set the depth of the LeWin block, attention heads
(3) and number of encoder-decoder stages each to 2. A concise
A ResNet-50 backbone was utilized as the feature extrac- representation of the architecture can be seen in Figure 1
tor ϕ. To be specific, the pooling layers from a ResNet-50 which resembles the alphabet ’E’ hence the name Eformer.
pretrained on the ImageNet dataset [7] were deleted, retain-
ing the convolutional blocks following which the weights 4. Results and Discussions
(θ̂) were frozen. To calculate perceptual loss, the denoised
output xi − R(xi ), where R(xi ) = vˆi (as described in Sec- This subsection highlights the results attained by mea-
tion 3.4) and ground truth (yi ) are passed to the extractor. suring three different metrics to judge noise reduction and
Following this, feature maps are extracted from four stages the quality of the reconstructed low dose CT images. We
of the backbone, as done in [19]. This perceptual loss, in use the following metrics for the evaluation - Peak Sig-
combination with MSE deals with both per-pixel similar- nal to Noise Ratio (PSNR), Structural Similarity (SSIM),
ity in addition to overall structural information. Our final and Root Mean Square Error (RMSE). PSNR is targeted at
objective is as follows, noise reduction and is a measure of the quality of recon-
struction. SSIM is a perceptual metric that focuses on the
L_{final} = \lambda _{mse} L_{mse} + \lambda _{msp} L_{msp} (4) visible structures in an image and is a measure of the vi-
sual quality. RMSE keeps track of the absolute pixel to
where, λmse and λmsp are pre-defined constants. pixel loss between the two images. We compare our re-
sults, examples shown in Figure 3, with architectures that
3.6. Overall Network Architecture
share similarities with our model in the sense they are based
Composing the aforementioned individual modules, our on a convolutional architecture. As seen in Table 1, CPCE
pipeline can be described as follows. An input image I is [23], WGAN [1] and EDCNN [19] like ours use a combi-
first passed through a Sobel Filter to produce S(I) followed nation of commonly used losses to train their model while
by a GeLU activation [13]. As a part of the encoding stages, REDCNN [5] only uses MSE. Table 2 shows that our pro-
at each stage, we pass the input through a LeWin trans- posed models, Eformer and Eformer-Residual, outperform
former block, proceeded by a concatenation with S(I) and the state-of-the-art methods in both the PSNR and MSE
consequent convolution operations, similar to [19] to pro- metrics, indicating efficient denoising and our comparable
duce an encoded feature map. The feature map, along with performance in SSIM also suggests that the visual quality
S(I) is then downsampled using the procedure described in of the image is high and important details are not lost in the
Section 3.3. Post encoding, at the bottleneck, we pass the reconstruction.
5. Conclusion Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
To conclude, this paper presents a residual learning worth 16x16 words: Transformers for image recognition at
based image denoising model evaluated in the medical do- scale, 2021. 1, 2
main. We leverage transformers, and an edge enhance- [9] Vincent Dumoulin and Francesco Visin. A guide to convo-
ment module to produce high quality denoised images, and lution arithmetic for deep learning, 2018. 3
achieve state-of-the-art performance using a combination of [10] M. Gholizadeh-Ansari, J. Alirezaie, and P. Babyn. Deep
multi-scale perceptual loss and the traditional MSE loss. Learning for Low-Dose CT Denoising Using Perceptual
We believe our work will encourage the use of transformers Loss and Edge Detection Layer. J Digit Imaging, 33(2):504–
in medical image denoising. In the future, we plan to ex- 515, 04 2020. 1, 2, 3
plore the capabilities of our model on a multitude of related [11] Lovedeep Gondara. Medical image denoising using convo-
tasks. lutional denoising autoencoders. In 2016 IEEE 16th Inter-
national Conference on Data Mining Workshops (ICDMW),
6. Acknowledgements pages 241–246, Dec 2016. 2, 3
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
We want to thank the members of Computer Vision Re- Deep residual learning for image recognition, 2015. 3
search Society (CVRS1 ) for their helpful suggestions and [13] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
feedback. units (gelus), 2020. 4
[14] Zhanli Hu, Changhui Jiang, Fengyi Sun, Qiyang Zhang,
References Yongshuai Ge, Yongfeng Yang, Xin Liu, Hairong Zheng,
and Dong Liang. Artifact correction in low-dose dental ct
[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. imaging using wasserstein generative adversarial networks.
Wasserstein gan, 2017. 2, 3, 4 Medical Physics, 46(4):1686–1696, 2019. 1, 2, 3
[2] Nicholas Bien, Pranav Rajpurkar, Robyn L. Ball, Jeremy [15] Worku Jifara, Feng Jiang, Seungmin Rho, Maowei Cheng,
Irvin, Allison Park, Erik Jones, Michael Bereket, Bhavik N. and Shaohui Liu. Medical image denoising using convo-
Patel, Kristen W. Yeom, Katie Shpanskaya, Safwan Halabi, lutional neural network: a residual learning approach. The
Evan Zucker, Gary Fanton, Derek F. Amanatullah, Christo- Journal of Supercomputing, 75(2):704–718, Feb 2019. 2, 3
pher F. Beaulieu, Geoffrey M. Riley, Russell J. Stewart, [16] P. Kaur, Gurvinder Singh, and Parminder Kaur. A review of
Francis G. Blankenberg, David B. Larson, Ricky H. Jones, denoising medical images using machine learning
Curtis P. Langlotz, Andrew Y. Ng, and Matthew P. Lun- approaches. Current Medical Imaging Reviews, 14:675 –
gren. Deep-learning-assisted diagnosis for knee magnetic 685, 2018. 2
resonance imaging: Development and retrospective valida- [17] Diederik P. Kingma and Jimmy Ba. Adam: A
tion of mrnet. PLOS Medicine, 15(11):1–19, 11 2018. 1 method for stochastic optimization, 2014. cite
[3] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping arxiv:1412.6980Comment: Published as a conference
Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and paper at the 3rd International Conference for Learning
Wen Gao. Pre-trained image processing transformer, 2021. Representations, San Diego, 2015. 7
2
[18] Manoj Kumar, Dirk Weissenborn, and Nal Kalchbrenner.
[4] Hu Chen, Yi Zhang, Mannudeep K. Kalra, Feng Lin, Colorization transformer, 2021. 2
Yang Chen, Peixi Liao, Jiliu Zhou, and Ge Wang. Low-
[19] Tengfei Liang, Yi Jin, Yidong Li, and Tao Wang. Edcnn:
dose ct with a residual encoder-decoder convolutional neu-
Edge enhancement-based densely connected network with
ral network. IEEE Transactions on Medical Imaging,
compound loss for low-dose ct denoising. 2020 15th IEEE
36(12):2524–2535, Dec 2017. 1
International Conference on Signal Processing (ICSP), Dec
[5] Hu Chen, Yi Zhang, Mannudeep K. Kalra, Feng Lin, Yang 2020. 1, 2, 3, 4
Chen, Peixi Liao, Jiliu Zhou, and Ge Wang. Low-dose ct
[20] Cynthia H McCollough, Adam C Bartley, Rickey E Carter,
with a residual encoder-decoder convolutional neural net-
Baiyu Chen, Tammy A Drees, Phillip Edwards, David R
work. IEEE Transactions on Medical Imaging, 36(12):2524–
Holmes III, Alice E Huang, Farhana Khan, Shuai Leng, et al.
2535, 2017. 2, 3, 4
Low-dose ct for the detection and classification of metastatic
[6] Hu Chen, Yi Zhang, Weihua Zhang, Peixi Liao, Ke Li, Jiliu liver lesions: results of the 2016 low dose ct grand challenge.
Zhou, and Ge Wang. Low-dose ct via convolutional neural Medical physics, 44(10):e339–e352, 2017. 4, 7
network. Biomed. Opt. Express, 8(2):679–694, Feb 2017. 2
[21] Augustus Odena, Vincent Dumoulin, and Chris Olah. De-
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, convolution and checkerboard artifacts. Distill, 2016. 3
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
[22] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
database. In 2009 IEEE Conference on Computer Vision and
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
Pattern Recognition, pages 248–255, 2009. 4
ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
differentiation in pytorch. 2017. 7
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
[23] Hongming Shan, Yi Zhang, Qingsong Yang, Uwe Kruger,
1 https://sites.google.com/view/thecvrs Mannudeep K. Kalra, Ling Sun, Wenxiang Cong, and Ge
Wang. 3-d convolutional encoder-decoder network for low-
dose ct via transfer learning from a 2-d trained network.
IEEE Transactions on Medical Imaging, 37(6):1522–1534,
Jun 2018. 2, 3, 4
[24] Irwin Sobel. An isotropic 3x3 image gradient operator. Pre-
sentation at Stanford A.I. Project 1968, 02 2014. 2
[25] Zhendong Wang, Xiaodong Cun, Jianmin Bao, and
Jianzhuang Liu. Uformer: A general u-shaped transformer
for image restoration, 2021. 1, 2, 3, 7, 8
[26] Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, and Bain-
ing Guo. Learning texture transformer network for image
super-resolution, 2020. 2
[27] Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. K.
Kalra, Y. Zhang, L. Sun, and G. Wang. Low-Dose CT Image
Denoising Using a Generative Adversarial Network With
Wasserstein Distance and Perceptual Loss. IEEE Trans Med
Imaging, 37(6):1348–1357, 06 2018. 1, 2, 3
Supplementary Material for Eformer: Edge Enhancement based
Transformer for Medical Image Denoising
Achleshwar Luthra* Harsh Sulakhe* Tanish Mittal* Abhishek Iyer Santosh Yadav

Birla Institute of Technology and Science, Pilani


{f20180401, f20180186, f20190658, f20181105, santosh.yadav } @ pilani.bits-pilani.ac.in

7. Dataset Details -2

For our research work, we have utilized the AAPM-


Mayo Clinic Low-Dose CT Grand Challenge Dataset [20] -
provided by The Cancer Imaging Archive (TCIA). The
dataset contains 3 types of CT scans collected from 140 0
patients. These 3 types of CT scans are abdomen, chest,
and head which are collected from a total of 48, 49, and
42 patients respectively. The data from each patient com-
prises of low-dose CT scans paired with its corresponding
normal-dose CT scans. The low dose CT scans are syn- 2
thetic CT scans which are generated by poisson noise in-
sertion into the projection data. Poisson noise was inserted
to reach a noise level of 25% of the full dose. Each CT Figure 4. Four different sets of Sobel-filters in our implementation.
scan is given in DICOM (Digital Imaging and Communica-
tions in Medicine) file format. It is a standard format which text. We use the Pytorch framework [22] to run our exper-
establishes rules for the exchange of medical images and iments. The convolutional layers are initialized using the
associated information between different vendors, comput- default scheme except the Sobel convolutional block. We
ers and hospitals. This format meets health information ex- enforce the filter parameters to follow the pattern as shown
change (HIE) standards and HL7 standards for transmission in Figure 4 where α is a learnable parameter. All our exper-
of health-related data. A DICOM file consists of a header iments were run on a 16GB NVIDIA TESLA P100 GPU.
and image pixel intensity data. The header contains infor- The model was trained with an ADAM [17] optimizer us-
mation regarding the patient demographics, study parame- ing a learning rate of 0.00002 and default parameters. The
ters, etc. stored in seperate ’tags’ and image pixel intensity model was trained using an input size of 128 × 128 pixels
data contains the pixel data of the CT scan which in our by resizing each image from its original size of 512 × 512
case contains pixel data of images of size 512 × 512. In our pixels. The results obtained have been shown in Figure 6
model, for training, we extracted the image pixel data from
a Dicom file to a NumPy array using Pydicom library 1 and 9. LeWin Transformer
then, the pixel data in NumPy array is scaled from 0 to 1 to
avoid heterogenous spanning of pixel data for different CT To make our submission self-containing, we have pro-
scans. vided architecture details of the LeWin transformer block
[25] in the supplementary material. LeWin transformer
8. Parameter Details and Network Training block (Figure 5) contains 2 core designs which are de-
scribed below. First, non-overlapping Window-based
The structure and architecture of the model have been Multi-head Self-Attention (W-MSA), which works on
previously described in Section 3.6 and Figure 1 of the main low-resolution feature maps and is sufficient to learn long-
* equal contribution range dependencies. Second, Locally-enhanced Feed-
1 https://pydicom.github.io/ Forward Network (LeFF), which integrates a convolution

7
+

LeFF

Layer Normalisation

W-MSA

Layer Normalisation

Figure 5. LeWin Transformer Block

operator with a traditonal feed-forward network and is vi-


tal in learning local context. In LeFF, the image patches
are first passed through linear projection layers followed by
3×3 depth-wise convolutional layers. Further the patch fea-
tures are flattened and finally passed to another linear layer
Figure 6. Results
to match the dimension of input channels. The structure of
the LeWin Transformer Block is pictorially represented in
Figure 5. Corresponding equations are as follows. X̂j denotes the output for the j-th head. Now, output for all
the heads can be concatenated and then linearly projected to
\begin {aligned} & \mathbf {X}_m^{\prime } = \text {W-MSA}(\text {LN}(\mathbf {X}_{m-1}))+\mathbf {X}_{m-1}, \\ \end {aligned} (5)
get the final results. We formulate attention calculation in
the same manner as done in [25].
\mathbf {X}_m = \text {LeFF}(\text {LN}(\mathbf {X}_m^{\prime }))+\mathbf {X}_m^{\prime } (6)
Here X′m and Xm are the outputs of the W-MSA module
and LeFF module respectively, LN represents layer nor-
malization. In the W-MSA module, the given 2D feature
map X ∈ RC×H×W is split into N non-overlapping win-
dows with window size M × M . Following this, self-
attention is performed on the flattened features of each win-
2
dow X i ∈ RM ×C . Suppose the head number is j and the
head dimension is dj = C/j. Then, consequent computa-
tions are,

\mathbf {X} = \{\mathbf {X}^1, \mathbf {X}^2, \dots , \mathbf {X}^N\}, ~~ N = HW/M^2 (7)

\begin {aligned} \mathbf {Y}^i_j = \text {Attention}(\mathbf {X}^i\mathbf {W}_j^Q, \mathbf {X}^i\mathbf {W}_j^K, \mathbf {X}^i\mathbf {W}_j^V)\\ i = 1 \dots N \end {aligned}
(8)

\mathbf {\hat {X}}_j =\{\mathbf {Y}^1_j, \mathbf {Y}^2_j, \dots , \mathbf {Y}^M_j\} (9)

You might also like