Edge Enhancement Based Transformer For Medical Image Denoising PDF
Edge Enhancement Based Transformer For Medical Image Denoising PDF
Denoising
Achleshwar Luthra* Harsh Sulakhe* Tanish Mittal* Abhishek Iyer Santosh Yadav
LeWin Transformer
LeWin Transformer
Block
Block
x2
LC2D Block LC2U Block x2
(n=1, k=1) (n=1, k=1)
Edge Enhanced
Concatenation
Features Downsampling Convolution
Downsampling
Upsampling
LeWin Transformer
Block
Figure 1. Detailed description of our method. All the steps involved have been explained in 3.6. LC2(D/U) stands for LeWin Transformer,
Concatenation block, Convolution block, and Downsampling/Upsampling.
prove the effectiveness of residual learning in image age. GAN based models such as [27, 14] use WGAN [1]
denoising tasks, we also show results using a deter- with Wasserstein Distance and Perceptual Loss for image
ministic approach where our model directly predicts denoising.
denoised images. In medical image denoising, resid- Recently, transformer-based architectures have also
ual learning clearly outperforms traditional learning achieved huge success in the computer vision domain pio-
approaches where directly predicting denoised images neered by ViT (Vision Transformer) [8], which successfully
becomes similar to formulating an identity mapping. utilized transformers for the task of image classification.
Since then, many models involving transformers have been
This paper follows the following structure - in Section proposed that have shown successful results for many low-
2 we discuss the previous work done in image denoising level vision tasks including image super-resolution [26], de-
and the use of transformers in related tasks. In Section 3, noising [25], deraining [3], and colorization [18]. Our work
we have explained our approach in a detailed manner. In is also inspired by one such denoising transformer, Uformer
Section 4, we compare our results with existing methods [25], which employs non-overlapping window-based self-
which is followed by some conclusive statements and future attention and depth-wise convolution in the feed forward
directions in Section 5. network to efficiently capture local context. We integrate
the edge enhancement module [19] and a Uformer-like ar-
2. Related Work chitecture in an efficient novel manner that helps us achieve
Low-dose CT (LDCT) image denoising is an active re- state-of-the-art results.
search area in medical image denoising due to its valuable
clinical usability. Due to the limitations in the amount of 3. Our Approach
data and the consequent low accuracy of conventional ap-
proaches [16], data-efficient deep learning approaches have In this section, we provide a detailed description about
a huge potential in this domain. The pioneering work of the components involved in our implementation.
Chen et al. [6] showed that a simple Convolutional Neural
3.1. Sobel-Feldman Operator
Network (CNN) can be used for suppressing the noise of
LDCT images. The models proposed in [11, 5, 23] show Inspired by [19], we use Sobel–Feldman operator [24],
that an encoder-decoder network is efficient in medical im- also called Sobel Filter, for our edge enhancement block.
age denoising. REDCNN [5] combines shortcut connec- Sobel Filter is specifically used in edge detection algorithms
tions into the residual encoder-decoder network and CPCE as it helps in emphasizing on the edges. Originally the oper-
[23] uses conveying-paths connections. Fully Convolu- ator had two variations - vertical and horizontal, but we also
tional Networks such as [10] uses dilated convolutions with include diagonal versions similar to [19] (See Supplemental
different dilation rates whereas [15] uses simple convolu- Material). Sample results of edge enhanced CT image have
tion layers with residual learning for denoising medical im- been shown in Figure 2. The set of image feature maps con-
Method MSE MSP Adv. VGG-P
REDCNN [5] ✓ - - -
WGAN [1] - - ✓ ✓
CPCE [23] - - ✓ ✓
EDCNN [19] ✓ ✓ - -
Eformer (ours) ✓ ✓ - -
Pooling layers are the most common way of downsam- 3.5. Optimization
pling the input image signal in a convolutional network.
As a part of the optimization process, we employ multi-
They work well in image classification tasks as they help
ple loss functions to achieve the best possible results. We
in capturing the essential structural details but at the cost
initially use Mean Squared Error (MSE) which calculates
of losing finer details which we cannot afford, in our task.
the pixelwise distance between the output and the ground
Hence we choose strided convolutions in our downsampling
truth image defined as follows.
layer. More specifically, we use a kernel size of 3 × 3 with
stride of 2 and padding of 1.
Upsampling can be thought of as unpooling or reverse L_{mse} = \frac {1}{N}\sum _{i=1}^N\Big \|(x_i - R(x_i)) - y_i\Big \|^2 (2)
of pooling using simple techniques such as Nearest Neigh-
bor. In our network, we use transpose convolutions [9].
Transpose convolution reconstructs the spatial dimensions However, it tends to create unwanted artifacts such as over-
and learns its own parameters just like regular convolutional smoothness and image blur. To overcome this, we employ
layers. The issue with transpose convolutions is that they both, a ResNet [12] based Multi-scale Perceptual (MSP)
Method PSNR ↑ SSIM ↑ RMSE ↓
REDCNN 42.3891 0.9856 0.0076
WGAN 38.6043 0.9647 0.0108
CPCE 40.8209 0.9740 0.0093
EDCNN 42.0835 0.9866 0.0079
Eformer 42.2371 0.9852 0.0077
Eformer-residual 43.487 0.9861 0.0067
7. Dataset Details -2
7
+
LeFF
Layer Normalisation
W-MSA
Layer Normalisation
\mathbf {X} = \{\mathbf {X}^1, \mathbf {X}^2, \dots , \mathbf {X}^N\}, ~~ N = HW/M^2 (7)
\begin {aligned} \mathbf {Y}^i_j = \text {Attention}(\mathbf {X}^i\mathbf {W}_j^Q, \mathbf {X}^i\mathbf {W}_j^K, \mathbf {X}^i\mathbf {W}_j^V)\\ i = 1 \dots N \end {aligned}
(8)
\mathbf {\hat {X}}_j =\{\mathbf {Y}^1_j, \mathbf {Y}^2_j, \dots , \mathbf {Y}^M_j\} (9)