2022 Springer
2022 Springer
https://doi.org/10.1007/s00521-022-07412-0 (0123456789().,-volV)(0123456789().
,- volV)
ORIGINAL ARTICLE
Yong Zhang1
Received: 13 January 2022 / Accepted: 9 May 2022 / Published online: 24 June 2022
Ó The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022
Abstract
Wrist and finger fractures detection is always the weak point of associate study, because there are small targets in X-rays,
such as hairline fractures. In this paper, a dataset, consisting of 4346 anteroposterior, lateral and oblique hand X-rays, is
built from many orthopedic cases. Specifically, it contains a lot of hairline fractures. An automatic preprocessing based on
generative adversative network (GAN) and a detection network, called WrisNet, are designed to improve the detection
performance of wrist and finger fractures. In the preprocessing, an attention mechanism-based GAN is proposed for
obtaining the approximation of manual windowing enhancement. A multiscale attention-module-based generator of the
GAN is proposed to increase continuity between pixels. The discriminator and the generator can achieve 93% structural
similarity (SSIM) as manual windowing enhancement without manual parameter adjustment. The designed WrisNet is
composed of two components: a feature extraction module and a detection module. A group convolution and a lightweight
but efficient triplet attention mechanism are elaborately embedded into the feature extraction module, resulting in richer
representations of hairline fractures. To obtain more accurate locating information in this condition, the soft non-maximum
suppression algorithm is employed as the post-processing method of the detection module. As shown in experimental
results, the designed method can have obvious average precision (AP) improvement up to 7% or more than other
mainstream frameworks. The automatic preprocessing and the detection net can greatly reduce the degree of artificial
intervention, so it is easy to be implemented in real clinical environment.
Keywords Attention mechanism Generative adversative network Soft non-maximum suppression Hairline fractures
123
18774 Neural Computing and Applications (2022) 34:18773–18785
fractures in X-rays are difficult to detect with state-of-the- translation. In the aspect of image generation, the structure
art methods in that small object detection [6, 7] is a chal- information existing in the train dataset is used to generate
lenging task for deep learning-based ways. To solve the new medical images. GANs are often used to increase the
two problems mentioned above, an automatic preprocess- number of train datasets to foster the accuracy of classifi-
ing based on GAN [8] and a detection network, called cation tasks. A new generation method called generating
WrisNet, are designed in this paper. The main contribution adversarial Unet was developed by Chen et al. [9], which
can be concluded as follows. can realize the generation of various medical images to
alleviate the over fitting phenomenon in training. A method
(1) A dataset, consisting of 4346 anteroposterior, lateral
that using cycle-consistent adversarial networks to generate
and oblique hand X-rays, is built from many
COVID-19 samples was suggested by Morı́s et al. [10] to
orthopedic cases. It should be pointed that hairline
improve the accuracy of classification. The applicability of
fractures account for more than 50 percent of the
generating images by GAN in oncology was demonstrated
total targets in the dataset, way more compared to the
by Han et al. [11]. The image translation of medical images
published datasets.
mainly includes super-resolution reconstruction, image
(2) An attention mechanism-based GAN is proposed as
denoising and so on. The conditional generation adversarial
the preprocessing to expand the gray scale range.
network (CGAN) [12] was used as a denoising algorithm in
The goal of the proposed GAN is obtaining the
[13] for low dose chest images and the proposed method
approximation of manual windowing enhancement.
was proved that was superior to the traditional method. A
We design a novel generator, consisted of multiscale
new super-resolution generation countermeasure network
attention-module-based net to process the input
was proposed by Zhu et al. [14], which combines CGAN
image, respectively. The GAN can achieve 93%
and super-resolution generation adversarial network
SSIM of manual windowing enhancement without
(SRGAN) to generate super-resolution images. By
manual parameter adjustment and greatly reduce the
extracting useful information from different channels and
degree of artificial intervention.
paying more attention to meaningful pixels, a new con-
(3) In order to deal with hairline fractures of the dataset,
volutional neural network was proposed by Gu et al. [15]
a novel network, called WrisNet, is proposed to
for super-resolution in medical imaging. Jiang et al. [16]
improve the detection performance. A feature
proposed an improved loss function obtained by combining
extraction module and a detection module are
four loss functions, and this loss function achieved good
formed WristNet. In the feature extraction module,
results in the field of super-resolution CT image recon-
the ResNeXt with the triplet attention (TA) is
struction. In this paper, GAN is firstly used as medical
designed to extract the features while in the detection
image preprocessing to expand the gray scale range.
module, the soft non-maximum suppression (Soft-
Meanwhile, a multiscale attention-module-based generator
NMS) algorithm is used as the post-processing
is proposed to process the image. The result of GAN
mechanism to improve the omission of hairline
achieves 93% SSIM as manual windowing enhancement
fractures. The results show that the AP can achieve
without manual parameter adjustment.
7% or more improvement than the state-of-the-art
frameworks.
2.2 Fracture detection by deep learning-based
This paper is organized as follows. In Sect. 2, medical method
image preprocessing methods and deep learning-based
fracture detection methods are reviewed. The proposed Considering the accuracy of fracture classification and
preprocessing and WrisNet are detailed in Sect. 3. In fracture location, Guan et al. [17] proposed an improved
Sect. 4, several experimental results are illustrated to val- object detection algorithm for the detection of arm frac-
idate the improved detection performance. Finally, the tures and obtained a model with a high AP. Qi et al. [18]
conclusion is given in Sect. 5. trained an object detection model to locate femoral frac-
tures by using a framework based on Faster-RCNN [19]
and achieved a good result. In [20], a dilated convolutional
2 Previous work feature pyramid network was designed, which was applied
to thigh fracture detection. In [21], the deep learning
2.1 GAN in medical image processing method was employed to process the CT images of spine as
well as to locate spinal fracture. In [22], the top layer of the
GANs have great application potential in the field of original model was retrained by using inception v2 network
medical image processing. The main tasks they can solve [23] for leg bone fracture detection. Nonetheless, the above
can be divided into image generation and image methods could not be applied to the proposed dataset due to
123
Neural Computing and Applications (2022) 34:18773–18785 18775
poor hairline break detection performance. To better solve 3.2.1 Feature extraction module
the problem of detecting small targets, a feature extraction
module, called ResNeXt-TA, is proposed to make fracture The proposed feature extraction module is inspired by
features more prominent. In addition, Soft-NMS is Faster-RCNN, mainly composed of ResNeXt-TA and FPN
designed as the specialized post-processing of a detection [26].
module to improve the omission of hairline fractures. (1) ResNeXt-TA
ResNeXt-TA is a proposed backbone, composed of C1,
C2, C3, C4 and C5. A convolution layer, a batch normal-
3 Methodology
ization layer [27], a ReLU activation function [28] and a
maxpool layer are formed as C1.
An automatic preprocessing based on GAN and WrisNet is
C2, C3, C4 and C5 are designed with different number
proposed for X-ray diagnosis of wrist and finger fractures,
(3, 4, 23, and 3) of blocks. The structure of each block is
which are detailed in Sects. 3.1 and 3.2, respectively. The
inspired by ResNet-block [29], and the chart of one block is
original image is input into the GAN for gray stretch. The
described as Fig. 2. Each block is formed by a residual
output is operated into WrisNet for detecting fractures of
connection and a ReLU layer. The residual connection
X-rays.
contains the following components in order:
3.1 GAN-based preprocessing (a) a convolution layer,
(b) a batch normalization layer,
The X-ray gray value is compressed in a small range, (c) a ReLU layer,
which is not conducive to the identification of crack fea- (d) a group convolution [30],
tures. A very efficient way of gray stretch is manual win- (e) a batch normalization layer,
dowing enhancement but the window level and window (f) a ReLU layer,
width of each image are need to be manually set. In this (g) a convolution layer,
paper, a GAN is firstly proposed to expand the gray scale. (h) a batch normalization layer,
Inspired by pix2pix [24], a multiscale attention-module- (i) a TA module,
based generator and a discriminator are designed to form (j) a shortcut connection.
the GAN. The structure of the generator is shown in Fig. 1.
The group convolution: The input tensor is firstly
The architecture is modeled with encoding process and
divided into 64 groups in the channel dimension, then they
decoding process, which are corresponding to 8 down-
are convolved with 64 different convolution layers,
samplings and 8 up-samplings, respectively. A CBAM
respectively. Finally, the results of the convolution are
module [25] is embedded at each scale. 16 CBAM modules
concatenated on the channel dimension as the output of
and the encoding-decoding architecture are formed the
group convolution. When the depth and width of the net-
generator. The discriminator of pix2pix is directly trans-
work are increased to a certain extent, increasing the
planted to the proposed GAN. The designed generator can
number of groups can improve the performance of feature
greatly increase continuity between pixels of the generated
extraction module effectively.
image, compared with pix2pix, and the comparing results
TA module: The TA module is used from [31] and the
can be seen in Sect. 4. The gray scale of the output can be
detailed structure of TA module is shown in Fig. 3, which
controlled in a reasonable range, which can help the fol-
is composed of three different sub-branches. The TA
lowing WrisNet to detect hairline fractures better.
module can be expressed as Eq. (1):
3.2 WrisNet-based fracture detection MðFÞ ¼ AVG M0;1;2 ðFÞ þ M1;0;2 ðFÞ þ M1;2;0 ðFÞ ð1Þ
123
18776 Neural Computing and Applications (2022) 34:18773–18785
123
Neural Computing and Applications (2022) 34:18773–18785 18777
Fig. 4 Size distribution of ground truth boxes in train dataset Fig. 5 Size distribution of ground truth boxes in test dataset
C W and H W C, respectively. f 77 ðÞ refers to the where MaxPoolðÞ and AvgPoolðÞ refer to the global
convolution operation with 7 7 kernel size. And rðÞ is maximum pooling and the global average pooling opera-
the sigmoid operation. Z-Pool(.) in the above formula can tions, respectively.
be expressed Eq. (5): The lightweight TA module is located after the third BN
Z poolðFÞ ¼ ½MaxPoolðFÞ; AvgPoolðFÞ ð5Þ layer of each block without adding too many parameters.
123
18778 Neural Computing and Applications (2022) 34:18773–18785
Although few parameters does it contain, it could still help detection box has its own confidence. Soft-NMS reduces
each block effectively to understand what information the confidence of the possibly redundant detection boxes
should be laid more emphasis on in the X-ray. In addition, instead of removing them directly. First, the confidences in
spatial attention is combined with channel attention [32] so set S are sorted from high to low. The detected box bm with
that the module could learn the interdependencies between the highest confidence is added to the set M, which is
different dimensions and generate more meaningful rep- merged into D, and bm is removed from B. Then, the
resentations of wrist and finger fractures. remaining boxes in B are checked one by one, and their
(2) Multi-scale feature extraction for small targets confidence scores are reduced by the function
f ðiruðM; bi ÞÞ which are shown in Eq. (6). The progressive
According to the analysis of the statistical data (as
loops until all the boxes in B are put into D. Finally, the
shown in Figs. 4 and 5), we find that the size distribution of
boxes with confidence lower than the threshold in D are
ground truth boxes is scattered and there are a large number
considered as repeated fracture localization. Soft-NMS can
of hairline fractures. FPN is used in feature extraction
greatly improve the detection effect in the above-men-
module to prevent the features of small fractures from
tioned special case by this kind of scoring reduction
being lost during feature extraction. As shown in Fig. 2,
mechanism.
123
Neural Computing and Applications (2022) 34:18773–18785 18779
Fig. 7 The data augmentation process includes random flips, bright- relevant parameters are randomly selected within a certain range. a Is
ness transformations, affine transformations, and image sharpening, the original X-ray, while the data-augmented results are shown in (b–
designed to enhance the X-rays of the train dataset. The input X-rays j)
are randomly subjected to the above four transformations, and the
123
18780 Neural Computing and Applications (2022) 34:18773–18785
SSIM\90% 212
SSIM\80% 45
SSIM\50% 3
The GAN model is trained on a GPU NVIDIA GeForce 4.4 Results and analyses
RTX 3090. The settings of training are as follows. Adam
gradient descent algorithm [35] is adopted. The batch size 4.4.1 GAN-based preprocessing
is set to 1 and a total of 200 epochs are trained. The initial
learning rate is set to 0.0002, and a linear learning rate The proposed GAN is compared with Unet-based [38]
decay strategy is adopted at the 100th epoch. pix2pix in two different ways, which are described as
follows:
123
Neural Computing and Applications (2022) 34:18773–18785 18781
Fig. 9 Comparison of the generated images. a Is the original X-ray image. b is the ground truth. c, d Are generated by pix2pix and the proposed
GAN, respectively
(1) SSIM value of the generated images is shown in Fig. 9. The result
shows that the image generated by the proposed GAN is
The distribution of SSIM value in test dataset using
more similar than pix2pix with ground truth (see circles in
pix2pix are shown in Table 3. The SSIM value which is
Fig. 9) and proves that the attention-module-based gener-
less than 90% can be great improved by using the proposed
ator can greatly improves the correlations between pixels.
GAN. The SSIM comparison between pix2pix and the
(2) AP value
proposed GAN is shown in Table 4, where the SSIM of
single image can be increased from 77.75 to 95.53%, which As shown in Table 5, the proposed generator ensures the
proves that the proposed GAN can obtain the approxima- consistency of the detection results between the generated
tion of manual windowing enhancement. The comparison images and the manual images, compared with Unet. The
123
18782 Neural Computing and Applications (2022) 34:18773–18785
Fig. 10 Our model has a good effect on the detection of phalanx, hand scaphoid and distal radius. The first to third results from the upper left
include the detection of phalangeal fractures. Others include the scaphoid and distal radius fractures
Fig. 11 Under the influence of plaster and steel nails, the model still has considerable detection effect. The first to third results from the upper left
include the detection with steel nails. Others include the detection with plasters
123
Neural Computing and Applications (2022) 34:18773–18785 18783
Table 6 Comparison of different frameworks frameworks use 3476 X-rays as the train dataset and 870
X-rays are set as the test dataset. The same image pre-
Algorithm Backbone AP (%)
processing method and the first data augmentation strategy
Faster R-CNN ResNet50 47.4 are used in this part. Furthermore, the pretrained weights
Faster R-CNN ResNeXt101 49.2 on ImageNet are used in all frameworks to initialize the
Cascade R-CNN [39] ResNet50 48.2 backbone network, and the hyperparameters are adjusted to
Cascade R-CNN ResNet101 48.4 achieve the best effect, to ensure the validity of the com-
Cascade R-CNN?DCN [40] ResNet101 48.3 parative experiment. AP is used as evaluation criteria of
WrisNet ResNeXt-TA 54.7 detection results, which is the most reliable and commonly
WrisNet (best effect) ResNeXt-TA 56.6 used evaluation criteria in current object detection field.
And APs of each framework are obtained when IOU is 0.5.
As shown in Table 6, our network achieves 54:7% AP,
detection effect of generated images even over the manual which have an improvement of at least 5:5% in AP over the
test dataset, due to the elimination of the influence of other frameworks. With the second data augmentation
subjective factors in preprocessing. strategy and Soft-NMS, the AP of WrisNet can reach to
56:6%:
4.4.2 Comparison of detection effect
4.4.3 Ablation experiment
3476 X-rays are using to train the WrisNet, and some
detection results of the test dataset are shown in Figs. 10 A simple ablation experiment is performed and the results
and 11. The green boxes in figures are the ground truth are shown in Table 7, where the significant improvement of
boxes marked by the doctors, and the blue boxes are our method is marked in bold. The impact of the proposed
detected by WrisNet. As shown in Fig. 10, WrisNet has data preprocessing, the proposed backbone network, the
excellent results in the fracture detection of phalanx, sca- data augmentation, and the proposed post-processing are
phoid, and distal radius, which is reflected in the large gradually tested. In ablation experiments, the results can
overlap area between the detection boxes and the corre- demonstrate that the proposed WrisNet have obvious AP
sponding ground truth boxes. At the same time, as shown improvement up to 8:6%. As shown in Table 8, the
in Fig. 11, the model can also perform well in complex improvement is mainly due to the enhancement of small
environments such as X-rays with nails or plaster. The target detection and WrisNet have obvious AP improve-
result can demonstrate that the effectiveness is very close ment of small targets up to 9:4%:
to the diagnosis of radiologists.
The detection effects of the representative object
detection frameworks are compared with WrisNet, and the
results are shown in Table 6, where the significant
improvement of our method is marked in bold. All the
48.0
U 49.2
U U 53.7
U U 53.3
U U U 54.0
U U U U 56.6
Table 8 Comparison of AP of
Algorithm AP% AP% (small targets) AP% (medium targets)
different size targets
Faster R-CNN(ResNeXt101) 48.0 27.8 67.6
WrisNet 56.6 37.2 73.4
123
18784 Neural Computing and Applications (2022) 34:18773–18785
123
Neural Computing and Applications (2022) 34:18773–18785 18785
22. Abbas W, Adnan SM, Javid MA, Majeed F, Ahsan T, Hassan SS the IEEE/CVF Winter Conference on Applications of Computer
(2020) Lower leg bone fracture detection and classification using Vision, pp 3139–3148. https://doi.org/10.1109/WACV48630.
faster RCNN for X-rays images. In: 2020 IEEE 23rd International 2021.00318
Multitopic Conference (INMIC). IEEE, pp 1–6. https://doi.org/ 32. Niu Z, Zhong G, Yu H (2021) A review on the attention mech-
10.1109/INMIC50486.2020.9318052 anism of deep learning. Neurocomputing 452:48–62. https://doi.
23. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) org/10.1016/j.neucom.2021.03.091
Rethinking the inception architecture for computer vision. In: 33. Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS—
Proceedings of the IEEE conference on computer vision and improving object detection with one line of code. In: Proceedings
pattern recognition (CVPR), pp 2818–2826. https://doi.org/10. of the IEEE international conference on computer vision (ICCV),
1109/cvpr.2016.308 pp 5561–5569. https://doi.org/10.1109/iccv.2017.593
24. Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image 34. Lin TY, Maire M, Belongie S, Hays J, Perona P et al (2014)
translation with conditional adversarial networks. In: Proceedings Microsoft coco: common objects in context. In: European con-
of the IEEE conference on computer vision and pattern recog- ference on computer vision. Springer, Cham, pp 740–755. https://
nition (CVPR), pp 1125–1134. https://doi.org/10.1109/cvpr.2017. doi.org/10.1007/978-3-319-10602-1_48
632 35. Kingma DP, Ba J (2014) Adam: a method for stochastic opti-
25. Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: convolutional mization. arXiv preprint arXiv:1412.6980
block attention module. In: Proceedings of the European con- 36. Chlap P, Min H, Vandenberg N, Dowling J, Holloway L,
ference on computer vision (ECCV), pp 3–19. https://doi.org/10. Haworth A (2021) A review of medical image data augmentation
1007/978-3-030-01234-2_1 techniques for deep learning applications. J Med Imaging Radiat
26. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S Oncol 65:545–563. https://doi.org/10.1111/1754-9485.13261
(2017) Feature pyramid networks for object detection. In: Pro- 37. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Ima-
ceedings of the IEEE conference on computer vision and pattern genet: a large-scale hierarchical image database. In: 2009 IEEE
recognition (CVPR), pp 2117–2125. https://doi.org/10.1109/cvpr. conference on computer vision and pattern recognition (CVPR).
2017.106 IEEE, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848
27. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep 38. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional
network training by reducing internal covariate shift. In: Inter- networks for biomedical image segmentation. In: International
national conference on machine learning (ICML). PMLR, conference on medical image computing and computer-assisted
pp 448–456 intervention. Springer, Cham, pp 234–241. https://doi.org/10.
28. Nair V, Hinton GE (2010) Rectified linear units improve 1007/978-3-319-24574-4_28
restricted Boltzmann machines. In: International conference on 39. Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high
machine learning (ICML), pp 807–814 quality object detection. In: Proceedings of the IEEE conference
29. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for on computer vision and pattern recognition (CVPR),
image recognition. In: Proceedings of the IEEE conference on pp 6154–6162. https://doi.org/10.1109/cvpr.2018.00644
computer vision and pattern recognition (CVPR), pp 770–778. 40. Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017)
https://doi.org/10.1109/cvpr.2016.90 Deformable convolutional networks. In: Proceedings of the IEEE
30. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated international conference on computer vision (ICCV),
residual transformations for deep neural networks. In: Proceed- pp 764–773. https://doi.org/10.1109/ICCV.2017.89
ings of the IEEE conference on computer vision and pattern
recognition (CVPR), pp 1492–1500. https://doi.org/10.1109/cvpr. Publisher’s Note Springer Nature remains neutral with regard to
2017.634 jurisdictional claims in published maps and institutional affiliations.
31. Misra D, Nalamada T, Arasanipalai AU, Hou Q (2021) Rotate to
attend: convolutional triplet attention module. In: Proceedings of
123