4.1. Ablation Experiment
Some ablation experiments were designed to assess effectiveness of data enhancement and weighted loss function in this work. To test effect of data enhancement, all data were extracted with lung mask as original dataset and binary cross-entropy loss function was applied. One set of experiments used original data for training and testing. Data of another experiment were calculated with data enhancement as enhanced dataset. Experiment parameters of two experiments remained the same. By comparison, quantitative results are revealed in
Table 2. Based on ACC and PRE, both experiments obtained a high value. Even PRE obtained from original data is 15% higher than that from enhanced dataset. However, REC obtained from original data is only 26.86%. It can be judged that texture characteristics of COVID-19 infection are suppressed due to interference of tissue in original data. Outputs obtained through network usually get fewer positive cases at infection. Although most of these positive cases belong to correct positive cases, they cannot effectively describe overall infection profile. After data enhancement, infection features are enhanced, so that more infection areas are judged as positive cases. Therefore, this is also the reason why DSC obtained by original data is only 37.5%, which is much lower than that of enhanced data.
Since the DSC is the coincidence rate of the segmentation results and the manual ground truth, it is a key indicator that reflects the performance of the segmentation method. Based it, all data pixels are used as samples, and McNemar’s test [
44] have verified the results of data augmentation (
Table 3). Var.1 indicates whether augmentation is used, and Var.2 indicates whether the voxels overlap. Since
p = sig. is much less than 0.01, the difference between the two methods is very significant. Therefore, the data augmentation can improve the performance. Details of the Var.1 and Var.2 crosstabulation is shown in
Table S3 of supporting information.
To verify quantitative conclusion,
Figure 6 randomly shows two visualizations obtained from enhanced data and original data. Where,
Figure 6a shows output map from network trained by original data. Relevant features are marked by blue boxed to highlight.
Figure 6b is output map by enhanced data. For convenience of observation,
Figure 7 enlarges blue boxes in
Figure 6 to display clearly. From visualization results, number of positive cases in infection area from original data is less than that from enhanced data. Additionally, the probability of many areas is lower. This is because infection feature is not obvious which leads to unclear judgment of network. It may lead to wrong judgment as background area in next binary segmentation process, thereby reducing accuracy. However, results from data enhancement also have shortcomings. For example, parts of the invasive lesions are judged to be highly similar vascular structures (
Figure 7b, the second column), which may reduce infection area. Therefore, more scientific data enhancement methods are still an important direction for our next exploration.
Another ablation experiment was applied to verify the role of the loss function in the method in this work. In experimental group and control group, lung region extraction, data enhancement, and network parameters were consistent. Additionally, modulation factors and were discussed.
Table 4,
Table 5 and
Table 6 list quantitative results by focal loss function, weighted loss function, and comprehensive comparison, respectively. For focal loss function, costs of positive and negative samples are balanced by
parameters. When
, cost of positive samples is reduced to
times than the original, and the cost of negative samples is reduced to
times than the original. In contrast, when
, positive samples can get more cost, so that network has more time to optimize positive sample. Therefore,
Table 4 shows that the best results are usually obtained at
.
parameter is mainly used to balance cost of complex and simple samples. Here situation of
and
are discussed. By quantitative assessment, performance of
is better and stable.
Table 5 mainly discuss the parameters of weighted loss function. Different from focal loss function, weighted loss increases cost of positive samples by
times with
, while cost of negative samples is reduced by
times. This allows positive samples to get more cost ratio and keep all cost steady, which makes network converge stably while pay more attention to target area. If parameters are appropriate, cost of negative samples will not decrease too much, which prevents network from converging too fast and oscillating near the minimum point. From experimental results,
is a better balance. At the same time, due to the influence of
parameter on cost, weighted loss function has a relatively reliable result at
.
Figure 8 and
Figure 9 shows loss convergence of three loss functions in the first 100 steps. For convenience of observation, the ordinate adopts a logarithmic coordinate, and base is 2. Where,
Figure 8 highlights the relationship between focal loss function and binary cross-entropy loss function. It can be clearly observed that cost of focal loss function drops fast and quickly reaches a relatively stable level. Further, as weight of positive sample increases and weight of the negative sample decreases, trend of this cost reduction will become greater. When facing a sample with a small proportion of positive samples, the weight of the positive and negative samples and the overall cost need to be rebalanced.
Figure 9 is the relationship between weighted loss function and binary cross-entropy loss function, where weighted loss function will increase weight ratio of the positive sample to the negative sample while
increases. Similarly, as the number of positive samples is small, it causes overall cost to gradually decrease. Eventually, it will face the same contradiction as focal loss function, so
value needs to be adjusted appropriately. In addition, the overall loss will increase to cause more training cost with
. After discussion of this work, appropriate results are obtained for
and
.
To comprehensively discuss binary cross-entropy loss function, focal loss function, and weighted loss function,
Table 6 compares results with representative parameters and binary cross-entropy loss function. Comparison shows that on the basis of focal loss function, by changing cost of positive, negative, complex and simple samples, performance can be effectively improved, especially on data where positive and negative samples do not match completely. Therefore, segmentation of vessels, tumors, nodules, and other small tissues are also the same common problem as this work. However, focal loss function will reduce cost of all samples, causing cost to quickly decrease to a small value, which may lead network to be difficult to continue to effectively converge in subsequent training process. Therefore, weighted loss function recalculates cost calculation method for positive, negative, complex, and simple samples, thereby further improving result on the basis of focal loss function.
Figure 10 introduces output map of four randomly data. Starting from
Figure 10a to
Figure 10h, they are original CT images, the manual ground truth, results of binary cross-entropy loss function, focal loss (0.8, 1), focal loss (0.8, 2), weighted loss (0.8, 1), weighted loss (0.8, 2), and weighted loss (0.2, 1). In visualization results, the positive sample area from binary cross-entropy loss function is dark. It can be judged that the same cost weight is difficult to effectively increase output value of target area on data with positive and negative sample imbalance. It is because the cost change obtained by calculating the positive sample is much smaller than the cost obtained by calculating the negative sample.
Other visualization results increase ratio of positive sample cost to a certain extent under adjustment of weight, which makes their positive sample area is brighter than that from binary cross-entropy loss function. It has more advantages in the next binary segmentation. The main difference between them is the degree of calculation for negative samples. For example, because rapid decrease in cost reduces subsequent training efficiency, background from focal loss function (
Figure 10d,g) is relatively gray. In
Figure 10g, background area is also gray. This is because the number of negative samples is large. Increasing
value will make cost of negative sample grow faster than positive sample, causing them to be unbalanced again. Therefore, it is very important to choose appropriate parameters.
4.2. Performance Assessment
This work evaluated accuracy of the proposed method by comparing with different work. Results of this work were obtained from 3D U-Net network model guided by lung region extraction, data enhancement, and weight loss function. Two datasets tested this work. As comparison groups, V-Net, U-Net, Dense-Net, and DenseVoxel-Net as four common 3D networks were constructed. To improve their performance, all data in comparison groups are calculated with lung region extraction and data enhancement. Finally, quantitative comparison on private dataset and public dataset is introduced in
Table 7 and
Table 9, respectively. It can be seen from comparison that results on private dataset are better than results on public dataset. This is because labeling specifications of different datasets are different. As the surrounding area of COVID-19 infection is usually spread and blurred, it is impossible to effectively distinguish a clear boundary. Thus, labeling on private dataset is mainly based on conservative outlines, with the purpose of accurately identifying and locating COVID-19 infection. Contrastingly, labeling of public dataset is mainly based on accurate coverage of COVID-19 infection. Label area needs to cover all infection ranges, so it usually crosses actual infection boundary. This leads to reason that PRE on public dataset is significantly higher than that on private dataset, but REC is smaller than the private dataset. It proves that positive cases of outputs are mainly concentrated in target area.
Moreover, we conducted statistical tests on the results in
Table 7. We have carried out Pearson Chi-Square, Likelihood ratio, Linear-by-Linear, and Association [
44] based on the DSC. So, the Chi-Square tests for different models are shown in
Table 8. The
p-value of all statistical tests is far less than 0.01, so there is a significant statistical difference. Details of the Var.1 and Var.2 crosstabulation is shown in
Table S4 of supporting information.
In addition to the indicators mentioned above, calibration is also an important factor in evaluating network models [
45,
46,
47]. Different from the machine learning classification problem, the image segmentation task is a pixel-level processing. Inspired by [
48], we define expected calibrated error (ECE) as follows:
where
represents the average PRE value in a
.
represents the average of all probability values under this
.
is the interval divided by the probability value. This study divides five
with a 0.2 gap. Different from [
48], because the positive and negative samples are unbalanced, any classifier can get a higher ACC value. Therefore, this study replaces the PRE that judges the positive class as the observation value. So, the ECE is implemented in
Table 7. From their ECE value, the models have achieved lower ECE, indicating that the neural network has good calibration. The output of the proposed model can match its confidence.
By comprehensive analysis, the proposed method has certain advantages in two datasets, which verifies that it is stable and generalizable. Since this work proposes a global 3D reconstruction, it can have a more complete global digital information than 2D segmentation and local patch segmentation. In the existing mainstream 3D network architecture, 3D U-Net is a lightweight and cost-effective network. Combined with data enhancement and loss function proposed in this work, it has certain clinical potential.