Radiol 2021210063
Radiol 2021210063
Radiol 2021210063
Purpose: To develop a deep learning model to identify active pulmonary tuberculosis on chest radiographs.
Materials and Methods: Chest radiographs were retrospectively gathered from a multicenter consecutive cohort with pulmonary tu-
berculosis who were successfully treated between 2011 and 2017, along with normal radiographs to enrich a negative class. The
pretreatment and posttreatment radiographs were labeled as positive and negative classes, respectively. A neural network was trained
with those radiographs to calculate the probability of active versus healed tuberculosis. A single-center consecutive cohort (test set
1; 89 patients, 148 radiographs) and data from one multicenter randomized controlled trial (test set 2; 366 patients, 3774 radio-
graphs) were used to test the model. The area under the receiver operating characteristic curve (AUC) was used to evaluate the per-
formance of the model and of the four expert readers.
Results: In total, 6654 pre- and posttreatment radiographs from 3327 patients (mean age 6 standard deviation, 55 years 6 19;
1884 men) with pulmonary tuberculosis and 3182 normal radiographs from as many patients (mean age, 53 years 6 14; 1629
men) were gathered. For test set 1, the model showed a higher AUC (0.83; 95% CI: 0.73, 0.89) than one pulmonologist (0.69;
95% CI: 0.61, 0.76; P , .001) and performed similarly to the other readers (AUC, 0.79–0.80; P = .14–.23). For 200 randomly se-
lected radiographs from test set 2, the model had a higher AUC (0.84) than the pulmonologists (0.71 and 0.74; P , .001 and .01,
respectively) and performed similarly to the radiologists (0.79 and 0.80; P = .08 and .06, respectively). The model output increased
by 0.30 on average with a higher degree of smear positivity (95% CI: 0.20, 0.39; P , .001) and decreased during treatment (base-
line, 3 months, and 6 months: 0.85, 0.51, and 0.26, respectively).
Conclusion: A deep learning model performed similarly to radiologists for accurately determining the activity of pulmonary tubercu-
losis on chest radiographs; it also was able to follow posttreatment changes.
© RSNA, 2021
Table 1: Demographic Characteristics of Patients Treated for Pulmonary Tuberculosis in the Training and Validation Data Sets
Table 2: Number of Chest Radiographs in the Training, Validation, and Test Data Sets
Negative Class
Table 3: Areas under the Receiver Operating Characteristic Curve for the Deep Learning Model and Human Readers
Human Readers*
Statistical Analysis
The area under the curve at receiver operating characteristic
analyses was used to evaluate the performance of the deep neural
network and human readers in test sets 1 and 2. The empirical
method suggested by DeLong et al (13) was used to compare re-
ceiver operating characteristic curves with each other. The degree
of agreement across human readers was assessed with the Cohen
k coefficient. We evaluated the relationship between the neural Figure 2: Graph shows receiver operating characteristic curves of the
network score and the readers’ grading and across the readers by neural network model and human readers in test set 2.
using Spearman correlation coefficients.
For test set 2 analyses, we assessed the correlation between
the network-driven logit-transformed disease activity score and
the degree of smear positivity on sputum smear test (Appendix 3182 patients from hospital A (mean age, 53 years 6 14; 1629
E1 [online]) results by using a linear mixed model with pa- men) for training and validation.
tients as a random effect to account for multiple measurements
per patient. The association of the activity score with the degree Performance of the Deep Neural Network
of smear positivity was evaluated for an ordinal logistic regres- For the test set 1, which consisted of 148 chest radiographs, the
sion model with use of generalized estimating equation analysis network had an AUC of 0.83 (95% CI: 0.76, 0.89; P , .001)
to account for multiple measurements per patient. The linear (Table 3; Fig E2 [online]). The two pulmonologists had AUCs of
mixed model was also used to evaluate whether the activity 0.69 (95% CI: 0.61, 0.76; P , .001) and 0.79 (95% CI: 0.71,
score decreased with treatment. 0.85; P , .001), and the two radiologists had AUCs of 0.80
P , .05 was considered indicative of statistically significant (95% CI: 0.73, 0.86; P , .001) and 0.79 (95% CI: 0.71, 0.85;
difference. Statistical analyses were performed by using SAS ver- P , .001). The neural network showed higher performance than
sion 9.4 (SAS Institute), SPSS version 23 (IBM), and MedCalc reader 1 (P , .001), but we did not find evidence of a difference
version 19.2.6 (MedCalc Software). in its performance from that of the other human readers (P =
.16, .23, and .14). The model’s and readers’ AUCs for smear-
Results positive cases were 0.89 and 0.83–0.93, respectively, and those
for smear-negative cases were 0.82 and 0.64–0.76. The Cohen
Characteristics of the Study Sample k values between pairs of pulmonologists and radiologists were
A total of 6654 pre- and posttreatment chest radiographs from 0.28 and 0.33, respectively, on a five-point scale. The k values
3327 patients with pulmonary tuberculosis (mean age 6 standard were 0.53 for the two pulmonologists and 0.54 for the two radi-
deviation, 55 years 6 19; 1884 men) were included in this study ologists when the results were dichotomized as active or healed
(Fig 1, Table 1), along with 3182 normal chest radiographs in tuberculosis (Table E5 [online]). The Spearman correlation coef-
ficients between the network and each reader were between 0.59 .001). The neural network showed higher performance than
and 0.73, and those across the readers were between 0.55 and the pulmonologists (P , .001 and P = .001, respectively), but
0.80 (Table E5 [online]). we did not find evidence of a difference in performance be-
For test set 2, which comprised 645 chest radiographs when tween the model and the thoracic radiologists (P = .08 and
considering pre- and posttreatment radiographs only, the neu- P = .06, respectively) (Table 3, Fig 2). The Cohen k values be-
ral network’s AUC was 0.82 (95% CI: 0.79, 0.85; P , .001). tween pairs of pulmonologists and radiologists were 0.20 and
For 200 randomly selected chest radiographs from test set 2, the 0.33, respectively, on a five-point scale. The k values were 0.33
AUC was 0.84 (95% CI: 0.78, 0.89; P , .001) (Table 3, Fig 2). for the two pulmonologists and 0.58 for the two radiologists
For the same data set, the pulmonologists had AUCs of 0.71 when the results were dichotomized as active or healed tuber-
(95% CI: 0.64, 0.77; P , .001) and 0.74 (95% CI: 0.67, 0.80; culosis (Table E5 [online]). Spearman correlation coefficients
P , .001), and the radiologists had AUCs of 0.79 (95% CI: between the network and each reader were between 0.54 and
0.72, 0.84; P , .001) and 0.80 (95% CI: 0.73, 0.85; P , 0.76, and those across the readers were between 0.64 and 0.79
(Table E5 [online]).
Disease activity scores of 0.15 and 0.82 deter-
mined active pulmonary tuberculosis at sensitivity
and specificity of 95%, respectively, in test set 2.
The cutoff at 95% sensitivity provided a specific-
ity of 48.5% (95% CI: 37.1, 60.2) in test set 1
and 26.0% (95% CI: 21.3, 31.3) in test set 2. The
cutoff at 95% specificity produced a sensitivity
of 40.0% (95% CI: 30.0, 51.0) in test set 1 and
49.6% (95% CI: 44.3, 54.8) in test set 2. For ra-
diographs that the network misclassified in test set
2, the readers’ AUCs were below 0.3 (95% sen-
sitivity setting) and 0.5 (95% specificity setting),
indicating that on the radiographs misclassified by
the model, it was also difficult for human experts to
determine the disease activity (Table E6 [online]).
Model-derived disease activity scores according to
the Likert-based human rater assessment are sum-
Figure 3: Box and whisker plot shows positive correlation between disease activity score and marized in Table E7 (online).
microscopic grade of Mycobacterium sputum smear test (negative, no bacilli detected on smear;
Relationship between the Disease Activity Score
6 [ie, equivocal positive], one to two bacilli found on 300 microscopic fields; 11, one to nine
and Sputum Smear Test
bacilli found per 100 fields; 21, one to nine bacilli found per 10 fields; 31, one to nine bacilli
per field; 41, more than nine bacilli found per field). The Spearman correlation coefficient of the
The median disease activity scores were 0.36 (in-
two variables was 0.34 (P , .001). Linear regression showed a sputum grade coefficient of 0.083
terquartile range [IQR], 0.15–0.74), 0.72 (IQR,
and a y-intercept of 0.53 (P , .001). Center lines represent median values; upper and lower bor-
0.44–0.83), 0.82 (IQR, 0.37–0.94), 0.70 (IQR,
ders of the boxes indicate 25th and 75th percentile values; upper and lower ends of vertical dotted
0.55–0.90), 0.85 (IQR, 0.66–0.98), and 0.89
lines show the 25th and 75th percentiles 6 1.5 interquartile range; dots represent outlier values.
(IQR, 0.81–0.99) for the negative
(no bacilli detected on smear), equiv-
Table 4: Temporal Changes in Model Output during Antituberculosis Treatment
ocal positive (one to two bacilli found
Month of on 300 microscopic fields), 11 (one
Treatment No. of Radiographs Mean Model Output* Median Model Output† to nine bacilli found per 100 fields),
0 783 0.72 6 0.28 0.85 (0.52–0.96) 21 (one to nine bacilli found per 10
1 1083 0.69 6 0.30 0.83 (0.43–0.95) fields), 31 (one to nine bacilli per
2 525 0.63 6 0.32 0.72 (0.31–0.93) field), and 41 (more than nine bacilli
3 384 0.52 6 0.31 0.51 (0.21–0.85) found per field) degrees of sputum
4 390 0.42 6 0.28 0.36 (0.17–0.66) smear positivity, respectively (Fig 3;
5 333 0.37 6 0.26 0.29 (0.16–0.55) Table E8 [online]). The linear mixed
6 294 0.33 6 0.23 0.26 (0.15–0.46) model showed that the log odds of
7 155 0.34 6 0.25 0.25 (0.13–0.49) the disease activity score increased by
8 21 0.27 6 0.21 0.18 (0.12–0.39) 0.30 on average (95% CI: 0.20, 0.39;
9 4 0.15 6 0.11 0.11 (0.09–0.22) P , .001) when the degree of spu-
tum smear positivity became higher.
Note.—Model output was a value between 0 and 1 for the probability of active tuberculosis.
The generalized estimating equation
* Data are means 6 standard deviations.
analysis of the ordinal logistic regres-
†
Data in parentheses are interquartile ranges.
sion model showed that the odds of
having a higher degree of the sputum smear test increased by Disease Activity Scores during Antituberculosis Treatment
1.36 (95% CI: 1.22, 1.50; P , .001) per 0.1 increase in the In test set 2, the median disease activity scores were 0.85 (IQR,
disease activity score. The analysis also showed that the odds of 0.52–0.96), 0.51 (IQR, 0.21–0.85), 0.26 (IQR, 0.15–0.46) and
having a lower degree of sputum smear positivity increased by 0.11 (IQR, 0.09–0.22) at 0, 3, 6, and 9 months, respectively,
1.94 (95% CI: 1.63, 2.30; P , .001) per 1-month elapse of the since the start of treatment (Table 4). The linear mixed model
antituberculosis treatment. showed that the disease activity score gradually decreased by 0.37
on average (95% CI: 0.35, 0.39;
P , .001) on the logit scale
of disease activity score per 1
month of treatment duration
(Figs 4–6).
Discussion
Several deep learning ap-
proaches have been reported to
differentiate active pulmonary
tuberculosis from normal lungs
on chest radiographs (6,14,15),
but this study differs from prior
research by introducing radio-
graphs that showed postin-
flammatory sequelae, including
fibrotic changes, granulomas,
and volume loss of the lung. We
Figure 4: Box and whisker plot shows a temporal change in disease activity score (on log-odds scale) during antituber- collected only one image pair
culosis treatment. Center lines represent median values; upper and lower borders of the boxes indicate 25th and 75th per- per patient (ie, one radiograph
centile values; upper and lower ends of vertical dotted lines show the 25th and 75th percentiles 6 1.5 interquartile range; showing active tuberculosis and
dots represent outlier values.
Figure 5: Representative images in a 44-year-old woman with pulmonary tuberculosis in test set 1 (A) before treatment, (B) after treatment, and (C) with magnified
view of the lung lesion. (A) The model was able to accurately determine tuberculosis activity, producing a high disease activity score of 0.96. Pulmonologists interpreted
the radiograph as definitely active tuberculosis, and radiologists interpreted it as probably active tuberculosis. (B) After 6 months of antituberculosis medication, the disease
activity score decreased to a low value of 0.09. Pulmonologists interpreted the radiograph as probably healed tuberculosis and definitely healed tuberculosis, and radiolo-
gists interpreted it as probably healed tuberculosis and equivocal activity. Human experts mostly accurately determined tuberculosis activity, but there were some discrepan-
cies on posttreatment chest radiographs. (C) The model detected clustered tiny nodular opacities with fuzzy margins and suggested a high score of active tuberculosis. On
the other hand, the model did not respond to tiny nodular opacities with clear margins.
Figure 6: Representative consecutive images in a 53-year-old man during treatment for pulmonary tuberculosis at (A) 7 days (disease activity score, 0.94), (B) 49
days (disease activity score, 0.68), (C) 140 days (disease activity score, 0.38), and (D) 175 days (disease activity score, 0.18) since the start of antituberculosis treatment.
Disease activity scores calculated by the model decreased gradually during antituberculosis treatment. Heat maps (right panels) also showed a gradual decrease in inten-
sity and territories of active lung lesions.
one showing healed tuberculosis) in thousands of consecutive tuberculosis on radiographs (16). Multidrug- or extensively
multicenter patients to encompass the full spectrum of initial ra- drug-resistant pulmonary tuberculosis typically requires long-
diologic manifestations and sequelae in the data set while avoid- term treatment. Current proxy markers for treatment success
ing similar or redundant images. Notably, the training data set are time to sputum culture conversion and culture conversion
comprised a 7-year consecutive cohort from six hospitals; thus, status at 2 months or 6 months (17,18). However, these mark-
the data set included the full spectrum of typical and atypical ers have some limitations in either sensitivity or specificity, and
radiographic findings of active and healed pulmonary tuberculo- confirming negative conversion takes 2 and 8 weeks in liquid
sis that accompanied intrathoracic extrapulmonary tuberculosis and solid culture media, respectively. Given the finding that
(eg, tuberculous pleurisy) and underlying lung disease (eg, inter- lower network-derived disease activity scores were associated
stitial lung abnormalities). with a lower bacilli burden, monitoring the activity score on
This deep learning network may be advantageous in countries chest radiographs during treatment may supplement current
with a high burden of tuberculosis where spontaneously healed prognostic markers. The same approach can be attempted to
or previously treated patients with tuberculosis are prevalent. monitor the treatment response of nontuberculous mycobacte-
Those countries usually suffer from low incomes and a shortage rial lung diseases that share similar radiographic findings with
of expert imaging professionals (1). It is important to triage tuberculosis (19), for which objective quantitative tools for
patients with suspected tuberculosis at chest radiography into monitoring during treatment are scarce.
those who should be tested bacteriologically or with the Xpert This study had limitations. First, the network was validated
MTB/RIF assay (4), and any radiographic abnormalities are with a limited quantity of retrospectively collected data. Also,
generally regarded as a positive result, given that tuberculosis can the data for test set 2 were obtained from three of the four hos-
sometimes manifest atypically (4). If a deep learning network pitals where the training data were collected. Furthermore, the
can accurately differentiate active tuberculosis from healed tu- addition of normal radiographs to negative cases may inflate the
berculosis on radiographs, then this will provide more efficient performance of the network. Second, the network could not
triage for testing in limited-resource settings by decreasing un- separate patients with healed tuberculosis from healthy individu-
necessary testing in individuals less likely to have active tubercu- als. Radiographic abnormalities in active tuberculosis can com-
losis (Table E9 [online]). pletely resolve in up to 60% of patients with treated tuberculosis
Another potential application of the network is to monitor (20), and posttreatment fibrotic sequelae may not develop if
and determine the treatment response of intractable mycobacte- lesions are healed before forming necrosis (21). Differentiating
rial diseases. Specifically, the disease activity score was correlated patients with healed tuberculosis from healthy individuals on
with the grade of the sputum smear and decreased with the radiographs would be more difficult than differentiating active
duration of treatment. Those findings are in accordance with tuberculosis from healed tuberculosis, and further study is war-
previously reported results for the visual severity of pulmonary ranted for this task. Third, despite the decrease in the disease
activity score during treatment, a considerable overlap existed. 3. Clifford V, He Y, Zufferey C, Connell T, Curtis N. Interferon gamma release
assays for monitoring the response to treatment for tuberculosis: a system-
Physicians would be needed to assess treatment response or tu- atic review. Tuberculosis (Edinb) 2015;95(6):639–650.
berculosis activity in a comprehensive evaluation of symptomatic 4. World Health Organization. Chest radiography in tuberculosis detection:
improvement, score change, and sputum or culture conversion. summary of current World Health Organization recommendations and
guidance on programmatic approaches. Geneva, Switzerland: World Health
Fourth, our cohort did not include patients with human immu- Organization, 2016.
nodeficiency virus and tuberculosis (22), so the performance of 5. Balabanova Y, Coker R, Fedorin I, et al. Variability in interpretation of chest
the network is unknown in those patients. Fifth, a high disease radiographs among Russian clinicians and implications for screening pro-
grammes: observational study. BMJ 2005;331(7513):379–382.
activity score with the network may result from parenchymal 6. Hwang EJ, Park S, Jin KN, et al. Development and validation of a deep
abnormalities of other pathogens in patients with healed tuber- learning-based automatic detection algorithm for active pulmonary tuber-
culosis, such as nontuberculous mycobacteria or aspergillosis culosis on chest radiographs. Clin Infect Dis 2019;69(5):739–747.
7. Kim HY, Song KS, Goo JM, Lee JS, Lee KS, Lim TH. Thoracic sequelae
(7,12). Sixth, we did not apply lung segmentation, which might and complications of tuberculosis. RadioGraphics 2001;21(4):839–858;
improve the network’s performance. Seventh, human readers and discussion 859–860.
our network tried to differentiate active tuberculosis from healed 8. Ali MG, Muhammad ZS, Shahzad T, Yaseen A, Irfan M. Post tuberculosis
sequelae in patients treated for tuberculosis: an observational study at a ter-
tuberculosis on a single radiograph. Human readers will differ- tiary care center of a high TB burden country. Eur Respir J 2018;52:PA2745.
entiate them more straightforwardly if previous radiographs are 9. Aldridge RW, Zenner D, White PJ, et al. Tuberculosis in migrants moving
available for comparison, and our network did not consider this. from high-incidence to low-incidence countries: a population-based cohort
study of 519 955 migrants screened before entry to England, Wales, and
Eighth, expert readers did not have any instructions for a harmo- Northern Ireland. Lancet 2016;388(10059):2510–2518.
nized interpretation of radiographs, and this might have resulted 10. Tan M, Le Q. EfficientNet: Rethinking Model Scaling for Convolutional
in limited agreements across readers. Neural Networks. In: Kamalika C, Ruslan S, eds. Proceedings of the 36th
International Conference on Machine Learning. Proceedings of Machine
In conclusion, the deep neural network was able to determine Learning Research: PMLR Proceedings of Machine Learning Research,
the activity of tuberculosis on chest radiographs, reflecting ba- 2019; 6105–6114.
cilli burden and changes after treatment. The network may help 11. Lee JK, Lee JY, Kim DK, et al. Substitution of ethambutol with linezolid
during the intensive phase of treatment of pulmonary tuberculosis: a pro-
radiologically triage patients with active tuberculosis by exclud- spective, multicentre, randomised, open-label, phase 2 trial. Lancet Infect
ing healed tuberculosis in high-burden countries and may assist Dis 2019;19(1):46–55.
in monitoring the activity of mycobacterial diseases that require 12. Nachiappan AC, Rahbar K, Shi X, et al. Pulmonary tuberculosis: role of ra-
diology in diagnosis and management. RadioGraphics 2017;37(1):52–72.
long-term treatment. 13. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under
two or more correlated receiver operating characteristic curves: a nonpara-
Author contributions: Guarantors of integrity of entire study, S.L., Y.J.L., S.H.Y.; metric approach. Biometrics 1988;44(3):837–845.
study concepts/study design or data acquisition or data analysis/interpretation, all au- 14. Khan FA, Majidulla A, Tavaziva G, et al. Chest x-ray analysis with deep
thors; manuscript drafting or manuscript revision for important intellectual content, learning-based software as a triage test for pulmonary tuberculosis: a pro-
all authors; approval of final version of submitted manuscript, all authors; agrees to spective study of diagnostic accuracy for culture-confirmed disease. Lancet
ensure any questions related to the work are appropriately resolved, all authors; litera- Digit Health 2020;2(11):e573–e581.
ture research, S.L., N.K., Y.J.L., J.Y.L., J.S.K., D.J., S.H.Y.; clinical studies, S.L., J.J.Y., 15. Pasa F, Golkov V, Pfeiffer F, Cremers D, Pfeiffer D. Efficient deep network
N.K., Y.J.L., J.K.L., J.Y.L., J.S.K., D.J., J.M.G., S.H.Y.; experimental studies, N.K.; architectures for fast chest x-ray tuberculosis screening and visualization. Sci
statistical analysis, S.L., J.J.Y., N.K., J.S.K., M.J.J., S.H.Y.; and manuscript editing, Rep 2019;9(1):6268.
S.L., J.J.Y., N.K., Y.J.L., J.S.K., Y.A.K., J.M.G., S.H.Y. 16. Ralph AP, Ardian M, Wiguna A, et al. A simple, valid, numerical score for
grading chest x-ray severity in adult smear-positive pulmonary tuberculosis.
Disclosures of Conflicts of Interest: S.L. Activities related to the present article: Thorax 2010;65(10):863–869.
will receive stock options from Medical IP for the potential commercialization of the 17. Kurbatova EV, Cegielski JP, Lienhardt C, et al. Sputum culture conversion
model presented in the article. Activities not related to the present article: disclosed no as a prognostic marker for end-of-treatment outcome in patients with mul-
relevant relationships. Other relationships: disclosed no relevant relationships. J.J.Y. tidrug-resistant tuberculosis: a secondary analysis of data from two observa-
disclosed no relevant relationships. N.K. disclosed no relevant relationships. Y.J.L. tional cohort studies. Lancet Respir Med 2015;3(3):201–209.
disclosed no relevant relationships. J.K.L. disclosed no relevant relationships. J.Y.L. 18. Rajpurkar P, Irvin J, Bagul A, et al. MURA Dataset: Towards Radiolo-
disclosed no relevant relationships. J.S.K. disclosed no relevant relationships. Y.A.K. gist-Level Abnormality Detection in Musculoskeletal Radiographs. arX-
disclosed no relevant relationships. D.J. disclosed no relevant relationships. M.J.J. iv:1712.06957 2017. https://arxiv.org/abs/1712.06957. Published 2017.
disclosed no relevant relationships. J.M.G. Activities related to the present article: is Accessed July 19, 2020.
member of Radiology editorial board. Activities not related to the present article: re- 19. Koh WJ, Kwon OJ, Lee KS. Nontuberculous mycobacterial pulmonary dis-
ceived or will receive research grants from Infinitt Healthcare, Dongkook Lifescience, eases in immunocompetent patients. Korean J Radiol 2002;3(3):145–157.
and LG Electronics. Other relationships: disclosed no relevant relationships. S.H.Y. 20. Menon B, Nima G, Dogra V, Jha S. Evaluation of the radiological sequelae
Activities related to the present article: will receive stock options from Medical IP for after treatment completion in new cases of pulmonary, pleural, and medias-
the potential commercialization of the model presented in the article. Activities not re- tinal tuberculosis. Lung India 2015;32(3):241–245.
lated to the present article: is the chief medical officer of Medical IP and receives stock 21. Long R, Maycher B, Dhar A, Manfreda J, Hershfield E, Anthonisen N. Pul-
options as compensation. Other relationships: disclosed no relevant relationships. monary tuberculosis treated with directly observed therapy: serial changes in
lung structure and function. Chest 1998;113(4):933–943.
22. Lee CH, Hwang JY, Oh DK, et al. The burden and characteristics of tuber-
References culosis/human immunodeficiency virus (TB/HIV) in South Korea: a study
1. World Health Organization. Global tuberculosis report 2020. Geneva, from a population database and a survey. BMC Infect Dis 2010;10(1):66.
Switzerland: World Health Organization, 2020.
2. Friedrich SO, Rachow A, Saathoff E, et al. Assessment of the sensitivity
and specificity of Xpert MTB/RIF assay as an early sputum biomarker of
response to tuberculosis treatment. Lancet Respir Med 2013;1(6):462–470.