PediCXR Final Munuscript
PediCXR Final Munuscript
net/publication/370324469
CITATIONS READS
13 105
5 authors, including:
Ha Quy Nguyen
École Polytechnique Fédérale de Lausanne
61 PUBLICATIONS 1,323 CITATIONS
SEE PROFILE
All content following this page was uploaded by Hieu H. Pham on 04 May 2023.
6
1 Smart Health Center, VinBigData JSC, Hanoi, Vietnam
7
2 Phu Tho Department of Health, Phu Tho, Vietnam
8
3 College of Engineering & Computer Science, VinUniversity, Hanoi, Vietnam
9
4 VinUni-Illinois Smart Health Center, Hanoi, Vietnam
10
5 Training and Direction of Healthcare Activities Center, Phu Tho General Hospital, Phu Tho, Vietnam
11
* These authors contributed equally: Hieu H. Pham and Ngoc H. Nguyen
12
* Corresponding author: Hieu H. Pham ([email protected])
13 ABSTRACT
Computer-aided diagnosis systems in adult chest radiography (CXR) have recently achieved great success thanks to the
availability of large-scale, annotated datasets and the advent of high-performance supervised learning algorithms. However,
the development of diagnostic models for detecting and diagnosing pediatric diseases in CXR scans is undertaken due to
the lack of high-quality physician-annotated datasets. To overcome this challenge, we introduce and release PediCXR, a new
pediatric CXR dataset of 9,125 studies retrospectively collected from a major pediatric hospital in Vietnam between 2020 and
14
2021. Each scan was manually annotated by a pediatric radiologist with more than ten years of experience. The dataset
was labeled for the presence of 36 critical findings and 15 diseases. In particular, each abnormal finding was identified via
a rectangle bounding box on the image. To the best of our knowledge, this is the first and largest pediatric CXR dataset
containing lesion-level annotations and image-level labels for the detection of multiple findings and diseases. For algorithm
development, the dataset was divided into a training set of 7,728 and a test set of 1,397. To encourage new advances in
pediatric CXR interpretation using data-driven approaches, we provide a detailed description of the PediCXR data sample and
make the dataset publicly available on https://physionet.org/content/pedicxr/1.0.0/.
Table 1. An overview of existing public datasets for CXR interpretation in pediatric patients.
Dataset Release year # findings # samples Image-level labels Local labels
Kermany et al.19 2018 2 5,856 Available Not available
Chen et al.20 2020 5 2,668 Available Not available
PediCXR (ours) 2021 52 9,125 Available Available
49 Methods
50 Data collection
51 Data collection was conducted at the Phu Tho Obstetric & Pediatric Hospital (PTOPH) between 2020 – 2021. The ethical
52 clearance of this study was approved by the Institutional Review Boards (IRBs) of the PTOPH. The need for obtaining informed
53 patient consent was waived because this retrospective study did not impact clinical care or workflow at these two hospitals,
54 and all patient-identifiable information in the data has been removed. We retrospectively collected more than 10,000 CXRs in
55 DICOM format from a local picture archiving and communication system (PACS) at PTOPH. The imaging dataset was then
56 transferred and analyzed at Smart Health Center, VinBigData JSC.
57 Overview of approach
58 The building of the PediCXR dataset is illustrated in Figure 1. In particular, the collection and normalization of the dataset
59 were divided into four main steps: (1) data collection, (2) data de-identification, (3) data filtering, and (4) data labeling. We
60 describe each step in detail as below.
Figure 1. Construction of the PediCXR dataset. First, raw pediatric scans in DICOM format were collected retrospectively from
the hospital’s PACS at PTOPH. These images were de-identified to protect patient’s privacy. Then, invalid files (including adult
CXR images, images of other modalities or other body parts, images with low quality, or incorrect orientation) were manually
filtered out. After that, a web-based DICOM labeling tool called VinDr Lab was developed to remotely annotate DICOM data.
Finally, the annotated dataset was then divided into a training set (N = 7,728) and a test set (N = 1,397 ) for algorithm
development.
2/9
61 Data de-identification
62 In this study, we follow the HIPAA Privacy Rule22 to protect individually identifiable health information from the DICOM
63 images. To this end, we removed or replaced with random values all personally identifiable information associated with the
64 images via a two-stage de-identification process. At the first stage, a Python script was used to remove all DICOM tags of
65 protected health information (PHI)23 such as patient’s name, patient’s date of birth, patient ID, or acquisition time and date,
66 etc. For the purpose of loading and processing DICOM files, we only retained a limited number of DICOM attributes that are
67 necessary, as indicated in Table 2 (Supplementary materials). In the second stage, we manually removed all textual information
68 appearing on the image data, i.e., pixel annotations that could include patient’s identifiable information.
69 Data filtering
70 The collected raw data included a significant amount of outliers including CXRs of adult patients, body parts other than
71 chest (abdominal, spine, and others), low-quality images, or lateral CXRs. To filter a large number of CXR scans, we trained
72 a lightweight convolutional neural network (CNN)24 to remove all outliers automatically. Next, a manual verification was
73 performed to ensure all outliers had been fully removed.
74 Data labeling
75 The PediCXR dataset was labeled for a total of 36 findings and 15 diagnoses. These labels were divided into two categories:
76 local labels (#1 – #36) and global labels (#37 – #52). The local labels should be marked with bounding boxes that localize the
77 findings, while the global labels should reflect the diagnostic impression of the radiologist. This list of labels was suggested by
78 a committee of the most experienced pediatric radiologists. To select these labels, the committee took into account two key
79 factors. First, findings and diseases are prevalent. Second, they can be differentiated on pediatric chest X-ray scans. Figure 2
80 illustrates several samples with both local and global labels annotated by our radiologists.
Figure 2. Several examples of pediatric CXR images with radiologist’s annotations. Local labels marked by radiologists are
plotted on the original images for visualization purposes. These annotations show abnormal findings from the scans. The global
labels, that classify images into diseases, are in bold and listed at the bottom of each example.
81 To facilitate the labeling process, we designed and built a web-based framework called VinDr Lab25 that allows a team of
82 experienced radiologists remotely annotate the data. Specifically, this is a web-based labeling tool that was developed to store,
83 manage, and remotely annotate DICOM data. The radiologists were oriented to locate the abnormal findings from the DICOM
84 viewer and draw the bounding boxes. All the annotators have been well-trained to ensure that the annotations are consistently
85 annotated. In addition, all the radiologists participating in the labeling process were certified in diagnostic radiology and
86 received healthcare professional certificates. In total, three pediatric radiologists with at least 15 years of experience were
87 involved in the annotation process. Each sample in the training set was assigned to one radiologist for annotation. Additionally,
88 all of the participating radiologists were blinded to relevant clinical information. A set of 9,125 pediatric CXRs were randomly
3/9
Table 2. The list of DICOM tags that were retained for loading and processing raw images. All other tags were
removed for protecting patient privacy. Details about all these tags can be found from DICOM Standard Browser at
https://dicom.innolitics.com/ciods.
DICOM Tag Attribute Name Description
(0028, 0030) Pixel Spacing Physical distance in the patient between the center of each pixel,
specified by a numeric pair - adjacent row spacing (delimiter) ad-
jacent column spacing in mm.
(0028, 0034) Pixel Aspect Ratio Ratio of the vertical size and horizontal size of the pixels in the
image specified by a pair of integer values where the first value is
the vertical pixel size, and the second value is the horizontal pixel
size.
(0028, 0100) Bits Allocated Number of bits allocated for each pixel sample. Each sample
shall have the same number of bits allocated.
(0028, 0101) Bits Stored Number of bits stored for each pixel sample. Each sample shall
have the same number of bits stored.
(0028, 0102) High Bit Most significant bit for pixel sample data. Each sample shall have
the same high bit.
(0028, 0103) Pixel Representation Data representation of the pixel samples. Each sample shall have
the same pixel representation.
(0028, 0106) Smallest Image Pixel Value The minimum actual pixel value encountered in this image.
(0028, 0107) Largest Image Pixel Value The maximum actual pixel value encountered in this image.
(0028, 1052) Rescale Intercept The value b in relationship between stored values (SV) and the
output units specified in Rescale Type (0028,1054). Each output
unit is equal to m*SV + b.
(0028, 1053) Rescale Slope Value of m in the equation specified by Rescale Intercept
(0028,1052).
(7FE0, 0010) Pixel Data A data stream of the pixel samples that comprise the image.
(0028, 0004) Photometric Interpretation Specifies the intended interpretation of the pixel data.
(0028, 2110) Lossy Image Compression Specifies whether an image has undergone lossy compression
(at a point in its lifetime).
(0028, 2114) Lossy Image Compression Method A label for the lossy compression method(s) that have been ap-
plied to this image.
(0028, 2112) Image Compression Ratio Describes the approximate lossy compression ratio(s) that have
been applied to this image.
(0028, 0002) Samples per Pixel Number of samples (planes) in this image.
4/9
89 annotated from the filtered data, of which 7,728 scans serve as the training set, and the remaining 1,397 studies form the test set.
90 Note the 9,125 studies correspond to 9,125 patients, and each study has a single CXR scan.
5/9
Figure 3. Distribution of abnormal findings on the training set of PediCXR. Rare findings (less than 10 examples) are not
included.
Figure 4. Distribution of pathologies on the training set of PediCXR. Rare diseases (less than 10 examples) are not included.
91 Once the labeling was completed, the annotations of all pediatric CXRs were exported in JavaScript Object Notation
92 (JSON) format. We developed a Python script to parse JSON files and organized the annotations in the form of a single
93 comma-separated values (CSV) file. Each CSV file contains labels, bounding box coordinates, and their corresponding image
94 identifiers (IDs). The data characteristics, including patient demographic and the prevalence of each finding or disease, are
95 summarized in Table 3. The distributions of abnormal findings and pathologies in the training set are drawn in Figure 3 and
96 Figure 4, respectively.
97 Data Records
98 The PediCXR dataset will be made available for public download on PhysioNet? . We offer complete imaging data as well as
99 ground truth labels for both the training and test datasets. The pediatric scans were split into two folders: one for training and
100 one for testing, named as “train” and “test”, respectively. Since each study has only one instance and each patient has
101 maximum one study, therefore, the value of the SOP Instance UID provided by the DICOM tag (0008, 0018) was encoded into
102 a unique, anonymous identifier for each image. To this end, we used the Python hashlib module (see Code Availability)
103 to encode the SOP Instance UIDs into image IDs. The radiologists’ local annotations of the training set were provided
104 in a CSV file called annotations_train.csv. Each row of the CSV file represents a bounding box annotation with
105 the following attributes: image ID (image_id), radiologist ID (rad_id), label’s name (class_name), bounding box
6/9
106 coordinates (x_min, y_min, x_max, y_max), and label class ID (class_id). The coordinates of the box’s upper-left
107 corner are (x_min, y_min), and the coordinates of the box’s lower right corner are (x_max, y_max). Meanwhile, the
108 image-level labels of the training set were stored in a different CSV file called image_labels_train.csv, with the
109 following fields: Image ID (image_id), radiologist ID (rad_ID), and labels (labels) for both the findings and diagnoses.
110 Each image ID is associated with a vector of multiple labels corresponding to different pathologies, with positive pathologies
111 encoded as “1” and negative pathologies encoded as “0”. Similarly, the test set’s bounding-box annotations and image-level
112 labels were saved in the files annotations_test.csv and image_labels_test.csv, respectively.
135 The primary uses for which the PediCXR dataset was conceptualized include:
136 • Developing and validating a predictive model for the classification of common thoracic diseases in pediatric patients.
137 • Developing and validating a predictive model for the localization of multiple abnormal findings on the pediatric chest
138 X-ray scans.
139 Finally, the released dataset remains with limitations that still need to be addressed in the future, including:
140 • The dataset did not contain clinical information associated with DICOM images, which is essential for the interpretation
141 of CXR in children patients.
142 • The number of examples for rare diseases (e.g., Congenital pulmonary airway malformation (CPAM), Congenital emphy-
143 sema, Diagphramatic hernia, Mediastinal tumor, Pleuro-pneumonia, Situs inversus, Lung tumor) or findings (Emphysema,
144 Edema, Calcification, Chest wall mass, Bronchectasis, Pleural thickening, Clavicle fracture, Pleuropulmonary mass,
145 Paraveterbral mass, etc.) are limited. Hence, training supervised learning algorithms, which requires a large-scale
146 annotated dataset, on the PediCXR dataset to diagnose the rare diseases and findings is not reliable.
147 To download and use the PediCXR , users are required to accept the PhysioNet Credentialed Health Data License 1.5.0.
148 By accepting this license, users agree that they will not share access to the dataset with anyone else. For any publication that
149 explores this resource, the authors must cite this original paper and release their code and models.
7/9
150 Code Availability
151 This study used the following open-source repositories to load and process DICOM scans: Python 3.7.0 (https://www.python.org/);
152 Pydicom 1.2.0 (https://pydicom.github.io/); OpenCV-Python 4.2.0.34 (https://pypi.org/project/opencv-python/); and Python
153 hashlib (https://docs.python.org/3/library/hashlib.html). The code for data de-identification was made publicly available at
154 https://github.com/vinbigdata-medical/vindr-cxr. The code to train CNN classifier for the out-of-distribution task was made
155 publicly available at https://github.com/vinbigdata-medical/DICOM-Imaging-Router. The VinDr Lab is an open source software
156 and can be found at https://vindr.ai/vindr-lab.
157 References
158 1. Collaborators, G. . L. Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lower
159 respiratory tract infections in 195 countries: a systematic analysis for the global burden of disease study 2015. The Lancet
160 Infect. Dis. 17, 1133–1161 (2017).
161 2. Wardlaw, T. M., Johansson, E. W., Hodge, M., Organization, W. H. & (UNICEF), U. N. C. F. Pneumonia : the forgotten
162 killer of children (2006).
163 3. Hart, A. & Lee, E. Y. Pediatric chest disorders: Practical imaging approach to diagnosis. Dis. Chest, Breast, Hear. Vessel.
164 2019-2022 107–125 (2019).
165 4. Chest radiograph (pediatric). https://radiopaedia.org/articles/chest-radiograph-paediatric. Accessed: 2021-09-24.
166 5. Du Toit, G., Swingler, G. & Iloni, K. Observer variation in detecting lymphadenopathy on chest radiography. Int. J. Tuberc.
167 Lung Dis. 6, 814–817 (2002).
168 6. Wang, X. et al. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification
169 and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern
170 Recognition (CVPR), 2097–2106, https://doi.org/10.1109/CVPR.2017.369 (2017).
171 7. Bustos, A., Pertusa, A., Salinas, J.-M. & de la Iglesia-Vayá, M. Padchest: A large chest X-ray image dataset with multi-label
172 annotated reports. arXiv preprint arXiv:1901.07441 (2019).
173 8. Irvin, J. et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings
174 of the AAAI Conference on Artificial Intelligence, vol. 33, 590–597 (2019).
175 9. Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.
176 Sci. Data 6, 317, https://doi.org/10.1038/s41597-019-0322-0 (2019).
177 10. Nguyen, H. Q. et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations. arXiv preprint
178 arXiv:2012.15029 (2020).
179 11. Jaeger, S. et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging
180 Medicine Surg. 4, 475–477, https://dx.doi.org/10.3978%2Fj.issn.2223-4292.2014.11.20 (2014).
181 12. Tabik, S. et al. Covidgr dataset and covid-sdnet methodology for predicting covid-19 based on chest x-ray images. IEEE
182 journal biomedical health informatics 24, 3595–3605 (2020).
183 13. Rajpurkar, P. et al. CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint
184 arXiv:1711.05225 (2017).
185 14. Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm
186 to practicing radiologists. PLoS Medicine 15, e1002686, https://doi.org/10.1371/journal.pmed.1002686 (2018).
187 15. Majkowska, A. et al. Chest radiograph interpretation with deep learning models: Assessment with radiologist-
188 adjudicated reference standards and population-adjusted evaluation. Radiology 294, 421–431, https://doi.org/10.1148/
189 radiol.2019191293 (2020).
190 16. Rajpurkar, P. et al. CheXpedition: Investigating generalization challenges for translation of chest X-ray algorithms to the
191 clinical setting. arXiv preprint arXiv:2002.11379 (2020).
192 17. Tang, Y.-X. et al. Automated abnormality classification of chest radiographs using deep convolutional neural networks. npj
193 Digit. Medicine 3, 1–8, https://doi.org/10.1038/s41746-020-0273-z (2020).
194 18. Pham, H. H., Le, T. T., Tran, D. Q., Ngo, D. T. & Nguyen, H. Q. Interpreting chest X-rays via CNNs that exploit
195 hierarchical disease dependencies and uncertainty labels. arXiv preprint arXiv:1911.06475 (2020).
196 19. Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172,
197 1122–1131.e9, https://doi.org/10.1016/j.cell.2018.02.010 (2018).
8/9
198 20. Chen, K.-C. et al. Diagnosis of common pulmonary diseases in children by X-ray images and deep learning. Sci. Reports
199 10, 1–9 (2020).
200 21. Gordon, L., Grantcharov, T. & Rudzicz, F. Explainable artificial intelligence for safe intraoperative decision support. JAMA
201 surgery 154, 1064–1065 (2019).
202 22. US Department of Health and Human Services. Summary of the HIPAA privacy rule. https://www.hhs.gov/hipaa/
203 for-professionals/privacy/laws-regulations/index.html (2003).
204 23. Isola, S. & Al Khalili, Y. Protected Health Information (PHI). https://www.ncbi.nlm.nih.gov/books/NBK553131/ (2019).
205 24. Pham, H. H., Do, D. V. & Nguyen, H. Q. Dicom imaging router: An open deep learning framework for classification of
206 body parts from dicom x-ray scans. arXiv preprint arXiv:2108.06490 (2021).
207 25. Nguyen, N. T. et al. Vindr lab: A data platform for medical ai. URL: https://github. com/vinbigdata-medical/vindr-lab
208 (2021).
209 26. Tran, T. T. et al. Learning to automatically diagnose multiple diseases in pediatric chest radiographs using deep con-
210 volutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (ICCV 2021)
211 (2021).
212 Acknowledgements
213 The collection of this dataset was funded by the Smart Health Center, VinBigData JSC. The authors would like to acknowledge
214 the Phu Tho Obstetric & Pediatric Hospital for agreeing to make the PediCXR dataset publicly available. We are especially
215 thankful to Anh T. Nguyen, Huong T.T. Nguyen, Ngan T.T. Nguyen for their helps in the data collection and labeling process.
9/9