0% found this document useful (0 votes)
18 views10 pages

PediCXR Final Munuscript

Pediatric chest X-ray images with labels

Uploaded by

bruceayim30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

PediCXR Final Munuscript

Pediatric chest X-ray images with labels

Uploaded by

bruceayim30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/370324469

PediCXR: An open, large-scale chest radiograph dataset for interpretation of


common thoracic diseases in children

Article in Scientific Data · April 2023


DOI: 10.1038/s41597-023-02102-5

CITATIONS READS

13 105

5 authors, including:

Hieu H. Pham Tuan Nguyen


VinUniversity Freie Universität Berlin
86 PUBLICATIONS 1,100 CITATIONS 14 PUBLICATIONS 183 CITATIONS

SEE PROFILE SEE PROFILE

Ha Quy Nguyen
École Polytechnique Fédérale de Lausanne
61 PUBLICATIONS 1,323 CITATIONS

SEE PROFILE

All content following this page was uploaded by Hieu H. Pham on 04 May 2023.

The user has requested enhancement of the downloaded file.


1 PediCXR: An open, large-scale chest radiograph
2 dataset for interpretation of common thoracic
3 diseases in children
4 Hieu H. Pham1,3,4,* , Ngoc H. Nguyen2 , Thanh T. Tran1 , Tuan N.M. Nguyen5 , and Ha Q.
5 Nguyen1

6
1 Smart Health Center, VinBigData JSC, Hanoi, Vietnam
7
2 Phu Tho Department of Health, Phu Tho, Vietnam
8
3 College of Engineering & Computer Science, VinUniversity, Hanoi, Vietnam

9
4 VinUni-Illinois Smart Health Center, Hanoi, Vietnam

10
5 Training and Direction of Healthcare Activities Center, Phu Tho General Hospital, Phu Tho, Vietnam

11
* These authors contributed equally: Hieu H. Pham and Ngoc H. Nguyen

12
* Corresponding author: Hieu H. Pham ([email protected])

13 ABSTRACT

Computer-aided diagnosis systems in adult chest radiography (CXR) have recently achieved great success thanks to the
availability of large-scale, annotated datasets and the advent of high-performance supervised learning algorithms. However,
the development of diagnostic models for detecting and diagnosing pediatric diseases in CXR scans is undertaken due to
the lack of high-quality physician-annotated datasets. To overcome this challenge, we introduce and release PediCXR, a new
pediatric CXR dataset of 9,125 studies retrospectively collected from a major pediatric hospital in Vietnam between 2020 and
14
2021. Each scan was manually annotated by a pediatric radiologist with more than ten years of experience. The dataset
was labeled for the presence of 36 critical findings and 15 diseases. In particular, each abnormal finding was identified via
a rectangle bounding box on the image. To the best of our knowledge, this is the first and largest pediatric CXR dataset
containing lesion-level annotations and image-level labels for the detection of multiple findings and diseases. For algorithm
development, the dataset was divided into a training set of 7,728 and a test set of 1,397. To encourage new advances in
pediatric CXR interpretation using data-driven approaches, we provide a detailed description of the PediCXR data sample and
make the dataset publicly available on https://physionet.org/content/pedicxr/1.0.0/.

15 Background & Summary


16 Common thoracic diseases cause several hundred thousand deaths every year among children under five years old1, 2 . The chest
17 radiograph or CXR is the first-line and most commonly performed imaging examination in the assessment of the pediatric
18 patient3 . Interpreting CXR scans on pediatric patients can be for a number of indications or critical findings, in particular
19 for common thoracic diseases in children such as Pneumonia, Bronchitis and Cardiovascular diseases (CVDs). Depending
20 on the patients’ age, the difficulty of the examination will vary, often requiring a specialist in pediatric diagnostic imaging
21 with an in-depth knowledge of radiological signs of different lung conditions4 . Additionally, the inter-observer agreement and
22 intra-observer agreement in the pediatric CXR interpretation were low5 . This opens room for the development of data-driven
23 approaches and computational tools to assist pediatricians in the diagnosis of common thoracic diseases and to reduce their
24 workload.
25 Computer-aided diagnosis (CAD) systems for identification of lung abnormality in adult CXRs have recently achieved
26 great success thanks to the availability of large labeled datasets6–10 . Many large-scale CXR datasets of adult patients such as
27 Montgomery County chest X-ray (MC)11 , Shenzhen chest X-ray11 , ChestX-ray86 , COVIDGR12 , ChestX-ray146 , Padchest7 ,
28 CheXpert8 , MIMIC-CXR9 and VinDr-CXR10 have been established and released in recent years. These datasets boosted new
29 advances in exploring new machine learning-based approaches in the interpretation of CXR in adults8, 13–18 . Unfortunately, the
30 creation of pediatric CXR datasets is still unexploited, and the number of benchmark pediatric CXR datasets is limited. This
31 becomes the main obstacle in developing and transferring new machine learning-based CAD systems for pediatric CXR in
32 clinical practice.
33 In an effort to provide a large-scale pediatric CXR dataset with high-quality annotations for the research community, we
34 have built the PediCXR dataset in DICOM format. The dataset consists of 9,125 posteroanterior (PA) view CXR scans in
35 patients younger than 10 years that were retrospectively collected from three major hospitals in Vietnam from 2020 to 2021. In
36 particular, all CXR scans come with both the localization of critical findings and the classification of common thoracic diseases.
37 These images were annotated by a group of three radiologists with at least 10 years of experience for the presence of 36 critical
38 findings (local labels) and 15 diagnoses (global labels). Here, the local labels should be annotated with rectangle bounding
39 boxes that localize the findings, while the global labels reflect the diagnostic impression of the radiologist at the image-level.
40 For algorithm development, we randomly divided the dataset into two parts: the training set of 7,728 scans (84.7%) and the
41 test set of 1,397 scans (15.3%). To the best of our knowledge, the released PediCXR is currently the largest public pediatric
42 CXR dataset with radiologist-generated annotations in both training and test sets. Table 1 below shows an overview of existing
43 public datasets for CXR interpretation in pediatric patients, compared with the PediCXR. Compared to the previous works, the
44 PediCXR dataset shows two main advantages. First, the dataset is labeled for multiple findings and diseases. Meanwhile, most
45 pediatric CXR datasets have focused on a single disease such as pneumonia19 or pneumothorax20 . Second, the dataset provides
46 bounding box annotations at lesion level, which is useful for developing explainable artificial intelligent models21 for the CXR
47 interpretation in children. We believe the introduction of the PediCXR provides a suitable imaging source for investigating the
48 ability of supervised machine learning models in identifying common lung diseases in pediatric patients.

Table 1. An overview of existing public datasets for CXR interpretation in pediatric patients.
Dataset Release year # findings # samples Image-level labels Local labels
Kermany et al.19 2018 2 5,856 Available Not available
Chen et al.20 2020 5 2,668 Available Not available
PediCXR (ours) 2021 52 9,125 Available Available

49 Methods
50 Data collection
51 Data collection was conducted at the Phu Tho Obstetric & Pediatric Hospital (PTOPH) between 2020 – 2021. The ethical
52 clearance of this study was approved by the Institutional Review Boards (IRBs) of the PTOPH. The need for obtaining informed
53 patient consent was waived because this retrospective study did not impact clinical care or workflow at these two hospitals,
54 and all patient-identifiable information in the data has been removed. We retrospectively collected more than 10,000 CXRs in
55 DICOM format from a local picture archiving and communication system (PACS) at PTOPH. The imaging dataset was then
56 transferred and analyzed at Smart Health Center, VinBigData JSC.

57 Overview of approach
58 The building of the PediCXR dataset is illustrated in Figure 1. In particular, the collection and normalization of the dataset
59 were divided into four main steps: (1) data collection, (2) data de-identification, (3) data filtering, and (4) data labeling. We
60 describe each step in detail as below.

Figure 1. Construction of the PediCXR dataset. First, raw pediatric scans in DICOM format were collected retrospectively from
the hospital’s PACS at PTOPH. These images were de-identified to protect patient’s privacy. Then, invalid files (including adult
CXR images, images of other modalities or other body parts, images with low quality, or incorrect orientation) were manually
filtered out. After that, a web-based DICOM labeling tool called VinDr Lab was developed to remotely annotate DICOM data.
Finally, the annotated dataset was then divided into a training set (N = 7,728) and a test set (N = 1,397 ) for algorithm
development.

2/9
61 Data de-identification
62 In this study, we follow the HIPAA Privacy Rule22 to protect individually identifiable health information from the DICOM
63 images. To this end, we removed or replaced with random values all personally identifiable information associated with the
64 images via a two-stage de-identification process. At the first stage, a Python script was used to remove all DICOM tags of
65 protected health information (PHI)23 such as patient’s name, patient’s date of birth, patient ID, or acquisition time and date,
66 etc. For the purpose of loading and processing DICOM files, we only retained a limited number of DICOM attributes that are
67 necessary, as indicated in Table 2 (Supplementary materials). In the second stage, we manually removed all textual information
68 appearing on the image data, i.e., pixel annotations that could include patient’s identifiable information.

69 Data filtering
70 The collected raw data included a significant amount of outliers including CXRs of adult patients, body parts other than
71 chest (abdominal, spine, and others), low-quality images, or lateral CXRs. To filter a large number of CXR scans, we trained
72 a lightweight convolutional neural network (CNN)24 to remove all outliers automatically. Next, a manual verification was
73 performed to ensure all outliers had been fully removed.

74 Data labeling
75 The PediCXR dataset was labeled for a total of 36 findings and 15 diagnoses. These labels were divided into two categories:
76 local labels (#1 – #36) and global labels (#37 – #52). The local labels should be marked with bounding boxes that localize the
77 findings, while the global labels should reflect the diagnostic impression of the radiologist. This list of labels was suggested by
78 a committee of the most experienced pediatric radiologists. To select these labels, the committee took into account two key
79 factors. First, findings and diseases are prevalent. Second, they can be differentiated on pediatric chest X-ray scans. Figure 2
80 illustrates several samples with both local and global labels annotated by our radiologists.

Figure 2. Several examples of pediatric CXR images with radiologist’s annotations. Local labels marked by radiologists are
plotted on the original images for visualization purposes. These annotations show abnormal findings from the scans. The global
labels, that classify images into diseases, are in bold and listed at the bottom of each example.

81 To facilitate the labeling process, we designed and built a web-based framework called VinDr Lab25 that allows a team of
82 experienced radiologists remotely annotate the data. Specifically, this is a web-based labeling tool that was developed to store,
83 manage, and remotely annotate DICOM data. The radiologists were oriented to locate the abnormal findings from the DICOM
84 viewer and draw the bounding boxes. All the annotators have been well-trained to ensure that the annotations are consistently
85 annotated. In addition, all the radiologists participating in the labeling process were certified in diagnostic radiology and
86 received healthcare professional certificates. In total, three pediatric radiologists with at least 15 years of experience were
87 involved in the annotation process. Each sample in the training set was assigned to one radiologist for annotation. Additionally,
88 all of the participating radiologists were blinded to relevant clinical information. A set of 9,125 pediatric CXRs were randomly

3/9
Table 2. The list of DICOM tags that were retained for loading and processing raw images. All other tags were
removed for protecting patient privacy. Details about all these tags can be found from DICOM Standard Browser at
https://dicom.innolitics.com/ciods.
DICOM Tag Attribute Name Description

(0010, 0040) Patient’s Sex Sex of the named patient.

(0010, 1010) Patient’s Age Age of the patient.

(0010, 1020) Patient’s Size Length or size of the patient, in meters.

(0010, 1030) Patient’s Weight Weight of the patient, in kilograms.

(0028, 0010) Rows Number of rows in the image.

(0028, 0011) Columns Number of columns in the image.

(0028, 0030) Pixel Spacing Physical distance in the patient between the center of each pixel,
specified by a numeric pair - adjacent row spacing (delimiter) ad-
jacent column spacing in mm.

(0028, 0034) Pixel Aspect Ratio Ratio of the vertical size and horizontal size of the pixels in the
image specified by a pair of integer values where the first value is
the vertical pixel size, and the second value is the horizontal pixel
size.

(0028, 0100) Bits Allocated Number of bits allocated for each pixel sample. Each sample
shall have the same number of bits allocated.

(0028, 0101) Bits Stored Number of bits stored for each pixel sample. Each sample shall
have the same number of bits stored.

(0028, 0102) High Bit Most significant bit for pixel sample data. Each sample shall have
the same high bit.

(0028, 0103) Pixel Representation Data representation of the pixel samples. Each sample shall have
the same pixel representation.

(0028, 0106) Smallest Image Pixel Value The minimum actual pixel value encountered in this image.

(0028, 0107) Largest Image Pixel Value The maximum actual pixel value encountered in this image.

(0028, 1050) Window Center Window center for display.

(0028, 1051) Window Width Window width for display.

(0028, 1052) Rescale Intercept The value b in relationship between stored values (SV) and the
output units specified in Rescale Type (0028,1054). Each output
unit is equal to m*SV + b.

(0028, 1053) Rescale Slope Value of m in the equation specified by Rescale Intercept
(0028,1052).

(7FE0, 0010) Pixel Data A data stream of the pixel samples that comprise the image.

(0028, 0004) Photometric Interpretation Specifies the intended interpretation of the pixel data.

(0028, 2110) Lossy Image Compression Specifies whether an image has undergone lossy compression
(at a point in its lifetime).

(0028, 2114) Lossy Image Compression Method A label for the lossy compression method(s) that have been ap-
plied to this image.

(0028, 2112) Image Compression Ratio Describes the approximate lossy compression ratio(s) that have
been applied to this image.

(0028, 0002) Samples per Pixel Number of samples (planes) in this image.

(0028, 0008) Number of Frames Number of frames in a multi-frame image.

4/9
89 annotated from the filtered data, of which 7,728 scans serve as the training set, and the remaining 1,397 studies form the test set.
90 Note the 9,125 studies correspond to 9,125 patients, and each study has a single CXR scan.

Table 3. Dataset characteristics of PediCXR.


Characteristics Training set Test set
Collection statistics

Years 2020 to 2021 2020 to 2021


Number of scans 7,728 1,397
Number of human annotators per scan 1 1
Image size (pixel×pixel, median) 1,643 × 1,349 1,638 × 1,343
Age (years, median)* 1.71 1.69
Male (%)* 57.63 59.14
Female (%)* 42.37 40.86
Data size (GB) 30.9 5.7

1. Boot-shaped heart (%) 35 (0.45%) 6 (0.43%)


2. Peribronchovascular interstitial opacity or PIO (%) 1,358 (17.57%) 248 (17.75%)
3. Reticulonodular opacity (%) 509 (6.59%) 90 (6.44%)
4. Bronchial thickening (%) 562 (7.27%) 116 (8.30%)
5. Enlarged PA (%) 61 (0.79%) 11 (0.79%)
6. Cardiomegaly (%) 161 (2.08%) 29 (2.08%)
7. Other opacity (%) 148 (1.92%) 27 (1.93%)
8. Intrathoracic digestive structure (%) 2 (0.03%) 0 (0.00%)
Local labels

9. Diffuse aveolar opacity (%) 119 (1.54%) 21 (1.50%)


10. Other lesion (%) 65 (0.84%) 11 (0.79%)
11. Consolidation (%) 176 (2.28%) 35 (2.51%)
12. Mediastinal shift (%) 5 (0.06%) 0 (0.00%)
13. Anterior mediastinal mass (%) 5 (0.06%) 1 (0.07%)
14. Other nodule/mass (%) 10 (0.13%) 2 (0.14%)
15. Dextro cardia (%) 16 (0.21%) 3 (0.21%)
16. Aortic enlargement (%) 2 (0.03%) 0 (0.00%)
17. Pleural effusion (%) 14 (0.18%) 3 (0.21%)
18. Stomach on the right side (%) 5 (0.06%) 1 (0.07%)
19. Atelectasis (%) 23 (0.30%) 3 (0.21)%)
20. Calcification (%) 1 (0.01%) 0 (0.00%)
21. Interstitial lung disease - ILD (%) 14 (0.18%) 2 (0.14%)
22. Lung hyperinflation (%) 108 (1.40%) 21 (1.50%)
23. Egg on string sign (%) 12 (0.16%) 2 (0.14%)
24. Pulmonary fibrosis (%) 1 (0.01%) 0 (0.00%)
25. Infiltration (%) 11 (0.14%) 2 (0.14%)
26. Lung cavity (%) 5 (0.06%) 1 (0.07%)
27. Pneumothorax (%) 4 (0.05%) 0 (0.00%)
28. Edema (%) 1 (0.01%) 0 (0.00%)
29. Pleural thickening (%) 2 (0.03%) 0 (0.00%)
30. Clavicle fracture (%) 5 (0.06%) 1 (0.07%)
31. Chest wall mass (%) 3 (0.04%) 0 (0.00%)
32. Lung cyst (%) 8 (0.10%) 2 (0.14%)
33. Emphysema (%) 1 (0.01%) 0 (0.00%)
34. Bronchectasis (%) 3 (0.04%) 0 (0.00%)
35. Expanded edges of the anterior ribs (%) 2 (0.03%) 0 (0.00%)
36. Paraveterbral mass (%) 2 (0.03%) 0 (0.00%)

37. No finding (%) 5,143 (66.55%) 907 (64.92%)


Global labels

38. Bronchitis (%) 842 (10.90%) 174 (12.46%)


40. Brocho-pneumonia (%) 545 (7.05%) 84 (6.01%)
41. Other diseases (%) 412 (5.33%) 77 (5.51%)
42. Bronchiolitis (%) 497 (6.43%) 90 (6.44%)
43. Situs inversus (%) 11 (0.14%) 2 (0.14%)
44. Pneumonia (%) 392 (5.07%) 89 (6.37%)
45. Pleuro-pneumonia (%) 6 (0.08%) 0 (0.00%)
46. Diagphramatic hernia (%) 3 (0.04%) 0 (0.00%)
47. Tuberculosis (%) 14 (0.18%) 1 (0.07%)
48. Congenital emphysema (%) 2 (0.03%) 0 (0.00%)
49. CPAM (%) 5 (0.06%) 1 (0.07%)
50. Hyaline membrane disease (%) 19 (0.25%) 3 (0.21%)
51. Mediastinal tumor (%) 8 (0.10%) 1 (0.07%)
52. Lung tumor (%) 5 (0.06%) 0 (0.00%)

5/9
Figure 3. Distribution of abnormal findings on the training set of PediCXR. Rare findings (less than 10 examples) are not
included.

Figure 4. Distribution of pathologies on the training set of PediCXR. Rare diseases (less than 10 examples) are not included.

91 Once the labeling was completed, the annotations of all pediatric CXRs were exported in JavaScript Object Notation
92 (JSON) format. We developed a Python script to parse JSON files and organized the annotations in the form of a single
93 comma-separated values (CSV) file. Each CSV file contains labels, bounding box coordinates, and their corresponding image
94 identifiers (IDs). The data characteristics, including patient demographic and the prevalence of each finding or disease, are
95 summarized in Table 3. The distributions of abnormal findings and pathologies in the training set are drawn in Figure 3 and
96 Figure 4, respectively.

97 Data Records
98 The PediCXR dataset will be made available for public download on PhysioNet? . We offer complete imaging data as well as
99 ground truth labels for both the training and test datasets. The pediatric scans were split into two folders: one for training and
100 one for testing, named as “train” and “test”, respectively. Since each study has only one instance and each patient has
101 maximum one study, therefore, the value of the SOP Instance UID provided by the DICOM tag (0008, 0018) was encoded into
102 a unique, anonymous identifier for each image. To this end, we used the Python hashlib module (see Code Availability)
103 to encode the SOP Instance UIDs into image IDs. The radiologists’ local annotations of the training set were provided
104 in a CSV file called annotations_train.csv. Each row of the CSV file represents a bounding box annotation with
105 the following attributes: image ID (image_id), radiologist ID (rad_id), label’s name (class_name), bounding box

6/9
106 coordinates (x_min, y_min, x_max, y_max), and label class ID (class_id). The coordinates of the box’s upper-left
107 corner are (x_min, y_min), and the coordinates of the box’s lower right corner are (x_max, y_max). Meanwhile, the
108 image-level labels of the training set were stored in a different CSV file called image_labels_train.csv, with the
109 following fields: Image ID (image_id), radiologist ID (rad_ID), and labels (labels) for both the findings and diagnoses.
110 Each image ID is associated with a vector of multiple labels corresponding to different pathologies, with positive pathologies
111 encoded as “1” and negative pathologies encoded as “0”. Similarly, the test set’s bounding-box annotations and image-level
112 labels were saved in the files annotations_test.csv and image_labels_test.csv, respectively.

113 Technical Validation


114 The data de-identification process was controlled. Specifically, all DICOM meta-data was parsed and manually reviewed to
115 ensure that all individually identifiable health information (PHI)23 of the children patients has been removed to meet the U.S.
116 HIPAA22 regulations. In addition, pixel values of all pediatric CXR scans were also carefully examined by human readers.
117 During this review process, all scans were manually reviewed case-by-case by a team of 10 human readers. A small number of
118 images containing private textual information that had not been removed by our algorithm was excluded from the dataset. The
119 manual review process also helped identify and discard out-of-distribution samples such as CXRs of adult patients, body parts
120 other than the chest, low-quality images, or lateral CXRs that our machine learning classifier was not able to detect. A set
121 of rules underlying our web-based annotation tool were developed to control the quality of the labeling process. These rules
122 prevent human annotators from mechanical mistakes like forgetting to choose global labels or marking lesions on the image
123 while choosing “No finding” as the global label.

124 Usage Notes


125 The PediCXR dataset was established for the purpose of developing and evaluating machine learning algorithms for detecting
126 and localizing anomalies in pediatric CXR images. The dataset has been previously used in a study on the diagnosis of
127 multiple diseases in pediatric patients26 and showed promising results. Specifically, the authors 26 introduced a deep learning
128 network to detect common pulmonary pathologies on CXR of pediatric patients. On the test set of 777 studies of the PediCXR
129 dataset, the network yielded an area under the receiver operating characteristic (AUC) of 0.709 (95% CI, 0.690–0.729). The
130 sensitivity, specificity, and F1-score at the cutoff value are 0.722 (0.694–0.750), 0.579 (0.563–0.595), and 0.389 (0.373–0.405),
131 respectively. However, they recognized that its performance remains low compared to medical experts. This work revealed the
132 major challenge in learning disease features on pediatric CXR images using representation learning techniques, opening huge
133 aspects for future research.
134

135 The primary uses for which the PediCXR dataset was conceptualized include:

136 • Developing and validating a predictive model for the classification of common thoracic diseases in pediatric patients.

137 • Developing and validating a predictive model for the localization of multiple abnormal findings on the pediatric chest
138 X-ray scans.

139 Finally, the released dataset remains with limitations that still need to be addressed in the future, including:

140 • The dataset did not contain clinical information associated with DICOM images, which is essential for the interpretation
141 of CXR in children patients.

142 • The number of examples for rare diseases (e.g., Congenital pulmonary airway malformation (CPAM), Congenital emphy-
143 sema, Diagphramatic hernia, Mediastinal tumor, Pleuro-pneumonia, Situs inversus, Lung tumor) or findings (Emphysema,
144 Edema, Calcification, Chest wall mass, Bronchectasis, Pleural thickening, Clavicle fracture, Pleuropulmonary mass,
145 Paraveterbral mass, etc.) are limited. Hence, training supervised learning algorithms, which requires a large-scale
146 annotated dataset, on the PediCXR dataset to diagnose the rare diseases and findings is not reliable.

147 To download and use the PediCXR , users are required to accept the PhysioNet Credentialed Health Data License 1.5.0.
148 By accepting this license, users agree that they will not share access to the dataset with anyone else. For any publication that
149 explores this resource, the authors must cite this original paper and release their code and models.

7/9
150 Code Availability
151 This study used the following open-source repositories to load and process DICOM scans: Python 3.7.0 (https://www.python.org/);
152 Pydicom 1.2.0 (https://pydicom.github.io/); OpenCV-Python 4.2.0.34 (https://pypi.org/project/opencv-python/); and Python
153 hashlib (https://docs.python.org/3/library/hashlib.html). The code for data de-identification was made publicly available at
154 https://github.com/vinbigdata-medical/vindr-cxr. The code to train CNN classifier for the out-of-distribution task was made
155 publicly available at https://github.com/vinbigdata-medical/DICOM-Imaging-Router. The VinDr Lab is an open source software
156 and can be found at https://vindr.ai/vindr-lab.

157 References
158 1. Collaborators, G. . L. Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lower
159 respiratory tract infections in 195 countries: a systematic analysis for the global burden of disease study 2015. The Lancet
160 Infect. Dis. 17, 1133–1161 (2017).
161 2. Wardlaw, T. M., Johansson, E. W., Hodge, M., Organization, W. H. & (UNICEF), U. N. C. F. Pneumonia : the forgotten
162 killer of children (2006).
163 3. Hart, A. & Lee, E. Y. Pediatric chest disorders: Practical imaging approach to diagnosis. Dis. Chest, Breast, Hear. Vessel.
164 2019-2022 107–125 (2019).
165 4. Chest radiograph (pediatric). https://radiopaedia.org/articles/chest-radiograph-paediatric. Accessed: 2021-09-24.
166 5. Du Toit, G., Swingler, G. & Iloni, K. Observer variation in detecting lymphadenopathy on chest radiography. Int. J. Tuberc.
167 Lung Dis. 6, 814–817 (2002).
168 6. Wang, X. et al. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification
169 and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern
170 Recognition (CVPR), 2097–2106, https://doi.org/10.1109/CVPR.2017.369 (2017).
171 7. Bustos, A., Pertusa, A., Salinas, J.-M. & de la Iglesia-Vayá, M. Padchest: A large chest X-ray image dataset with multi-label
172 annotated reports. arXiv preprint arXiv:1901.07441 (2019).
173 8. Irvin, J. et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings
174 of the AAAI Conference on Artificial Intelligence, vol. 33, 590–597 (2019).
175 9. Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.
176 Sci. Data 6, 317, https://doi.org/10.1038/s41597-019-0322-0 (2019).
177 10. Nguyen, H. Q. et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations. arXiv preprint
178 arXiv:2012.15029 (2020).
179 11. Jaeger, S. et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging
180 Medicine Surg. 4, 475–477, https://dx.doi.org/10.3978%2Fj.issn.2223-4292.2014.11.20 (2014).
181 12. Tabik, S. et al. Covidgr dataset and covid-sdnet methodology for predicting covid-19 based on chest x-ray images. IEEE
182 journal biomedical health informatics 24, 3595–3605 (2020).
183 13. Rajpurkar, P. et al. CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint
184 arXiv:1711.05225 (2017).
185 14. Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm
186 to practicing radiologists. PLoS Medicine 15, e1002686, https://doi.org/10.1371/journal.pmed.1002686 (2018).
187 15. Majkowska, A. et al. Chest radiograph interpretation with deep learning models: Assessment with radiologist-
188 adjudicated reference standards and population-adjusted evaluation. Radiology 294, 421–431, https://doi.org/10.1148/
189 radiol.2019191293 (2020).
190 16. Rajpurkar, P. et al. CheXpedition: Investigating generalization challenges for translation of chest X-ray algorithms to the
191 clinical setting. arXiv preprint arXiv:2002.11379 (2020).
192 17. Tang, Y.-X. et al. Automated abnormality classification of chest radiographs using deep convolutional neural networks. npj
193 Digit. Medicine 3, 1–8, https://doi.org/10.1038/s41746-020-0273-z (2020).
194 18. Pham, H. H., Le, T. T., Tran, D. Q., Ngo, D. T. & Nguyen, H. Q. Interpreting chest X-rays via CNNs that exploit
195 hierarchical disease dependencies and uncertainty labels. arXiv preprint arXiv:1911.06475 (2020).
196 19. Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172,
197 1122–1131.e9, https://doi.org/10.1016/j.cell.2018.02.010 (2018).

8/9
198 20. Chen, K.-C. et al. Diagnosis of common pulmonary diseases in children by X-ray images and deep learning. Sci. Reports
199 10, 1–9 (2020).
200 21. Gordon, L., Grantcharov, T. & Rudzicz, F. Explainable artificial intelligence for safe intraoperative decision support. JAMA
201 surgery 154, 1064–1065 (2019).
202 22. US Department of Health and Human Services. Summary of the HIPAA privacy rule. https://www.hhs.gov/hipaa/
203 for-professionals/privacy/laws-regulations/index.html (2003).
204 23. Isola, S. & Al Khalili, Y. Protected Health Information (PHI). https://www.ncbi.nlm.nih.gov/books/NBK553131/ (2019).
205 24. Pham, H. H., Do, D. V. & Nguyen, H. Q. Dicom imaging router: An open deep learning framework for classification of
206 body parts from dicom x-ray scans. arXiv preprint arXiv:2108.06490 (2021).
207 25. Nguyen, N. T. et al. Vindr lab: A data platform for medical ai. URL: https://github. com/vinbigdata-medical/vindr-lab
208 (2021).
209 26. Tran, T. T. et al. Learning to automatically diagnose multiple diseases in pediatric chest radiographs using deep con-
210 volutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (ICCV 2021)
211 (2021).

212 Acknowledgements
213 The collection of this dataset was funded by the Smart Health Center, VinBigData JSC. The authors would like to acknowledge
214 the Phu Tho Obstetric & Pediatric Hospital for agreeing to make the PediCXR dataset publicly available. We are especially
215 thankful to Anh T. Nguyen, Huong T.T. Nguyen, Ngan T.T. Nguyen for their helps in the data collection and labeling process.

216 Author contributions


217 H.Q.N. and H.H.P designed the study; T.T.T. performed the data de-identification; H.Q.N., and H.H.P. wrote the paper; all
218 authors reviewed the manuscript.

219 Competing interests


220 This work was funded by the Vingroup JSC. The funder had no role in study design, data collection and analysis, decision to
221 publish, or preparation of the manuscript.

9/9

View publication stats

You might also like