0% found this document useful (0 votes)

18 views10 pages

PediCXR Final Munuscript

Pediatric chest X-ray images with labels

Uploaded by

bruceayim30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views10 pages

PediCXR Final Munuscript

Pediatric chest X-ray images with labels

Uploaded by

bruceayim30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/370324469

PediCXR: An open, large-scale chest radiograph dataset for interpretation of

common thoracic diseases in children

Article in Scientific Data · April 2023

DOI: 10.1038/s41597-023-02102-5

CITATIONS READS

13 105

5 authors, including:

Hieu H. Pham Tuan Nguyen

VinUniversity Freie Universität Berlin
86 PUBLICATIONS 1,100 CITATIONS 14 PUBLICATIONS 183 CITATIONS

SEE PROFILE SEE PROFILE

Ha Quy Nguyen
École Polytechnique Fédérale de Lausanne
61 PUBLICATIONS 1,323 CITATIONS

SEE PROFILE

All content following this page was uploaded by Hieu H. Pham on 04 May 2023.

The user has requested enhancement of the downloaded file.

1 PediCXR: An open, large-scale chest radiograph
2 dataset for interpretation of common thoracic
3 diseases in children
4 Hieu H. Pham1,3,4,* , Ngoc H. Nguyen2 , Thanh T. Tran1 , Tuan N.M. Nguyen5 , and Ha Q.
5 Nguyen1

6
1 Smart Health Center, VinBigData JSC, Hanoi, Vietnam
7
2 Phu Tho Department of Health, Phu Tho, Vietnam
8
3 College of Engineering & Computer Science, VinUniversity, Hanoi, Vietnam

9
4 VinUni-Illinois Smart Health Center, Hanoi, Vietnam

10
5 Training and Direction of Healthcare Activities Center, Phu Tho General Hospital, Phu Tho, Vietnam

11
* These authors contributed equally: Hieu H. Pham and Ngoc H. Nguyen

12
* Corresponding author: Hieu H. Pham ([email protected])

13 ABSTRACT

Computer-aided diagnosis systems in adult chest radiography (CXR) have recently achieved great success thanks to the
availability of large-scale, annotated datasets and the advent of high-performance supervised learning algorithms. However,
the development of diagnostic models for detecting and diagnosing pediatric diseases in CXR scans is undertaken due to
the lack of high-quality physician-annotated datasets. To overcome this challenge, we introduce and release PediCXR, a new
pediatric CXR dataset of 9,125 studies retrospectively collected from a major pediatric hospital in Vietnam between 2020 and
14
2021. Each scan was manually annotated by a pediatric radiologist with more than ten years of experience. The dataset
was labeled for the presence of 36 critical findings and 15 diseases. In particular, each abnormal finding was identified via
a rectangle bounding box on the image. To the best of our knowledge, this is the first and largest pediatric CXR dataset
containing lesion-level annotations and image-level labels for the detection of multiple findings and diseases. For algorithm
development, the dataset was divided into a training set of 7,728 and a test set of 1,397. To encourage new advances in
pediatric CXR interpretation using data-driven approaches, we provide a detailed description of the PediCXR data sample and
make the dataset publicly available on https://physionet.org/content/pedicxr/1.0.0/.

15 Background & Summary

16 Common thoracic diseases cause several hundred thousand deaths every year among children under five years old1, 2 . The chest
17 radiograph or CXR is the first-line and most commonly performed imaging examination in the assessment of the pediatric
18 patient3 . Interpreting CXR scans on pediatric patients can be for a number of indications or critical findings, in particular
19 for common thoracic diseases in children such as Pneumonia, Bronchitis and Cardiovascular diseases (CVDs). Depending
20 on the patients’ age, the difficulty of the examination will vary, often requiring a specialist in pediatric diagnostic imaging
21 with an in-depth knowledge of radiological signs of different lung conditions4 . Additionally, the inter-observer agreement and
22 intra-observer agreement in the pediatric CXR interpretation were low5 . This opens room for the development of data-driven
23 approaches and computational tools to assist pediatricians in the diagnosis of common thoracic diseases and to reduce their
24 workload.
25 Computer-aided diagnosis (CAD) systems for identification of lung abnormality in adult CXRs have recently achieved
26 great success thanks to the availability of large labeled datasets6–10 . Many large-scale CXR datasets of adult patients such as
27 Montgomery County chest X-ray (MC)11 , Shenzhen chest X-ray11 , ChestX-ray86 , COVIDGR12 , ChestX-ray146 , Padchest7 ,
28 CheXpert8 , MIMIC-CXR9 and VinDr-CXR10 have been established and released in recent years. These datasets boosted new
29 advances in exploring new machine learning-based approaches in the interpretation of CXR in adults8, 13–18 . Unfortunately, the
30 creation of pediatric CXR datasets is still unexploited, and the number of benchmark pediatric CXR datasets is limited. This
31 becomes the main obstacle in developing and transferring new machine learning-based CAD systems for pediatric CXR in
32 clinical practice.
33 In an effort to provide a large-scale pediatric CXR dataset with high-quality annotations for the research community, we
34 have built the PediCXR dataset in DICOM format. The dataset consists of 9,125 posteroanterior (PA) view CXR scans in
35 patients younger than 10 years that were retrospectively collected from three major hospitals in Vietnam from 2020 to 2021. In
36 particular, all CXR scans come with both the localization of critical findings and the classification of common thoracic diseases.
37 These images were annotated by a group of three radiologists with at least 10 years of experience for the presence of 36 critical
38 findings (local labels) and 15 diagnoses (global labels). Here, the local labels should be annotated with rectangle bounding
39 boxes that localize the findings, while the global labels reflect the diagnostic impression of the radiologist at the image-level.
40 For algorithm development, we randomly divided the dataset into two parts: the training set of 7,728 scans (84.7%) and the
41 test set of 1,397 scans (15.3%). To the best of our knowledge, the released PediCXR is currently the largest public pediatric
42 CXR dataset with radiologist-generated annotations in both training and test sets. Table 1 below shows an overview of existing
43 public datasets for CXR interpretation in pediatric patients, compared with the PediCXR. Compared to the previous works, the
44 PediCXR dataset shows two main advantages. First, the dataset is labeled for multiple findings and diseases. Meanwhile, most
45 pediatric CXR datasets have focused on a single disease such as pneumonia19 or pneumothorax20 . Second, the dataset provides
46 bounding box annotations at lesion level, which is useful for developing explainable artificial intelligent models21 for the CXR
47 interpretation in children. We believe the introduction of the PediCXR provides a suitable imaging source for investigating the
48 ability of supervised machine learning models in identifying common lung diseases in pediatric patients.

Table 1. An overview of existing public datasets for CXR interpretation in pediatric patients.
Dataset Release year # findings # samples Image-level labels Local labels
Kermany et al.19 2018 2 5,856 Available Not available
Chen et al.20 2020 5 2,668 Available Not available
PediCXR (ours) 2021 52 9,125 Available Available

49 Methods
50 Data collection
51 Data collection was conducted at the Phu Tho Obstetric & Pediatric Hospital (PTOPH) between 2020 – 2021. The ethical
52 clearance of this study was approved by the Institutional Review Boards (IRBs) of the PTOPH. The need for obtaining informed
53 patient consent was waived because this retrospective study did not impact clinical care or workflow at these two hospitals,
54 and all patient-identifiable information in the data has been removed. We retrospectively collected more than 10,000 CXRs in
55 DICOM format from a local picture archiving and communication system (PACS) at PTOPH. The imaging dataset was then
56 transferred and analyzed at Smart Health Center, VinBigData JSC.

57 Overview of approach
58 The building of the PediCXR dataset is illustrated in Figure 1. In particular, the collection and normalization of the dataset
59 were divided into four main steps: (1) data collection, (2) data de-identification, (3) data filtering, and (4) data labeling. We
60 describe each step in detail as below.

Figure 1. Construction of the PediCXR dataset. First, raw pediatric scans in DICOM format were collected retrospectively from
the hospital’s PACS at PTOPH. These images were de-identified to protect patient’s privacy. Then, invalid files (including adult
CXR images, images of other modalities or other body parts, images with low quality, or incorrect orientation) were manually
filtered out. After that, a web-based DICOM labeling tool called VinDr Lab was developed to remotely annotate DICOM data.
Finally, the annotated dataset was then divided into a training set (N = 7,728) and a test set (N = 1,397 ) for algorithm
development.

2/9
61 Data de-identification
62 In this study, we follow the HIPAA Privacy Rule22 to protect individually identifiable health information from the DICOM
63 images. To this end, we removed or replaced with random values all personally identifiable information associated with the
64 images via a two-stage de-identification process. At the first stage, a Python script was used to remove all DICOM tags of
65 protected health information (PHI)23 such as patient’s name, patient’s date of birth, patient ID, or acquisition time and date,
66 etc. For the purpose of loading and processing DICOM files, we only retained a limited number of DICOM attributes that are
67 necessary, as indicated in Table 2 (Supplementary materials). In the second stage, we manually removed all textual information
68 appearing on the image data, i.e., pixel annotations that could include patient’s identifiable information.

69 Data filtering
70 The collected raw data included a significant amount of outliers including CXRs of adult patients, body parts other than
71 chest (abdominal, spine, and others), low-quality images, or lateral CXRs. To filter a large number of CXR scans, we trained
72 a lightweight convolutional neural network (CNN)24 to remove all outliers automatically. Next, a manual verification was
73 performed to ensure all outliers had been fully removed.

74 Data labeling
75 The PediCXR dataset was labeled for a total of 36 findings and 15 diagnoses. These labels were divided into two categories:
76 local labels (#1 – #36) and global labels (#37 – #52). The local labels should be marked with bounding boxes that localize the
77 findings, while the global labels should reflect the diagnostic impression of the radiologist. This list of labels was suggested by
78 a committee of the most experienced pediatric radiologists. To select these labels, the committee took into account two key
79 factors. First, findings and diseases are prevalent. Second, they can be differentiated on pediatric chest X-ray scans. Figure 2
80 illustrates several samples with both local and global labels annotated by our radiologists.

Figure 2. Several examples of pediatric CXR images with radiologist’s annotations. Local labels marked by radiologists are
plotted on the original images for visualization purposes. These annotations show abnormal findings from the scans. The global
labels, that classify images into diseases, are in bold and listed at the bottom of each example.

81 To facilitate the labeling process, we designed and built a web-based framework called VinDr Lab25 that allows a team of
82 experienced radiologists remotely annotate the data. Specifically, this is a web-based labeling tool that was developed to store,
83 manage, and remotely annotate DICOM data. The radiologists were oriented to locate the abnormal findings from the DICOM
84 viewer and draw the bounding boxes. All the annotators have been well-trained to ensure that the annotations are consistently
85 annotated. In addition, all the radiologists participating in the labeling process were certified in diagnostic radiology and
86 received healthcare professional certificates. In total, three pediatric radiologists with at least 15 years of experience were
87 involved in the annotation process. Each sample in the training set was assigned to one radiologist for annotation. Additionally,
88 all of the participating radiologists were blinded to relevant clinical information. A set of 9,125 pediatric CXRs were randomly

3/9
Table 2. The list of DICOM tags that were retained for loading and processing raw images. All other tags were
removed for protecting patient privacy. Details about all these tags can be found from DICOM Standard Browser at
https://dicom.innolitics.com/ciods.
DICOM Tag Attribute Name Description

(0010, 0040) Patient’s Sex Sex of the named patient.

(0010, 1010) Patient’s Age Age of the patient.

(0010, 1020) Patient’s Size Length or size of the patient, in meters.

(0010, 1030) Patient’s Weight Weight of the patient, in kilograms.

(0028, 0010) Rows Number of rows in the image.

(0028, 0011) Columns Number of columns in the image.

(0028, 0030) Pixel Spacing Physical distance in the patient between the center of each pixel,
specified by a numeric pair - adjacent row spacing (delimiter) ad-
jacent column spacing in mm.

(0028, 0034) Pixel Aspect Ratio Ratio of the vertical size and horizontal size of the pixels in the
image specified by a pair of integer values where the first value is
the vertical pixel size, and the second value is the horizontal pixel
size.

(0028, 0100) Bits Allocated Number of bits allocated for each pixel sample. Each sample
shall have the same number of bits allocated.

(0028, 0101) Bits Stored Number of bits stored for each pixel sample. Each sample shall
have the same number of bits stored.

(0028, 0102) High Bit Most significant bit for pixel sample data. Each sample shall have
the same high bit.

(0028, 0103) Pixel Representation Data representation of the pixel samples. Each sample shall have
the same pixel representation.

(0028, 0106) Smallest Image Pixel Value The minimum actual pixel value encountered in this image.

(0028, 0107) Largest Image Pixel Value The maximum actual pixel value encountered in this image.

(0028, 1050) Window Center Window center for display.

(0028, 1051) Window Width Window width for display.

(0028, 1052) Rescale Intercept The value b in relationship between stored values (SV) and the
output units specified in Rescale Type (0028,1054). Each output
unit is equal to m*SV + b.

(0028, 1053) Rescale Slope Value of m in the equation specified by Rescale Intercept
(0028,1052).

(7FE0, 0010) Pixel Data A data stream of the pixel samples that comprise the image.

(0028, 0004) Photometric Interpretation Specifies the intended interpretation of the pixel data.

(0028, 2110) Lossy Image Compression Specifies whether an image has undergone lossy compression
(at a point in its lifetime).

(0028, 2114) Lossy Image Compression Method A label for the lossy compression method(s) that have been ap-
plied to this image.

(0028, 2112) Image Compression Ratio Describes the approximate lossy compression ratio(s) that have
been applied to this image.

(0028, 0002) Samples per Pixel Number of samples (planes) in this image.

(0028, 0008) Number of Frames Number of frames in a multi-frame image.

4/9
89 annotated from the filtered data, of which 7,728 scans serve as the training set, and the remaining 1,397 studies form the test set.
90 Note the 9,125 studies correspond to 9,125 patients, and each study has a single CXR scan.

Table 3. Dataset characteristics of PediCXR.

Characteristics Training set Test set
Collection statistics

Years 2020 to 2021 2020 to 2021

Number of scans 7,728 1,397
Number of human annotators per scan 1 1
Image size (pixel×pixel, median) 1,643 × 1,349 1,638 × 1,343
Age (years, median)* 1.71 1.69
Male (%)* 57.63 59.14
Female (%)* 42.37 40.86
Data size (GB) 30.9 5.7

1. Boot-shaped heart (%) 35 (0.45%) 6 (0.43%)

2. Peribronchovascular interstitial opacity or PIO (%) 1,358 (17.57%) 248 (17.75%)
3. Reticulonodular opacity (%) 509 (6.59%) 90 (6.44%)
4. Bronchial thickening (%) 562 (7.27%) 116 (8.30%)
5. Enlarged PA (%) 61 (0.79%) 11 (0.79%)
6. Cardiomegaly (%) 161 (2.08%) 29 (2.08%)
7. Other opacity (%) 148 (1.92%) 27 (1.93%)
8. Intrathoracic digestive structure (%) 2 (0.03%) 0 (0.00%)
Local labels

9. Diffuse aveolar opacity (%) 119 (1.54%) 21 (1.50%)

10. Other lesion (%) 65 (0.84%) 11 (0.79%)
11. Consolidation (%) 176 (2.28%) 35 (2.51%)
12. Mediastinal shift (%) 5 (0.06%) 0 (0.00%)
13. Anterior mediastinal mass (%) 5 (0.06%) 1 (0.07%)
14. Other nodule/mass (%) 10 (0.13%) 2 (0.14%)
15. Dextro cardia (%) 16 (0.21%) 3 (0.21%)
16. Aortic enlargement (%) 2 (0.03%) 0 (0.00%)
17. Pleural effusion (%) 14 (0.18%) 3 (0.21%)
18. Stomach on the right side (%) 5 (0.06%) 1 (0.07%)
19. Atelectasis (%) 23 (0.30%) 3 (0.21)%)
20. Calcification (%) 1 (0.01%) 0 (0.00%)
21. Interstitial lung disease - ILD (%) 14 (0.18%) 2 (0.14%)
22. Lung hyperinflation (%) 108 (1.40%) 21 (1.50%)
23. Egg on string sign (%) 12 (0.16%) 2 (0.14%)
24. Pulmonary fibrosis (%) 1 (0.01%) 0 (0.00%)
25. Infiltration (%) 11 (0.14%) 2 (0.14%)
26. Lung cavity (%) 5 (0.06%) 1 (0.07%)
27. Pneumothorax (%) 4 (0.05%) 0 (0.00%)
28. Edema (%) 1 (0.01%) 0 (0.00%)
29. Pleural thickening (%) 2 (0.03%) 0 (0.00%)
30. Clavicle fracture (%) 5 (0.06%) 1 (0.07%)
31. Chest wall mass (%) 3 (0.04%) 0 (0.00%)
32. Lung cyst (%) 8 (0.10%) 2 (0.14%)
33. Emphysema (%) 1 (0.01%) 0 (0.00%)
34. Bronchectasis (%) 3 (0.04%) 0 (0.00%)
35. Expanded edges of the anterior ribs (%) 2 (0.03%) 0 (0.00%)
36. Paraveterbral mass (%) 2 (0.03%) 0 (0.00%)

37. No finding (%) 5,143 (66.55%) 907 (64.92%)

Global labels

38. Bronchitis (%) 842 (10.90%) 174 (12.46%)

40. Brocho-pneumonia (%) 545 (7.05%) 84 (6.01%)
41. Other diseases (%) 412 (5.33%) 77 (5.51%)
42. Bronchiolitis (%) 497 (6.43%) 90 (6.44%)
43. Situs inversus (%) 11 (0.14%) 2 (0.14%)
44. Pneumonia (%) 392 (5.07%) 89 (6.37%)
45. Pleuro-pneumonia (%) 6 (0.08%) 0 (0.00%)
46. Diagphramatic hernia (%) 3 (0.04%) 0 (0.00%)
47. Tuberculosis (%) 14 (0.18%) 1 (0.07%)
48. Congenital emphysema (%) 2 (0.03%) 0 (0.00%)
49. CPAM (%) 5 (0.06%) 1 (0.07%)
50. Hyaline membrane disease (%) 19 (0.25%) 3 (0.21%)
51. Mediastinal tumor (%) 8 (0.10%) 1 (0.07%)
52. Lung tumor (%) 5 (0.06%) 0 (0.00%)

5/9
Figure 3. Distribution of abnormal findings on the training set of PediCXR. Rare findings (less than 10 examples) are not
included.

Figure 4. Distribution of pathologies on the training set of PediCXR. Rare diseases (less than 10 examples) are not included.

91 Once the labeling was completed, the annotations of all pediatric CXRs were exported in JavaScript Object Notation
92 (JSON) format. We developed a Python script to parse JSON files and organized the annotations in the form of a single
93 comma-separated values (CSV) file. Each CSV file contains labels, bounding box coordinates, and their corresponding image
94 identifiers (IDs). The data characteristics, including patient demographic and the prevalence of each finding or disease, are
95 summarized in Table 3. The distributions of abnormal findings and pathologies in the training set are drawn in Figure 3 and
96 Figure 4, respectively.

97 Data Records
98 The PediCXR dataset will be made available for public download on PhysioNet? . We offer complete imaging data as well as
99 ground truth labels for both the training and test datasets. The pediatric scans were split into two folders: one for training and
100 one for testing, named as “train” and “test”, respectively. Since each study has only one instance and each patient has
101 maximum one study, therefore, the value of the SOP Instance UID provided by the DICOM tag (0008, 0018) was encoded into
102 a unique, anonymous identifier for each image. To this end, we used the Python hashlib module (see Code Availability)
103 to encode the SOP Instance UIDs into image IDs. The radiologists’ local annotations of the training set were provided
104 in a CSV file called annotations_train.csv. Each row of the CSV file represents a bounding box annotation with
105 the following attributes: image ID (image_id), radiologist ID (rad_id), label’s name (class_name), bounding box

6/9
106 coordinates (x_min, y_min, x_max, y_max), and label class ID (class_id). The coordinates of the box’s upper-left
107 corner are (x_min, y_min), and the coordinates of the box’s lower right corner are (x_max, y_max). Meanwhile, the
108 image-level labels of the training set were stored in a different CSV file called image_labels_train.csv, with the
109 following fields: Image ID (image_id), radiologist ID (rad_ID), and labels (labels) for both the findings and diagnoses.
110 Each image ID is associated with a vector of multiple labels corresponding to different pathologies, with positive pathologies
111 encoded as “1” and negative pathologies encoded as “0”. Similarly, the test set’s bounding-box annotations and image-level
112 labels were saved in the files annotations_test.csv and image_labels_test.csv, respectively.

113 Technical Validation

114 The data de-identification process was controlled. Specifically, all DICOM meta-data was parsed and manually reviewed to
115 ensure that all individually identifiable health information (PHI)23 of the children patients has been removed to meet the U.S.
116 HIPAA22 regulations. In addition, pixel values of all pediatric CXR scans were also carefully examined by human readers.
117 During this review process, all scans were manually reviewed case-by-case by a team of 10 human readers. A small number of
118 images containing private textual information that had not been removed by our algorithm was excluded from the dataset. The
119 manual review process also helped identify and discard out-of-distribution samples such as CXRs of adult patients, body parts
120 other than the chest, low-quality images, or lateral CXRs that our machine learning classifier was not able to detect. A set
121 of rules underlying our web-based annotation tool were developed to control the quality of the labeling process. These rules
122 prevent human annotators from mechanical mistakes like forgetting to choose global labels or marking lesions on the image
123 while choosing “No finding” as the global label.

124 Usage Notes

125 The PediCXR dataset was established for the purpose of developing and evaluating machine learning algorithms for detecting
126 and localizing anomalies in pediatric CXR images. The dataset has been previously used in a study on the diagnosis of
127 multiple diseases in pediatric patients26 and showed promising results. Specifically, the authors 26 introduced a deep learning
128 network to detect common pulmonary pathologies on CXR of pediatric patients. On the test set of 777 studies of the PediCXR
129 dataset, the network yielded an area under the receiver operating characteristic (AUC) of 0.709 (95% CI, 0.690–0.729). The
130 sensitivity, specificity, and F1-score at the cutoff value are 0.722 (0.694–0.750), 0.579 (0.563–0.595), and 0.389 (0.373–0.405),
131 respectively. However, they recognized that its performance remains low compared to medical experts. This work revealed the
132 major challenge in learning disease features on pediatric CXR images using representation learning techniques, opening huge
133 aspects for future research.
134

135 The primary uses for which the PediCXR dataset was conceptualized include:

136 • Developing and validating a predictive model for the classification of common thoracic diseases in pediatric patients.

137 • Developing and validating a predictive model for the localization of multiple abnormal findings on the pediatric chest
138 X-ray scans.

139 Finally, the released dataset remains with limitations that still need to be addressed in the future, including:

140 • The dataset did not contain clinical information associated with DICOM images, which is essential for the interpretation
141 of CXR in children patients.

142 • The number of examples for rare diseases (e.g., Congenital pulmonary airway malformation (CPAM), Congenital emphy-
143 sema, Diagphramatic hernia, Mediastinal tumor, Pleuro-pneumonia, Situs inversus, Lung tumor) or findings (Emphysema,
144 Edema, Calcification, Chest wall mass, Bronchectasis, Pleural thickening, Clavicle fracture, Pleuropulmonary mass,
145 Paraveterbral mass, etc.) are limited. Hence, training supervised learning algorithms, which requires a large-scale
146 annotated dataset, on the PediCXR dataset to diagnose the rare diseases and findings is not reliable.

147 To download and use the PediCXR , users are required to accept the PhysioNet Credentialed Health Data License 1.5.0.
148 By accepting this license, users agree that they will not share access to the dataset with anyone else. For any publication that
149 explores this resource, the authors must cite this original paper and release their code and models.

7/9
150 Code Availability
151 This study used the following open-source repositories to load and process DICOM scans: Python 3.7.0 (https://www.python.org/);
152 Pydicom 1.2.0 (https://pydicom.github.io/); OpenCV-Python 4.2.0.34 (https://pypi.org/project/opencv-python/); and Python
153 hashlib (https://docs.python.org/3/library/hashlib.html). The code for data de-identification was made publicly available at
154 https://github.com/vinbigdata-medical/vindr-cxr. The code to train CNN classifier for the out-of-distribution task was made
155 publicly available at https://github.com/vinbigdata-medical/DICOM-Imaging-Router. The VinDr Lab is an open source software
156 and can be found at https://vindr.ai/vindr-lab.

157 References
158 1. Collaborators, G. . L. Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lower
159 respiratory tract infections in 195 countries: a systematic analysis for the global burden of disease study 2015. The Lancet
160 Infect. Dis. 17, 1133–1161 (2017).
161 2. Wardlaw, T. M., Johansson, E. W., Hodge, M., Organization, W. H. & (UNICEF), U. N. C. F. Pneumonia : the forgotten
162 killer of children (2006).
163 3. Hart, A. & Lee, E. Y. Pediatric chest disorders: Practical imaging approach to diagnosis. Dis. Chest, Breast, Hear. Vessel.
164 2019-2022 107–125 (2019).
165 4. Chest radiograph (pediatric). https://radiopaedia.org/articles/chest-radiograph-paediatric. Accessed: 2021-09-24.
166 5. Du Toit, G., Swingler, G. & Iloni, K. Observer variation in detecting lymphadenopathy on chest radiography. Int. J. Tuberc.
167 Lung Dis. 6, 814–817 (2002).
168 6. Wang, X. et al. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification
169 and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern
170 Recognition (CVPR), 2097–2106, https://doi.org/10.1109/CVPR.2017.369 (2017).
171 7. Bustos, A., Pertusa, A., Salinas, J.-M. & de la Iglesia-Vayá, M. Padchest: A large chest X-ray image dataset with multi-label
172 annotated reports. arXiv preprint arXiv:1901.07441 (2019).
173 8. Irvin, J. et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings
174 of the AAAI Conference on Artificial Intelligence, vol. 33, 590–597 (2019).
175 9. Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.
176 Sci. Data 6, 317, https://doi.org/10.1038/s41597-019-0322-0 (2019).
177 10. Nguyen, H. Q. et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations. arXiv preprint
178 arXiv:2012.15029 (2020).
179 11. Jaeger, S. et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging
180 Medicine Surg. 4, 475–477, https://dx.doi.org/10.3978%2Fj.issn.2223-4292.2014.11.20 (2014).
181 12. Tabik, S. et al. Covidgr dataset and covid-sdnet methodology for predicting covid-19 based on chest x-ray images. IEEE
182 journal biomedical health informatics 24, 3595–3605 (2020).
183 13. Rajpurkar, P. et al. CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint
184 arXiv:1711.05225 (2017).
185 14. Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm
186 to practicing radiologists. PLoS Medicine 15, e1002686, https://doi.org/10.1371/journal.pmed.1002686 (2018).
187 15. Majkowska, A. et al. Chest radiograph interpretation with deep learning models: Assessment with radiologist-
188 adjudicated reference standards and population-adjusted evaluation. Radiology 294, 421–431, https://doi.org/10.1148/
189 radiol.2019191293 (2020).
190 16. Rajpurkar, P. et al. CheXpedition: Investigating generalization challenges for translation of chest X-ray algorithms to the
191 clinical setting. arXiv preprint arXiv:2002.11379 (2020).
192 17. Tang, Y.-X. et al. Automated abnormality classification of chest radiographs using deep convolutional neural networks. npj
193 Digit. Medicine 3, 1–8, https://doi.org/10.1038/s41746-020-0273-z (2020).
194 18. Pham, H. H., Le, T. T., Tran, D. Q., Ngo, D. T. & Nguyen, H. Q. Interpreting chest X-rays via CNNs that exploit
195 hierarchical disease dependencies and uncertainty labels. arXiv preprint arXiv:1911.06475 (2020).
196 19. Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172,
197 1122–1131.e9, https://doi.org/10.1016/j.cell.2018.02.010 (2018).

8/9
198 20. Chen, K.-C. et al. Diagnosis of common pulmonary diseases in children by X-ray images and deep learning. Sci. Reports
199 10, 1–9 (2020).
200 21. Gordon, L., Grantcharov, T. & Rudzicz, F. Explainable artificial intelligence for safe intraoperative decision support. JAMA
201 surgery 154, 1064–1065 (2019).
202 22. US Department of Health and Human Services. Summary of the HIPAA privacy rule. https://www.hhs.gov/hipaa/
203 for-professionals/privacy/laws-regulations/index.html (2003).
204 23. Isola, S. & Al Khalili, Y. Protected Health Information (PHI). https://www.ncbi.nlm.nih.gov/books/NBK553131/ (2019).
205 24. Pham, H. H., Do, D. V. & Nguyen, H. Q. Dicom imaging router: An open deep learning framework for classification of
206 body parts from dicom x-ray scans. arXiv preprint arXiv:2108.06490 (2021).
207 25. Nguyen, N. T. et al. Vindr lab: A data platform for medical ai. URL: https://github. com/vinbigdata-medical/vindr-lab
208 (2021).
209 26. Tran, T. T. et al. Learning to automatically diagnose multiple diseases in pediatric chest radiographs using deep con-
210 volutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (ICCV 2021)
211 (2021).

212 Acknowledgements
213 The collection of this dataset was funded by the Smart Health Center, VinBigData JSC. The authors would like to acknowledge
214 the Phu Tho Obstetric & Pediatric Hospital for agreeing to make the PediCXR dataset publicly available. We are especially
215 thankful to Anh T. Nguyen, Huong T.T. Nguyen, Ngan T.T. Nguyen for their helps in the data collection and labeling process.

216 Author contributions

217 H.Q.N. and H.H.P designed the study; T.T.T. performed the data de-identification; H.Q.N., and H.H.P. wrote the paper; all
218 authors reviewed the manuscript.

219 Competing interests

220 This work was funded by the Vingroup JSC. The funder had no role in study design, data collection and analysis, decision to
221 publish, or preparation of the manuscript.

9/9

View publication stats

Graduation Project Proposal 2
No ratings yet
Graduation Project Proposal 2
6 pages
Laboratory Information Management System
No ratings yet
Laboratory Information Management System
14 pages
Diagnosing Pneumonia Using CNN
No ratings yet
Diagnosing Pneumonia Using CNN
52 pages
Ai Base Report Team 4 Darsh
No ratings yet
Ai Base Report Team 4 Darsh
64 pages
AI BASE Report Team 4
No ratings yet
AI BASE Report Team 4
62 pages
Computational Intelligence - 2019 - Mendoza - Detection and Classification of Lung Nodules in Chest X Ray Images Using Deep
No ratings yet
Computational Intelligence - 2019 - Mendoza - Detection and Classification of Lung Nodules in Chest X Ray Images Using Deep
32 pages
Çallı Et Al. - 2021
No ratings yet
Çallı Et Al. - 2021
29 pages
Chest Imagenome Dataset For Clinical Reasoning
No ratings yet
Chest Imagenome Dataset For Clinical Reasoning
24 pages
Ai Base Pneumonia Infection System Report 1
No ratings yet
Ai Base Pneumonia Infection System Report 1
47 pages
A Large Model For Non-Invasive and Personalized Management of Breast Cancer From Multiparametric MRI
No ratings yet
A Large Model For Non-Invasive and Personalized Management of Breast Cancer From Multiparametric MRI
14 pages
Technical Seminar - Rep
No ratings yet
Technical Seminar - Rep
26 pages
Machine and Deep Learning For Tuberculosis Detection On Chest X-Rays: Systematic Literature Review
No ratings yet
Machine and Deep Learning For Tuberculosis Detection On Chest X-Rays: Systematic Literature Review
23 pages
CheXMed Paper
No ratings yet
CheXMed Paper
11 pages
10.1038@s41591 018 0335 9
No ratings yet
10.1038@s41591 018 0335 9
12 pages
Detection of Pneumonia From X-Ray Images Using Deep Learning Techniques
No ratings yet
Detection of Pneumonia From X-Ray Images Using Deep Learning Techniques
23 pages
Paper 1
No ratings yet
Paper 1
11 pages
Fi Pagenumber
No ratings yet
Fi Pagenumber
22 pages
Journal JMS 2018
No ratings yet
Journal JMS 2018
12 pages
Shibu George Et Al. - 2023
No ratings yet
Shibu George Et Al. - 2023
16 pages
Learning To Recognize Chest-Xray Images Faster and More Efficiently Based On Multi-Kernel Depthwise Convolution
No ratings yet
Learning To Recognize Chest-Xray Images Faster and More Efficiently Based On Multi-Kernel Depthwise Convolution
10 pages
Technology Landscape Computer Aided TB FV EN
No ratings yet
Technology Landscape Computer Aided TB FV EN
56 pages
Detection of Pneumonia Using ML DL in Python
No ratings yet
Detection of Pneumonia Using ML DL in Python
12 pages
Final Report Gokul
No ratings yet
Final Report Gokul
30 pages
04 An Advanced Approach For Accurate Pneumonia Detection Using Combined Deep Convolutional Neural Networks
No ratings yet
04 An Advanced Approach For Accurate Pneumonia Detection Using Combined Deep Convolutional Neural Networks
12 pages
Sakul Deep COPD+Detection 4
No ratings yet
Sakul Deep COPD+Detection 4
7 pages
PneumoniaDetection Journal
No ratings yet
PneumoniaDetection Journal
10 pages
Yolo Vs FRCNN XQ T I VN
No ratings yet
Yolo Vs FRCNN XQ T I VN
17 pages
NN 2
No ratings yet
NN 2
18 pages
COPD 458935 Application and Prospects of Artificial Intelligence Technol
No ratings yet
COPD 458935 Application and Prospects of Artificial Intelligence Technol
7 pages
Medical Image Captioning Via Generative Pretrained
No ratings yet
Medical Image Captioning Via Generative Pretrained
13 pages
Research Proposal
No ratings yet
Research Proposal
16 pages
R&D Management in The Knowledge Era: Tuğrul Daim Marina Dabić Nuri Başoğlu João Ricardo Lavoie Brian J. Galli
No ratings yet
R&D Management in The Knowledge Era: Tuğrul Daim Marina Dabić Nuri Başoğlu João Ricardo Lavoie Brian J. Galli
626 pages
Transfer Learning For Multicenter Classification of Chronic Obstructive Pulmonary Disease
No ratings yet
Transfer Learning For Multicenter Classification of Chronic Obstructive Pulmonary Disease
11 pages
Evaluation of Retrieval Accuracy and Visual Similarity in Contentbased Image Retrieval of Chest CT For Obstructive Lung Diseasescientific Reports
No ratings yet
Evaluation of Retrieval Accuracy and Visual Similarity in Contentbased Image Retrieval of Chest CT For Obstructive Lung Diseasescientific Reports
11 pages
Covid-19 Detection Using Chest X-Rays With VGG-16 Deep Learning Model
No ratings yet
Covid-19 Detection Using Chest X-Rays With VGG-16 Deep Learning Model
5 pages
Lung Segmentation - in - Chest - X-Ray
No ratings yet
Lung Segmentation - in - Chest - X-Ray
11 pages
Arf Ansari 2020
No ratings yet
Arf Ansari 2020
8 pages
RM 3
No ratings yet
RM 3
6 pages
s12652 021 03075 2
No ratings yet
s12652 021 03075 2
8 pages
Pneumonia Classification Using Hybrid CNN Architecture
No ratings yet
Pneumonia Classification Using Hybrid CNN Architecture
3 pages
Accelerated Diagnosis and Reporting of Patients Using Analysis of Bulk Chest X-Ray Images To Aid Impacted Healthcare System During Covid19
No ratings yet
Accelerated Diagnosis and Reporting of Patients Using Analysis of Bulk Chest X-Ray Images To Aid Impacted Healthcare System During Covid19
14 pages
Hybrid Deep Learning For Detecting Lung Diseases From X-Ray Images
No ratings yet
Hybrid Deep Learning For Detecting Lung Diseases From X-Ray Images
14 pages
247 2022 Article 5368
No ratings yet
247 2022 Article 5368
13 pages
Enhancing Thorax Disease Classification in Chest X-Ray Images Through Advance Deep Learning Techniques
No ratings yet
Enhancing Thorax Disease Classification in Chest X-Ray Images Through Advance Deep Learning Techniques
10 pages
A Novel Approach For Detection of Pneumonia On Edge Devices Using Chest X-Rays
No ratings yet
A Novel Approach For Detection of Pneumonia On Edge Devices Using Chest X-Rays
5 pages
Philips Allura Xper FD20
100% (1)
Philips Allura Xper FD20
21 pages
Lung Disease Classification Using CNN
No ratings yet
Lung Disease Classification Using CNN
7 pages
DAP Radiology and CT Quality Control Procedures Workbook
100% (1)
DAP Radiology and CT Quality Control Procedures Workbook
134 pages
Early Detection of TB and Other Lung Diseases in Chest Radiography Using Image Processing Techniques
No ratings yet
Early Detection of TB and Other Lung Diseases in Chest Radiography Using Image Processing Techniques
7 pages
Expert Systems With Applications
No ratings yet
Expert Systems With Applications
13 pages
Automated Abnormality Classi Fication of Chest Radiographs Using Deep Convolutional Neural Networks
No ratings yet
Automated Abnormality Classi Fication of Chest Radiographs Using Deep Convolutional Neural Networks
8 pages
DOI Problem 2
No ratings yet
DOI Problem 2
11 pages
Surendra Synopsis 1NT19SDS03
No ratings yet
Surendra Synopsis 1NT19SDS03
2 pages
Pentero 900
No ratings yet
Pentero 900
418 pages
(IJIT-V7I2P7) :manish Gupta, Rachel Calvin, Bhavika Desai, Prof. Suvarna Aranjo
No ratings yet
(IJIT-V7I2P7) :manish Gupta, Rachel Calvin, Bhavika Desai, Prof. Suvarna Aranjo
4 pages
Digital SAT Practice
No ratings yet
Digital SAT Practice
13 pages
ECO RAY Merged
No ratings yet
ECO RAY Merged
13 pages
Digital Pathology
No ratings yet
Digital Pathology
36 pages
PACS Basics
No ratings yet
PACS Basics
6 pages
Dicom-Synapse 07 PDF
100% (1)
Dicom-Synapse 07 PDF
19 pages
uDR 266i - Datasheet - CE - 20221129
No ratings yet
uDR 266i - Datasheet - CE - 20221129
8 pages
6.2 Brochure Image Suite CR Systems 8 12
No ratings yet
6.2 Brochure Image Suite CR Systems 8 12
4 pages
00 Synapse 2.3 Service Manual 1.0.2
No ratings yet
00 Synapse 2.3 Service Manual 1.0.2
428 pages
KARL STORZ OR1™ Presentation
100% (10)
KARL STORZ OR1™ Presentation
83 pages
RAYUG-3201408-E - Rev.1.82 (SMARTDent)
No ratings yet
RAYUG-3201408-E - Rev.1.82 (SMARTDent)
163 pages
Ezovion E Brochure
No ratings yet
Ezovion E Brochure
30 pages
1488 Total Video Solutions
No ratings yet
1488 Total Video Solutions
8 pages
Ki 67 Prediction 2pdf
No ratings yet
Ki 67 Prediction 2pdf
36 pages
1-B5 - PS1 - 10304 - 2024 - Bienal Cigre 2024 - 2 Year Feedback
No ratings yet
1-B5 - PS1 - 10304 - 2024 - Bienal Cigre 2024 - 2 Year Feedback
11 pages
Anatomical-Models EN
No ratings yet
Anatomical-Models EN
14 pages
Brochure - MobileCooper - en
No ratings yet
Brochure - MobileCooper - en
6 pages
Report On Internship
No ratings yet
Report On Internship
26 pages
Mysono U6 Catalogue
No ratings yet
Mysono U6 Catalogue
4 pages
Regeneron Science Talent Search
No ratings yet
Regeneron Science Talent Search
22 pages
8.23 NSO Calendar 2024
No ratings yet
8.23 NSO Calendar 2024
2 pages
Ki 67 Index in Breast Cancer
No ratings yet
Ki 67 Index in Breast Cancer
14 pages
Lvad 112
No ratings yet
Lvad 112
9 pages
Azurion R3.0 MDS2
No ratings yet
Azurion R3.0 MDS2
17 pages
LGG
No ratings yet
LGG
10 pages
FGLI Only Calendar
No ratings yet
FGLI Only Calendar
2 pages
SBMP Project
No ratings yet
SBMP Project
20 pages
Math 114 Syllabus
No ratings yet
Math 114 Syllabus
2 pages
CLINIVIEW 11.1 DICOM Conformance Statement R5
No ratings yet
CLINIVIEW 11.1 DICOM Conformance Statement R5
54 pages
EUB-8500 English
No ratings yet
EUB-8500 English
20 pages
Final IAI - Brochure 2023
No ratings yet
Final IAI - Brochure 2023
6 pages
Chapter 3. Installing The Product
No ratings yet
Chapter 3. Installing The Product
40 pages
Summer 2024 - CS304 - 1
No ratings yet
Summer 2024 - CS304 - 1
3 pages
Live PDF For DL For College Essay Guy's Roles and Identities Exercise
No ratings yet
Live PDF For DL For College Essay Guy's Roles and Identities Exercise
1 page
08 - SPEC VISUS Picture Archiving and Communications System Jive-X Mini PACS Package
No ratings yet
08 - SPEC VISUS Picture Archiving and Communications System Jive-X Mini PACS Package
1 page
Brochure USG Anasonic - C5Series
No ratings yet
Brochure USG Anasonic - C5Series
2 pages
AI in Clinical Trials: Revolutionizing Drug Development
From Everand
AI in Clinical Trials: Revolutionizing Drug Development
Sumanthmayur M R
No ratings yet
Arterial hypertension in clinical practice: study and analysis of biotechnological and telemedicine models
From Everand
Arterial hypertension in clinical practice: study and analysis of biotechnological and telemedicine models
Michele Karaboue
No ratings yet
FRCPath Part 1: Examination Preparation Guide: eBook
From Everand
FRCPath Part 1: Examination Preparation Guide: eBook
Sinclair Steele
No ratings yet
Epidemiological Data Analyst - The Comprehensive Guide
From Everand
Epidemiological Data Analyst - The Comprehensive Guide
ANTILLIA TAURED
No ratings yet
Health informatics: Improving patient care
From Everand
Health informatics: Improving patient care
BCS, The Chartered Institute for IT
3/5 (1)
The Placenta and Neurodisability 2nd Edition
From Everand
The Placenta and Neurodisability 2nd Edition
Ian Crocker
No ratings yet
SARS-CoV-2 Viral Outbreak Investigation: Laboratory Perspective: Clinical Updates in COVID-19
From Everand
SARS-CoV-2 Viral Outbreak Investigation: Laboratory Perspective: Clinical Updates in COVID-19
Cheng Hoon Chew
3/5 (1)
Palliative Care, Trials and COVID-19 Tribulations
From Everand
Palliative Care, Trials and COVID-19 Tribulations
Cheng Hoon Chew
No ratings yet
Clinical Trial Management – an Overview
From Everand
Clinical Trial Management – an Overview
Editor IJSMI
No ratings yet
From A Biomedical Scientist to A Clinical Scientist The UK Science Training Programme via Equivalence (STPE) Step-by-Step Process: Continuing Professional Development in Pathology For Medical Laboratory Professionals
From Everand
From A Biomedical Scientist to A Clinical Scientist The UK Science Training Programme via Equivalence (STPE) Step-by-Step Process: Continuing Professional Development in Pathology For Medical Laboratory Professionals
Dr Lydia Taiwo
No ratings yet

PediCXR Final Munuscript

Uploaded by

PediCXR Final Munuscript

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

PediCXR: An open, large-scale chest radiograph dataset for interpretation of

Article in Scientific Data · April 2023

Hieu H. Pham Tuan Nguyen

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

15 Background & Summary

(0010, 0040) Patient’s Sex Sex of the named patient.

(0010, 1010) Patient’s Age Age of the patient.

(0010, 1020) Patient’s Size Length or size of the patient, in meters.

(0010, 1030) Patient’s Weight Weight of the patient, in kilograms.

(0028, 0010) Rows Number of rows in the image.

(0028, 0011) Columns Number of columns in the image.

(0028, 1050) Window Center Window center for display.

(0028, 1051) Window Width Window width for display.

(0028, 0008) Number of Frames Number of frames in a multi-frame image.

Table 3. Dataset characteristics of PediCXR.

Years 2020 to 2021 2020 to 2021

1. Boot-shaped heart (%) 35 (0.45%) 6 (0.43%)

9. Diffuse aveolar opacity (%) 119 (1.54%) 21 (1.50%)

37. No finding (%) 5,143 (66.55%) 907 (64.92%)

38. Bronchitis (%) 842 (10.90%) 174 (12.46%)

113 Technical Validation

124 Usage Notes

216 Author contributions

219 Competing interests

View publication stats

You might also like