The GDC provides access to high quality data sets from NCI-supported programs and recommends use of biospecimen best practices for organizations generating datasets from non-NCI supported programs. High quality datasets are achieved through GDC processes and tools for submitted data validation and data harmonization, which includes aligning data to a common reference genome and generating higher level data.
High-Quality Tissue Samples
The GDC obtains datasets from NCI programs that ensure high quality by accepting only tissues that have extensive information on their sources and have undergone stringent material processing quality control. The GDC obtains datasets from other NCI-supported programs that adhere to high quality standards for handling biospecimens and associated analytes and recommends that non-NCI supported programs submitting data to GDC follow Biospecimen Best Practices. Organizations wishing to submit data to the GDC should visit the NCI Biorepositories and Biospecimen Research Branch (BBRB) site for recommended tissue collection strategies for generating high quality datasets.
The GDC maintains standardized biospecimen information on the tissues and samples used in each study. Information on the tissue sample, portion, analyte, and/or aliquot is extracted from submitted data, maintained in the GDC data model, and made accessible via the GDC Data Portal.
Submitted Data Validation
Data validation is performed on data submitted to the GDC through the GDC Data Submission Processes and Tools. Submitted data is not distributed by the GDC unless it passes the following GDC data validation checks.
- Verification of MD5 Checksum - Comparison of the provided MD5 Checksum with the MD5 Checksum of the file submitted to the GDC.
- Data Type & Format Validation - Validation of biospecimen, clinical, and genomic data against the standard GDC Data Types and File Formats and GDC Data Dictionary.
- Validation of Data References - Cross check of existing Universally Unique Identifiers (UUIDs) or barcodes across biospecimen, clinical, and genomic data.
- Data Integrity Checks - Validation of clinical, biospecimen, and genomic data using 100+ GDC data integrity checks. Examples include.
Data Type | Example Data Integrity Check |
---|---|
Clinical |
|
Biospecimen |
|
Genomic |
|
GDC Data Harmonization
The GDC uses submitted genomic sequence data to create derived data products such as somatic DNA mutations, gene expression, and copy number variations. Validation of genomic data is performed using GDC Data Harmonization software and algorithms. Bioinformatics workflows are developed with ongoing input from recognized experts in the cancer genomics community. Workflows are implemented using techniques to make them reproducible, interoperable across multiple platforms, and shareable with any interested member of the community. GDC workflows are described in detail on the GDC Documentation Site and made available in the GDC GitHub Repository. Quality control checks are performed in GDC workflows and the GDC adds various summary metrics to the aligned reads for query by the user. For a complete list of the summary metrics as well as the tools used to generate them please visit the Data Dictionary Viewer.
Workflow | Tools | Quality Control Checks | Quality Control Metrics |
---|---|---|---|
DNA Alignment & Somatic Variant Calling | BWA, Picard Tools, GATK, MuSE, MuTect2, VarScan2, Pindel, CaVEMan, Strelka2, SvABA |
|
|
RNA Alignment, Expression, and Gene Fusion Analysis |
|
|
|
miRNA Alignment & Expression Analysis | BWA, BCGSC miRNA Profiling |
|
|
Copy Number Variation Analysis | ASCAT, ABSOLUTE, GATK CNV, DNAcopy |
|
|
Methylation Array Analysis | SeSAMe |
|
Included in SeSAMe |