NCI Cancer Research Data Commons: Cloud-Based Analytic Resources

CANCER RESEARCH | REVIEW
NCI Cancer Research Data Commons: Cloud-Based

Analytic Resources
David Pot1, Zelia Worman2, Alexander Baumann3, Shirish Pathak4, Rowan Beck2, Erin Beck5,
Katherine Thayer3, Tanja M. Davidsen5, Erika Kim5, Brandi Davis-Dusenbery2, John Otridge4,
Todd Pihl4, The CRDC Program, Jill S. Barnholtz-Sloan5,6, and Anthony R. Kerlavage5
ABSTRACT
◥
The NCI’s Cloud Resources (CR) are the analytical compo- data analysis where the data resides, without download or
nents of the Cancer Research Data Commons (CRDC) ecosys- storage. In addition, users can upload their own data and tools
tem. This review describes how the three CRs (Broad Institute into their workspaces, allowing researchers to create custom
FireCloud, Institute for Systems Biology Cancer Gateway in the analysis workflows and integrate CRDC-hosted data with their
Cloud, and Seven Bridges Cancer Genomics Cloud) provide own.
access and availability to large, cloud-hosted, multimodal cancer See related articles by Brady et al., p. 1384, Wang et al., p. 1388,
datasets, as well as offer tools and workspaces for performing and Kim et al., p. 1404
Introduction CR also has unique features for use by different types of cancer
researchers (Fig. 1).
Collaboration and agreement on shared standards and formats are
This Review highlights: each of the three NCI CRs (with details
required across the medical and scientific community to collect,
provided in the Supplementary Data), how they compare and com-
organize, and analyze the large amounts of valuable diverse clinical
plement each other, available datasets, tools serving differing research-
and molecular data created on a daily basis. The NCI’s Cancer
er types, their biological success as well as teaching successes, and
Research Data Commons (CRDC) is a cloud-based data science
proposed future direction to continue serving cancer research efforts
infrastructure that provides secure access to a large, comprehensive,
across national and international communities.
and expanding collection of cancer research data. CRDC focuses on
providing high-quality curated cancer data that adheres to Findable, Data availability
Accessible, Interoperable, and Reusable (FAIR) principles. Use of NCI has long invested in making large, consistently collected datasets
FAIR principles enable different parts of the CRDC ecosystem to available, such as The Cancer Genome Atlas (TCGA). The CRDC
combine detailed clinical, molecular (e.g., -omic), and imaging data extends these efforts, by enabling researchers to perform multi-modal
obtained through various technologies where researchers can explore analysis across many data types using the Cloud Resources. CRDC’s
and analyze multimodal cancer datasets, and share results and insights Genomic Data Commons (GDC; ref. 2), Proteomic Data Commons
with the greater scientific community (1). (PDC; ref. 3), Imaging Data Commons (IDC; ref. 4), Integrated Canine
Here, we describe the analytic components of the CRDC, the NCI Data Commons (ICDC), and Cancer Data Service (CDS) all currently
Cloud Resources (CR). Three separate CRs: the Broad Institute connect to the various CRs described in Table 1 (5). Through the three
FireCloud, Institute for Systems Biology Cancer Gateway in the CRs, 9.4PB of cancer data is currently available for analysis.
Cloud (ISB-CGC), and Seven Bridges Cancer Genomics Cloud (SB- Searching through the individual data commons portals, research-
CGC) each provide common features to access and analyze cloud- ers can select and combine data of interest from various datasets for
based CRDC data, as well as user provided data, in workspaces coanalysis. Although combining datasets still remains challenging due
utilizing both common and user provided tools and pipelines. Each to current lack of harmonization, the data commons and CRs provide
houses cloud-scale analysis tools that researchers have leveraged to ways to coanalyze and harmonize depending on the researcher’s needs.
interrogate large data sets to make new discoveries. However, each These data commons include several data modalities including geno-
mics, proteomics, imaging, epigenomics, among others that, using the
1
CRs, can be leveraged for multiomics cancer research. For analysis
General Dynamics Information Technology, Falls Church, Virginia. 2Velsera
within SB-CGC and FireCloud, a user creates a study manifest with
(Seven Bridges), Charlestown, Massachusetts. 3Broad Institute, Cambridge,
Massachusetts. 4Frederick National Laboratory for Cancer Research, Frederick, metadata and file location information to be uploaded for analysis.
Maryland. 5Center for Biomedical Informatics and Information Technology, NCI, ISB-CGC ingests tabular data (Supplementary Table S1) into Google’s
Rockville, Maryland. 6Trans Divisional Research Program, Division of Cancer BigQuery for interactive and scalable analysis as well as allows
Epidemiology and Genetics, NCI, Rockville, Maryland. researchers to analyze their data in a private workspace.
D. Pot, Z. Worman, and A. Baumann contributed equally to this article. The data from CRDC fall into two categories: Open Access and
Corresponding Author: Erin Beck, National Cancer Institute, 9609 Medical
Controlled Access (see Table 1). Open Access data includes aggregated
Center Drive, Rockville, MD 20850. E-mail: [email protected] information such as gene expression levels, as well as information like
disease type, stage, and tissue type. Controlled Access data includes
Cancer Res 2024;84:1396–403
information that could lead to identification of an individual and
doi: 10.1158/0008-5472.CAN-23-2657 requires authorization, in most cases from the NIH Database of
This open access article is distributed under the Creative Commons Attribution- Genotypes and Phenotypes (dbGaP). Data from multiple commons
NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license. can be combined together and coanalyzed within the CRs. In all cases,
2024 The Authors; Published by the American Association for Cancer Research the underlying data files are protected through authorization provided
AACRJournals.org | 1396
NCI CRDC: Cloud-Based Analytic Resources
Figure 1.
The NCI Cloud Resources. Each CR
provides unique features to collective-
ly support users across varying levels
of technical expertise and access to
diverse sets of NCI data. FireCloud and
SB-CGC offer extensive repositories of
prebuilt tools, tutorials, and workflows
in CWL and WDL that provide more
assistance to beginners to the cloud,
while ISB-CGC is designed for the
more advanced user to easily combine
new data with tabulated derived data
to gain new insights. Users can bring
their own data to “Secure Work-
spaces” and combine it with NCI
cloud-hosted “Data” using the analy-
sis “Cloud-Based Tools” readily avail-
able at each CR.
by the CRDC Data Commons Framework (DCF; ref. 5). Below, we tural variants, changes in gene expression and posttranscriptional
highlight some of the data types currently available via the CRDC for modifications, and changes in DNA methylation. Within the
analysis with the NCI Cloud Resources. CRDC researchers can access this molecular data through the GDC
and the Cancer Data Service (CDS), which enable the search and
Genomics, Transcriptomics, and Other discovery of genomic, transcriptomic, and epigenomic sequencing
modalities. In particular, the GDC contains some of the largest and
Molecular Data most comprehensive cancer genomic datasets, including TCGA and
Some examples of molecular alterations, which often underlie The Therapeutically Applicable Research to Generate Effective
cancer development, include mutations, copy-number or struc- Treatments (TARGET) program. GDC’s data release v39.0 included
AACRJournals.org Cancer Res; 84(9) May 1, 2024 1397

Pot et al.
Table 1. Data availability: summary representation of data available to account holders in the Cloud Resources.
Broad
FireCloud ISB-CGC SB-CGC
Reference genomes and files e.g., GTEx, 1000 Genomes @ @ @

Derived data e.g., gene expression matrixes @ @
Connection to non-cancer data e.g., AnVIL @ @ @
GDCa,b TCGA (The Cancer Genome Atlas) @ @ @
AWS and GCP TARGET (Therapeutically Applicable Research to Generate Effective Treatments) @ @ @
CCLE (Cancer Cell Line Encyclopedia) @ @ @
PDCa,b CPTAC (Clinical Proteomic Tumor Analysis Consortium) @ @
AWS APOLLO (applied Proteomics Organizational Learning and Outcomes) @ @
ICPC (International Cancer Proteogenomic Consortium) @ @
CBTN (Children’s Brain Tumor Network) @ @
ICDCa CMPC (The Comparative Molecular Characterization Program) @
AWS COP (Comparative Oncology Program) @
PCCR (The Purdue University Center for Cancer Research) @
CDSa,b PPTC (Pediatric Preclinical Testing consortium) @
AWS HTAN (Human Tumor Atlas Network) @ @
CCDI (Childhood Cancer Data Initiative) @
IDC TCGA (The Cancer Genome Atlas) @
GCP
Note: The cloud(s) hosting each data node is also provided. Refer to Supplementary Table S3 for a complete list of acronyms and definitions. Of note, the datasets
represent the most commonly requested and used data by cancer researchers.
a
More data is available than the ones highlighted on this table. Please refer to the individual websites for a full list of datasets available.
b
Data portals include both controlled and open-access data. To access controlled data, researchers must obtain the appropriate dbGaP permissions. CRDC provides a
list of key datasets on their website.
44,541 cases spanning 79 projects, and 69 primary tissue sites. The comparison and study. The IDC includes imaging data from several
GDC provides harmonized and standardized molecular, biospeci- projects such as the TCGA, HTAN, and CCDI, with plans to add more
men, and clinical data. The physical location of the GDC data is in the future. The December 2023 data release from IDC included
replicated on both the Amazon Web Services (AWS; used by SB- 142 collections representing more than 511,000 image series from
CGC) and Google Cloud Platform (GCP; used by FireCloud and 65,066 cases in a standardized Digital Imaging and Communications
ISB-CGC) for CR access. Tens of thousands of GDC raw data files in Medicine format (DICOM). IDC data can be accessed directly
and hundreds of higher level files are available in all three CRs for on IDC’s portal and, for TCGA images, via ISB-CGC. CDS also hosts
further analysis. In addition, genomic data from programs includ- raw imaging data files that are non-DICOM format from HTAN.
ing Human Tumor Atlas Network (HTAN) and Childhood Cancer All imaging data available have been deidentified of any patient
Data Initiative (CCDI) are available on the CDS. CDS data is stored information.
in the AWS cloud, can be searched on the CDS Portal, and is
available for analysis on the SB-CGC. Multispecies data
The fourth data commons linked to CRs is the ICDC. The
canine’s accelerated aging process and breed-specific cancer pre-
Proteomics Data disposition provides an interesting backdrop in which to study
The NCI PDC serves as one of the most comprehensive proteomic human disease. As of August 2023, the ICDC provides access to
data repositories currently available. The PDC provides highly curated canine data consisting of genomic and transcriptomic data, as well
and standardized biospecimen, clinical, and proteomic data. Reflecting as clinical and biospecimen metadata from nearly 700 cancer cases
the broad range of proteomic analysis, the PDC houses data represent- representing more than 80 different breeds. Studies include the
ing diverse analytical fractions including global proteome, phospho- PRE-medical Cancer Immunotherapy Network Canine Trials
proteome, glycoproteome, acetylome, lipidome and ubiquitylome (PRECINCT) and the Comparative Oncology Program. All ICDC
derived from multiple experimental technologies. The PDC is cur- data is open access and can be accessed via SB-CGC.
rently hosting 134 studies, encompassing data from 19þ cancer types
and more than 3,000 cases. Both raw and processed PDC data are Supporting multiple data modalities and analyses
openly accessible and available through all three CRs for further The types of data generated in the course of biomedical research are
analysis. The PDC’s cloud-based infrastructure and application pro- diverse and wide ranging. To accommodate situations where data does
gramming interface (API) facilitate interoperability. not fit in the above data commons, and to support researcher’s
compliance with data sharing policies, the NCI developed the CDS.
This solution provides a flexible and responsive approach for research-
Imaging Data ers to quickly and securely share data, without the need to meet the
Imaging data within the CRDC represents a wide range of applica- requirements from the data commons. The CDS includes primarily
tions from clinical and preclinical imaging, radiological images such as molecular characterization, genomic profiling, and imaging data. As of
CT, MRI, PET, digital pathology, and multispectral microscopy. Raw August 2023, numerous datasets from the CCDI (https://www.cancer.
imaging data is processed, annotated, and modeled to support cross gov/research/areas/childhood/childhood-cancer-data-initiative) as well
1398 Cancer Res; 84(9) May 1, 2024 CANCER RESEARCH

as HTAN (https://humantumoratlas.org/) are available through ing open access, controlled access, and private data. For FireCloud
https://dataservice.datacommons.cancer.gov/, and are updated (Supplementary Fig. S1) and SB-CGC (Supplementary Fig. S2) work-
frequently. spaces users can invite collaborators to view (read-only permissions)
or participate in their analysis (write/execute permissions). Collabora-
tors must also be given appropriate access by workspace owners to
Specialized Datasets enter a workspace containing controlled data, along with being
ISB-CGC hosts two specialized databases: The Mitelman Database authorized by dbGaP for any controlled data access. Analysts can
of Chromosome Aberrations and Gene Fusions in Cancer (https://mi choose from existing analysis tools and pipelines, as described below,
telmandatabase.isb-cgc.org/) and the TP53 Database (https://tp53.isb- or bring their own analytic tools and queries to their workspace, and
cgc.org/). In addition, ISB-CGC maintains another separately located create their own pipelines. All three CRs have extensive documenta-
database, caNanoLab (https://cananolab.cancer.gov/). The Mitelman tion on creating novel tools, including in writing [ISB-CGC doc
Database is the largest catalog of acquired chromosome aberrations (https://isb-cgc.appspot.com/programmatic_access/); FireCloud doc
available today, presently comprising >70,000 cases across multiple (https://support.terra.bio/hc/en-us/sections/7182576252315-Advanced-
cancer types (6). The TP53 Database is a comprehensive database on workflow-documentation); SB-CGC doc (https://docs.cancergenomic
variations in the tumor protein p53 gene (TP53), one of the most scloud.org/page/bring-your-own-tools-to-the-cancer-genomics-cloud)]
frequently mutated genes in human cancer (7). caNanoLab is a data and videos [Building an App (https://www.youtube.com/watch?v=
sharing portal designed to facilitate information sharing across the x1YS0u1jtPg) and Editing a Workflow (https://www.youtube.com/
international biomedical nanotechnology research community to watch?v=689JGWpjyH4)]. For ISB-CGC the analytic sandbox access
expedite and validate the use of nanotechnology in biomedicine (8). environment is controlled by the researchers through GCP native
tools (Supplementary Fig. S3). Researchers acquire copies of NCI
Interoperating with datasets from other NIH data commons dbGaP controlled data through ISB-CGC and can add their own
Researchers benefit from the breadth of cancer datasets described data, software tools, and collaborators to their own GCP project. For
above but can also gain access, within the CRDC, to many other high all three CRs, with the exception of free cloud credits, users are
impact datasets across NIH. Other NIH Institutes and Centers (IC) charged for their data storage and computation (see below), but
have made similar investments in global standards and IC-specific, CRDC-hosted data that resides outside of the CR workspaces (e.g.,
cloud based data commons over the past decade (e.g., NHGRI, NHLBI, CRDC or NCPI data) is free to access.
NCBI, NIH Common Fund). The NIH Cloud Platform Interopera-
bility (NCPI) program was established to drive key standards and Tools
policy discussions across NIH to ensure researchers can analyze cloud- Depending on the needs and computational skill set of the user,
based datasets from each of the participating NCPI data commons analysis can be carried out using publicly available analytic tools,
without the need to download or move the data. Today this means that, and/or bespoke analysis. In addition to the analytic tools themselves,
within FireCloud and SB-CGC, authorized researchers with the utility tools and cloud-native application support are provided that
appropriate dbGaP credentials are able to connect to other NIH data enable users to both take advantage of command-line and GUI-based
ecosystems [e.g., NHGRI’s AnVIL, NIH Commons Fund’s Gabriella tools for management of data and resources, as well as expand analytic
Miller Kids First, NHLBI’s BioData Catalyst, and NCBI’s Sequence capabilities beyond those provided by the resources through the use of
Read Archive (SRA)] and seamlessly analyze the many datasets tools such as highly scalable cloud-native machine learning. These
within these other NIH data commons alongside CRDC data, as apps and tools are regularly updated and evolve based on user
well as their own. CRDC spans multiple cloud service providers feedback. Different versions of curated tools are available on the cloud
(AWS, GCP), which means this external data can be accessed within platforms, and researchers are able to select the most up to date version
an analysis workspace specific to that cloud service provider without or go back to a previous one as needed. The cloud compute costs for
incurring additional storage or access costs. In addition to allowing these analytic tools vary widely as they range from smaller scale data
access to other NIH data common’s datasets, both the CRDC and visualization to complex and highly parallelized data processing for
NCPI have invested in interoperability and standards. Specifically, calling variants from raw sequencing data. Each CR works closely with
CRDC and NCPI have actively participated in standards including a researcher to provide cost information to develop a budget for their
Global Alliance for Genomics Health (https://www.ga4gh.org/), analyses. Users of the CRs can also upload their own tools to their CR
NIH Researcher Auth Service (https://datascience.nih.gov/re workspaces. A detailed breakdown of analysis tool capabilities is
searcher-auth-service-initiative), and Fast Healthcare Interopera- shown in Table 2.
bility Resources (https://fhir.org/), adopting those standards into Secondary analysis capabilities, often referred to as pipelines or
production interfaces over time, and allowing for more seamless workflows, are provided in all three CRs through workflow languages
integration of data across NIH data ecosystems. such as Common Workflow Language (CWL), NextFlow, and Work-
flow Description Language (WDL). Each of these workflow systems
Cloud analysis workspaces and tools has different benefits and drawbacks and are adopted by different
The NCI Cloud Resources provide secure analytic capabilities for research communities. Popular publicly available pipelines include
open and controlled access datasets within the CRDC. Here we outline analytical support for variant calling (e.g., whole genome DNA-seq),
shared and unique features related to workspaces, tools, analysis RNA sequencing (RNA-seq), machine learning, imaging, genome-
capabilities and performance, credits and billing for the CRs. wide association studies (GWAS), long-read data (copy-number
variations/structural variants), and proteomics. Both platforms pro-
Workspaces vide example analysis packages that can be used as tutorials to show
All three CRs provide user-controlled analytic sandbox environ- users how to use such tools, and documentation about considerations
ments that allow researchers to store and manage their data, tools, and such as cost. In addition to these curated public pipelines in FireCloud
pipelines, and run secure computations on all manner of data includ- and SB-CGC, within all three CRs users are able to write their own

Pot et al.
Table 2. Tool availability: summary representation of tools available to account holders in the Cloud Resources.
Tool category Tools Broad FireCloud ISB-CGC SB-CGC
Workflows CWL Workflow support @ @ @

WDL Workflow support @ @ @
Nextflow Workflow support Coming soon @ @
Publicly available workflows from @ @ @
Dockstore
Analysis types Existing workflows and tools used Variant calling (long and Variant calling (short reads), Variant calling (long and short
by community short reads), GWAS, RNAseq, ML, CNV, reads), GWAS, Bulk RNAseq,
RNAseq, ML, Epigenomics, correlations Single-Cell RNAseq, ML,
Epigenomics, Fusion using BigQuery derived Epigenomics, Multiomics,
Detection datasets Proteomics, Fusion Detection,
Imaging Analysis
Tutorials Example tool analysis projects @ @ @
Interactive Jupyter @ @ @
applications RStudio @ @ @
RShiny Apps Coming soon @ @
Galaxy @ @
SAS Coming soon
Command line sessions Coming soon @ @
Interactive querying (BigQuery, @ @
etc)
User-driven User written workflow support @ @ @
content User created interactive apps Coming soon @ @
User defined project resources @ @
Analytic APIs for scripting @ @ @
workspaces Bring your own data @ @ @
Access controlled data @ @ @
Cloud native Billing Cloud-specific Cloud-specific Integrated
tool support Command line tools, e.g., gsutil @ @ via Python / R
Make use of Cloud-specific tools @ @ @
such as TensorFlow, BigQuery,
etc.
STRIDES support @ @ Coming soon
Note: Tools are broken down by category and status of tool availability within each CR.
pipelines, or bring in additional pipelines through the Dockstore tool includes tools to manage data movement such as gsutil, docker image
repository (https://dockstore.org/). These pipelines make use of the storage and retrieval, and cloud-specific GUI interfaces for billing and
elastic scalability of the cloud to support resources well beyond resource monitoring. In addition, users are able to go beyond the out-
what researcher computers or often institutional High Performance of-the-box capabilities provided by these resources through tools
Computing clusters are capable of providing, thus reducing cost and such as cloud databases, cloud-native machine learning, and automa-
democratizing the use of data by users who are working independently tion. These tools are managed by the cloud providers, have active
or at smaller institutions. communities and documentation, and continue to expand over time,
Tertiary analysis capabilities, often referred to as interactive anal- and many researchers prefer to use them directly, even if not natively
ysis, are provided in FireCloud, ISB-CGC, and SB-CGC through both provided by the CRs.
GUI and command-line tools that support rapid iterations by
researchers to explore secondary data and derived scientific results. Performance, credits, and billing
Many of the commonly used tools within the bioinformatics com- To help researchers estimate their cloud-based computational costs,
munity are provided, including BigQuery, Galaxy, Jupyter notebooks, each CR provides sample cost information. Some common pipelines
RStudio/RShiny, and SAS. Like pipelines, these tools provide the ability within the respective platforms, as well as their time to complete and
to both make use of publicly available analytic methods, as well as write associated costs, include:
customized analyses using languages such as Python, R, and SQL,
including the enormously scalable analytic capabilities provided by * ISB-CGC - performing six billion statistical correlations using
Google’s BigQuery. Community-driven tools and libraries such as BigQuery for $2 in 3 hours
Bioconductor, Numpy, and Pandas are often preinstalled to simplify * FireCloud - whole genome variant alignment and calling pipeline
the development of use- case-specific analyses. As with pipelines, these using 65 GBs of data for $5 in 20 hours
tools can make use of elastic compute within the cloud to scale up * SB-CGC - bulk RNA-seq Transcription Profiling with differential
analyses and provide cost savings to the researcher. expression analysis for $2 in 2 hours
Cloud-native tool support In addition, to encourage cost-free experimentation on the CRs and
This enables researchers to make use of functionality that is specific to lower the barrier to cloud adoption, each CR provides access to free
to a given cloud that goes beyond those provided by the CRs. This credits for new users. After the credits are used the researcher may

continue to utilize the CRs through a billing platform. Additional * Verification of the enrichment of multiple investigational and
details on performance, free credits, and billing for each CR can be hypothetical resistance mechanisms in treated and nontreated
found in the supplementary material. Each CR has staff members patients from a pan-cancer cohort of 1,031 refractory metastatic
available to answer any questions and work with researchers to address tumors. The verification of these mechanisms confirmed their
their individual needs. putative role in treatment resistance (17)
Researchers have made tools and data available on the cloud to

Success Stories more rapidly and easily gain insights into the mechanisms of
Since the inception of NCI’s Cloud Resources, thousands of scien- cancer, including those listed in Supplementary Table S2. Of note,
tists worldwide have used the data, algorithms, and tools in the cloud to many of the tools used in the research above have been made easily
gain insights into the mechanisms of cancer, develop and make usable by the research community in the library of tools available
available new more powerful algorithms to speed cancer research, in the CRs.
and to monitor and assess clinical research. Hundreds of publications The CRs also provide tools and readily formatted data for use in
have cited the use of CRDC and the CRs (https://datacommons.cancer. monitoring and assessing clinical research, including performing
gov/publications/selected-publications), and the cancer research com- liquid biopsy detection of genomic alterations in pediatric brain
munity continues to partner with the CRs to further enable their tumors from cell-free DNA in peripheral blood, Cerebrospinal fluid,
research on the cloud. Cumulatively, the three CRs each have very and urine, using the Broad’s FireCloud (18), finding the best bio-
significant computational and community usage, which is detailed in markers of drug response within a breast cancer clinical trial, using
the Supplementary Data of this review, all speaking to the success of the the ISB-CGC Cloud Resource (19), and cataloging patient-derived
CRs in providing a needed cloud-based cancer analysis platform. Xenograft models in PDXNet portal (20).
Through the CRs, CRDC researchers can utilize the cloud’s ‘com- As our CRs continue to work closely with the cancer research
putation as needed’ power, as well as the CRDC’s colocated NCI community and other CRDC components, we will continue to devel-
datasets to securely analyze their own data in their own workspaces, op, make available, easily enable, and demonstrate more tools and
using available computational pipelines. Below are just a few examples computational approaches, and increase the findability, usability,
of success stories where researchers have used their own data or NCI interoperability and availability of NCI datasets to make the CRDC
datasets and the CR’s computational power to discover new biological data ecosystem more useful to researchers worldwide.
insights:
SB-CGC Training, Outreach, and Education

* Identification of DNA damage response correlates of LINE-1 As members of the CRDC, our goal is to serve all types of users and
expression in breast, ovarian, endometrial, and colon cancers contribute to NCI’s mission of ensuring access to cancer resources. As
using multi omic data from the Clinical Proteomic Tumor highlighted in the NCI Cancer Plan (https://nationalcancerplan.can
Analysis Consortium (CPTAC). The researchers then validated cer.gov/): “to accelerate cancer research we must work together to
the potential for LINE-1 overexpression to trigger RAD50 develop strategies, share knowledge, and accelerate progress.” To
phosphorylation in the lab (9) facilitate adoption and use of the CRs, we offer a range of services
* Elucidation of tandem repeat expansions in 2,622 cancer from one-on-one scientific consultation with our team of bioinfor-
genomes spanning 29 cancer types. Furthermore, in maticians, weekly drop in office hours where users ask questions and
preliminary experiments treating cells that harbor a certain get support, to larger in-class and online workshops. Here, we provide
recurrent repeat expansion with a GAAA-targeting molecule a summary of some of the teaching events and lectures offered to
led to a dose-dependent decrease in cell proliferation (10) students and faculty at research universities and global intergovern-
* Identification of a type of decay machinery responsible for mental organizations, and provide some metrics of success in improv-
removing AGO-associated miRNAs. These AGO-associated ing cloud computing literacy.
miRNAs are involved in regulating gene expression in TCGA Through training, lectures, and university demonstrations, the CRs
cancer patients with synonymous or missense mutations on have taught undergraduate and graduate students, postdoctoral fel-
AGO2 (11) lows, professors, and staff scientists the latest cloud technologies to
Broad FireCloud leverage high throughput data streams. Together with faculty, we
* Identification of a radiation-related genomic profile of papillary incorporate the CRs in lesson plans, creating a lecture series that goes
thyroid carcinoma (12) from biological concepts to posing a research question to using cloud
* Elucidation of distinct patterns of rare coding pathogenic computing. For example, ISB-CGC worked with George Washington
variants in Ewing sarcoma (13) University to give an overview of CRDC and how to work with large
* Development of a machine learning framework to estimate datasets using BigQuery and SQL. SB-CGC designed courses with
tumor mutational burden from RNA-seq in a tumor without a faculty at Purdue University, Georgetown University, University of
matched normal sample (14) California, Davis, and Brigham Young University, giving lectures to
ISB-CGC students, postdocs, clinicians, and researchers covering topics such as
* Development of a rare genetic risk score based on copy number RNA-seq, GWAS, imaging machine learning, and proteomics. Stu-
variations for glioblastoma multiforme (15) dents learn how to access CRDC data, upload their own data, identify
* Development of a genetic risk score based on chromosomal- the best tool to answer their question, and visualize their results
scale length variation of germline DNA (using Affymetrix SNP without leaving the CRDC ecosystem. Various attendees have incor-
6.0 array data and copy-number variation) for predicting porated the CRs into their research (see Supplementary Data for
whether or not a woman will develop ovarian cancer (16) details).

Pot et al.
Several organizations outside of the United States have also shown that are used to describe the datasets so that more powerful analyses
interest in the CRDC infrastructure and have requested training can be performed by more easily combining datasets and analyzing
sessions. The ISB-CGC participated in four half-day events educating them. Availability of easily findable, interoperable and computable
researchers at the European Molecular Biology Laboratory (EMBL) data that feeds readily into already existing or newly created Artificial
about the CRDC and CRs. EMBL consists of more than 80 inde- Intelligence and Machine Learning algorithms are key to advancing
pendent research groups with expertise in molecular biology. The the understanding of cancer. The NCI Cloud Resources will continue
ISB-CGC demonstrated how to utilize BigQuery to access data, and to work with the research community to make the CRDC datasets
how to access SQL and R to interact with the data on the cloud more available in order to combine these with new data using novel
platform. Likewise, the SB-CGC participated in the Data Science for analysis techniques for unique insights into cancer.
Health Discovery and Innovation in Africa Initiative (DS-I Africa),
which supports a robust pan-continental network of data scientists and Authors’ Disclosures
technologies to apply advanced data science skills and transform D.A. Pot reports other support from GDIT during the conduct of the study.
health. At this training the attendees performed a bulk RNA-seq Z.F. Worman reports other support from Velsera during the conduct of the
analysis using publicly available data, and ran a machine learning study. B.N. Davis-Dusenbery reports grants and other support from NCI
imaging analysis using Python/Jupyter Labs. All attendees were suc- during the conduct of the study, and employee and equity holder in Velsera.
J. Otridge reports other support from NCI during the conduct of the study.
cessful at running their analysis and several continued using the SB-
J.S. Barnholtz-Sloan reports other support from NIH/NCI during the conduct
CGC for their research. of the study. No disclosures were reported by the other authors.
Synopsis and Future Implications Disclaimer

In summary, the CRs provide a cloud-based platform where The content of this publication does not necessarily reflect the views or
policies of the Department of Health and Human Services, nor does mention of
cancer reference datasets can be securely analyzed in conjunction
trade names, commercial products or organizations imply endorsement by the
with a researchers’ own data, as well as with reference sets from US Government.
other NIH ICs. We have described how each of the CRs have a
different user focus, different and common data sets available, Acknowledgments
and provided computational resources and tools. This breadth of
We appreciate all former members of Cloud Resources and Cancer Research Data
resources allows cancer researchers to enable the right resources for Commons; specifically, we would like to acknowledge Daoud Meerzaman, Natalie
their needs and skill sets. Each of the CRs provide support mechan- Madero, Sheila Reynolds, Manisha Ray, Nicole Bolliger, Annie Kuan, and Cara
isms that can assist laboratories in using tools and CRDC data, as Mason. The full list of CRDC Program consortium members can be found in the
well as provide teaching resources to support the education of future Supplementary Data. ISB-CGC is funded in whole or in part with federal funds from
researchers. Our presence on the cloud democratizes access to huge the NCI, NIH, Department of Health and Human Services, under contract no.
HHSN261201400008C and ID/IQ agreement no. 17X146 under contract no.
datasets and powerful computational resources so that data can be
HHSN261201500003I. SB-CGC is powered by Seven Bridges and is funded in whole
securely analyzed, shared, and new insights into the causes, diag- or in part with federal funds from the NCI, NIH, Department of Health and Human
noses, and treatments of cancer can be published and made public. Services, under contract no. HHSN261201400008C and ID/IQ agreement no. 17X146
Our close collaborations with all components within the CRDC will under contract no. HHSN261201500003I. Broad FireCloud is funded in whole or in
continue to be key to enabling highly curated data with appropriate part with federal funds from the NCI, NIH, Department of Health and Human
targeted analysis tools to be made available to the worldwide cancer Services, under contract no. HHSN261201500003I.
research community.
In the future, the NCI Cloud Resources will continue to collect and Note
provide new datasets and data types for combined analysis with Supplementary data for this article are available at Cancer Research Online
(http://cancerres.aacrjournals.org/).
researchers’ data to bring even more insights. Close collaboration
with the cancer research community will ensure that we make available
data and tools that are relevant, timely, robust, and easy to use. Received September 9, 2023; revised January 26, 2024; accepted March 5, 2024;
Working with other teams at CRDC we will further enhance the terms published first March 15, 2024.
References
1. Kim E, Davidsen T, Davis-Dusenbery BN, Baumann A, Maggio A, Chen Z, et al. 6. Wang J, Zheng J, Lee EE, Aguilar B, Phan J, Abdilleh K, et al. A cloud-based
NCI cancer research data commons: lessons learned and future state. Cancer Res resource for genome coordinate-based exploration and large-scale analysis of
2024;84:1404–9. chromosome aberrations and gene fusions in cancer. Genes Chromosomes
2. Heath AP, Ferretti V, Agrawal S, An M, Angelakos JC, Arya R, et al. The NCI Cancer 2023;62:441–8.
genomic data commons. Nat Genet 2021;53:257–62. 7. Andrade KCd, Lee EE, Tookmanian EM, Kesserwan CA, Manfredi JJ, Hatton
3. Thangudu RR, Rudnick PA, Holck M, Singhal D, MacCoss MJ, Edwards NJ, JN, et al. The TP53 database: transition from the international agency for
et al. Proteomic data commons: a resource for proteogenomic analysis research on cancer to the US national cancer institute. Cell Death Differ 2022;
[abstract]. In: Proceedings of the Annual Meeting of the American Associ- 29:1071–3.
ation for Cancer Research 2020; 2020 Apr 27–28 and Jun 22–24. Philadelphia 8. Ke W, Crist RM, Clogston JD, Stern ST, Dobrovolskaia MA, Grodzinski P, et al.
(PA): AACR; 2020. Abstract nr LB-242. Trends and patterns in cancer nanotechnology research: a survey of NCI’s
4. Fedorov A, Longabaugh WJR, Pot D, Clunie DA, Pieper S, Aerts HJWL, et al. CaNanoLab and nanotechnology characterization laboratory. Adv Drug Deliv
NCI imaging data commons. Cancer Res 2021;81:4188–93. Rev 2022;191:114591.
5. Wang Z, Davidsen T, Kuffel G, Addepalli K, Bell A, Casas-Silva E, et al. NCI 9. McKerrow W, Wang X, Mendez-Dorantes C, Mita P, Cao S, Grivainis M, et al.
cancer research data commons: resources to share key cancer data. Cancer Res LINE-1 expression in cancer correlates with P53 mutation, copy number alteration,
2024;84:1388–95. and S phase checkpoint. Proc Natl Acad Sci U S A 2022;119:e2115999119.

10. Erwin GS, G€ ursoy G, Al-Abri R, Suriyaprakash A, Dolzhenko E, Zhu K, et al. 15. Ko C, Brody JP. A genetic risk score for glioblastoma multiforme based on copy
Recurrent repeat expansions in human cancer genomes. Nature 2023;613: number variations. Cancer Treat Res Commun 2021;27:100352.
96–102. 16. Toh C, Brody JP. Genetic risk score for ovarian cancer based on chromosomal-
11. Yang A, Shao T-J, Bofill-De Ros X, Lian C, Villanueva P, Dai L, et al. AGO-bound scale length variation. BioData Mining 2021;14:18.
mature MiRNAs are oligouridylated by TUTs and subsequently degraded by 17. Pradat Y, Viot J, Yurchenko AA, Gunbin K, Cerbone L, Deloger M, et al.
DIS3L2. Nat Commun 2020;11:2765. Integrative pan-cancer genomic and transcriptomic analyses of refractory
12. Morton LM, Karyadi DM, Stewart C, Bogdanova TI, Dawson ET, Steinberg MK, metastatic cancer. Cancer Discov 2023;13:1116–43.
et al. Radiation-related genomic profile of papillary thyroid carcinoma after the 18. Pages M, Rotem D, Gydush G, Reed S, Rhoades J, Ha G, et al. Liquid biopsy
chernobyl accident. Science 2021;372:eabg2538. detection of genomic alterations in pediatric brain tumors from cell-free DNA in
13. Gillani R, Camp SY, Han S, Jones JK, Chu H, O’Brien S, et al. Germline peripheral blood, CSF, and urine. Neuro-oncol 2022;24:1352–63.
predisposition to pediatric ewing sarcoma is characterized by inherited 19. O’Grady N, Gibbs DL, Abdilleh K, Asare A, Asare S, Venters S, et al. PRoBE the
pathogenic variants in DNA damage repair genes. Am J Hum Genet 2022; cloud toolkit: finding the best biomarkers of drug response within a breast cancer
109:1026–37. clinical trial. JAMIA Open 2021;4:ooab038.
14. Katzir R, Rudberg N, Yizhak K. Estimating tumor mutational burden from 20. Koc S, Lloyd MW, Grover JW, Xiao N, Seepo S, Subramanian SL, et al. PDXNet
RNA-sequencing without a matched-normal sample. Nat Commun 2022; portal: patient-derived xenograft model, data, workflow and tool discovery.
13:3092. NAR Cancer 2022;4:zcac014.

NCI Cancer Research Data Commons: Cloud-Based Analytic Resources

Uploaded by

Copyright:

Available Formats

NCI Cancer Research Data Commons: Cloud-Based Analytic Resources

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NCI Cancer Research Data Commons: Cloud-Based Analytic Resources

Uploaded by

Copyright:

Available Formats

CANCER RESEARCH | REVIEW

NCI Cancer Research Data Commons: Cloud-Based

AACRJournals.org Cancer Res; 84(9) May 1, 2024 1397

Reference genomes and ﬁles e.g., GTEx, 1000 Genomes @ @ @

1398 Cancer Res; 84(9) May 1, 2024 CANCER RESEARCH

AACRJournals.org Cancer Res; 84(9) May 1, 2024 1399

Tool category Tools Broad FireCloud ISB-CGC SB-CGC

Workﬂows CWL Workﬂow support @ @ @

1400 Cancer Res; 84(9) May 1, 2024 CANCER RESEARCH

Researchers have made tools and data available on the cloud to

SB-CGC Training, Outreach, and Education

AACRJournals.org Cancer Res; 84(9) May 1, 2024 1401

Synopsis and Future Implications Disclaimer

1402 Cancer Res; 84(9) May 1, 2024 CANCER RESEARCH

AACRJournals.org Cancer Res; 84(9) May 1, 2024 1403

You might also like