0% found this document useful (0 votes)
3 views13 pages

landscape-analysis-governance

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 13

ARTICLE

Cloud-based biomedical data storage and analysis


for genomic research: Landscape analysis of data
governance in emerging NIH-supported platforms
Jacklyn M. Dahlquist,1 Sarah C. Nelson,2,3,* and Stephanie M. Fullerton1,*

Summary

The storage, sharing, and analysis of genomic data poses technical and logistical challenges that have precipitated the development of
cloud-based computing platforms designed to facilitate collaboration and maximize the scientific utility of data. To understand cloud
platforms’ policies and procedures and the implications for different stakeholder groups, in summer 2021, we reviewed publicly available
documents (N ¼ 94) sourced from platform websites, scientific literature, and lay media for five NIH-funded cloud platforms (the All of
Us Research Hub, NHGRI AnVIL, NHLBI BioData Catalyst, NCI Genomic Data Commons, and the Kids First Data Resource Center) and a
pre-existing data sharing mechanism, dbGaP. Platform policies were compared across seven categories of data governance: data submis-
sion, data ingestion, user authentication and authorization, data security, data access, auditing, and sanctions. Our analysis finds sim-
ilarities across the platforms, including reliance on a formal data ingestion process, multiple tiers of data access with varying user authen-
tication and/or authorization requirements, platform and user data security measures, and auditing for inappropriate data use. Platforms
differ in how data tiers are organized, as well as the specifics of user authentication and authorization across access tiers. Our analysis
maps elements of data governance across emerging NIH-funded cloud platforms and as such provides a key resource for stakeholders
seeking to understand and utilize data access and analysis options across platforms and to surface aspects of governance that may require
harmonization to achieve the desired interoperability.

Introduction or research initiative. (Notably, some of the entities we


refer to as ‘‘platforms’’ may alternatively be described as
Individual-level genomic, environmental, and linked ‘‘ecosystems,’’ in recognition of their multiple compo-
phenotypic and health outcome data are being generated nents.) At the time of analysis, there were five such plat-
at an unprecedented pace and scale in human biomedical forms: the NIH Office of the Director’s All of Us Research
research. The storage, sharing, and analysis of such data Hub (AoURH), the National Human Genome Research In-
poses profound technical and logistical challenges that stitute’s (NHGRI) Analysis Visualization and Informatics
have precipitated the development of new cloud-based Lab-space (AnVIL), the National Heart, Lung, and Blood
computing and storage platforms designed to facilitate Institute’s (NHLBI) BioData Catalyst (BDC), the National
collaboration and maximize the scientific utility of Cancer Institute’s (NCI) Genomic Data Commons (GDC),
costly-to-generate genomic and linked clinical data and the NIH Common Fund Kids First Data Resource Cen-
(Figure 1). Compared with pre-existing data sharing mech- ter (Kids First DRC) (see Table 1).
anisms such as the National Center for Biotechnology In- At least two key differences between cloud-based plat-
formation (NCBI) database of Genotypes and Phenotypes forms and pre-existing data sharing mechanisms merit
(dbGaP), emerging cloud-based platforms offer new and attention. First, cloud-based platforms ‘‘invert’’ data
potentially more efficient alternatives for accessing, stor- sharing in that users come to data stored in central cloud
ing, and analyzing data, yet their specific policies and prac- locations for analysis, rather than downloading data to
tices are not widely known, and the extent to which they store and analyze locally.2 Second, to expedite data ac-
adhere to previously proposed key functions of good cess and analysis, streamlined mechanisms are being
genomic governance remains unexamined.1 developed for both user authentication and authoriza-
We define a ‘‘cloud-based platform’’ as one that pairs tion for use of data stored on such platforms.3,4
cloud-based data storage with search and analysis func- Borrowing some features from traditional models such
tionality via cloud-based workspaces and portals. While in- as dbGaP but innovating others, cloud-based platforms
dividual components providing data access, storage, and therefore represent a partial continuation of prior
analysis capabilities may be shared across different plat- genomic data sharing practices but also a sea change
forms, we identify a ‘‘platform’’ as a centralized system for data stewards in the novel ways that users can find,
for data sharing associated with a specific NIH Institute access, and analyze data.

1
Department of Bioethics and Humanities, University of Washington School of Medicine, Seattle, WA 98195, USA; 2Department of Biostatistics, University
of Washington, Seattle, WA 98195, USA
3
Lead contact
*Correspondence: [email protected] (S.C.N.), [email protected] (S.M.F.)
https://doi.org/10.1016/j.xhgg.2023.100196.
Ó 2023 The Author(s). This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Human Genetics and Genomics Advances 4, 100196, July 13, 2023 1


Figure 1. Traditional (left) versus cloud-
based biomedical data sharing (right)
In the traditional model, data are down-
loaded from a central repository and stored
and analyzed locally. In the cloud-based
model, data are stored and analyzed
remotely in cloud environments.

news websites (see Table 1 for list and


URLs). Search terms included widely
recognized categories for data governance
including: ‘‘data access,’’ ‘‘data use,’’ ‘‘data
sharing,’’ ‘‘auditing,’’ and ‘‘permissions.’’
Documents were included if they relayed
platform policies, procedures, or other
governance-related topics, and they were
excluded if they were summarized versions
of longer documents, did not focus on the
platform itself, or appeared to be primarily
A clear understanding of these platforms’ policies and a form of marketing communication (e.g., press release) that pro-
practices is necessary to start unpacking the implications moted the platform’s achievements rather than described how it
for many stakeholder groups, including research partici- works. Our final document count was as follows: AoURH n ¼ 17,
pants; researchers (data contributors and platform users); AnVIL n ¼ 12, BDC n ¼ 21, Kids First DRC n ¼ 7, GDC n ¼ 20,
and dbGaP n ¼ 17. All documents so identified (N ¼ 94) were
policymakers; funders; and researchers’ institutions, which
downloaded between June and July 2021 and archived for consis-
may be held accountable for data contribution and uses. It
tency (for the full document breakdown, see Table S1). We recog-
is also crucial as researchers and institutions begin to navi-
nize that platform documentation and policies are evolving, and
gate the new NIH Data Management and Sharing Policy.5 therefore some of the information presented here may be incom-
The purpose of this paper is to (1) describe current data plete and/or out of date. Please see the ‘‘limitations’’ section for
governance practices of emerging cloud-based platforms more information.
while (2) comparing these practices across new and pre-ex-
isting mechanisms to identify potential challenges and Analysis
tradeoffs. Platform documents were coded and analyzed in ATLAS.ti8 using a
codebook based on selected background literature.1,9–11 The code-
book covered topics such as data protections, how data are made
Methods available and accessible, and platform history and organization.
Some code examples include data access, roles and responsibilities,
This study used a cross-sectional qualitative directed content anal- data ingestion, and auditing. Two coders (J.D. and S.N.) double-
ysis of publicly available documents as they were available in June coded 12 of the same documents, two from each platform/mech-
and July of 2021.6 The purpose of this analysis was to identify pol- anism, in order to assess inter-coder reliability and the robustness
icies and practices of cloud-based genomic platforms in regard to of the codebook (i.e., codes and definitions). After the initial pilot
data submission, data ingestion, user authentication and authori- coding and subsequent minor adjustments to the codebook, J.D.
zation, data security, data access, auditing, and sanctions. coded the remaining documents with oversight from the rest of
the research team.
Platforms
We included five cloud-based platforms in our search that met our
cloud platform definition above and were in development and/or Results
early stages of active use at the time of our analysis: AoURH,
AnVIL, BDC, GDC, and Kids First DRC. To situate these cloud- Platform policies were compared across seven categories of
based platforms in the context of established data sharing mecha-
data governance: data submission, data ingestion, user
nisms, we also included in our analysis dbGaP, whose data access
authentication and authorization, data security, data ac-
request and review systems are also used by several cloud plat-
cess, auditing, and sanctions. The results of these compar-
forms. Initiatives that do not represent discrete platforms for
data storage and analysis were excluded. We also did not include isons are described below and summarized in Table 3.
cloud-based platforms specific to a given academic research insti-
tution (e.g., the St Jude Cloud).7 Data submission
Data submission is the act of data generators providing their
Document sampling data to a platform (see Table 2 for summary definitions of
To find relevant documents, we searched the public-facing plat- italicized terms provided in italics in this section). Since
form websites, as well as PubMed, preprint servers, and science 2008, all NIH-funded, high-throughput genomic studies

2 Human Genetics and Genomics Advances 4, 100196, July 13, 2023


Table 1. Platforms included in the document review (full name and abbreviation), primary funding body, and platform website (URL) at
which document search was performed
Platform URL at which document search
Platform name abbreviation Primary funder/NIH Institute was performed

All of Us Research Hub AoURH NIH Office of the Director https://www.researchallofus.org

Analysis Visualization and AnVIL National Human Genome Research https://anvilproject.org


Informatics Lab-space Institute (NHGRI)

BioData Catalyst BDC National Heart, Lung, and Blood https://bdcatalyst.gitbook.io/


Institute (NHLBI) biodata-catalyst-documentation

Kids First Data Resource Center Kids First DRC NIH Common Fund https://kidsfirstdrc.org

Genomic Data Commons GDC National Cancer Institute (NCI) https://gdc.cancer.gov

database of Genotypes and dbGaP National Center for Biotechnology https://www.ncbi.nlm.nih.gov/


Phenotypes Information (NCBI) books/NBK5295

Additional non-platform websites searched for documents (relevant peer reviewed and/or lay media) to analyze included PubMed, bioRxiv, medRxiv, Nature
News, and Science News.

have been required to submit their data to NIH-designated forming, cleaning, processing, harmonizing, indexing,
repositories such as dbGaP.12 Exceptions to this policy are and/or otherwise curating data submitted by data genera-
made on a case-by-case basis, and non-NIH-funded studies tors in order to make it accessible on a platform. Platforms
are accepted into dbGaP at the discretion of NIH Institutes typically require data to undergo quality control and
and Centers.12 Data submission must follow the sharing harmonization as part of ingestion. Data harmonization en-
requirements and timelines in the NIH Genomic Data sures that data from different studies and generators are
Sharing (GDS) policy.13 compatible. Data are also indexed as part of the ingestion
All platforms considered here require data generators to process, which involves assigning unique identifiers to
register their studies with dbGaP, with the exception of each data file.18 AoURH uses the Observational Medical
AoURH, which only contains data generated from the All Outcomes Partnership Common Data Model to harmonize
of Us Research Program and does not accept external sub- data before taking further steps to ‘‘ensure participant pri-
missions. The usual dbGaP study registration process in- vacy is protected.’’19 AnVIL’s Data Ingestion Committee,
cludes ethical oversight by the submitter’s Institutional which includes AnVIL team members and NHGRI program
Review Board (IRB) and dialog with an NIH Program Officer officers, evaluates applications for ingestion as a form of
and a Genomic Program Administrator (GPA). Additional quality control.20 BDC, while recognizing that their data
submission details from platform-specific documentation does go through quality control, states that they are data
are as follows. Data generators wishing to deposit data ‘‘custodians’’ and ‘‘cannot control the quality of data in-
within AnVIL must get approval from the AnVIL Ingestion gested.’’21 BDC contains data ingested from dbGaP or
Committee. That Committee assesses whether the data are directly from participating consortia, and it has plans for a
a good fit; this is determined in part by the amount of data, separate center, the Data Management Core, that will help
ethical oversight during data collection, how participants researchers with ingestion requirements.15 The GDC, while
were consented, and what data use limitations (DULs) are not addressing quality control explicitly, notes that it can
included.14 BDC documentation highlights that when sub- take up to 6 months after processing data before it is released
mitting data, data generators must work with an NHLBI to researchers.22 Kids First DRC reports plans to use the Hu-
GPA and register their data with dbGaP, the ‘‘central regis- man Phenotype Ontology and NCI Thesaurus for pheno-
tration authority’’ for BDC.15 The GDC accepts data from type harmonization, and it lists a variety of workflows
genomic cancer studies; priority is given to new data types, that can be used for genomic harmonization.17
and the study’s size, quality, and ability to further under-
standing of cancer are taken into account.16 The Kids First User authentication and authorization
DRC notes that ‘‘projects that allow for the broadest leveling All platforms divide ingested data into two or three data
of sharing . will be prioritized for Kids First Support’’ and tiers; however, the types of data contained in those tiers
states that restrictions like disease-specific consent or and requirements to access vary between platforms. Open
requiring a letter of collaboration ‘‘impede the ability for tiers are open to anyone, without the need to register or
the Kids First program to accomplish its goals.’’17 otherwise authenticate the identity of the data user. User
authentication tiers require users to log in with a username
Data ingestion and password, usually via an external user account, but
Once data are submitted and accepted, they go through an they do not include further requirements or subsequent
intake or data ingestion process before becoming available to identity verification. User authorization tiers require users
researchers. While not defined explicitly or consistently to not only authenticate their identity but also to be autho-
across platforms, data ingestion generally entails trans- rized to access the tier, usually through a specific request

Human Genetics and Genomics Advances 4, 100196, July 13, 2023 3


Table 2. Working definitions of key terms and concepts used in this paper, created from the authors’ understanding of the concepts as
well as information provided in the documents analyzed
Term Definition

Platform A centralized system pairing cloud-based data storage with search and analysis functionality via cloud-
based workspaces and portals. Associated with a specific NIH Institute or research initiative. May also be
referred to as an ‘‘ecosystem.’’

Data submission The act of data generators providing their data to a platform.

Data ingestion The action of obtaining, importing, transforming, cleaning, processing and otherwise curating submitted
data to make it available on a platform.

Data harmonization Ensures the compatibility of data from different submitters and/or studies via quality control, processing,
and post-processing.

Data indexing Assigning unique identifiers to data to support efficient discovery of said data.

Data curation The process of receiving, transferring, organizing, integrating, removing, and preserving data residing
within a platform.

Open tier Open to anyone, without the need to register or authenticate your identity. Contain aggregate, de-
identified data.

User authentication tier Require users to log in with a username and password, usually via an external user account, but don’t
include further requirements. Data varies from aggregate de-identified data to de-identified individual
data.

User authorization tier Require users to not only authenticate their identity but also be authorized to access the tier through a
special request (such as a DAR).

(such as a data access request [DAR]). User authorization use, such as a DAR (see data access below for more detail).
tiers are included in every platform; platforms vary with re- In all cases, access to individual de-identified data is
gard to their use of open and/or user authentication tiers. allowed upon authorization.
The AoURH differs from other platforms in having two
Open tiers different user authorization tiers that, at the time of this
AoURH, GDC, Kids First DRC, and dbGaP have open tiers. analysis, both require an eRA Commons account. First is
These open tiers allow users to browse the studies con- the ‘‘Registered’’ tier, which contains individual-level data
tained in these tiers without logging in or making any spe- and requires an eRA Commons account and a request to
cial requests for data access or use. AoURH’s open tier access the data.23,32 AoURH also has a second ‘‘Controlled’’
allows users to view ‘‘summary statistics and aggregate in- tier that contains individual-level genetic data, which ac-
formation that poses negligible risks to the privacy of cording to AoURH, requires more ‘‘stringent’’ user authen-
research participants.’’23 With the GDC and Kids First tication than the individual-level data in their ‘‘Registered’’
DRC open tiers, users can search for studies or use open tier.32 At the time of review, to access this ‘‘Controlled’’ tier,
datasets.24–26 users must log in with an eRA Commons account as well as
be ‘‘appropriately accredited’’ and obtain ‘‘additional
User authentication tiers approval.’’23 What this extra accreditation and approval
AnVIL, BDC, and Kids First DRC have user authentication might entail was unclear from the documentation avail-
tiers. Data contained in these tiers vary from aggregate de- able to our review. We consider both of these tiers to be
identified data to de-identified individual data. AnVIL and authorization rather than authentication tiers because of
BDC, which primarily provide aggregate de-identified data the required project description to initiate a workspace in
or unrestricted individual-level data to this tier (e.g., 1000 either tier, meaning that access to data is tied to a specific
Genomes Project data in AnVIL), each require authentica- proposed use analogous to a dbGaP Research Use
tion with Google, ORCID, or eRA Commons credentials Statement.
(see Table 3).15,27–30 The Kids First DRC has a ‘‘KidsFirst’’ Another variation is AnVIL’s ‘‘Consortium’’ tier that is
tier that allows users to ‘‘search de-identified data’’ and only open to members of specific research consortia who
can be accessed by logging in using Google, ORCID, Face- have placed their data on the platform, typically ahead
book, or LinkedIn credentials.24,31 This is the only plat- of release to the scientific community. Consortia members
form in our analysis that allows user authentication using gain access directly from a consortium official, and it is the
Facebook or LinkedIn credentials. responsibility of the consortium to manage who is allowed
access to these data.33
User authorization tiers
All platforms have user authorization tiers. User authoriza- Data security
tion tiers require that users both authenticate their identity Platform policies reference data security in two distinct
and submit a request to access data for a specific research ways: individual users’ responsibilities with respect to

4 Human Genetics and Genomics Advances 4, 100196, July 13, 2023


Table 3. Summary table of document review results, with platforms as rows and columns as key features of data governance
User authentication and
Platform Data submission Data ingestion authorization Data security Data access Auditing Sanctions

AoURH all data provided through uses Observational Medical no login for public tier; eRA user: no data screenshots, three access tiers: "public," reviews by RAB determine termination of your
the AoU Research Project Outcomes Partnership commons for registered don’t publish or download "registered," and DUCC compliance; account, public posting of
Common Data Model to tier; plans to change in the ppt-level data, and don’t "controlled"; applicable corrective action your name and affiliation,
harmonize data; future share logins institutional AoU DUA is recommended; user’s institution notified
further cleans data to eRA commons and more platform: data is de- (referred to as a ‘‘Data Use all user uploaded work will along with NIH or other
protect participant privacy stringent access identified (though the and Registration be logged and monitored; federal agencies;
requirements than platform notes that the Agreement’’ or DURA) anyone, including financial or legal
registered tier; users must term ‘‘de-identified’’ may required; researchers and the public, repercussions;
be appropriately accredited not be fully accurate to must complete research can ask RAB to review a other sanctions
and obtain separate describe its data); analysis training (referred to as study
authorization for of registered and/or ‘‘Responsible Conduct of
controlled tier controlled data only Research Training’’), agree
permitted on the platform to DUCC, prove ID, share
contact info and
affiliations, and provide
consent for release of this
info;
access authorization
determined via "data
passport" (user based); no
DACs (project based)
Human Genetics and Genomics Advances 4, 100196, July 13, 2023 5

AnVIL must get approval from data ingestion committee Google account for open user: don’t re-identify ppts; three access tiers: "open," potential DMIs must be access suspended or
NHGRI and the AnVIL evaluates applications and tier; eRA commons for platform: two-factor "controlled," and reported to DAC within terminated and user’s
Ingestion Committee; coordinates with dataset controlled tier; authentication, all data "consortium"; 24 hours; institution notified
data must conform to the stewards (unspecified) to members of consortia are covered by Certificate of three authorized user Terra & Gen3 log access to
NIH GDS Policy; determine time frame for granted access directly by a Confidentiality, systems are groups: developers, data, go through audits,
participants must be retention of data, long term designated consortium independently tested consortia, and external and are monitored for
explicitly consented for storage, archival, and official annually, system is researchers; abnormal use;
data sharing; availability of data; continually tested and submit DAR and agree to all activities are logged and
the AnVIL Ingestion uses Gen3 to ingest data scanned, and is consistent DUC; regularly reviewed/
Committee assesses data; with NIH Security Best DAC determines access via monitored
study must be registered in Practices and GDS Policy DAR, dbGaP consent codes,
dbGaP and DULs;
piloting DUOS to
streamline DAR approval;
upload data in accordance
with all national, tribal,
state laws, and relevant
institutional policies;
consent groups placed into
different workspaces

(Continued on next page)


6
Human Genetics and Genomics Advances 4, 100196, July 13, 2023

Table 3. Continued
User authentication and
Platform Data submission Data ingestion authorization Data security Data access Auditing Sanctions

BDC study must be registered in data "streamed in real time" eRA commons, Google, or user: don’t re-identify ppts two access tiers: "open" and all activities are logged and user institutions are
dbGaP or ingested in batches; ORCID for open tier; and don’t share logins; "controlled"; regularly reviewed/ accountable and may be
BDC is a custodian of data, eRA commons for platform: Public Trust submit DAR to dbGaP; monitored subject to sanctions
and it cannot control controlled tier Clearance used for all staff Cloud Use Statement may
quality of data ingested; and contractors, data be required;
Data Management Core within the cloud is DAC determines access via
works with data providers encrypted, cannot DAR, dbGaP consent codes
to assess data with the download controlled access and DULs
intent of harmonizing; participant level data
datasets are added by a user,
ingested from a controlled
source, or transferred from
collaborative programs

GDC accepts data from different submitted data is processed, no login for open tier; eRA user: don’t re-identify ppts two access tiers: "open" and GDC DMI standard data access is removed if
cancer study groups; validated, and harmonized Commons for controlled and comply with your "controlled"; operating procedure is data are discovered to
data submission adheres to before being hosted access DUA; apply via dbGaP and DAC referenced but details of contain PHI or PII or if data
the NIH and NCI GDS platform: data is de- approves/denies; auditing are not specified are shared out of
policies; identified according to agree to DUA and NIH GDS compliance with sharing
aggregate data for patients Health and Human services policy and submit data conditions set by DAC;
aged 90þ; Safe Harbor guidelines, sharing plan sanctions for inappropriate
submissions reviewed by GDC does not house data use not specified
considering a study’s size, electronic health record
quality, compatibility with and does not accept data for
data already hosted, and ppts over 90 years old
likely impact on the field;
any investigator or
consortium with cancer
genomic data can apply for
data submission;
data submitters understand
and agree that data will be
made available to the
scientific community;
data submitters retain
ownership of their data

(Continued on next page)


Table 3. Continued
User authentication and
Platform Data submission Data ingestion authorization Data security Data access Auditing Sanctions

Kids DULs that impede the reports plans to use Human no login for open tier; user: comply with NIH three access tiers: "open," Gen3 & Cavatica monitor N/Aa
First ability to access, use, or Phenotype Ontology and Google, ORCID, Facebook, Security Best Practices for "KidsFirst," and data use and ensure data
DRC analyze data will not be NCI Thesaurus for or LinkedIn for KidsFirst Controlled Access Data, "controlled"; access is appropriate;
prioritized; phenotype harmonization; tier; eRA commons for report DMIs, don’t re- Submit DAR and DAC users instructed to report
data consented only for lists a variety of workflows controlled tier identify ppts, and don’t determines access via DAR, inadvertent data release or
disease-specific research; for genomic harmonization share logins; dbGaP consent codes, and other DMI
data that require a letter of platform: N/Aa DULs;
collaboration or data that agree to DUC
require local IRB approval
will not be prioritized;
projects that allow for
broad data sharing will be
prioritized

dbGaP submitters required to data undergo quality no login for open tier; eRA user: don’t re-identify ppts, two access tiers: "open" and when notified, NIH reviews user and institution are
certify they have control and curation by Commons for controlled create secure logins, don’t "controlled"; possible DMIs notified of problems, and
considered the risks to dbGaP before being access share logins, ensure data is submit DAR and agree to appropriate steps are taken;
Human Genetics and Genomics Advances 4, 100196, July 13, 2023 7

individuals, their families, released to the public secure and confidential, DUC and GDS policy; users may face enforcement
and populations associated destroy locally stored data DAC determines access via actions;
with data submitted to and officially close project DAR, dbGaP consent codes, access suspended or
dbGaP; when no longer needed, and DULs terminated
all investigators receiving have a security plan and
NIH support to conduct technical training, and
genomic research submit policy controls in place
their de-identified study before data migration,
data to dbGaP; adhere to DAR for approved
non-NIH-funded data can data use and to NIH
be submitted to dbGaP; security best practices,
requires the local IRB to report DMIs, and users and
certify consistency with user institutions
laws and regulations accountable for ensuring
data security, not the cloud
service provider;
platform: data is de-
identified
a
N/A indicates information in the category was not found in the publicly available documents we reviewed in 2021. We recognize more information for these categories may be publicly available at the time of publication.
data security and platform responsibilities designed to DAR and determines whether or not to approve it based
ensure a secure environment. All platforms encourage a on the DULs for the requested dataset.36 Which DAC re-
standard set of data security practices that users are respon- views a DAR typically depends on which NIH Institute a
sible to uphold, such as not sharing login information, not study is registered with; Kids First DRC is distinct in having
taking screenshots of data, and not attempting to re-iden- a dedicated Kids First DRC DAC.17 Once approved, users
tify participants. must also agree to adhere to the NIH GDS policy, and a
Platform responsibilities for ensuring data security can data use agreement (DUA), depending on the platform.
vary, but many aspects are similar across the board due to AoURH requires that the user’s institution enters into a
federal regulations in the United States. For example, DUA with the All of Us Research Project (AoURP), referred
AoURH, AnVIL, BDC, GDC, and dbGaP reference to as a ‘‘Data Use and Registration Agreement’’ or DURA.
following the Federal Information Security Modernization Other platforms refer to DUAs in different contexts: BDC
Act (FISMA) or National Institute of Standards and Tech- stipulates that users must ensure that DUAs are ‘‘approved
nology (NIST) guidelines. FISMA has set rules around infor- and maintained,’’37 GDC notes that applying for access via
mation security, and NIST outlines guidance for complying dbGaP requires users to sign a DUA established by data
with those rules. Other similar platform data security mea- owners,38 and the AnVIL states that DUAs are required
sures include encrypting data, performing routine system ‘‘as necessary.’’33 Kids First DRC did not provide informa-
security checks, and de-identifying data in a manner tion on the need for DUAs. AoURH also requires that re-
consistent with the GDS policy. searchers adhere to the All of Us Data User Code of
Some platforms have clear rules around data download. Conduct, which prohibits data users from ‘‘attempting to
For instance, AoURH and BDC specify that they do not re-identify participants or their relatives’’ and encourages
allow data in user authorization tiers to be downloaded them to ‘‘be careful when distributing the results of their
at all.32,34 However, BDC also notes that it is technically work’’ to ‘‘prevent others from using this information to
possible for users to download platform data, and therefore re-identify All of Us participants’’.39
the responsibility lies with the user to not download the There are two main variations from the process noted
data.34 In contrast to the cloud-based platforms, in dbGaP above: (1) models where data users, instead of data uses,
the data are typically meant to be downloaded onto the ap- are authorized and (2) models that have automated or
plicant’s local system. For this reason, dbGaP makes clear semi-automated data access review. In the first category
that users and their institutions are responsible for their are AoURH and the AnVIL. The AoURH does not use the
data’s security. Users are advised to avoid putting dbGaP DAC system and instead uses a ‘‘data passport’’
controlled access data on portable devices, and they are model that grants access to vetted ‘‘Authorized Users’’
encouraged to have an institutional data security plan in rather than granting data access on an individual project
place before migrating data.35 dbGaP also requires that basis.32,40 This data passport system is possible due to the
projects be closed out officially when data are no longer single broad consent for data access and use that governs
needed and that all locally stored files are destroyed.35 It all data on the AoURH platform; other platforms do not
was unclear how data access on cloud-based platforms currently have this ability. To become an ‘‘authorized
would cease at the conclusion of a project; we found no user,’’ the user’s institution must have a DURA with the
equivalent information about project close-out procedures AoURP. With this in place, users must then do the
specific to cloud-based platforms. following before becoming ‘‘Authorized Users’’: establish
A range of other data security precautions from the plat- their identity (using eRA Commons to validate), consent
form side were noted; for example, GDC does not accept to public display of their name and description of their
data from participants over the age of 90, and similarly research projects, consent to public release of their name
AoURH limits what can be reported on participants aged in case of a DUCC violation, complete the All of Us
90þ.22,32 BDC specifies that they use network firewalls Responsible Conduct of Research Training, and sign an
and require all platform staff and contracts to have public agreement testifying they have done what is required.23,32
trust clearance, which includes a thorough background After this is complete, ‘‘Authorized Users’’ will be able to ac-
check.15 All platforms validate user identity, at least in prin- cess AoURH’s ‘‘Registered’’ and ‘‘Controlled’’ tiers, create
ciple, and post descriptions of research projects publicly. workspaces on the AoURH, and carry out research with
the data.23,32 A similar variation is AnVIL’s in-development
Data access ‘‘library card’’ concept, which like the data passport, would
All platforms contain user authorization tiers that include allow researchers to be pre-authorized to request user
individual-level data, such as germline genomic data and authorization tier data.2 This process uses the Global Alli-
certain phenotype or clinical data, depending on the study ance for Genomics and Health (GA4GH) Passport Visa sys-
and/or the platform. To access these data, most platforms tem and aims to reduce the number of steps researchers
require the user to submit a DAR to dbGaP. Typically, a have to go through before gaining access to data, while still
DAR is first reviewed by a signing official from the appli- ensuring they are permissioned to do so.2,41 However, un-
cant’s institution; alternatives to this model are discussed like the AoURH passport approach, pre-authorized AnVIL
below. A data access committee (DAC) then reviews the users are still required to complete a dbGaP DAR.2,20

8 Human Genetics and Genomics Advances 4, 100196, July 13, 2023


The second variation from the process above is the Data of individual-level genomic, environmental, and linked
Use Oversight System (DUOS), which is being piloted by phenotypic data. Currently several such platforms are in
the AnVIL. DUOS uses the GA4GH Data Use Ontology development or early use, and our analysis focused on
(DUO) algorithm to compare DARs with the requested da- the publicly accessible documentation associated with
taset’s consent codes and DULs. The algorithm can then the five most prominent: the AoURH, the AnVIL,
suggest that the DAC either approve or deny that DAR, BDC, the GDC, and the Kids First DRC. The aim was to
with the hope that this expedites data access review and re- describe the heterogeneous landscape of these new
duces the burden on DACs.2 cloud-based data sharing mechanisms with the intent of
understanding currently operating policies and procedures
Auditing enacting data governance and how these may differ from
All platform actions are regularly monitored and audited pre-existing mechanisms.
for abnormal use and to ensure researchers are only access- Our analysis suggests many similarities across the plat-
ing the data for which they have appropriate permissions. forms, including reliance on a formal data ingestion pro-
The AnVIL and dbGaP specifically ask that all data man- cess, multiple tiers of data access that often require some
agement incidents (DMIs), such as unauthorized data degree of user authentication and/or authorization, plat-
sharing or data security breaches, be reported as they are form and user data security measures, and auditing for
found, and therefore these platforms rely partially on inappropriate data use. Many of the platforms use eRA
research teams and their institutions to help with audit- Commons credentials for user authentication and rely on
ing.42,43 Reported DMIs are reviewed by the affected pro- dbGaP for study registration and the adjudication of
ject’s DAC, and corrective action is determined by the DARs, so as to authorize use of controlled access data. Plat-
DAC if necessary. The GDC mentions that if any data are forms differ in the way in which they choose to organize
found to contain protected health information (PHI) or their tiers of data, as well as the specifics of user authenti-
personally identifiable information (PII), that data will be cation and authorization for different types of data. The
removed and reported via the GDC DMI procedure (which AoURH, unlike other platforms, does not use dbGaP to
includes notifying submitters and correcting the issue manage data access, choosing to rely instead on an inves-
before re-release).22 They do not specify how this kind of tigator-centered ‘‘data passport’’ model and post-hoc vet-
DMI might be identified in the first place. ting of publicly described research projects for compliance
The AoURH is once again a unique platform when it with its user code of conduct. This novel data access mech-
comes to auditing. In addition to performing regular plat- anism is enabled by use of a common, broad consent agree-
form audits, their Resource Access Board (RAB) conducts ad ment, versus other platforms that generally provide access
hoc reviews of workspaces and research project descriptions to numerous studies with varying (and often legacy) con-
to identify any that may be in violation of their DUCC. If sent. What explains other differences between the plat-
the RAB finds a project is noncompliant, they can recom- forms is not always immediately obvious. Regardless, the
mend corrective action. This model is especially unique complexity and heterogeneity we observe within and
as anyone, including research participants, the public, or across platforms could ultimately limit the goal of effi-
researchers themselves can request to have any project re- ciently combining and analyzing large enough datasets
viewed by the RAB.32,44 to advance precision medicine goals.
The siloed nature of current NIH-supported cloud-based
data sharing mechanisms is well recognized, and an effort
Sanctions
is underway to identify avenues to enhance interopera-
Neither the GDC nor the Kids First DRC described sanctions
bility, i.e., the NIH Cloud Platform Interoperability
for misuse in the public-facing documents included in our
(NCPI) effort.46 Platforms involved with this effort include
search. However, the remaining platforms agree on a range
the AnVIL, BDC, the NCI Cancer Research Data Commons
of such sanctions, including public posting of the sanc-
(which includes the GDC), Kids First DRC, and NCBI
tioned user’s name and affiliation, suspension or removal
(which manages dbGaP). The NCPI aims to establish a
from platform use, and notifying the user’s home research
‘‘federated data ecosystem’’ by integrating aspects of cloud
institution. AoURH implies that NIH and other federal
platforms such as user authentication, data discovery, data-
agencies may be notified as well and that ‘‘financial or legal
sets, and workflows using different application program-
repercussions’’ may ensue.23 In addition, AoURH reserves
ming interfaces (APIs).46 Using these APIs, researchers
the right to pursue ‘‘other sanctions’’ as they see fit.23 The
can access data on one platform and analyze it with the
BDC notes that user institutions are accountable in addition
tools of another platform without having to download or
to users and therefore may face sanctions as well.45
host the data externally. Pilot innovations designed to
streamline data access, which might ultimately be shared
Discussion across platforms, include the NIH Research Authorization
System (RAS), a simplified approach to user authentica-
Major public investment in cloud-based platforms is tion,2,15,47 as well as the AnVIL library card concept, which
enabling the storage, access, and analysis of large amounts like the AoURH data passport model would pre-authorize

Human Genetics and Genomics Advances 4, 100196, July 13, 2023 9


investigators for controlled data access.2,4 Similarly, the and so subject to US federal data sharing requirements and
AnVIL DUOS system of semi-automating data access re- data security standards (function 2). Specifically, most plat-
view could ultimately expedite DAC review for other plat- forms explicitly noted following the FISMA standards and/
forms that utilize dbGaP for DARs,2,3 which will remain or NIST guidelines. Similarly, in making genomic and
essential for federated data accrued under varying consent linked phenotypic data available for analysis in the cloud,
understandings and thus subject to different DULs. these efforts also—at least in theory—promote equity
Notably, ease of access was identified as a driving feature in access to, and use and analysis of, shared data (func-
in the selection of genetic databases in a recent empirical tion 4). Compared with earlier data sharing mechanisms,
study of genetic researchers.48 Whether or how platforms which required investigators to have access to local data
such as the AoURH, which employ fundamentally storage and computational resources sufficient to manage
different approaches to data access review, can participate and analyze very large datasets, each of these cloud-based
in cross-platform data or analytic pipeline sharing remains platforms provide much easier (albeit remote) access to
an open question. Achieving interoperability will require data as well as access to a wide variety of analytical tools
both policy and technical solutions and will likely surface and pipelines. Although computation time must still be
tensions between enhancing security and promoting paid for, the costs of data storage are typically borne by
participant trust in a platform versus maximizing the sci- the platform, and both junior investigators and researchers
entific utility of a resource. at institutions without the necessary infrastructure can
In a recent commentary addressing better governance now access and analyze data that were previously out of
of human genomic data, O’Doherty et al. (2021) note reach. Lack of publicly accessible data about current or
that enhanced data sharing raises concerns about poten- anticipated users makes it difficult to determine if this
tial risks, including privacy violations, misuse of data, promise of more equitable data access has yet been
and unauthorized data access.1 They describe five ‘‘key achieved. Notably, initiatives such as the Genomic Data
functions of good governance’’ that governance frame- Science Community Network are working toward
works, ideally, would fulfill, including (1) enabling data these ends.49
access, (2) compliance with applicable national laws and O’Doherty et al. also note that good governance frame-
international agreements, (3) supporting appropriate works would ideally ‘‘clarify how its operations enhance
data use and mitigating possible harms, (4) promoting eq- public trustworthiness and the public good’’ and, hence,
uity in access to, and use and analysis of, data, and (5) us- enhance public benefit (function 5).1 While all the plat-
ing data for public benefit. They also cite transparency as forms we reviewed have user authentication and auditing
a ‘‘meta-function of good governance’’ and one in which mechanisms designed to promote public confidence in
‘‘unlike the other functions, cannot legitimately vary by the security of the data they house and share, we encoun-
context or be balanced against other dimensions of tered relatively little information about the degree to
good governance.’’ The cloud-based platforms in our anal- which these platforms solicit participant and/or public
ysis are each designed to enable facile data access (func- input on their operations. The AoURH RAB, which audits
tion 1), particularly of very large genomic data, but they research projects to ensure compliance with AoU policies,
vary in the degree to which their policies and procedures does include research participant representatives.44 Where
are transparently described. Indeed, a primary motivation we saw the greatest heterogeneity in platform function-
for our analysis was to promote greater transparency ality was in supporting appropriate data use and miti-
about these new data sharing mechanisms by conducting gating possible harms (function 3), with different ap-
a comprehensive (and comparative) assessment of the proaches taken to user authentication and data access
public documentation provided by these platforms. Inter- review. User authentication for most data access required
estingly, but perhaps unsurprisingly, the greatest transpar- use of a vetted user credential, such as from the eRA Com-
ency we observed was for a category of platform informa- mons, but some platforms also accepted Google, Face-
tion that we did not originally set out to measure, i.e., book, or LinkedIn credentials, at least for non-controlled
cost. Most platforms explicitly outlined the ways that access data (typically aggregated and de-identified). While
users must pay for cloud storage, data egress (or down- widening access to non-academic researchers, this may
load, where allowed), and computing time. Platforms also risk exposing data to users whose identities cannot ul-
devote substantial documentation to how these costs timately be verified. Similarly, while the goal of streamlin-
work, how to set up billing accounts with Cloud Service ing data access by authorizing users rather than specific
Providers, and what the user’s responsibilities are in this research uses (as adopted by the AoURH) works well in
respect. Platforms also generally described the use of the that research setting, it does leave open the possibility
‘‘cloud credits’’ they offered to new users to help reduce that stigmatizing or harmful research could nevertheless
initial barriers to using these platforms. be pursued. While that potential is also there with DAC-
Several of the other key functions that O’Doherty et al. vetted DARs, a prospective review does provide the oppor-
(2021) describe are also well-represented across the plat- tunity to give feedback to investigators who may not
forms whose documentation we reviewed.1 All of the plat- otherwise recognize the potential for harm. The AoURH
forms we examined, for example, are sponsored by the NIH Responsible Conduct of Research Training, required for

10 Human Genetics and Genomics Advances 4, 100196, July 13, 2023


Registered and Controlled tier users, may similarly ‘‘front platforms versus traditional platforms, as this is some-
load’’ protective measures against data misuse, but again, thing we did not incorporate into the analysis. Equipped
it is not feedback specific to proposed projects. In either with more complete information about the governance
case, not enough information about platform auditing of these new data sharing mechanisms, we will be well-
procedures was available to judge the extent to which positioned to contribute to ongoing interoperability
harmful research uses could be detected and mitigated efforts and help promote broad public support for such
or who might be held responsible if something goes initiatives.
wrong.

Data and code availability


Limitations
This analysis was not without limitations. Mainly, our The dataset and codes supporting the current study are available
methods of sourcing documents may have led to some from the corresponding authors on request.
relevant information being overlooked. We were limited
to publicly available documentation, and in an effort to Supplemental information
keep the scope of this analysis manageable, we primarily
sourced documents from platform websites (supplemented Supplemental information can be found online at https://doi.org/
by academic literature and limited science news websites). 10.1016/j.xhgg.2023.100196.
We recognize that platform documentation is evolving as
policies are established and as platforms continue to Acknowledgments
develop; therefore, our current analysis may omit more
Research reported in this publication was supported by the Na-
recently added or updated information. As a result, some
tional Human Genome Research Institute of the National Insti-
of our outstanding questions may have answers available
tutes of Health under award number R21HG011501.
in sources we did not search or in sources we did search
but that have since been updated. It is also possible our un-
derstanding gleaned from available sources is incomplete Declaration of interests
or not wholly accurate. However, we contend that what
The authors declare no competing interests.
we have inferred from available documentation is compa-
rable to what other researchers and platforms users may Received: January 25, 2023
understand. Accepted: April 7, 2023
In addition, we learned a lot about genomic cloud
computing platforms during this analysis that would alter
our approach were we starting anew. As part of learning References
more about these platforms, we learned that there may 1. O’Doherty, K.C., Shabani, M., Dove, E.S., Bentzen, H.B., Borry,
be other sites we should have incorporated into our docu- P., Burgess, M.M., Chalmers, D., De Vries, J., Eckstein, L., Full-
ment search (e.g., the NCPI). In addition, our analysis does erton, S.M., et al. (2021). Toward better governance of human
not include more recently developed cloud platforms that genomic data. Nat. Genet. 53, 2–8. https://doi.org/10.1038/
would have otherwise fit our scope criteria (e.g., NIH s41588-020-00742-6.
INCLUDE).50 We are also aware there is likely a wealth of 2. Schatz, M.C., Philippakis, A.A., Afgan, E., Banks, E., Carey, V.J.,
Carroll, R.J., Culotti, A., Ellrott, K., Goecks, J., Grossman, R.L.,
information about these platforms available from non-
et al. (2021). Inverting the model of genomics data sharing
public sources.
with the NHGRI genomic data science analysis, visualization,
and Informatics lab-space. Cell Genom. 2, 100085. https://
Future directions doi.org/10.1101/2021.04.22.436044.
Our analysis maps elements of data governance across 3. Broad Institute DUOS - Data Use Oversight System. https://
emerging NIH-funded cloud platforms and as such pro- duos.broadinstitute.org/
vides a key resource for a range of future investigations 4. Cabili, M.N., Carey, K., Dyke, S.O.M., Brookes, A.J., Fiume, M.,
and stakeholders. We aim to enable investigators seeking Jeanson, F., Kerry, G., Lash, A., Sofia, H., Spalding, D., et al.
to understand and utilize data access and analysis options (2018). Simplifying research access to genomics and health
across platforms. For policymakers, we surface governance data with Library Cards. Sci. Data 5, 180039. https://doi.org/
decisions that may require harmonization within and 10.1038/sdata.2018.39.
across platforms to achieve the desired interoperability. 5. Final NIH Policy for Data Management and Sharing. (2023).
https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-
To supplement and extend what we learned from our
013.html
analysis of publicly available documentation reported
6. Hsieh, H.-F., and Shannon, S.E. (2005). Three approaches to
here, we are conducting additional research, including qualitative content analysis. Qual. Health Res. 15, 1277–
key informant interviews with platform developers, users, 1288. https://doi.org/10.1177/1049732305276687.
and other stakeholders, to gain deeper and first-hand un- 7. McLeod, C., Gout, A.M., Zhou, X., Thrasher, A., Rahbarinia,
derstanding of platform design and use. It would also be D., Brady, S.W., Macias, M., Birch, K., Finkelstein, D., Sunny,
worthwhile for future work to look at costs of these cloud J., et al. (2021). St. Jude cloud: a pediatric cancer genomic

Human Genetics and Genomics Advances 4, 100196, July 13, 2023 11


data-sharing ecosystem. Cancer Discov. 11, 1082–1099. 28. BioData Catalyst Discovering Data Using Gen3. BioData Catal.
https://doi.org/10.1158/2159-8290.CD-20-1230. https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/
8. ATLAS.ti Scientific Software Development GmbH (2021). written-documentation/getting-started/explore-available-data/
ATLAS.ti. gen3-discovering-data.
9. Knoppers, B.M. (2014). Framework for responsible sharing of 29. The AnVIL Getting Started with AnVIL. The AnVIL. https://
genomic and health-related data. HUGO J. 8, 3. https://doi. anvilproject.org/learn.
org/10.1186/s11568-014-0003-1. 30. BioData Catalyst Overview. BioData Catal. https://
10. McGuire, A.L., Roberts, J., Aas, S., and Evans, B.J. (2019). Who bdcatalyst.gitbook.io/biodata-catalyst-documentation/
owns the data in a medical information commons? J. Law Med. written-documentation/getting-started/overview.
Ethics 47, 62–69. https://doi.org/10.1177/1073110519840485. 31. Kids First Data Resource Center Getting Started. Kids First Data
11. O’Doherty, K.C., Burgess, M.M., Edwards, K., Gallagher, R.P., Resour. Cent. https://kidsfirstdrc.org/support/getting-started/.
Hawkins, A.K., Kaye, J., McCaffrey, V., and Winickoff, D.E. 32. All of Us Research Program (2020). Framework for Access to All
(2011). From consent to institutions: designing adaptive of Us Data Resources v1.1.
governance for genomic biobanks. Soc. Sci. Med. 73, 367– 33. The AnVIL Consortium Guidelines for AnVIL Data Access.
374. https://doi.org/10.1016/j.socscimed.2011.05.046. The AnVIL. https://anvilproject.org/learn/data-submitters/
12. National Institutes of Health (2007). NOT-OD-07-088: Policy resources/consortium-data-access-guidelines.
for Sharing of Data Obtained in NIH Supported or Conducted 34. BioData Catalyst Data Upload & Download Policy & Recom-
Genome-wide Association Studies (GWAS). https://grants.nih. mendations for Users. BioData Catal. https://bdcatalyst.
gov/grants/guide/notice-files/not-od-07-088.html. gitbook.io/biodata-catalyst-documentation/community/
13. National Institutes of Health (2014). NOT-OD-14-124: NIH request-for-comments/data-upload-and-download-policy-
Genomic Data Sharing Policy. https://grants.nih.gov/grants/ and-recommendations-for-users.
guide/notice-files/not-od-14-124.html. 35. NIH Security Best Practices for Controlled-Access Data Subject
14. The AnVIL FAQ - Data Submission. The AnVIL. https:// to the NIH Genomic Data Sharing (GDS) Policy (2021).
anvilproject.org/faq/data-submission. 36. National Institutes of Health Scientific Data Sharing
15. BioData Catalyst Data Generator Guidance. https:// Completing an Institutional Certification Form.
bdcatalyst.gitbook.io/biodata-catalyst-documentation/data- 37. BioData Catalyst (2022). Understanding Access (BioData Catal).
management/biodata-catalyst-data-generator-guidance. https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/
16. NCI Genomic Data Commons Requesting Data Submission. written-documentation/getting-started/data-access/
Natl. Cancer Inst. - Genomic Data Commons. https://gdc. understanding-access-requirements-for-biodata-catalyst.
cancer.gov/submit-data/requesting-data-submission. 38. Jensen, M.A., Ferretti, V., Grossman, R.L., and Staudt, L.M.
17. Kids First Data Resource Center Frequently Asked Questions (2017). The NCI Genomic Data Commons as an engine for
(FAQs) for the Kids First Funding Opportunities. Natl. Inst. precision medicine. Blood 130, 453–459. https://doi.org/10.
Health - Off. Strateg. Coord. - Common Fund. https:// 1182/blood-2017-03-735654.
commonfund.nih.gov/kidsfirst/FAQ. 39. All of Us Research Program (2020). Data and Statistics Dissem-
18. The AnVIL Step 4 - Ingest Data. The AnVIL. https:// ination Policy.
anvilproject.org/learn/data-submitters/submission-guide/ 40. All of Us Research Program Data Access Tiers. Us Res. Hub.
ingesting-data. https://www.researchallofus.org/data-tools/data-access/.
19. Data Methodology Us Res. Hub. https://www.researchallofus. 41. Voisin, C., Linden, M., Dyke, S.O., Bowers, S.R., Alper, P., Bark-
org/data-tools/methods/. ley, M.P., Bernick, D., Chao, J., Courtot, M., Jeanson, F., et al.
20. The AnVIL Step 1 - Register Study/Obtain Approvals. The (2021). GA4GH Passport standard for digital identity and ac-
AnVIL. https://anvilproject.org/learn/data-submitters/submission- cess permissions. Cell Genom. 1, 100030. https://doi.org/10.
guide/data-approval-process. 1016/j.xgen.2021.100030.
21. BioData Catalyst Data Management Strategy. https:// 42. The AnVIL FAQ - Data Security, Management, and Access Pro-
bdcatalyst.gitbook.io/biodata-catalyst-documentation/data- cedures. The AnVIL. https://anvilproject.org/faq/data-security.
management/data-management-strategy. 43. Ramos, E.M., Din-Lovinescu, C., Bookman, E.B., McNeil, L.J.,
22. NCI Genomic Data Commons Data Submission Policies. Baker, C.C., Godynskiy, G., Harris, E.L., Lehner, T., McKeon,
https://gdc.cancer.gov/submit-data/data-submission-policies. C., Moss, J., et al. (2013). A mechanism for controlled access
23. All of Us Research Program Data User Code of Conduct to GWAS data: experience of the GAIN data access committee.
Version 2. Am. J. Hum. Genet. 92, 479–488. https://doi.org/10.1016/j.
24. Kids First Data Resource Center Data Access. Kids First Data Re- ajhg.2012.08.034.
sour. Cent. https://kidsfirstdrc.org/portal/data-access/. 44. All of Us Research Program What Is the Resource Access Board
25. Kids First Data Resource Center Applying for Access. Kids First (RAB)? Us Res. Hub. https://www.researchallofus.org/faq/what-
DRC Help Cent. https://www.notion.so/Applying-for-Access- is-the-resource-access-board-rab/.
ffed3a85b29741388b30a1ad0687003f. 45. BioData Catalyst (2021). Data Protection (Data Prot). https://
26. Wilson, S., Fitzsimons, M., Ferguson, M., Heath, A., Jensen, biodatacatalyst.nhlbi.nih.gov/data-protection/.
M., Miller, J., Murphy, M.W., Porter, J., Sahni, H., Staudt, L., 46. NCPI NIH Cloud Platform Interoperability Effort. NCPI.
et al. (2017). Developing cancer Informatics applications https://anvilproject.org/ncpi.
and tools using the NCI genomic data commons API. Cancer 47. BioData Catalyst (2021). NHLBI BioData Catalyst Ecosystem
Res. 77, e15–e18. https://doi.org/10.1158/0008-5472.CAN- Security Statement (BioData Catal). https://bdcatalyst.
17-0598. gitbook.io/biodata-catalyst-documentation/community/
27. BioData Catalyst Getting Started. https://sb-biodatacatalyst. request-for-comments/nhlbi-biodata-catalyst-ecosystem-
readme.io/docs/getting-started. security-statement.

12 Human Genetics and Genomics Advances 4, 100196, July 13, 2023


48. Trinidad, M.G., Ryan, K.A., Krenz, C.D., Roberts, J.S., McGuire, 49. Genomic Data Science Community Network (2022). Diver-
A.L., De Vries, R., Zikmund-Fisher, B.J., Kardia, S., Marsh, E., sifying the genomic data science research community.
Forman, J., et al. (2023). Extremely slow and capricious’’: a Genome Res. 32, 1231–1241. https://doi.org/10.1101/gr.
qualitative exploration of genetic researcher priorities in se- 276496.121.
lecting shared data resources. Genet. Med. 25, 115–124. 50. INCLUDE (2021). Incl. Data Coord. Cent. https://includedcc.
https://doi.org/10.1016/j.gim.2022.09.003. org/.

Human Genetics and Genomics Advances 4, 100196, July 13, 2023 13

You might also like