landscape-analysis-governance
landscape-analysis-governance
landscape-analysis-governance
Summary
The storage, sharing, and analysis of genomic data poses technical and logistical challenges that have precipitated the development of
cloud-based computing platforms designed to facilitate collaboration and maximize the scientific utility of data. To understand cloud
platforms’ policies and procedures and the implications for different stakeholder groups, in summer 2021, we reviewed publicly available
documents (N ¼ 94) sourced from platform websites, scientific literature, and lay media for five NIH-funded cloud platforms (the All of
Us Research Hub, NHGRI AnVIL, NHLBI BioData Catalyst, NCI Genomic Data Commons, and the Kids First Data Resource Center) and a
pre-existing data sharing mechanism, dbGaP. Platform policies were compared across seven categories of data governance: data submis-
sion, data ingestion, user authentication and authorization, data security, data access, auditing, and sanctions. Our analysis finds sim-
ilarities across the platforms, including reliance on a formal data ingestion process, multiple tiers of data access with varying user authen-
tication and/or authorization requirements, platform and user data security measures, and auditing for inappropriate data use. Platforms
differ in how data tiers are organized, as well as the specifics of user authentication and authorization across access tiers. Our analysis
maps elements of data governance across emerging NIH-funded cloud platforms and as such provides a key resource for stakeholders
seeking to understand and utilize data access and analysis options across platforms and to surface aspects of governance that may require
harmonization to achieve the desired interoperability.
1
Department of Bioethics and Humanities, University of Washington School of Medicine, Seattle, WA 98195, USA; 2Department of Biostatistics, University
of Washington, Seattle, WA 98195, USA
3
Lead contact
*Correspondence: [email protected] (S.C.N.), [email protected] (S.M.F.)
https://doi.org/10.1016/j.xhgg.2023.100196.
Ó 2023 The Author(s). This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Kids First Data Resource Center Kids First DRC NIH Common Fund https://kidsfirstdrc.org
Additional non-platform websites searched for documents (relevant peer reviewed and/or lay media) to analyze included PubMed, bioRxiv, medRxiv, Nature
News, and Science News.
have been required to submit their data to NIH-designated forming, cleaning, processing, harmonizing, indexing,
repositories such as dbGaP.12 Exceptions to this policy are and/or otherwise curating data submitted by data genera-
made on a case-by-case basis, and non-NIH-funded studies tors in order to make it accessible on a platform. Platforms
are accepted into dbGaP at the discretion of NIH Institutes typically require data to undergo quality control and
and Centers.12 Data submission must follow the sharing harmonization as part of ingestion. Data harmonization en-
requirements and timelines in the NIH Genomic Data sures that data from different studies and generators are
Sharing (GDS) policy.13 compatible. Data are also indexed as part of the ingestion
All platforms considered here require data generators to process, which involves assigning unique identifiers to
register their studies with dbGaP, with the exception of each data file.18 AoURH uses the Observational Medical
AoURH, which only contains data generated from the All Outcomes Partnership Common Data Model to harmonize
of Us Research Program and does not accept external sub- data before taking further steps to ‘‘ensure participant pri-
missions. The usual dbGaP study registration process in- vacy is protected.’’19 AnVIL’s Data Ingestion Committee,
cludes ethical oversight by the submitter’s Institutional which includes AnVIL team members and NHGRI program
Review Board (IRB) and dialog with an NIH Program Officer officers, evaluates applications for ingestion as a form of
and a Genomic Program Administrator (GPA). Additional quality control.20 BDC, while recognizing that their data
submission details from platform-specific documentation does go through quality control, states that they are data
are as follows. Data generators wishing to deposit data ‘‘custodians’’ and ‘‘cannot control the quality of data in-
within AnVIL must get approval from the AnVIL Ingestion gested.’’21 BDC contains data ingested from dbGaP or
Committee. That Committee assesses whether the data are directly from participating consortia, and it has plans for a
a good fit; this is determined in part by the amount of data, separate center, the Data Management Core, that will help
ethical oversight during data collection, how participants researchers with ingestion requirements.15 The GDC, while
were consented, and what data use limitations (DULs) are not addressing quality control explicitly, notes that it can
included.14 BDC documentation highlights that when sub- take up to 6 months after processing data before it is released
mitting data, data generators must work with an NHLBI to researchers.22 Kids First DRC reports plans to use the Hu-
GPA and register their data with dbGaP, the ‘‘central regis- man Phenotype Ontology and NCI Thesaurus for pheno-
tration authority’’ for BDC.15 The GDC accepts data from type harmonization, and it lists a variety of workflows
genomic cancer studies; priority is given to new data types, that can be used for genomic harmonization.17
and the study’s size, quality, and ability to further under-
standing of cancer are taken into account.16 The Kids First User authentication and authorization
DRC notes that ‘‘projects that allow for the broadest leveling All platforms divide ingested data into two or three data
of sharing . will be prioritized for Kids First Support’’ and tiers; however, the types of data contained in those tiers
states that restrictions like disease-specific consent or and requirements to access vary between platforms. Open
requiring a letter of collaboration ‘‘impede the ability for tiers are open to anyone, without the need to register or
the Kids First program to accomplish its goals.’’17 otherwise authenticate the identity of the data user. User
authentication tiers require users to log in with a username
Data ingestion and password, usually via an external user account, but
Once data are submitted and accepted, they go through an they do not include further requirements or subsequent
intake or data ingestion process before becoming available to identity verification. User authorization tiers require users
researchers. While not defined explicitly or consistently to not only authenticate their identity but also to be autho-
across platforms, data ingestion generally entails trans- rized to access the tier, usually through a specific request
Platform A centralized system pairing cloud-based data storage with search and analysis functionality via cloud-
based workspaces and portals. Associated with a specific NIH Institute or research initiative. May also be
referred to as an ‘‘ecosystem.’’
Data submission The act of data generators providing their data to a platform.
Data ingestion The action of obtaining, importing, transforming, cleaning, processing and otherwise curating submitted
data to make it available on a platform.
Data harmonization Ensures the compatibility of data from different submitters and/or studies via quality control, processing,
and post-processing.
Data indexing Assigning unique identifiers to data to support efficient discovery of said data.
Data curation The process of receiving, transferring, organizing, integrating, removing, and preserving data residing
within a platform.
Open tier Open to anyone, without the need to register or authenticate your identity. Contain aggregate, de-
identified data.
User authentication tier Require users to log in with a username and password, usually via an external user account, but don’t
include further requirements. Data varies from aggregate de-identified data to de-identified individual
data.
User authorization tier Require users to not only authenticate their identity but also be authorized to access the tier through a
special request (such as a DAR).
(such as a data access request [DAR]). User authorization use, such as a DAR (see data access below for more detail).
tiers are included in every platform; platforms vary with re- In all cases, access to individual de-identified data is
gard to their use of open and/or user authentication tiers. allowed upon authorization.
The AoURH differs from other platforms in having two
Open tiers different user authorization tiers that, at the time of this
AoURH, GDC, Kids First DRC, and dbGaP have open tiers. analysis, both require an eRA Commons account. First is
These open tiers allow users to browse the studies con- the ‘‘Registered’’ tier, which contains individual-level data
tained in these tiers without logging in or making any spe- and requires an eRA Commons account and a request to
cial requests for data access or use. AoURH’s open tier access the data.23,32 AoURH also has a second ‘‘Controlled’’
allows users to view ‘‘summary statistics and aggregate in- tier that contains individual-level genetic data, which ac-
formation that poses negligible risks to the privacy of cording to AoURH, requires more ‘‘stringent’’ user authen-
research participants.’’23 With the GDC and Kids First tication than the individual-level data in their ‘‘Registered’’
DRC open tiers, users can search for studies or use open tier.32 At the time of review, to access this ‘‘Controlled’’ tier,
datasets.24–26 users must log in with an eRA Commons account as well as
be ‘‘appropriately accredited’’ and obtain ‘‘additional
User authentication tiers approval.’’23 What this extra accreditation and approval
AnVIL, BDC, and Kids First DRC have user authentication might entail was unclear from the documentation avail-
tiers. Data contained in these tiers vary from aggregate de- able to our review. We consider both of these tiers to be
identified data to de-identified individual data. AnVIL and authorization rather than authentication tiers because of
BDC, which primarily provide aggregate de-identified data the required project description to initiate a workspace in
or unrestricted individual-level data to this tier (e.g., 1000 either tier, meaning that access to data is tied to a specific
Genomes Project data in AnVIL), each require authentica- proposed use analogous to a dbGaP Research Use
tion with Google, ORCID, or eRA Commons credentials Statement.
(see Table 3).15,27–30 The Kids First DRC has a ‘‘KidsFirst’’ Another variation is AnVIL’s ‘‘Consortium’’ tier that is
tier that allows users to ‘‘search de-identified data’’ and only open to members of specific research consortia who
can be accessed by logging in using Google, ORCID, Face- have placed their data on the platform, typically ahead
book, or LinkedIn credentials.24,31 This is the only plat- of release to the scientific community. Consortia members
form in our analysis that allows user authentication using gain access directly from a consortium official, and it is the
Facebook or LinkedIn credentials. responsibility of the consortium to manage who is allowed
access to these data.33
User authorization tiers
All platforms have user authorization tiers. User authoriza- Data security
tion tiers require that users both authenticate their identity Platform policies reference data security in two distinct
and submit a request to access data for a specific research ways: individual users’ responsibilities with respect to
AoURH all data provided through uses Observational Medical no login for public tier; eRA user: no data screenshots, three access tiers: "public," reviews by RAB determine termination of your
the AoU Research Project Outcomes Partnership commons for registered don’t publish or download "registered," and DUCC compliance; account, public posting of
Common Data Model to tier; plans to change in the ppt-level data, and don’t "controlled"; applicable corrective action your name and affiliation,
harmonize data; future share logins institutional AoU DUA is recommended; user’s institution notified
further cleans data to eRA commons and more platform: data is de- (referred to as a ‘‘Data Use all user uploaded work will along with NIH or other
protect participant privacy stringent access identified (though the and Registration be logged and monitored; federal agencies;
requirements than platform notes that the Agreement’’ or DURA) anyone, including financial or legal
registered tier; users must term ‘‘de-identified’’ may required; researchers and the public, repercussions;
be appropriately accredited not be fully accurate to must complete research can ask RAB to review a other sanctions
and obtain separate describe its data); analysis training (referred to as study
authorization for of registered and/or ‘‘Responsible Conduct of
controlled tier controlled data only Research Training’’), agree
permitted on the platform to DUCC, prove ID, share
contact info and
affiliations, and provide
consent for release of this
info;
access authorization
determined via "data
passport" (user based); no
DACs (project based)
Human Genetics and Genomics Advances 4, 100196, July 13, 2023 5
AnVIL must get approval from data ingestion committee Google account for open user: don’t re-identify ppts; three access tiers: "open," potential DMIs must be access suspended or
NHGRI and the AnVIL evaluates applications and tier; eRA commons for platform: two-factor "controlled," and reported to DAC within terminated and user’s
Ingestion Committee; coordinates with dataset controlled tier; authentication, all data "consortium"; 24 hours; institution notified
data must conform to the stewards (unspecified) to members of consortia are covered by Certificate of three authorized user Terra & Gen3 log access to
NIH GDS Policy; determine time frame for granted access directly by a Confidentiality, systems are groups: developers, data, go through audits,
participants must be retention of data, long term designated consortium independently tested consortia, and external and are monitored for
explicitly consented for storage, archival, and official annually, system is researchers; abnormal use;
data sharing; availability of data; continually tested and submit DAR and agree to all activities are logged and
the AnVIL Ingestion uses Gen3 to ingest data scanned, and is consistent DUC; regularly reviewed/
Committee assesses data; with NIH Security Best DAC determines access via monitored
study must be registered in Practices and GDS Policy DAR, dbGaP consent codes,
dbGaP and DULs;
piloting DUOS to
streamline DAR approval;
upload data in accordance
with all national, tribal,
state laws, and relevant
institutional policies;
consent groups placed into
different workspaces
Table 3. Continued
User authentication and
Platform Data submission Data ingestion authorization Data security Data access Auditing Sanctions
BDC study must be registered in data "streamed in real time" eRA commons, Google, or user: don’t re-identify ppts two access tiers: "open" and all activities are logged and user institutions are
dbGaP or ingested in batches; ORCID for open tier; and don’t share logins; "controlled"; regularly reviewed/ accountable and may be
BDC is a custodian of data, eRA commons for platform: Public Trust submit DAR to dbGaP; monitored subject to sanctions
and it cannot control controlled tier Clearance used for all staff Cloud Use Statement may
quality of data ingested; and contractors, data be required;
Data Management Core within the cloud is DAC determines access via
works with data providers encrypted, cannot DAR, dbGaP consent codes
to assess data with the download controlled access and DULs
intent of harmonizing; participant level data
datasets are added by a user,
ingested from a controlled
source, or transferred from
collaborative programs
GDC accepts data from different submitted data is processed, no login for open tier; eRA user: don’t re-identify ppts two access tiers: "open" and GDC DMI standard data access is removed if
cancer study groups; validated, and harmonized Commons for controlled and comply with your "controlled"; operating procedure is data are discovered to
data submission adheres to before being hosted access DUA; apply via dbGaP and DAC referenced but details of contain PHI or PII or if data
the NIH and NCI GDS platform: data is de- approves/denies; auditing are not specified are shared out of
policies; identified according to agree to DUA and NIH GDS compliance with sharing
aggregate data for patients Health and Human services policy and submit data conditions set by DAC;
aged 90þ; Safe Harbor guidelines, sharing plan sanctions for inappropriate
submissions reviewed by GDC does not house data use not specified
considering a study’s size, electronic health record
quality, compatibility with and does not accept data for
data already hosted, and ppts over 90 years old
likely impact on the field;
any investigator or
consortium with cancer
genomic data can apply for
data submission;
data submitters understand
and agree that data will be
made available to the
scientific community;
data submitters retain
ownership of their data
Kids DULs that impede the reports plans to use Human no login for open tier; user: comply with NIH three access tiers: "open," Gen3 & Cavatica monitor N/Aa
First ability to access, use, or Phenotype Ontology and Google, ORCID, Facebook, Security Best Practices for "KidsFirst," and data use and ensure data
DRC analyze data will not be NCI Thesaurus for or LinkedIn for KidsFirst Controlled Access Data, "controlled"; access is appropriate;
prioritized; phenotype harmonization; tier; eRA commons for report DMIs, don’t re- Submit DAR and DAC users instructed to report
data consented only for lists a variety of workflows controlled tier identify ppts, and don’t determines access via DAR, inadvertent data release or
disease-specific research; for genomic harmonization share logins; dbGaP consent codes, and other DMI
data that require a letter of platform: N/Aa DULs;
collaboration or data that agree to DUC
require local IRB approval
will not be prioritized;
projects that allow for
broad data sharing will be
prioritized
dbGaP submitters required to data undergo quality no login for open tier; eRA user: don’t re-identify ppts, two access tiers: "open" and when notified, NIH reviews user and institution are
certify they have control and curation by Commons for controlled create secure logins, don’t "controlled"; possible DMIs notified of problems, and
considered the risks to dbGaP before being access share logins, ensure data is submit DAR and agree to appropriate steps are taken;
Human Genetics and Genomics Advances 4, 100196, July 13, 2023 7
individuals, their families, released to the public secure and confidential, DUC and GDS policy; users may face enforcement
and populations associated destroy locally stored data DAC determines access via actions;
with data submitted to and officially close project DAR, dbGaP consent codes, access suspended or
dbGaP; when no longer needed, and DULs terminated
all investigators receiving have a security plan and
NIH support to conduct technical training, and
genomic research submit policy controls in place
their de-identified study before data migration,
data to dbGaP; adhere to DAR for approved
non-NIH-funded data can data use and to NIH
be submitted to dbGaP; security best practices,
requires the local IRB to report DMIs, and users and
certify consistency with user institutions
laws and regulations accountable for ensuring
data security, not the cloud
service provider;
platform: data is de-
identified
a
N/A indicates information in the category was not found in the publicly available documents we reviewed in 2021. We recognize more information for these categories may be publicly available at the time of publication.
data security and platform responsibilities designed to DAR and determines whether or not to approve it based
ensure a secure environment. All platforms encourage a on the DULs for the requested dataset.36 Which DAC re-
standard set of data security practices that users are respon- views a DAR typically depends on which NIH Institute a
sible to uphold, such as not sharing login information, not study is registered with; Kids First DRC is distinct in having
taking screenshots of data, and not attempting to re-iden- a dedicated Kids First DRC DAC.17 Once approved, users
tify participants. must also agree to adhere to the NIH GDS policy, and a
Platform responsibilities for ensuring data security can data use agreement (DUA), depending on the platform.
vary, but many aspects are similar across the board due to AoURH requires that the user’s institution enters into a
federal regulations in the United States. For example, DUA with the All of Us Research Project (AoURP), referred
AoURH, AnVIL, BDC, GDC, and dbGaP reference to as a ‘‘Data Use and Registration Agreement’’ or DURA.
following the Federal Information Security Modernization Other platforms refer to DUAs in different contexts: BDC
Act (FISMA) or National Institute of Standards and Tech- stipulates that users must ensure that DUAs are ‘‘approved
nology (NIST) guidelines. FISMA has set rules around infor- and maintained,’’37 GDC notes that applying for access via
mation security, and NIST outlines guidance for complying dbGaP requires users to sign a DUA established by data
with those rules. Other similar platform data security mea- owners,38 and the AnVIL states that DUAs are required
sures include encrypting data, performing routine system ‘‘as necessary.’’33 Kids First DRC did not provide informa-
security checks, and de-identifying data in a manner tion on the need for DUAs. AoURH also requires that re-
consistent with the GDS policy. searchers adhere to the All of Us Data User Code of
Some platforms have clear rules around data download. Conduct, which prohibits data users from ‘‘attempting to
For instance, AoURH and BDC specify that they do not re-identify participants or their relatives’’ and encourages
allow data in user authorization tiers to be downloaded them to ‘‘be careful when distributing the results of their
at all.32,34 However, BDC also notes that it is technically work’’ to ‘‘prevent others from using this information to
possible for users to download platform data, and therefore re-identify All of Us participants’’.39
the responsibility lies with the user to not download the There are two main variations from the process noted
data.34 In contrast to the cloud-based platforms, in dbGaP above: (1) models where data users, instead of data uses,
the data are typically meant to be downloaded onto the ap- are authorized and (2) models that have automated or
plicant’s local system. For this reason, dbGaP makes clear semi-automated data access review. In the first category
that users and their institutions are responsible for their are AoURH and the AnVIL. The AoURH does not use the
data’s security. Users are advised to avoid putting dbGaP DAC system and instead uses a ‘‘data passport’’
controlled access data on portable devices, and they are model that grants access to vetted ‘‘Authorized Users’’
encouraged to have an institutional data security plan in rather than granting data access on an individual project
place before migrating data.35 dbGaP also requires that basis.32,40 This data passport system is possible due to the
projects be closed out officially when data are no longer single broad consent for data access and use that governs
needed and that all locally stored files are destroyed.35 It all data on the AoURH platform; other platforms do not
was unclear how data access on cloud-based platforms currently have this ability. To become an ‘‘authorized
would cease at the conclusion of a project; we found no user,’’ the user’s institution must have a DURA with the
equivalent information about project close-out procedures AoURP. With this in place, users must then do the
specific to cloud-based platforms. following before becoming ‘‘Authorized Users’’: establish
A range of other data security precautions from the plat- their identity (using eRA Commons to validate), consent
form side were noted; for example, GDC does not accept to public display of their name and description of their
data from participants over the age of 90, and similarly research projects, consent to public release of their name
AoURH limits what can be reported on participants aged in case of a DUCC violation, complete the All of Us
90þ.22,32 BDC specifies that they use network firewalls Responsible Conduct of Research Training, and sign an
and require all platform staff and contracts to have public agreement testifying they have done what is required.23,32
trust clearance, which includes a thorough background After this is complete, ‘‘Authorized Users’’ will be able to ac-
check.15 All platforms validate user identity, at least in prin- cess AoURH’s ‘‘Registered’’ and ‘‘Controlled’’ tiers, create
ciple, and post descriptions of research projects publicly. workspaces on the AoURH, and carry out research with
the data.23,32 A similar variation is AnVIL’s in-development
Data access ‘‘library card’’ concept, which like the data passport, would
All platforms contain user authorization tiers that include allow researchers to be pre-authorized to request user
individual-level data, such as germline genomic data and authorization tier data.2 This process uses the Global Alli-
certain phenotype or clinical data, depending on the study ance for Genomics and Health (GA4GH) Passport Visa sys-
and/or the platform. To access these data, most platforms tem and aims to reduce the number of steps researchers
require the user to submit a DAR to dbGaP. Typically, a have to go through before gaining access to data, while still
DAR is first reviewed by a signing official from the appli- ensuring they are permissioned to do so.2,41 However, un-
cant’s institution; alternatives to this model are discussed like the AoURH passport approach, pre-authorized AnVIL
below. A data access committee (DAC) then reviews the users are still required to complete a dbGaP DAR.2,20