Genomics
Genomics
Basic research
Personal genomes in progress: from the
Human Genome Project to the Personal
Genome Project
Jeantine E. Lunshof, PhD; Jason Bobe, MS; John Aach, PhD;
Misha Angrist, PhD; Joseph V. Thakuria, MD; Daniel B. Vorhaus, JD, MA;
Margret R. Hoehe, MD, PhD; George M. Church, PhD
The cost of a diploid human genome sequence has dropped from about $70M to $2000 since 2007—even as the standards
for redundancy have increased from 7x to 40x in order to improve call rates. Coupled with the low return on investment
for common single-nucleotide polylmorphisms, this has caused a significant rise in interest in correlating genome sequences
with comprehensive environmental and trait data (GET). The cost of electronic health records, imaging, and microbial,
immunological, and behavioral data are also dropping quickly. Sharing such integrated GET datasets and their interpre-
tations with a diversity of researchers and research subjects highlights the need for informed-consent models capable of
addressing novel privacy and other issues, as well as for flexible data-sharing resources that make materials and data avail-
able with minimum restrictions on use. This article examines the Personal Genome Project’s effort to develop a GET data-
base as a public genomics resource broadly accessible to both researchers and research participants, while pursuing the
highest standards in research ethics.
© 2010, LLS SAS Dialogues Clin Neurosci. 2010;12:47-60.
Keywords: Personal Genome Project; personal genomics; DNA sequencing tech- Sciences & Policy, Duke University, Durham, North Carolina, USA (Misha
nology; whole-genome sequencing; phenome; envirome; microbiome; GET data Angrist); Harvard Medical School, Massachusetts General Hospital, Boston,
set; open consent; public genome; ELSI Massachusetts, USA (Joseph V. Thakuria); Robinson, Bradshaw & Hinson, P.A.,
Charlotte, North Carolina, USA (Daniel B. Vorhaus); Department of Vertebrate
Author affiliations: Department of Genetics, Harvard Medical School, Boston, Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
Massachusetts, USA (Jason Bobe*, John Aach, George M. Church**); (Margret R. Hoehe**) * Co-first authors ** Co-last authors
PersonalGenomes.org, Boston, Massachusetts, USA (Jason Bobe, Daniel B.
Vorhaus, Joseph V. Thakuria, George M. Church**); European Centre for Address for correspondence: Department of Genetics, Harvard Medical School, 77
Public Health Genomics, FHML, Maastricht University, Maastricht, The Avenue Louis Pasteur, Boston, MA 02115, USA, Tel: +1 617/432-7562;
Netherlands; Department of Molecular Cell Physiology, VU University PersonalGenomes.org, 77 Avenue Louis Pasteur, Boston, MA 02115 USA
Amsterdam, The Netherlands (Jeantine E. Lunshof*); Institute for Genome (e-mail: [email protected])
Basic research
When the Human Genome Project published its draft The PGP’s mission
results on June 26, 2000, it published a compound human
genome sequence containing genetic information from sev- In contrast to research studies that focus on small sub-
eral volunteers. Seventy percent of the final sequence was sets of traits within narrowly defined human populations
obtained from one anonymous individual, while the remain- exhibiting single diseases, the PGP was conceived with
ing 30% came from a number of different individuals. From an expansive mission. From the outset, the mission of the
the first amalgamated human genome sequence—which project (Table I) has been to develop a broad-based, lon-
was refined in 2003 and continues to be updated and refined gitudinal, and participatory research study that will facil-
to this day—private and public research efforts have gone itate a comprehensive understanding of the project’s
on to sequence numerous individual human genomes with participants at the genomic level and beyond.
increasing speed and detail and decreasing time and cost. The PGP is constructed with the recognition that our
The acceleration of whole-genome sequencing in the desire to truly understand the genesis of most complex
research context necessitates new perspectives and models human traits—from dread diseases to the talents and
that enable scientists and society to learn as much as pos- quirks that make us each uniquely human—could only
sible from this rapidly expanding dataset while still respect- be satisfied by examining genomic information in con-
ing important ethical, legal, and social norms. text and by surrounding it with the richest possible data
The Personal Genome Project (PGP),1 an ambitious from the widest possible array of supplemental sources.
research study directed by faculty members in the By supplementing genomic sequence data with the col-
Department of Genetics at Harvard Medical School, aims lection and analysis of tissues and extensive environ-
to recruit as many as 100 000 informed participants to con- mental and trait data, and by making these data publicly
tribute genomic sequence data, tissues, and extensive envi- accessible to researchers worldwide, the PGP aims to
ronmental, trait, and other information to a publicly acces- improve understanding of the ways in which genomes
sible and identifiable research database. plus environments ultimately equal traits (Figure 1).
In this review we describe the Personal Genome Project The PGP is more than just a research repository. In addi-
itself, focusing on its unique structural features and the tion to its publicly accessible research database, the PGP,
rationale behind the project’s design. We also elucidate the which is supported by the nonprofit PersonalGenomes.org,
changing scientific and social landscape that makes the also works to disseminate genomic technology and knowl-
PGP’s model of open consent and public data access edge at a global level, thereby producing tangible and
increasingly important to the furtherance of human widely available improvements in the understanding and
genomic research. management of human health and disease. The PGP also
Personal genomes in progress - Lunshof et al Dialogues in Clinical Neuroscience - Vol 12 . No. 1 . 2010
finds itself at the forefront of discourse surrounding the stage for over four decades of development of ever more
ethical, legal, and social issues (ELSI) associated with efficient and comprehensive sequencing methods. Table
large-scale whole-genome sequencing, particularly in the II describes this history by a set of milestones that take
areas of privacy, informed consent, and data accessibility. one from the early beginnings of DNA sequencing up
The PGP is, and is intended to be, a research project that through delivery of draft human genome sequences in
is constantly in progress, exploring the boundaries of 2001 to 2003. In the 38 years between 1965, when Robert
human genomic research in a way that produces maxi- Holley and colleagues at Cornell and the US
mal advances in scientific understanding and public Department of Agriculture sequenced a 77 nt RNA
understanding and well-being, while striving to reach gene after 4 years of effort, and 2003, when the public
beyond what is minimally required to satisfy its ethical, Human Genome Project (HGP) declared that it had met
legal, and social obligations to its participants. In the sec- its goals regarding delivery of a ~3Gbp human genome
tions that follow we report on unique aspects of the PGP sequence, the size of DNA sequence that could be
relating to technology development, integrative accommodated by sequencing technology improved ~30
genomics, and human subject research protocols, as well million-fold.
as describe the development and current state of the
PGP. Post-HGP sequencing—towards
whole diploid genomes
Key developments in
human genome sequencing Notably, the HGP had delivered only a single human
genome sequence that was a composite built from a small
The PGP derives its impetus and importance from his- number of deidentified individuals, while the competing
toric breakthroughs in understanding and analysis of nonpublic human genome project merged in data from an
DNA. DNA comprises only a very small fraction of a identified individual (Craig Venter); both were haploid
cell (~3% dry weight E. coli), and its role as the mole- estimates. As recognized from the beginning of the HGP,
cule primarily responsible for transmission of genetic many additional resources would be needed to under-
traits was not recognized until a series of discoveries stand the functions of the genes laid out in these “refer-
beginning in the 1940s. The emergence in 1953 of a clear ence” human genomes, and to identify the sequence dif-
concept of DNA as a double-helical structure compris- ferences between individuals that contribute to individual
ing a pair of complementary strings of four elementary traits, health, and disease. Indeed, as the HGP ended, pro-
bases (the nucleotides A, C, G, and T) crystallized inter- jects were already under way to identify large numbers of
est in determining the DNA sequences of genes and the genetic differences from the HGP-derived reference
sequence differences responsible for disease, and set the genome in different human populations that could sub-
VDJ-ome
Personal Personal
genome
6 Gbp
stem cells
and + Envirome = TRAITS
(phenome)
3M+ alleles epigenome
Microbiome
Figure 1. Genome + Environment = Traits (GET) equation. Envirome: the totality of environmental influences; VDJ-ome: the DNA sequences of the entire
repertoire of an individual’s immunoglobulin and T-cell receptors, which reflect a lifetime of antigenic exposures; Microbiome: the billions of
commensal, symbiotic, and pathogenic micro-organisms that share our body space; Epigenome: the totality of programmed biochemical and
structural modifications to genomic DNA that regulate organism or phenotype development. (see overview in Table III).
DCNS_44_5.qxd:DCNS#44 10/03/10 1:44 Page 50
Basic research
sequently be analyzed using low-cost array methods in by the $3 billion US cost of the HGP itself. It is here that
large numbers of individuals, a strategy that has since economic arguments were advanced suggesting that dra-
given rise to more than 480 published genome-wide asso- matic improvements in sequencing were feasible that
ciation studies.16,17 At the same time, however, interest was might ultimately enable an individual’s genome to be
rising in the second approach: to significantly improve sequenced for 1000 to 10 000 USD.18 On an empirical
DNA sequencing technology to a point where an indi- level, sequencing technology has appeared to exhibit a
vidual’s entire genome could be sequenced at very low historical trend of exponentially decreasing costs with
cost. A combination of two kinds of arguments were time as measured by sequenced base pairs per dollar at
advanced supporting this approach, focusing on functional a given error rate, a situation frequently compared with
utility and economics, respectively. “Moore’s Law” in computing,24 which noted that com-
The gist of the functional arguments was that sequenc- puting power measured by the integrated circuit tran-
ing of individuals is intrinsically more informative and sistor density doubled roughly every 2 years at constant
flexible than array-based interrogation of known sites of cost (Figure 2).18,25 To get genome sequencing costs down
variation and that, variation aside, any improvements in to $1000 would require cost and throughput improve-
sequencing cost and capability could be quickly applied ments of an additional 4 to 5 orders of magnitude, so the
to numerous general aspects of biology that are critical question of economic feasibility ultimately turned on
to understanding gene function, traits, and health and whether new methods could enable this very large
disease.18,19 The relative advantages of sequencing have improvement.
long been recognized. Unlike array analyses, sequenc- Here, the HGP again gave grounds for optimism, for
ing: (i) does not require variations to be preidentified; even though the HGP itself only achieved 100-fold
(ii) can more readily accommodate more complex vari- improvements, it achieved this largely by refining, minia-
ations than single nucleotide changes and very short turizing, and robotically scaling up, but not fundamen-
inserts or deletions; and (iii) need not focus on variations tally changing, a Sanger sequencing method initially
that are common in large populations vs rare or unique developed over 20 years earlier (Table II). If such meth-
variations. In consequence, as sequencing technology has ods were capable of 100-fold improvement, considerably
improved, it has increasingly been integrated into asso- greater improvements might be expected from more rad-
ciation studies of variation.20-23 ically changing sequencing chemistry, signal generation
However, these advantages of sequencing were coun- and detection, and instrumentation in ways that could
terbalanced by their high cost, a situation well illustrated integrate some of the vast advances in chemistry and
Date Event Size of sequence (bp) Reference
1957 First sequence mutation identified responsible for disease 1 amino acid (Ingram 19572)
(sickle cell vs normal hemoglobin)
1965 First sequence of a single complete gene 77 bases (Holley, Apgar et al 19653)
1976-1977 Sequencing of first viral genomes 3562 bases (MS2 RNA phage) (Fiers, Contreras et al 19764;
5375 bases (φ X174 DNA phage) Sanger, Air et al 19775)
1975-1977 Maxam/Gilbert and Sanger DNA sequencing methods (Sanger and Coulson 19756;
Maxam and Gilbert 19777;
Sanger, Nicklen et al 19778)
1994 First commercial bacterial genome sequence 1.7Mbp (Helicobacter pylori) (Nature Genetics, May 19969)
1995 First published bacterial genome sequence 1.83Mbp (Haemophilus influenzae) (Fleischmann, Adams et al 199510)
1998-2000 Genome sequences of first animals 100Mbp (Caenorhabditis elegans) (C. elegans Sequencing
120Mbp (Drosophila melanogaster) Consortium 1998,11 Adams,
Celniker et al 200012)
2001 Two draft sequences of human genome ~3Gbp (Lander, Linton et al 2001,13
Venter, Adams et al 200114)
2003 Completion of public Human Genome Project (Collins, Morgan et al 200315)
Personal genomes in progress - Lunshof et al Dialogues in Clinical Neuroscience - Vol 12 . No. 1 . 2010
enzymology, optics and electronics, materials science, Two complete cancer genomes were recently sequenced,
microfabrication, and process control that had accrued one with each platform.36,37 Further rounds of innovation
over the preceding 20 years and been put to good use in have yielded a diverse set of newer NGS methods. For
many other fields. The HGP also directly provided an instance, a number of “single-molecule” sequencing meth-
important resource for realizing this strategy: the refer- ods are now available or in development. These methods
ence human genome sequence itself, as this could serve avoid the need to make thousands to millions of copies of
as a template against which reads obtained by new tech- DNA template molecules on microbeads or surfaces to
nologies could be located, allowing new human genomes assure that sequencing operations generate sufficient sig-
to be assembled at least initially by “resequencing” vs de nal to read individual bases accurately, and instead use
novo assembly. This reduces the burden on new sequenc- highly sensitive optics to detect bases at the single mole-
ing methods by allowing them to generate useful data cule level; this allows even denser packing of DNA tem-
with shorter reads and higher base call error rates than plates and further efficiencies in sequencing chemistry.
would generally be needed for de novo assembly, While Helicos Biosciences has commercialized a single-
although de novo assembly of genomes using new molecule system that simply arrays single template mol-
sequencing technology remains an important goal. ecules on a surface and uses sequencing cycle similar to
the methods above, Pacific Biosciences is developing a
Next-generation sequencing system in which enzymes and templates are tethered to
the bottom of nanofabricated wells and which monitors
Researchers were quick to work out sequencing the signals generated by sequencing chemistry in real-
approaches along the lines indicated in these arguments, time vs artificial cycles.38,39 Here, the nanofabricated wells
and commercial products emerged soon, giving rise to enable substantially increased accuracy of single molecule
next-generation sequencing (NGS). Soon granting agen- base incorporation events. Finally, on another track, the
cies promised funding for support, and a ~10M USD company Complete Genomics, Inc has developed a
competition was announced for rapid, accurate genomic method whereby very compact self-assembling amplicons
sequencing, generating increased coalescence around of template DNAs called “nanoballs” are flowed onto a
target goals for dramatic improvements to sequencing nanofabricated grid of ~300nm spots at 700 to 1300 nm
technology.26,27,28 Detailed reviews and comparisons of center-to-center distances. Three complete human
NGS approaches have been published.18,29,30 genomes were sequenced with this method (as of January
Among the earliest NGS methods were polony sequenc- 2010) with an average consumable cost of $4400 and as
ing (the Polonator) and 454 Life Sciences.31,32,33 Both meth- low as $1500 for 40X coverage.40
ods amplify DNA templates onto microbeads that are
packed onto two-dimensional arrays for sequencing, 10000000
thereby achieving enormous economies of scale com- 1000000 CGI$5K human genome
pared with Sanger sequencing, and each achieved ~25-
100000 Illumina GA & ABISOLID
fold better cost per bp compared with HGP (Figure 2).
However, each uses different sequencing chemistry and 10000
Watson (454)
arraying technology, giving rise to many technical trade- 1000
bp/$
offs. Together they proved the general point that great Venter (3730)
100
improvements in sequencing efficiency were indeed 454 & Polony
Basic research
Towards affordable personal genomes pass multiple interacting “-omes.” For example, a per-
son’s diet will have a profound influence upon her or his
These developments suggest that technology capable of somatic gene expression as well as the genomic and pro-
meeting the cost target of $1000 or less for a diploid teomic activity of the person’s microbiome. It will also
human genome sequence is within reach. Indeed, the in- affect the metabolome. Similarly, an individual’s envi-
depth resequencing of individual human genomes has now ronmental exposures to pollutants will have a direct
been demonstrated several times by NGS developers to bearing on her or his immunological response and there-
demonstrate that their methods have come of age. There fore, on the VDJ-ome. Germline alleles will affect how
are now published full genome sequences for at least one metabolizes drugs, which will have myriad effects on
seven individuals,40 with some having been sequenced by an individual’s physiological and behavioral phenotypes.
more than one method. There are also tens—and perhaps
hundreds—of additional unpublished or partly published Genomes (vs exomes)
genomes (see, eg, refs 36,37),while the lower-coverage 1000
Genomes Project20,21 continues. Clearly, the age of personal In its early phase, given the then-current cost of genomic
genomics is now close at hand. sequencing, the PGP planned to focus on exomes rather
than whole genomes as a way to affordably expand the
The PGP project to large numbers of participants. Despite repre-
senting only 1% to 2% of the 6 billion base pairs in a
As described in the first section, one of the PGP’s central human genome, the exome contains all protein-coding
aims is to develop a publicly available, fully consented exons and therefore provides access to the majority of
database containing comprehensive human genome and known functional variants.48,49,50 However, continued
phenome data for its research participants. Such inte- improvements in genomic sequencing have produced
grated datasets are fundamental drivers of progress in price declines that have rendered whole-genome
functional genomics and enable systems biology-based sequencing significantly cheaper per base pair than
insights into the mechanisms of human health and dis- exome sequencing. The PGP, as a result, has determined
ease.41 PGP studies will look beyond inherited genomes that whole-genome sequencing is cost-justified given the
to include somatic and epigenetic variation data, as well relatively high price of exomes and the additional infor-
as relevant microbiome, transcriptome, immunity-reflect- mation supplied by whole-genome sequences of PGP
ing “VDJ-ome” and phenome data to develop compre- participants.51 See also Table III for the various “omes.”
hensive profiles. By developing high-resolution data pro-
files for each participant, and multiplying that by a large Phenomes
(up to 100 000) participant population, the PGP will also
generate valuable data describing the kinds and distrib- Detailed phenotype data is required to categorize and,
utions of variation that exist in populations. Although an ultimately, understand the phenotypes that the PGP
improved understanding of human health and disease is seeks to explore. However, the vastness of the human
a central aim of the PGP, its focus is considerably broader phenome, defined as the physical totality of human traits
and will enable research into the social and behavioral at all levels, from the molecular to the behavioral, will
sciences using personal genomic data. Finally, the PGP’s require new strategies that permit high-throughput trait
flexible study protocol and public and distributed collection while yielding accurate and standardized phe-
approach to research enables it to keep pace with notypic data. With regard to the cellular and molecular
sequencing and other technological advances while phenotypes, the PGP collects participant tissue samples
simultaneously driving these developments. and develops cell lines that are then deposited and pub-
licly accessible through established biobanks.52,53
Integrated personal genomes: inherited, somatic, envi- As the PGP expands it is exploring Web-based, high-
ronmental genomics throughput behavioral phenotype data-collection mod-
els pioneered by leading public and private researchers.
If the PGP is to fulfill its mission to address the multidi- While the reliability and validity of self-reported traits
mensional complexity of human biology, it must encom- is a concern, particularly for phenome research con-
DCNS_44_5.qxd:DCNS#44 10/03/10 1:44 Page 53
Personal genomes in progress - Lunshof et al Dialogues in Clinical Neuroscience - Vol 12 . No. 1 . 2010
ducted online,54,55 Web-based assessments provide dis- junction with genome and phenome information. The rel-
tinct opportunities for “dynamic phenotyping” based on evant envirome data is too large and complex to be
a particular individual’s prior genotype-phenotype asso- reported, managed, or analyzed manually. The creation of
ciations.56 The multimodal capabilities of Web-based trait phenome-genome and genome-envirome networks has
collection instruments, combined with their low cost of been suggested in order to relate phenome and envirome
implementation at large scales, seem likely to accelerate information to potential disease-associated genes.61
the ability of studies like the PGP to effectively explore
new corners of the human phenome. Microbiomes
The PGP is also taking advantage of recent advance-
ments in health information technologies to assist par- Even though microbial cells are estimated to outnumber
ticipants and researchers alike in structuring and access- human cells in a single individual by a factor of ten, we
ing the massive amounts of personalized data generated know very little about the microbes that live in and on us,
by the project. The emergence of online Personally including what mixture of bacteria, viruses, and other
Controlled Health Record (PCHR) platforms and other micro-organisms constitute a “normal” human micro-
novel tools enables individuals to collect and manage biome and how those organisms impact different biolog-
their own health data—including health history, med- ical states.62 Major efforts such as the Human Microbiome
ication, allergy, immunization, biometric and other data Project are under way to characterize the microbiota at
types57,58,59—and can be developed for integrated data different body sites in humans and to assess how variation
entry, access and dissemination by both the individual in microbial communities is associated with states of
and third-party researchers or data providers, including health and disease.63 The PGP takes advantage of the
health care providers. unique availability of comprehensive participant profiles
and uses them to explore interactions between host
Enviromes genetic and phenotypic variability alongside the genomic
variation in the microbes that colonize them.64
The picture of genome and phenome is incomplete with-
out the envirome. The envirome can be described as the The VDJ-ome
totality of equivalent environmental influences con-
tributing to all disorders and organisms.60 The mode of The Church Lab at Harvard Medical School is develop-
response of an organism to the environment that is ing techniques for characterizing the repertoires of B-
reflected in its phenotype is constrained by its unique set and T-cell receptors in individual humans from blood
of genetic variations and the environmental influences on samples and correlated across time with personal expo-
gene expression. Therefore, a comprehensive approach is sure histories, with an ultimate goal of characterizing
required to describe the envirome systematically in con- individuals repertoires of linked VDJ and VJ sequences.
Personal genome: Entire diploid human genome of a single individual representing 6 billion base pairs.
Exome: All exons, representing 1% to 2% of the entire human genome.
Phenome: Set of all traits in an organism, at all levels, or one of its subsystems, including morphology, physiology, and behavior.42,43
Envirome: The totality of equivalent environmental influences contributing to all disorders and organisms.44
Microbiome (human): The ecological community of commensal, symbiotic, and pathogenic microorganisms that share our body space.45
VDJ-ome: The repertoire of rearranged V, D, and J genome segments present in an individuals's B and T immune cells at any given time (see
Table IV).
Transcriptome: The set of all RNA molecules, including mRNA, rRNA, tRNA, and noncoding RNA produced in one or a population of cells.46
Epigenome: The totality of programmed biochemical and structural modifications to genomic DNA that regulate organism or phenotype
development.
Metabolome: Total set of metabolites generated by an organism, or subsystem.
Proteome: The entire set of proteins expressed by a genome, cell, tissue or organism at a given time under defined conditions. There are
more proteins than genes.47
Basic research
These techniques will be directly applicable to PGP par- ity of such research, however, depends upon the respon-
ticipants and their self-reported data, and will yield a sible development and widespread availability of such
database of unprecedented depth describing the diver- comprehensive datasets, which in turn depends on
sity and time development of human immune responses describing and addressing the various ethical, legal and
of large numbers of individuals in their life contexts. social challenges. Those challenges include a standard set
that are inherent to any research involving human sub-
The adaptive immune system jects, as well as certain challenges that are unique to
The adaptive immune system enables individuals to respond to “public genomics”71 research involving publicly available,
their unique exposure histories to pathogens and environmental identifiable whole-genome sequence data, such as the
antigens, and possibly to cancerous mutations in their own cells, by model pioneered by the PGP. We use the term “public
generating and modulating expression of >1012 unique antibodies genomics” to denote research studies that possess the
from B cells and T cell receptors.65 Antibody diversity derives from following three critical attributes.
programmed stochastic rearrangements in maturing B cells of ~40
V, 23 D, and ~5 J functional genomic segments into VDJ heavy Integrated data
chains, and ~35 V and ~5 J segments into VJ light chains (κ or λ) in
B cells, that are further randomized by somatic hypermutation; a The various data types, including genomic and phenomic
similar process occurs in T cells.66 NGS methods are now allowing or trait data, are accessible in a linked format, such as a
researchers to identify and analyze expressed VDJ sequences in PCHR or other integrated data structure. Through this
depth.67 explicit linkage of data it is possible to ascertain the
Table IV. The adaptive immune system and the VDJ-ome. complete list of available traits and genetic variants for
any given participant. Integration also facilitates partic-
Tissue reprogramming ipant-researcher interactions, longitudinal study and
recontact and, crucially, simultaneous investigation of the
The PGP also applies advances in tissue reprogramming full range of complex trait associations. Although par-
techniques to tissue samples collected from PGP partic- ticipants need not be explicitly identified, integrated data
ipants. Cells from collected somatic tissues are repro- sets that include both genomic and phenomic data will
grammed into induced pluripotent stem (iPS) cells68 and be identifiable in most cases. For this reason, participants
made to differentiate into the cell types that are targeted must be made explicitly aware of the probability that
for functional analysis. These methods enable experi- they will be identified with their publicly available data,
mental access to diverse tissue types that would other- rendering promises of perfect privacy, anonymity, or con-
wise be unobtainable from human subjects but are rou- fidentiality impermissible within the public genomics
tinely analyzed in model organisms, and thus, PGP model. However, the promise of privacy need not give
participants can effectively serve as human model organ- way to a promise of publicity.
isms. By examining multiple cell types from a single indi-
vidual, differences in physiological states within and Open access
between tissues can be compared within a single PGP
participant and/or across the entire PGP cohort. This Data sets and tissues are made publicly available with
approach also permits researchers to elucidate connec- minimal or no access restrictions (including researcher
tions between genetic variation and variation in other qualifications and cost), and are generally transferable
molecular traits, such as gene expression or epigenetic outside the original research study to be utilized by and
modifications.69 Stored fibroblast cell lines provide combined with data from third parties. Well-developed
researchers with access to renewable supplies of differ- data structures and intellectual property licenses are
ent tissue types from PGP participants. important components of this characteristic. Developing
datasets that are not only publicly available but also eas-
The PGP: from personal to public genomes ily portable fosters the development of a genomic com-
mons, allows data validation by third parties, and enables
The potential benefits arising from large-scale and inte- the use and application of data in novel contexts that
grated human genomic datasets are immense.70 The util- may not be foreseeable at the time of collection, thereby
DCNS_44_5.qxd:DCNS#44 10/03/10 1:44 Page 55
Personal genomes in progress - Lunshof et al Dialogues in Clinical Neuroscience - Vol 12 . No. 1 . 2010
facilitating hypothesis generation, encouraging serendip- and research—on a large and expanding scale. In
ity and broadening the genomic research community. October of 2008, the PGP published the first inte-
grated set of DNA sequences, traits, and tissues col-
Voluntary and informed participation lected from ten participants (the “PGP-10”) enrolled
in a pilot study initiated in 2005. Today, the PGP is
Satisfaction of the first two criteria publication of an inte- incrementally expanding its cohort toward 100 000
grated dataset in an open-access format necessitates that participants. More than 12 000 individuals had regis-
a premium be placed on receiving truly voluntary and tered to participate in the PGP as of February 2010. In
informed consent from participants in public genomics the following section we highlight significant features
research projects. Given the yet-unknown outcomes and of the PGP study protocol as it is implemented for the
the potential personal, familial, and social risks associated enrollment of the first 100 participants (“PGP-100”)
with such research, enrollment is only acceptable under and summarized in Table V.
an informed consent protocol that is specially designed
to meet the highest standards of human research subjects Public genomes: adding to ELSI
protection in view of these conditions.
The practice of public genomics poses its own challenges,
The study protocol especially for the organization and governance of human
subjects’ research, forcing us to critically reassess current
The PGP aims to produce public genomics research— frameworks and practices. In order to pursue innovative
and to develop and evaluate associated technologies research in a responsible manner, the PGP has devel-
Enrollment • Participants may be interviewed by one or more PGP staff to verify identity and consent, confirm familiarity with
study protocols, and/or review trait questionnaire responses. Blood samples, saliva sample, and/or skin cells may
be collected.
• Tissue samples prepared for DNA sequencing and other biological analyses.
• Participants opt-in to have their profiles made available on a publicly accessible Web site, or withdraw from the
study.
• Establishment, distribution and analysis of cell lines for research.
Ongoing • Information collected for 25 years. Participants can leave the study at any time.
participation • Data Safety Monitoring Board monitors the impacts of the PGP on enrolled participants. Quarterly emails
inquire about adverse events.
• Additional trait data and tissue samples may be requested periodically.
Basic research
oped a number of project-specific tools and resources Data sharing—and the risks of public genomes
relevant to ELSI.
The PGP’s informed consent process begins with an
Open consent extensive pre-enrollment educational examination
designed to ensure a potential participant’s ability to
The “open consent” model developed by the PGP is understand the specific nature of the data collected and
designed to address the set of challenges associated with the risks presented by public genomics research. For indi-
the creation of datasets where it may be possible to iden- viduals who demonstrate the needed proficiency, the spe-
tify individual participants with their genomic and other cific informed consent agreement that follows includes a
data. The open consent model assumes that, in such a lengthy but “noncomprehensive list of hypothetical sce-
context, conventional assurances of anonymity, privacy narios that could pose risks” for participants and their
and confidentially are impossible and should not serve families (Table VI). Participants are warned that “the com-
as any part of the foundation for the informed consent plete set and magnitude of the risks that the public avail-
protocol.72,73 Due to the structure of public genomics pro- ability of [your genomic data] poses to you and your rel-
jects such as the PGP, and their associated datasets, while atives is not known at this time.” It is crucial that
privacy and confidentiality can be protected they cannot participants understand that once identifying genetic and
and should not be guaranteed to participants. This prac- trait data and tissues are released into the public domain
tice ensures veracity, which we regard as a necessary— for the express intent of broad dissemination and use by
though not sufficient—prerequisite for the exertion of third parties it will be, in all likelihood, impossible to effect
substantive autonomy. It is only through veracity that the a meaningful retraction at a later date.
criteria underlying truly informed consent can be satis- The PGP’s informed consent agreements and broader
fied. study protocol are developed in continuous close interac-
Open consent is therefore based on complete openness tion with the Harvard Medical School Committee on
and transparency with regard to all aspects of participa- Human Studies. The project is also overseen by an inde-
tion, including the potential for reidentification and the pendent Data Safety Monitoring Board. Removing poten-
reality that there may be other risks that are unidenti- tially disingenuous promises of anonymity, privacy, and
fiable at the time of consent. Predicting all potential risks confidentiality, while seeking to comprehensively and
is by definition impossible and even a list of known pos- openly describe both known and unknown risks of partic-
sible risks is unlikely ever to be comprehensive. ipation, helps to ensure that research participants are as
Potential risks of participation in the PGP as described in the consent form (Abbreviated)
• The risks of public disclosure of your genetic and trait information could affect your employment, insurance and financial well-being and
social interactions for you and your family.
• Anyone with sufficient knowledge and resources could take your DNA sequence data and/or posted trait information and use that data,
with or without modification, to: (i) infer paternity or other features of your genealogy; (ii) claim statistical evidence that could affect your
employment, insurance or ability to obtain financial services; (iii) claim relatedness to criminals or incriminate relatives; (iv) make synthetic
DNA and plant it at a crime scene, or otherwise use it to falsely identify you; or (v) reveal the possibility of a disease or unknown propensity
for a disease.
• Whether or not it is lawful to do so, you could be subject to actual or attempted employment, insurance, financial, or other forms of
discrimination or negative treatment on the basis of the public disclosure of your genetic and trait information by the PGP or by a third party.
• The distribution of your cell lines could result in the creation and further distribution by a third party of additional cell lines, organs, or
tissues containing your DNA for research, commercial, clinical, or other uses, including certain forms of assisted reproduction, some of which
you may find objectionable or upsetting.
• If you have previously made available or intend to make available genetic information in a confidential setting, for example in another
research study or in a clinical trial, the data that you provide as part of the PGP may be used, on its own or in combination with your
previously shared data, to identify you as a participant in otherwise confidential genetic research or trials.
Personal genomes in progress - Lunshof et al Dialogues in Clinical Neuroscience - Vol 12 . No. 1 . 2010
informed as possible about the nature of public genomics tion are blunted by the partial shifting of the interpreta-
research and, simultaneously, safeguards the trustworthi- tive burden from the clinician to the researcher. The PGP
ness of scientists and of scientific research in general. has approached this issue by focusing on data disclosure
via the Preliminary Research Report (PRR), which con-
Return of research data to participants tains a noncomprehensive list of genetic variants present
in the participant’s DNA sequence data currently
Research volunteers have been traditionally treated as thought to have a likelihood of clinical relevance among
“objects” of study who have no intrinsic rights to the individuals possessing such variants.
data generated by their participation.74 Today, we see This preliminary identification of potentially significant
that study participants are increasingly asking for access variants is not intended to substitute in any way for pro-
to their data75 and that available information and com- fessional medical advice, diagnosis or treatment. It lever-
munication technologies have turned the return of ages current knowledge by combining an evolving set of
research results into a feasible option. While some filtering algorithms and the use of existing variant data-
researchers adhere to the traditional viewpoint that bases— neither of which can be expected to have 100%
research subjects should not or cannot receive identifi- accuracy in identifying truly pathogenic variants given
able research data, some have suggested legal and ethi- the gaps in current scientific understanding. Participants
cal grounds for finding that researchers possess the are specifically instructed to confirm any potentially sig-
obligation to inform their participants of certain results, nificant findings in consultation with their health care
particularly when they are clinically actionable. 76 provider. It is possible that the increased rate of data
However, defining the scenarios in which research return from public genomics research—as well as from
results should be reported—and how to report such commercial providers of personal genomic data—will
results—remains a challenging issue. The medical, finan- help speed the creation of universal standards for clini-
cial, and psychosocial risks of disclosing variants of cal genomic interpretation that will help shift some of
known and unknown clinical significance require that a the interpretative burden back away from public
careful distinction be made between those variants in genomics researchers.
which convincing clinical observational data exists and
those in which disease association is less robust; a dis- Outlook: the PGP from 10 to 100 000
tinction that can influence both when and how to return
results. Other concerns that have been voiced include After publishing initial data from its first 10 participants
the uncertainty surrounding regulations governing the in 2008, the PGP has continued to broaden the scope of
return of genomics research results directly to partici- the information it is collecting and publishing while
pants, the impact of false-positive and/or false-negative simultaneously commencing the next stages of partici-
results, as well as the “incidentalome,”77 and in the con- pant enrollment. From exome to whole-genome
text of commercial direct-to-consumer testing, the con- sequence data, the development and release of the GET-
cern that obtaining results could lead to a “raiding of the EvidenceBase tool80 for generation of Preliminary
medical commons.”78 Research Reports, and the publication of substantial
As new models of genomic research and commerce scholarship based on the PGP data generated to date,
emerge, new mechanisms for communicating results to the project’s progress has been substantial. The PGP is
participants are also being explored. Many of these new now supported by PersonalGenomes.org, a 501(c)(3)
models embrace a high level of involvement from their non-profit charity that coordinates the international
participants and, in return, may rely on some combina- efforts of the PGP with other collaborative public
tion of education, informed consent, and intermediation genomics research projects around the world. Both the
to return data in a responsible fashion.79 PGP and PersonalGenomes.org continue to strive to
The public genomics model adopted by the PGP utilizes develop and disseminate genomic technologies, pheno-
the first two approaches while foregoing the third, opting typing strategies, and knowledge on a global scale and
to return data directly to research participants without to produce tangible and widely available improvements
the required intervention of an intermediary. The advan- in the understanding and management of human health
tages of direct data return and participant communica- in a responsible fashion.
DCNS_44_5.qxd:DCNS#44 10/03/10 1:44 Page 58
Basic research
Avances en el genoma personal: Les progrès du génome personnel :
desde el Proyecto Genoma Humano al de l’étude du génome humain à l’étude du
Proyecto Genoma Personal génome personnel
El costo de una secuencia del genoma humano diploide se ha Le coût de séquençage d'un génome diploïde humain a
reducido desde cerca de 70 millones de dólares a 2000 dólares chuté de 70 millions de dollars à 2 000 dollars depuis 2007,
desde 2007, aunque los estándares de la redundancia han bien que les standards de redondance aient augmenté de 7
aumentado de 7 a 40 veces para mejorar los índices de à 40 fois afin d'améliorer le taux d'identification des bases.
demanda de genotipo. Junto con el bajo retorno de inversión Associé au faible retour sur investissement des polymor-
para los polimorfismos de nucleótidos únicos comunes, esta phismes de simples nucléotides (SNP), cette situation explique
situación ha causado un aumento significativo del interés en l’intérêt accru pour la corrélation des séquences des génomes
correlacionar las secuencias genómicas con una completa avec des données complètes environnementales et de traits
información ambiental y de rasgos (GAR). El costo de las fichas (GET). Les coûts des enregistrements numériques médicaux,
médicas electrónicas, de las imágenes y de la información de l’imagerie et des données microbiennes, immunologiques
microbiológica, inmunológica y conductual también está et comportementales chutent aussi rapidement. Le partage
reduciéndose rápidamente. El compartir tal conjunto de infor- de telles bases de données GET intégrées et de leurs inter-
mación y sus interpretaciones con una diversidad de investi- prétations avec un grand nombre de chercheurs et de sujets
gadores y sujetos de investigación pone de relieve la necesi- de recherche souligne la nécessité de modèles de consente-
dad de contar con modelos de consentimiento informado ment éclairé nécessaires à cette nouvelle protection des don-
capaces de estar orientados hacia nuevos temas de privacidad nées personnelles et autres problématiques, en plus des
y otros, además de flexibilizar los recursos de datos compar- besoins de flexibilité des ressources requises pour le partage
tidos que permitan disponer de materiales e información con des données, permettant en plus une utilisation peu restric-
mínimas restricciones de uso. Este artículo examina el esfuerzo tive de ces matériels et données. Cet article analyse les efforts
del Proyecto de Genoma Personal para desarrollar una base du Projet du Génome Personnel afin de développer une base
de datos de GAR como un recurso de genómica pública de données GET en tant que ressource génomique publique,
ampliamente accesible tanto a investigadores como a parti- largement accessible à la fois aux chercheurs et aux partici-
cipantes de las investigaciones, respetando los estándares más pants à la recherche, tout en respectant les standards les plus
elevados de la ética de la investigación. élevés de l’éthique de la recherche.
Personal genomes in progress - Lunshof et al Dialogues in Clinical Neuroscience - Vol 12 . No. 1 . 2010
22. Siva N. 1000 Genomes project. Nat Biotechnol. 2008;26:256. 49. Ng PC, Levy S, Huang J, et al. Genetic variation in an individual human
23. Nejentsev S, Walker N, Riches D, Egholm M, Todd JA. Rare variants of exome. PLoS Genet. 2008;4:e1000160.
IFIH1, a gene implicated in antiviral responses, protect against type 1 dia- 50. Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR. Power of
betes. Science. 2009;324:387-389. deep, all-exon resequencing for discovery of human trait genes. Proc Natl
24. Moore GE. Cramming more components onto integrated circuits. Acad Sci U S A. 2009;106:3871-3876.
Electronics. 1965;38:114-117. 51. Ng SB, Turner EH, Robertson PD, et al. Targeted capture and massive-
25. Carr PA, Church GM. Genome engineering. Nat Biotechnol. ly parallel sequencing of 12 human exomes. Nature. 2009;461:272-276.
2009;27:1151-1162. 52. Angrist M. Eyes wide open: the personal genome project, citizen sci-
26. National Human Genome Research Institute. 2004. Near-Term ence and veracity in informed consent. Pers Med. 2009;6:691-699.
Technology Development for Genome Sequencing (RFA-HG-04-002. 53. Coriell Cell Repositories at the Coriell Institute for Medical Research.
Available at: http://grants.nih.gov/grants/guide/rfa-files/RFA-HG-04- Available at: http://www.coriell.org/. Accessed January 31, 2010.
002.html. Accessed 30 January, 2010. 54. dbGaP: the database of Genotypes and Phenotypes. Available at:
27. National Human Genome Research Institute. 2004. Revolutionary http://www.ncbi.nlm.nih.gov/gap. Accessed 31 January, 2010.
Genome Sequencing Technologies -- the $1000 Genome (RFA-HG-04-003. 55. Merrill RM, Richardson JS. Validity of self-reported height, weight,
Available at: http://grants.nih.gov/grants/guide/rfa-files/RFA-HG-04- and body mass index: findings from the National Health and Nutrition
003.html. Accessed January 30, 2010. Examination Survey, 2001-2006. Prev Chronic Dis. 2009;6:A121.
28. X PRIZE Foundation. 2006.Archon X PRIZE for Genomics: X PRIZE 56. Porter SC, Manzi SF, Volpe D, Stack AM. Getting the data right: infor-
Foundation Announces Largest Medical Prize in History. Available at: mation accuracy in pediatric emergency medicine. Qual Saf Health Care.
http://genomics.xprize.org/press-release/x-prize-foundation-announces- 2006;15:296-301.
largest-medical-prize-in-history. Accessed January 30, 2010. 57. Bilder RM, Sabb FW, Cannon TD, et al. Phenomics: the systematic study
29. Metzker ML. Sequencing technologies - the next generation. Nat Rev of phenotypes on a genome-wide scale. Neuroscience. 2009;164:30-42.
Genet. 2010;11:31-46. 58. GoogleHealth. Available at: https://www.google.com/health/.
30. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. Accessed January 31, 2010.
2008;26:1135-1145. 59. MS Healthvault. Available at: http://www.healthvault.com/. Accessed
31. Ultra-Low Cost Sequencing Technology. Available at: January 31, 2010.
http://arep.med.harvard.edu/Polonator/. Accessed February 5, 2010. 60. Indivo™ The Personally Controlled Health Record. Available at:
32. Shendure J, Porreca GJ, Reppas NB, et al. Accurate multiplex polony http://indivohealth.org/. Accessed January 31, 2010.
sequencing of an evolved bacterial genome. Science. 2005;309:1728- 61. Anthony JC, Eaton WW, Henderson AS. Looking to the future in psy-
1732. chiatric epidemiology. Epidemiol Rev. 1995:17;240-242.
33. Margulies M, Egholm M. Genome sequencing in microfabricated 62. Butte AJ, Kohane IS. Creation and implications of a phenome-genome
high-density picolitre reactors. Nature. 2005;437:376-380. network. Nat Biotechnol. 2006;24:55-62.
34. Applied Biosystems Inc. 2010. ABI product literature. Available at 63. Turnbaugh PJ, Gordon JI. A core gut microbiome in obese and lean
https://docs.appliedbiosystems.com/pebiodocs/00113233.pdf. Accessed twins. Nature. 2009;457:480–484.
January 22, 2010. 64. Peterson J. The NIH Human Microbiome Project. Genome Res.
35. Applied Biosystems Inc 2010. The ABI SoLID 3 System: Enabling the December 2009;19:2317-2323.
Next Generation of Science. Available at: http://www3.appliedbiosys- 65. Sommer MO, Dantas G, Church GM. Functional characterization of
tems.com/cms/groups/mcb_marketing/documents/generaldocuments/cms_ the antibiotic resistance reservoir in the human microflora. Science.
061241.pdf. Accessed January 30, 2010. 2009;325:1128-1131.
36. Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole 66. Lefranc MP, Giudicelli V, Ginestoux C, et al. IMGT, the international
human genome sequencing using reversible terminator chemistry. Nature. ImMunoGeneTics information system. Nucleic Acids Res. 2009;37:D1006-
2008;456:53-59. 1012.
37. Pleasance ED, Cheetham RK, Stephens PJ, et al. A comprehensive cat- 67. Number of functional IG and TR genes per haploid genome. Available
alogue of somatic mutations from a human cancer genome. Nature. at: http://www.imgt.org/textes/IMGTrepertoire/LocusGenes/tabgenes/
2010;463:191-196. human/geneNumber.html#functional. Accessed 5 February, 2010.
38. Pleasance ED, Stephens PJ, O'Meara S, et al. A small-cell lung cancer 68. Weinstein JA, Jiang N, White RA 3rd, Fisher DS, Quake SR. High-
genome with complex signatures of tobacco exposure. Nature. throughput sequencing of the zebrafish antibody repertoire. Science.
2010;463:184-190. 2009;324:807-810.
39. Harris TD, Buzby PR, Babcock H. Single-molecule DNA sequencing of a 69. Park IH, Arora N, Huo H, et al. Disease-specific induced pluripotent
viral genome. Science. 2008;320:106-109. stem cells. Cell. 2008;134:877-886.
40. Eid J, Fehr A, Gray J, et al. Real-time DNA sequencing from single poly- 70. Lee JH, Park IH, Gao Y, et al. A robust approach to identifying tissue-
merase molecules. Science. 2009;323:133-138. specific gene expression regulatory variants using personalized human
41. Drmanac R, Sparks AB, Callow MJ, et al. Human genome sequencing induced pluripotent stem cells. PLoS Genet. 2009;5:e1000718.
using unchained base reads on self-assembling DNA nanoarrays. Science. 71. Collins F S.The case for a US prospective cohort study of genes and
2010;327:78-81. environment. Nature. 2004;429:475–477.
42. Church GM. The Personal Genome Project. Mol Systems Biol. 72. Conley JM, Doerr AK, Vorhaus DB. Enabling responsible public
2005;1:2005.0030. genomics. Health-Matrix: Journal of Law-Medicine. In press.
43. Freimer N, Sabatti C. The human phenome project. Nat Genet. 73. Lunshof JE, Chadwick R, Vorhaus DB, Church GM. From genetic priva-
2003;34:15–21. cy to open consent. Nat Rev Genet,. 2008; 9:406-411.
44. Mahner M, Kary M. What exactly are genomes, genotypes and phe- 74. Overview of PGP consent forms. Available at: http://www.person-
notypes? And what about phenomes? J Theor Biol. 1997;186:55–63. algenomes.org/consent/ . Accessed January 31, 2010.
45. Anthony JC, Eaton WW, Henderson AS. Looking to the future in psy- 75. Renegar G, Webster CJ, Stuerzebecher S, et al. Returning genetic
chiatric epidemiology. Epidemiol Rev. 1995;17:240-242. research results to individuals: points-to-consider. Bioethics. 2006;20:24-
46. Lederberg J, McCray AT. ’Ome Sweet ’Omics—a genealogical treasury 36.
of words. Scientist. 2001;15:8. 76. Murphy J, Scott J, Kaufman D, Geller G, LeRoy L, Hudson K. Public
47. Transcriptome. Available at: http://en.wikipedia.org/wiki/ expectationsfor return of results from large-cohort genetic research. Am J
Transcriptome. Accessed January 31, 2010. Bioeth. 2008;8:36-43.
48. Proteome. Available at: http://en.wikipedia.org/wiki/Proteome. 77. Wolf SM, Lawrenz FP, Kahn JP, et al. Managing 0J Law Med Ethics.
Accessed January 31, 2010. 2008;Summer 2008:2-31.
DCNS_44_5.qxd:DCNS#44 10/03/10 1:44 Page 60
Basic research
78. Kohane IS, Masys DR, Altman RB. The incidentalome: a threat to 80. Kohane IS, Mandl KD, Taylor PL, Holm IA, Nigrin DJ, Kunkel LM.
genomic medicine. JAMA. 2006 12;296:212-215 Reestablishing the researcher-patient compact. Science. 2007;316:836-
79. McGuire AL, Burke W. An unwelcome side effect of direct-to- 837.
consumer personal genome testing: raiding the medical commons. JAMA. 81. Trait-o-matic. Available at: http://snp.med.harvard.edu. Accessed
2008;300:2669-2671. February 4, 2010.