Genome-Wide Association Studies
Genome-Wide Association Studies
Genome-Wide Association Studies
net/publication/234090011
CITATIONS READS
431 2,352
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by William S Bush on 04 March 2014.
Abstract: Genome-wide associa- While understanding the complexity of of GWAS to common diseases that have a
tion studies (GWAS) have evolved human health and disease is an important complex multifactorial etiology.
over the last ten years into a objective, it is not the only focus of human
powerful tool for investigating the genetics. Accordingly, one of the most
2. Concepts Underlying the
genetic architecture of human dis- successful applications of GWAS has been
ease. In this work, we review the in the area of pharmacology. Pharmaco- Study Design
key concepts underlying GWAS, genetics has the goal of identifying DNA 2.1 Single Nucleotide
including the architecture of com- sequence variations that are associated Polymorphisms
mon diseases, the structure of with drug metabolism and efficacy as well The modern unit of genetic variation is
common human genetic variation, as adverse effects. For example, warfarin is the single nucleotide polymorphism or SNP.
technologies for capturing genetic a blood-thinning drug that helps prevent SNPs are single base-pair changes in the
information, study designs, and the blood clots in patients. Determining the DNA sequence that occur with high
statistical methods used for data appropriate dose for each patient is frequency in the human genome [5]. For
analysis. We also look forward to
important and believed to be partly the purposes of genetic studies, SNPs are
the future beyond GWAS.
controlled by genes. A recent GWAS typically used as markers of a genomic
revealed DNA sequence variations in region, with the large majority of them
several genes that have a large influence having a minimal impact on biological
This article is part of the ‘‘Transla- on warfarin dosing [4]. These results, and systems. SNPs can have functional conse-
tional Bioinformatics’’ collection for more recent validation studies, have led to quences, however, causing amino acid
PLOS Computational Biology. genetic tests for warfarin dosing that can changes, changes to mRNA transcript
be used in a clinical setting. This type of stability, and changes to transcription
1. Important Questions in genetic test has given rise to a new field factor binding affinity [6]. SNPs are by
Human Genetics called personalized medicine that aims to far the most abundant form of genetic
tailor healthcare to individual patients variation in the human genome.
A central goal of human genetics is to based on their genetic background and SNPs are notably a type of common
identify genetic risk factors for common, other biological features. The widespread genetic variation; many SNPs are present
complex diseases such as schizophrenia availability of low-cost technology for in a large proportion of human popula-
and type II diabetes, and for rare Mende- measuring an individual’s genetic back- tions [7]. SNPs typically have two alleles,
lian diseases such as cystic fibrosis and ground has been harnessed by businesses meaning within a population there are
sickle cell anemia. There are many that are now marketing genetic testing two commonly occurring base-pair pos-
different technologies, study designs and directly to the consumer. Genome-wide sibilities for a SNP location. The fre-
analytical tools for identifying genetic risk association studies, for better or for worse, quency of a SNP is given in terms of the
factors. We will focus here on the genome- have ushered in the exciting era of minor allele frequency or the frequency of
wide association study or GWAS that personalized medicine and personal ge- the less common allele. For example, a
measures and analyzes DNA sequence netic testing. The goal of this chapter is to SNP with a minor allele (G) frequency of
variations from across the human genome
introduce and review GWAS technology, 0.40 implies that 40% of a population
in an effort to identify genetic risk factors
study design and analytical strategies as an has the G allele versus the more common
for diseases that are common in the
important example of translational bioin- allele (the major allele), which is found in
population. The ultimate goal of GWAS
formatics. We focus here on the application 60% of the population.
is to use genetic risk factors to make
predictions about who is at risk and to
identify the biological underpinnings of
disease susceptibility for developing new Citation: Bush WS, Moore JH (2012) Chapter 11: Genome-Wide Association Studies. PLoS Comput Biol 8(12):
prevention and treatment strategies. One e1002822. doi:10.1371/journal.pcbi.1002822
of the early successes of GWAS was the Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
identification of the Complement Factor H Baltimore County, United States of America
gene as a major risk factor for age-related Published December 27, 2012
macular degeneration or AMD [1–3]. Not Copyright: ß 2012 Bush, Moore. This is an open-access article distributed under the terms of the Creative
only were DNA sequence variations in this Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
gene associated with AMD but the bio-
logical basis for the effect was demonstrat- Funding: This work was supported by NIH grants ROI-LM010098, ROI-LM009012, ROI-AI59694, RO1-EY022300,
and RO1-LM011360. The funders had no role in the preparation of the manuscript.
ed. Understanding the biological basis of
genetic effects will play an important role in Competing Interests: The authors have declared that no competing interests exist.
developing new pharmacologic therapies. * E-mail: [email protected]
across the genome and to characterize netic variation within a population over has existed. As such, different human sub-
correlations among variants. time. It is related to the concept of populations have different degrees and
The International HapMap Project chromosomal linkage, where two markers on patterns of LD. African-descent popula-
used a variety of sequencing techniques a chromosome remain physically joined tions are the most ancestral and have
to discover and catalog SNPs in European on a chromosome through generations of smaller regions of LD due to the accumu-
descent populations, the Yoruba popula- a family. In figure 2, two founder lation of more recombination events in
tion of African origin, Han Chinese chromosomes are shown (one in blue that group. European-descent and Asian-
individuals from Beijing, and Japanese and one in orange). Recombination descent populations were created by
individuals from Tokyo [15,16]. The events within a family from generation founder events (a sampling of chromo-
project has since been expanded to include to generation break apart chromosomal somes from the African population), which
11 human populations, with genotypes for segments. This effect is amplified through altered the number of founding chromo-
1.6 million SNPs [7]. HapMap genotype generations, and in a population of fixed somes, the population size, and the
data allowed the examination of linkage size undergoing random mating, repeated generational age of the population. These
disequilibrium. random recombination events will break populations on average have larger regions
apart segments of contiguous chromo- of LD than African-descent groups.
some (containing linked alleles) until Many measures of LD have been
3.2 Linkage Disequilibrium
eventually all alleles in the population proposed [17], though all are ultimately
Linkage disequilibrium (LD) is a prop-
are in linkage equilibrium or are indepen- related to the difference between the
erty of SNPs on a contiguous stretch of
dent. Thus, linkage between markers on a observed frequency of co-occurrence for
genomic sequence that describes the
population scale is referred to as linkage two alleles (i.e. a two-marker haplotype)
degree to which an allele of one SNP is
disequilibrium. and the frequency expected if the two
inherited or correlated with an allele of The rate of LD decay is dependent on markers are independent. The two com-
another SNP within a population. The multiple factors, including the population monly used measures of linkage disequi-
term linkage disequilibrium was coined by size, the number of founding chromo- librium are D’ and r2 [15,17] shown in
population geneticists in an attempt to somes in the population, and the number equations 1 and 2. In these equations, p12
mathematically describe changes in ge- of generations for which the population is the frequency of the ab haplotype, p1: is
the frequency of the a allele, and p2: is the one allele of the first SNP is often observed preventing genotyping SNPs that provide
frequency of the b allele. with one allele of the second SNP, so only redundant information. Based on analy-
one of the two SNPs needs to be sis of data from the HapMap project,
D0 ~ genotyped to capture the allelic variation. .80% of commonly occurring SNPs in
8 p p {p p 9 There are dependencies between these European descent populations can be
> AB ab Ab aB >
< min(pA pb ,pa pB ) if pAB pab {pAb paB w0 >
> =ð1Þ two statistics; r2 is sensitive to the allele captured using a subset of 500,000 to one
frequencies of the tow markers, and can million SNPs scattered across the ge-
>
> pAB pab {pAb paB >
: if pAB pab {pAb paB v0 >
; only be high in regions of high D’. nome [19].
min(pA pB ,pa pb )
One often forgotten issue associated
with LD measures is that current technol- 3.3 Indirect Association
2 ogy does not allow direct measurement of The presence of LD creates two possible
(pAB pab {pAb paB )
r2 ~ ð2Þ haplotype frequencies from a sample positive outcomes from a genetic associa-
pA pB pa pb
because each SNP is genotyped indepen- tion study. In the first outcome, the SNP
dently and the phase or chromosome of influencing a biological system that ulti-
D’ is a population genetics measure that is origin for each allele is unknown. Many mately leads to the phenotype is directly
related to recombination events between well-developed and documented methods genotyped in the study and found to be
markers and is scaled between 0 and 1. A for inferring haplotype phase and estimat- statistically associated with the trait. This is
D’ value of 0 indicates complete linkage ing the subsequent two-marker haplotype referred to as a direct association, and the
equilibrium, which implies frequent re- frequencies exist, and generally lead to genotyped SNP is sometimes referred to as
combination between the two markers and reasonable results [18]. the functional SNP. The second possibility is
statistical independence under principles SNPs that are selected specifically to that the influential SNP is not directly
of Hardy-Weinberg equilibrium. A D’ of 1 capture the variation at nearby sites in the typed, but instead a tag SNP in high LD
indicates complete LD, indicating no genome are called tag SNPs because alleles with the influential SNP is typed and
recombination between the two markers for these SNPs tag the surrounding stretch statistically associated to the phenotype
within the population. For the purposes of of LD. As noted before, patterns of LD are (figure 3). This is referred to as an indirect
genetic analysis, LD is generally reported population specific and as such, tag SNPs association [10]. Because of these two
in terms of r2 , a statistical measure of selected for one population may not work possibilities, a significant SNP association
correlation. High r2 values indicate that well for a different population. LD is from a GWAS should not be assumed as
two SNPs convey similar information, as exploited to optimize genetic studies, the causal variant and may require
additional studies to map the precise needed to capture the variation across the change in LDL level per allele or by
location of the influential SNP. African genome. genotype class. With an easily measurable
Conceptually, the end result of GWAS It is important to note that the technol- ubiquitous quantitative trait, GWAS of
under the common disease/common var- ogy for measuring genomic variation is blood lipids have been conducted in
iant hypothesis is that a panel of 500,000 changing rapidly. Chip-based genotyping numerous cohort studies. Their results
to one million markers will identify platforms such as those briefly mentioned were also easily combined to conduct an
common SNPs that are associated to above will likely be replaced over the next extremely well-powered massive meta-
common phenotypes. To conduct such a few years with inexpensive new technolo- analysis, which revealed 95 loci associated
study practically requires a genotyping gies for sequencing the entire genome. to lipid traits in more than 100,000 people
technology that can accurately capture These next-generation sequencing meth- [21]. Here, HDL and LDL may be the
the alleles of 500,000 to one million SNPs ods will provide all the DNA sequence primary traits of interest or can be
for each individual in a study in a cost- variation in the genome. It is time now to considered intermediate quantitative traits
effective manner. retool for this new onslaught of data. or endophenotypes for cardiovascular
disease.
4. Genotyping Technologies 5. Study Design Other disease traits do not have well-
established quantitative measures. In these
Genome-wide association studies were Regardless of assumptions about the circumstances, individuals are usually clas-
made possible by the availability of chip- genetic model of a trait, or the technology sified as either affected or unaffected – a
based microarray technology for assaying used to assess genetic variation, no genetic binary categorical variable. Consider the
one million or more SNPs. Two primary study will have meaningful results without vast difference in measurement error
platforms have been used for most GWAS. a thoughtful approach to characterize the associated with classifying individuals as
These include products from Illumina phenotype of interest. When embarking either ‘‘case’’ or ‘‘control’’ versus precisely
(San Diego, CA) and Affymetrix (Santa on a genetic study, the initial focus should measuring a quantitative trait. For exam-
Clara, CA). These two competing tech- be on identifying precisely what quantity or ple, multiple sclerosis is a complex clinical
nologies have been recently reviewed [20] trait genetic variation influences. phenotype that is often diagnosed over a
and offer different approaches to measure long period of time by ruling out other
SNP variation. For example, the Affyme- 5.1 Case Control versus Quantitative possible conditions. However, despite the
trix platform prints short DNA sequences Designs ‘‘loose’’ classification of case and control,
as a spot on the chip that recognizes a There are two primary classes of GWAS of multiple sclerosis have been
specific SNP allele. Alleles (i.e. nucleotides) phenotypes: categorical (often binary enormously successful, implicating more
are detected by differential hybridization case/control) or quantitative. From the than 10 new genes for the disorder [22].
of the sample DNA. Illumina on the other statistical perspective, quantitative traits So while quantitative outcomes are pre-
hand uses a bead-based technology with are preferred because they improve power ferred, they are not required for a
slightly longer DNA sequences to detect to detect a genetic effect, and often have a successful study.
alleles. The Illumina chips are more more interpretable outcome. For some
expensive to make but provide better disease traits of interest, quantitative 5.2 Standardized Phenotype Criteria
specificity. disease risk factors have already been A major component of the success with
Aside from the technology, another identified. High-density lipoprotein multiple sclerosis and other well-conduct-
important consideration is the SNPs that (HDL) and low-density lipoprotein (LDL) ed case/control studies is the definition of
each platform has selected for assay. This cholesterol levels are strong predictors of rigorous phenotype criteria, usually pre-
can be important depending on the heart disease, and so genetic studies of sented as rule list based on clinical
specific human population being studied. heart disease outcomes can be conducted variables. Multiple sclerosis studies often
For example, it is important to use a chip by examining these levels as a quantitative use the McDonald criteria for establishing
that has more SNPs with better overall trait. Assays for HDL and LDL levels, case/control status and defining clinical
genomic coverage for a study of Africans being already useful for clinical practice, subtypes [23]. Standardized methods like
than Europeans. This is because African are precise and ubiquitous measurements the McDonald criteria establish a concise,
genomes have had more time to recom- that are easy to obtain. Genetic variants evidence-based approach that can be
bine and therefore have less LD between that influence these levels have a clear uniformly applied by multiple diagnosing
alleles at different SNPs. More SNPs are interpretation – for example, a unit clinicians to ensure that consistent pheno-
Further Reading
N 1000 Genomes Project Consortium, Altshuler D, Durbin RM, Abecasis GR, Bentley DR, et al. (2010) A map of human genome
variation from population-scale sequencing. Nature 467: 1061–1073.
N Haines JL, Pericak-Vance MA (2006) Genetic analysis of complex disease. New York: Wiley-Liss. 512 p.
N Hartl DL, Clark, AG (2006) Principles of population genetics. Sunderland (Massachusetts): Sinauer Associates, Inc. 545 p.
N NCI-NHGRI Working Group on Replication in Association Studies, Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, et al.
(2007) Replicating genotype-phenotype associations. Nature 447: 655–660.
GWAS: genome-wide association study; a genetic study design that attempts to identify commonly occurring genetic variants that
contribute to disease risk
Personalized Medicine: the science of providing health care informed by individual characteristics, such as genetic variation
SNP: single nucleotide polymorphism; a single base-pair change in the DNA sequence
Linkage Analysis: the attempt to statistically relate transmission of an allele within families to inheritance of a disease
Common disease/Common variant hypothesis: The hypothesis that commonly occurring diseases in a population are caused in part
by genetic variation that is common to that population
Linkage disequilibrium: the degree to which an allele of one SNP is observed with an allele of another SNP within a population
Direct association: the statistical association of a functional or influential allele with a disease
Indirect association: the statistical association of an allele to disease that is in strong linkage disequilibrium with the allele that is
functional or influential for disease
Population stratification: the false association of an allele to disease due to both differences in population frequency of the allele and
differences in ethnic prevalence or sampling of affected individuals
False positive: from statistical hypothesis testing, the rejection of a null hypothesis when the null hypothesis is true
Genome-wide significance: a false-positive rate threshold established by empirical estimation of the independent genomic regions
present in a population
Replication: the observation of a statistical association in a second, independent dataset (often the same population as the first
association)
Generalization: the replication of a statistical association in a second population
Imputation: the estimation of unknown alleles based on the observation of nearby alleles in high linkage disequilibrium
References
1. Haines JL, Hauser MA, Schmidt S, Scott WK, complex traits. Nat Rev Genet 6: 95–108. doi: studies. Methods Mol Biol 700: 3–16. doi:
Olson LM, et al. (2005) Complement factor H 10.1038/nrg1521 10.1007/978-1-61737-954-3_1
variant increases the risk of age-related macular 11. Corder EH, Saunders AM, Strittmatter WJ, 21. Teslovich TM, Musunuru K, Smith AV, Ed-
degeneration. Science 308: 419–421. doi: Schmechel DE, Gaskell PC, et al. (1993) Gene mondson AC, Stylianou IM, et al. (2010)
10.1126/science.1110359 dose of apolipoprotein E type 4 allele and the risk Biological, clinical and population relevance of
2. Edwards AO, Ritter R, III, Abel KJ, Manning A, of Alzheimer’s disease in late onset families. 95 loci for blood lipids. Nature 466: 707–713. doi:
Panhuysen C, et al. (2005) Complement factor H Science 261: 921–923. 10.1038/nature09270
polymorphism and age-related macular degener- 12. Altshuler D, Hirschhorn JN, Klannemark M, 22. Habek M, Brinar VV, Borovecki F (2010) Genes
ation. Science 308: 421–424. doi: 10.1126/ Lindgren CM, Vohl MC, et al. (2000) The associated with multiple sclerosis: 15 and count-
science.1110189 common PPARgamma Pro12Ala polymorphism ing. Expert Rev Mol Diagn 10: 857–861. doi:
3. Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler is associated with decreased risk of type 2 diabetes. 10.1586/erm.10.77
RS, et al. (2005) Complement factor H polymor- Nat Genet 26: 76–80. doi: 10.1038/79216 23. Polman CH, Reingold SC, Edan G, Filippi M,
phism in age-related macular degeneration. 13. Reich DE, Lander ES (2001) On the allelic spectrum Hartung HP, et al. (2005) Diagnostic criteria for
Science 308: 385–389. doi: 10.1126/sci- of human disease. Trends Genet 17: 502–510. multiple sclerosis: 2005 revisions to the ‘‘McDon-
ence.1109557 14. Hindorff LA, Sethupathy P, Junkins HA, Ramos ald Criteria’’. Ann Neurol 58: 840–846. doi:
4. Cooper GM, Johnson JA, Langaee TY, Feng H, EM, Mehta JP, et al. (2009) Potential etiologic 10.1002/ana.20703
Stanaway IB, et al. (2008) A genome-wide scan and functional implications of genome-wide 24. Chew EY, Kim J, Sperduto RD, Datiles MB, III,
for common genetic variants with a large association loci for human diseases and traits. Coleman HR, et al. (2010) Evaluation of the age-
influence on warfarin maintenance dose. Blood Proc Natl Acad Sci U S A 106: 9362–9367. doi: related eye disease study clinical lens grading
112: 1022–1027. doi: 10.1182/blood-2008-01-
10.1073/pnas.0903103106 system AREDS report No. 31. Ophthalmology
134247
15. International HapMap Consortium (2005) A 117: 2112–2119. doi: 10.1016/j.ophtha.2010.02.
5. Genomes Project Consortium (2010) A map of
haplotype map of the human genome. Nature 033
human genome variation from population-scale
437: 1299–1320. doi: 10.1038/nature04226 25. Denny JC, Ritchie MD, Crawford DC, Schildcr-
sequencing. Nature 467: 1061–1073. doi:
10.1038/nature09534 16. Ritchie MD, Denny JC, Crawford DC, Ramirez out JS, Ramirez AH, et al. (2010) Identification
6. Griffith OL, Montgomery SB, Bernier B, Chu B, AH, Weiner JB, et al. (2010) Robust replication of of genomic predictors of atrioventricular con-
Kasaian K, et al. (2008) ORegAnno: an open- genotype-phenotype associations across multiple duction: using electronic medical records as a
access community-driven resource for regulatory diseases in an electronic medical record. tool for genome science. Circulation 122: 2016–
annotation. Nucleic Acids Res 36: D107-D113. Am J Hum Genet 86: 560–572. doi: 10.1016/ 2021. doi: 10.1161/CIRCULATIONAHA.110.
doi: 10.1093/nar/gkm967 j.ajhg.2010.03.003 948828
7. Altshuler DM, Gibbs RA, Peltonen L, Altshuler 17. Devlin B, Risch N (1995) A comparison of linkage 26. Wilke RA, Berg RL, Linneman JG, Peissig P,
DM, Gibbs RA, et al. (2010) Integrating common disequilibrium measures for fine-scale mapping. Starren J, et al. (2010) Quantification of the
and rare genetic variation in diverse human Genomics 29: 311–322. doi: 10.1006/ clinical modifiers impacting high-density lipopro-
populations. Nature 467: 52–58. doi: 10.1038/ geno.1995.9003 tein cholesterol in the community: Personalized
nature09298 18. Fallin D, Schork NJ (2000) Accuracy of haplotype Medicine Research Project. Prev Cardiol 13: 63–
8. Kerem B, Rommens JM, Buchanan JA, Markiewicz frequency estimation for biallelic loci, via the 68. doi: 10.1111/j.1751-7141.2009.00055.x
D, et al. (1989) Identification of the cystic fibrosis expectation-maximization algorithm for un- 27. Kullo IJ, Fan J, Pathak J, Savova GK, Ali Z, et al.
gene: genetic analysis. Science 245: 1073–1080. phased diploid genotype data. Am J Hum Genet (2010) Leveraging informatics for genetic studies:
9. MacDonald ME, Novelletto A, Lin C, Tagle D, 67: 947–959. doi: 10.1086/303069 use of the electronic medical record to enable a
Barnes G, et al. (1992) The Huntington’s disease 19. Li M, Li C, Guan W (2008) Evaluation of genome-wide association study of peripheral
candidate region exhibits many different haplo- coverage variation of SNP chips for genome-wide arterial disease. J Am Med Inform Assoc 17:
types. Nat Genet 1: 99–103. doi: 10.1038/ association studies. Eur J Hum Genet 16: 635– 568–574. doi: 10.1136/jamia.2010.004366
ng0592-99 643. doi: 10.1038/sj.ejhg.5202007 28. McCarty CA, Wilke RA (2010) Biobanking and
10. Hirschhorn JN, Daly MJ (2005) Genome-wide 20. Distefano JK, Taverna DM (2011) Technological pharmacogenomics. Pharmacogenomics 11: 637–
association studies for common diseases and issues and experimental design of gene association 641. doi: 10.2217/pgs.10.13