TMP F780
TMP F780
org on July 10, 2009 - Published by Cold Spring Harbor Laboratory Press
P<P Published online June 22, 2009 in advance of the print journal.
Email alerting Receive free email alerts when new articles cite this article - sign up in the box at the
service top right corner of the article or click here
Advance online articles have been peer reviewed and accepted for publication but have not yet
appeared in the paper journal (edited, typeset versions may be posted when available prior to final
publication). Advance online articles are citable and establish publication priority; they are indexed
by PubMed from initial publication. Citations to Advance online articles must include the digital
object identifier (DOIs) and date of initial publication.
Letter
As the cost of DNA sequencing drops, we are moving beyond one genome per species to one genome per individual to
improve prevention, diagnosis, and treatment of disease by using personal genotypes. Computational methods are fre-
quently applied to predict impairment of gene function by nonsynonymous mutations in individual genomes and single
nucleotide polymorphisms (nSNPs) in populations. These computational tools are, however, known to fail 15%–40% of
the time. We find that accurate discrimination between benign and deleterious mutations is strongly influenced by the
long-term (among species) history of positions that harbor those mutations. Successful prediction of known disease-
associated mutations (DAMs) is much higher for evolutionarily conserved positions and for original–mutant amino acid
pairs that are rarely seen among species. Prediction accuracies for nSNPs show opposite patterns, forecasting impediments
to building diagnostic tools aiming to simultaneously reduce both false-positive and false-negative errors. The relative
allele frequencies of mutations diagnosed as benign and damaging are predicted by positional evolutionary rates. These
allele frequencies are modulated by the relative preponderance of the mutant allele in the set of amino acids found at
homologous sites in other species (evolutionarily permissible alleles [EPAs]). The nSNPs found in EPAs are biochemically
less severe than those missing from EPAs across all allele frequency categories. Therefore, it is important to consider
position evolutionary rates and EPAs when interpreting the consequences and population frequencies of human muta-
tions. The impending sequencing of thousands of human and many more vertebrate genomes will lead to more accurate
classifiers needed in real-world applications.
[Supplemental material is available online at http://www.genome.org.]
Unshrouding the mysteries of human genome variation is the es- be not strictly-neutral and are thus thought to harbor signatures of
sential precursor to the development of personalized medicine negative or positive selection (Yampolsky et al. 2005; Eyre-Walker
where the aim is to relate the genotype with the phenotype in et al. 2006; Levy et al. 2007; Shastry 2007; Bentley et al. 2008;
better understanding an individual’s susceptibility to disease and Boyko et al. 2008; Wang et al. 2008; Wheeler et al. 2008).
response to treatment. Already, complete genomes from many The de novo prediction methods to predict functional effects
individual humans have been sequenced, and projects are un- of novel mutations often do not directly incorporate many bi-
derway to expand that number to over a thousand genomes in the ological attributes (e.g., interactions among multiple sites or genes,
near future (Levy et al. 2007; Bentley et al. 2008; Wang et al. 2008; environmental influences on phenotypes, and allele state in the
Wheeler et al. 2008). These projects have revealed that every paired chromosome) because of the lack of information and the
individual carries thousands of amino acid–altering (nonsyn- difficulty in modeling them mathematically. Still, these methods
onymous) nucleotide mutations and that a large number of these offer up to 80% accuracy for mutations in genes implicated in
mutations are novel in terms of their location and the type of Mendelian diseases (for reviews, see Bhatti et al. 2006; Ng and
amino acid change induced. Experimental and other functional Henikoff 2006; Bromberg and Rost 2007; Shastry 2007; Tian et al.
information are rarely available for the association of phenotypic 2007). PolyPhen is the most widely used method for estimating
effect with these mutations, so computational methods are used potential deleterious effects of amino acid mutations; it is available
instead (e.g., Miller and Kumar 2001; Ramensky et al. 2002; Ng and as a web-based service, and it relies on information from sequence
Henikoff 2003; Shastry 2007; Tian et al. 2007; Lohmueller et al. conservation, physiochemical differences, proximity of mutations
2008). These in silico predictions are of great interest in detecting to predicted functional domains, and structural features (Sunyaev
variants for Mendelian and complex diseases, in prioritizing et al. 1999; Ramensky et al. 2002). PolyPhen and SIFT (Ng and
polymorphisms for experimental research in humans and other Henikoff 2003) have been used in hundreds of studies, including
species, and in analyzing data from genome-wide association the evaluation of nonsynonymous single nucleotide polymor-
studies (e.g., Rudd et al. 2005; Bhatti et al. 2006; Kryukov et al. phisms (nSNPs) found in complete genomes. Many other ap-
2007; Doniger et al. 2008). Using various prediction tools, up to proaches have been proposed over the last decade, but these are
one-fourth of nonsynonymous mutations have been diagnosed to not yet widely used (e.g., Bromberg and Rost 2007; Tian et al. 2007;
Cheng et al. 2008).
3
Corresponding author. In recent years, scientists have employed many strategies in
E-mail [email protected]; fax (480) 727-6947.
Article published online before print. Article and publication date are at efforts to build super-classifiers, using sophisticated computational
http://www.genome.org/cgi/doi/10.1101/gr.091991.109. approaches to improve the accuracy of computational prediction
19:000–000 Ó 2009 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/09; www.genome.org Genome Research 1
www.genome.org
Downloaded from genome.cshlp.org on July 10, 2009 - Published by Cold Spring Harbor Laboratory Press
Kumar et al.
tools in diagnosing known disease-associated mutations (DAMs) to gene ontology). We classified each gene into one or more of 13
be function-altering (damaging; true-positives) and nSNPs to be major categories (plus a group of unannotated genes). The accu-
neutral (benign; true-negatives). These strategies have resulted in racy of predicting DAMs in these categories varied in a relatively
some gains compared with classical methods such as PolyPhen and narrow range, even though DAMs in some of the gene function
SIFT (e.g., Bromberg and Rost 2007; Tian et al. 2007). However, the categories were significantly easier to predict than others (e.g.,
anatomies of the misdiagnoses of different types of DAMs and translation) (see Supplemental Fig. S3).
nSNPs remain poorly understood, as the primary correlates of the In contrast to functional categories, the long-term evolu-
observed failures are yet to be explored. With a focus on PolyPhen tionary rates at DAM positions correlate strongly with the success
and the comparison of the observed patterns with those from SIFT, in diagnosing DAMs in both PolyPhen (Fig. 1C) and SIFT (data not
we have taken an evolutionary approach in examining the patterns shown). DAMs in completely conserved positions (lowest evolu-
of successes and failures of mutational diagnosis. Given that Poly- tionary rate) were 1.5 times more likely to be correctly classified
Phen (and other methods) already considers a host of evolutionary than those at positions harboring any interspecies variation (67%
and primary sequence attributes in making decisions, it is reasonable vs. 44%; P < 0.01), while more than 70% of DAMs in the fastest-
to work with the null hypothesis that the accuracy of correct pre- evolving positions were misdiagnosed (benign). The accuracy of
diction is similar for mutations occurring at positions evolving with DAM prediction also varies tremendously and unexpectedly
different evolutionary rates and that the accuracy of the correct among the 20 amino acids, with accuracy ranging from 22%–96%
prediction is similar for different original and mutated amino acids. in the proteome-wide analysis (Fig. 1D).
The choice of evolutionary conservation of positions, original amino We examined the question of whether the observed re-
acids, and mutant alleles reflects practical considerations, because lationship between the evolutionary rate and the accuracy of
these three attributes are readily available or calculable for all prediction is reproduced for DAMs of specific proteins, because
mutations. Other factors, such as secondary, tertiary, and higher PolyPhen scores the relative likelihood of a mutation to affect
structures undoubtedly play an important role, but that infor- function in its protein context. Analysis of the cystic fibrosis
mation is not yet available for an overwhelming proportion of transmembrane conductance regulator (CFTR) protein, which
known DAMs and nSNPs for human populations. contributed the largest number of DAMs in our data set (444),
produced patterns similar to those seen in the proteome-wide
analysis for rates (Pearson’s r = 0.95) as well as for original amino
Results
acids (r = 0.97) (Fig. 1E). Similar results were observed for other
We begin with a report on the accuracy of correctly diagnosing DAM-rich proteins (data not shown). Therefore, the dependence of
known DAMs implicated in Mendelian diseases (>9000 DAMs correct inference of DAMs on evolutionary rate and amino acids is
from >500 genes) (Supplemental Fig. S1). These DAMs were sub- a fundamental attribute of positions, rather than an artifact of
jected to the most recent version of the PolyPhen web service, proteome-wide summarization of mutations in proteins evolving
which classifies them into three categories—benign, possibly- with vastly different conservation profiles and amino acid con-
damaging, and probably-damaging—based on the logarithmic ra- tents. Major differences among evolutionary rates and amino acids
tios of the likelihood of occurrence of a given DAM at the specific and were also observed in the SIFT analysis, and these differences
position and the likelihood of that amino acid occurring at any were correlated with those observed for PolyPhen (r = 0.95 and
position (Ramensky et al. 2002). A probably-damaging designation 0.59, respectively; P << 0.01). Because different amino acids are
indicates that the mutation’s chance of affecting protein function known to evolve at intrinsically different rates, we also examined
is the highest, whereas a benign designation suggests little or no the relationship of evolutionary conservation on the DAM pre-
putative impact on the protein function. diction accuracy for specific amino acids. In the easiest to diagnose
PolyPhen designated 60% of DAMs to be probably-damaging, amino acids (e.g., arginine), DAMs occurring at completely con-
which is the correct inference in this case (Fig. 1A). However, 21% served positions were significantly harder to predict than those at
of DAMs were identified to be benign, which provides a lower limit positions with any site variability (P << 0.01) (Fig. 1F). A similar
on the false-negative rate of inference. Similar accuracies are pattern is seen for amino acids whose mutations are difficult to
reported in other studies, as well (Ng and Henikoff 2006; Chan diagnose (e.g., alanine) (P << 0.01) (Fig. 1F).
et al. 2007; Tian et al. 2007; Cheng et al. 2008; Lohmueller et al. Next, we analyzed >12,000 nSNPs in order to examine the de-
2008). Pooling of the benign and possibly-damaging (ambiguous) pendence of rate of evolution and the original amino acid for
diagnoses increases the false-negative rate for DAMs to 41%, while mutations not associated with any disease. The fraction of nSNPs
the pooling of the possibly-damaging and probably-damaging identified as benign also depends strongly on evolutionary rates (Fig.
categories increases the DAM-prediction accuracy of PolyPhen to 2B) and the original amino acids (Fig. 2C). However, DAMs and
79%. However, it appears to be more prudent to use only the nSNPs show opposite patterns in terms of accuracy, assuming that
probably-damaging category to represent the correct inference for a vast majority of nSNPs represent nondisease variations. For in-
DAMs, because PolyPhen classified a very similar fraction of DAMs stance, alanine nSNPs are diagnosed as benign most often, and nSNPs
and nSNPs into the possibly-damaging category (20% and 18%, at fast-evolving positions are also easily diagnosed to be benign.
respectively) (cf. Figs. 1A and 2A). We compared PolyPhen results In order to investigate why DAMs and nSNPs show comple-
with those obtained from SIFT, which classifies mutations into mentary patterns, we further analyzed results from PolyPhen, which
only two categories: tolerant or not-tolerant (Ng and Henikoff uses a single score metric (position-specific independent counts
2006). SIFT designated 21% of DAMs to be tolerant, which is [PSIC] score) in its decision making. Distributions of PSIC scores
similar to the DAM misclassification rates in PolyPhen (Supple- overlap extensively for DAMs and nSNPs proteome-wide (Fig. 3A)
mental Fig. S2). and for individual amino acids (Supplemental Fig. S4). Generally,
For identifying the correlates of successes and failures in di- DAMs exhibit a wider range of values and carry larger PSIC scores
agnosing DAMs, we first examined the accuracy of correct di- as compared with nSNPs. Underlying the wide PSIC distributions
agnoses for genes having different functions (as reflected in the for DAMs and nSNPs are the relationships of PSIC scores with the
2 Genome Research
www.genome.org
Downloaded from genome.cshlp.org on July 10, 2009 - Published by Cold Spring Harbor Laboratory Press
Genome Research 3
www.genome.org
Downloaded from genome.cshlp.org on July 10, 2009 - Published by Cold Spring Harbor Laboratory Press
Kumar et al.
Discussion
In proteome-wide analyses, we have shown that evolutionary rate
and positional amino acid composition correlates extensively with
the computational assessment of a mutation’s functional effects.
A large number of DAMs are found in positions that vary among
species, and a majority of DAMs in these positions are misdi-
agnosed. Similarly, a large number of nSNPs occur in positions
that are highly conserved, which, in many cases, are predicted by
computational tools to carry functional consequences. Correlation
between classification accuracies and PSIC scores suggests that we
could improve prediction by tailoring PSIC classification thresh-
Figure 2. PolyPhen classification of 12,421 nSNPs into benign, possibly- olds to individual classes of variants (e.g., by amino acid type and
damaging, and probably-damaging categories. (A) Fraction of nSNPs by rate class). However, such efforts would likely suffer the hand-
classified into the three categories. The fraction of nSNPs designated to be
benign at positions with different evolutionary rates (B) and original amino icap of the classical trade-off between the false-negative and false-
acids (C ). Panel B also contains the accuracy of DAM inference from Figure positive prediction rates. That is, while changes in PSIC diagnostic
1B (filled squares). Error bars, 95% confidence interval based on the bi- thresholds for individual amino acids and/or rate classes might
nomial variance of the fraction of sites. reduce false-negatives for DAMs, they might simultaneously in-
crease false-positives for nSNPs. In such cases, it is prudent to as-
sociate a reliability indicator with inferences produced using
gaps and missing data were almost identical for EPA and non-EPA computational methods (e.g., Bromberg and Rost 2007).
nSNP sites (32% and 33%, respectively). We expect the fraction of We suggest that a reliability of inference (RoI) measure be
non-EPA nSNPs to increase in the future as more individual included with functional predictions to reflect their uncertainty.
genomes are sequenced and rarer alleles are discovered. This in- The RoI measure is the average of probability of true-positives (PTP)
crease will be counteracted by discovery of more nSNPs in EPA and the probability of true-negatives (PTN). The former is calculated
because the use of more species in the multiple sequence align- by applying the given computational method on all available
ments would expand the list of EPAs at each position. Overall, the DAMs, while the latter is calculated by using all available strictly
number of non-EPA nSNPs is likely to decline slowly, if at all. This ‘‘neutral’’ nSNP data. By design, the RoI does not depend on the
conclusion is based on the observation that more than 80% of EPA inference made. Rather, it captures how difficult it will be to make
nSNPs could be identified using only 33 nonhuman mammals, a correct prediction for a given type of change in its evolutionary
and a 30% increase in the number of species (nine additional context. The RoI may only be improved by improving true-positive
species) led to the discovery of only a small fraction of nSNPs in and true-negative rates (such efforts are already underway for
expanded EPA lists for each site. PolyPhen) (S. Sunyaev, pers. comm.). Of course, PTP and PTN may
The frequency of nSNP occurrence in EPA shows a marked be weighted unequally in calculating the RoI when analyzing
relationship with the evolutionary rate. The nonsynonymous poly- nonsynonymous mutations from the genomes of ‘‘healthy’’ indi-
morphisms in the fastest-evolving positions are EPAs significantly viduals, because they are expected to carry a large number of
4 Genome Research
www.genome.org
Downloaded from genome.cshlp.org on July 10, 2009 - Published by Cold Spring Harbor Laboratory Press
Figure 3. (A) Frequency distributions of DAM (red) and nSNPs (blue) PSIC scores. Vertical lines show the PolyPhen PSIC cut-offs for classification of
variants in the absence of structural or other information; nSNPs and DAMs are from Subramanian and Kumar (2006a). (B) Mean PSIC values for DAMs in
different evolutionary rate categories. The correlation (r) between mean PSIC values and mean evolutionary rate is 0.96 (P << 0.01). The 95% confidence
intervals derived from the SEMs are shown. (C ) Relation between mean PSIC scores for DAMs and mean PSIC scores for nSNPs, by amino acid types (solid
circles) and evolutionary rates (open circles). (D) Inverse relationship of the accuracy of DAMs (probably-damaging) and nSNPs (benign) with the evo-
lutionary interchangeability of amino acid pairs (original/variant pairs) as captured in the BLOSUM62 matrix. Each data point represents the average of all
pairs for a given BLOSUM score, with the error bars displaying the 95% confidence intervals derived from binomial variance of the proportions. BLOSUM
scores are log-odds substitution occurrences. Negative BLOSUM scores show amino acid pairs that are found to have a low probability of substitution,
whereas a positive score indicates frequently observed amino acid pairs. Complete 20 3 20 matrices of DAM and nSNP accuracies (and their SEs) are given
in the Supplemental material.
neutral mutations. In this case, RoI = (PTN + vPTP)/(1 + v), where v In addition to helping us understand the factors that modulate
is the expected ratio of DAMs to nSNPs and will generally be less the accuracy of computational methods, evolutionary rates and
than one. Furthermore, single and multidimensional RoI matrices frequencies of EPAs at positions involved in DAMs and nSNPs sup-
may be constructed, with amino acid pair and rate classes as ad- ply null expectations for interpreting the observed population fre-
ditional dimensions, because the accuracy of diagnosis differs quencies of alleles. For example, computational methods have been
among classes for the same amino acid. We anticipate that suffi- used to predict the functional effects of nSNPs (benign, possibly-
cient data will become available in the future from the profiling of damaging, and probably-damaging) found in genome-scale popu-
an expanded number of diseases, individuals, and populations to lation surveys and the distributions of frequencies of alleles in the
build such matrices. three functional categories compared (Lohmueller et al. 2008).
For now, we used the estimates of PTP and PTN based on the Lohmueller et al. (2008) noted that the mean derived allele fre-
DAM and nSNP data analyzed (see 20 3 20 matrices in the Supple- quency (MAF) for the benign alleles is significantly higher than that
mental Figs. S5, S6), respectively, to estimate the RoI for 682 muta- for the damaging alleles. The direction and magnitude of this dif-
tions found in the disease-associated genes of one individual (Levy ference is predictable based on the average evolutionary rates of
et al. 2007). The average RoI for these mutations is 57.5% when PTP positions in the three functional categories, because the long-term
and PTN are equally weighted. It rises to 71% when PTN is given evolutionary rates at any given position will modulate allele fre-
a weight 10 times that to PTP (i.e., v = 0.1). This ad hoc ratio may be quencies within populations under the principles of the neutral
justifiable, because ;10% of nonsynonymous mutations are found theory (Kimura 1983; Subramanian and Kumar 2006a). Indeed, rates
to be fixed among species in comparative genomic analysis in- of evolution and the MAF are highly correlated over all nSNPs and
volving humans and chimpanzees (e.g., Subramanian and Kumar when considering EPA and non-EPA nSNPs separately (r = 0.88; P <
2006b). While a 71% success rate may appear reasonably good for 0.05). The evolutionary rate ratio for probably-damaging and benign
some academic research, it is presently too low to be useful in real- positions is quite similar to that reported for the MAF (0.49 and 0.40,
world applications (especially in making health decisions). respectively), but a second-degree polynomial fits the relationship
Genome Research 5
www.genome.org
Downloaded from genome.cshlp.org on July 10, 2009 - Published by Cold Spring Harbor Laboratory Press
Kumar et al.
6 Genome Research
www.genome.org
Downloaded from genome.cshlp.org on July 10, 2009 - Published by Cold Spring Harbor Laboratory Press
novel mutations with a high RoI. With the knowledge of in- billion years. Species divergence times were obtained from an
formation on the genotypes of nonsynonymous mutations and advanced version of the TimeTree resource (www.timetree.org,
SNPs, the copy number variation of the protein (including para- version 2.0 prerelease) (Hedges et al. 2006). For each position, all
logs), and the availability of more protein structures, it will become species containing alignment gaps or missing data were pruned
possible to build more accurate mutation classifiers to diagnose from the tree before calculating the number of substitutions and
disease propensities of novel mutations, select and prioritize var- the total evolutionary time. We repeated this procedure to calcu-
iants for experimental research, and develop baseline patterns of late the evolutionary rate using only 33 mammalian species. Ver-
novel allele frequencies within populations. tebrate and mammal rates were highly correlated for all sites used
(r = 0.92; P << 0.01), and we employed the latter rates, as mam-
malian genomes are more appropriate models for the human ge-
nome as compared to more distantly related species. Furthermore,
Methods we have previously shown that maximum likelihood estimates of
We analyzed two large-scale data sets of DAMs and nSNPs (Sub- relative evolutionary rates are very highly correlated with rates
ramanian and Kumar 2006a; Lohmueller et al. 2008). The Sub- obtained using the Fitch algorithm (Miller and Kumar 2001), as
ramanian and Kumar (2006a) data set consisted of 10,685 DAMs each site contains data from many closely and distantly related
and 5308 human nSNPs. This data set was constructed by down- species. This was confirmed in our analysis of DAM positions for
loading the human proteome from GenBank (build 34.1) with which rates from four species ML analysis from Subramanian and
associated RefSeq identifiers for each gene. Of all available DAMs Kumar (2006a) and the 44-species analysis in this study showed
in 1307 human genes from HGMD (http://archive.uwcm.ac.uk/ significant correlation (r = 0.70; P << 0.01). Because the calculation
uwcm/mg/hgmd0.html) and all putatively-benign nSNP sites of rates by our current method only requires the amino acids in all
in 11,753 human genes from various genome projects (see other species at a given site, it is more suitable for application in
Subramanian and Kumar 2006a), genes containing no DAMs or personalized diagnostics. We quantized evolutionary rates into six
nSNPs were discarded. Complete proteomes of four diverse species discrete categories such that sites showing no variation across all
(Homo sapiens, Mus musculus, Gallus gallus, and Takifugu rubripes) species comprise the slowest-evolving group (category 0), and the
were obtained for the remaining genes from the Ensembl web cut-off rates for the other five categories (1–5) were such that they
server (http://www.ensembl.org/), along with orthologs identified each contained a similar number of sites when applied to the
via a reciprocal BLASTP search with each RefSeq gene (Altschul Lohmueller et al. (2008) nSNPs. The five categories of evolutionary
et al. 1990; Waterston et al. 2002). Additionally, the BLOSUM rate of variable positions had average evolutionary rates of 0.6, 1.5,
substitution matrix was employed using appropriate threshold 2.5, 3.5, and 5.3 with standard deviations of 0.2, 0.3, 0.3, 0.3, and
scores (Subramanian and Kumar 2004). If any of the three verte- 1.5, respectively.
brate orthologs could not be determined for any human gene, then These UCSC Genome Browser alignments were also used to
that gene and all DAMs and nSNPs contained within it were ex- generate EPAs at each position, because they cover 44 diverse
cluded from the data set. Each ortholog was aligned to the ho- vertebrate species, including agnathans, fishes, amphibians, birds,
mologous human sequence with CLUSTALW using default settings and mammals (http://genome.ucsc.edu/). Under the principles of
(Thompson et al. 1994), and all sites (and thus associated DAMs the neutral theory of molecular evolution, a vast majority of EPAs
and nSNPs) containing indels or missing data at homologous sites are expected to represent neutral variants at a site. For each DAM/
in any of the three vertebrate species were excluded in order to nSNP, we estimated the percentage of evolutionary time span
represent at least four species. (%ETS) in the 44-species tree, which is the total branch length
From the Lohmueller et al. (2008) data, we extracted all nSNPs (times) in the tree obtained after pruning all nonhuman species
by removing all synonymous, noncoding and redundant SNPs. lacking the variant allele divided by the total branch length of the
Then, we used dbSNP rsIDs for each nSNP (http://www.ncbi.nlm. tree after pruning all species containing an alignment gap or
nih.gov/projects/SNP/) to generate a RefSeq identifier (Pruitt et al. missing data. For each variant at a site, the ETS varies from 0%–
2007). This information was used to map each nSNP onto 100%, with constant sites containing a single EPA with an ETS of
the 44-species protein alignments available in the UCSC Genome 100% and non-EPA mutations producing an ETS of 0%. A smaller
Browser (Kuhn et al. 2009). During this process, a substantial ETS is frequently associated with variation that has occurred re-
number of nSNPs was eliminated because either dbSNP records did cently in species closely related to humans.
not contain a map from rsIDs to RefSeq identifiers, not all human The PolyPhen web resource was used to classify mutations
RefSeq identifiers were present in the UCSC data set, or the wild- into benign, possibly-damaging, and probably-damaging catego-
type amino acid in the Lohmueller data set was not the human ries for DAMs and nSNPs (Ramensky et al. 2002). After removing
representative in the UCSC data set. The outcome was a set of duplicate entries and sites for which PolyPhen returned ‘‘un-
12,712 nSNPs with allele frequencies as reported by Lohmueller known’’ or was unable to return any result, the final data set
et al. (2008), and the 44-species alignment for each nSNP position. contained 9460 DAMs and 4020 nSNPs for the Subramanian and
The 44-species alignments were also generated by using the RefSeq Kumar (2006a) data sets. We noticed that while PolyPhen attempts
identifiers in UCSC for all the DAMs. We discarded all positions to incorporate information from the protein structure (when
where the amino acid state of any of the species in the original four available from databases such as Protein Data Bank) and available
sequence alignment disagreed between the Subramanian and functional data from site annotations, the final diagnosis for this
Kumar (2006a) data set and the UCSC alignment. This produced data set was rooted solely in primary sequences for >97% of the
a total of 8696 DAMs with 44-species alignments. mutations we tested. Inclusion and exclusion of these mutations
We estimated the evolutionary rate for each amino acid site produce the same results, so we did not consider nonsequence
separately using the amino acids found in the 44-species align- attributes in any of our analyses. Because of the slowness of the
ment. The number of substitutions at each site was obtained by web resource (http://sift.cchmc.org/), SIFT analyses are based on
using the known phylogeny of the species (Fig. 1B) and applying a subset of these DAMs and nSNPs (approximately one-third each,
the Fitch (1971) algorithm. The total of the substitutions was di- 2375 and 1439, respectively). Supplementary information avail-
vided by the total time elapsed on the tree to obtain the evolu- able from Lohmueller et al. (2008) provided the PolyPhen di-
tionary rate in the units of the number of substitutions per site per agnosis for all the nSNPs.
Genome Research 7
www.genome.org
Downloaded from genome.cshlp.org on July 10, 2009 - Published by Cold Spring Harbor Laboratory Press
Kumar et al.
Acknowledgments Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J,
Kirkness EF, Denisov G, et al. 2007. The diploid genome sequence of
We thank Revak Raj Tyagi for his help with UCSC Genome browser an individual human. PLoS Biol 5: e254. doi: 10.1371/journal.pbio.
data extraction, Antoine Ah-Foune and Veronica Shi for some early 0050254.
Lohmueller KE, Indap AR, Schmidt S, Boyko AR, Hernandez RD, Hubisz MJ,
analyses, and Kristi Garboushian for providing editorial support.
Sninsky JJ, White TJ, Sunyaev SR, Nielsen R, et al. 2008. Proportionally
We thank David Cooper (HGMD) for permitting us to use the more deleterious genetic variation in European than in African
disease-associated mutation data of Subramanian and Kumar (2006a). populations. Nature 451: 994–997.
This research was supported by a research grant from NIH HG2096 Miller MP, Kumar S. 2001. Understanding human disease mutations
(S.K.). through the use of interspecific genetic variation. Hum Mol Genet 10:
2319–2328.
Ng PC, Henikoff S. 2003. SIFT: Predicting amino acid changes that affect
protein function. Nucleic Acids Res 31: 3812–3814.
References Ng PC, Henikoff S. 2006. Predicting the effects of amino acid substitutions
on protein function. Annu Rev Genomics Hum Genet 7: 61–80.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local Pruitt KD, Tatusova T, Maglott DR. 2007. NCBI reference sequences (RefSeq):
alignment search tool. J Mol Biol 215: 403–410. A curated non-redundant sequence database of genomes, transcripts
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown and proteins. Nucleic Acids Res 35: D61–D65.
CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole Ramensky V, Bork P, Sunyaev S. 2002. Human non-synonymous SNPs:
human genome sequencing using reversible terminator chemistry. Server and survey. Nucleic Acids Res 30: 3894–3900.
Nature 456: 53–59. Rudd MF, Williams RD, Webb EL, Schmidt S, Sellick GS, Houlston RS. 2005.
Bhatti P, Church DM, Rutter JL, Struewing JP, Sigurdson AJ. 2006. Candidate The predicted impact of coding single nucleotide polymorphisms
single nucleotide polymorphism selection using publicly available tools: database. Cancer Epidemiol Biomarkers Prev 14: 2598–2604.
A guide for epidemiologists. Am J Epidemiol 164: 794–804. Shastry BS. 2007. SNPs in disease gene mapping, medicinal drug
Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD,
development and evolution. J Hum Genet 52: 871–880.
Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR, et al.
Subramanian S, Kumar S. 2004. Gene expression intensity shapes
2008. Assessing the evolutionary impact of amino acid mutations in the
evolutionary rates of the proteins encoded by the vertebrate genome.
human genome. PLoS Genet 4: e1000083. doi: 10.1371/journal.pgen.
Genetics 168: 373–381.
1000083.
Subramanian S, Kumar S. 2006a. Evolutionary anatomies of positions and
Bromberg Y, Rost B. 2007. SNAP: Predict effect of non-synonymous
polymorphisms on function. Nucleic Acids Res 35: 3823–3835. types of disease-associated and neutral amino acid mutations in the
Chan PA, Duraisamy S, Miller PJ, Newell JA, McBride C, Bond JP, Raevaara T, human genome. BMC Genomics 7: 306. doi: 10.1186/1471-2164-7-306.
Ollila S, Nystrom M, Grimm AJ, et al. 2007. Interpreting missense Subramanian S, Kumar S. 2006b. Higher intensity of purifying selection
variants: Comparing computational methods in human disease genes on >90% of the human genes revealed by the intrinsic replacement
CDKN2A, MLH1, MSH2, MECP2, and tyrosinase (TYR). Hum Mutat 28: mutation rates. Mol Biol Evol 23: 2283–2287.
683–693. Sunyaev SR, Eisenhaber F, Rodchenkov IV, Eisenhaber B, Tumanyan VG,
Cheng TM, Lu YE, Vendruscolo M, Lio P, Blundell TL. 2008. Prediction by Kuznetsov EN. 1999. PSIC: Profile extraction from sequence alignments
graph theoretic measures of structural effects in proteins arising from with position-specific counts of independent observations. Protein Eng
non-synonymous single nucleotide polymorphisms. PLoS Comput Biol 12: 387–394.
4: e1000135. doi: 10.1371/journal.pcbi.1000135. Thompson JD, Higgins DG, Gibson TJ. 1994. CLUSTAL W: Improving the
Doniger SW, Kim HS, Swain D, Corcuera D, Williams M, Yang SP, Fay JC. sensitivity of progressive multiple sequence alignment through
2008. A catalog of neutral and deleterious polymorphism in yeast. PLoS sequence weighting, position-specific gap penalties and weight matrix
Genet 4: e1000183. doi: 10.1371/journal.pgen.1000183. choice. Nucleic Acids Res 22: 4673–4680.
Eyre-Walker A, Woolfit M, Phelps T. 2006. The distribution of fitness effects of Tian J, Wu N, Guo X, Guo J, Zhang J, Fan Y. 2007. Predicting the phenotypic
new deleterious amino acid mutations in humans. Genetics 173: 891–900. effects of non-synonymous single nucleotide polymorphisms based on
Fitch WM. 1971. Toward defining the course of evolution: Minimum support vector machines. BMC Bioinformatics 8: 450. doi: 10.1186/1471-
change for a specific tree topology. Syst Zool 20: 406–416. 2105-8-450.
Gao L, Zhang J. 2003. Why are some human disease-associated mutations Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J,
fixed in mice? Trends Genet 19: 678–681. et al. 2008. The diploid genome sequence of an Asian individual. Nature
Hedges SB, Dudley J, Kumar S. 2006. TimeTree: A public knowledge-base of 456: 60–65.
divergence times among organisms. Bioinformatics 22: 2971–2972. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P,
Kimura M. 1983. The neutral theory of molecular evolution. Cambridge Agarwala R, Ainscough R, Alexandersson M, An P, et al. 2002. Initial
University Press, Cambridge, UK. sequencing and comparative analysis of the mouse genome. Nature 420:
Kondrashov AS. 2003. Direct estimates of human per nucleotide mutation 520–562.
rates at 20 loci causing Mendelian diseases. Hum Mutat 21: 12–27. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W,
Kondrashov AS, Sunyaev S, Kondrashov FA. 2002. Dobzhansky-Muller Chen YJ, Makhijani V, Roth GT, et al. 2008. The complete genome of an
incompatibilities in protein evolution. Proc Natl Acad Sci 99: 14878– individual by massively parallel DNA sequencing. Nature 452: 872–876.
14883. Yampolsky LY, Kondrashov FA, Kondrashov AS. 2005. Distribution of the
Kryukov GV, Pennacchio LA, Sunyaev SR. 2007. Most rare missense alleles strength of selection against amino acid replacements in human
are deleterious in humans: Implications for complex disease and proteins. Hum Mol Genet 14: 3191–3201.
association studies. Am J Hum Genet 80: 727–739.
Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead
B, Raney BJ, Pohl A, Pheasant M, et al. 2009. The UCSC Genome Browser
Database: Update 2009. Nucleic Acids Res 37: D755–D761. Received February 1, 2009; accepted in revised form June 8, 2009.
8 Genome Research
www.genome.org