Bioinformatics: Original Paper
Bioinformatics: Original Paper
Bioinformatics: Original Paper
ß The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 3081
D.Börnigen et al.
3082
An unbiased evaluation of gene prioritization tools
within already known genes are not considered. This process was kept and tool-specific gene identifiers (e.g. EntrezGene or Ensembl identifiers).
active for 6 months (May 15–November 15, 2010) and led to a collection As mentioned earlier, most of the tools in addition require a set of candi-
of 42 associations (see Table 1 and Supplementary Table S2). For each date genes (from the whole genome). Several tools accept chromosomal
association, the tools are run as soon as the association is identified fol- coordinates, whereas some prefer cytogenetics bands. For each associ-
lowing the defined workflow (see later). By doing this, we simulate as ation, we select the cytogenetics bands that cover 10 Mb around the
much as possible the prediction of a novel disease gene, since the under- novel disease gene and derive the chromosomal coordinates. We choose
lying databases are still unaware of the association. Once an association is 10 Mb to obtain on average at least 100 candidate genes. Once again,
identified, the exact inputs for the different tools have to be defined. For BioMart is used to retrieve specific gene identifiers. For an overview of
instance, ToppGene, GeneDistiller, GeneWanderer, Pinta and Endeavour the inputs for the 42 associations, please see Supplementary Table S3.
require training genes (genes already known to be associated to the disease The resulting 42 novel disease-gene associations do not represent a
under study), whereas Suspects, Posmed, GeneDistiller and Candid re- homogeneous set. Therefore, we have divided them into confirmed
quire keywords that describe the disease. Training genes and keywords (for monogenic diseases, the mutation is found in at least two unrelated
are collected from the corresponding OMIM pages, Genetic Association patients; for multifactorial diseases, a GWAS is replicated in a separate
Database (GAD) pages and from recently published reviews when pos- cohort), intermediate (a single study, but additional functional evidence is
sible. BioMart (Haider et al., 2009) is used to map between gene symbols provided) and unconfirmed (a single study) associations.
3083
D.Börnigen et al.
2.3 Performance measures Table 2. Results for the genome-wide and candidate set based prioritiza-
tion tools
For each tool, we then assess its ability to identify the novel disease genes
as promising genes using several statistical measures. We first compute
the median of the rank ratio over all associations. We preferably use rank Median Response TPR in top TPR in top TPR in top
ratio over rank because tools do not necessarily return the same number rate (%) 5% (%) 10% (%) 30% (%)
of candidate genes even when fed with the same inputs. In addition, we
also draw the boxplots of these rank ratios to give a more comprehensive Genome-wide prioritization tools
view of tool performance. Another method to compare the tools is to Candid 18.10 100 21.4 33.3 64.3
Endeavour-GW 15.49 100 28.6 38.1 71.4
build the receiver-operating characteristic (ROC) curves and to compute
Pinta-GW 19.03 100 26.2 31.0 71.4
the area under the curve (AUC) as an estimate of the global performance. Integration 12.45 100 19.1 38.1 78.6
To compare the tools even further, we computed the true-positive rates Candidate set based prioritization tools
when setting the threshold for validation at the top 5% (True Positive Suspects 12.77a 88.9a 33.3a 33.3a 63.0a
3084
An unbiased evaluation of gene prioritization tools
3085
D.Börnigen et al.
general, more genes than keywords for training (18.8 genes on an us to identify the influence of the size of the gene list to prioritize.
average for six keywords). This also indicates that more key- The median rank ratio is better for Endeavour-CS (11.16) than
words might be needed to model a disease and that a small for Endeavour-GW (15.49) in our benchmark. The difference
text (such as an OMIM entry) might even be necessary (van remains, albeit smaller, when considering the AUC and the
Driel et al., 2006). TPR in top 10 and 30%.
There is in general an agreement between the five performance The same training genes are used, and therefore the observed
measures we use throughout our study. One notable exception difference is only caused by extending the small candidate gene
exists for ToppGene, whose AUC is 66%, and corresponds to set to the whole genome. This confirms previous findings that
rank 10th (out of the 12 prioritization tools). In contrast, its prioritizing the whole genome is more difficult than prioritizing a
associated TPR in top 10% is 42.9%, which corresponds to rather small positive locus. The heat map indicates that the two
rank second. This apparent contradiction can be explained by Endeavour modes are strongly correlated as expected since the
3086
An unbiased evaluation of gene prioritization tools
that not all tools are influenced by the intrinsic complexity of projects, G.0318.05 (subfunctionalization), G.0553.06
multifactorial diseases. For instance, Endeavour and ToppGene (VitamineD), G.0302.07 (SVM/Kernel), research communities
seem to perform better for monogenic conditions while (ICCoS, ANMMM, MLDM); G.0733.09 (3UTR); G.082409
GeneWanderer and Suspects perform better for complex dis- (EGFR), IWT: PhD Grants, Silicos; SBO-BioFrame,
orders. However, the size of our validation dataset does not SBO-MoKa, TBM-IOTA3, FOD:Cancer plans, IBBT]; Belgian
allow for a complete statistical analysis. Larger validation data- Federal Science Policy Office [IUAP P6/25 (BioMaGNet,
sets and real predictive studies will be pursued to complement Bioinformatics and Modeling: from Genomes to Networks,
our preliminary study. 2007–2011)]; EU-RTD [ERNSI: European Research Network
We are aware of the limited coverage of available literature in on System Identification; FP7-HEALTH CHeartED].
human genetics in our study that report novel disease-gene asso-
ciations. However, we aimed at estimating the real performance Conflict of Interest: none declared.
3087
D.Börnigen et al.
Hardy,J. and Singleton,A. (2009) Genomewide association studies and human dis- Remmers,E.F. et al. (2010) Genome-wide association study identifies variants in the
ease. N. Engl. J. Med., 360, 1759–1768. MHC class i, IL10, and IL23R-IL12RB2 regions associated with Behçet’s dis-
Hirschfield,G.M. et al. (2010) Variants at IRF5-TNPO3, 17q12-21 and MMEL1 are ease. Nat. Genet., 42, 698–702.
associated with primary biliary cirrhosis. Nat. Genet., 42, 655–657. Safran,M. et al. (2010) GeneCards version 3: the human gene integrator. Database,
Hüffmeier,U. et al. (2010) Common variants at TRAF3IP2 are associated with 2010: article ID baq020; doi:10.1093/database/baq020.
susceptibility to psoriatic arthritis and psoriasis. Nat. Genet., 42, 996–999. Sampson,M.G. et al. (2010) Evidence for a recurrent microdeletion at chromosome
Hutz,J.E. et al. (2008) CANDID: a flexible method for prioritizing candidate genes 16p11.2 associated with congenital anomalies of the kidney and urinary tract
for complex human traits. Genet. Epidemiol., 32, 779–790. (CAKUT) and hirschsprung disease. Am. J. Med. Genet. A, 152, 2618–2622.
Kantarci,S. et al. (2010) Characterization of the chromosome 1q41q42.12 region, Schuster,S.C. (2008) Next-generation sequencing transforms today’s biology. Nat.
and the candidate gene DISP1, in patients with CDH. Am. J. Med. Genet. A, Methods, 5, 16–18.
152, 2493–2504. Seelow,D. et al. (2008) GeneDistiller—distilling candidate genes from linkage inter-
Köhler,S. et al. (2008) Walking the interactome for prioritization of candidate dis- vals. PLoS One, 3, e3874.
ease genes. Am. J. Hum. Genet., 82, 949–958. Sheen,V.L. et al. (2010) Mutation in PQBP1 is associated with periventricular het-
Letra,A. et al. (2010) Follow-up association studies of chromosome region 9q and erotopia. Am. J. Med. Genet. A, 152, 2888–2890.
3088