Mileidy W. Gonzalez and William R. Pearson
Mileidy W. Gonzalez and William R. Pearson
Pearson
INTRODUCTION
Evaluation and improvement of protein sequence similarity searches, using algorithms like BLAST or Smith-Waterman (SSEARCH) and more sophisticated searches like PSI-BLAST or HMMER require query sequences and reference sets curated to accurately reflect homology relationships. Because structural similarity is preserved well beyond sequence similarity protein structures are often the gold standard for annotating homology relationships. They do not reflect common practice in protein similarity searching, which is to characterize unknown proteins by searching large, comprehensive protein sets like RefSeq and UniProt.
To better characterize similarity searching strategies, in particular PSI-BLAST performance, against comprehensive protein databases: identified a set of diverse protein domains from Pfam to use as queries against a set of real proteins containing those domains. Query domain families are taxonomically-broad (to provide harder homology detection cases), and have long models (to better simulate full-length protein searches).
DATABASE ASSEMBLY
EVALUATION DATASETS:
From 681 initial Pfam families that met criteria for (a) domain length (b) taxonomic diversity (c) family size and (d) available structure. selected 344 query Pfam families after merging families that belonged to the same clan. 81 families belonged to distinct clans, 263 families did not have an associated clan. Was reduced to 320 nonhomologous domains using information from Pfam.
The target library was built from 234,505 full-length UniProt proteins containing Pfam v. 21 homologs to the original 320 Pfam families together with 1,627 other domain families. Two query sets were constructed and the members of these sets evaluated further: (a) a challenging query subset (50 hard) with the lowest family coverage with BLAST, and (b) a randomly sampled representative query set (50 sampled with replacement).
ANNOTATION EXTENSIONS:
Thousands of alignments to very similar UniProt sequences were annotated as partial homologs or non-homologs. To correct these conservative annotations, we compared the bare domain query sequences to the target library using SSEARCH and GLSEARCH. RefProtDom describes relationships and alignment boundaries between query domains and the target library homologs according to Pfam v. 21, Pfam v. 24, and the SSEARCH/GLSEARCH alignment boundaries.
Although SSEARCH/GLSEARCH searches against the target library dramatically reduced the number of apparent false positives with very low E()-values, additional searches with PSI-BLAST using the queries sometimes found unrelated UniProt sequences with significant (E()<10-40) scores. Structures of significant non-homologs that mapped to unrelated Pfam families were examined in SCOP and CATH; if they shared the same SCOP fold or CATH topology they were annotated as homologs.
SUMMARY
Iterative similarity searches are usually performed against fullength proteins with complex domain architectures. RefProtDoms greatest strength is its use of a taxonomically diverse set of full-length, multi-domain, proteins in the target library. RefProtDom can simulate searches against comprehensive sequence databases while evaluating success on challenging homologies.
The RefProtDom query and target libraries seek to reduce the number of un-annotated homologies with statistically significant similarities, and to more accurately estimate homologous domain boundaries.
By combining single domain queries with full-length, multi-domain proteins, RefProtDom can highlight alignment errors and evaluate improvements in alignment accuracy.
For iterative sequence comparison methods, alignment accuracy is crucial. RefProtDoms annotations, identified a previouslyunrecognized alignment overextension error in PSIBLAST responsible for the corruption of its PSSMs and its poor specificity. Domains are the basic units of protein function and evolution; thus, improved homology detection requires improved domain alignment accuracy. Large-scale automatic annotation of gene function is limited by local alignments incomplete motif matches and fuzzy domain boundaries.
Establishing homology is central to a wide array of bioinformatics methodologies. improved domain alignments can improve 3-D protein structural predictions that use homology modelling, clarify how protein domain networks interact to generate disease phenotypes. RefProtDom provides a comprehensive set of full-length UniProt proteins that can be used to evaluate domain alignment accuracy.
EXAMPLE
Six types of files are provided: Reference library sequence files Annotation files (list homologous domains in the reference libraries) Supplementary Annotation Files A tar-gzip file with sets of query sequences A tar-gzip file with the trees and multiple sequence alignments for the super families A file of the most frequently-asked questions (FAQ.txt)
1. Reference libraries
library_all_domains.fa.gz - Full-length Uniprot proteins containing homologs to the query domains. library_all_domains_rdm.fa.gz - Randomshuffles of each of the full-length Uniprot proteins in library_all_domains.fa.gz. library_long_domains.fa.gz - A subset of the library_all_domains.fa.gz library from which proteins with homologous domains less than 75% of the Pfam model length are excluded. library_long_domains_rdm.fa.gz Random-shuffles of each of the full-length Uniprot proteins in library_long_domains.fa.gz
2. Annotation files
Format:
>[source]|[accession]|[sequence_name] [superfamily]<tab>[domain_start]<tab> [domain_end] <tab>[e-value]<tab>[mode] <tab>[long_domain]
>up|P53627|ABFA_STRLI PF06964 293 494 1.3e-104 pf21ls 1 >pfam21|P19801|ABP1_HUMAN PF01179 296 715 2.3e-35 ua_pws 1 CL47 39 125 3.3e-29 pf21ls 1 CL47 141 241 1.3e-24 pf21ls 1
family_query.summary - Lists the size and names of the queries for each of the chosen families. pfam_to_clan.txt - Lists the pfam family to clan superfamily correspondence. refprotdom_domain_bound_ext.txt - Lists the domains that in pfam v.21 were annotated as partial homologies whose coordinates we extended. refprotdom_unannot_homol.txt - Lists missed/ unannotated homologs in Pfam v.21 that we uncovered with reverse PSI-BLAST searches, pair-wise searches or through SCOP/CATH structural evidence.
4. Query sequences
queries.tgz is a gzip-ed tar file that produces the following directories: queries/by_difficulty/ queries/by_tree_location/ In queries/by_difficulty/, there are two classes of query sequence files, each of which contains 50 domain sequences, in 10 different random-sequence embeddings. hard_embedded.[1-10].fa sampled_embedded.[1-10].fa "hard" domains are domains that find the smallest number of related sequences after a BLAST search. "sampled" domains were chosen at random from 640 domains selected because of their length and phylogenetic diversity .
"queries/by_tree_location/", also contains two classes of query sequence files ; the classes are "des", for queries from relatively deserted parts of the domain phylogenetic tree, and "pop", for queries from a populated region. QUERY EMBEDDING:All queries are available as bare domains (non-embedded/ne) or flanked by artificial proteins (embedded/e#). QUERY FILE NAMES:A query is a sequence domain from a family that falls under any of the 4 types of queries. format: [type]_[embedding].[e#].fa Query files are in FASTA format, >[query_accession] e_d_start:# e_d_end:# from:[sequence_id] ([domain_start][domain_end]); pfam:[pfam_superfamily]; model_len:[#]; all_homol:[#]; long_homol:[#];
For example: qPF00589_e5 is a domain from PF00589 that has been embedded in the 5th shuffle replicate. e_d_start/; The boundaries of the Pfam domain in the query sequence e_d_end from; The original pfamseq_id (Uniprot id) and coordinates (start-end) of the query domain. pfam ;The pfam accession name (PF#####) or clan number (CL###). No clan accession is given when a superfamily contains a single Pfam family. model_len;The length of the pfam domain model all_homol; Number of homologs that this family has in the "library_all_domains.fa" library long_homol; Number of homologs that this family has in "library_long_domains.fa" descr ;Description of the Pfam domain
"trees/all_domains_in_family/" contains trees of all domain members of each superfamily. "trees/long_domains_in_family/" contains trees of the long-domain members of each superfamily .
To determine whether the alignments are True positives (TPs) or False Positives (FPs) all you need to know is the library sequence's id, (e.g. up|Q1YWW7|Q1YWW7_PHOPR) and the pfam superfamily to which the query belongs.
REFERENCE
THANK YOU