0% found this document useful (0 votes)

54 views23 pages

Mileidy W. Gonzalez and William R. Pearson

The RefProtDom database assembles a set of diverse protein domain queries from Pfam and uses them to search a target library of over 200,000 full-length UniProt proteins containing those domains. This allows evaluation of sequence search tools like BLAST and PSI-BLAST on challenging homology detection cases against a comprehensive sequence database. The RefProtDom annotations seek to reduce unannotated homologies and more accurately estimate domain boundaries to improve domain alignment accuracy, which is important for functions like protein structure prediction and gene ontology annotation. Users can access the query sequences, target library, and annotation files through the RefProtDom website to evaluate search results.

Uploaded by

Amitha Murali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views23 pages

Mileidy W. Gonzalez and William R. Pearson

Uploaded by

Amitha Murali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Mileidy W. Gonzalez and William R.

Pearson

INTRODUCTION
Evaluation and improvement of protein sequence similarity searches, using algorithms like BLAST or Smith-Waterman (SSEARCH) and more sophisticated searches like PSI-BLAST or HMMER require query sequences and reference sets curated to accurately reflect homology relationships. Because structural similarity is preserved well beyond sequence similarity protein structures are often the gold standard for annotating homology relationships. They do not reflect common practice in protein similarity searching, which is to characterize unknown proteins by searching large, comprehensive protein sets like RefSeq and UniProt.

To better characterize similarity searching strategies, in particular PSI-BLAST performance, against comprehensive protein databases: identified a set of diverse protein domains from Pfam to use as queries against a set of real proteins containing those domains. Query domain families are taxonomically-broad (to provide harder homology detection cases), and have long models (to better simulate full-length protein searches).

DATABASE ASSEMBLY

EVALUATION DATASETS:
From 681 initial Pfam families that met criteria for (a) domain length (b) taxonomic diversity (c) family size and (d) available structure. selected 344 query Pfam families after merging families that belonged to the same clan. 81 families belonged to distinct clans, 263 families did not have an associated clan. Was reduced to 320 nonhomologous domains using information from Pfam.

The target library was built from 234,505 full-length UniProt proteins containing Pfam v. 21 homologs to the original 320 Pfam families together with 1,627 other domain families. Two query sets were constructed and the members of these sets evaluated further: (a) a challenging query subset (50 hard) with the lowest family coverage with BLAST, and (b) a randomly sampled representative query set (50 sampled with replacement).

ANNOTATION EXTENSIONS:

Thousands of alignments to very similar UniProt sequences were annotated as partial homologs or non-homologs. To correct these conservative annotations, we compared the bare domain query sequences to the target library using SSEARCH and GLSEARCH. RefProtDom describes relationships and alignment boundaries between query domains and the target library homologs according to Pfam v. 21, Pfam v. 24, and the SSEARCH/GLSEARCH alignment boundaries.

Although SSEARCH/GLSEARCH searches against the target library dramatically reduced the number of apparent false positives with very low E()-values, additional searches with PSI-BLAST using the queries sometimes found unrelated UniProt sequences with significant (E()<10-40) scores. Structures of significant non-homologs that mapped to unrelated Pfam families were examined in SCOP and CATH; if they shared the same SCOP fold or CATH topology they were annotated as homologs.

SUMMARY
Iterative similarity searches are usually performed against fullength proteins with complex domain architectures. RefProtDoms greatest strength is its use of a taxonomically diverse set of full-length, multi-domain, proteins in the target library. RefProtDom can simulate searches against comprehensive sequence databases while evaluating success on challenging homologies.

The RefProtDom query and target libraries seek to reduce the number of un-annotated homologies with statistically significant similarities, and to more accurately estimate homologous domain boundaries.

By combining single domain queries with full-length, multi-domain proteins, RefProtDom can highlight alignment errors and evaluate improvements in alignment accuracy.

For iterative sequence comparison methods, alignment accuracy is crucial. RefProtDoms annotations, identified a previouslyunrecognized alignment overextension error in PSIBLAST responsible for the corruption of its PSSMs and its poor specificity. Domains are the basic units of protein function and evolution; thus, improved homology detection requires improved domain alignment accuracy. Large-scale automatic annotation of gene function is limited by local alignments incomplete motif matches and fuzzy domain boundaries.

Establishing homology is central to a wide array of bioinformatics methodologies. improved domain alignments can improve 3-D protein structural predictions that use homology modelling, clarify how protein domain networks interact to generate disease phenotypes. RefProtDom provides a comprehensive set of full-length UniProt proteins that can be used to evaluate domain alignment accuracy.

EXAMPLE
Six types of files are provided: Reference library sequence files Annotation files (list homologous domains in the reference libraries) Supplementary Annotation Files A tar-gzip file with sets of query sequences A tar-gzip file with the trees and multiple sequence alignments for the super families A file of the most frequently-asked questions (FAQ.txt)

1. Reference libraries
library_all_domains.fa.gz - Full-length Uniprot proteins containing homologs to the query domains. library_all_domains_rdm.fa.gz - Randomshuffles of each of the full-length Uniprot proteins in library_all_domains.fa.gz. library_long_domains.fa.gz - A subset of the library_all_domains.fa.gz library from which proteins with homologous domains less than 75% of the Pfam model length are excluded. library_long_domains_rdm.fa.gz Random-shuffles of each of the full-length Uniprot proteins in library_long_domains.fa.gz

2. Annotation files

family_members.annot.gz - lists the domains in each sequence in the library_*_domains.fa.gz files.

Format:
>[source]|[accession]|[sequence_name] [superfamily]<tab>[domain_start]<tab> [domain_end] <tab>[e-value]<tab>[mode] <tab>[long_domain]

>up|P53627|ABFA_STRLI PF06964 293 494 1.3e-104 pf21ls 1 >pfam21|P19801|ABP1_HUMAN PF01179 296 715 2.3e-35 ua_pws 1 CL47 39 125 3.3e-29 pf21ls 1 CL47 141 241 1.3e-24 pf21ls 1

3. Supplementary annotation files

family_query.summary - Lists the size and names of the queries for each of the chosen families. pfam_to_clan.txt - Lists the pfam family to clan superfamily correspondence. refprotdom_domain_bound_ext.txt - Lists the domains that in pfam v.21 were annotated as partial homologies whose coordinates we extended. refprotdom_unannot_homol.txt - Lists missed/ unannotated homologs in Pfam v.21 that we uncovered with reverse PSI-BLAST searches, pair-wise searches or through SCOP/CATH structural evidence.

4. Query sequences
queries.tgz is a gzip-ed tar file that produces the following directories: queries/by_difficulty/ queries/by_tree_location/ In queries/by_difficulty/, there are two classes of query sequence files, each of which contains 50 domain sequences, in 10 different random-sequence embeddings. hard_embedded.[1-10].fa sampled_embedded.[1-10].fa "hard" domains are domains that find the smallest number of related sequences after a BLAST search. "sampled" domains were chosen at random from 640 domains selected because of their length and phylogenetic diversity .

"queries/by_tree_location/", also contains two classes of query sequence files ; the classes are "des", for queries from relatively deserted parts of the domain phylogenetic tree, and "pop", for queries from a populated region. QUERY EMBEDDING:All queries are available as bare domains (non-embedded/ne) or flanked by artificial proteins (embedded/e#). QUERY FILE NAMES:A query is a sequence domain from a family that falls under any of the 4 types of queries. format: [type]_[embedding].[e#].fa Query files are in FASTA format, >[query_accession] e_d_start:# e_d_end:# from:[sequence_id] ([domain_start][domain_end]); pfam:[pfam_superfamily]; model_len:[#]; all_homol:[#]; long_homol:[#];

For example: qPF00589_e5 is a domain from PF00589 that has been embedded in the 5th shuffle replicate. e_d_start/; The boundaries of the Pfam domain in the query sequence e_d_end from; The original pfamseq_id (Uniprot id) and coordinates (start-end) of the query domain. pfam ;The pfam accession name (PF#####) or clan number (CL###). No clan accession is given when a superfamily contains a single Pfam family. model_len;The length of the pfam domain model all_homol; Number of homologs that this family has in the "library_all_domains.fa" library long_homol; Number of homologs that this family has in "library_long_domains.fa" descr ;Description of the Pfam domain

5. Trees and MSAs

trees.tgz is a gzip-ed tar file that produces the following directories:

trees/all_domains_in_family/ trees/long_domains_in_family/

"trees/all_domains_in_family/" contains trees of all domain members of each superfamily. "trees/long_domains_in_family/" contains trees of the long-domain members of each superfamily .

6. Using the files -- evaluating

search alignment accuracy

To determine whether the alignments are True positives (TPs) or False Positives (FPs) all you need to know is the library sequence's id, (e.g. up|Q1YWW7|Q1YWW7_PHOPR) and the pfam superfamily to which the query belongs.

REFERENCE

http://bioinformatics.oxfordjournals.org /content/26/18/2361.full.pdf+html?sid= 8e863a03-b8cf-4b0a-90b69396115deb47 http://faculty.virginia.edu/wrpearson/fa sta/PUBS/gonzalez09a/

THANK YOU

Bioinformatics Assingment - B8.Docx Alex Presly-37
No ratings yet
Bioinformatics Assingment - B8.Docx Alex Presly-37
10 pages
Bioinformatics Assingment - New Kandy - Draft
100% (1)
Bioinformatics Assingment - New Kandy - Draft
14 pages
Data Retrieval
67% (3)
Data Retrieval
17 pages
Alimentary Paste
67% (3)
Alimentary Paste
16 pages
Emboss
100% (2)
Emboss
35 pages
Reiki - Pranic Healing Course
100% (2)
Reiki - Pranic Healing Course
35 pages
Mobil SHC 634 Msds
No ratings yet
Mobil SHC 634 Msds
13 pages
Bioinformatics 1 p3
No ratings yet
Bioinformatics 1 p3
17 pages
Exam Year Questions and Answers
No ratings yet
Exam Year Questions and Answers
8 pages
Nbs Annotation
100% (1)
Nbs Annotation
62 pages
W9-SIO1003 Practical 4-Questions
No ratings yet
W9-SIO1003 Practical 4-Questions
6 pages
Bioinformatics Tutorial
No ratings yet
Bioinformatics Tutorial
12 pages
Division 23 - Heating, Ventilating and Air Conditioning
No ratings yet
Division 23 - Heating, Ventilating and Air Conditioning
812 pages
Bad Attitude Homework
100% (1)
Bad Attitude Homework
4 pages
Practical 2 Sequence Alignment
No ratings yet
Practical 2 Sequence Alignment
8 pages
Medical Technology Prayer: PAMET: Philippine Association of Medical Technologists
No ratings yet
Medical Technology Prayer: PAMET: Philippine Association of Medical Technologists
3 pages
MHFA Answers
No ratings yet
MHFA Answers
4 pages
Pharmachieve Fact Sheet Pa Vs PP Pebc Osce Resources
0% (1)
Pharmachieve Fact Sheet Pa Vs PP Pebc Osce Resources
6 pages
CUBT401 - 4 - Sequence and Genome Annotation
No ratings yet
CUBT401 - 4 - Sequence and Genome Annotation
66 pages
PFAM Database
No ratings yet
PFAM Database
22 pages
Mastering Elasticsearch 5.x - Third Edition
From Everand
Mastering Elasticsearch 5.x - Third Edition
Bharvi Dixit
3/5 (1)
Water Quality Control Plan: Ocean Waters of California
No ratings yet
Water Quality Control Plan: Ocean Waters of California
117 pages
Lecture2022 - 3 /!
No ratings yet
Lecture2022 - 3 /!
60 pages
Bioinfo Final Practical
No ratings yet
Bioinfo Final Practical
66 pages
Bioinformatics Manual Updated
No ratings yet
Bioinformatics Manual Updated
48 pages
Lecture - 02 - Comparative Sequence Analysis
No ratings yet
Lecture - 02 - Comparative Sequence Analysis
28 pages
Anotacion de Genomas
No ratings yet
Anotacion de Genomas
84 pages
Mastering Elasticsearch - Second Edition
From Everand
Mastering Elasticsearch - Second Edition
Rafał Kuć
No ratings yet
Blast Nsuite
No ratings yet
Blast Nsuite
19 pages
Unit Iv - Blast
No ratings yet
Unit Iv - Blast
21 pages
Custom PC - January 2022
No ratings yet
Custom PC - January 2022
118 pages
Chua Yuen Chong, Gerrard - BIO61604 - Pract 3 and 4
No ratings yet
Chua Yuen Chong, Gerrard - BIO61604 - Pract 3 and 4
20 pages
BI Lab Manual (18-19)
No ratings yet
BI Lab Manual (18-19)
21 pages
Elasticsearch Server: Second Edition
From Everand
Elasticsearch Server: Second Edition
Rafał Kuć
No ratings yet
Bioinformatics
No ratings yet
Bioinformatics
22 pages
Semwork 1
No ratings yet
Semwork 1
19 pages
Parts and Instruction Manual: Marking and Decorating Systems
No ratings yet
Parts and Instruction Manual: Marking and Decorating Systems
35 pages
Slides 3
No ratings yet
Slides 3
53 pages
Basic Bioinformatics
No ratings yet
Basic Bioinformatics
40 pages
Bioinformatics - Derived Databases: How Do We Carry Out 1 and 2 ?
No ratings yet
Bioinformatics - Derived Databases: How Do We Carry Out 1 and 2 ?
25 pages
Lab Report 05
No ratings yet
Lab Report 05
20 pages
Using BLAST: FASTA Format
0% (1)
Using BLAST: FASTA Format
3 pages
Sequence Similarity Searching: WWW - Med.nyu - edu/rcr/rcr/course/PPT/similarity
No ratings yet
Sequence Similarity Searching: WWW - Med.nyu - edu/rcr/rcr/course/PPT/similarity
57 pages
3 - Introduction (SEQU ANAL of PCR Products 9 9 12
No ratings yet
3 - Introduction (SEQU ANAL of PCR Products 9 9 12
42 pages
Practical Lab Exercise For Intro Bioinf II
No ratings yet
Practical Lab Exercise For Intro Bioinf II
29 pages
6.2) Survival Analysis (Logrank Test 22)
No ratings yet
6.2) Survival Analysis (Logrank Test 22)
22 pages
Rosales
No ratings yet
Rosales
27 pages
CBL 21MBT011
No ratings yet
CBL 21MBT011
18 pages
Bioinformatics Module
No ratings yet
Bioinformatics Module
8 pages
Fundamentals of Plant Biochemistry and Biotechnology (Bag 105)
No ratings yet
Fundamentals of Plant Biochemistry and Biotechnology (Bag 105)
13 pages
Exam Year Questions and Answers
No ratings yet
Exam Year Questions and Answers
8 pages
Proteins Bioinfo Latest
No ratings yet
Proteins Bioinfo Latest
45 pages
Bioinformatics
No ratings yet
Bioinformatics
11 pages
Developing A Toothbrushing Visual Pedagogy (TBVP) For Preschool Children With Autism Spectrum Disorder
No ratings yet
Developing A Toothbrushing Visual Pedagogy (TBVP) For Preschool Children With Autism Spectrum Disorder
12 pages
Harvard Medical School
No ratings yet
Harvard Medical School
9 pages
Identification of Functionally Related Enzymes by Learning-to-Rank Methods
No ratings yet
Identification of Functionally Related Enzymes by Learning-to-Rank Methods
13 pages
Health Insurance
No ratings yet
Health Insurance
15 pages
Excercise 6
No ratings yet
Excercise 6
10 pages
PFAM
No ratings yet
PFAM
12 pages
8 Conformational
No ratings yet
8 Conformational
26 pages
Genome Annotation
No ratings yet
Genome Annotation
25 pages
02.-Sequence Analysis PDF
No ratings yet
02.-Sequence Analysis PDF
14 pages
Blast Introduction
No ratings yet
Blast Introduction
42 pages
Asm 4
No ratings yet
Asm 4
12 pages
Bio Tools Booklet
No ratings yet
Bio Tools Booklet
5 pages
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
No ratings yet
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
6 pages
Interview Dr. Don Huber - Part1
No ratings yet
Interview Dr. Don Huber - Part1
16 pages
14-Pfam-Protein Family Database-12-09-2024
No ratings yet
14-Pfam-Protein Family Database-12-09-2024
7 pages
Protein Family
No ratings yet
Protein Family
5 pages
Mid Bioinfor
No ratings yet
Mid Bioinfor
6 pages
Annotating Genomes Using Proteomics Data: Andy Jones Department of Preclinical Veterinary Science
No ratings yet
Annotating Genomes Using Proteomics Data: Andy Jones Department of Preclinical Veterinary Science
22 pages
An Introduction To The Pfam Protein Families Database
No ratings yet
An Introduction To The Pfam Protein Families Database
15 pages
Renolit JP 1619-SDS
No ratings yet
Renolit JP 1619-SDS
8 pages
Pfam
No ratings yet
Pfam
4 pages
TY-Exercise 4
No ratings yet
TY-Exercise 4
8 pages
Bioinformatics: ABE 2007 Kent Koster Group 3
No ratings yet
Bioinformatics: ABE 2007 Kent Koster Group 3
43 pages
08 Interference Exercises - Points 53-59
No ratings yet
08 Interference Exercises - Points 53-59
6 pages
Active Ingredients in Pain Reliever
No ratings yet
Active Ingredients in Pain Reliever
13 pages
SJK (T) Ladang Edinburgh 1
No ratings yet
SJK (T) Ladang Edinburgh 1
3 pages
S Giáo Dục Và Đào Tạo
No ratings yet
S Giáo Dục Và Đào Tạo
4 pages
Bioinformatics Tutorial
No ratings yet
Bioinformatics Tutorial
5 pages
ChemWell 2902 In1
No ratings yet
ChemWell 2902 In1
2 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
Stimulation of Breast Growth by Hypnosis The Journal of Sex Research Vol 10, No 4
No ratings yet
Stimulation of Breast Growth by Hypnosis The Journal of Sex Research Vol 10, No 4
1 page
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Carmen's Cafe Brookside Dinner Menu
No ratings yet
Carmen's Cafe Brookside Dinner Menu
4 pages
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
Bel-Ray Rust Preventative Coating
No ratings yet
Bel-Ray Rust Preventative Coating
3 pages
AAPL Forensic Psychiatry Review Course Brochure
No ratings yet
AAPL Forensic Psychiatry Review Course Brochure
2 pages
SR 1 Container-Loading-Supervision 2021.08.04 EN Global-2
No ratings yet
SR 1 Container-Loading-Supervision 2021.08.04 EN Global-2
20 pages

Mileidy W. Gonzalez and William R. Pearson

Uploaded by

Mileidy W. Gonzalez and William R. Pearson

Uploaded by

Mileidy W. Gonzalez and William R.

family_members.annot.gz - lists the domains in each sequence in the library_*_domains.fa.gz files.

3. Supplementary annotation files

5. Trees and MSAs

trees.tgz is a gzip-ed tar file that produces the following directories:

6. Using the files -- evaluating

http://bioinformatics.oxfordjournals.org /content/26/18/2361.full.pdf+html?sid= 8e863a03-b8cf-4b0a-90b69396115deb47 http://faculty.virginia.edu/wrpearson/fa sta/PUBS/gonzalez09a/

You might also like