InterPro Final Print

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Bioinformatics Practical No.

InterPro

Aim: The protein sequence analysis using InterPro. 1. Introduction: InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium. The aim of InterPro is to combine their individual strengths to provide a single resource through which scientists can access comprehensive information about protein families, domains and functional sites. The InterPro Consortium The following databases make up the InterPro Consortium: 1) PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is base at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland. 2) HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of wellconserved proteins families or subfamilies. HAMAP is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland. 3) Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. Pfam is based at the Wellcome Trust Sanger Institute, Hinxton, UK. 4) PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK. 5) The ProDom protein domain database consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSIBLAST searches. ProDom is based at PRABI Villeurbanne, France.

6) SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at at EMBL, Heidelberg, Germany. 7) TIGRFAMs is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. TIGRFAMs is based at the J. Craig Venter Institute, Rockville, MD, US. 8) The PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. PIRSF is based at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, US. 9) SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY is based at the University of Bristol, UK. 10) The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multilinkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK. 11) PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function, as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences. PANTHER is based at at University of Southern California, CA, US. Contents and coverage of InterPro 42.0 InterPro protein matches are calculated for all UniProtKB and UniParc proteins. The following statistics are for all UniProtKB proteins. InterPro release 42.0 contains 24622 entries, (last entry: IPR027636) representing: 1) Family (16547) 2) Domain (6972)

3) Repeat (273) 4) Sites 5) Active site (105) 6) Binding site (71) 7) Conserved site (639) 8) PTM (15) InterPro cites 37735 publications in PubMed.

Features: 1) InterProScan is the software package that allows sequences to be scanned against InterPro's signatures. The software is available: as a web-based tool for the analysis of single protein sequences programmatically via Web services that allow up to 25 sequences to be analysed per request (both SOAP and REST-based services are available) as a downloadable package for local installation from the EBI's FTP server .

InterProScan is run regularly against UniProtKB and the results are made available via the InterPro website. 2) In July 2009, a BioMart was added to the InterPro suite of services. BioMart provides users with the ability to retrieve large sets of data, based on sophisticated queries that may incorporate multiple lters. Users are able to specify precisely which elds are included in the results returned. The InterPro BioMart has been described previously, including a detailed explanation of how to use the BioMart with several example queries. The most important benet provided by this feature is the ability to interrogate InterPro for multiple entries, proteins or member database signatures in a single query, which is a feature not available from the main InterPro Web interface. 3) Utopia: InterPro signature match data can be visualised on multiple sequence alignments and 3D structures using Utopia tools.

4) InterPro Text-based search: Text search, using InterPro entry names and identifiers, UniProt accessions, GO terms, PDB identifiers, or free text, to find information in InterPro relating to your query.

Protocol: Using InterPro sequence analysis: 1. Go to http://www.ebi.ac.uk/interpro/. 2. Get a protein sequence in FASTA format from NCBI site and paste it as a query sequence in the space provided. 3. Click Search.

Sample Result for Sequence analysis: 1. Insulin [Crassostrea gigas] GenBank: EKC18433.1 protein sequence was used as query sequence. 2. The results obtained display protein family membership, domains and repeats, detailed signature matches and gene ontology predictions for the protein. 3. The gene ontology prediction includes: Biological Process in which the query protein is involved Molecular function of the query protein Cellular component the protein constitutes

Hyperlinks to individual Protein fingerprints from member databases

Predicted Molecular Function of the protein Extracellular domain predicted

Using InterProScan: InterProScan (v4.8) is a sequence analysis application (protein sequences) that combines different protein signature recognition methods into one resource. Protocol: 1) 2) 3) 4) Click on InterProScan on the home-page. Enter the query protein sequence in the space provided. Select the databases to search the query protein sequence against. Click Submit.

Interpretation 1) The graphical output gives the various protein signatures from different signature databases selected to which the query protein sequence matched. 2) The source protein signature database is color coded, according to the legend displayed below the results. 3) The highlighted box on the left, gives the InterPro accession no. eg. (IPR016179) and the hyperlinks to the individual signature entries from the source databases. 4) Hence, in this case the query protein sequence viz. can be said to have protein signature matches from:

a) Gene3D: No description b) Pfam: Insulin c) SMART: Insulin/ Insulin-like growth factor d) Superfamily: Insulin like e) PRINTS: INSULINFAMLY f) Prosite: INSULIN g) Panther: Insulin/ Insulin like growth factor h) PIR: Signal peptide

Conclusion: InterPro combines signatures from multiple, diverse databases into a single searchable resource, reducing redundancy and helping users interpret their sequence analysis results. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool and integrated resource.

Application: InterPro is used by research scientists interested in the large-scale analysis of whole proteomes, genomes and metagenomes, as well as researchers seeking to characterise individual protein sequences. Within the EBI, InterPro is used to help annotate protein sequences in UniProtKB. It is also used by the Gene Ontology Annotation group to automatically assign Gene Ontology terms to protein sequences. References: 1) Hunter et al., InterPro in 2011: new developments in the family and domain prediction database Nucleic Acids Research, 2012, Vol. 40, Database issue, doi:10.1093/nar/gkr948 2) www.ebi.ac.uk/interpro

You might also like