Essential Info Notes-1
Essential Info Notes-1
There are two main functions of Biological Databases: Making Biological Data available to Scientists: As much of information should be available in one single place (book, sit, database). Public data ay be difficult to find or access, and collecting it from literature is very time consuming. And not all data is actually published explicitly in an article. To make Biological Data available in Computer-readable form: Since analysis of Biological Data almost always involves Computers, having the Data in Computerreadable form ( rather than print or paper) is a necessary first step. One of the first Biological sequence Database was probably the book Atlas of Protein Sequence and Structure by Margaret Dayhoff and colleagues, first published in 1965. It contained the Protein sequences determined at the time, and new editions of the book were published well into the 1970s. The Computer became h storage medium of choice as soon they came with in the reach of normal scientists. Databases were distributed on tapes, and later on various kinds of discs. When universities and research institutions were connected to Internet or its precursors (National Computer Network), it is easy to understand why it became the medium of choice. And it is easier to see why WWW ( World Wide Web) based on http (Hyper text markup language) since beginning of the 1990s is the standard method of Communication and access for nearly all biological Databases. As biology has increasingly turned into a data-rich science, the need for storing and communicating large database has grown tremendously. The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural Data produced by XRay crystallography and macromolecular NMR. An new field of Science dealing with issue, challenges and new possibilities created by these database has emerged: Bioinformatics. Other type of data that or will soon be available in databases are metabolic pathways ( KEGG), gene expression data (microarrays), protein-protein interactions and other types of data related to Biological function and processes. Biological databases have become an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life. The biological knowledge is distributed amongst many different general and specialized databases. This sometimes makes it difficult to ensure the consistency of information. Biological databases cross-reference other databases with accession numbers as one way of linking their related knowledge together. An important resource for finding biological databases is a special yearly issue of the journal Nucleic Acids Research (NAR). The Database Issue of NAR is freely available, and categorizes many of the publicly vailable online databases related to biology and bioinformatics.
Most important public databases for molecular biology
DDBJ (DNA DataBase of Japan) EMBL Nucleotide DB (European Molecular Biology Laboratory ) GenBank (National Center for Biotechnology Information)
Meta-DBs
Entrez Gene Unified retrival of gene-centred information (NCBI) euGenes Assembled information on eukaryotic genomes (Univ. of Indiana) GeneCards (Weizmann Inst.) GenLoc / UDB (Weizmann Inst.) SOURCE (Univ. of Stanford) LocusLink (National Center for Biotechnology Information)
UniGene Automatic partitioning of GenBank sequences (NCBI) Golden Path / UCSC (Univ. of California, Santa Cruz)
Specialized DBs
CGAP Cancer Genes (National Cancer Institute) Clone Registry Clone Collections (National Center for Biotechnology
Information)
I.M.A.G.E Clone Collections (Image Consortium) DBGET H.sapiens, retrieval system (Univ. of Kyoto)
DIP Interacting Proteins (Univ. of California) GDB (Human Genome Organization) KEGG Functional Db (Univ. of Kyoto) MGI Mouse Genome (Jackson Lab.) OMIM Inherited Diseases (National Center for Biotechnology Information) SWISS-PROT Protein Db (Swiss Institute of Bioinformatics) PEDANT Protein Db (Forschungszentrum f. Umwelt & Gesundheit) List with SNP-Databases Reactome, The Genome Knowledgebase (EBI)
Microarray-DBs
ArrayExpress (European Bioinformatic Institute) Gene Expression Omnibus (National Center for Biotechnology Information) maxd (Univ. of Manchester) SMD (Univ. of Stanford)
The question how to deal with changed, updated and deleted entries in databases is a very tricky problem, and the policies for how accession codes and identifiers are changed or kept constant are not completely consistent between databases or even over time for one single database. The exact definition of what the identifier and accession code are supposed to denote varies between the different databases, but the basic idea is the following.
Identifier
An identifier (locus in GenBank, entry name in SWISS -PROT) is a string of letters and digits that generally is interpretable in some meaningful way by a human, for instance as a recognizable abbreviation of the full protein or gene name. SWISS-PROT uses a system where the entry name consists of two parts: the first denotes the protein and the second part denotes the species it is found in. For example, KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo sapiens. An identifier can usually change. For example, the database curators may decide that the identifier for an entry no longer is appropriate. However, this does not happen very often. In fact, it happens so rarely that its not really a big problem.
NUCLEOTIDE DATABASES
NCBIs sequence databases accept genome data from sequencing projects from around the world and serve as the cornerstone of bioinformatics research. GenBank: An annotated collection of all publicly available nucleotide and amino acid sequences. EST database: A collection of expressed sequence tags, or short, single-pass sequence reads from mRNA (cDNA). GSS database: A database of genome survey sequences, or short, single-pass genomic sequences.
HomoloGene: A gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. HTG database: A collection of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences. SNPs database: A central repository for both single-base nucleotide substitutions and short deletion and insertion polymorphisms. RefSeq: A database of non-redundant reference sequences standards, including genomic DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within NCBI and with external groups, support our data-gathering efforts. STS database: A database of sequence tagged sites, or short sequences that are operationally unique in the genome. UniSTS: A unified, non-redundant view of sequence tagged sites (STSs). UniGene: A collection of ESTs and full-length mRNA sequences organized into clusters, each representing a unique known or putative human gene annotated with mapping and expression information and cross-references to other sources.
1 5 9 13 17 21
2 6 10 14 18 22
3 7 11 15 19 X
4 8 12 16 20 Y
Organelle Genome Databases: OGMP: Organell genome megasequencing program GOBASE: An organelle genome database MitoMap: Human mitochondrial genome database
RNA Databases: Rfam: RNA familiy database RNA base: Database of RNA structures tRNA database: Database of tRNAs tRNA: tRNA sequences and genes sRNA: Small RNA database Comparative & Phylogenetic Databases: COG: Phylogenetic classification of proteins DHMHD: Human-mouse homology database HomoloGene: Gene homologies across species Homophila: Human disease to Drosophila gene database HOVERGEN: Database of homologous vertebrate genes TreeBase: A database of phylogenetic knowledge XREF: Cross-referencing with model organisms SNPs, Mutations & Variations Databases: ALPSbase: Database of mutations causing human ALPS dbSNP: Single nucleotide polymorphism database at NCBI HGVbase: Human Genome Variation database Alternative Splicing Databases: ASAP: Alternate splicing analysis tool at UCLA ASG: Alternate splicing gallery HASDB: Human alternative splicing database at UCLA AsMamDB: alternatively spliced genes in human, mouse and rat ASD: Alternative splicing database at CSHL Specialised Databases: ABIM: Links to several genomics database ACUTS: Ancient conserved untranslated sequences AGSD: Animal genome size database AmiGO: The Gene Ontology database ARGH: The acronym database ASDB: Database of alternatively spliced genes BACPAC: BAC and PAC genomic DNA library info BBID: Biological Biochemical image database Cardiac gene database: CHLC: Genetic markers on chromosomes COGENT: Complete genome tracking database COMPEL: Composite regulatory elements in eukaryotes CUTG: Codon usage database dbEST: Database of expressed sequences or mRNA dbGSS: Genome survey sequence database dbSTS: Sequence tagged sites (STS)
DBTSS: Database of transcriptional start sites DOGS: Database of genome sizes EID: The exon-intron database Harvard Exon-Intron: Exon-Intron database Singapore EPD: Eukaryotic promotor database FlyTrap: HTML based gene expression database GDB: The genome database GenLink: Resources for human genetic and telomere research GeneKnockouts: Gene knockout information GENOTK: Human cDNA database GEO: Gene expression omnibus NCBI GOLD: Information on genome projects around the world GSDB:The Genome Sequence DataBase HGI: TIGR human gene index HTGS: High-through-put genomic sequence at NCBI IMAGE: The largest collection of DNA sequences clones IMGT: The international ImMunoGeneTics information system IPCN: Index to Plant Chromosome Numbers database LocusLink: Single query interface to sequence and genetic loci TelDB: The telomere database MitoDat: Mitochondrial nuclear genes Mouse EST: NIA mouse cDNA project MPSS: Searchable databases of several species NDB: Nucleic acid database NEDO: Human cDNA sequence database NPD: Nuclear protein database Oomycetes DB: Oomycetes database at Virginia Bioinformatics Institute PLACE: Database of plant cis-acting regulatory DNA elements RDP: Ribosomal database project RDB: Receptor database at NIHS, Japan Refseq: The NCBI reference sequence project RHdb: Radiation hybrid physical map of chromosomes SHIGAN: SHared Information of GENetic resources, Japan SpliceDB: Canonical and non-canonical splice site sequences STACK: Consensus human EST database TAED: The adaptive evolution database TIGR: Curated databases of microbes, plants and humans TRANSFAC: The Transcription Factor Database TRRD: Transcription Regulatory region database UniGene: Cluster of sequences for unique genes at NCBI UniSTS: Nonredundent collection of STS
Protein Databases
Protein Sequence Databases Protein Structure Databases Protein Domains, Motifs and Signatures Others Protein Sequence Databases: Antibodies: Sequence and Structure BRENDA: Enzyme database CD Antigens: Database of CD antigens dbCFC: Cytokine family database Histons: Histone sequence database HPRD: Human protein reference database InterPro: Intergrated documentation 5resources for protein families iProClass: An integrated protein classification database KIND: A non-redundant protein sequence database MHCPEP: Database of MHC binding peptides MIPS: Munich information centre for protein sequences PIR: Annotated, and non-redundant protein sequence database PIR-ALN: Curated database of protein sequence alignments PIR-NREF: PIR nonredundent reference protein database PMD: Protein mutant database PRF: Protein research foundation, Japan ProClass: Non-redundant protein database ProtoMap: Hierarchical classification of swissprot proteins REBASE: Restriction enzyme database RefSeq: Reference sequence database at NCBI SwissProt: Curated protein sequence database SPTR: Comprehensive protein sequence database Transfac: Transcription factor database TrEMBL: Annotated translations of EMBL nucleotide sequences Tumor gene database: Genes with cancer-causing mutations WD repeats: WD-repeat family of proteins Protein Structure Databases: Cath: Protein structure classification HIV Protease: HIV protease database 3D structure PDB: 3-D macromolecular structure data PSI: Protein structure initiative S2F: Structure to function project Scop: Structural Classification of Proteins Protein Domains, Motifs & Signatures: BLOCKS: Multipe aligned segments of conserved protein regions CCD: Conserved domain database and search service DOMO: Homologous protein domain families Pfam: Database of protein domains and HMMs
ProDom: Protein domain database Prints: Protein motif fingerprint database Prosite: Database of protein families and domains SMART: Simple modular architecture research tool TIGRFAM: Protein families based on HMMs Others: Phospho Site: Database of phosphorylation sites PROW: Protein reviews on the web Protein Lounge: Complete systems biology
Other Databases:
Carbohydarate Databases: Carb DB: Carbohydrate Sequence and Structure Database GlycoWord: Glycoscience related information SPECARB: Raman Spectra of carbohydrates Other Databases: AlzGene: Alzheimers disease Polygenic pathways: Alzheimers disease, Bipolar disorder or Schizophrenia
ATIDB: Arabidopsis insertion database CSHL: Arabidopsis genome analysis at Cold Spring ESSA: Arabidopsis thalina project at MIPS Genoscope: AGI in France Kazusa: Arabidopsis thaliana genome info Japan MPSS: Massively parallel signature sequencing NASC: Nottingham Arabidopsis stock center Stanford: Sequencing of the Arabidopsis genome at Stanford TAIR: Arabidopsis information resource TIGR: TIGR Arabidopsis genome annotation database Wustl: Arabidopsis genome at Washington university Trees: A forest tree genome database Bacterial genomes: B. Subtilus: Bacillus subtilus database Chlamydomonas: Chlamydomonas genetics center E. coli: E.coli genome project MGD: Microbial germ plasm database Microbial: Microbial Genome Gateway Microbial: Microbial genomes Micado: Genetics maps of B. subtilis and E. coli MycDB: A integrated Mycobacterial database Neisseria: Neisseria meningitidis genome Neurospora: Neurospora crassa database OralGen: Oral pathogen database Salmonella: Salmonella information STDGen: Sexulally transmitted disease database Bass: Bass: Sea Bass Mapping project Cat (Felis catus): Cat ArkDB: Cat mapping database Cattle (Bos taurus): ARK: Farm animals BoLA: Bovine MHC information Bovin: Bovine genome database BovMap: Mapping the bovine genome CaDBase: Genetic diversity in cattles ComRad: Comparative radiation hybrid mapping Cow ArkDB: Bovine ArkDB GemQual: Genetics of meat quality Chicken (Gallus gallus):
Chicken: Poultry gene mapping project ChickMap: Chicken genome project Chicken ArkDB: Chicken database ChickEST: Chick EST database Poultry: Poultry genome project Cotton: Cotton: Cotton data collection site Cyano Bacteria (Blue green algae): Cyano Bacteria: Anabaena genome Daphnia (Crustacea): Daphnia pulex: Daphnia genomics consortium Deer: Deer ArkDB: Deer mapping database Dictyostelium discoideum: Dicty_cDB: Dictyostelium discoideum cDNA project DGP: Dictyostelium discoideum genome project Dictybase: Online informatics resources for Dictyostelium Dog (Canis familiaris): Dog: Dog genome project Dog genome project: Frog (Xenopus): Xenbase: A Xenopus web resource Xenopus: Xenopus tropicalis genome Fruit fly (Drosophila melanogaster): ENSEMBL: Drosophila Genome Browser at ENSEMBL Fruitfly: Drosophila genome project at Berkeley FlyBase: A Database of the Drosophila Genome FlyMove: A Drosophila multimedia database FlyView: A Drosophila image database Fungus: Aspergillus: Aspergillus Genomics Candida: Candida albicans information page FungalWeb: Fungi database FGSC: Fungal genetic stocks center Goat (Capra hircus):
Goat: GoatMap, mapping the caprine genome Horse (Equus caballus): Horse ArkDB: Horse mapping database Madaka Fish: Medaka: Medaka fish home page Maize: Maize: Maize genome database Malaria (Plasmodium spp): Malaria: Malaria genetics and genomics PlasmoDB: Plasmodium falciparum genome database Parasites: Parasite databases of clustered ESTs Parasite Genome: Parasite genome databases Mosquito: Mosquito: Mosquito genome web server Mouse (Mus musculus): ENSEMBL: Mouse genome server at ENSEMBL Jackson Lab: Mouse Resources MRC: Mouse genome center at MRC, UK MGI: Mouse genome informatics at Jackson Labs MGD: Mouse genome database MGS: Mouse genome sequencing at NIH MIT: Genetic and physical maps of the mouse genome Mouse SNP: Mouse SNP database NCI: Mouse repository NIH: NIH mouse initiative ORNL: Mutent mouse database RIKEN: Mouse resources Rodentia: The whole mouse catalog Pig (Sus scrofa): INCO: Pig trait gene mapping Pig: Pig EST database Pig: Pig gene mapping project PiGBase: Pig genome mapping Pig ArkDB: Pig Ark DB Plants: PlantGDB: Resources for plant comparative genomics
Protozoa: Protozoa: Protozoan genomes Pufferfish: Fugu: Puffer fish project, UK site Fugu: Fugu genome project, Singapore Fugu: Puffer fish project, USA Rat (Ratus norvigicus): MIT: Genetic maps of the Rat genome NIH: Rat genomics and genetics Rat: RatMap RGD: Rat genome database Rice (Oriza sativa): MPSS: Massively parallel signature sequencing Rice-research: Rice genome sequence database Rice: Rice genome project Rickettsia: RicBase: Rickettsia genome database Salmon: Salmon ArkDB: Salmon mapping database Sheep (Ovis aries): Sheep: Sheep gene mapping SheepBase: Sheep gene mapping Sheep ArkDB: Sheep mapping database Soy: Soy: Soybeans database Sorghum: Sorghum: Sorghum Genomics Tetraodon: Tetraodon: Tetraodon nigroviridis genome Tetraodon: Tetraodon nigroviridis genome at Whitehead Tilapia: HCGS: Tilapia genome Tilapia ArkDB: Tilapia mapping database Turkey:
Turkey ArkDB: Turkey mapping database Viruses: HIV: HIV sequence database Herpes: Human herpes virus 5 database Worm (Caenorhabditis elegans): C. elegans: C. elegans genome sequencing project NemBase: Resource for nematode sequence and functional data WormAtlas: Anatomy of C. elegans WormBase: The Genome and biology of C. elegans ACEDB: A C. elegans database WWW Server: C. elegans web server Yeast: SCPD: The promoter database of Saccharomyces cerevisiae SGD: Saccharomyces genome database S. Pompe: Schizosaccharomyces pompe genome project TRIPLES: Functional analysis of Yeast genome at Yale Yeast Intron database: Spliceosomal introns of the yeast Zebra fish (Danio rerio): ZFIN: Zebrafish information network ZGR: Zebrafish genome resources ZIS: Zebrafish information server Zebrafish: Zebrafish webserver
DOMAIN DATABASE
Domains can be thought of as distinct functional and/or structural units of a protein. These two classifications coincide rather often, as a matter of fact, and what It is found as an independently folding unit of a polypeptide chain carrying specific function. Domains are often identified as recurring (sequence or structure) units, which may exist in various contexts. In molecular evolution such domains may have been utilized as building blocks, and may have been recombined in different arrangements to modulate protein function. We can define conserved domains as recurring units in molecular evolution, the extents of which can be determined by sequence and structure analysis. Conserved domains contain conserved
sequence patterns or motifs, which allow for their detection in polypeptide sequences.
The goal of the NCBI conserved domain curation project is to provide database users with insights into how patterns of residue conservation and divergence in a family relate to functional properties, and to provide useful links to more detailed information that may help to understand those sequence/structure/function relationships. To do this, CDD Curators include the following types of information in order to supplement and enrich the traditional multiple sequence alignments that form the foundation of domain models: 3-dimensional structures and conserved core motifs, conserved features/sites, phylogenetic organization, links to electronic literature resources.
CDD
Conserved Domain Database (CDD) CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and fulllength proteins. These are available as positionspecific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-manually curated domains, which use 3D-structureinformation to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM).
Search How To Help News FTP Publications
CD-Search is NCBI's interface to searching the Conserved Domain Database with protein or nucleotide query sequences. It uses RPS-BLAST, a variant of PSI-BLAST, to quickly scan a set of pre-calculated position-specific scoring matrices (PSSMs) with a protein query. The results of CD-Search are presented as an annotation of protein domains on the user query sequence (illustrated example), and can be visualized as domain multiple sequence alignments with embedded user queries. High confidence associations between a query sequence and conserved domains are shown as specific hits. The CDSearch Help provides additional details, including information about running CD-Search locally. Batch CD-Search serves as both a web application and a script interface for a conserved domain search on multiple protein sequences, accepting up to 100,000 proteins in a single job. It enables you to view a graphical display of the concise or full search result for any individual protein from your input list, or todownload the results for the complete set of proteins. The Batch CD-Search Help provides additional details.
CD-Search (Help & FTP) Batch CD-Search (Help) Publications
Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the Entrez Protein database based on domain architecture, defined as the sequential order of conserved domains in protein
queries. CDART finds protein similarities across significant evolutionary distances using sensitive domain profiles rather than direct sequence similarity. Proteins similar to the query are grouped and scored by architecture. You can search CDART directly with a query protein sequence, or, if a sequence of interest is already in the Entrez Protein database, simply retrieve the record, open its "Links" menu, and select "Domain Relatives" to see the precalculated CDART results (illustrated example). Relying on domain profiles allows CDART to be fast and, because it relies on annotated functional domains, informative.
About Search Help FTP Publications
CDTree
CDTree is a helper application for your web browser that allows you to interactively view and examine conserved domain hierarchies curated at NCBI. CDTree works with Cn3D as its alignment viewer/editor, it is used in the CDD curation process and is a both classification and research tool for functional annotation and the study of protein and protein domain families.
About Install Publications
Content[edit source]
CDD content includes NCBI manually curated domain models and domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAMs). What is unique about NCBI-curated domains is that they use 3D-structure information to explicitly define domain boundaries, align blocks, amend alignment details, and provide insights into sequence/structure/function relationships. Manually curated models are organized hierarchically if they describe domain families that are clearly related by common descent. To provide a nonredundant view of the data, CDD clusters similar domain models from various sources into superfamilies.
research projects, such as KOGs (a eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD). Accessions that start with "cl" are for superfamily cluster records and can contain domain models from one or more source databases. When searching CDD, it is possible to limit search results to domains from any given source database by using the Database Search Field. Phylogenetic organization: Based on evidence from sequence comparison, NCBI
Conserved Domain Curators attempt to organize related domain models into phylogenetic family hierarchies. Links to electronic literature resources: NCBI curated domains also provide links to citations in PubMed and NCBI Bookshelf that discuss the domain. These references are selected by curators and, whenever possible, include articles that provide evidence for the biological function of the domain and/or discuss the evolution and classification of a domain family. It is also possible to limit CDD search results to domain models from any given source database by using the Database Search Field.
PFAM
Pfam 27.0 (Mar 2013 , 14831 families)
Proteins are generally comprised of one or more functional regions, commonly termed domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein. The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and hidden Markov models (HMMs). There are two levels of quality to Pfam families: Pfam-A and Pfam-B. PfamA entries are derived from the underlying sequence database, known as Pfamseq, which is built from the most recent release of UniProtKB at a given time-point. Each Pfam-A family consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases. Pfam-B families are un-annotated and of lower quality as they are generated automatically from the non-redundant clusters of the latest ADDA release. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.
Pfam entries are classified in one of four ways: A collection of related protein regions A structural unit Family: Domain:
Repeat: A short unit which is unstable in isolation but forms a stable structure when multiple copies are present Motifs: A short unit found outside globular domains Related Pfam entries are grouped together into clans; the relationship may be defined by similarity of sequence, structure or profile-HMM.
1. 2. 3. 4. 5. 6.
Analyze your protein sequence for Pfam matches View Pfam family annotation and alignments
See groups of related families Look at the domain organisation of a protein sequence Find the domains on a PDB structure Query Pfam by keywords
Go Example
Enter any type of accession or ID to jump to the page for a Pfam family or clan, UniProt sequence, PDB structure, etc.
Browse Pfam
You can use the links below to find lists of families, clans or proteomes, which begin with the chosen letter (or number). You can also see a list of Pfam families which are new to this release or the list of the twenty largest families, in terms of number of sequences.
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM match. The envelope coordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate the region over which HMMER is confident that the alignment of the sequence to the profile HMM is correct. Our full alignments contain the envelope coordinates from HMMER3.
Architecture
The collection of domains that are present on a protein.
Clan
A collection of related Pfam entries. The relationship may be defined by similarity of sequence, structure or profile-HMM.
Domain
A structural unit.
Domain score
The score of a single domain aligned to an HMM. Note that, for HMMER2, if there was more than one domain, the sequence scorewas the sum of all the domain scores for that Pfam entry. This is not quite true for HMMER3.
DUF
Domain of unknown function.
Envelope coordinates
See Alignment coordinates.
Family
A collection of related protein regions.
Full alignment
An alignment of the set of related sequences which score higher than the manually set threshold values for the HMMs of a particular Pfam entry.
HMMER
The suite of programs that Pfam uses to build and search HMMs. Since Pfam release 24.0 we have used HMMER version 3 to make Pfam. See the HMMER site for more information.
HMMER3
The suite of programs that Pfam uses to build and search HMMs. See the HMMER site for more information.
iPfam
A resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction.
Metaseq
A collection of sequences derived from various metagenomics datasets.
Motif
A short unit found outside globular domains.
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of representative sequences. We manually set a threshold value for each HMM and search our models against the UniProt database. All of the sequnces which score above the threshold for a Pfam entry are included in the entry's full alignment.
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues from them. Since Pfam-B families are automatically generated, we recommend that you verify that the sequences in a Pfam-B family are related, using other methods such as BLAST. For Pfam 24.0, we have made HMMs for the first (and therefore largest) 20,000 Pfam-B familes. Users can search their sequences against the Pfam-B HMMs in addition to the Pfam-A HMMs when performing both single-sequence searches and batch searches on the website.
Posterior probability
HMMER3 reports a posterior probability for each residue that matches a 'match' or 'insert' state in the profile HMM. A high posterior probability shows that the alignment of the amino acid to the match/insert state is likely to be correct, whereas a low posterior probability indicates that there is alignment uncertainty. This is indicated on a scale with '*' being 10, the highest certainty, down to 1 being complete uncertainty. Within Pfam we display this information as a heat map view, where green residues indicate high posterior probability, and red ones indicate a lower posterior probability.
Repeat
A short unit which is unstable in isolation but forms a stable structure when multiple copies are present.
Seed alignment
An alignment of a set of representative sequences for a Pfam entry. We use this alignment to construct the HMMs for the Pfam entry.
Sequence score
The total score of a sequence aligned to a HMM. If there is more than one domain, the sequence score is the sum of all the domain scores for that Pfam entry. If there is only a single domain, the sequence and the domains score for the protein will be identical. We use the sequence score to determine whether a sequence belongs to the full alignment of a particular Pfam entry.
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 500 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa. Features: You can use SMART in two different modes: normal or genomic.The main difference is in the underlying protein database used. In Normal SMART, the database contains Swiss-Prot, SP-TrEMBL and stable Ensembl proteomes. In Genomic SMART, only the proteomes of completely sequenced genomes are used; Ensembl for metazoans and Swiss-Prot for the rest.
The protein database in Normal SMART has significant redundancy, even though identical proteins are removed. If you use SMART to explore domain architectures, or want to find exact domain counts in various genomes, consider switching to Genomic mode. The numbers in the domain annotation pages will be more accurate, and there will not be many protein fragments corresponding to the same gene in the architecture query results. We should Remember that we are exploring a limited set of genomes, though.
Different color schemes are used to easily identify the mode we are in.
Normal mode Genomic mode
Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlay in three dimensions. Alignments held by SMART are mostly based on published observations (see domain annotations for details), but are updated and edited manually.
Alignment block
Ungapped alignments that usually represent a single secondary structure.
Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores. The likelihood that the query sequence is a bona fide homologue of the database sequence is compared to the likelihood that the sequence was instead generated by a "random" model. Taking the logarithm (to base 2) of this likelihood ratio gives the bits score.
Cellular Role
Chromatin-associated Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleus Interaction (with the environment) Molecules that sense cellular environmental change, such as osmolarity, light flux, acidity, ion concentration etc Metabolic Enzymes that catalyze reactions in living cells that transform organic molecules Replication The process of making an identical copy of a section of duplex (double-stranded) DNA, using existing DNA as a template for the synthesis of new DNA strands
Signalling Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by a cellular response Transport Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrients, ions, etc. across the membrane Translation The process in which the genetic code carried by messenger RNA directs the synthesis of proteins from amino acids Transcription The synthesis of an RNA copy from a sequence of DNA (a gene); the first step in gene expression
Coiled coils
Intimately-associated bundles of long alpha-helices ([1], [2], [3]). Coiled coils are detected in SMART using the method of Lupas et al. ([4], COILS home). Coiled coils predictions are indicated on the second line in SMART's graphical output.
Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic core. 2+ 2+ In small disulphide-rich and Zn -binding or Ca - binding domains the hydrophobic core may be provided by cystines and metal ions, respectively. Homologous domains with common functions usually show sequence similarities.
Domain composition
Proteins with the same domain composition have at least one copy of each of domains of the query.
Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed).
Entrez
A WWW-based system that allows easy retrieval of sequence, structure, molecular biology and literature data (Entrez). SMART's domain annotation pages contain links to the Entrez system thereby providing extensive literature, structure and sequence information.
E-value
This represents the number of sequences with a score greater-than, or equal to, X, expected absolutely by chance. The E-value connects the score ("X") of an alignment between a usersupplied sequence and a database sequence, generated by any algorithm, with how many alignments with similar or greater scores that would be expected from a search of a random sequence database of equivalent size. Since version 2.0 E-values are calculated using Hidden Markov Models, leading to more accurate estimates than before.
Extracellular Domains
Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus.
Gap
A position in an alignment that represents a deletion within one sequence relative to another. Gap penalties are requirements for alignment algorithms in order to reduce excessively-gapped regions. Gaps in alignments represent insertions that usually occur in protruding loops or beta-bulges within protein structures.
Genomic database
Protein database used in SMART's 'Genomic' mode. It contains data from completely sequenced genomes only. Ensembl data is used for Metazoan genomes and Swiss-Prot for others. A complete list of genomes in the database is avaliable.
HMM consensus
The HMM consensus is a 'one line summary' of the corresponding HMM. The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM. Capital letters mean "highly conserved" residues (probability > 0.5 for protein models). (modified from the HMMer User's Guide)
HMMer
The HMMer package ([1], [2]) provides multiple alignment and database searching capabilities. There are several programs in the package (see the docu) including one (hmmfs) that searches databases for non-overlapping LOCAL similarities (i.e. that match across at least part of the HMM), and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (i.e. that match across the full HMM). These correspond approximately to profile-based searches using negative and positive profiles, respectively (see WiseTools). Database searches using hmmls or hmmfs provide alignment scores as bits scores.
Homology
Evolutionary descent from a common ancestor due to gene duplication.
Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm.
Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm, extracellular space, nucleus, and membrane-associated) are shown in annotation pages.
Motif
Sequence motifs are short conserved regions of polypeptides. Sets of sequence motifs need not necessarily represent homologues.
ORF
Open reading frame.
Outlier homologues
These are often difficult to detect using HMM methodology. A complementary approach to their detection is to query a database of sequences taken from multiple sequence alignments, using BLAST. Selecting this option will also activate searches against sequence databases derived from proteins of known structure. A simple BLAST search of the PDB is performed, together with a search of RPS_Blast profiles derived from SCOP. These profiles were kindly provided by Steffen Schmidt (see Schmidt et al. J. Chem. Inf. Comput. Sci. 2002 (42) 405-7).
P-value
This represents a probability that, given a database of a particular size, random sequences score higher than a value X. P-values are generated by the BLAST algorithm that has been integrated into SMART.
PFAM
Pfam is a database of protein domain families represented as (i) multiple alignments, and (ii) HMMprofiles ([1], [2]). Pfam WWW servers allow comparison of user-supplied sequences with the Pfam database (Sanger Center and Washington Univ.). SMART contains a facility to search the Pfam collection using HMMer.
Prokaryotic domains
SMART now also searches for domains found in two component regulatory systems. These can be found mainly in Prokaryotes, but a few were also found in eukaryotes like yeast and plants.
Profile
A profile is a table of position-specific scores and gap penalties, representing an homologous family, that may be used to search sequence databases (Ref.: [1], [2], [3]). In CLUSTAL-W-derived profiles those sequences that are more distantly related are assigned higher weights ([4], [5], [6]). Issues in profile-based database searching are discussed in Bork & Gibson (1996) [7].
ProfileScan
An excellent WWW server that allows a user to compare a protein or DNA sequence against a database of profiles (located at the ISREC).
PROSITE
This is a dictionary of protein sites and motif patterns. Some SMART domain annotations contain links to PROSITE.
Searched domains
In the first version of SMART, only eukaryotic signalling domains could be searched. In 1998 we have extended this set by prokaryotic signalling and extracellular domains. In the input page you can choose wether you only want to search cytoplasmic (prokaryotic + eukaryotic), extracellular or all domains.
Secondary Literature
The secondary literature is derived by the following procedure. For each of the hand selected papers referenced by a domain, 100 neighbouring papers are retrieved using Medline. If one of these neighbouring papers is referenced from more than two original papers, it is included into the secondary literature list.
Seed Alignment
Alignment that contains only one of each pair of homologues that are represented in a CLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 0.2 (see the related article).
SEG
A program of Wootton & Federhen [1] that detects regions of the query sequence that have low compositional complexity [2].
Sequence ID or ACC
Sequence identifiers or accession codes may be entered via the SMART homepage to initiate a query. You can use either Uniprot or Ensembl sequence identifiers.
Signalling Domains
The original set of domains used in SMART were collected as those that satisfied one or both of two criteria: 1. cytoplasmic domains that possess kinase, phosphatase, ubiquitin ligase or phospholipase enzymatic activities or those that stimulate GTPase-activation or guanine nucleotide exchange cytoplasmic domains that occur in at least two proteins with different domain organisations, of which one also contains a domain that satisfies criterion 1
2.
These domains mostly mediate or regulate the transduction of an extracellular signal towards the nucleus resulting in the initiation of a cellular response. More recently, prokaryotic two-component signalling domains have been added to the SMART set.
SignalP
This program predicts the presence and location of signal peptide cleavage sites in amino acid sequences (SignalP home page).
Species
Numbers of domains present in a variety of selected taxa (animal, archaea, bacteria, fungi, plants and protozoa) are shown in annotation pages.
SwissProt
The SwissProt database is an extensively annotated and non-redundant collection of protein sequences. SwissProt annotations have been mined for SMART-derived annotations of alignments.
TMHMM2
This program predicts the location and topology of transmembrane helices (TMHMM2).
Thresholds
For each of the domains found by SMART a combination of thresholds is used to distinguish between true and false hits. The different thresholds are described in the SMARTpaper.
WiseTools
A package that is based on database searches using profiles. Profiles may be generated using PairWise, and then compared with sequence databases using SearchWise. Scores are generated for alignments that match the whole of the profile (using a "positive" profile) or else that match at least part of the profile (using a "negative" profile). Only the top-scoring optimal alignment of each
sequence is reported; hence, SMART relies on re-iterating the search for new repeats until none are reported that score above threshold. Score thresholds have been set manually that are considered to represent a score just above the top-scoring true negative. Additional thresholds have been estimated for domains that are repeated in single polypeptides. A more recent package allows comparison of DNA sequences at the level of their conceptual translations, regardless of sequence error and introns (see Wise2).
Whats new????
Changes from version 6.0 to 7.0
Full text search engine perform a full text search of SMART and Pfam domain annotations, p lus the complete protein descriptions for Uniprot and Ensembl proteins metaSMART Explore and compare domain architectures in various publicly available metagenomics datasets. iTOL export and visualization Domain architecture analysis results can be exported and visualized in interactive Tree Of Life. New option can be found in the protein list function select list. User interface cleanup Various small changes to the UI, resulting in faster and easier navigation
Metabolic pathways information SMART domains and proteins available in the "genomic" mode now have basic metabolic pathways information. It is generated by mapping our genomic mode protein database to the KEGG orthologous groups. Updated genomic mode protein database Greatly expanded genomic mode database now includes sequences from 630 completely sequences species.
SMART webservice You can access SMART using our webservice. Check the WSDL file for details. The webservice is still under active development and only works for sequences and IDs which are in our database (complete Uniref100 and stable Ensembl genomes). SMART DAS server SMART DAS server is available at URL http://smart.embl.de/smart/das. It provides all the protein fetures from our database (SMART domains, signal peptides, transmembrane regions and colied coils) for all Uniref100 and Ensembl proteins.
If you need help with these services or have questions/feedback, please contact us.
New protein database in 'Normal' mode SMART now uses Uniprot as the main source of protein sequences. All Ensembl proteomes (except pre-releases) are also included. To lower redundancy in the database, the following procedure is used: o only one copy of 100% identical proteins is kept (different IDs are still available) o each species' proteins are separated o CD-HIT clustering with 96% identity cutoff is preformed on each species separately
o o
longest member of each protein cluster is used as the representative only representative cluster members and single proteins (ie. proteins which are not members of any clusters) are used in all domain architecture queries and for domain counts in the annotation pages
Even though the number of proteins in the database is almost doubled (current version has around 2.9 million proteins), the redundancy should be minimal.
Domain architecture invention dating As a further step from the single domain to the understanding of multi domain proteins, SMART now predicts the taxonomic class, where the concept of a protein, that is its domain architecture, was invented. The domain architecture is defined as the linear order of all SMART domains in the protein sequence. To derive the point of its invention, all proteins with the same domain architecture are mapped onto NCBIs taxonomy . The last common ancestor of all organisms containing at least one protein with the domain architecture is defined as the point of its origin.
Intron positions are shown in protein schematics For proteins that match any of the Ensembl predictions, SMART will show intron positions as vertical coloured lines in graphical representations (see example). This information is retrieved from a pre-calculated mapping of Ensembl gene structures to protein sequences.
Vertical line at the end of the protein is not an actual intron, but a mark to show that intron mapping was performed. If that is the only line, there are no introns annotated. If there is no line at all, there is no data avaliable in Ensembl for that particular sequence. You can switch off intron display on your SMART preferences page.
Alternative splicing information Since SMART now incorporates Ensembl genomes, 'Additional information' page shows a list of alternative splices of the gene encoding the analyzed protein (if there are any). It is possible to either display SMART protein annotation for any of the alternative splices, or get a graphical multiple sequence alignment of all of them. Orthology information There are 2 separate sets of orthologs for each Ensembl protein: 1:1 reciprocal best matches in other genomes and orthologous groups with reciprocal best hits from all genomes analyzed (i.e. each of these proteins has exactly one ortholog in all 6 genomes). This data is displayed on 'Additional information' page. Graphical multiple sequence alignments Orthologus groups and different aliternative splices can be displayed as graphical multiple sequence alignments. Proteins are aligned using ClustalW. Domains, intrinsic features and introns are mapped onto the alignment with their positions adjusted according to gaps (black boxes).
Search structure based profiles using RPS-Blast Clicking on the search schnipsel and structures checkbox will now also initiate a search of profiles based on scop domain families, using RPS-Blast. These profiles were kindly provided by Steffen Schmidt (see Schmidt et al. J. Chem. Inf. Comput. Sci. 2002 (42) 405-7).
Improved Architecture analysis In addition to standard 'Domain selection' querying, it is now possible to do queries based on GO (Gene ontology, click here for more info) terms associated with domains. In the first step, you get a list of domains matching the GO terms entered. After selecting the domains of interest from the list, proteins containing those domains are displayed. Use 'Taxonomic selection' box to limit the results based on taxonomic ranges. Pfam domains are stored in the database SMART database now contains precomputed results for all Pfam domains. To use Pfam domains in the architecture queries, prepend the domain name with 'Pfam:' (for example, "TyrKc AND Pfam:Fz AND TRANS") Try our SMART Toolbar for Mozilla web browser! Click here for more info.
Fantastic new protein picture generator Proteins are now displayed as dynamically generated PNG images. This means that you can download the entire protein representation as a single image. You have our permission to do this and use these diagrams in any way that you like; do acknowledge us though. All domain bubbles have been script-generated using The Gimp and it'sPerl-Fu extension. The script is GPL'd and is avaliable for download Fancy colour alignments All alignments we generate are coloured in using Leo Goodstadt's excellent CHROMA program. CHROMA is available from here and it's great. You may experience some problems if you're using a klunky browser. This will be fixed when you change your browser. Excellent transmembrane domains These are now calculated using the fine TMHMM2 program (the main web site is here ), kindly provided by Anders Krogh and co-workers. You can read about the method here. Selective SMART We now store intrinsic features such as transmembrane domains, and signal peptides in our database. This means that these can be queried for (e.g. SIGNAL AND TyrKc). The feature names are Signal peptide: SIGNAL, transmembrane domain: TRANS, coiled-coli COIL. Techincal changes SMART code has been modified to run under Apache mod_perl module. SMART is now using Apache::DBI for persistent database connections. Database engine has been updated to the PostgreSQL 7.1. These changes resulted in significant speed improvements. Bug fixes as usual... show_many_proteins script now uses POST method, so there is no longer a limit on number of proteins you can display. Fixed a display problem with proteins having thousands of different representations (if you try to display those, make sure you have a good browser (like Mozilla or Opera :-) and a bunch of RAM).
Startup page The start page now includes selective SMART and allows to search for keywords in the annotation of domains.
Schnipsel Blast The results of a schnipsel Blast search are included into the bubble diagramm. Additionally, PDB is searched. Taxonomic breakdown When selecting multiple proteins, e.g. via selective SMART or the annotation pages, an overview of the taxonomy of all species (tax break) is offered. Links Links don't contain version numbers. This allows stable links from external sources. Selective SMART Allows to search for multiple copies of domains and is case insensitive. You can now search for e.g. 'Sh3 AND sH3 AND sh3' Annotation You can align youre query sequence to the SMART alignment using hmmalign Update of underlying database SMART now uses PostgreSQL 6.5.2.
Digest output SMART now only produces a single diagram representing a 'best' interpretation of all the annotation that has been performed. A comprehensive summary of the results is also provided in table format. selective SMART Selective SMART allows to look for proteins with combinations of specific domains in different species or taxonomic ranges. alert SMART The SMART database gets updated about once a week. If you are interested in specific domains or combination of them in specific taxonomic ranges, you can use the SMART alerting service. This provides the identities of newly-deposited proteins that match your query. Domain queries You can ask for proteins having the same domain order / composition as your query protein. SMARTed Genomes You can get the result of SMARTing the genomes of Caenorhabditis elegans and Saccharomyces cerevisiae via the annotation pages. Faster PFAM searches The PFAM searches now runs on a PVM cluster.
The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution
Lesley H. Greene, Tony E. Lewis, Sarah Addou, Alison Cuff,* Tim Dallman, Mark Dibley, Oliver Redfern, Frances Pearl, Rekha Nambudiry, Adam Reid, Ian Sillitoe, Corin Yeats, Janet M. Thornton,1 and Christine A. Orengo
Author information Article notes Copyright and License information This article has been cited by other articles in PMC.
Go to:
WHATS NEW????
We report the latest release (version 3.0) of the CATH protein domain database (http://www.cathdb.info). There has been a 20% increase in the number of structural domains classified in CATH, up to 86 151 domains. Release 3.0 comprises 1110 fold groups and 2147 homologous superfamilies. To cope with the increases in diverse structural homologues being determined by the structural genomics initiatives, more sensitive methods have been developed for identifying boundaries in multi-domain proteins and for recognising homologues. The CATH classification update is now being driven by an integrated pipeline that links these automated procedures with validation steps, that have been made easier by the provision of information rich web pages summarising comparison scores and relevant links to external sites for each domain being classified. An analysis of the population of domains in the CATH hierarchy and several domain characteristics are presented for version 3.0. We also report an update of the CATH Dictionary of homologous structures (CATH-DHS) which now contains multiple structural alignments, consensus information and functional annotations for 1459 well populated superfamilies in CATH. CATH is directly linked to the Gene3D database which is a projection of CATH structural data onto 2 million sequences in completed genomes and UniProt. Go to:
GENERAL INTRODUCTION
The numbers of new structures being deposited in the Protein Data Bank (PDB) continues to grow at a considerable rate. In addition, structures being targeted by world wide structural genomics initiatives are more likely to be novel or only very remotely related to domains previously classified.Only 2% of structures currently solved by conventional crystallography or NMR are likely to adopt novel folds (see Figures 1 and and2).2). A higher proportion of new folds are expected to be solved by structural genomics structures. Although the influx of more diverse structures and subsequent analysis will inform our understanding of how domains evolve, it has resulted in increasing lags between the numbers of structures being deposited and classified. In response to this situation we have significantly improved our automated and manual protocols for domain boundary assignment and homologue recognition.
Figure 1 Annual decrease in the percentage of new structures classified in CATH which are observed to possess a novel fold. The raw data for years 19722005 was fit to a single exponential equation by nonlinear regression using Sigma Plot (SPSS, Version ...
Figure 2 Annual proportion of protein structures deposited in the PDB which are classified in CATH, rejected or pending classification. The colour scheme reflects different categories of PDB chains. Black: not accepted by the CATH criteria; Red: unprocessed chains; ... Significant changes have been implemented in the CATH classification protocol to achieve a more highly automated system. A seamless flow of structures between the constituent programs has been achieved by building a pipeline which integrates web services for each major comparison stage in the classification (see Figure 3). Secondly, completely automatic decisions are now being made for new protein chains with close relatives already assigned in the CATH database. There are two situations that preclude the CATH update process from being fully automated. We rely on expert manual curation for particularly challenging protein domain boundary assignments (DomChop stage) and also for classifications of remote folds and homologues (HomCheck stage). These two manual stages will remain an integral part of the system (Figure 3).
Figure 3 Flow diagram of the CATH classification pipeline. This schematic illustrates the processes involved in classifying newly determined structures in CATH. The CATH update protocol workflow from new chain to assigned domain is split into two main processes; ... In this paper we report our ongoing development of the automated procedures. These critical new features should better enable CATH to keep pace with the PDB (3) and facilitate its development. Key statistics on the domain structure populations and characteristics are also presented. Go to:
being classified in CATH (e.g. http://www.cathdb.info/cgibin/cath/Chain.pl?chain_id=2g3aA). For protein chains which are closely related to chains that are already chopped in CATH, an automated protocol has been developed (ChopClose). ChopClose identifies any previously chopped chains that have sufficiently high sequence identity and overlap with the query chain. Using SSAP (6), the query is aligned against each of these chains in turn and in each case the domain boundaries are inherited across the alignment. The process of inheriting the boundaries often requires some adjustments to be made to account for insertions, deletions or unresolved residues. If the inheritance from one of the chains meets various criteria (SSAP score 80, sequence identity >80%, RMSD 6.0 , longest end extension 10 residues etc) then the resulting boundaries are used to chop the chain automaticallyAutoChop. For cases where ChopClose's best result does not meet all the criteria for automatic chopping, it is provided as support information for a manual domain boundary assignment. Refer to Figure 3 for the location of AutoChop/ChopClose within the CATH update protocol. Go to:
scheme for comparing GO terms, based on a method developed by Lord et al. (13). On a separate validation set of 14 000 homologous pairs and 14 000 nonhomologous pairs 97% of the homologues can be recognised at an error rate of <4%. Go to:
Previously, CATH data was generated using a group of independent programs and flat files. Over the past two years we have developed an update protocol for CATH that is driven by a suite of programs with a central library and a PostgreSQL database system. A classification pipeline has been established which links in a completely automated fashion the different programs that analyse the sequences and structures of both protein chains and domains. The CATH update protocol can essentially be divided into two parts, domain boundary assignment and domain homology classification (see Figure 3). The aim of the protocol is to minimise manual assignment and provide as much support as possible when manual validation is necessary. Processing of both parts of the classification protocol are similar, requiring related meta-data and the triggering of the same automated algorithms. Methods include pairwise sequence similarity comparisons and scans by other homologue detection or fold recognition algorithms such as HMMscan and CATHEDRAL that provide data for either manual or automated assignment. Many of the automated steps in the protocol have been established as a web service and the pipeline integrates both automated steps together with holding stages in which domains are held prior to processing and await the completion of manual validation of predictions (see below).
Web pages to support manual validation
For each manual stage (domain boundary assignmentDomChop and homologue recognitionHomCheck) in the classification we have developed a suite of web pages bringing together all available meta-data from prediction algorithms (e.g. DBS, CATHEDRAL, HMMscan for DomChop, CATHEDRAL, HMMscan, for HomCheck) and information from the literature and from other family classifications with relevant data (e.g. Pfam). For each protein or domain shown on the pages, information on the statistical significance of matches is presented. The web pages will shortly be made viewable and will provide interim data on protein chains and domains not fully classified in CATH for biologists interested in any entries pending classification.
Go to:
An analysis of the percentage of new folds arising since the early 1970s to the present age is shown in Figure 1. The numbers of new folds has been decreasing over time with respect to the number of new structures being deposited and it can be seen that currently approximately 2% of new structures classified in CATH are observed to be novel folds. For comparison the number of domain structures solved over time is also graphically represented in Figure 1.
Number of domains within a protein chain
Integral to the construction of the CATH database is designating domain boundaries. We conducted an analysis of the number of chains versus number of domains in a chain. It is interesting to note that 64% of all protein structures currently solved and classified in CATH are single domain chains (data not shown). The next most prevalent are two domain chains (27%) and following this we find that the number of chains containing three or more domains rapidly decreases. The average size of the single domain chains is 159 residues in length.
The CATH Dictionary of Homologous Superfamilies (CATH-DHS)
The CATH-DHS has also been recently updated. Data on structural similarity and superfamily variability is presented as a significant update to the Dictionary of Homologous Superfamilies (DHS) web-resource (14). The DHS also provides functional annotations of domains within each H-level (superfamily) in CATH v 2.5.1.
For each superfamily, pair-wise structural similarity scores between relatives, measured by SSAP, are presented. The DHS now contains 3307 multiple structural alignments for 1459 superfamilies. For each superfamily, multiple alignments are generated for all the relatives and also for subgroups of structurally similar relatives and sequence similar relatives. Alignments are performed using the residue-based CORA algorithm (15) and presented both as CORAPLOTS (14) and in the form of a 2DSEC diagram (16), alongside co-ordinate data of the superposed structures in PDB format. Sequence representations of the alignments are available to download in FASTA format. In the CORAPLOT images of the multiple alignment, residues in each domain are coloured according to ligand binding and residue type. EquivSEC plots are also shown that describe the variability in orientation and packing between equivalent secondary structures (16). To identify sequence relatives for CATH superfamilies, sequences from UniProt (17) were scanned against HMMs of all CATH domains (12). Homologous sequences were identified as those hits with an E-value < 0.01 and a 60% residue overlap with the CATH domain. This protocol recognised over one million domain sequences in UniProt which could be integrated in the CATH-DHS. The harvested sequences in each superfamily were compared against other relatives by BLAST (18) to determine the pair-wise sequence identity, and then clustered at appropriate levels of sequence identity (35 and 95%) using multi-linkage clustering. Information and links to other functional databases ENZYME (19), GO (Gene Ontology Consortium, 2000), KEGG (20), COG (21), SWISSPROT (22) are also included by BLASTing the sequences from each superfamily against sequences provided by these resources. Only 95% sequence identity hits, with an 80% residue overlap which were used to annotate sequences. Recent analysis of structural and functional divergence in highly populated CATH superfamilies (>5 structural relatives with <35% sequence identity) has been undertaken using data from the DHS. The 2DSEC algorithm was used to analyse multiple structural alignments of families and identify highly conserved structural cores and secondary structure embellishments or decorations to the common core. In some large superfamilies, extensive embellishments were observed outside the core, and although these secondary structure insertions were frequently discontinuous in the protein chain, they were often co-located in 3D space (16). In many cases, manual inspection revealed that the embellishment had aggregated to form a larger structural feature that was modifying the active site of the domain or creating new surfaces for domain or protein interactions. Data collected in the DHS clearly shows a relationship between structural divergence within a superfamily, sequence divergence of this superfamily amongst predicted domains in the genomes and the number of distinct functional groups that can be identified for the superfamily (see Figure 4).
Figure 4 Relationship between sequence variability, structural variability and functional diversity in CATH superfamilies. Structural variation in a CATH superfamily as measured by the number of diverse structural subgroups (SSAP score <80 between groups) ... Go to:
LATERAL LINKS ACROSS THE CATH HIERARCHY TO CAPTURE EVOLUTIONARY DIVERGENCE AND EXPLORE THE STRUCTURAL CONTINUUM
Our analysis of structural divergence in CATH superfamilies (16) has revealed families where significant changes in the structures had occurred, in some cases 5fold differences in the sizes of domains were identified and sometimes it was apparent that the folds of these very diverse relatives had effectively changed. Therefore, in these superfamilies, more than one fold group can be identified, effectively breaking the hierarchical nature of the CATH classification which implies that each relative within a C.A.T.H. homologous superfamily should belong to the same C.A.T. fold group In addition, an all versus all HMMHMM scan between all superfamily representatives revealed several cases of extremely remote homologues which had been classified into separate superfamilies and yet match with significant E-values. Structure comparison had failed to detect the relationship between these superfamilies because the structural divergence of the relatives was so extreme, sometimes constituting a change in architecture as well as fold group. In these cases homology was only suggested by the HMM-based scans and then manually validated by considering functional information and detailed evidence from literature. In order to capture information on these distant homologies, links have been created between the superfamilies both on our web pages and in the CATH database. The data can now be found as a link from the CATH homepage (http://www.cathdb.info). In the near future, we also plan to provide web pages presenting cases of significant structural overlaps between superfamilies or fold groups. For these cases we are not currently able to find any additional evidence to support a distant evolutionary relationship and these examples highlight the recurrence of large structural motifs between some folds and the existence of a structural continuum in some regions of fold space.
Go to:
FEATURES
The CATH database can be accessed at http://www.cathdb.info. The web interface may be browsed or alternatively searched with PDB codes or CATH domain identifiers. There is also a facility for keyword searches. With the version 3.0 release we now make the raw and processed data files available which include for example CATH domain PDB files, sequences, dssp files and they can be accessed through the CATH database main page. The Gene3D resource can be accessed through the CATH database or directly at http://www.cathdb.info/Gene3D. The DHS can be accessed through the CATH database or directly at http://www.cathdb.info/bsm/dhs.
Structural similarity is assessed using an automatic method (SSAP) (3,4), which scores 100 for identical proteins and generally returns scores above 80 for homologous proteins. More distantly related folds generally give scores above 70 (Topology or fold level), though in the absence of any sequence or functional similarity this may simply represent examples of convergent evolution, reinforcing the hypothesis that there exists a limited number of folds in nature (5,6).
Abstract We report the latest release (version 1.4) of the CATH protein domains database (http://www.biochem.ucl.ac.uk/bsm/cath). This is a hierarchical classification of 13 359 protein domain structures into evolutionary families and structural groupings. We currently identify 827 homologous families in which the proteins have both structual similarity and sequence and/or functional similarity. These can be further clustered into 593 fold groups and 32 distinct architectures. Using our structural classification and associated data on protein functions, stored in the database (EC identifiers, SWISS-PROT keywords and information from the Enzyme database and literature) we have been able to analyse the correlation between the 3D structure and function. More than 96% of folds in the PDB are associated with a single homologous family. However, within the superfolds, three or more different functions are observed. Considering enzyme functions, more than 95% of clearly homologous families exhibit either single or closely related functions, as demonstrated by the EC identifiers of their relatives. Our analysis supports the view that determining structures, for example as part of a structural genomics initiative, will make a major contribution to interpreting genome data.
Previous SectionNext Section
(Homologous familes), for having either significant sequence similarity (35% identity) or high structural similarity and some sequence similarity (20% identity).
The
organized according to their (C)lass, (A)rchitecture, (T)opology and (H)omologous superfamily. Relationships between evolutionary related structures (homologues) within the database have been used to test the sensitivity of various sequence search methods in order to identify relatives in Genbank and other sequence databases. Subsequent application of the most sensitive and efficient algorithms, gapped blast and the profile based method, Position Specific Iterated Basic Local Alignment Tool (PSI-BLAST), could be used to assign structural data to between 22 and 36 % of microbial genomes in order to improve functional annotation and enhance understanding of biological mechanism. However, on a cautionary note, an analysis of functional conservation within fold groups and homologous superfamilies in the CATH database, revealed that whilst function was conserved in nearly 55% of enzyme families, function had diverged considerably, in some highly populated families. In these families, functional properties should be inherited far more cautiously and the probable effects of substitutions in key functional residues must be carefully assessed.
Figure 1
Schematic representation of the (C)lass, (A)rchitecture and (T)opology/fold levels in the CATH database.
Figure 2
Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies, for the subtilisin family (CATH id: 3.40.50.200). Tables display the PDB codes for non-identical relatives in the family, together with EC identifier codes and information about the enzyme reactions. The multiple structural alignment, shown, has been coloured according to secondary structure assignments (red for helix, blue for strands). The Architecture level in CATH, groups proteins whose folds have similar 3D arrangements of secondary structures (e.g., barrel, sandwich or propellor), regardless of their connectivity, whilst the top level, Class, simply reflects the proportion of -helix or -strand secondary structures. Three major classes are recognised, mainly-, mainly- and , since analysis revealed considerable overlap between the + and alternating / classes, originally described by Levitt and Chothia (7). Before classification, multidomain proteins are first separated into their constituent folds using a consensus method which seeks agreement between three independent algorithms (8). Whilst the protocol for updating CATH is largely automatic (9), several stages require manual validation, in particular establishing domain boundaries in proteins for which no consensus could be reached and in checking the relationships of very distant homologues and proteins having borderline fold similarity. Although there are plans to assign the more regular architectures automatically, all architecture groupings are currently assigned manually. A homologous family Dictionary is now available within CATH, which contains functional data, where available, for each protein within a homologous family. This includes EC identifiers, SWISSPROT keywords and information from the Enzyme database or the literature (Fig. 2). Multiple structure based alignments are also available, coloured according to secondary structure assignments or residue properties and there are schematic plots showing domain representations annotated by protein-ligand interactions (DOMPLOTS) (A.E.Todd, C.A.Orengo and J.M.Thornton, submitted to Protein Engng.). The topology of each domain is illustrated by schematic TOPS diagrams (http://www3.ebi.ac.uk/tops; 10).
Figure 3
CATH wheel plot showing the population of homologous families in different fold groups, architectures and classes. The wheel is coloured according to protein class (red, mainly-; green, mainly-; yellow, ; blue, few secondary structures). The size of the outer wheel represents the number of homologous families in CATH whilst each band in the outer wheel corresponds to a single fold family. The size of each fold band therefore reflects the number of homologous families having that fold. It can be seen that most fold families contain a single homologous family. The superfold families are shown as paler bands, containing many homologous families. The inner wheel shows the population of homologous families in the different architectures. We have also recently set up a Web Server (11), which enables the user to scan the CATH database with a newly determined protein structure and identify possible fold similarities or evolutionary relationships. There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST) (12) to identify a probable fold for a new sequence. The latest release of CATH (version 1.4, April 1998) contains 9342 protein chains from the PDB (13), which divide into 13 359 domain folds. Currently 32 different architectures are recognised. Since the last release, three new architectures have been described, including the five-bladed - propellor. Grouping proteins on the basis of sequence, structure and functional similarity gives 827 evolutionary homologous families (H-level). Whilst recognising more distant structural similarity with no accompanying sequence or function similarity gives rise to 593 different fold groups (T-level). The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown in Figure 3. It can be seen that several highly populated fold families, which we describe as superfolds (6), as they support a diverse range of sequences and more than three different functions, still account for nearly 30% of non-homologous structures.
Previous SectionNext Section
Implications for Structural Genomics As the sequence databases grow rapidly, the need to interpret these sequences and assign functions to specific genes becomes increasingly important. Many techniques exist for matching protein sequences and thereby inheriting functional information. However, for very distant homologues there is often no
detectable sequence similarity, despite conservation of 3D structure and function. For these cases, evolutionary relationships and thereby functions can only be assigned by comparing the structures. Therefore, a number of structural genomics initiatives are being proposed (14) which aim to identify all the folds in nature with the ultimate goal of being able to predict the function of a new protein from its known or probable structure. The important questions to ask are how many more folds do we need to determine before we have the complete set? and how confident can we be in assigning function between proteins having similar structures? In the current genomes, on average only 3046% of sequences can be assigned to a structural family, by recognising sequence similarity to a protein of known structure (15,16). With only 600 unique structures currently in the PDB, compared with 20 000 sequence families, it is clear that we still need to determine many more structures if we are to understand biology at the molecular level. However analysis of recently deposited structural data is very revealing.Figure 4a illustrates the distribution of 2159 new structural domains classified in the 10 months from June 1997 to March 1998. A large proportion of these (79%) were clearly homologous (30% identity) to proteins of known structure. Of the remaining 443 structures (Fig. 4b) corresponding to new sequences, we found only 8% were novel folds, the remainder resembling a previously determined structure. Many of these, 199 (45%), could be identified as clear homologues by having significant structure and sequence similarity (SSAP 80 and 20% sequence identity). A further 169 (38%) were probable homologues as, although the sequence identity was below 20%, they had functional similarity and/or gave significant scores using sequence search methods designed to detect very distant homologues (PSIBLAST) (12). There remained a further 40 (9%) proteins which were analogousi.e., they had the same fold as a previous entry, but neither the sequence nor the function gave definite evidence of a common ancestor.
Figure 4
Pi-charts showing the proportion of 2159 recently deposited structures, which match structures in CATH. (a) Proportion of new structures matching by sequence alignment (21) or structure alignment (SSAP) (3). (b) Proportion of new non-homologous structure (<30% sequence identity to any previous CATH entry), which match previous CATH entries by structure. Those which have more than 20% sequence identity, measured after structural alignment, or functional similarity, are assigned as homologues. The remaining structures are analogues, having no clear evolutionary relationship.
Previous SectionNext Section
Relationship between Protein Structure and Function We now need to consider at what levels of structural similarity or evolutionary distance it is reasonable to inherit functional information, within a protein family. Data on the CATH evolutionary families and structural groupings is stored in a Postgres relational database (11) with links to a ligand database containing information about protein-ligand interactions (2). This allows us to analyse the relationship between the 3D structure and function, using stored data on EC identifiers, SWISS-PROT key words and protein-ligand interactions (11). Considering the degree of functional similarity observed in structures with similar folds, the vast majority (>96%) of fold groups in the PDB derive from a single homologous family, with similar or closely related functions within the family. However, for the very common folds (superfolds, see above) which derive from three or more apparently unrelated homologous families, the proteins can perform quite unrelated functions even though they have the same fold. We have described these as analogous folds, which may or may not have a common ancestor.
At the homologous superfamily level in CATH, a more detailed analysis of enzyme functions showed that the majority of homologous enzyme families in CATH (>90%) contained proteins for which the first three EC identifiers were the same. Considering those families where homologues have significant sequence identity (20%) after structural alignment, 95% were found to have a single EC identifier, whilst for families where proteins have more than 30% sequence similarity, we observed that 98% had a single EC code. Although assigning function on the basis of homology is common practice, it is clear that some caution should be exercised, particularly where there is little or no sequence similarity. There are also some clear examples where homologues with significant sequence similarity perform different functions. The role of gene recruitment is especially clear in the eye lens proteins, which function as enzymes in other cellular environments, but which are used as structural proteins in this context (17). The extent of such gene recruitment and context-sensitive function is really not known at this time. For enzymes, it is clear that catalytic function can change and evolve, usually to act on a different but related substrate. Similarly, within the lipocalin family (CATH id #: 2.40.130.10), several proteins are found with very similar structures, which bind different fatty acids in the same region at the base of the -barrel (e.g., retinol, bilin, biotin). Nearly half of the homologous families where two or more different EC numbers were observed, belong to the superfolds. This suggests that if a new protein is assigned to a superfold family, more caution should be used when inheriting functional information, as there appears to be greater tolerance to changes in sequence and ultimately function, for these families. However, it is interesting to note that many of these were TIM barrel or Rossmann folds. These are superfolds in which the substrate or ligand commonly binds in the same place. This is in the base of the -barrel for the TIMs and at the crossover of the polypeptide chain for the doubly wound Rossmann structures.
Previous SectionNext Section
Assignment of Function Through Structure One of the reasons for determining structures is to derive more information to facilitate the assignment of function. From our analysis of proteins in CATH, we suggest that structural data can help to assign function in several ways: i. The structural data allow recognition of more distant homologues compared with sequence datain our analysis, 83% of structures with novel sequences could be assigned as homologues in this way (note that such assignment of function is again subject to the caveats imposed by gene recruitment discussed above). The structural data allows detailed inspection of the functional siteto suggest if and how the function may have evolved. For example, if an enzyme has evolved to act on a different substrate, the binding site may reveal, or at least suggest, possible changes in the substrate. For the superfolds, similarity of structure does not necessarily mean similarity of function. However the active site/binding sites are often conserved, e.g., in the TIM barrel or Rossmann fold structures, the ligand always binds at the same end of the barrel or sheet.
ii.
iii.
iv.
Some methods have already been developed, and will increasingly be the focus of attention over the next few years, which aim to predict function ab initio from structure. For example, enzymes can often be identified by the presence of a major cleft, which also locates the active site (18). Similarly critical surface patches, which are used for molecular recognition in binding other proteins or ligands, may be identified using knowledge-based approaches (19,20). In summary, extrapolating the data from Figure 4 to a new genome, we can expect that, of the 5470% of sequences which currently have no obvious sequence matches in the PDB, we will find nearly 80 90% to be homologous to a known family using the structural data alone. For the singlet folds, this
will almost certainly reveal some clues to the function. For the superfolds, some folds will reveal information on the functional class (e.g., enzyme for TIM barrels) or the location of the active site, if not the specific function. Only 1020% will be expected to be novel folds. For these the ab initio methods referred to above may provide some clues to guide experiments. Therefore, it is clear that determining structures, as part of a structural genomics initiative, for example, will make a major contribution to interpreting genome data.
Jail
Just another interface library?
Interfaces of macromolecules are a valuable basis to analyse the process of molecular recognition. JAIL classifies not only the interfaces between domain architectures but also those between protein chains and those between proteins and nucleic acids.
Of course not!
Gt-alpha/Gi-alpha chimera (PDB-ID: 1GOT). Interfaces of 1GOT
Interacting proteins are difficult to crystallize and rarely present within the Protein Data Base. Nevertheless, it is essential to analyse the interacting parts of the proteins to understand the process of protein-protein docking. To overcome this problem we have built up the JAIL database. Since interacting domains exhibit similiar structural features than proteins, all known interfaces between interacting domains of the SCOP database were extracted and classified in JAIL. Only a part of all protein structures are included in SCOP. Particularly, new PDB entries are not yet annotated. To overcome this problem additionally all interfaces between protein chains were calculated and included in the database. This type of interface also comprises the interacting parts of the assumed biological units. The last important type of interfaces provided here is composed of the interacting parts between proteins and nucleic acids. Overall the data set consists of about 180,000 interfaces. JAIL is a comfortable tool to browse through the interface library and to analyze single interfaces. However, more general questions require large-scale analysis. For this purpose, a detailed form enables the compiling of comprehensive non redundant data sets for download.
How is an interface defined? A complete residue is part of an interface if at least one atom of the aminoacid is located within a range of 4.5 Angstroem of any atom of the interacting domain or chain. One part of an interface must consist of at least 5 C-alpha atoms in the case of protein chains. In the case of nucleic acids the relevant atoms are calculated on the basis of the phosphor atoms of the RNA/DNA-backbone. What are biounits? The primary coordinate file deposited in the PDB generally contains one asymmetric unit. The asymmetric unit is the smallest portion of a crystal structure to which crystallographic symmetry can be applied to generate one cell. The biological molecule (biounit) is believed to be the functional unit of the protein. Frequently those units can be assumed or calculated when additional information is available. The biological units of many proteins are deposited in a separate section of the PDB database and can be used for interface calculations. More information about biounits In which way are redundant interfaces excluded in the download section? The redundancy is excluded in two different ways, by structure and by sequence. The sequential clustering is based on the Cd-hit program. The structural clustering is defined by the protein families and superfamilies of the SCOP classification. The database classifies proteins by domain architecture. Which settings in the download section are best for my own research? The selection of the datasets depends on the type of interactions (protein-protein or proteinnucleic acids) and the level of diversity that is desired. Sequence identity of maximal 50% results in a higher diversity than the setting to 95%. The default settings include interfaces of domain-domain interactions as well as interfaces between interacting chains. All interfaces of chains that were already treated by the SCOP domain interfaces are excluded by default. This procedure results in a high number of interfaces that are still diverse enough for statistical analysis. What is meant by "show conservation" in Jmol? The conservation of protein sequences is defined by the mutation rates at each amino acid
position. For JAIL this information was retrieved from ConSurf. ConSurf is a derived database merging structural and sequence information. Database scheme
Search for
Fulltext search
e.g. 1aay e.g. d1az0a_ e.g. 3.1.1. e.g. P03697 e.g. capsid protein
Keyword:
Search Clear
Keyword:
Search in the following interface types
Search
intra
Search
inter
Clear
don't care
MMDB
Experimentally resolved structures of proteins, RNA, and DNA, derived from the Protein Data Bank (PDB), with value-addedfeatures such as explicit chemical graphs, computationally identified 3D domains (compact substructures) that are used to identify similar 3D structures, as well as links to literature, similar sequences, information about chemicals bound to the structures, and more. These connections make it possible, for example, tofind 3D structures for homologs of a protein sequence of interest, then interactively view the sequence-structure relationships,active sites, bound chemicals, journal articles, and more.
Three-dimensional structures are now known within many protein families and it is quite likely, in searching a sequence database, that one will encounter a homolog with known structure. The goal of Entrezs 3D-structure database is to make this information, and the functional annotation it can provide, easily accessible to molecular biologists. To this end Entrezs search engine provides three powerful features. (i) Sequence and structure neighbors; one may select all sequences similar to one of interest, for example, and link to any known 3D structures. (ii) Links between databases; one may search by term matching in MEDLINE, for example, and link to 3D structures reported in these articles. (iii) Sequence and structure visualization; identifying a homolog with known structure, one may view moleculargraphic and alignment displays, to infer approximate 3D structure. In this article we focus on two features of Entrezs Molecular Modeling Database (MMDB) not described previously: links from individual biopolymer chains within 3D structures to a systematic taxonomy of organisms represented in molecular databases, and links from individual chains (and compact 3D domains within them) to structure neighbors, other chains (and 3D domains) with similar 3D structure. MMDB may be accessed athttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure.
The SUPERFAMILY annotation is based on a collection of hidden Markov models, which [6] represent structural protein domains at the SCOP superfamilylevel. A superfamily groups together domains which have an evolutionaryrelationship. The annotation is produced by scanning protein sequences from completely sequenced genomes against the hidden Markov models. For each protein you can: Submit sequences for SCOP classification View domain organisation, sequence alignments and protein sequence details
For each genome you can: Examine superfamily assignments, phylogenetic trees, domain organisation lists and networks Check for over- and under-represented superfamilies within a genome
For each superfamily you can: Inspect SCOP classification, functional annotation, Gene Ontologyannotation, InterPro abstract and genome assignments Explore taxonomic distribution of a superfamily across the tree of life
All annotation, models and the database dump are freely available for download to everyone.
Contents
[hide]
Purpose[edit source]
SUPERFAMILY classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. The superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.
Major Features
Sequence search Keyword search Submit your protein, or DNA, sequence for SCOP superfamily and family level classification. Search for superfamily, family or species names plus sequence, SCOP, PDB or hidden Markov model IDs. Domain assignments, alignments and architectures for completely Domain sequencedeukaryotic and prokaryotic organisms, plus sequence assignments collections. Browse unusual (over- and under-represented) superfamilies and families, Comparative adjacent domain pair lists and graphs, unique domain pairs, domain genomics combinations, domain architecture co-occurrence networks and domain tools distribution across taxonomic kingdoms for each organism. For each genome: number of sequences, number of sequences with assignment, percentage of sequences with assignment, percentage total sequence coverage, number of domains assigned, number of superfamilies Genome assigned, number of families assigned, average superfamily size, statistics percentage produced by duplication, average sequence length, average length matched, number of domain pairs and number of unique domain architectures. Gene Domain-centric Gene Ontology (GO) automatically annotated by Hai Ontology Fang. Domain-centric phenotype/anatomy ontology including Disease Phenptype Ontology, Human Phenotype, Mouse Phenotype, Worm Phenotype, Yeast Ontology Phenotype, Fly Phenotype,Fly Anatomy, Zebrafish Anatomy, Xenopus Anatomy, Arabidopsis Plant. Superfamily InterPro abstracts for 1,052 superfamilies, and Gene Ontology (GO) annotation annotation for 763 superfamilies. Functional Functional annotation of SCOP 1.73 superfamilies, by Christine Vogel. annotation Trees are generated using heuristic parsimony methods, and are based on Phylogenetic protein domain architecture data for all genomes in SUPERFAMILY. trees Genome combinations, or specific clades, can be displayed as individual trees. Similar Find the 10 domain architectures which are most similar to a domain domain architecture of interest. architectures Hidden Produce SCOP domain assignments for your sequences using the Markov SUPERFAMILY models. HMM visualisation by Martin Madera, models e.g. model 0045110. Find remote domain matches when the HMM search fails to find a Profile significant match. Profile comparison (PRC) for aligning and scoring two comparison profile hidden Markov models by Martin Madera.
Web services Distributed Annotation Server and linking to SUPERFAMILY. Sequences, assignments, models, MySQL database and scripts - updated Downloads weekly.
Jump to [
page ]
Recent news
3rd June 2013 genetrainer being launched at LeWeb conference.
The SUPERFAMILY database provides protein domain assignments, at the SCOP 'superfamily' level, for the predicted protein sequences in over 400 completed genomes. A superfamily groups together domains of different families which have a common evolutionary ancestor based on structural, functional and sequence data. SUPERFAMILY domain assignments are generated using an expert curated set of profile hidden Markov models. All models and structural assignments are available for browsing and download from http://supfam.org. The web interface includes services such as domain architectures and alignment details for all protein assignments, searchable domain combinations, domain occurrence network visualization, detection of over- or under-represented superfamilies for a given genome by comparison with other genomes, assignment of manually submitted sequences and keyword searches. In this update we describe the SUPERFAMILY database and outline two major developments: (i) incorporation of family level assignments and (ii) a superfamily-level functional annotation. The SUPERFAMILY database can be used for general protein evolution and superfamily-specific studies, genomic annotation, and structural genomics target suggestion and assessment.