PFAM Database

The Pfam database is a collection of protein families that uses profile hidden Markov models to classify protein sequences and domains. It was created to aid in genome annotation by assigning sequences to existing protein families. Pfam is continuously updated as new sequences are deposited and now contains over 16,000 families. It provides a useful resource for researchers to identify functional domains and relationships between protein families.

Uploaded by

Nadish Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

481 views

PFAM Database

Uploaded by

Nadish Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

PFAM database

Introduction

Over a million protein sequences have been

deposited in the UniProt sequence database
(Apweiler et al., 2004).
• The number of new sequences deposited shows
no signs of abating.
• Thus, protein sequence analysis may seem like an
endless task.
• However, the majority of protein sequences
appear to fall into a few thousand protein
families (Chothia, 1992).
• If we can place new sequences into existing
families, we can make our task simpler.
• Very often, these protein families are
representative of proteins at the domain level.
• Proteins are generally composed of one or more
functional regions, commonly termed domains.
• Different combinations of domains give rise to
the diverse range of proteins found in nature.
• The identification of domains that occur within
proteins can therefore provide insights into their
function.
• Different combinations of domains give rise to
functional diversity (Vogel et al., 2004), which
includes their ability to form different protein
interactions.
The Pfam database

• Pfam is a database of protein domain families

(Bateman et al., 2004).

• Each family is represented by multiple sequence

alignments and profile hidden Markov models (HMMs)

• In addition, each family has associated annotation,

literature references, and links to other databases.
• Pfam was founded in 1995 by Erik Sonhammer, Sean Eddy and
Richard Durbin as a collection of commonly occurring protein
domains that could be used to annotate the protein coding genes of
multicellular animals.
• One of its major aims at inception was to aid in the annotation of
the C. elegans genome. The project was partly driven by the
assertion in ‘One thousand families for the molecular biologist’ by
Cyrus Chothia that there were around 1500 different families of
proteins and that the majority of proteins fell into just 1000 of
these.
• Counter to this assertion, the Pfam database currently contains
16,306 entries corresponding to unique protein domains and
families. However, many of these families contain structural and
functional similarities indicating a shared evolutionary origin
(see Clans)
• The entries in Pfam are freely available via the
web and in flatfile format (Pfam is available in
Europe
at http://www.sanger.ac.uk/Software/Pfam/ (U
K), http://www.cgb.ki.se/Pfam/ (Sweden),
and http://pfam.jouy.inra.fr/(France), and in the
United States at http://pfam.wustl.edu/).
• Pfam is a founding member database of InterPro
and, therefore, also available via the InterPro site
at http://ebi.ac.uk/interpro.
• Proteins are generally composed of one or
more functional regions, commonly
termed domains.
• Different combinations of domains give rise to
the diverse range of proteins found in nature.
• The identification of domains that occur
within proteins can therefore provide insights
into their function.
• Pfam also generates higher-level groupings of related
entries, known as clans. A clan is a collection of Pfam
entries which are related by similarity of sequence,
structure or profile-HMM.

• The data presented for each entry is based on the UniProt

Reference Proteomes but information on individual
UniProtKB sequences can still be found by entering the
protein accession.

• Pfam full alignments are available from searching a variety

of databases, either to provide different accessions (e.g. all
UniProt and NCBI GI) or different levels of redundancy.
• The general purpose of the Pfam database is to
provide a complete and accurate classification of
protein families and domains.
• Originally, the rationale behind creating the
database was to have a semi-automated method
of curating information on known protein families
to improve the efficiency of annotating genomes.
• The Pfam classification of protein families has
been widely adopted by biologists because of its
wide coverage of proteins and sensible naming
conventions
• It is used by experimental biologists researching
specific proteins, by structural biologists to
identify new targets for structure determination,
by computational biologists to organise
sequences and by evolutionary biologists tracing
the origins of proteins

• Early genome projects, such as human and fly

used Pfam extensively for functional annotation
of genomic data.
• The Pfam website allows users to submit protein or
DNA sequences to search for matches to families in the
database.
• If DNA is submitted, a six-frame translation is
performed, then each frame is searched.

• Rather than performing a typical BLAST search, Pfam

uses profile hidden Markov models, which give
greater weight to matches at conserved sites, allowing
better remote homology detection, making them
more suitable for annotating genomes of organisms
with no well-annotated close relatives
• Pfam has also been used in the creation of
other resources such as iPfam, which catalogs
domain-domain interactions within and
between proteins, based on information in
structure databases and mapping of Pfam
domains onto these structures
Features of PFAM
• For each family in Pfam one can:
View a description of the family
Look at multiple alignments
View protein domain architectures
Examine species distribution
Follow links to other databases
View known protein structures
Features of PFAM
• Entries can be of several types: family, domain, repeat
or motif.
 Family is the default class, which simply indicates that
members are related.
 Domains are defined as an autonomous structural unit
or reusable sequence unit that can be found in multiple
protein contexts.
 Repeats are not usually stable in isolation, but rather
are usually required to form tandem repeats in order to
form a domain or extended structure.
 Motifs are usually shorter sequence units found
outside of globular domains
Creation of new entries
• Creation of new entries
• New families come from a range of sources, primarily the PDB and analysis
of complete proteomes to find genes with no Pfam hit.

• For each family, a representative subset of sequences are aligned into a

high-quality seed alignment.

• Sequences for the seed alignment are taken primarily from pfamseq (a
non-redundant database of reference proteomes) with some
supplementation from UniprotKB.

• This seed alignment is then used to build a profile hidden Markov model
using HMMER. This HMM is then searched against sequence databases,
and all hits that reach a curated gathering threshold are classified as
members of the protein family. The resulting collection of members is
then aligned to the profile HMM to generate a full alignment.
• For each family, a manually curated gathering
threshold is assigned that maximises the number of
true matches to the family while excluding any false
positive matches.
• False positives are estimated by observing overlaps
between Pfam family hits that are not from the same
clan. This threshold is used to assess whether a match
to a family HMM should be included in the protein
family.
• Upon each update of Pfam, gathering thresholds are
reassessed to prevent overlaps between new and
existing families
Domains of unknown function
• Domains of Unknown Function (DUFs) represent a growing fraction
of the Pfam database.
• The families are so named because they have been found to be
conserved across species, but perform an unknown role.
• Each newly added DUF is named in order of addition.
• Names of these entries are updated as their functions are
identified. Normally when the function of at least one protein
belonging to a DUF has been determined, the function of the entire
DUF is updated and the family is renamed. Some named families
are still domains of unknown function, that are named after a
representative protein, e.g. YbbR.
• Numbers of DUFs are expected to continue increasing as conserved
sequences of unknown function continue to be identified in
sequence data. It is expected that DUFs will eventually outnumber
families of known function.
Clans
• Over time both sequence and residue coverage have
increased, and as families have grown, more
evolutionary relationships have been discovered,
allowing the grouping of families into clans.
• Clans were first introduced to the Pfam database in
2005.
• They are groupings of related families that share a
single evolutionary origin, as confirmed by structural,
functional, sequence and HMM comparisons.
• As of release 29.0, approximately one third of protein
families belonged to a clan
• A major point of difference between Pfam and other databases at
the time of its inception was the use of two alignment types for
entries: a smaller, manually checked seed alignment, as well as a
full alignment built by aligning sequences to a profile hidden
Markov model built from the seed alignment.
• This smaller seed alignment was easier to update as new releases
of sequence databases came out, and thus represented a promising
solution to the dilemma of how to keep the database up to date as
genome sequencing became more efficient and more data needed
to be processed over time.
• A further improvement to the speed at which the database could
be updated came in version 24.0, with the introduction of
HMMER3, which is ~100 times faster than HMMER2 and more
sensitive.
• Because the entries in Pfam-A do not cover all known proteins, an
automatically generated supplement was provided called Pfam-B. Pfam-B
contained a large number of small families derived from clusters produced
by an algorithm called ADDA. Although of lower quality, Pfam-B families
could be useful when no Pfam-A families were found. Pfam-B was
discontinued as of release 28.0
• Pfam was originally hosted on three mirror sites around the world to
preserve redundancy. However between 2012-2014, the Pfam resource
was moved to EMBL-EBI, which allowed for hosting of the website from
one domain (xfam.org), using duplicate independent data centres.
• This allowed for better centralisation of updates, and grouping with other
Xfam projects such as Rfam, TreeFam, iPfam and others, whilst retaining
critical resilience provided by hosting from multiple centres.
• Pfam has undergone a substantial reorganisation over the last two years
to further reduce manual effort involved in curation and allow for more
frequent updates.
The use of Pfam
•
The used by molecular biologists as a protein
information resource and analysis tool is widespread.
• The multiple sequence alignments around which Pfam
families are built are important for understanding both
protein structure and function.
• The alignments are also the basis for techniques such
as secondary structure prediction, fold recognition,
and phylogenetic analysis and can guide mutation
design.
• In addition to the identification of domains in novel
protein sequences

Mechanism of Pain Generation For Endometriosis-Associated Pelvic Pain
No ratings yet
Mechanism of Pain Generation For Endometriosis-Associated Pelvic Pain
9 pages
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
Genome Annotation and Tools
No ratings yet
Genome Annotation and Tools
20 pages
MCQs Series for Life Sciences: Volume 2
From Everand
MCQs Series for Life Sciences: Volume 2
Maddaly Ravi
4/5 (1)
Group # 13
No ratings yet
Group # 13
49 pages
FASTA
No ratings yet
FASTA
33 pages
Substitution Matrix
No ratings yet
Substitution Matrix
10 pages
Phylogenetic Analysis
No ratings yet
Phylogenetic Analysis
6 pages
Biological Database 1
No ratings yet
Biological Database 1
50 pages
Cath Database
No ratings yet
Cath Database
16 pages
BLOSUM Matrices
No ratings yet
BLOSUM Matrices
18 pages
Bioinformatics Notes
No ratings yet
Bioinformatics Notes
104 pages
Phylogenetic Trees
No ratings yet
Phylogenetic Trees
11 pages
Genetica Brooker
100% (1)
Genetica Brooker
25 pages
Bioinformatics. CH 3 Databases (Summarized Notes)
50% (2)
Bioinformatics. CH 3 Databases (Summarized Notes)
5 pages
Manual PDF
100% (1)
Manual PDF
53 pages
Introduction of Proteomics
No ratings yet
Introduction of Proteomics
21 pages
Recombinant Dna Technology
No ratings yet
Recombinant Dna Technology
21 pages
Proteomic and Proteomics
No ratings yet
Proteomic and Proteomics
6 pages
FASTA
No ratings yet
FASTA
4 pages
Phylogenetic Tree Construction - Methods
No ratings yet
Phylogenetic Tree Construction - Methods
7 pages
Docking With ArgusLab
100% (1)
Docking With ArgusLab
24 pages
Selection of rDNA
100% (1)
Selection of rDNA
31 pages
BLAST
100% (1)
BLAST
4 pages
Sequence Similarity Searching: Basic Local Alignment Search Tool
No ratings yet
Sequence Similarity Searching: Basic Local Alignment Search Tool
47 pages
DNA Footprinting: Pranjali Priya 15-MSVM 06 M.Sc. Biochemistry and Molecular Biology
100% (1)
DNA Footprinting: Pranjali Priya 15-MSVM 06 M.Sc. Biochemistry and Molecular Biology
10 pages
Introduction To Databases
No ratings yet
Introduction To Databases
7 pages
Blast
100% (1)
Blast
21 pages
Advances in Zinc Finger Nuclease and Its Applications
No ratings yet
Advances in Zinc Finger Nuclease and Its Applications
13 pages
Sequence Analysis &alignment
100% (1)
Sequence Analysis &alignment
2 pages
Tranposon and Insertion Sequence Lec
No ratings yet
Tranposon and Insertion Sequence Lec
33 pages
Rat Liver Dna Isolation
67% (3)
Rat Liver Dna Isolation
4 pages
Sequence Analysis
No ratings yet
Sequence Analysis
6 pages
7.1 Linkage and Crossing Over
No ratings yet
7.1 Linkage and Crossing Over
34 pages
Applications of PCR
No ratings yet
Applications of PCR
11 pages
Databases Bioinformatics
No ratings yet
Databases Bioinformatics
42 pages
Experiment 9 Bioinformatics Tools For Cell and Molecular Biology
No ratings yet
Experiment 9 Bioinformatics Tools For Cell and Molecular Biology
11 pages
Unit1 - Bioinformatics (KBT-603)
No ratings yet
Unit1 - Bioinformatics (KBT-603)
91 pages
Bi0505 Lab
No ratings yet
Bi0505 Lab
102 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
Genome Organization in Prokaryotes
75% (4)
Genome Organization in Prokaryotes
8 pages
Serial Analysis of Gene Expression (SAGE)
No ratings yet
Serial Analysis of Gene Expression (SAGE)
34 pages
Lab Report 2 Bioinformatics
No ratings yet
Lab Report 2 Bioinformatics
17 pages
C Value Paradox
No ratings yet
C Value Paradox
4 pages
Homology Modelling
No ratings yet
Homology Modelling
29 pages
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
No ratings yet
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
26 pages
Fasta and Blast
No ratings yet
Fasta and Blast
3 pages
Unit 5-Introduction To Biological Databases
No ratings yet
Unit 5-Introduction To Biological Databases
14 pages
Metagenomics and Industrial Applications: Perspectives
100% (1)
Metagenomics and Industrial Applications: Perspectives
7 pages
PSSM
No ratings yet
PSSM
17 pages
Transposable Genetic Element
No ratings yet
Transposable Genetic Element
60 pages
Plasmid Vectors
No ratings yet
Plasmid Vectors
31 pages
Characteristics and Genotyping (Semi-Automated and Automated), Apparatus Used in Genotyping
No ratings yet
Characteristics and Genotyping (Semi-Automated and Automated), Apparatus Used in Genotyping
45 pages
4-Embryo Culture and Embryo Rescue
100% (1)
4-Embryo Culture and Embryo Rescue
23 pages
Allosteric Regulation & Covalent Modification
100% (1)
Allosteric Regulation & Covalent Modification
10 pages
Recombinant DNA Technology Lecture
No ratings yet
Recombinant DNA Technology Lecture
50 pages
Bioinformatics Exercises Print
No ratings yet
Bioinformatics Exercises Print
6 pages
Bioinformatics Class Notes
No ratings yet
Bioinformatics Class Notes
12 pages
Bio-Informatics, Its Application S& Ncbi: Submitted By: Sidhant Oberoi (BTF/09/4038)
No ratings yet
Bio-Informatics, Its Application S& Ncbi: Submitted By: Sidhant Oberoi (BTF/09/4038)
9 pages
Jumping Genes-1
100% (1)
Jumping Genes-1
56 pages
Notes On a Few Minor Phyla
From Everand
Notes On a Few Minor Phyla
Daniel Zimmermann
No ratings yet
Cell Structure and Function (Short Notes & MCQ's)
No ratings yet
Cell Structure and Function (Short Notes & MCQ's)
15 pages
Isolation of DNA
No ratings yet
Isolation of DNA
19 pages
MA6000 Operational quality report
No ratings yet
MA6000 Operational quality report
4 pages
Biochemistry 2nd Year Guideline by Medicose Fever
No ratings yet
Biochemistry 2nd Year Guideline by Medicose Fever
22 pages
Andrew King 1998
No ratings yet
Andrew King 1998
11 pages
Biochemistry 9th edition. Edition Shawn O. Farrell - The complete ebook version is now available for download
100% (1)
Biochemistry 9th edition. Edition Shawn O. Farrell - The complete ebook version is now available for download
59 pages
Genetics - Chapter 14
No ratings yet
Genetics - Chapter 14
10 pages
Download MOLECULAR PLANT TAXONOMY : methods and protocols. 2nd Edition Pascale Besse (Ed.) ebook All Chapters PDF
100% (20)
Download MOLECULAR PLANT TAXONOMY : methods and protocols. 2nd Edition Pascale Besse (Ed.) ebook All Chapters PDF
55 pages
NEET Biology Chapter Wise Mock Test - Biomolecules and Enzymes - CBSE Tuts
0% (1)
NEET Biology Chapter Wise Mock Test - Biomolecules and Enzymes - CBSE Tuts
21 pages
9700_w24_ms_21
No ratings yet
9700_w24_ms_21
16 pages
Wa0012
No ratings yet
Wa0012
14 pages
Download full Molecular Diagnostics Fundamentals Methods Clinical Applications 1st Edition Lela ebook all chapters
100% (11)
Download full Molecular Diagnostics Fundamentals Methods Clinical Applications 1st Edition Lela ebook all chapters
60 pages
Drug Development Process
No ratings yet
Drug Development Process
60 pages
Molecular Farming - The Slope of Enlightenment
No ratings yet
Molecular Farming - The Slope of Enlightenment
16 pages
MIT Environment 10-21-04
No ratings yet
MIT Environment 10-21-04
2 pages
Biologia Telomerilor Umani2711
No ratings yet
Biologia Telomerilor Umani2711
20 pages
Electrophoresis
No ratings yet
Electrophoresis
96 pages
Protein Synthesis Essay Questions
100% (2)
Protein Synthesis Essay Questions
5 pages
Introduction To Biotechnology
No ratings yet
Introduction To Biotechnology
25 pages
DNA Module Lab Report
No ratings yet
DNA Module Lab Report
9 pages
Poster 73-Bact-Builder A New Streamlined Tool For Generating High Quality Consensus Based, Complete Mycobacterium Tuberculosis Genomes
No ratings yet
Poster 73-Bact-Builder A New Streamlined Tool For Generating High Quality Consensus Based, Complete Mycobacterium Tuberculosis Genomes
1 page
Nano-Bio Interfaces Pioneering Targeted Approaches For Disease Treatment and Prevention
No ratings yet
Nano-Bio Interfaces Pioneering Targeted Approaches For Disease Treatment and Prevention
19 pages
Restriction Enzymes: Durgesh Sirohi
No ratings yet
Restriction Enzymes: Durgesh Sirohi
40 pages
Sbi4u Molecgenetics lp2
No ratings yet
Sbi4u Molecgenetics lp2
3 pages
Current Immunization Schedule of Routine Epi in Pakistan
100% (3)
Current Immunization Schedule of Routine Epi in Pakistan
2 pages
Taste Transduction and Channel Synapses in Taste Buds
No ratings yet
Taste Transduction and Channel Synapses in Taste Buds
11 pages
Genetics and Biotechnology Reaction Paper
No ratings yet
Genetics and Biotechnology Reaction Paper
2 pages
CAPE Biology Unit 1 Paper 2 2017 Question 5
75% (4)
CAPE Biology Unit 1 Paper 2 2017 Question 5
2 pages
What Is Medical Genetics?: An Introduction
No ratings yet
What Is Medical Genetics?: An Introduction
16 pages

PFAM Database

Uploaded by

PFAM Database

Uploaded by

PFAM database

Over a million protein sequences have been

• Pfam is a database of protein domain families

• Each family is represented by multiple sequence

• In addition, each family has associated annotation,

• The data presented for each entry is based on the UniProt

• Pfam full alignments are available from searching a variety

• Early genome projects, such as human and fly

• Rather than performing a typical BLAST search, Pfam

• For each family, a representative subset of sequences are aligned into a

You might also like