0% found this document useful (0 votes)
481 views

PFAM Database

The Pfam database is a collection of protein families that uses profile hidden Markov models to classify protein sequences and domains. It was created to aid in genome annotation by assigning sequences to existing protein families. Pfam is continuously updated as new sequences are deposited and now contains over 16,000 families. It provides a useful resource for researchers to identify functional domains and relationships between protein families.

Uploaded by

Nadish Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
481 views

PFAM Database

The Pfam database is a collection of protein families that uses profile hidden Markov models to classify protein sequences and domains. It was created to aid in genome annotation by assigning sequences to existing protein families. Pfam is continuously updated as new sequences are deposited and now contains over 16,000 families. It provides a useful resource for researchers to identify functional domains and relationships between protein families.

Uploaded by

Nadish Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

PFAM database

Introduction

Over a million protein sequences have been


deposited in the UniProt sequence database
(Apweiler et al., 2004).
• The number of new sequences deposited shows
no signs of abating.
• Thus, protein sequence analysis may seem like an
endless task.
• However, the majority of protein sequences
appear to fall into a few thousand protein
families (Chothia, 1992).
• If we can place new sequences into existing
families, we can make our task simpler.
• Very often, these protein families are
representative of proteins at the domain level.
• Proteins are generally composed of one or more
functional regions, commonly termed domains.
• Different combinations of domains give rise to
the diverse range of proteins found in nature.
• The identification of domains that occur within
proteins can therefore provide insights into their
function.
• Different combinations of domains give rise to
functional diversity (Vogel et al., 2004), which
includes their ability to form different protein
interactions.
The Pfam database

• Pfam is a database of protein domain families


(Bateman et al., 2004).

• Each family is represented by multiple sequence


alignments and profile hidden Markov models (HMMs)

• In addition, each family has associated annotation,


literature references, and links to other databases.
• Pfam was founded in 1995 by Erik Sonhammer, Sean Eddy and
Richard Durbin as a collection of commonly occurring protein
domains that could be used to annotate the protein coding genes of
multicellular animals.
• One of its major aims at inception was to aid in the annotation of
the C. elegans genome. The project was partly driven by the
assertion in ‘One thousand families for the molecular biologist’ by
Cyrus Chothia that there were around 1500 different families of
proteins and that the majority of proteins fell into just 1000 of
these.
• Counter to this assertion, the Pfam database currently contains
16,306 entries corresponding to unique protein domains and
families. However, many of these families contain structural and
functional similarities indicating a shared evolutionary origin
(see Clans)
• The entries in Pfam are freely available via the
web and in flatfile format (Pfam is available in
Europe
at http://www.sanger.ac.uk/Software/Pfam/ (U
K), http://www.cgb.ki.se/Pfam/ (Sweden),
and http://pfam.jouy.inra.fr/(France), and in the
United States at http://pfam.wustl.edu/).
• Pfam is a founding member database of InterPro
and, therefore, also available via the InterPro site
at http://ebi.ac.uk/interpro.
• Proteins are generally composed of one or
more functional regions, commonly
termed domains.
• Different combinations of domains give rise to
the diverse range of proteins found in nature.
• The identification of domains that occur
within proteins can therefore provide insights
into their function.
• Pfam also generates higher-level groupings of related
entries, known as clans. A clan is a collection of Pfam
entries which are related by similarity of sequence,
structure or profile-HMM.

• The data presented for each entry is based on the UniProt


Reference Proteomes but information on individual
UniProtKB sequences can still be found by entering the
protein accession.

• Pfam full alignments are available from searching a variety


of databases, either to provide different accessions (e.g. all
UniProt and NCBI GI) or different levels of redundancy.
• The general purpose of the Pfam database is to
provide a complete and accurate classification of
protein families and domains.
• Originally, the rationale behind creating the
database was to have a semi-automated method
of curating information on known protein families
to improve the efficiency of annotating genomes.
• The Pfam classification of protein families has
been widely adopted by biologists because of its
wide coverage of proteins and sensible naming
conventions
• It is used by experimental biologists researching
specific proteins, by structural biologists to
identify new targets for structure determination,
by computational biologists to organise
sequences and by evolutionary biologists tracing
the origins of proteins

• Early genome projects, such as human and fly


used Pfam extensively for functional annotation
of genomic data.
• The Pfam website allows users to submit protein or
DNA sequences to search for matches to families in the
database.
• If DNA is submitted, a six-frame translation is
performed, then each frame is searched.

• Rather than performing a typical BLAST search, Pfam


uses profile hidden Markov models, which give
greater weight to matches at conserved sites, allowing
better remote homology detection, making them
more suitable for annotating genomes of organisms
with no well-annotated close relatives
• Pfam has also been used in the creation of
other resources such as iPfam, which catalogs
domain-domain interactions within and
between proteins, based on information in
structure databases and mapping of Pfam
domains onto these structures
Features of PFAM
• For each family in Pfam one can:
View a description of the family
Look at multiple alignments
View protein domain architectures
Examine species distribution
Follow links to other databases
View known protein structures
Features of PFAM
• Entries can be of several types: family, domain, repeat
or motif.
 Family is the default class, which simply indicates that
members are related.
 Domains are defined as an autonomous structural unit
or reusable sequence unit that can be found in multiple
protein contexts.
 Repeats are not usually stable in isolation, but rather
are usually required to form tandem repeats in order to
form a domain or extended structure.
 Motifs are usually shorter sequence units found
outside of globular domains
Creation of new entries
• Creation of new entries
• New families come from a range of sources, primarily the PDB and analysis
of complete proteomes to find genes with no Pfam hit.

• For each family, a representative subset of sequences are aligned into a


high-quality seed alignment.

• Sequences for the seed alignment are taken primarily from pfamseq (a
non-redundant database of reference proteomes) with some
supplementation from UniprotKB.

• This seed alignment is then used to build a profile hidden Markov model
using HMMER. This HMM is then searched against sequence databases,
and all hits that reach a curated gathering threshold are classified as
members of the protein family. The resulting collection of members is
then aligned to the profile HMM to generate a full alignment.
• For each family, a manually curated gathering
threshold is assigned that maximises the number of
true matches to the family while excluding any false
positive matches.
• False positives are estimated by observing overlaps
between Pfam family hits that are not from the same
clan. This threshold is used to assess whether a match
to a family HMM should be included in the protein
family.
• Upon each update of Pfam, gathering thresholds are
reassessed to prevent overlaps between new and
existing families
Domains of unknown function
• Domains of Unknown Function (DUFs) represent a growing fraction
of the Pfam database.
• The families are so named because they have been found to be
conserved across species, but perform an unknown role.
• Each newly added DUF is named in order of addition.
• Names of these entries are updated as their functions are
identified. Normally when the function of at least one protein
belonging to a DUF has been determined, the function of the entire
DUF is updated and the family is renamed. Some named families
are still domains of unknown function, that are named after a
representative protein, e.g. YbbR.
• Numbers of DUFs are expected to continue increasing as conserved
sequences of unknown function continue to be identified in
sequence data. It is expected that DUFs will eventually outnumber
families of known function.
Clans
• Over time both sequence and residue coverage have
increased, and as families have grown, more
evolutionary relationships have been discovered,
allowing the grouping of families into clans.
• Clans were first introduced to the Pfam database in
2005.
• They are groupings of related families that share a
single evolutionary origin, as confirmed by structural,
functional, sequence and HMM comparisons.
• As of release 29.0, approximately one third of protein
families belonged to a clan
• A major point of difference between Pfam and other databases at
the time of its inception was the use of two alignment types for
entries: a smaller, manually checked seed alignment, as well as a
full alignment built by aligning sequences to a profile hidden
Markov model built from the seed alignment.
• This smaller seed alignment was easier to update as new releases
of sequence databases came out, and thus represented a promising
solution to the dilemma of how to keep the database up to date as
genome sequencing became more efficient and more data needed
to be processed over time.
• A further improvement to the speed at which the database could
be updated came in version 24.0, with the introduction of
HMMER3, which is ~100 times faster than HMMER2 and more
sensitive.
• Because the entries in Pfam-A do not cover all known proteins, an
automatically generated supplement was provided called Pfam-B. Pfam-B
contained a large number of small families derived from clusters produced
by an algorithm called ADDA. Although of lower quality, Pfam-B families
could be useful when no Pfam-A families were found. Pfam-B was
discontinued as of release 28.0
• Pfam was originally hosted on three mirror sites around the world to
preserve redundancy. However between 2012-2014, the Pfam resource
was moved to EMBL-EBI, which allowed for hosting of the website from
one domain (xfam.org), using duplicate independent data centres.
• This allowed for better centralisation of updates, and grouping with other
Xfam projects such as Rfam, TreeFam, iPfam and others, whilst retaining
critical resilience provided by hosting from multiple centres.
• Pfam has undergone a substantial reorganisation over the last two years
to further reduce manual effort involved in curation and allow for more
frequent updates.
The use of Pfam

The used by molecular biologists as a protein
information resource and analysis tool is widespread.
• The multiple sequence alignments around which Pfam
families are built are important for understanding both
protein structure and function.
• The alignments are also the basis for techniques such
as secondary structure prediction, fold recognition,
and phylogenetic analysis and can guide mutation
design.
• In addition to the identification of domains in novel
protein sequences

You might also like