PFAM Database
PFAM Database
Introduction
• Sequences for the seed alignment are taken primarily from pfamseq (a
non-redundant database of reference proteomes) with some
supplementation from UniprotKB.
• This seed alignment is then used to build a profile hidden Markov model
using HMMER. This HMM is then searched against sequence databases,
and all hits that reach a curated gathering threshold are classified as
members of the protein family. The resulting collection of members is
then aligned to the profile HMM to generate a full alignment.
• For each family, a manually curated gathering
threshold is assigned that maximises the number of
true matches to the family while excluding any false
positive matches.
• False positives are estimated by observing overlaps
between Pfam family hits that are not from the same
clan. This threshold is used to assess whether a match
to a family HMM should be included in the protein
family.
• Upon each update of Pfam, gathering thresholds are
reassessed to prevent overlaps between new and
existing families
Domains of unknown function
• Domains of Unknown Function (DUFs) represent a growing fraction
of the Pfam database.
• The families are so named because they have been found to be
conserved across species, but perform an unknown role.
• Each newly added DUF is named in order of addition.
• Names of these entries are updated as their functions are
identified. Normally when the function of at least one protein
belonging to a DUF has been determined, the function of the entire
DUF is updated and the family is renamed. Some named families
are still domains of unknown function, that are named after a
representative protein, e.g. YbbR.
• Numbers of DUFs are expected to continue increasing as conserved
sequences of unknown function continue to be identified in
sequence data. It is expected that DUFs will eventually outnumber
families of known function.
Clans
• Over time both sequence and residue coverage have
increased, and as families have grown, more
evolutionary relationships have been discovered,
allowing the grouping of families into clans.
• Clans were first introduced to the Pfam database in
2005.
• They are groupings of related families that share a
single evolutionary origin, as confirmed by structural,
functional, sequence and HMM comparisons.
• As of release 29.0, approximately one third of protein
families belonged to a clan
• A major point of difference between Pfam and other databases at
the time of its inception was the use of two alignment types for
entries: a smaller, manually checked seed alignment, as well as a
full alignment built by aligning sequences to a profile hidden
Markov model built from the seed alignment.
• This smaller seed alignment was easier to update as new releases
of sequence databases came out, and thus represented a promising
solution to the dilemma of how to keep the database up to date as
genome sequencing became more efficient and more data needed
to be processed over time.
• A further improvement to the speed at which the database could
be updated came in version 24.0, with the introduction of
HMMER3, which is ~100 times faster than HMMER2 and more
sensitive.
• Because the entries in Pfam-A do not cover all known proteins, an
automatically generated supplement was provided called Pfam-B. Pfam-B
contained a large number of small families derived from clusters produced
by an algorithm called ADDA. Although of lower quality, Pfam-B families
could be useful when no Pfam-A families were found. Pfam-B was
discontinued as of release 28.0
• Pfam was originally hosted on three mirror sites around the world to
preserve redundancy. However between 2012-2014, the Pfam resource
was moved to EMBL-EBI, which allowed for hosting of the website from
one domain (xfam.org), using duplicate independent data centres.
• This allowed for better centralisation of updates, and grouping with other
Xfam projects such as Rfam, TreeFam, iPfam and others, whilst retaining
critical resilience provided by hosting from multiple centres.
• Pfam has undergone a substantial reorganisation over the last two years
to further reduce manual effort involved in curation and allow for more
frequent updates.
The use of Pfam
•
The used by molecular biologists as a protein
information resource and analysis tool is widespread.
• The multiple sequence alignments around which Pfam
families are built are important for understanding both
protein structure and function.
• The alignments are also the basis for techniques such
as secondary structure prediction, fold recognition,
and phylogenetic analysis and can guide mutation
design.
• In addition to the identification of domains in novel
protein sequences