Libraries/RepeatPeps.readme


Transposable element protein database
=====================================

The transposable element protein database now contains 18,011
predicted proteins, with a combined length of 16.1 million amino acids. 

A breakdown of the main classes is given in the following table:

                   (sub)classes    proteins     AA (x 1000)

LTR elements
  Gypsy                   1           3530           3716
  Copia                   1           2116           2847
  ERV                     7            975            684
  Pao                     1            592           1036
  DIRS                    1            424            288
  Ngaro                   1             86             42
  Other                   3             62             65

DNA
  TcMar                  17           1127            464
  hAT                    15            824            565
  MULE                    4            581            370
  PIF                     4            421            168
  CMC                     5            376            306
  Maverick                1            349            152
  PiggyBac                3            195            145
  Crypton                 9            141             84
  Sola                    3            133             97
  P-element               2            110             91
  Zisupton                1             85             75
  Merlin                  1             80             22
  Kolobok                 4             70             43
  Academ                  3             44             48
  Dada                    1             34             36
  Ginger                  2             31             20
  IS3EU                   1             23             13
  Other                   3             18             15

LINEs
  L1                      4           1701           1461
  L2                      1            695            498
  CR1                     2            699            529
  I/Jockey                2            759            636
  RTE                     4            219            211
  Penelope                1            185            123
  CRE                     3             65             67
  Other                  14           1107           1009

Rolling Circle            2            124            182

Other                     6             30             22


About 1102 of these proteins are entries extracted from the nr
GenBank databases. Most of the remaining peptides are translations of
interspersed repeat consensus sequences. Many of the coding elements in
repeat libraries contain frame shifts and stop codons reflecting errors
in the consensus. In order to represent these elements currently some
( 1500 entries ) interrupted coding regions have been translated by 
using TFASTY or GeneWise, guided by the closest related proteins in 
the database.

Very similar proteins are represented by only one sequence (e.g. there
is only one HIV-1 pol entry). A cut-off of 90% similarity and/or 85%
identity has been used, but if many differences appear to be due to
ambiguities or errors, more distant matching proteins have been excluded.

The protein database is still very much under development but functions
fine for the purpose of masking repetitive DNA that could give rise to
spurious matches in translated database searches.  As most proteins are
classified to the level RepeatMasker classifies transposable elements,
it is also already useful for classifying transposable elements. This
will be its central role in the interspersed repeat identification
program that we are developing. In the future phylogenetic comparisons
of transposable elements based on their coding regions can be performed
using this database, though we will need to (and will) curate the database
significantly before this can be done properly.

Tandem repeats and low complexity DNA
=====================================

Prior to being compared with WU_BLASTX to the transposable element protein
database, the query is checked for the presence of tandem repeats using
Tandem Repeat Finder followed by the simple repeat / low complexity
algorithm of RepeatMasker. This will avoid false annotations in the
comparison against the transposable element proteins. The button allows
you to turn off the masking/annotating of low complexity and simple
repeats in the final output. Low complexity and simple repeat analysis
will still occur prior to looking for matches to the RepeatPep database.

False positives
===============
We have run RepeatProteinMasker on several megabases of genomic DNA
and compared it to RepeatMasker output to identify false positive
matches. This number could be reduced to zero by eliminating certain
low-complexity proteins from the database and reporting matches based on
a complexity adjusted alignment score, very similar to that implemented
in RepeatMasker. False matches to inverse, non-complimented genomic DNA
are also absent, except for an occasional match in very (>70%) GC-rich
DNA (no such cases were seen with a score above 30).

A surprisingly large number of genes are derived from transposable
element proteins. In 2001 I had identified 50 in the human genome and
it is more than likely that other genomes are very similar if not more
so, considering that most other genomes are exposed to a much larger
variety of transposable elements and DNA transposons in general. So,
be aware that some genes may be (partially) masked.


Developed by Arian Smit
Institute for Systems Biology