-
Notifications
You must be signed in to change notification settings - Fork 49
/
RepeatPeps.readme
114 lines (94 loc) · 5.42 KB
/
RepeatPeps.readme
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
Transposable element protein database
=====================================
The transposable element protein database now contains 18,011
predicted proteins, with a combined length of 16.1 million amino acids.
A breakdown of the main classes is given in the following table:
(sub)classes proteins AA (x 1000)
LTR elements
Gypsy 1 3530 3716
Copia 1 2116 2847
ERV 7 975 684
Pao 1 592 1036
DIRS 1 424 288
Ngaro 1 86 42
Other 3 62 65
DNA
TcMar 17 1127 464
hAT 15 824 565
MULE 4 581 370
PIF 4 421 168
CMC 5 376 306
Maverick 1 349 152
PiggyBac 3 195 145
Crypton 9 141 84
Sola 3 133 97
P-element 2 110 91
Zisupton 1 85 75
Merlin 1 80 22
Kolobok 4 70 43
Academ 3 44 48
Dada 1 34 36
Ginger 2 31 20
IS3EU 1 23 13
Other 3 18 15
LINEs
L1 4 1701 1461
L2 1 695 498
CR1 2 699 529
I/Jockey 2 759 636
RTE 4 219 211
Penelope 1 185 123
CRE 3 65 67
Other 14 1107 1009
Rolling Circle 2 124 182
Other 6 30 22
About 1102 of these proteins are entries extracted from the nr
GenBank databases. Most of the remaining peptides are translations of
interspersed repeat consensus sequences. Many of the coding elements in
repeat libraries contain frame shifts and stop codons reflecting errors
in the consensus. In order to represent these elements currently some
( 1500 entries ) interrupted coding regions have been translated by
using TFASTY or GeneWise, guided by the closest related proteins in
the database.
Very similar proteins are represented by only one sequence (e.g. there
is only one HIV-1 pol entry). A cut-off of 90% similarity and/or 85%
identity has been used, but if many differences appear to be due to
ambiguities or errors, more distant matching proteins have been excluded.
The protein database is still very much under development but functions
fine for the purpose of masking repetitive DNA that could give rise to
spurious matches in translated database searches. As most proteins are
classified to the level RepeatMasker classifies transposable elements,
it is also already useful for classifying transposable elements. This
will be its central role in the interspersed repeat identification
program that we are developing. In the future phylogenetic comparisons
of transposable elements based on their coding regions can be performed
using this database, though we will need to (and will) curate the database
significantly before this can be done properly.
Tandem repeats and low complexity DNA
=====================================
Prior to being compared with WU_BLASTX to the transposable element protein
database, the query is checked for the presence of tandem repeats using
Tandem Repeat Finder followed by the simple repeat / low complexity
algorithm of RepeatMasker. This will avoid false annotations in the
comparison against the transposable element proteins. The button allows
you to turn off the masking/annotating of low complexity and simple
repeats in the final output. Low complexity and simple repeat analysis
will still occur prior to looking for matches to the RepeatPep database.
False positives
===============
We have run RepeatProteinMasker on several megabases of genomic DNA
and compared it to RepeatMasker output to identify false positive
matches. This number could be reduced to zero by eliminating certain
low-complexity proteins from the database and reporting matches based on
a complexity adjusted alignment score, very similar to that implemented
in RepeatMasker. False matches to inverse, non-complimented genomic DNA
are also absent, except for an occasional match in very (>70%) GC-rich
DNA (no such cases were seen with a score above 30).
A surprisingly large number of genes are derived from transposable
element proteins. In 2001 I had identified 50 in the human genome and
it is more than likely that other genomes are very similar if not more
so, considering that most other genomes are exposed to a much larger
variety of transposable elements and DNA transposons in general. So,
be aware that some genes may be (partially) masked.
Developed by Arian Smit
Institute for Systems Biology