A Novel Model For Dna Sequence Similarity Analysis Based On Graph Theory
A Novel Model For Dna Sequence Similarity Analysis Based On Graph Theory
Open Access
Full open access to this and
thousands of other papers at
O r i g i n al Resea r c h
http://www.la-press.com.
Xingqin Qi2, Qin Wu1, Yusen Zhang2, Eddie Fuller1 and Cun-Quan Zhang1
1
Department of Mathematics, West Virginia University, Morgantown, WV, USA, 26506. 2School of Mathematics
and Statistics, Shandong University at Weihai, Weihai, China, 264209.
Corresponding authors email: [email protected]; [email protected]
Abstract: Determination of sequence similarity is one of the major steps in computational phylogenetic studies. As we know, during
evolutionary history, not only DNA mutations for individual nucleotide but also subsequent rearrangements occurred. It has been one
of major tasks of computational biologists to develop novel mathematical descriptors for similarity analysis such that various mutation
phenomena information would be involved simultaneously. In this paper, different from traditional methods (eg, nucleotide frequency,
geometric representations) as bases for construction of mathematical descriptors, we construct novel mathematical descriptors based
on graph theory. In particular, for each DNA sequence, we will set up a weighted directed graph. The adjacency matrix of the directed
graph will be used to induce a representative vector for DNA sequence. This new approach measures similarity based on both order-
ing and frequency of nucleotides so that much more information is involved. As an application, the method is tested on a set of 0.9-kb
mtDNA sequences of twelve different primate species. All output phylogenetic trees with various distance estimations have the same
topology, and are generally consistent with the reported results from early studies, which proves the new method’s efficiency; we also
test the new method on a simulated data set, which shows our new method performs better than traditional global alignment method
when subsequent rearrangements happen frequently during evolutionary history.
doi: 10.4137/EBO.S7364
This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.
simulated test is discussed in Section 5; conclusions and xW be the number of loops incident with the vertex
are made in Section 6. W in Gm, respectively. Clearly x_w = (nW − 1)* nW / 2
for every W ∈ {A, C, G, T}, thus we can get each n_W
Construction of Representative by xw. The length of DNA sequence can be obtained
Vector for DNA Sequence by n = nA + nG + nC + nT. Note that there is only one
The alphabet representation of a DNA sequence is a arc (W′, W′′) with weight 1 / ( n − 1)α in Gm. Thus, the
string of letters A, C, G and T. Assume S = s1 s2… sn first nucleotide base in the sequence S is W′. The jth
is a DNA sequence of length n, where si ∈ {A, C, nucleotide base W* in S is determined by the arc
(W′, W*) with weight 1 / ( j − 1) .
α
G, T}.
1
1/ 2
1/ 5 A C
1/ 2
1/ 2
1/ 3
1/ 3
1/ 4 1 1 1
1 / 2 1/ 4
1/
1/ 2
1/ 2
5
1 1/ 3
G 1/ 3 T 1/
3
1
4
1/ 2 1/
Figure 1. Directed multi-graph Gm for S = ACGTATC with α = 1/2. G T
1 1/ 3
a
α is a user specified parameter. 1/ 2
b
As an approximation and for practical purpose, we only keep 4 decimals for
wm(si,sj). Figure 2. Simplified graph Gs for S = ACGTATC.
but we will see later that the simplified graph still We call the method of constructing the representa-
contains enough accurate information to characterize tive vector for a DNA sequence directed euler tour
DNA sequences. (DET) method.
→
RS =[0.5000, 2.1154, 0.7071, 2.0246, 0.5774, The third distance measurement is based on the
0.4472, 1.0000, 1.20771, 0.7071, 0.5000, correlation coefficients. The calculation of the→linear
→
0, 1.5774, 1.0000, 1.5774, 0, 0.7071]. correlation coefficient r(s, h) between R s and R h uses
the conventional Pearson formalism as detailed in the by Zhang.25,26 The data source consists of four species
following: of old-world monkeys (Macaca fascicular, Macaca
K → → K → K →
K ∑ i =1 R s (i ) ⋅ R h (i ) − ∑ i =1 R s (i ) ⋅∑ i =1 R h (i )
r ( s, h) =
2 2
→ K → → K →
K ∑ i =1 ( R (i )) − ∑ i =1 R s (i ) × K ∑ i =1 ( R h (i )) 2 − ∑ i =1 R h (i )
K 2 K
s
→ →
where K is the dimension of R s or R h (here K = 16). fuscata, Macaca sylvanus, Macaca mulatta), one
Thus we define the third distance measurement as: specie of new-world monkeys (Saimiri scirueus), two
species of prosimians (Lemur catta, Tarsisus syrichta),
d3(s, h) = 1 − r(s, h). and five hominoid species (Human, Chimpanzee,
Gorilla, Orangutan and Hylobates), for detailed
Then a comparison between a pair of DNA information please see Table 1.
sequences to judge their similarity or dissimilar-
ity could be carried out by calculating the distances Previous experiments results for these
between the corresponding mathematical descriptors. species based on different methods
We will give a test of the utility of DET method and In Hayasaka et al24 calculated the number of nucle-
the proposed distance measurements in the following otide substitutions for a given pair of species by the
Section 4. six-parameter method. Using the calculated methods,
they gave phylogenetic trees for these twelve species
Applications and Experimental with the same topology depending on three different
Results grouping methods. Thus the phylogenetic relation-
Data description ships derived from these mtDNA comparisons appear
To test the utility of DET method and the proposed reliable. In References Zhang et al25,26 also obtained
distance measurements, we will use the 0.9-kb consistent results with24 based on their new proposed
mtDNA fragments of twelve species of four dif- methods for DNA sequence comparison, where only
ferent groups of primates for a test, which were eleven species except human were involved. For the
reported by Hayasaka24 firstly and subsequently used sake of later comparison, we re-construct these previous
phylogenetic trees based on the upper triangular part ger weights (at least 0.1) when we construct the rep-
of dissimilarity matrix reported in the references, see resentative vector.
Figure 3. Here we list the preference distance for f(l) with
different α. Considering the lengths of these twelve
Selection of the parameter α species (890 ∼ 900), when α = 2 or α = 1, l0 is too
In this subsection, we show how to choose the value small; while when α = 1 / 3 or α = 1 / 4 , l0 is too big.
of α for this data set. Denote the weight function Thus, for this data set, we prefer to use α = 1 / 2 with
f (l ) = 1 / l α , where l is an integer. Because the maxi- l0 = 100. The nucleotides with distance 100 would
mum value of f(l) for any α is just 1, the arcs with be considered to have stronger interactive relation-
weights not less than 0.1 should be thought as relatively ships. But we admit, for data sets with very long DNA
important. We thus define l0 as the preference distance sequences, to make l0 bigger correspondingly, one
of function f(l) when f(l0) $ 0.1 while f (l0 + 1) , 0.1. could choose higher order roots.
Pairs of nucleotides within l0 would be assigned big-
Ora
Gorilla
Similarity matrix based on DET
chi By the DET method of Section 2, each sequence
human
could be represented by a 16-dimensional vector,
M.syl (Barbary macaque)
and then the similarities between each pair of these
M.fas (Crab-eating macaque)
M.fus (Japanese macaque) twelve mtDNA fragments could be computed under
M.mul (Rhesus macaque) the proposed distance measurements. In Table 2,
B we present the upper triangular part of the similar-
S.sci
ity matrix among these twelve species by the DET
T.syr
Lemur
method with weighted function f (l ) = 1 / l based
Ora on the first distance measurement d1.
Hyl When we examine Table 2, we notice that the
Gorilla
smallest entries are associated with the pairs (Gorilla,
Chi
M.syl
Human), (M.fas, M. Mul), (M. Fus, M. Mul), (Chi.,
M.fus
Gorilla), (M. fas, M. fus), (Human, Chi.) and (M.Syl)
M.mul and (M.mul). Those observed facts are similar to that
M.fas reported in previous studies.25,26 And also consistent
80 70 60 50 40 30 20 10 with biological classification in27 and28 that Gorilla,
C Chimpanzee, Human are in the same family homini-
T.syr
dae and the same subfamily homininae; and Macaca
S.sci
Lemur
fascicular, Macaca fuscata, Macaca mulatta, Macaca
Ora sylvanus are in the same family Cercopithecidae and
Hyl the same genus Macaca.
Gorilla We also present the upper triangular part of the
Chi
similarity matrices based on the second and the third
M.syl
M.fus
distance measurement in Tables 3 and 4 respectively.
M.mul We will see that there is a whole qualitative agree-
M.fas ment among similarities based on these three distinct
0.022 0.02 0.018 0.016 0.014 0.012 0.01 0.008 distance measurements. It provides a strong evidence
Figure 3. Previous phylogenetic trees for these 12 species based on
that DET method works well for DNA representation
different methods. (A) Figure 3 in24 (B) Figure 1 in26 (C) Figure 1 in.25 and comparison.
Species Lemur Chi S.sci M.fas Gorilla M.fus M.mul M.syl Hyl Ora T.syr Human
Lemur 0 0.0511 0.0169 0.0358 0.0539 0.0373 0.0327 0.0221 0.0510 0.0702 0.0171 0.0591
Chi 0 0.0528 0.0183 0.0072 0.0171 0.0211 0.0325 0.0179 0.0264 0.0654 0.0098
S.Sci 0 0.0362 0.0545 0.0395 0.0347 0.0286 0.0496 0.0716 0.0201 0.0592
M.fas 0 0.0210 0.0085 0.0059 0.0172 0.0196 0.0391 0.0488 0.0255
Gorilla 0 0.0186 0.0233 0.0354 0.0133 0.0211 0.0679 0.0058
M.fus 0 0.0063 0.0181 0.0174 0.0342 0.0514 0.0238
M.mul 0 0.0131 0.0212 0.0397 0.0463 0.0284
M.syl 0 0.0326 0.0509 0.0355 0.0406
Hyl 0 0.0243 0.0631 0.0169
Ora 0 0.0841 0.0198
T.syr 0 0.0730
Human 0
Construction of dendrogram tree in the DNA sequence; when the distance d = 2, the
To see the phylogenetic relationships more easily, matrix will give the frequencies of all such pairs XY
we use the average linkage clustering method for that Y and X are separated by one nucleic base. The
the phylogenetic tree construction. In Figure 4, the advantage of such representations of DNA sequences
dendrogram trees based on Tables 2–4 are presented are that it offers upon inspection useful information
respectively. One can find that the three dendrogram that is hidden in the lengthy sequence of the DNA.
trees of these twelve species have the same topol- We should notice that, in Randić’s work, informa-
ogy, which are generally consistent with the previ- tion from all matrices Md were not combined together
ous works in Figure 3. for analysis, thus not enough information is obtained
from such characterization. In our method, for a
Simulated Test for DET Method DNA sequence of length n, we have considered all
In a DNA sequence of four letters, there are sixteen pairs XY under possible distance d = 1, 2 …, (n − 1)
possible ordered XY pairs: AA, AC, AT, AG, GC …, apart, and assign different weights to pair XY accord-
etc. In Ref. Randić29 introduced a condensed charac- ing to their locations and distributions. Furthermore,
terization of DNA sequences by (4 × 4)-matrix Md all information of ordered pairs of nucleotides are
that give the count of occurrences of all pairs XY assembled in one matrix (ie, the adjacency matrix M).
of bases at distance precisely d. For example, when So in some senses, the DET method is an extension of
the distance d = 1, the matrix will give the frequen- Randić’s, but it involves more information hidden in
cies of all such pairs XY that X and Y are adjacent the DNA sequence.
Species S.sci Chi Lemur M.fas Gorilla M.fus M.mul M.syl Hyl Ora T.syr Human
S.sci 0 0.0163 0.0016 0.0080 0.0182 0.0087 0.0067 0.0030 0.0163 0.0305 0.0018 0.0219
Chi 0 0.0177 0.0021 0.0003 0.0018 0.0028 0.0066 0.0020 0.0043 0.0269 0.0006
Lemur 0 0.0084 0.0190 0.0099 0.0076 0.0050 0.0160 0.0321 0.0024 0.0225
M.fas 0 0.0028 0.0004 0.0002 0.0018 0.0025 0.0094 0.0150 0.0042
Gorilla 0 0.0022 0.0034 0.0078 0.0011 0.0027 0.0290 0.0002
M.fus 0 0.0002 0.0020 0.0019 0.0072 0.0166 0.0036
M.mul 0 0.0011 0.0028 0.0098 0.0135 0.0051
M.syl 0 0.0066 0.0160 0.0079 0.0103
Hyl 0 0.0034 0.0252 0.0018
Ora 0 0.0438 0.0023
T.syr 0 0.0336
Human 0
Species S.sci Chi Lemur M.fas Gorilla M.fus M.mul M.syl Hyl Ora T.syr Human
S.sci 0 0.0749 0.0024 0.0360 0.0840 0.0396 0.0302 0.0136 0.0752 0.1348 0.0082 0.1015
Chi 0 0.0869 0.0098 0.0014 0.0087 0.0134 0.0302 0.0081 0.0183 0.1252 0.0027
Lemur 0 0.0429 0.0959 0.0473 0.0366 0.0196 0.0851 0.1485 0.0078 0.1142
M.fas 0 0.0137 0.0016 0.0007 0.0068 0.0123 0.0409 0.0709 0.0205
Gorilla 0 0.0103 0.0166 0.0358 0.0046 0.0103 0.1367 0.0011
M.fus 0 0.0012 0.0090 0.0076 0.0316 0.0773 0.0171
M.mul 0 0.0044 0.0127 0.0431 0.0629 0.0247
M.syl 0 0.0284 0.0708 0.0357 0.0476
Hyl 0 0.0109 0.1210 0.0083
Ora 0 0.1956 0.0086
T.syr 0 0.1586
Human 0
0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005
x2 might be regarded as two distinct related species.
While the following simulated test is designed to
Cosine
show, by DET method, with high probability, x2 still
T.syr
S.sci
could regard x1 as its ancestor even if x2 is the off-
Lemur spring of x1 after many generations.
Ora
Hyl
Constructing the Simulated Data set. This simu-
Human lated test is designed to test the advantage of the DET
Gorilla
Chi
method over the alignment method when sequence
M.syl rearrangements happen frequently. One child genome is
M.fus
M.mul
copied from the initial parent with shuffle model, which
M.fas involves two operations: (i) transposition, exchange
0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 the positions of two adjacent random sequence frag-
Correlation
ments and (ii) reversal, reverse the order of a random
T.syr
sequence fragment and reinsertion of it in the same
S.sci position. To accelerate the speed of achieving the
Lemur
Ora
generations that alignment method is no longer useful
Hyl for offspring to find out its ancestor, we require that the
Human
Gorilla
size of involved random sequence fragments for both
Chi operations should be at least 0.1n (where n is the
M.syl
M.fus
length of genome sequence).
M.mul For computational expediency, 0.9 kb mtDNA
M.fas
sequence of Human (L00016) is chosen as a root
0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
ancestor test sequence, and 0.9 kb mtDNA sequence
Figure 4. Phylogetic tree for these 12 species based on Tables 2–4. of Chimpanzee (V00672) is chosen as its “brother”
sequence. We would generate up to 20th generation d uring the evolutionary process; the second
of Human (L00016) using shuffle model. For simplic- advantage of our method is low time complexity
ity, we select its 1st; 5th; 10th; 15th and 20th genera- compared with other alignment-free methods. The
tion to use. core of the method is the construction of the repre-
We then compute the similarities based on both sentative vector for each DNA sequence and then to
Alignmentc and DET methodd between these five get the similarity matrix for a set of DNA sequences.
generations with Chimpanzee (V00672) and Human The representative vector for one DNA sequence of
(L00016), respectively. Because these five genera- length n can be obtained in O(n2) time units. Then,
tions are generated randomly, to be fair, we repeat the for a set of m sequences with the maximum length of
above process for several times. At each time, if all n, the similarity matrix can be computed in O(mn2)
of these five generations have bigger similarities with time units. The similarity matrix is relatively simple
Human (L00016) than with Chimpanzee (V00672) for calculation. Our novel method is very different
based on some method, this method would get one to all traditional methods and is proven to be effec-
score. Let suc_rate = score/l, where l is the total test tive and accurate for similarity comparison of DNA
times. Clearly, the method with higher suc_rate would sequences.
be regarded as much more robust when shuffing
happens frequently. Acknowledgement
In the following table we list the result of suc_rate We thank the Editor and the two anonymous referees
when we run the above simulated test once for different for their valuable comments, which help improve this
l values. To our surprise, the average suc_rate for DET work a lot. This work is supported partly by the Shan-
method is 1, while the average suc_rate for alignment is dong Province Natural Science Foundation of China
just 0.2709, which proves the utility and tolerance of with No. ZR2010AQ018 and partly by the Indepen-
DET method when shuffing happens frequently. dent Innovation Foundation of Shandong University
Suc_rate l = 10 l = 20 l = 30 l = 40 l = 50 l = 60 l = 70 l = 80 l = 90 l = 100
DET method 1 1 1 1 1 1 1 1 1 1
Alignment 0.2000 0.3000 0.4000 0.2000 0.3000 0.2667 0.3857 0.2250 0.2222 0.2500
Concluding Remarks with No. 2010ZRJQ005, and has been partially sup-
The graph model presented in this paper is a new ported by a WV EPSCoR grant.
method for mathematically characterization of
DNA sequences. When it is used to compare DNA Disclosure
sequences, we get consistent results with previous This manuscript has been read and approved by all
studies and biological classifications. We have no authors. This paper is unique and is not under con-
intention to compete with other existing methods, sideration by any other publication and has not been
since as we know, for any particular research proj- published elsewhere. The authors and peer review-
ect we will have to identify which measure is most ers of this paper report no conflicts of interest. The
meaningful or useful. The first advantage of our new authors confirm that they have permission to repro-
method is greater efficiency when compared with duce any copyrighted material.
alignment method when shuffing happens frequently
References
1. Ma H, Bork P. Measuring genome evolution. Proc Natl Acad Sci U S A.
c
Here, we use Needleman-Wunsch globally align algorithm to get the alignment of 1988;95:5849–56.
any two DNA sequences, and define the similarity between them as r / N , where 2. Wildman DE, et al. Genomics, biogeography, and the diversification of pla-
r is the number of matching in the alignment and N is the length of alignment. cental mammals. Proc Natl Acad Sci U S A. 2007;104:14395–400.
d
The DET similarity between two sequences is defind as 1 − d2, ie, the cosine 3. Blaisdell BE. A measure of the similarity of sets of sequences not requiring
value of the included angle between their representative vectors. sequence alignment. Proc Natl Acad Sci U S A. 1986;83:5155–9.
4. Hamori E, Ruskin J. H curves: a novel method of representation of nucle- 17. Qi X, Wen J, Qi Z. New 3D graphical representation of DNA sequence based
otide series especially suited for long DNA sequences. Journal of Biological on dual nucleotides. Journal of Theoretical Biology. 2007;249:681–90.
Chemistry. 1983;258:1318–27. 18. Qi Z, Fan T. PN-curve: a 3D graphical representation of DNA sequences
5. Nandy A. A new graphical representation and analysis of DNA sequence and their numerical characterization. Chemical Physics Letters. 2007;442:
structure: I. Methodology and Application to Globin Genes. Curr Sci. 434–40.
1994;66:309–14. 19. Cao Z, Liao B, Li R. A group of 3D graphical representation of DNA
6. Guo X, Randić M, Basak SC. A novel 2-D graphical representation of DNA sequences based on dual nucleotides. International Journal of Quantum
sequences of lowde generacy. Chemical Physics Letters. 2001;350:106–12. Chemistry. 2008;108:1485–90.
7. Randić M, Vraćko M, Lerś N. Novel 2-D graphical representation of DNA 20. Yu J, Sun X, Wang J. TN curve: A novel 3D graphical representation of
sequences and their numberical characterization. Journal of Chemical DNA sequence based on trinucleotides and its applications. Journal of
Information and Computer Science. 2003a;368:1–6. Theoretical Biology. 2009;261:459–68. doi:10.1016/j.jtbi.2009.08.005.
8. Randić M, Vraćko, M, Lerś N, Plavśić D. Analysis of similarity/dissimilar- 21. Chi R, Ding K. Novel 4D numerical representation of DNA sequences.
ity of DNA sequence based on novel 2-D graphical representation. Journal Chemical Physics Letters. 2005;407:63–7.
of Chemical Information and Computer Science. 2003b;371:202–7. 22. Liao B, Li R, Zhu W. On the similarity of DNA primary sequences based
9. Randić M, Vraćko M, Zupan J, Novic M. Compact 2-D graphical represen- on 5-D representation. Journal of Mathematical Chemistry. 2007;42:
tation of DNA. Chemical Physics Letters. 2003c;373:558–62. 47–57.
10. Randić M. Graphical representations of DNA as 2-D map. Chemical Physics 23. Liao B, Wang T. Analysis of similarity/dissimilarity of DNA sequences
Letters. 2004;386:468–71. based on nonoverlapping trinucleotides of nucleotide bases. Journal of
11. Zhang Y, Liao B, Ding K. On 2D graphical representation of DNA sequence Chemical Information and Computer Science. 2004;44:1666–70.
of nondegeneracy. Journal of Chemical Information and Computer Science. 24. Hayasaka K, Gojobori T, Horai S. Molecular phylogeny and evolution of
2005;411:28–32. primate mitochondrial DNA. Mol Biol Evol. 1998;5:626–44.
12. Liu X, Dai Q, Xiu Z, Wang T. PNN-curve: a new 2D graphical representa- 25. Zhang Y, Chen W. New invariant of DNA sequences. Match Commun
tion of DNA sequences and its application. Journal of Theoretical Biology. Comput Chem. 2007;58:197–208.
2006;243:555–61. 26. Zhang Y. A simple method to construct the similarity matrices of DNA
13. Huang G, Liao B, Li Y, Liu Z. H curves: a novel 2D graphical representation sequence. Match Commun Comput Chem. 2008;60:313–24.
for DNA sequences. Chemical Physics Letters. 2008;462:129–32. 27. Goodman M, et al. Primate evolution at the DNA level and a classification
14. Randić M, Vracko M, Nandy A. Basak SC. On 3-D graphical representation of hominoids. Journal of Molecular Evolution. 1990;30:260–6.
of DNA primary sequences and their numerical characterization. Journal of 28. Groves C, Wilson DE, Reeder DM, editors. Mammal Species of the World
Chemical Information and Computer Science. 2000;40:1235–44. (3rd ed.), Johns Hopkins University Press; 2005:161–5.
15. Liao B, Wang T. 3-D graphical representation of DNA sequences and their 29. Randić M. Condensed representation of DNA primary sequences. Journal
numerical characterization. Journal of Molecular Structure (THEOCHEM). of Chemical Information and Computer Science. 2000;40:50–6.
2004;681:209–12.
16. Zhang Y, Liao B, Ding K. On 3D D-curves of DNA sequences. Mol Simul.
2006;32:29–34.
http://www.la-press.com