0% found this document useful (0 votes)
37 views10 pages

A Novel Model For Dna Sequence Similarity Analysis Based On Graph Theory

This paper presents a novel method for DNA sequence similarity analysis using graph theory, specifically by constructing a weighted directed graph for each DNA sequence. The approach generates a representative vector from the graph's adjacency matrix, allowing for a more comprehensive measurement of similarity that accounts for both nucleotide ordering and frequency. The method has been tested on mitochondrial DNA sequences from primates, demonstrating efficiency and consistency with traditional results while outperforming conventional alignment methods in scenarios with frequent rearrangements.

Uploaded by

ansarizidi885
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views10 pages

A Novel Model For Dna Sequence Similarity Analysis Based On Graph Theory

This paper presents a novel method for DNA sequence similarity analysis using graph theory, specifically by constructing a weighted directed graph for each DNA sequence. The approach generates a representative vector from the graph's adjacency matrix, allowing for a more comprehensive measurement of similarity that accounts for both nucleotide ordering and frequency. The method has been tested on mitochondrial DNA sequences from primates, demonstrating efficiency and consistency with traditional results while outperforming conventional alignment methods in scenarios with frequent rearrangements.

Uploaded by

ansarizidi885
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Evolutionary Bioinformatics

Open Access
Full open access to this and
thousands of other papers at
O r i g i n al Resea r c h
http://www.la-press.com.

A Novel Model for DNA Sequence Similarity


Analysis Based on Graph Theory

Xingqin Qi2, Qin Wu1, Yusen Zhang2, Eddie Fuller1 and Cun-Quan Zhang1
1
Department of Mathematics, West Virginia University, Morgantown, WV, USA, 26506. 2School of Mathematics
and Statistics, Shandong University at Weihai, Weihai, China, 264209.
Corresponding authors email: [email protected]; [email protected]

Abstract: Determination of sequence similarity is one of the major steps in computational phylogenetic studies. As we know, during
evolutionary history, not only DNA mutations for individual nucleotide but also subsequent rearrangements occurred. It has been one
of major tasks of computational biologists to develop novel mathematical descriptors for similarity analysis such that various mutation
phenomena information would be involved simultaneously. In this paper, different from traditional methods (eg, nucleotide frequency,
geometric representations) as bases for construction of mathematical descriptors, we construct novel mathematical descriptors based
on graph theory. In particular, for each DNA sequence, we will set up a weighted directed graph. The adjacency matrix of the directed
graph will be used to induce a representative vector for DNA sequence. This new approach measures similarity based on both order-
ing and frequency of nucleotides so that much more information is involved. As an application, the method is tested on a set of 0.9-kb
mtDNA sequences of twelve different primate species. All output phylogenetic trees with various distance estimations have the same
topology, and are generally consistent with the reported results from early studies, which proves the new method’s efficiency; we also
test the new method on a simulated data set, which shows our new method performs better than traditional global alignment method
when subsequent rearrangements happen frequently during evolutionary history.

Keywords: DNA sequence, mathematical descriptor, similarity analysis, weighted graph

Evolutionary Bioinformatics 2011:7 149–158

doi: 10.4137/EBO.S7364

This article is available from http://www.la-press.com.

© the author(s), publisher and licensee Libertas Academica Ltd.

This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.

Evolutionary Bioinformatics 2011:7 149


Qi et al

Introduction they order an instant, though visual and qualitative


The number of DNA sequences is rapidly increasing summary of the lengthy DNA sequences. But this
in the DNA database. It is one of the challenges for approach also involves many unresolved questions.
bio-scientists to analyze the large volume of genomic For example, how to obtain suitable matrices to char-
DNA sequence data. Many schemes have been pro- acterize DNA sequences and how to select invariants
posed to numerically characterize DNA sequences suitable for sequence comparisons. Another difficulty
and analyze their similarities. we must face is that the calculation of the matrices or
Sequence alignment has been frequently used the invariants will become more and more difficult
as a powerful tool to accomplish the comparison with the length of the sequences.
of two closely related genomes at the base-by- It has been one of major challenges for computa-
base nucleotide sequence level. This method is tional biologists that low time-complexity alignment-
mainly based on the orderings of nucleotides free methods are needed for proper measurements
appearing in the sequence. But with the divergence of sequence similarity, which should not only take
of species over time, subsequence rearrangements into account the happenings of single nucleotide
occurring during evolution make sequence align- mutations but also the happenings of subsequence
ment similarity scores less reliable. Improvements1,2 rearrangements.
have been proposed to overcome the difficulty, but In this paper, we introduce a novel method,
most of these improvements mainly rely on the cor- which is based on graph theory, to represent DNA
rect definition and selection of common genes to be sequences mathematically for similarity analysis.
compared, and significant homology among aligned In particular, for each DNA sequence, we will set
gene sequences. up a weighted directed graph, whose adjacency
Blaisdell B.E.3 introduced a measure of simi- matrix will give us a representative vector. Three
larity of sets of sequences not requiring sequence distance measurements for representative vectors
alignment. It is the first usage of features (l-mers) are then defined to assess the similarity/dissimi-
counts for biological sequence comparison. Geo- larity analysis for DNA sequences. As an applica-
metric representations of DNA sequences has been tion, the method is tested on a set of 0.9-kb mtDNA
regarded as another powerful alignment-free tool sequences of twelve different primate species, and
for the analysis of DNA sequences recently since the output phylogenetic trees based on these three
Hamori and Ruskin4 first proposed a 3D geometric distance measurements have the same topology, and
representation for DNA sequences. This methodol- are all generally consistent with the results reported
ogy always starts with a graphical representation in previous studies.24–26 ­Furthermore, to show its
of DNA sequence, which could be based on 2D,5–13 robustness and tolerance to the happening of rear-
3D,14–20 4D,21 5D,22 and 6D23 spaces, and represents rangements of DNA subsequence, we test it on one
DNA as matrices by associating with the selected synthetic data set by showing the fact that based on
geometrical objects, then vectors composed of the our method offspring after various generations could
invariants of matrices are used to compare DNA still find its original ancestor with high ­probability.
sequences. This method is significantly different from all tradi-
Sequence alignment method is mainly based tional methodologies and is a promising approach in
on the orderings of nucleotides appearing in the future studies.
sequence. But with the diverge of species over time, The paper is organized as follows. In Section 2,
subsequence rearrangements (eg, reversal, transposi- we describe the method of constructing the weighted
tion or block-exchange) occurring during evolution directed graph and the representative vector for a
would make sequence alignment similarity scores given DNA sequence; in Section 3, three distance
less reliable. Features count methods only focus on measurements are introduced to assess the similarity/
the appearing frequencies (l-mers), which would dissimilarity of DNA sequences; the experimental
lose significant amounts of information. Geometric results for 0.9-kb mtDNA sequences of twelve differ-
representation schemes have an advantage in that ent primate species are presented in Section 4; and the

150 Evolutionary Bioinformatics 2011:7


A novel method for DNA sequence similarity analysis

simulated test is discussed in Section 5; conclusions and xW be the number of loops incident with the vertex
are made in Section 6. W in Gm, ­respectively. Clearly x_w = (nW − 1)* nW / 2
for every W ∈ {A, C, G, T}, thus we can get each n_W
Construction of Representative by xw. The length of DNA sequence can be obtained
Vector for DNA Sequence by n = nA + nG + nC + nT. Note that there is only one
The alphabet representation of a DNA sequence is a arc (W′, W′′) with weight 1 / ( n − 1)α in Gm. Thus, the
string of letters A, C, G and T. Assume S = s1 s2… sn first nucleotide base in the sequence S is W′. The jth
is a DNA sequence of length n, where si ∈ {A, C, nucleotide base W* in S is determined by the arc
(W′, W*) with weight 1 / ( j − 1) .
α
G, T}.

Directed multi-graph The simplified weighted


We will show how to construct the weighted directed
multi-graph for S = s1 s2 … sn, which is denoted by directed graph
Gm = (V (Gm), A(Gm)). The vertex set V (Gm) = {A, C, Gm is a directed multi-graph. That is, there may be
G, T}. For each pair of nucleotides si and sj in S with parallel arcs from one vertex to anther. In the follow-
i , j, put a arc from asi to sj, and define the weight of ing, we will simplify Gm to Gs by merging parallel
( )
the arc as 1/ ( j − i )α , where α . 0 so that 1/ ( j − i )
α arcs into one arc.
Let the vertex set V (Gs) = V (Gm). Denote Aum,v as
is an decreasing function of (j−i) which would reflect
the fact that the two nucleotides with smaller distance the set of all arcs from the vertex u to v in Gm; for any
would have stronger interactive relationship than pair of vertices u and v, if Aum,v ≠4, put an arc (u, v)
from u to v in Gs, and assign the weight of the arc
those with bigger distance.
(u, v) in Gs as
An example with parameter α = 1/ 2 is illustrated
in Figure 1.b
Theorem 1. It is an one-to-one mapping between ws (u , v) = ∑ wm (u, v), Aum,v ≠4
a DNA sequence S and its corresponding weighted ( u ,v )∈Aum, v

directed multi-graph Gm.


Proof. It is sufficient to show that we can Based on this simplification rule, the directed
get only one DNA sequence from the graph multi-graph in Figure 1 is simplified and illustrated
Gm. Let nW be the number of nucleotide base in Figure 2.
W (∈ {A, C, G, T}) appearing in the DNA sequence Note that the one-to-one mapping between a DNA
sequence S and the simplified graph does not exist
1/ 2
1/ 5
then, which is also the source of error of our ­strategy,
1/ 3
1/ 4
1/ 6
A
1
C 1/ 4 1 1 / 6 1/ 2
1
1/ 5
1 1 1/ 3
1/ 4

1
1/ 2

1/ 5 A C
1/ 2

1/ 2

1/ 3
1/ 3
1/ 4 1 1 1
1 / 2 1/ 4

1/
1/ 2
1/ 2

5
1 1/ 3

G 1/ 3 T 1/
3
1
4
1/ 2 1/
Figure 1. Directed multi-graph Gm for S = ACGTATC with α = 1/2. G T

1 1/ 3
a
α is a user specified parameter. 1/ 2
b
As an approximation and for practical purpose, we only keep 4 decimals for
wm(si,sj). Figure 2. Simplified graph Gs for S = ACGTATC.

Evolutionary Bioinformatics 2011:7 151


Qi et al

but we will see later that the simplified graph still We call the method of constructing the representa-
contains enough accurate information to characterize tive vector for a DNA sequence directed euler tour
DNA sequences. (DET) method.

The representative vector Three Distance Measurements


From above subsections, we get one weighted for Similarity Calculation
directed graph associated with a DNA sequence. In the above section, we obtain a mapping from
The weighted directed graph Gs corresponds to a a set of DNA sequences to a set of vectors in the
(4 × 4) adjacency matrix M, which is defined as 16-dimensional linear space by DET method.
follows: ­Comparison between DNA sequences becomes com-
parison between these 16-dimensional vectors. We
will introduce three popular measurements of defin-
 ws ( A, A) ws ( A, C ) ws ( A, G ) ws ( A, T )  ing the distance between two 16-dimensional vectors
 w (C , A) w (C , C ) w (C , G ) w (C , T ) to reflect the dissimilarity of the two corresponding
M=
s s s s

 sw (G , A) ws (G , C ) ws (G , G ) ws (G , T )
DNA sequences. The smaller the distance is, the
 w (T , A) w (T , C ) w (T , G ) w (T , T )  more similar the two sequences are. For two DNA
s s s s
sequences s and h, we denote the representative
→ →
vectors by R s and R h respectively.
Then→ we rewrite matrix M as one 16-dimensional The first distance measurement d1(s,h) is defined
vector R by the row order, to be the Euclidean distance between the end points
→ →
of R s and R h, which is based on the assumption that

two DNA sequences are similar if the corresponding
RT = [ ws ( A, A), , ws ( A, T ), ws (C , A),...,
16-vectors have similar magnitudes, ie,
ws (C , T ),  , ws (T , A),  , ws (T , T )]
16 → →
We call the 16-dimensional vector the representa- d1 ( s, h) = ∑ ( R (i) − Rs h (i )) 2 .
tive vector of a DNA sequence. We admit that there i =1

is a loss of information when one condenses sequence


S to a 16 dimensional vector, but we will see later The second distance measurement d2(s, h)
that it is still enough to make comparisons for DNA between s and h is defined to be one

minus

the cosine
sequences. of the included angle between R s and R h , which is
For the given example of S = ACGTATC, when the based on the assumption that two DNA sequences
weight function is f (l ) = 1 / l , the (4 × 4)-matrix are similar if the corresponding 16-dimensional
and the 16-dimensional vector are as follows. vectors in the 16-dimensional space have similar
directions, ie,
 0.5000 2.1154 0.7071 2.0246
→ →
 0.5774 0.4472 1.0000 1.2071 d 2 ( s, h) = 1 − cos( R s , R h )
M = 
16 → →
 0.7071 0.55000 0 1.5774 
∑ R s (i ) .R h (i )
 1.0000 0.7071 = 1− i =1
1.5774 0 → →
∑ ( R s (i )) 2 .∑ i =1 ( R h (i )) 2
16 16
i =1


RS =[0.5000, 2.1154, 0.7071, 2.0246, 0.5774, The third distance measurement is based on the
0.4472, 1.0000, 1.20771, 0.7071, 0.5000, correlation coefficients. The calculation of the→linear

0, 1.5774, 1.0000, 1.5774, 0, 0.7071]. correlation coefficient r(s, h) between R s and R h uses

152 Evolutionary Bioinformatics 2011:7


A novel method for DNA sequence similarity analysis

the conventional Pearson formalism as detailed in the by Zhang.25,26 The data source consists of four species
following: of old-world monkeys (Macaca fascicular, Macaca

K → →  K → K →
K ∑ i =1  R s (i ) ⋅ R h (i )  − ∑ i =1 R s (i ) ⋅∑ i =1 R h (i )
r ( s, h) =  
2 2
→  K →  →  K → 
K ∑ i =1 ( R (i )) −  ∑ i =1 R s (i )  × K ∑ i =1 ( R h (i )) 2 −  ∑ i =1 R h (i ) 
K 2 K
s
   

→ →
where K is the dimension of R s or R h (here K = 16). fuscata, Macaca sylvanus, Macaca mulatta), one
Thus we define the third distance measurement as: ­specie of new-world monkeys (Saimiri scirueus), two
species of prosimians (Lemur catta, Tarsisus syrichta),
d3(s, h) = 1 − r(s, h). and five hominoid species (Human, ­Chimpanzee,
Gorilla, Orangutan and Hylobates), for detailed
Then a comparison between a pair of DNA information please see Table 1.
sequences to judge their similarity or dissimilar-
ity could be carried out by calculating the distances Previous experiments results for these
between the corresponding mathematical descriptors. species based on different methods
We will give a test of the utility of DET method and In Hayasaka et al24 calculated the number of nucle-
the proposed distance measurements in the following otide substitutions for a given pair of species by the
Section 4. six-parameter method. Using the calculated methods,
they gave phylogenetic trees for these twelve species
Applications and Experimental with the same topology depending on three different
Results grouping methods. Thus the phylogenetic relation-
Data description ships derived from these mtDNA comparisons appear
To test the utility of DET method and the proposed reliable. In References Zhang et al25,26 also obtained
distance measurements, we will use the 0.9-kb consistent results with24 based on their new proposed
mtDNA fragments of twelve species of four dif- methods for DNA sequence comparison, where only
ferent groups of primates for a test, which were eleven species except human were involved. For the
reported by Hayasaka24 firstly and subsequently used sake of later comparison, we re-construct these previous

Table 1. 0.9-kb mtDNA fragments of 12 species.

Species ID/accession Abbreviation Length (bp) Database


Macaca fascicular M22653 M.fas 896 NCBI
Macaca fuscata M22651 M.fus 896 NCBI
Macaca mulatta M22650 M.mul 896 NCBI
Macaca sylvanus M22654 M.syl 896 NCBI
Saimiri scirueus M22655 S.sci 893 NCBI
Chimpanzee V00672 Chi 896 NCBI
Lemur catta M22657 Lemur 895 NCBI
Gorilla V00658 Gorilla 896 NCBI
Hylobates V00659 Hyl. 896 NCBI
Orangutan V00675 Ora 895 NCBI
Tarsisus syrichta M22656 T.syr 895 NCBI
Human L00016 Human 896 NCBI

Evolutionary Bioinformatics 2011:7 153


Qi et al

phylogenetic trees based on the upper triangular part ger weights (at least 0.1) when we construct the rep-
of dissimilarity matrix reported in the references, see resentative vector.
Figure 3. Here we list the preference distance for f(l) with
different α. Considering the lengths of these twelve
Selection of the parameter α ­species (890 ∼ 900), when α = 2 or α = 1, l0 is too
In this subsection, we show how to choose the value small; while when α = 1 / 3 or α = 1 / 4 , l0 is too big.
of α for this data set. Denote the weight function Thus, for this data set, we prefer to use α = 1 / 2 with
f (l ) = 1 / l α , where l is an integer. Because the maxi- l0 = 100. The nucleotides with distance 100 would
mum value of f(l) for any α is just 1, the arcs with be considered to have stronger interactive relation-
weights not less than 0.1 should be thought as relatively ships. But we admit, for data sets with very long DNA
important. We thus define l0 as the ­preference distance sequences, to make l0 bigger correspondingly, one
of function f(l) when f(l0) $ 0.1 while f (l0 + 1) , 0.1. could choose higher order roots.
Pairs of nucleotides within l0 would be assigned big-

α=2 α=1 α = 1/2 α = 1/3 α = 1/4


A Lemur
T.syr (Tarsier) l0 3 10 100 1000 10000
S.sci (Squirrel monkey)
Hyl. (Gibbon)

Ora
Gorilla
Similarity matrix based on DET
chi By the DET method of Section 2, each sequence
human
could be represented by a 16-dimensional vector,
M.syl (Barbary macaque)
and then the similarities between each pair of these
M.fas (Crab-eating macaque)
M.fus (Japanese macaque) twelve mtDNA fragments could be computed under
M.mul (Rhesus macaque) the proposed distance measurements. In Table 2,
B we present the upper triangular part of the similar-
S.sci
ity matrix among these twelve species by the DET
T.syr
Lemur
method with weighted function f (l ) = 1 / l based
Ora on the first distance measurement d1.
Hyl When we examine Table 2, we notice that the
Gorilla
smallest entries are associated with the pairs (Gorilla,
Chi
M.syl
Human), (M.fas, M. Mul), (M. Fus, M. Mul), (Chi.,
M.fus
Gorilla), (M. fas, M. fus), (Human, Chi.) and (M.Syl)
M.mul and (M.mul). Those observed facts are similar to that
M.fas reported in previous studies.25,26 And also consistent
80 70 60 50 40 30 20 10 with biological classification in27 and28 that Gorilla,
C Chimpanzee, Human are in the same family homini-
T.syr
dae and the same subfamily homininae; and Macaca
S.sci
Lemur
fascicular, Macaca fuscata, Macaca mulatta, Macaca
Ora sylvanus are in the same family Cercopithecidae and
Hyl the same genus Macaca.
Gorilla We also present the upper triangular part of the
Chi
similarity matrices based on the second and the third
M.syl
M.fus
distance measurement in Tables 3 and 4 respectively.
M.mul We will see that there is a whole qualitative agree-
M.fas ment among similarities based on these three distinct
0.022 0.02 0.018 0.016 0.014 0.012 0.01 0.008 distance measurements. It provides a strong evidence
Figure 3. Previous phylogenetic trees for these 12 species based on
that DET method works well for DNA representation
different methods. (A) Figure 3 in24 (B) Figure 1 in26 (C) Figure 1 in.25 and comparison.

154 Evolutionary Bioinformatics 2011:7


A novel method for DNA sequence similarity analysis

Table 2. The upper triangular part of similarity/dissimilairty matrix based on d1.

Species Lemur Chi S.sci M.fas Gorilla M.fus M.mul M.syl Hyl Ora T.syr Human
Lemur 0 0.0511 0.0169 0.0358 0.0539 0.0373 0.0327 0.0221 0.0510 0.0702 0.0171 0.0591
Chi 0 0.0528 0.0183 0.0072 0.0171 0.0211 0.0325 0.0179 0.0264 0.0654 0.0098
S.Sci 0 0.0362 0.0545 0.0395 0.0347 0.0286 0.0496 0.0716 0.0201 0.0592
M.fas 0 0.0210 0.0085 0.0059 0.0172 0.0196 0.0391 0.0488 0.0255
Gorilla 0 0.0186 0.0233 0.0354 0.0133 0.0211 0.0679 0.0058
M.fus 0 0.0063 0.0181 0.0174 0.0342 0.0514 0.0238
M.mul 0 0.0131 0.0212 0.0397 0.0463 0.0284
M.syl 0 0.0326 0.0509 0.0355 0.0406
Hyl 0 0.0243 0.0631 0.0169
Ora 0 0.0841 0.0198
T.syr 0 0.0730
Human 0

Construction of dendrogram tree in the DNA sequence; when the distance d = 2, the
To see the phylogenetic relationships more easily, matrix will give the frequencies of all such pairs XY
we use the average linkage clustering method for that Y and X are separated by one nucleic base. The
the phylogenetic tree construction. In Figure 4, the advantage of such representations of DNA sequences
dendrogram trees based on Tables 2–4 are presented are that it offers upon inspection useful information
­respectively. One can find that the three dendrogram that is ­hidden in the lengthy sequence of the DNA.
trees of these twelve species have the same topol- We should notice that, in Randić’s work, informa-
ogy, which are generally consistent with the previ- tion from all matrices Md were not combined together
ous works in Figure 3. for analysis, thus not enough information is obtained
from such characterization. In our method, for a
Simulated Test for DET Method DNA sequence of length n, we have considered all
In a DNA sequence of four letters, there are sixteen pairs XY under possible distance d = 1, 2 …, (n − 1)
possible ordered XY pairs: AA, AC, AT, AG, GC …, apart, and assign different weights to pair XY accord-
etc. In Ref. Randić29 introduced a condensed charac- ing to their locations and distributions. Furthermore,
terization of DNA sequences by (4 × 4)-matrix Md all information of ordered pairs of nucleotides are
that give the count of occurrences of all pairs XY assembled in one matrix (ie, the adjacency matrix M).
of bases at distance precisely d. For example, when So in some senses, the DET method is an extension of
the distance d = 1, the matrix will give the frequen- Randić’s, but it involves more information hidden in
cies of all such pairs XY that X and Y are adjacent the DNA sequence.

Table 3. The upper triangular part of similarity matrix based on d2.

Species S.sci Chi Lemur M.fas Gorilla M.fus M.mul M.syl Hyl Ora T.syr Human
S.sci 0 0.0163 0.0016 0.0080 0.0182 0.0087 0.0067 0.0030 0.0163 0.0305 0.0018 0.0219
Chi 0 0.0177 0.0021 0.0003 0.0018 0.0028 0.0066 0.0020 0.0043 0.0269 0.0006
Lemur 0 0.0084 0.0190 0.0099 0.0076 0.0050 0.0160 0.0321 0.0024 0.0225
M.fas 0 0.0028 0.0004 0.0002 0.0018 0.0025 0.0094 0.0150 0.0042
Gorilla 0 0.0022 0.0034 0.0078 0.0011 0.0027 0.0290 0.0002
M.fus 0 0.0002 0.0020 0.0019 0.0072 0.0166 0.0036
M.mul 0 0.0011 0.0028 0.0098 0.0135 0.0051
M.syl 0 0.0066 0.0160 0.0079 0.0103
Hyl 0 0.0034 0.0252 0.0018
Ora 0 0.0438 0.0023
T.syr 0 0.0336
Human 0

Evolutionary Bioinformatics 2011:7 155


Qi et al

Table 4. The upper triangular part of similarity matrix based on d3.

Species S.sci Chi Lemur M.fas Gorilla M.fus M.mul M.syl Hyl Ora T.syr Human
S.sci 0 0.0749 0.0024 0.0360 0.0840 0.0396 0.0302 0.0136 0.0752 0.1348 0.0082 0.1015
Chi 0 0.0869 0.0098 0.0014 0.0087 0.0134 0.0302 0.0081 0.0183 0.1252 0.0027
Lemur 0 0.0429 0.0959 0.0473 0.0366 0.0196 0.0851 0.1485 0.0078 0.1142
M.fas 0 0.0137 0.0016 0.0007 0.0068 0.0123 0.0409 0.0709 0.0205
Gorilla 0 0.0103 0.0166 0.0358 0.0046 0.0103 0.1367 0.0011
M.fus 0 0.0012 0.0090 0.0076 0.0316 0.0773 0.0171
M.mul 0 0.0044 0.0127 0.0431 0.0629 0.0247
M.syl 0 0.0284 0.0708 0.0357 0.0476
Hyl 0 0.0109 0.1210 0.0083
Ora 0 0.1956 0.0086
T.syr 0 0.1586
Human 0

Euclidean It is known that the similarity of two sequences


T.syr determined by alignment is completely based on the
ordering of nucleotides, while the similarity by DET
S.sci
Lemur
M.syl is based on both the frequency of pairs of nucleotides
and the distance between them. Assume that x1 is a
M.fus
M.mul
M.fas DNA sequence, x2 is its descendant after several rear-
rangements. The alignment method would estimate a
Ora
Hyl
Chi relatively low similarity between x1 and x2 because
of the signifcant change in ordering such that x1 and
Human
Gorilla

0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005
x2 might be regarded as two distinct related species.
While the following simulated test is designed to
Cosine
show, by DET method, with high probability, x2 still
T.syr
S.sci
could regard x1 as its ancestor even if x2 is the off-
Lemur spring of x1 after many generations.
Ora
Hyl
Constructing the Simulated Data set. This simu-
Human lated test is designed to test the advantage of the DET
Gorilla
Chi
method over the alignment method when sequence
M.syl rearrangements happen frequently. One child genome is
M.fus
M.mul
copied from the initial parent with shuffle model, which
M.fas involves two operations: (i) transposition, exchange
0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 the positions of two adjacent random sequence frag-
Correlation
ments and (ii) reversal, reverse the order of a random
T.syr
sequence fragment and reinsertion of it in the same
S.sci position. To accelerate the speed of achieving the
Lemur
Ora
generations that alignment method is no longer useful
Hyl for offspring to find out its ancestor, we require that the
Human
Gorilla
size of involved random sequence fragments for both
Chi operations should be at least 0.1n (where n is the
M.syl
M.fus
length of genome sequence).
M.mul For computational expediency, 0.9 kb mtDNA
M.fas
sequence of Human (L00016) is chosen as a root
0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
ancestor test sequence, and 0.9 kb mtDNA sequence
Figure 4. Phylogetic tree for these 12 species based on Tables 2–4. of Chimpanzee (V00672) is chosen as its “brother”

156 Evolutionary Bioinformatics 2011:7


A novel method for DNA sequence similarity analysis

sequence. We would generate up to 20th generation d­ uring the evolutionary process; the second
of Human (L00016) using shuffle model. For simplic- advantage of our method is low time complexity
ity, we select its 1st; 5th; 10th; 15th and 20th genera- compared with other alignment-free methods. The
tion to use. core of the method is the construction of the repre-
We then compute the similarities based on both sentative vector for each DNA sequence and then to
Alignmentc and DET methodd between these five get the similarity matrix for a set of DNA sequences.
generations with Chimpanzee (V00672) and Human The representative vector for one DNA sequence of
(L00016), respectively. Because these five genera- length n can be obtained in O(n2) time units. Then,
tions are generated randomly, to be fair, we repeat the for a set of m sequences with the maximum length of
above process for several times. At each time, if all n, the similarity matrix can be computed in O(mn2)
of these five generations have bigger similarities with time units. The similarity matrix is relatively simple
Human (L00016) than with Chimpanzee (V00672) for calculation. Our novel method is very different
based on some method, this method would get one to all traditional methods and is proven to be effec-
score. Let suc_rate = score/l, where l is the total test tive and accurate for similarity comparison of DNA
times. Clearly, the method with higher suc_rate would sequences.
be regarded as much more robust when shuffing
happens frequently. Acknowledgement
In the following table we list the result of suc_rate We thank the Editor and the two anonymous referees
when we run the above simulated test once for different for their valuable comments, which help improve this
l values. To our surprise, the average suc_rate for DET work a lot. This work is supported partly by the Shan-
method is 1, while the average suc_rate for alignment is dong Province Natural Science Foundation of China
just 0.2709, which proves the utility and tolerance of with No. ZR2010AQ018 and partly by the Indepen-
DET method when shuffing happens frequently. dent Innovation Foundation of Shandong University

Suc_rate l = 10 l = 20 l = 30 l = 40 l = 50 l = 60 l = 70 l = 80 l = 90 l = 100
DET method 1 1 1 1 1 1 1 1 1 1
Alignment 0.2000 0.3000 0.4000 0.2000 0.3000 0.2667 0.3857 0.2250 0.2222 0.2500

Concluding Remarks with No. 2010ZRJQ005, and has been partially sup-
The graph model presented in this paper is a new ported by a WV EPSCoR grant.
method for mathematically characterization of
DNA sequences. When it is used to compare DNA Disclosure
sequences, we get consistent results with previous This manuscript has been read and approved by all
studies and biological classifications. We have no authors. This paper is unique and is not under con-
intention to compete with other existing methods, sideration by any other publication and has not been
since as we know, for any particular research proj- published elsewhere. The authors and peer review-
ect we will have to identify which measure is most ers of this paper report no conflicts of interest. The
meaningful or useful. The first advantage of our new authors confirm that they have permission to repro-
method is greater efficiency when compared with duce any copyrighted material.
alignment method when shuffing happens frequently
References
1. Ma H, Bork P. Measuring genome evolution. Proc Natl Acad Sci U S A.
c
Here, we use Needleman-Wunsch globally align algorithm to get the alignment of 1988;95:5849–56.
any two DNA sequences, and define the similarity between them as r / N , where 2. Wildman DE, et al. Genomics, biogeography, and the diversification of pla-
r is the number of matching in the alignment and N is the length of alignment. cental mammals. Proc Natl Acad Sci U S A. 2007;104:14395–400.
d
The DET similarity between two sequences is defind as 1 − d2, ie, the cosine 3. Blaisdell BE. A measure of the similarity of sets of sequences not requiring
value of the included angle between their representative vectors. sequence alignment. Proc Natl Acad Sci U S A. 1986;83:5155–9.

Evolutionary Bioinformatics 2011:7 157


Qi et al

4. Hamori E, Ruskin J. H curves: a novel method of representation of nucle- 17. Qi X, Wen J, Qi Z. New 3D graphical representation of DNA sequence based
otide series especially suited for long DNA sequences. Journal of Biological on dual nucleotides. Journal of Theoretical Biology. 2007;249:681–90.
Chemistry. 1983;258:1318–27. 18. Qi Z, Fan T. PN-curve: a 3D graphical representation of DNA sequences
5. Nandy A. A new graphical representation and analysis of DNA sequence and their numerical characterization. Chemical Physics Letters. 2007;442:
structure: I. Methodology and Application to Globin Genes. Curr Sci. 434–40.
1994;66:309–14. 19. Cao Z, Liao B, Li R. A group of 3D graphical representation of DNA
6. Guo X, Randić M, Basak SC. A novel 2-D graphical representation of DNA sequences based on dual nucleotides. International Journal of Quantum
sequences of lowde generacy. Chemical Physics Letters. 2001;350:106–12. Chemistry. 2008;108:1485–90.
7. Randić M, Vraćko M, Lerś N. Novel 2-D graphical representation of DNA 20. Yu J, Sun X, Wang J. TN curve: A novel 3D graphical representation of
sequences and their numberical characterization. Journal of Chemical DNA sequence based on trinucleotides and its applications. Journal of
­Information and Computer Science. 2003a;368:1–6. Theoretical Biology. 2009;261:459–68. doi:10.1016/j.jtbi.2009.08.005.
8. Randić M, Vraćko, M, Lerś N, Plavśić D. Analysis of similarity/dissimilar- 21. Chi R, Ding K. Novel 4D numerical representation of DNA sequences.
ity of DNA sequence based on novel 2-D graphical representation. Journal Chemical Physics Letters. 2005;407:63–7.
of Chemical Information and Computer Science. 2003b;371:202–7. 22. Liao B, Li R, Zhu W. On the similarity of DNA primary sequences based
9. Randić M, Vraćko M, Zupan J, Novic M. Compact 2-D graphical represen- on 5-D representation. Journal of Mathematical Chemistry. 2007;42:
tation of DNA. Chemical Physics Letters. 2003c;373:558–62. 47–57.
10. Randić M. Graphical representations of DNA as 2-D map. Chemical ­Physics 23. Liao B, Wang T. Analysis of similarity/dissimilarity of DNA sequences
Letters. 2004;386:468–71. based on nonoverlapping trinucleotides of nucleotide bases. Journal of
11. Zhang Y, Liao B, Ding K. On 2D graphical representation of DNA sequence Chemical Information and Computer Science. 2004;44:1666–70.
of nondegeneracy. Journal of Chemical Information and Computer Science. 24. Hayasaka K, Gojobori T, Horai S. Molecular phylogeny and evolution of
2005;411:28–32. primate mitochondrial DNA. Mol Biol Evol. 1998;5:626–44.
12. Liu X, Dai Q, Xiu Z, Wang T. PNN-curve: a new 2D graphical representa- 25. Zhang Y, Chen W. New invariant of DNA sequences. Match Commun
tion of DNA sequences and its application. Journal of Theoretical Biology. ­Comput Chem. 2007;58:197–208.
2006;243:555–61. 26. Zhang Y. A simple method to construct the similarity matrices of DNA
13. Huang G, Liao B, Li Y, Liu Z. H curves: a novel 2D graphical representation sequence. Match Commun Comput Chem. 2008;60:313–24.
for DNA sequences. Chemical Physics Letters. 2008;462:129–32. 27. Goodman M, et al. Primate evolution at the DNA level and a classification
14. Randić M, Vracko M, Nandy A. Basak SC. On 3-D graphical representation of hominoids. Journal of Molecular Evolution. 1990;30:260–6.
of DNA primary sequences and their numerical characterization. Journal of 28. Groves C, Wilson DE, Reeder DM, editors. Mammal Species of the World
Chemical Information and Computer Science. 2000;40:1235–44. (3rd ed.), Johns Hopkins University Press; 2005:161–5.
15. Liao B, Wang T. 3-D graphical representation of DNA sequences and their 29. Randić M. Condensed representation of DNA primary sequences. Journal
numerical characterization. Journal of Molecular Structure (THEOCHEM). of Chemical Information and Computer Science. 2000;40:50–6.
2004;681:209–12.
16. Zhang Y, Liao B, Ding K. On 3D D-curves of DNA sequences. Mol Simul.
2006;32:29–34.

Publish with Libertas Academica and


every scientist working in your field can
read your article
“I would like to say that this is the most author-friendly
editing process I have experienced in over 150
publications. Thank you most sincerely.”

“The communication between your staff and me has


been terrific. Whenever progress is made with the
manuscript, I receive notice. Quite honestly, I’ve
never had such complete communication with a
journal.”

“LA is different, and hopefully represents a kind of


scientific publication machinery that removes the
hurdles from free flow of scientific thought.”

Your paper will be:


• Available to your entire community
free of charge
• Fairly and quickly peer reviewed
• Yours! You retain copyright

http://www.la-press.com

158 Evolutionary Bioinformatics 2011:7

You might also like