Genmex Tool: (Gene Microsatellite Extractor) : Identification of Tandem Repeats
Genmex Tool: (Gene Microsatellite Extractor) : Identification of Tandem Repeats
Abstract--
Biological problems in much more sophisticated manner. The Biological data is huge and increasing at faster rate. The computational approach (Insilco) is much needed to analyze these huge biological data. Pattern matching emerges as a powerful tool in locating nucleotide or amino acid sequence patterns in the genomic sequence databases, although several pattern matching algorithms are available in literature, the efficiency of various algorithms depends on faster and exact identification of the pattern in the sequence. In this article a Novel approach is proposed to solve the problem of finding tandem repeats patterns in the given sequence by combining the preprocessing method (PDFMCSP) with pattern searching method TSW.PBFMCSP is used to preprocess the sequence string using the concept of inverted matrix and frequently occurring pattern. The frequently occurring patterns are searched in the input sequence string using Two Sliding Window method (TSW) in which the string is scanned from both the sides at a time. The searching is stopped when both the windows converge. Keywords: Tandem repeats, TSW, PBFMCSP.
INTRODUCTION
The field of computer science has been extending into the different fields for making the work easier, simpler and faster .It is dealing with many types of research problems, in which string matching is one of them [1,2,3]. String matching has wide range of application in many areas such as search engine, speech recognition, data compression, information retrieval, computational biology, virus detection, network intrusion detection, and DNA, RNA and Protein sequence searching and so on. The most promising problem in the analysis of biological sequences is the searching similar sequence in the primary structure of related proteins or genes. Several methods have been proposed to solve this problem. Pattern matching focuses on finding the occurrences of a particular pattern P of length m in a text T of length n. Both the pattern and the text are built over a finite alphabet set called of size . Generally, pattern matching algorithms make use of a single window whose size is equal to the pattern length[4].The Two Sliding Windows algorithm (TSW) [5].Which concentrates on both the pattern and the text and make use of two windows of size that is equal to the size of the pattern. The first window is aligned with the left end of the text while, the second window is aligned with the right end of the text. Both windows slide at the same time (in parallel) over the text in the searching phase to locate the
pattern. The windows slide towards each other until the first occurrence of the pattern from either side in the text is found or they reach the middle of the text. If required, all the occurrences of the pattern in the text can be found. The goal of data mining or knowledge discovery is to utilize those existing data to find out new facts and to uncover new relationships that were previously unknown, in an efficient manner with minimum utilization of the space and time [10]. Frequent Item set Mining plays an essential role in many data mining tasks and applications, such as mining association rules, correlations, sequential patterns, classification and clustering. Frequent item set construction has been a major research area over the years and several algorithms have been proposed in the literature to address the problem of mining association rules. Microsatellites, also known as simple sequence repeats (SSR), short tandem repeats (STR), or variable number tandem repeats (VNTR) are the tandem repeats of nucleotide motifs of the size 16 bp[6] found in every genome known so far. A microsatellite consists of a specific sequence of DNA bases or nucleotides which contains mono, di, tri, or tetra tandem repeats. For example, AAAAAAAAAAA would be referred to as (A)11, GTGTGTGTGTGT would be referred to as (GT)6, CTCACTCACTCACTC would be referred to as (ACTC)4. Their importance in genomes is well known. Microsatellites are associated with various disease genes, have been used as molecular markers in linkage analysis and DNA fingerprinting studies, and also seem to play an important role in the genome evolution. Microsatellite instability has also been implicated in the induction of cancer [7] Owing to their high mutability, it is thought that the microsatellites are one of the sources of genetic
diversity [8] In the recent times, efforts have also been made to study the possible functional roles of microsatellites in giving rise to certain amount of plasticity and also in the evolution of genomes [9].
METHODOLOGY: PREPROCESSING PHASE In the preprocessing phase, the input DNA sequence, which is a string, will be first represented in the form of an inverted matrix. Then the frequencies of various possible patterns are found using the inverted matrix. For example, if the different characters found in a string are A, T, G, C, then the different patterns that can be obtained from these are A, T, G, C, AA, AT, AG, AC, GG, GA, GT, GC, AAA, ATC, AGC and so on. After the frequencies are obtained, the frequent patterns are extracted by pruning the infrequent patterns using Apriori algorithm. Therefore the preprocessing phase includes 3 main steps. 1. Inverted Matrix generation
2. Finding frequencies of various patterns. 3. Pruning the infrequent patterns using t he Apriori algorithm.
Step 2 - Finding frequencies of various patterns. There are three different types of patterns 1. single character patterns A, G, T,C 2. Patterns with length more than 2. Single character patterns: For single character patterns the frequency is given by the column count of that particular character. Form our previous example, the frequency of A is 5, G is 4, C is 5 and D is 3. Patterns with length more than 2: In this we traverse the inverted matrix and find the frequencies. For example to find the frequency of the pattern ATG we start with finding AT it is found 3 times in row 1 since the index 3 is
present 3 times. For the first occurrence of index 3 in row A, the value of the pair is 3,2 now we check in 3rd row 2nd column in the matrix. The character found is A since index is 1. Therefore the pattern present is ATA but not ATG hence we dont count this. For the second occurrence of T in A that is 3,4 in row A, we check the character present in 3rd row and 4th column position. The character is G as the index is 2. So the pattern explores is ATG which is the patterns that we are checking for and is counted to obtain the frequency. For the third occurrence of T in row A, the value of pair is 3,5. we go to third row 5th column and fond $,$. When the index is found to be $ that means that we have only AT but third character is not present. Hence the pattern is AT and is not counted. Hence the frequency of the pattern ATG is 1. The total number of patterns that can be obtained with four characters with length 1,2 3,4,5,6 are For each of those patterns we find the frequencies using the method described as above. Algorithm is generated which supports patterns up to length 6.
ALGORITHM INPUT: Inverted matrix OUTPUT: frequent patterns found in the given biological sequence PROCESS: %input characters are A, T, G, C. their indexes are assigned in the inverted matrix% Patern=0; Count=0; FOR i=1 to 4 Pattern=pattern*10+i; Access the i th row in the inverted matrix construction and search for any element in the row with index i; flag=no of occurrences of index i; frequency[count]=flag; count++; FOR j=1 to 4 pattern=pattern*10+j; flag=number occurrences of index j in ith row. frequency[count]=flag; count++; FOR k=1 to 4 pattern=pattern*10+k; flag= number of occurrences of k index in the a,bth element of inverted matrix. frequency[count]=flag; count++; FOR l=1 to 4 pattern=pattern*10+l; flag= number of occurrences of l index in the a,bth element of inverted matrix. frequency[count]=flag;
count++; FOR m=1 to 4 pattern=pattern*10+m; flag= number of occurrences of m index in the a,bth element of inverted matrix. frequency[count]=flag; count++; FOR n=1 to 4 pattern=pattern*10+n; flag= number of occurrences of n index in the a,bth element of inverted matrix. frequency[count]=flag; count++;
The freq of various patterns are entered into an array and the corresponding patterns are fed into the further step. Step 3: Pruning Infrequent Patterns Using Apriori Algorithm:We take the input values for the minimum repeat range for each of the patterns having length 1,2 3,4,5,6 and so on. For example if the minimum repeat range of di is given as 4 then all the patterns of length those are repeated more than 4 times are found like ATATATAT, GCGCGCGCGC etc and so on. If the minimum repeat value of tri is given to be 5, all the patterns like AGTAGTAGTAGTAGT are found. We take the minimum repeat values from the input and compute the minimum threshold and support values. If the values are given as
Mono= 5 Di=3 Tri=2 Tetra=2 Penta=3 Hexa=2 Then the support values for each of them are Let total= 4 + 4 + 4 + 4 + 4 + 4 Mono= 5*100/total Di=3*100/total Tri=2*100/total Tetra=2*100/total Penta=3*100/total Hexa=2*100/total For each pattern we have found the frequencies. We find the support for each of those patterns using the formula Support =
2 3 4 5 6
4 + 4 2 + 43 + 4 4 + 45 + 46
Example: Let the input pattern be ATGCATATATATATAT The frequency of AT is 7. the support of AT= 7*100/total. The minimum threshold is 3*100/total. The support is greater than the minimum threshold. Therefore the pattern is considered as frequent pattern. Now consider the pattern GC. The frequency of the pattern is 1. The support is 1*100/total. This is less than the minimum support value i.e., 3*100/total of di pattern. Therefore this pattern is pruned. In this way the infrequent patterns are pruned.
text. Both windows slide at the same time (in parallel) over the text in the searching phase to locate the pattern. The windows slide towards each other until they converge. We use Berry Ravindran bad character shift rule for fast search process which results due to fast shifting of the sliding windows.
TSW ALGORITHM
The Two Sliding Windows algorithm (TSW) scans the text from both sides simultaneously. It uses two sliding windows; the size of each window is m which is the same size as the pattern. The two windows search the text in parallel. The text is divided into two parts: the left and the right parts, each part is of size n/2. The left part is scanned from left to right using the left window and the right part is scanned from right to left using the right window. Both windows slide in parallel which makes the TSW algorithm suitable for parallel processors structures. TSW algorithm stops when one of the two sliding windows converge. If necessary, the algorithm can be modified easily to find all the occurrences of the pattern. Also if the pattern is exactly in the middle of the text, TSW can find it easily. The TSW algorithm utilizes the idea of BR bad character shift function to get better shift values during the searching phase. BR algorithm provides a maximum shift value in most cases without losing any characters. The main differences between TSW algorithm and BR algorithm are: TSW uses two sliding windows rather than using one sliding window to scan all text characters as in BR algorithm The TSW uses two arrays; each array is a one dimensional array of size (m-1). The arrays are used to store the calculated shift values for the two sliding windows. The shift values are calculated only for the pattern characters. While the original BR algorithm uses a two-dimensional array to store the shift values for all the alphabets. Using one dimensional array reduces the search processing time and at the same time reduces the memory requirements needed to store the shift values PRE-PROCESSING PHASE: The pre-processing phase is used to generate two arrays nextl and nextr, each array is a one-dimensional array. The values of the nextl array are calculated according to Berry-Ravindran bad character algorithm (BR). nextl contains the shift values needed to search the text from the left side. To calculate the shift values, the algorithm considers two consecutive text characters a and b which are aligned immediately after the sliding window. Initially, the indexes of the two consecutive characters in the text string from the left are (m+1) and (m+2) for a and b respectively. On the other hand, the values of the nextr array are calculated according to our proposed shift function. nextr contains the shift values needed to search the text from the right side, initially the indexes of the two consecutive characters in the text string from the right.
SEARCHING THE FREQUENT PATTERNS IN THE INPUT SEQUENCE USING TWO- SLIDING WINDOW APPROACH: The frequent patterns found in the preprocessing phase are searched in the input biological sequence using two sliding window approach. The patterns that will be searched are dynamically generated in the previous phase. Placing the pattern in the sliding window Initially we take the length of the pattern. If it is mono the pattern is repeated 5 times and placed in the window. Eg; if the pattern is A, the pattern placed in the sliding window is AAAAA. It is repeated 5 times because the minimum repeat unit for mono pattern is given as 5 in the previous phase. Similarly if the pattern is AGC, the pattern is repeated 2 times i.e., AGCAGC (since minimum repeat unit of tri is 2) is placed in the sliding window and is searched in the pattern. INTRODUCTION TO TSW (TWO SLIDING WINDOW): The algorithm concentrates on both the pattern and the text. It makes use of two windows of size that is equal to the size of the pattern. The first window is aligned with the left end of the text while, the second window is aligned with the right end of the
it identifies the positions of the characters A,T,G,C in the sequence to shift the window in the TSW method which we can use it as bad character shift preprocessing of the given sequence. Hence this novel method reduces much time complexity to find tandem repeats in the sequence.
REFERENCES
[1] G..Navarro,M.Raffinot,Fast and Flexible Pattern Matching
in Strings-Practical On-line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, Cambridge,2002. [2] M.Crochemore,W.Rytter,Jewels of Stringology, World Scientific,Singapore,2002. [3] W.f.smyth, Computing Patterns in Strings, Pearson Addison Wesley, 2003. [4] Charras, C. and T. Lecroq, 2004. Handbook of Exact String Matching Algorithms. First Edition.Kings College London Publications.ISBN: 0954300645 [5] Amjad Hudaib et al., A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW), Journal of Computer Science 4 (5): 393-401, 2008 [6] Schlotterer,C. (2000) Evolutionary dynamics of microsatellite DNA. Chromosoma, 109, 365371. [7] Thibodeau,S.N. et al. (1993) Microsatellite instability in cancer of the proximal colon. Science, 260, 816819. [8] Kashi,Y. and King,D.G. (2006) Simple sequence repeats as advantageous mutators in evolution. Trends Genet., 22, 253 259. [9] Sreenu,V.B. et al. (2006) Microsatellite polymorphism across the M. tuberculosis and M. bovis genomes: implications on genome evolution and plasticity.BMC Genomics, 7, 7888. [10] Jiawei Han and Micheline Kamber, Data Mining, Concepts and Techniques, 2 nd Edition, Morgan Kaufmann Published,2006.
Searching phase:
STEP 1: Compare the characters of the two sliding windows with the corresponding text characters from both sides. If there is a mismatch during comparison from both sides, the algorithm goes to step2, otherwise the comparison process continues until a complete match is found. The algorithm stops and displays the corresponding position of the pattern on the text string. If we search for all the pattern occurrences in the text string, the algorithm continues to step2. STEP 2: In this step, we use the shift values from the next arrays depending on the two text characters placed immediately after the pattern window. The two characters are placed to the right side of the left window and to the left side of the right window. The corresponding windows are shifted to the correct positions based on the shift values, the left window is shifted to the right and the right window is shifted to the left. Both steps are repeated until the first occurrence of the pattern is found from either sides or until both windows are positioned beyond n/2. If the first occurrence of the pattern exists in the middle of the text, the TSW algorithm[a] continues comparing pattern characters with text characters through the inner loops before the TSW algorithm terminates the searching process through the outer loop. The outcome from this TSW gives the tandem repeats (Microsatellites) present in the given input sequence. CONCLUSION In this novel approach we presented a method combining the frequent pattern search and fast pattern matching (Two Sliding Window) method to reduce the time complexity and to find microsatellites in the given nucleotide sequence. This approach preprocess the sequence to identify frequent patterns in the sequence by using inverted matrix method and at the same time