International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
DOI:10.5121/ijcsa.2015.5402 19
K-Mer Index Of DNA Sequence Based On Hash
Algorithm
Jinlin Liu1
, Qiang Chen2
and Chen Zhang3
]
1
College of Electronic and Electrical Engineering, Shanghai University of Engineering
Science, Shanghai 201620,China.
2
College of Electronic and Electrical Engineering, Shanghai University of Engineering
Science, Shanghai 201620,China.
3
School of Management, Shanghai University of Engineering Science
Shanghai, 201620, China.
ABSTRACT
K-mer frequency statistics of biological sequences is a very important and important problem in biological
information processing. This paper addresses the problem of index k-mer for large scale data reading DNA
sequences in a limited memory space and time. Using the hash algorithm to establish index, the index
model is set up to base pairing, and get the length of k-mer statistic information quickly, so as to avoid
searching all the sequences of the index. At the same time, the program uses hash table to establish index
and build search model, and uses the zipper method to resolve the conflict in the case of address conflict.
Algorithm of time complexity analysis and experimental results show that compared with the traditional
indexing methods, the algorithm of the performance improvement is obvious, and very suitable for to be
used in the k-mer length change with a wide range .
KEYWORDS
K-mer index; hash algorithm; DNA detecting; index model;
1.INTRODUCTION
With the rapid development of DNA sequencing technology in recent years, human generated
massive biological sequence data, and we need to analyze and process through effective
calculation means. Among the numerous biological sequence analysis and processing problems,
the k-mer of biological sequence data is a short sequence of DNA sequences of k sequences.
When the K value is appropriate, sequence k-mer frequency distribution contains all the
information in the genome constituting equivalent sequences .So we can learn biological
sequences of base distribution characteristics, functions, structures and evolution information by
analyzing DNA sequence k-mer distribution and different k-mer information
2.QUESTIONS
This paper aims to solve the problem of k-mer index of DNA sequence.According to the given K,
100 million DNA sequences will establish index, Then the computer will read every K length
DNA from the start to end for each sequence. Then move on to the next sequence to read again,
until the positions of the individual K-mer appeared in the sequence were recorded. Because
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
20
DNA sequencing fragments, large scale of data, so we have to handle large data sets under the
condition of limited memory and disk space, and make the space complexity and computational
complexity as much as possible has been optimized. So we have to solve these problems.
Q1.
According to the given K to establish index, then search every sequence. Each sequence uses a
hash algorithm to encode the base, and then convert the input specific K base fragment into the
decimal data, and then match in the 100 million sequence. In the end, the computer output line
and column base fragment.
Q2.
After the index is established, we build the hash table in memory, and every time we traverse, we
store the frequency and the position of the k-mer in the hash table. Under the limited memory
space, we can traverse a million DNA sequences.
3.PROBLEM ANALYSIS
3.1.problem abstraction.
First according to the 100 million genetic sequence, because the length of each gene sequence is
100, so gene sequence is equivalent to a two bit matrix array a, corresponding to the rows of a as:
1-1 000000, it is listed as the 1-100. The problem is abstracted from the matrix A[i][j] analysis,
i=1,2... 1000000; j=1,2,... 100.
3.2.Method solution
The base species of the sequence are: C, A, G, T. Using the hash algorithm, the four bases are
converted into four binary digits, and then the conversion sequence is converted, which is set
A=0, C=1, G=2, T=3,and then convert the four numbers to decimal digits in the matching query
.Hash value algorithm formula is Hash(value)=value*[4^(k-m-1)], value represents the
corresponding value of the character, K represents the length of M, and k-mer represents the
position range of the character in the string [0- (m-1)].For example, the sequence k=4 of a given
ATCG is converted into the corresponding decimal ATCG=[0* (4^3) +3* (4^2) +2* (4^1) +1*
(4^0)]=54. The base sequence of each row length of 100 can be converted to a 100-k+1 decimal
number. The same principle can be used for the same 1 million line base sequence, you can get
the corresponding decimal number and then stored in the two-dimensional array A[i][j].when the
same decimal number is matched, the program converts decimal conversion into a four - band
form of a corresponding length of K, like the example ATCG form. Then program will print base
fragment corresponding row and column labels mark.
After the establishment of the index, we use division method to build hash tables in memory, and
determine the address of the hash table. The column headers and corresponding location is stored
in the hash table every k-mer occurs. The search efficiency of the query million DNA sequences
is improved under the limited memory space.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
21
4.MODEL ESTABLISHMENT AND SOLUTION
Hash algorithm is the binary value of arbitrary length is mapped into a shorter fixed length of the
binary value, this small binary value called hash value.
In this paper, according to the principle of hash algorithm, the identity of the four bases of the
ACGT respectively 0123, converted to four hexadecimal number is then transformed into a
decimal number, let base conversion of decimal number and the first line of 100-k+1 to a decimal
number to match, if the base sequence matching, the program will output the row and column
label mark.
Flow chart as shown below:
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
22
4.1.Model two: search model based on hash table
The main requirement of this paper is to design hash function, according to the keyword k-mer to
build hash table.
There are a lot of methods of constructing hash function, digital analysis method, the direct
method of definite value, random numbers, random number method is usually used in the key
word length, this paper selects division method. The obtained nucleotide sequence of hash values
divided by 1000 to take over, get the number as the address of the hash table. All to take over the
business of the same number into the bucket, and in each bucket will remainder exists is not the
same, but business the same. Therefore, in order to solve the address conflict.
The method of the zipper is to resolve the conflict: the nodes of all keywords are synonymous
with the same single linked list.. If the selected hash table length is m, the hash table can be
defined as an array of pointers consisting of a m pointer T[0..M-1]. All the hash address for the
node of I, are inserted into the single T[i] pointer to the single chain table. The initial values of
each component in T should be null pointer. In the zipper method, the load factor can be greater
than 1, but generally take α less than 1.
Hash search: first of all, k-mer as the keyword, and program needs to use the hash function to
calculate the address. If the base arrangement is the same as the base sequence of the searched
sequence, if the same output of the node is all the information, if the relative should be found,
then returns continue to search.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
23
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
24
4.2.Model three: analysis of the memory space occupied by the hash table
Data definition analysis: int keyword denotes an integer, whose range from negative -
2147483647 to +2147483647 (including these two digits) (32 bits) of integer. The number of
bytes occupied per int type is 4B. The char holds no symbol for the 16 bit (double byte) code bits,
whose values range from 0 to 65535 (8 bits).
The number of bytes occupied per char type is 1B.
Overall data analysis:
row, 1000000 defined int type variable (4Byte)
Column, 100 defined char type variable (1Byte)
Each index information theory takes up the memory space size: (B), can also be converted into
memory occupancy size: (GB)
Different K values, the memory space corresponding to each index is shown in the table below
Table4.1 The Memory Space
K Memory Space((((GB))))
1 0.00000002
2 0.00000007
3 0.00000030
4 0.00000119
5 0.00000477
6 0.00001907
7 0.00007629
5 4
1024 1024 1024
k
 ×
 
× × 
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
25
5.RUN RESULTS SHOW
5.1.The interface
Figure5.1 The interface
8 0.00030518
9 0.00122070
10 0.00488281
11 0.01953125
12 0.07812500
13 0.31250000
14 1.25000000
15 5.00000000
16 20.00000000
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
26
5.2.Search interface
Figure5.2 the search interface
5.3.File generated results
K_mer.txt file shown in Figure
Figure5.3 the text file shown
International Journal on Computational Science
5.4.Results the output interface
5.5.The complexity of the algorithm
(1) establish index complexity analysis
Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is
deep.
Space complexity O ( )
(2) using index complexity analysis
Time complexity O (1)
Space complexity O (1)
6.CONCLUSIONS
In order to solve the problem of k
the hash algorithm index model, the hash table query model, and the memory analysis
model of hash table. The design uses the visual2010 software to traverse the optimal
results, and the occupancy memory is
is accurate. To provide a good solution for solving the problem of k
ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
esults the output interface
Figure5.4 the output interface
5.5.The complexity of the algorithm
(1) establish index complexity analysis
Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is
(2) using index complexity analysis
In order to solve the problem of k-mer index DNA, three kinds of models are proposed,
the hash algorithm index model, the hash table query model, and the memory analysis
The design uses the visual2010 software to traverse the optimal
results, and the occupancy memory is small, the traversal efficiency is high and the result
is accurate. To provide a good solution for solving the problem of k-mer index DNA.
August 2015
27
Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is
dex DNA, three kinds of models are proposed,
the hash algorithm index model, the hash table query model, and the memory analysis
The design uses the visual2010 software to traverse the optimal
small, the traversal efficiency is high and the result
mer index DNA.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
28
REFERENCES
[1] Singh, M.; Garg, D., "Choosing Best Hashing Strategies and Hash Functions," Advance Computing
Conference, 2009. IACC 2009. IEEE International , vol., no., pp.50,55, 6-7 March 2009
[2] Rizk G, Lavenier D, Chikhi R. DSK: k-mer counting with very low memory
usage[J].Bioinformatics, 2013, 29(5): 652-653
[3] Deorowicz S, Debudaj-Grabysz A, Grabowski S. Disk-based k-mer counting on a PC[J].BMC
bioinfonnatics, 2013, 14(1): 160.
[4] Roy K S, Bhattacharya D, Schliep A. Turtle: Identifying frequent k-mers with cache-efficient
algorithms[J]. arXiv preprint arXiv:1305.1861,2013.
[5] Chor B, Horn D, Goldman N, et al. Genomic DNA k-mer spectra: models and modalities[J].Genome
Biol, 2009, 10(10): 8108.
[6] Hao B, Lee H C, Zhang S. Fractals related to long DNA sequences and complete
genomes[J].Chaos,Solitions&Fractals,2000,11(6):825-836.
[7] Yang Xu; Lei Ma; Zhaobo Liu; Chao, H.J., "A Multi-dimensional Progressive Perfect Hashing for
High-Speed String Matching," Architectures for Networking and Communications Systems (ANCS),
2011 Seventh ACM/IEEE Symposium on , vol., no., pp.167,177, 3-4 Oct. 2011
[8] Yasuda, K.; Miura, T.; Shioya, I., "Distributed Processes on Tree Hash," Computer Software and
Applications Conference, 2006. COMPSAC '06. 30th Annual International , vol.2, no., pp.10,13, 17-
21 Sept. 2006
[9] Bradford, P.G.; Gavrylyako, O.V., "Hash chains with diminishing ranges for sensors," Parallel
Processing Workshops, 2004. ICPP 2004 Workshops. Proceedings. 2004 International Conference
on , vol., no., pp.77,83, 18-18 Aug. 2004
[10] Jian-Wei Fan; Chao-Wen Chan; Ya-Fen Chang, "A random increasing sequence hash chain and
smart card-based remote user authentication scheme," Information, Communications and Signal
Processing (ICICS) 2013 9th International Conference on , vol., no., pp.1,5, 10-13 Dec. 2013
Authors
Jinlin Liu is currently studying in Mechanical and Electronic Engineering from
Shanghai University of Engineering Science, China, where he is working towards the
Master degree. His current research interests include FPGA, design and develop in
Embedded system.

More Related Content

DOC
Data structure-questions
PDF
Data structure-question-bank
PPTX
Linear search-and-binary-search
PDF
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
SIMILARITY SEARCH FOR TRAJECTORIES OF RFID TAGS IN SUPPLY CHAIN TRAFFIC
PDF
8074.pdf
PPT
358 33 powerpoint-slides_15-hashing-collision_chapter-15
Data structure-questions
Data structure-question-bank
Linear search-and-binary-search
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
SIMILARITY SEARCH FOR TRAJECTORIES OF RFID TAGS IN SUPPLY CHAIN TRAFFIC
8074.pdf
358 33 powerpoint-slides_15-hashing-collision_chapter-15

What's hot (20)

PPTX
Datastructures using c++
PPTX
Bsc cs ii dfs u-1 introduction to data structure
PDF
Binary Similarity : Theory, Algorithms and Tool Evaluation
DOCX
Datastructures and algorithms prepared by M.V.Brehmanada Reddy
PPTX
Data Structure & Algorithms | Computer Science
PDF
M v bramhananda reddy dsa complete notes
PPT
Ch17 Hashing
PDF
Introduction to Data Structure
PDF
Data structure
DOCX
Mit203 analysis and design of algorithms
PDF
Searching and Sorting Techniques in Data Structure
PPTX
C programming
PDF
IRJET- A Survey on Different Searching Algorithms
PDF
Ii pu cs practical viva voce questions
PPSX
Lecture 1 an introduction to data structure
DOCX
Bc0038– data structure using c
PDF
UNIT I LINEAR DATA STRUCTURES – LIST
PDF
Data structures Basics
PPTX
Efficient Sparse Coding Algorithms
PPT
Binary Search
Datastructures using c++
Bsc cs ii dfs u-1 introduction to data structure
Binary Similarity : Theory, Algorithms and Tool Evaluation
Datastructures and algorithms prepared by M.V.Brehmanada Reddy
Data Structure & Algorithms | Computer Science
M v bramhananda reddy dsa complete notes
Ch17 Hashing
Introduction to Data Structure
Data structure
Mit203 analysis and design of algorithms
Searching and Sorting Techniques in Data Structure
C programming
IRJET- A Survey on Different Searching Algorithms
Ii pu cs practical viva voce questions
Lecture 1 an introduction to data structure
Bc0038– data structure using c
UNIT I LINEAR DATA STRUCTURES – LIST
Data structures Basics
Efficient Sparse Coding Algorithms
Binary Search
Ad

Viewers also liked (20)

PDF
A countermeasure for flooding
PDF
Handling ambiguities and unknown words in named entity recognition using anap...
PDF
Energy efficient sensor selection in visual sensor networks based on multi ob...
PDF
Quantifying the impact of flood attack on
PDF
INTELLIGENT QUERY PROCESSING IN MALAYALAM
PDF
INVESTIGATION OF NONLINEAR DYNAMICS IN THE BOOST CONVERTER: EFFECT OF CAPACIT...
PDF
Automatic 3D view Generation from a Single 2D Image for both Indoor and Outdo...
PDF
SCHEDULING IN GRID TO MINIMIZE THE IMPOSED OVERHEAD ON THE SYSTEM AND TO INC...
PDF
tScene classification using pyramid histogram of
PDF
Theta θ(g,x) and pi π(g,x) polynomials of hexagonal trapezoid system tb,a
PDF
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
PDF
A LOCATION-BASED RECOMMENDER SYSTEM FRAMEWORK TO IMPROVE ACCURACY IN USERBASE...
PDF
Application of Taguchi Experiment Design for Decrease of Cogging Torque in P...
PDF
PORTFOLIO SELECTION BY THE MEANS OF CUCKOO OPTIMIZATION ALGORITHM
PDF
COUPLER, POWER DIVIDER AND CIRCULATOR IN V-BAND SUBSTRATE INTEGRATED WAVEGUID...
PDF
A COMPARATIVE PERFORMANCE STUDY OF OFDM SYSTEM WITH THE IMPLEMENTATION OF COM...
PDF
Data analysis by using machine
PDF
Automatic rectification of perspective distortion from a single image using p...
DOCX
JAVA 2013 IEEE IMAGEPROCESSING PROJECT Query adaptive image search with hash ...
PDF
Enhanced Hashing Approach For Image Forgery Detection With Feature Level Fusion
A countermeasure for flooding
Handling ambiguities and unknown words in named entity recognition using anap...
Energy efficient sensor selection in visual sensor networks based on multi ob...
Quantifying the impact of flood attack on
INTELLIGENT QUERY PROCESSING IN MALAYALAM
INVESTIGATION OF NONLINEAR DYNAMICS IN THE BOOST CONVERTER: EFFECT OF CAPACIT...
Automatic 3D view Generation from a Single 2D Image for both Indoor and Outdo...
SCHEDULING IN GRID TO MINIMIZE THE IMPOSED OVERHEAD ON THE SYSTEM AND TO INC...
tScene classification using pyramid histogram of
Theta θ(g,x) and pi π(g,x) polynomials of hexagonal trapezoid system tb,a
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
A LOCATION-BASED RECOMMENDER SYSTEM FRAMEWORK TO IMPROVE ACCURACY IN USERBASE...
Application of Taguchi Experiment Design for Decrease of Cogging Torque in P...
PORTFOLIO SELECTION BY THE MEANS OF CUCKOO OPTIMIZATION ALGORITHM
COUPLER, POWER DIVIDER AND CIRCULATOR IN V-BAND SUBSTRATE INTEGRATED WAVEGUID...
A COMPARATIVE PERFORMANCE STUDY OF OFDM SYSTEM WITH THE IMPLEMENTATION OF COM...
Data analysis by using machine
Automatic rectification of perspective distortion from a single image using p...
JAVA 2013 IEEE IMAGEPROCESSING PROJECT Query adaptive image search with hash ...
Enhanced Hashing Approach For Image Forgery Detection With Feature Level Fusion
Ad

Similar to K mer index of dna sequence based on hash (20)

PDF
Text encryption
PDF
Symmetric Key Generation Algorithm in Linear Block Cipher Over LU Decompositi...
PDF
Computational intelligence based simulated annealing guided key generation in...
PDF
A new dna based approach of generating keydependentmixcolumns
PDF
A design of parity check matrix for short irregular ldpc codes via magic
PDF
Design of ternary sequence using msaa
PDF
Design and Analysis of an Improved Nucleotide Sequences Compression Algorithm...
PDF
Truncated boolean matrices for dna
PDF
C6 agramakrishnan1
PDF
A MODIFIED DNA COMPUTING APPROACH TO TACKLE THE EXPONENTIAL SOLUTION SPACE OF...
PDF
Loss less DNA Solidity Using Huffman and Arithmetic Coding
PPTX
BCS304 Module 5 slides DSA notes 3rd sem
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PDF
Combining text and pattern preprocessing in an adaptive dna pattern matcher
PDF
Analytical Study of AES and Proposed Variant with Enhance Block Length and Ke...
PDF
A Novel Design For Generating Dynamic Length Message Digest To Ensure Integri...
PDF
A Cryptographic Hardware Revolution in Communication Systems using Verilog HDL
PDF
I1803014852
PPT
Advance algorithm hashing lec II
Text encryption
Symmetric Key Generation Algorithm in Linear Block Cipher Over LU Decompositi...
Computational intelligence based simulated annealing guided key generation in...
A new dna based approach of generating keydependentmixcolumns
A design of parity check matrix for short irregular ldpc codes via magic
Design of ternary sequence using msaa
Design and Analysis of an Improved Nucleotide Sequences Compression Algorithm...
Truncated boolean matrices for dna
C6 agramakrishnan1
A MODIFIED DNA COMPUTING APPROACH TO TACKLE THE EXPONENTIAL SOLUTION SPACE OF...
Loss less DNA Solidity Using Huffman and Arithmetic Coding
BCS304 Module 5 slides DSA notes 3rd sem
Welcome to International Journal of Engineering Research and Development (IJERD)
Combining text and pattern preprocessing in an adaptive dna pattern matcher
Analytical Study of AES and Proposed Variant with Enhance Block Length and Ke...
A Novel Design For Generating Dynamic Length Message Digest To Ensure Integri...
A Cryptographic Hardware Revolution in Communication Systems using Verilog HDL
I1803014852
Advance algorithm hashing lec II

Recently uploaded (20)

PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
Advancing precision in air quality forecasting through machine learning integ...
DOCX
search engine optimization ppt fir known well about this
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PPTX
Microsoft User Copilot Training Slide Deck
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PPTX
Configure Apache Mutual Authentication
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PPTX
Build Your First AI Agent with UiPath.pptx
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Co-training pseudo-labeling for text classification with support vector machi...
Flame analysis and combustion estimation using large language and vision assi...
Advancing precision in air quality forecasting through machine learning integ...
search engine optimization ppt fir known well about this
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Microsoft User Copilot Training Slide Deck
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Configure Apache Mutual Authentication
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Improvisation in detection of pomegranate leaf disease using transfer learni...
Convolutional neural network based encoder-decoder for efficient real-time ob...
Taming the Chaos: How to Turn Unstructured Data into Decisions
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Build Your First AI Agent with UiPath.pptx
4 layer Arch & Reference Arch of IoT.pdf
sustainability-14-14877-v2.pddhzftheheeeee
NewMind AI Weekly Chronicles – August ’25 Week IV
Data Virtualization in Action: Scaling APIs and Apps with FME
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf

K mer index of dna sequence based on hash

  • 1. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 DOI:10.5121/ijcsa.2015.5402 19 K-Mer Index Of DNA Sequence Based On Hash Algorithm Jinlin Liu1 , Qiang Chen2 and Chen Zhang3 ] 1 College of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620,China. 2 College of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620,China. 3 School of Management, Shanghai University of Engineering Science Shanghai, 201620, China. ABSTRACT K-mer frequency statistics of biological sequences is a very important and important problem in biological information processing. This paper addresses the problem of index k-mer for large scale data reading DNA sequences in a limited memory space and time. Using the hash algorithm to establish index, the index model is set up to base pairing, and get the length of k-mer statistic information quickly, so as to avoid searching all the sequences of the index. At the same time, the program uses hash table to establish index and build search model, and uses the zipper method to resolve the conflict in the case of address conflict. Algorithm of time complexity analysis and experimental results show that compared with the traditional indexing methods, the algorithm of the performance improvement is obvious, and very suitable for to be used in the k-mer length change with a wide range . KEYWORDS K-mer index; hash algorithm; DNA detecting; index model; 1.INTRODUCTION With the rapid development of DNA sequencing technology in recent years, human generated massive biological sequence data, and we need to analyze and process through effective calculation means. Among the numerous biological sequence analysis and processing problems, the k-mer of biological sequence data is a short sequence of DNA sequences of k sequences. When the K value is appropriate, sequence k-mer frequency distribution contains all the information in the genome constituting equivalent sequences .So we can learn biological sequences of base distribution characteristics, functions, structures and evolution information by analyzing DNA sequence k-mer distribution and different k-mer information 2.QUESTIONS This paper aims to solve the problem of k-mer index of DNA sequence.According to the given K, 100 million DNA sequences will establish index, Then the computer will read every K length DNA from the start to end for each sequence. Then move on to the next sequence to read again, until the positions of the individual K-mer appeared in the sequence were recorded. Because
  • 2. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 20 DNA sequencing fragments, large scale of data, so we have to handle large data sets under the condition of limited memory and disk space, and make the space complexity and computational complexity as much as possible has been optimized. So we have to solve these problems. Q1. According to the given K to establish index, then search every sequence. Each sequence uses a hash algorithm to encode the base, and then convert the input specific K base fragment into the decimal data, and then match in the 100 million sequence. In the end, the computer output line and column base fragment. Q2. After the index is established, we build the hash table in memory, and every time we traverse, we store the frequency and the position of the k-mer in the hash table. Under the limited memory space, we can traverse a million DNA sequences. 3.PROBLEM ANALYSIS 3.1.problem abstraction. First according to the 100 million genetic sequence, because the length of each gene sequence is 100, so gene sequence is equivalent to a two bit matrix array a, corresponding to the rows of a as: 1-1 000000, it is listed as the 1-100. The problem is abstracted from the matrix A[i][j] analysis, i=1,2... 1000000; j=1,2,... 100. 3.2.Method solution The base species of the sequence are: C, A, G, T. Using the hash algorithm, the four bases are converted into four binary digits, and then the conversion sequence is converted, which is set A=0, C=1, G=2, T=3,and then convert the four numbers to decimal digits in the matching query .Hash value algorithm formula is Hash(value)=value*[4^(k-m-1)], value represents the corresponding value of the character, K represents the length of M, and k-mer represents the position range of the character in the string [0- (m-1)].For example, the sequence k=4 of a given ATCG is converted into the corresponding decimal ATCG=[0* (4^3) +3* (4^2) +2* (4^1) +1* (4^0)]=54. The base sequence of each row length of 100 can be converted to a 100-k+1 decimal number. The same principle can be used for the same 1 million line base sequence, you can get the corresponding decimal number and then stored in the two-dimensional array A[i][j].when the same decimal number is matched, the program converts decimal conversion into a four - band form of a corresponding length of K, like the example ATCG form. Then program will print base fragment corresponding row and column labels mark. After the establishment of the index, we use division method to build hash tables in memory, and determine the address of the hash table. The column headers and corresponding location is stored in the hash table every k-mer occurs. The search efficiency of the query million DNA sequences is improved under the limited memory space.
  • 3. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 21 4.MODEL ESTABLISHMENT AND SOLUTION Hash algorithm is the binary value of arbitrary length is mapped into a shorter fixed length of the binary value, this small binary value called hash value. In this paper, according to the principle of hash algorithm, the identity of the four bases of the ACGT respectively 0123, converted to four hexadecimal number is then transformed into a decimal number, let base conversion of decimal number and the first line of 100-k+1 to a decimal number to match, if the base sequence matching, the program will output the row and column label mark. Flow chart as shown below:
  • 4. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 22 4.1.Model two: search model based on hash table The main requirement of this paper is to design hash function, according to the keyword k-mer to build hash table. There are a lot of methods of constructing hash function, digital analysis method, the direct method of definite value, random numbers, random number method is usually used in the key word length, this paper selects division method. The obtained nucleotide sequence of hash values divided by 1000 to take over, get the number as the address of the hash table. All to take over the business of the same number into the bucket, and in each bucket will remainder exists is not the same, but business the same. Therefore, in order to solve the address conflict. The method of the zipper is to resolve the conflict: the nodes of all keywords are synonymous with the same single linked list.. If the selected hash table length is m, the hash table can be defined as an array of pointers consisting of a m pointer T[0..M-1]. All the hash address for the node of I, are inserted into the single T[i] pointer to the single chain table. The initial values of each component in T should be null pointer. In the zipper method, the load factor can be greater than 1, but generally take α less than 1. Hash search: first of all, k-mer as the keyword, and program needs to use the hash function to calculate the address. If the base arrangement is the same as the base sequence of the searched sequence, if the same output of the node is all the information, if the relative should be found, then returns continue to search.
  • 5. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 23
  • 6. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 24 4.2.Model three: analysis of the memory space occupied by the hash table Data definition analysis: int keyword denotes an integer, whose range from negative - 2147483647 to +2147483647 (including these two digits) (32 bits) of integer. The number of bytes occupied per int type is 4B. The char holds no symbol for the 16 bit (double byte) code bits, whose values range from 0 to 65535 (8 bits). The number of bytes occupied per char type is 1B. Overall data analysis: row, 1000000 defined int type variable (4Byte) Column, 100 defined char type variable (1Byte) Each index information theory takes up the memory space size: (B), can also be converted into memory occupancy size: (GB) Different K values, the memory space corresponding to each index is shown in the table below Table4.1 The Memory Space K Memory Space((((GB)))) 1 0.00000002 2 0.00000007 3 0.00000030 4 0.00000119 5 0.00000477 6 0.00001907 7 0.00007629 5 4 1024 1024 1024 k  ×   × × 
  • 7. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 25 5.RUN RESULTS SHOW 5.1.The interface Figure5.1 The interface 8 0.00030518 9 0.00122070 10 0.00488281 11 0.01953125 12 0.07812500 13 0.31250000 14 1.25000000 15 5.00000000 16 20.00000000
  • 8. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 26 5.2.Search interface Figure5.2 the search interface 5.3.File generated results K_mer.txt file shown in Figure Figure5.3 the text file shown
  • 9. International Journal on Computational Science 5.4.Results the output interface 5.5.The complexity of the algorithm (1) establish index complexity analysis Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is deep. Space complexity O ( ) (2) using index complexity analysis Time complexity O (1) Space complexity O (1) 6.CONCLUSIONS In order to solve the problem of k the hash algorithm index model, the hash table query model, and the memory analysis model of hash table. The design uses the visual2010 software to traverse the optimal results, and the occupancy memory is is accurate. To provide a good solution for solving the problem of k ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 esults the output interface Figure5.4 the output interface 5.5.The complexity of the algorithm (1) establish index complexity analysis Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is (2) using index complexity analysis In order to solve the problem of k-mer index DNA, three kinds of models are proposed, the hash algorithm index model, the hash table query model, and the memory analysis The design uses the visual2010 software to traverse the optimal results, and the occupancy memory is small, the traversal efficiency is high and the result is accurate. To provide a good solution for solving the problem of k-mer index DNA. August 2015 27 Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is dex DNA, three kinds of models are proposed, the hash algorithm index model, the hash table query model, and the memory analysis The design uses the visual2010 software to traverse the optimal small, the traversal efficiency is high and the result mer index DNA.
  • 10. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 28 REFERENCES [1] Singh, M.; Garg, D., "Choosing Best Hashing Strategies and Hash Functions," Advance Computing Conference, 2009. IACC 2009. IEEE International , vol., no., pp.50,55, 6-7 March 2009 [2] Rizk G, Lavenier D, Chikhi R. DSK: k-mer counting with very low memory usage[J].Bioinformatics, 2013, 29(5): 652-653 [3] Deorowicz S, Debudaj-Grabysz A, Grabowski S. Disk-based k-mer counting on a PC[J].BMC bioinfonnatics, 2013, 14(1): 160. [4] Roy K S, Bhattacharya D, Schliep A. Turtle: Identifying frequent k-mers with cache-efficient algorithms[J]. arXiv preprint arXiv:1305.1861,2013. [5] Chor B, Horn D, Goldman N, et al. Genomic DNA k-mer spectra: models and modalities[J].Genome Biol, 2009, 10(10): 8108. [6] Hao B, Lee H C, Zhang S. Fractals related to long DNA sequences and complete genomes[J].Chaos,Solitions&Fractals,2000,11(6):825-836. [7] Yang Xu; Lei Ma; Zhaobo Liu; Chao, H.J., "A Multi-dimensional Progressive Perfect Hashing for High-Speed String Matching," Architectures for Networking and Communications Systems (ANCS), 2011 Seventh ACM/IEEE Symposium on , vol., no., pp.167,177, 3-4 Oct. 2011 [8] Yasuda, K.; Miura, T.; Shioya, I., "Distributed Processes on Tree Hash," Computer Software and Applications Conference, 2006. COMPSAC '06. 30th Annual International , vol.2, no., pp.10,13, 17- 21 Sept. 2006 [9] Bradford, P.G.; Gavrylyako, O.V., "Hash chains with diminishing ranges for sensors," Parallel Processing Workshops, 2004. ICPP 2004 Workshops. Proceedings. 2004 International Conference on , vol., no., pp.77,83, 18-18 Aug. 2004 [10] Jian-Wei Fan; Chao-Wen Chan; Ya-Fen Chang, "A random increasing sequence hash chain and smart card-based remote user authentication scheme," Information, Communications and Signal Processing (ICICS) 2013 9th International Conference on , vol., no., pp.1,5, 10-13 Dec. 2013 Authors Jinlin Liu is currently studying in Mechanical and Electronic Engineering from Shanghai University of Engineering Science, China, where he is working towards the Master degree. His current research interests include FPGA, design and develop in Embedded system.