DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database

Afify, Heba; Islam, Muhammad; Wahed, Manal Abdel

Computer Science > Data Structures and Algorithms

arXiv:1109.0094 (cs)

[Submitted on 1 Sep 2011]

Title:DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database

Authors:Heba Afify, Muhammad Islam, Manal Abdel Wahed

View PDF

Abstract:Modern biological science produces vast amounts of genomic sequence data. This is fuelling the need for efficient algorithms for sequence compression and analysis. Data compression and the associated techniques coming from information theory are often perceived as being of interest for data communication and storage. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison of genomic databases. This paper presents a differential compression algorithm that is based on production of difference sequences according to op-code table in order to optimize the compression of homologous sequences in dataset. Therefore, the stored data are composed of reference sequence, the set of differences, and differences locations, instead of storing each sequence individually. This algorithm does not require a priori knowledge about the statistics of the sequence set. The algorithm was applied to three different datasets of genomic sequences, it achieved up to 195-fold compression rate corresponding to 99.4% space saving.

Subjects:	Data Structures and Algorithms (cs.DS); Computational Engineering, Finance, and Science (cs.CE); Software Engineering (cs.SE)
Cite as:	arXiv:1109.0094 [cs.DS]
	(or arXiv:1109.0094v1 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.1109.0094

Submission history

From: Heba Affify [view email]
[v1] Thu, 1 Sep 2011 05:39:35 UTC (117 KB)

Computer Science > Data Structures and Algorithms

Title:DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators