CSE3068-Sequential and Spatial Data Mining: School of Computing Science and Engineering
CSE3068-Sequential and Spatial Data Mining: School of Computing Science and Engineering
VIT Chennai
Vandalur - Kelambakkam Road, Chennai - 600 127
FALL SEM 22-23
DIGITAL ASSIGNMENT- 3
Submitted to
Prof. Vinothini A,
Assistant Professor Senior,
SCOPE, VIT, Chennai
ABSTRACT:
DNA Sequencing plays a vital role in the modern research. It allows a large
number of multiple areas to progress, as well as genetics, meta-genetics, and
phylogenetics. DNA Sequencing involves extracting and reading the strands of
DNA. this paper discusses the application of machine learning algorithms in
demystifying DNA sequencing. DNA sequencing is a complex process that
involves identifying the order of nucleotides in a DNA molecule. With the
advent of next-generation sequencing technologies, the cost of sequencing has
decreased significantly, making it more accessible. However, the process still
generates a large amount of data that needs to be processed and analysed.
Machine learning algorithms offer a way to analyse and make sense of this data.
This paper explores the different machine learning algorithms used in DNA
sequencing and how they can be applied to improve accuracy and reduce errors.
The paper also discusses the challenges of using machine learning in DNA
sequencing. The aim of our proposed system is to implement a better prediction
model for DNA research and get the most accurate results out of it. The
“machine learning models” which are being considered are the most used and
reputed. The proposed models include “Naive Bayes”. The Naive Bayes method
gave greater accuracy of 98.00 percent in machine learning. The model seems to
produce good results on human data. It also does on Chimpanzee which is
because the Chimpanzee and humans share the same genetic hierarchy. The
performance of the dog is not quite as good which is because the dog is more
diverging from humans than the chimpanzee.
INTRODUCTION:
A genome is a complete collection of DNA in an organism. All living species
possess a genome, but they differ considerably in size. The human genome, for
instance, is arranged into 23 chromosomes, which is a little bit like an
encyclopaedia being organized into 23 volumes. And if you counted all the
characters (individual DNA “base pairs”), there would be more than 6 billion in
each human genome. So, it’s a huge compilation.
A human genome has about 6 billion characters or letters. If you think the
genome (the complete DNA sequence) is like a book, it is a book about 6 billion
letters of “A”, “C”, “G” and “T”. Now, everyone has a unique genome.
Nevertheless, scientists find most parts of the human genomes are alike to each
other.
As a data-driven science, genomics extensively utilizes machine learning to
capture dependencies in data and infer new biological hypotheses. Nonetheless,
the ability to extract new insights from the exponentially increasing volume of
genomics data requires more powerful machine learning models. By efficiently
leveraging large data sets, deep learning has reconstructed fields such as
computer vision and natural language processing. It has become the method of
preference for many genomics modelling tasks, including predicting the
influence of genetic variation on gene regulatory mechanisms such as DNA
receptiveness and splicing.
So, in this article, we will understand how to interpret a DNA structure and how
machine learning algorithms can be used to build a prediction model on DNA
sequence data.
LITERATURE SURVEY:
1. S. Bai and S. -X. Bai, "The Maximal Frequent Pattern mining of DNA
sequence," 2009 IEEE International Conference on Granular Computing, 2009,
pp. 23-26, doi: 10.1109/GRC.2009.5255169.
2. T. Zhu and S. Bai, "A Parallel Mining Algorithm for Closed Sequential
Patterns," 21st International Conference on Advanced Information Networking
and Applications Workshops (AINAW'07), 2007, pp. 392-395, doi:
10.1109/AINAW.2007.40.
3. P. Hoffman, G. Grinstein, K. Marx, I. Grosse and E. Stanley, "DNA visual
and analytic data mining," Proceedings. Visualization '97 (Cat. No.
97CB36155), 1997, pp. 437-441, doi: 10.1109/VISUAL.1997.663916.
4. Hemalatha Gunasekaran, K. Ramalakshmi, A. Rex Macedo Arokiaraj, S.
Deepa Kanmani, Chandran Venkatesan, C. Suresh Gnana Dhas, "Analysis of
DNA Sequence Classification Using CNN and Hybrid Models", Computational
and Mathematical Methods in Medicine, vol. 2021, Article ID 1835056, 12
pages, 2021. https://doi.org/10.1155/2021/1835056
5. Shadman Shadab, Md Tawab Alam Khan, Nazia Afrin Neezi, Sheikh
Adilina, Swakkhar Shatabda, DeepDBP: Deep neural networks for
identification of DNA-binding proteins, Informatics in Medicine Unlocked.
6. Ashlock, Daniel & Warner, Elizabeth. (2008). Side effect machines for
sequence classification. Canadian Conference on Electrical and Computer
Engineering. 001453 - 001456. 10.1109/CCECE.2008.4564782.
7. Sansom, Clare. (2000). Database searching with DNA and protein sequences:
An introduction. Briefings in bioinformatics.
8. Nandy, Ashesh, Marissa Harle and Subhash C. Basak. “Mathematical
descriptors of DNA sequences: development and applications.” Arkivoc 2006
9. P. Hoffman, G. Grinstein, K. Marx, I. Grosse and E. Stanley, "DNA visual
and analytic data mining,"
10.Yang Aimin, Zhang Wei, Wang Jiahao, Yang Ke, Han Yang, Zhang
Limin“Review on the Application of Machine Learning Algorithms in the
Sequence Data Mining of DNA.
DATASET SEQUENCE:
• k-mer counting:
It returns a list of k-mer “words.” You can then join the “words” into a
“sentence”, then apply your favorite natural language processing methods on the
“sentences” as you normally would.
Loading Dataset:
PROPOSED SYSTEM:
• PREPROCESSING
There are 3 general approaches to encode sequence data:
1. Ordinal encoding DNA sequence data
In this approach, we need to encode each nitrogen bases as an ordinal value. For
example, “ATGC” becomes [0.25, 0.5, 0.75, 1.0]. Any other base such as “N”
can be a 0rdinal encoding DNA Sequence.
2. One-hot encoding DNA Sequence
Use one-hot encoding to represent the DNA sequence. This is widely used in
deep learning methods and lends itself well to algorithms like convolutional
neural networks. In this example, “ATGC” would become [0,0,0,1], [0,0,1,0],
[0,1,0,0], [1,0,0,0]. And these one-hot encoded vectors can either be
concatenated or turned into 2-dimensional arrays.
3. DNA sequence as a “language”, known as k-mer counting
DNA and protein sequences can be seen as the language of life. The language
encodes instructions as well as functions for the molecules that are found in all
life forms. The sequence language resemblance continues with the genome as
the book, subsequences (genes and gene families) are sentences and chapters, k-
mers and peptides are words, and nucleotide bases and amino acids are the
alphabets. Since the relationship seems so likely, it stands to reason that the
natural language processing(NLP) should also implement the natural language
of DNA and protein sequences.
The method we use here is manageable and easy. We first take the long
biological sequence and break it down into k-mer length overlapping “words”.
RESULTS:
The model seems to produce good results on human data. It also does
on Chimpanzee which is because the Chimpanzee and humans share
the same genetic hierarchy. The performance of the dog is not quite as
good which is because the dog is more diverging from humans than
the chimpanzee.