0% found this document useful (0 votes)
48 views12 pages

CSE3068-Sequential and Spatial Data Mining: School of Computing Science and Engineering

Uploaded by

arya dadhich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views12 pages

CSE3068-Sequential and Spatial Data Mining: School of Computing Science and Engineering

Uploaded by

arya dadhich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

School of Computing Science and Engineering

VIT Chennai
Vandalur - Kelambakkam Road, Chennai - 600 127
FALL SEM 22-23

CSE3068- Sequential and Spatial Data Mining

DIGITAL ASSIGNMENT- 3

"Unlocking Biological Mysteries: Applying Machine


Learning to DNA Sequence classification Across Species"
By
19MIA1025 Arya Dadhich
19MIA1049 Sayantan Nandy

M.Tech Intgd. CSE with Specialization in Business Analytics

Submitted to

Prof. Vinothini A,
Assistant Professor Senior,
SCOPE, VIT, Chennai
ABSTRACT:
DNA Sequencing plays a vital role in the modern research. It allows a large
number of multiple areas to progress, as well as genetics, meta-genetics, and
phylogenetics. DNA Sequencing involves extracting and reading the strands of
DNA. this paper discusses the application of machine learning algorithms in
demystifying DNA sequencing. DNA sequencing is a complex process that
involves identifying the order of nucleotides in a DNA molecule. With the
advent of next-generation sequencing technologies, the cost of sequencing has
decreased significantly, making it more accessible. However, the process still
generates a large amount of data that needs to be processed and analysed.
Machine learning algorithms offer a way to analyse and make sense of this data.
This paper explores the different machine learning algorithms used in DNA
sequencing and how they can be applied to improve accuracy and reduce errors.
The paper also discusses the challenges of using machine learning in DNA
sequencing. The aim of our proposed system is to implement a better prediction
model for DNA research and get the most accurate results out of it. The
“machine learning models” which are being considered are the most used and
reputed. The proposed models include “Naive Bayes”. The Naive Bayes method
gave greater accuracy of 98.00 percent in machine learning. The model seems to
produce good results on human data. It also does on Chimpanzee which is
because the Chimpanzee and humans share the same genetic hierarchy. The
performance of the dog is not quite as good which is because the dog is more
diverging from humans than the chimpanzee.
INTRODUCTION:
A genome is a complete collection of DNA in an organism. All living species
possess a genome, but they differ considerably in size. The human genome, for
instance, is arranged into 23 chromosomes, which is a little bit like an
encyclopaedia being organized into 23 volumes. And if you counted all the
characters (individual DNA “base pairs”), there would be more than 6 billion in
each human genome. So, it’s a huge compilation.
A human genome has about 6 billion characters or letters. If you think the
genome (the complete DNA sequence) is like a book, it is a book about 6 billion
letters of “A”, “C”, “G” and “T”. Now, everyone has a unique genome.
Nevertheless, scientists find most parts of the human genomes are alike to each
other.
As a data-driven science, genomics extensively utilizes machine learning to
capture dependencies in data and infer new biological hypotheses. Nonetheless,
the ability to extract new insights from the exponentially increasing volume of
genomics data requires more powerful machine learning models. By efficiently
leveraging large data sets, deep learning has reconstructed fields such as
computer vision and natural language processing. It has become the method of
preference for many genomics modelling tasks, including predicting the
influence of genetic variation on gene regulatory mechanisms such as DNA
receptiveness and splicing.
So, in this article, we will understand how to interpret a DNA structure and how
machine learning algorithms can be used to build a prediction model on DNA
sequence data.
LITERATURE SURVEY:
1. S. Bai and S. -X. Bai, "The Maximal Frequent Pattern mining of DNA
sequence," 2009 IEEE International Conference on Granular Computing, 2009,
pp. 23-26, doi: 10.1109/GRC.2009.5255169.
2. T. Zhu and S. Bai, "A Parallel Mining Algorithm for Closed Sequential
Patterns," 21st International Conference on Advanced Information Networking
and Applications Workshops (AINAW'07), 2007, pp. 392-395, doi:
10.1109/AINAW.2007.40.
3. P. Hoffman, G. Grinstein, K. Marx, I. Grosse and E. Stanley, "DNA visual
and analytic data mining," Proceedings. Visualization '97 (Cat. No.
97CB36155), 1997, pp. 437-441, doi: 10.1109/VISUAL.1997.663916.
4. Hemalatha Gunasekaran, K. Ramalakshmi, A. Rex Macedo Arokiaraj, S.
Deepa Kanmani, Chandran Venkatesan, C. Suresh Gnana Dhas, "Analysis of
DNA Sequence Classification Using CNN and Hybrid Models", Computational
and Mathematical Methods in Medicine, vol. 2021, Article ID 1835056, 12
pages, 2021. https://doi.org/10.1155/2021/1835056
5. Shadman Shadab, Md Tawab Alam Khan, Nazia Afrin Neezi, Sheikh
Adilina, Swakkhar Shatabda, DeepDBP: Deep neural networks for
identification of DNA-binding proteins, Informatics in Medicine Unlocked.
6. Ashlock, Daniel & Warner, Elizabeth. (2008). Side effect machines for
sequence classification. Canadian Conference on Electrical and Computer
Engineering. 001453 - 001456. 10.1109/CCECE.2008.4564782.
7. Sansom, Clare. (2000). Database searching with DNA and protein sequences:
An introduction. Briefings in bioinformatics.
8. Nandy, Ashesh, Marissa Harle and Subhash C. Basak. “Mathematical
descriptors of DNA sequences: development and applications.” Arkivoc 2006
9. P. Hoffman, G. Grinstein, K. Marx, I. Grosse and E. Stanley, "DNA visual
and analytic data mining,"
10.Yang Aimin, Zhang Wei, Wang Jiahao, Yang Ke, Han Yang, Zhang
Limin“Review on the Application of Machine Learning Algorithms in the
Sequence Data Mining of DNA.
DATASET SEQUENCE:

One-hot encoding DNA Sequence:


Another approach is to use one-hot encoding to represent the DNA sequence.
This is widely used in deep learning methods and lends itself well to algorithms
like convolutional neural networks. In this example, “ATGC” would become
[0,0,0,1], [0,0,1,0], [0,1,0,0], [1,0,0,0]. And these one-hot encoded vectors can
either be concatenated or turned into 2-dimensional arrays.

• k-mer counting:
It returns a list of k-mer “words.” You can then join the “words” into a
“sentence”, then apply your favorite natural language processing methods on the
“sentences” as you normally would.

Loading Dataset:
PROPOSED SYSTEM:
• PREPROCESSING
There are 3 general approaches to encode sequence data:
1. Ordinal encoding DNA sequence data
In this approach, we need to encode each nitrogen bases as an ordinal value. For
example, “ATGC” becomes [0.25, 0.5, 0.75, 1.0]. Any other base such as “N”
can be a 0rdinal encoding DNA Sequence.
2. One-hot encoding DNA Sequence
Use one-hot encoding to represent the DNA sequence. This is widely used in
deep learning methods and lends itself well to algorithms like convolutional
neural networks. In this example, “ATGC” would become [0,0,0,1], [0,0,1,0],
[0,1,0,0], [1,0,0,0]. And these one-hot encoded vectors can either be
concatenated or turned into 2-dimensional arrays.
3. DNA sequence as a “language”, known as k-mer counting
DNA and protein sequences can be seen as the language of life. The language
encodes instructions as well as functions for the molecules that are found in all
life forms. The sequence language resemblance continues with the genome as
the book, subsequences (genes and gene families) are sentences and chapters, k-
mers and peptides are words, and nucleotide bases and amino acids are the
alphabets. Since the relationship seems so likely, it stands to reason that the
natural language processing(NLP) should also implement the natural language
of DNA and protein sequences.
The method we use here is manageable and easy. We first take the long
biological sequence and break it down into k-mer length overlapping “words”.

For example, if we use “words” of length 6 (hexamers), “ATGCATGCA”


becomes: ‘ATGCAT’, ‘TGCATG’, ‘GCATGC’, ‘CATGCA’. Hence our
example sequence is broken down into 4 hexamer words.
In genomics, we refer to these types of manipulations as “k-mer counting”, or
counting the occurrences of each possible k-mer sequence and Python natural
language processing tools make it super easy.
This allows you to determine how the DNA sequence information and
vocabulary size will be important in your application. For example, if you use
words of length 6, and there are 4 letters, you have a vocabulary of size 4096
possible words. You can then go on and create a bag-of-words model like you
would in NLP.
• FEATURE EXTRACTION
k-mer “words.” We can then join the “words” into a “sentence”, then apply your
favourite natural language processing methods on the “sentences” as you
normally would. Tune both the word length and the amount of overlap. This
allows you to determine how the DNA sequence information and vocabulary
size will be important in your application. For example, if you use words of
length 6, and there are 4 letters, you have a vocabulary of size 4096 possible
words. You can then go on and create a bag-of-words model like you would in
NLP. We are making feature matrix using count vectorizer.
CountVectorizer is a technique in Natural Language Processing (NLP) that
converts text data into a matrix of token counts. It works by first tokenizing the
text, then counting the frequency of each word in the text, creating a vocabulary
of unique words, and finally creating a matrix where the rows represent
documents and the columns represent unique words in the vocabulary, with the
cells representing the frequency of each word in each document. The resulting
matrix can be used as input for machine learning models.
Creating the Bag of Words model using CountVectorizer(). This is equivalent to
k-mer counting. The n-gram size of 4 was previously determined by testing.
Convert our k-mer words into uniform length numerical vectors that represent
counts for every k-mer in the vocabulary
So, for humans we have 4380 genes converted into uniform length feature
vectors of 4-gram k-mer (length 6) counts. For chimp and dog, we have the
same number of features with 1682 and 820 genes respectively.
So now that we know how to transform our DNA sequences into uniform length
numerical vectors in the form of k-mer counts and ngrams, we can now go
ahead and build a classification model that can predict the DNA sequence
function based only on the sequence itself.
Here I will use the human data to train the model, holding out 20% of the
human data to test the model. Then we can challenge the model’s
generalizability by trying to predict sequence function in other species (the
chimpanzee and dog).
• MODEL BUILDING
We will create a multinomial naive Bayes classifier.
Naive Bayes classifier is based on Bayes' theorem, which can be expressed
mathematically as follows:
P(C | X) = P(X | C) * P(C) / P(X)
where,
P(C | X) is the probability of class C given the input X
P(X | C) is the conditional probability of input X given class C
P(C) is the prior probability of class C
P(X) is the marginal probability of input X
The Naive Bayes classifier makes the naive assumption that the features X are
independent given the class C, i.e.,
P(X | C) = P(X1 | C) * P(X2 | C) * ... * P(Xn | C)
where, X1, X2, ..., Xn are the features of input X.
Using this assumption, we can rewrite Bayes' theorem as:
P(C | X) = P(C) * P(X1 | C) * P(X2 | C) * ... * P(Xn | C) / P(X)
Now, we can classify a new input X by selecting the class C that maximizes P(C
| X), i.e.,
C_hat = argmax(P(C) * P(X1 | C) * P(X2 | C) * ... * P(Xn | C))
where, C_hat is the predicted class. To estimate the probabilities P(C) and P(Xi |
C), we can use the maximum likelihood estimator or other methods. For
example, we can count the number of occurrences of each class and feature in
the training data and normalize them by the total number of examples or
features. We can also use smoothing techniques, such as Laplace smoothing or
Bayesian smoothing, to handle zero counts and avoid overfitting.
The Naive Bayes classifier is a simple and effective way to perform
classification based on probabilistic reasoning and can work well in many
situations, especially when the independence assumption holds approximately.
However, it may not be suitable for complex or correlated features or when the
class distribution is highly skewed.
• MODEL EVALUATION
We are using confusion matrix.
A confusion matrix is a table that summarizes the performance of a
classification model by comparing its predicted labels with the actual labels of a
test dataset. Mathematically, it can be represented as follows:
Actual
/ \
/ \
Predicted Positive Predicted Negative
/ \
/ \
Positive Negative
where, the rows represent the predicted labels and the columns represent the
actual labels. The four elements of the confusion matrix are:
True Positive (TP): The number of examples that are correctly predicted as
positive. This means the model predicted the example as positive, and the actual
label is also positive.
False Positive (FP): The number of examples that are incorrectly predicted as
positive. This means the model predicted the example as positive, but the actual
label is negative.
False Negative (FN): The number of examples that are incorrectly predicted as
negative. This means the model predicted the example as negative, but the
actual label is positive.
True Negative (TN): The number of examples that are correctly predicted as
negative. This means the model predicted the example as negative, and the
actual label is also negative.
Using these elements, we can calculate various performance metrics of a
classification model, such as accuracy, precision, recall, and F1 score. For
example,
Accuracy: The proportion of correctly classified examples out of the total
number of examples.
accuracy = (TP + TN) / (TP + FP + FN + TN)
Precision: The proportion of true positives out of the total number of positive
predictions.
precision = TP / (TP + FP)
Recall: The proportion of true positives out of the total number of actual
positive examples.
recall = TP / (TP + FN)
F1 score: The harmonic mean of precision and recall.
F1 = 2 * (precision * recall) / (precision + recall)

Confusion Matrix for Chimpanzee:

Confusion Matrix for Human:


Confusion Matrix for Dog:

RESULTS:

The model seems to produce good results on human data. It also does
on Chimpanzee which is because the Chimpanzee and humans share
the same genetic hierarchy. The performance of the dog is not quite as
good which is because the dog is more diverging from humans than
the chimpanzee.

You might also like