0% found this document useful (0 votes)
41 views

DNA Sequencing With Machine Learning

This document discusses using machine learning classifiers to predict gene function from DNA sequences. It introduces using k-mers to represent DNA sequences as bags of words that can be analyzed using natural language processing and machine learning techniques. The document walks through preparing human, chimpanzee and dog DNA sequence and label data, using k-mers and CountVectorizer to represent the sequences as word counts, splitting the human data into train and test sets, training a multinomial naive Bayes classifier on the k-mer counts, and evaluating the classifier's performance on the test set.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

DNA Sequencing With Machine Learning

This document discusses using machine learning classifiers to predict gene function from DNA sequences. It introduces using k-mers to represent DNA sequences as bags of words that can be analyzed using natural language processing and machine learning techniques. The document walks through preparing human, chimpanzee and dog DNA sequence and label data, using k-mers and CountVectorizer to represent the sequences as word counts, splitting the human data into train and test sets, training a multinomial naive Bayes classifier on the k-mer counts, and evaluating the classifier's performance on the test set.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

DNA sequencing and applying

classifier with ML​


INTRODUCATION:- 2

In the field of medical information research, the


genetic series is widely used as a component of a
category. One of the applications of ML is
biochemistry. Bioinformatics is an interdisciplinary
science that uses computers and communication
science to understand biological data. One of its
most difficult tasks is to distinguish between regular
genes and disease-causing genes.
3

The classification of gene sequences into


existing categories is utilized in genomic
research to discover the functions of novel
proteins. As a result, it is critical to identify
and categorize such genes. We employ ML
approaches to distinguish between infected
and normal genes using classification
methods.
I will apply a classification model that
can predict a gene's function based on
the DNA sequence of the coding
sequence alone.
5

You will need some libraries


such as: numpy, pandas ..
I will upload human data and read it 6

to became have some data for human


DNA sequence coding regions
and a class label.
7

I also upload and read data


for Chimpanzee and a more
divergent species, the dog.
Here are the definitions for each of 8

the 7 classes and how many there are


in the human training data. They are
gene sequence function groups.
9

Since seq is not equal, we will apply the k-


mers to the complete sequences.
Using get Kmers function
10

Now, our coding sequence data is


changed to lowercase, split up into all
possible k-mer words of length 6
11
12
13

Since we are going to use scikit-learn


natural language processing tools to
do the k-mer counting, we need to
now convert the lists of k-mers for
each gene into string sentences of
words that the count vectorizer can
use.
14
We can also make a y variable
to hold the class labels.
16
17

We will perform the same


steps for chimpanzee and dog
18
19
20
21
we will apply the BAG of WORDS
using CountVectorizer using NLP.
This is equivalent to k-mer counting.
23
24

If we have a look at class balance we can


see we have relatively balanced dataset.
25
26
27

Splitting the human dataset into the


training set and test set.
28

A multinomial naive Bayes classifier will be


created. I previously did some parameter
tuning and found the ngram size of 4
(reflected in the Countvectorizer() instance)
and a model alpha of 0.1 did the best
29
let's look at some model
performce metrics like the
confusion matrix, accuracy,
precision, recall and f1 score.
We are getting really good
results on our unseen data,
31
32
33
THANK YOU

You might also like