This document discusses using machine learning classifiers to predict gene function from DNA sequences. It introduces using k-mers to represent DNA sequences as bags of words that can be analyzed using natural language processing and machine learning techniques. The document walks through preparing human, chimpanzee and dog DNA sequence and label data, using k-mers and CountVectorizer to represent the sequences as word counts, splitting the human data into train and test sets, training a multinomial naive Bayes classifier on the k-mer counts, and evaluating the classifier's performance on the test set.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
41 views
DNA Sequencing With Machine Learning
This document discusses using machine learning classifiers to predict gene function from DNA sequences. It introduces using k-mers to represent DNA sequences as bags of words that can be analyzed using natural language processing and machine learning techniques. The document walks through preparing human, chimpanzee and dog DNA sequence and label data, using k-mers and CountVectorizer to represent the sequences as word counts, splitting the human data into train and test sets, training a multinomial naive Bayes classifier on the k-mer counts, and evaluating the classifier's performance on the test set.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34
DNA sequencing and applying
classifier with ML
INTRODUCATION:- 2
In the field of medical information research, the
genetic series is widely used as a component of a category. One of the applications of ML is biochemistry. Bioinformatics is an interdisciplinary science that uses computers and communication science to understand biological data. One of its most difficult tasks is to distinguish between regular genes and disease-causing genes. 3
The classification of gene sequences into
existing categories is utilized in genomic research to discover the functions of novel proteins. As a result, it is critical to identify and categorize such genes. We employ ML approaches to distinguish between infected and normal genes using classification methods. I will apply a classification model that can predict a gene's function based on the DNA sequence of the coding sequence alone. 5
You will need some libraries
such as: numpy, pandas .. I will upload human data and read it 6
to became have some data for human
DNA sequence coding regions and a class label. 7
I also upload and read data
for Chimpanzee and a more divergent species, the dog. Here are the definitions for each of 8
the 7 classes and how many there are
in the human training data. They are gene sequence function groups. 9
Since seq is not equal, we will apply the k-
mers to the complete sequences. Using get Kmers function 10
Now, our coding sequence data is
changed to lowercase, split up into all possible k-mer words of length 6 11 12 13
Since we are going to use scikit-learn
natural language processing tools to do the k-mer counting, we need to now convert the lists of k-mers for each gene into string sentences of words that the count vectorizer can use. 14 We can also make a y variable to hold the class labels. 16 17
We will perform the same
steps for chimpanzee and dog 18 19 20 21 we will apply the BAG of WORDS using CountVectorizer using NLP. This is equivalent to k-mer counting. 23 24
If we have a look at class balance we can
see we have relatively balanced dataset. 25 26 27
Splitting the human dataset into the
training set and test set. 28
A multinomial naive Bayes classifier will be
created. I previously did some parameter tuning and found the ngram size of 4 (reflected in the Countvectorizer() instance) and a model alpha of 0.1 did the best 29 let's look at some model performce metrics like the confusion matrix, accuracy, precision, recall and f1 score. We are getting really good results on our unseen data, 31 32 33 THANK YOU
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB