0% found this document useful (0 votes)
23 views8 pages

Bio Report El

This document describes a study on classifying genes from DNA sequences using machine learning. It discusses: 1) Extracting features from DNA sequences and using machine learning models like SVM, Random Forests and Neural Networks to classify genes. 2) Evaluating model performance using metrics like accuracy, precision and recall. 3) Using k-mer counting to extract features from variable length DNA sequences and make them uniform for machine learning. 4) Choosing naive Bayes classification as it provides high scalability and accuracy for this gene classification task.

Uploaded by

Jateen Rathod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Bio Report El

This document describes a study on classifying genes from DNA sequences using machine learning. It discusses: 1) Extracting features from DNA sequences and using machine learning models like SVM, Random Forests and Neural Networks to classify genes. 2) Evaluating model performance using metrics like accuracy, precision and recall. 3) Using k-mer counting to extract features from variable length DNA sequences and make them uniform for machine learning. 4) Choosing naive Bayes classification as it provides high scalability and accuracy for this gene classification task.

Uploaded by

Jateen Rathod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

DEPARTMENT OF BIOTECHNOLOGY

Gene Classification from DNA sequences


using Machine Learning

Submitted by
JATEEN SATISH RATHOD

1RV20EC074

SHARANYA D

1RV20EC142

ISHAN RAHMAN

1RV20EC074

Under the guidance of

Dr. A H Manjunatha Reddy,


Associate Professor

2023-2024

Go, Change the World


RV COLLEGE OF ENGINEERING®
(Autonomous Institution Affiliated to Visvesvaraya Technological University, Belagavi)

DEPARTMENT OF BIOTECHNOLOGY

CERTIFICATE

Certified that the FCPS experiential work titled ‘Gene Classification from DNA sequences
using Machine Learning’ is carried out by Jateen Rathod (1RV20EC074), Sharanya D
(1RV20EC142), Ishan Rahman (1RV20EC074) in partial fulfilment for the requirement of
degree of Bachelor of Engineering in Electronics and Communication Engineering of the
Visvesvaraya Technological University, Belagavi during the year 2022-2023.

Dr. A H Manjunatha Reddy,


Associate Professor,
Department of Biotechnology
RVCE, Bengaluru

1. Introduction
1.1 Gene classification
Gene classification from DNA sequences using machine learning is a fascinating area of
bioinformatics with numerous applications in genomics, medicine, and evolutionary biology.
The process involves analyzing the sequence of nucleotides in DNA to predict the function or
classification of genes. Here's a comprehensive overview of the topic:

Gene classification from DNA sequences using machine learning is a multifaceted and crucial
area of bioinformatics with numerous applications in genomics, medicine, and evolutionary
biology.

Understanding DNA Sequences:

DNA sequences are composed of four nucleotides: Adenine (A), Thymine (T), Cytosine (C),
and Guanine (G). Genes, which are segments of DNA, contain instructions for building
proteins, essential for various biological functions.

Challenges in Gene Classification:

DNA sequences can be lengthy and intricate, making manual classification laborious.
Moreover, genes can have diverse functions, necessitating accurate classification for
understanding their roles in cellular processes.

Machine Learning Approaches:

Feature Extraction: DNA sequences are converted into numerical or categorical features.
Techniques include k-mer counting and one-hot encoding.

Model Selection: Various models like Support Vector Machines, Random Forest,
Convolutional Neural Networks, Recurrent Neural Networks, and Gradient Boosting
Machines are employed.

Training and Evaluation: Models are trained on a dataset split into training and testing sets,
with evaluation using metrics like accuracy, precision, recall, and F1-score.

Cross-validation: Techniques like k-fold cross-validation ensure model robustness.

Data Preprocessing:

This involves cleaning data, handling missing values, addressing class imbalance, and scaling
or normalizing features.

Feature Selection and Dimensionality Reduction:

Techniques like recursive feature elimination or principal component analysis can reduce the
dimensionality of the feature space.

Model Interpretability:

Interpretability techniques such as SHAP values or saliency maps aid in understanding the
reasons behind model predictions.
Application Areas:

Gene classification using machine learning has applications in disease prediction and
diagnosis, drug discovery, personalized medicine, understanding evolutionary relationships,
and functional annotation of genomes.

Challenges and Future Directions:

Challenges include the need for larger and more diverse datasets, addressing class imbalance,
and improving model interpretability. Future research may focus on integrating multi-omics
data to enhance classification accuracy.

1.2 Sequence Based Classification

DNA sequence-based gene classification refers to the process of categorizing genes or DNA
sequences into different functional or structural groups based on their sequence features. This
classification is often performed using computational methods and machine learning
techniques applied to DNA sequence data.

DNA sequence-based gene classification typically works in the following steps:

Data Collection: DNA sequences are collected from various sources, such as genome
databases, sequencing experiments, or public repositories.
Feature Extraction: Features are extracted from the DNA sequences to represent their
characteristics. These features can include:
•Nucleotide composition: Frequencies of different nucleotides (A, T, C, G) in the sequence.
•K-mer frequencies: Frequencies of short subsequences of length k (e.g., di-nucleotide
frequencies, tri-nucleotide frequencies).
Training Data Preparation: The extracted features are used to create a training dataset,
where each DNA sequence is associated with a label indicating its gene class or functional
category. This dataset is split into training and validation sets for model training and
evaluation.
Model Training: Machine learning algorithms, such as Support Vector Machines (SVM),
Random Forests, Neural Networks, or Gradient Boosting Machines, are trained on the training
dataset using the extracted sequence features. The model learns to classify DNA sequences
into different gene classes based on the provided features.
Model Evaluation: The trained model is evaluated using the validation dataset to assess its
performance in classifying DNA sequences into the correct gene classes. Evaluation metrics
such as accuracy, precision, recall, and F1-score are commonly used to measure the model's
performance.

1.3 ML Model

We use a text file containing over 4000 sequences belonging to 6 different classes with their
labels
We first find the length of the sequences from the dataset that we have given as input to the
model

The complements and the reverse complements of the sequences are found. These concepts
are particularly important in bioinformatics and molecular biology for tasks such as primer
design, sequence alignment, and various computational analyses of DNA sequences.

Length of the Sequence varies a lot .We have to find a way to make it constant so that we can
apply ML Techniques to this problem.

We can use the K-mer method to rectify this issue. We can take a long biological sequence
and break it down into k-mer length overlapping “words”. For example, if we use words of
length 6 (hexamers), “ATGCATGCA” becomes: ‘ATGCAT’, ‘TGCATG’, ‘GCATGC’,
‘CATGCA’. Hence our example sequence is broken down into 4 hexamer words.

In genomics, we refer to these types of manipulations as "k-mer counting", or counting the


occurances of each possible k-mer sequence.

The scikit-learn natural language processing tools do the k-mer counting. To do so, we need
to convert the list of k-mers for each gene into gene sentences of words that the count
vectorizer can use.

1.4 Classifier Choice

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which
assumes that the features are conditionally independent, given the target class. The strength
(naivety) of this assumption is what gives the classifier its name. These classifiers are among
the simplest Bayesian network models.

Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the
number of variables (features/predictors) in a learning problem. Maximum-likelihood training
can be done by evaluating a closed-form expression, which takes linear time, rather than by
expensive iterative approximation as used for many other types of classifiers.

The same dataset was used to evaluate the performance of the different methods generally
used for such purposes. As seen, the Naives Bayes has better accuracy and f1score than the
other methods.
1.4 Testing

The k-mers counting is used to check for the repeating sequences and find the most optimum
matching.

The parameters like accuracy and f1score to measure the performance of the model. The
model uses chimpanzee’s and dog’s DNA sequence to check the similarity between the
training dataset

You might also like