Bio Report El
Bio Report El
Submitted by
JATEEN SATISH RATHOD
1RV20EC074
SHARANYA D
1RV20EC142
ISHAN RAHMAN
1RV20EC074
2023-2024
DEPARTMENT OF BIOTECHNOLOGY
CERTIFICATE
Certified that the FCPS experiential work titled ‘Gene Classification from DNA sequences
using Machine Learning’ is carried out by Jateen Rathod (1RV20EC074), Sharanya D
(1RV20EC142), Ishan Rahman (1RV20EC074) in partial fulfilment for the requirement of
degree of Bachelor of Engineering in Electronics and Communication Engineering of the
Visvesvaraya Technological University, Belagavi during the year 2022-2023.
1. Introduction
1.1 Gene classification
Gene classification from DNA sequences using machine learning is a fascinating area of
bioinformatics with numerous applications in genomics, medicine, and evolutionary biology.
The process involves analyzing the sequence of nucleotides in DNA to predict the function or
classification of genes. Here's a comprehensive overview of the topic:
Gene classification from DNA sequences using machine learning is a multifaceted and crucial
area of bioinformatics with numerous applications in genomics, medicine, and evolutionary
biology.
DNA sequences are composed of four nucleotides: Adenine (A), Thymine (T), Cytosine (C),
and Guanine (G). Genes, which are segments of DNA, contain instructions for building
proteins, essential for various biological functions.
DNA sequences can be lengthy and intricate, making manual classification laborious.
Moreover, genes can have diverse functions, necessitating accurate classification for
understanding their roles in cellular processes.
Feature Extraction: DNA sequences are converted into numerical or categorical features.
Techniques include k-mer counting and one-hot encoding.
Model Selection: Various models like Support Vector Machines, Random Forest,
Convolutional Neural Networks, Recurrent Neural Networks, and Gradient Boosting
Machines are employed.
Training and Evaluation: Models are trained on a dataset split into training and testing sets,
with evaluation using metrics like accuracy, precision, recall, and F1-score.
Data Preprocessing:
This involves cleaning data, handling missing values, addressing class imbalance, and scaling
or normalizing features.
Techniques like recursive feature elimination or principal component analysis can reduce the
dimensionality of the feature space.
Model Interpretability:
Interpretability techniques such as SHAP values or saliency maps aid in understanding the
reasons behind model predictions.
Application Areas:
Gene classification using machine learning has applications in disease prediction and
diagnosis, drug discovery, personalized medicine, understanding evolutionary relationships,
and functional annotation of genomes.
Challenges include the need for larger and more diverse datasets, addressing class imbalance,
and improving model interpretability. Future research may focus on integrating multi-omics
data to enhance classification accuracy.
DNA sequence-based gene classification refers to the process of categorizing genes or DNA
sequences into different functional or structural groups based on their sequence features. This
classification is often performed using computational methods and machine learning
techniques applied to DNA sequence data.
Data Collection: DNA sequences are collected from various sources, such as genome
databases, sequencing experiments, or public repositories.
Feature Extraction: Features are extracted from the DNA sequences to represent their
characteristics. These features can include:
•Nucleotide composition: Frequencies of different nucleotides (A, T, C, G) in the sequence.
•K-mer frequencies: Frequencies of short subsequences of length k (e.g., di-nucleotide
frequencies, tri-nucleotide frequencies).
Training Data Preparation: The extracted features are used to create a training dataset,
where each DNA sequence is associated with a label indicating its gene class or functional
category. This dataset is split into training and validation sets for model training and
evaluation.
Model Training: Machine learning algorithms, such as Support Vector Machines (SVM),
Random Forests, Neural Networks, or Gradient Boosting Machines, are trained on the training
dataset using the extracted sequence features. The model learns to classify DNA sequences
into different gene classes based on the provided features.
Model Evaluation: The trained model is evaluated using the validation dataset to assess its
performance in classifying DNA sequences into the correct gene classes. Evaluation metrics
such as accuracy, precision, recall, and F1-score are commonly used to measure the model's
performance.
1.3 ML Model
We use a text file containing over 4000 sequences belonging to 6 different classes with their
labels
We first find the length of the sequences from the dataset that we have given as input to the
model
The complements and the reverse complements of the sequences are found. These concepts
are particularly important in bioinformatics and molecular biology for tasks such as primer
design, sequence alignment, and various computational analyses of DNA sequences.
Length of the Sequence varies a lot .We have to find a way to make it constant so that we can
apply ML Techniques to this problem.
We can use the K-mer method to rectify this issue. We can take a long biological sequence
and break it down into k-mer length overlapping “words”. For example, if we use words of
length 6 (hexamers), “ATGCATGCA” becomes: ‘ATGCAT’, ‘TGCATG’, ‘GCATGC’,
‘CATGCA’. Hence our example sequence is broken down into 4 hexamer words.
The scikit-learn natural language processing tools do the k-mer counting. To do so, we need
to convert the list of k-mers for each gene into gene sentences of words that the count
vectorizer can use.
In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which
assumes that the features are conditionally independent, given the target class. The strength
(naivety) of this assumption is what gives the classifier its name. These classifiers are among
the simplest Bayesian network models.
Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the
number of variables (features/predictors) in a learning problem. Maximum-likelihood training
can be done by evaluating a closed-form expression, which takes linear time, rather than by
expensive iterative approximation as used for many other types of classifiers.
The same dataset was used to evaluate the performance of the different methods generally
used for such purposes. As seen, the Naives Bayes has better accuracy and f1score than the
other methods.
1.4 Testing
The k-mers counting is used to check for the repeating sequences and find the most optimum
matching.
The parameters like accuracy and f1score to measure the performance of the model. The
model uses chimpanzee’s and dog’s DNA sequence to check the similarity between the
training dataset