0% found this document useful (0 votes)

23 views8 pages

Bio Report El

This document describes a study on classifying genes from DNA sequences using machine learning. It discusses: 1) Extracting features from DNA sequences and using machine learning models like SVM, Random Forests and Neural Networks to classify genes. 2) Evaluating model performance using metrics like accuracy, precision and recall. 3) Using k-mer counting to extract features from variable length DNA sequences and make them uniform for machine learning. 4) Choosing naive Bayes classification as it provides high scalability and accuracy for this gene classification task.

Uploaded by

Jateen Rathod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views8 pages

Bio Report El

Uploaded by

Jateen Rathod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

DEPARTMENT OF BIOTECHNOLOGY

Gene Classification from DNA sequences

using Machine Learning

Submitted by
JATEEN SATISH RATHOD

1RV20EC074

SHARANYA D

1RV20EC142

ISHAN RAHMAN

1RV20EC074

Under the guidance of

Dr. A H Manjunatha Reddy,

Associate Professor

2023-2024

Go, Change the World

RV COLLEGE OF ENGINEERING®
(Autonomous Institution Affiliated to Visvesvaraya Technological University, Belagavi)

DEPARTMENT OF BIOTECHNOLOGY

CERTIFICATE

Certified that the FCPS experiential work titled ‘Gene Classification from DNA sequences
using Machine Learning’ is carried out by Jateen Rathod (1RV20EC074), Sharanya D
(1RV20EC142), Ishan Rahman (1RV20EC074) in partial fulfilment for the requirement of
degree of Bachelor of Engineering in Electronics and Communication Engineering of the
Visvesvaraya Technological University, Belagavi during the year 2022-2023.

Dr. A H Manjunatha Reddy,

Associate Professor,
Department of Biotechnology
RVCE, Bengaluru

1. Introduction
1.1 Gene classification
Gene classification from DNA sequences using machine learning is a fascinating area of
bioinformatics with numerous applications in genomics, medicine, and evolutionary biology.
The process involves analyzing the sequence of nucleotides in DNA to predict the function or
classification of genes. Here's a comprehensive overview of the topic:

Gene classification from DNA sequences using machine learning is a multifaceted and crucial
area of bioinformatics with numerous applications in genomics, medicine, and evolutionary
biology.

Understanding DNA Sequences:

DNA sequences are composed of four nucleotides: Adenine (A), Thymine (T), Cytosine (C),
and Guanine (G). Genes, which are segments of DNA, contain instructions for building
proteins, essential for various biological functions.

Challenges in Gene Classification:

DNA sequences can be lengthy and intricate, making manual classification laborious.
Moreover, genes can have diverse functions, necessitating accurate classification for
understanding their roles in cellular processes.

Machine Learning Approaches:

Feature Extraction: DNA sequences are converted into numerical or categorical features.
Techniques include k-mer counting and one-hot encoding.

Model Selection: Various models like Support Vector Machines, Random Forest,
Convolutional Neural Networks, Recurrent Neural Networks, and Gradient Boosting
Machines are employed.

Training and Evaluation: Models are trained on a dataset split into training and testing sets,
with evaluation using metrics like accuracy, precision, recall, and F1-score.

Cross-validation: Techniques like k-fold cross-validation ensure model robustness.

Data Preprocessing:

This involves cleaning data, handling missing values, addressing class imbalance, and scaling
or normalizing features.

Feature Selection and Dimensionality Reduction:

Techniques like recursive feature elimination or principal component analysis can reduce the
dimensionality of the feature space.

Model Interpretability:

Interpretability techniques such as SHAP values or saliency maps aid in understanding the
reasons behind model predictions.
Application Areas:

Gene classification using machine learning has applications in disease prediction and
diagnosis, drug discovery, personalized medicine, understanding evolutionary relationships,
and functional annotation of genomes.

Challenges and Future Directions:

Challenges include the need for larger and more diverse datasets, addressing class imbalance,
and improving model interpretability. Future research may focus on integrating multi-omics
data to enhance classification accuracy.

1.2 Sequence Based Classification

DNA sequence-based gene classification refers to the process of categorizing genes or DNA
sequences into different functional or structural groups based on their sequence features. This
classification is often performed using computational methods and machine learning
techniques applied to DNA sequence data.

DNA sequence-based gene classification typically works in the following steps:

Data Collection: DNA sequences are collected from various sources, such as genome
databases, sequencing experiments, or public repositories.
Feature Extraction: Features are extracted from the DNA sequences to represent their
characteristics. These features can include:
•Nucleotide composition: Frequencies of different nucleotides (A, T, C, G) in the sequence.
•K-mer frequencies: Frequencies of short subsequences of length k (e.g., di-nucleotide
frequencies, tri-nucleotide frequencies).
Training Data Preparation: The extracted features are used to create a training dataset,
where each DNA sequence is associated with a label indicating its gene class or functional
category. This dataset is split into training and validation sets for model training and
evaluation.
Model Training: Machine learning algorithms, such as Support Vector Machines (SVM),
Random Forests, Neural Networks, or Gradient Boosting Machines, are trained on the training
dataset using the extracted sequence features. The model learns to classify DNA sequences
into different gene classes based on the provided features.
Model Evaluation: The trained model is evaluated using the validation dataset to assess its
performance in classifying DNA sequences into the correct gene classes. Evaluation metrics
such as accuracy, precision, recall, and F1-score are commonly used to measure the model's
performance.

1.3 ML Model

We use a text file containing over 4000 sequences belonging to 6 different classes with their
labels
We first find the length of the sequences from the dataset that we have given as input to the
model

The complements and the reverse complements of the sequences are found. These concepts
are particularly important in bioinformatics and molecular biology for tasks such as primer
design, sequence alignment, and various computational analyses of DNA sequences.

Length of the Sequence varies a lot .We have to find a way to make it constant so that we can
apply ML Techniques to this problem.

We can use the K-mer method to rectify this issue. We can take a long biological sequence
and break it down into k-mer length overlapping “words”. For example, if we use words of
length 6 (hexamers), “ATGCATGCA” becomes: ‘ATGCAT’, ‘TGCATG’, ‘GCATGC’,
‘CATGCA’. Hence our example sequence is broken down into 4 hexamer words.

In genomics, we refer to these types of manipulations as "k-mer counting", or counting the

occurances of each possible k-mer sequence.

The scikit-learn natural language processing tools do the k-mer counting. To do so, we need
to convert the list of k-mers for each gene into gene sentences of words that the count
vectorizer can use.

1.4 Classifier Choice

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which
assumes that the features are conditionally independent, given the target class. The strength
(naivety) of this assumption is what gives the classifier its name. These classifiers are among
the simplest Bayesian network models.

Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the
number of variables (features/predictors) in a learning problem. Maximum-likelihood training
can be done by evaluating a closed-form expression, which takes linear time, rather than by
expensive iterative approximation as used for many other types of classifiers.

The same dataset was used to evaluate the performance of the different methods generally
used for such purposes. As seen, the Naives Bayes has better accuracy and f1score than the
other methods.
1.4 Testing

The k-mers counting is used to check for the repeating sequences and find the most optimum
matching.

The parameters like accuracy and f1score to measure the performance of the model. The
model uses chimpanzee’s and dog’s DNA sequence to check the similarity between the
training dataset

Repair+manuals Chilton Manuales
39% (95)
Repair+manuals Chilton Manuales
26 pages
Business Planfor Soapand Detergent Factory
100% (1)
Business Planfor Soapand Detergent Factory
6 pages
BIOINFORMATICS
No ratings yet
BIOINFORMATICS
20 pages
Standard Major Emergency Management
No ratings yet
Standard Major Emergency Management
80 pages
Wartsila Me Operations Manual
100% (15)
Wartsila Me Operations Manual
373 pages
DNA Sequencing With Machine Learning
No ratings yet
DNA Sequencing With Machine Learning
34 pages
Genomic Sequence Data Classification Using Machine Learning Techniques
100% (1)
Genomic Sequence Data Classification Using Machine Learning Techniques
23 pages
CSE3068-Sequential and Spatial Data Mining: School of Computing Science and Engineering
No ratings yet
CSE3068-Sequential and Spatial Data Mining: School of Computing Science and Engineering
12 pages
Gene Prediction Using Statistical Methods
No ratings yet
Gene Prediction Using Statistical Methods
47 pages
Optimizing Classification Efficiency With Machine Learning Techniques For Pattern Matching
No ratings yet
Optimizing Classification Efficiency With Machine Learning Techniques For Pattern Matching
18 pages
CSE3068-Sequential and Spatial Data Mining: School of Computing Science and Engineering
No ratings yet
CSE3068-Sequential and Spatial Data Mining: School of Computing Science and Engineering
8 pages
Research Article Analysis of DNA Sequence Classification Using CNN and Hybrid Models
No ratings yet
Research Article Analysis of DNA Sequence Classification Using CNN and Hybrid Models
12 pages
Em and Forward
No ratings yet
Em and Forward
11 pages
LayoutingFix
No ratings yet
LayoutingFix
8 pages
P11 - Machine Learning Applications in Genetics and Genomics
No ratings yet
P11 - Machine Learning Applications in Genetics and Genomics
12 pages
MGCP Report (4-1)
No ratings yet
MGCP Report (4-1)
19 pages
Nihms 839467
No ratings yet
Nihms 839467
30 pages
Application of Deep Learning Technique in Next Generation Sequence Experiments
No ratings yet
Application of Deep Learning Technique in Next Generation Sequence Experiments
21 pages
Unveiling DNA Sequences: A Comparison of Machine Learning and Deep Learning Techniques For Prediction
No ratings yet
Unveiling DNA Sequences: A Comparison of Machine Learning and Deep Learning Techniques For Prediction
11 pages
Machine Learning in Genomics Medicine
No ratings yet
Machine Learning in Genomics Medicine
22 pages
Alfarama Journal of Basic & Applied Sciences Faculty of Science Port Said University
No ratings yet
Alfarama Journal of Basic & Applied Sciences Faculty of Science Port Said University
9 pages
Deep Learning: New Computational Modelling Techniques For Genomics
No ratings yet
Deep Learning: New Computational Modelling Techniques For Genomics
15 pages
Rep 16
No ratings yet
Rep 16
17 pages
Lecture 01
No ratings yet
Lecture 01
23 pages
Gene Finding
No ratings yet
Gene Finding
5 pages
Bioinformatics TM4
No ratings yet
Bioinformatics TM4
44 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Lecture 1
No ratings yet
Lecture 1
36 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Genomics 4
No ratings yet
Genomics 4
9 pages
DNA Design
No ratings yet
DNA Design
10 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
Classification
No ratings yet
Classification
4 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
4 pages
Data Mining Fall 2023
No ratings yet
Data Mining Fall 2023
15 pages
Gene Prediction
No ratings yet
Gene Prediction
24 pages
BBT3 - CASD - BIOCOMP - 2ndassignment' With You
No ratings yet
BBT3 - CASD - BIOCOMP - 2ndassignment' With You
7 pages
Molecular Biology Notes
No ratings yet
Molecular Biology Notes
4 pages
8.01 Machine Learning Basics
No ratings yet
8.01 Machine Learning Basics
6 pages
End Sem Presentation
No ratings yet
End Sem Presentation
4 pages
FA - 2 - 1 s2.0 S0959440X21000154 Main
No ratings yet
FA - 2 - 1 s2.0 S0959440X21000154 Main
8 pages
1 Intro Annotated
No ratings yet
1 Intro Annotated
66 pages
R22 ML Syllabus
No ratings yet
R22 ML Syllabus
2 pages
Metagenomics Classification: Project Synopsis
No ratings yet
Metagenomics Classification: Project Synopsis
15 pages
Deep Learning For Comp Bio Review
No ratings yet
Deep Learning For Comp Bio Review
16 pages
Enhanced Viral Genome Classification Using Large L
No ratings yet
Enhanced Viral Genome Classification Using Large L
16 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Survey Paper
No ratings yet
Survey Paper
7 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
ML Syllabus
No ratings yet
ML Syllabus
3 pages
Gene Prediction
25% (4)
Gene Prediction
36 pages
Deep Learning in Bioinformatics PDF
No ratings yet
Deep Learning in Bioinformatics PDF
18 pages
Machine Learning Applications in Microbial Ecology, Human Microbiome
No ratings yet
Machine Learning Applications in Microbial Ecology, Human Microbiome
16 pages
Bioinfo Course Notes M1 2020 DR Mbulli
No ratings yet
Bioinfo Course Notes M1 2020 DR Mbulli
56 pages
MLUnit 1
No ratings yet
MLUnit 1
131 pages
3ML 01 Introduction
No ratings yet
3ML 01 Introduction
34 pages
ISC Unit II Topic-6
No ratings yet
ISC Unit II Topic-6
29 pages
Gene Prediction Using Unsupervised Deep Networks
No ratings yet
Gene Prediction Using Unsupervised Deep Networks
49 pages
Unit 3 &4 BDA Notes
No ratings yet
Unit 3 &4 BDA Notes
20 pages
Proj 782
No ratings yet
Proj 782
31 pages
Genomic Benchmarks: A Collection of Datasets For Genomic Sequence Classification
No ratings yet
Genomic Benchmarks: A Collection of Datasets For Genomic Sequence Classification
9 pages
Ijatcse 43922020
No ratings yet
Ijatcse 43922020
6 pages
Machine Learning For Genomic Data Proposal
No ratings yet
Machine Learning For Genomic Data Proposal
4 pages
Efka SL 3034
No ratings yet
Efka SL 3034
2 pages
BOX Hill Growth Centres Precinct Development Control Plan - in Force 28 June 2021
No ratings yet
BOX Hill Growth Centres Precinct Development Control Plan - in Force 28 June 2021
243 pages
Nuclear - Iaea - Trs433
No ratings yet
Nuclear - Iaea - Trs433
147 pages
Underwater Noise Review: For Saoirse Wave Energy Limited
No ratings yet
Underwater Noise Review: For Saoirse Wave Energy Limited
29 pages
Hypatia Ipazia: The Mean Streets of Old Alexandria by Mike Flynn
No ratings yet
Hypatia Ipazia: The Mean Streets of Old Alexandria by Mike Flynn
28 pages
Disaster Risk Reduction
No ratings yet
Disaster Risk Reduction
2 pages
Gtu Teaching Scheme
No ratings yet
Gtu Teaching Scheme
1 page
Packaging Efficiency Project
No ratings yet
Packaging Efficiency Project
8 pages
Brewers Fayre Main Menu Band4
No ratings yet
Brewers Fayre Main Menu Band4
8 pages
LD7750RGR
No ratings yet
LD7750RGR
1 page
90SQ... SERIES: Schottky Rectifier 9 Amp
No ratings yet
90SQ... SERIES: Schottky Rectifier 9 Amp
5 pages
Theory & Application of Psycho-Oncology
No ratings yet
Theory & Application of Psycho-Oncology
56 pages
S.V Reg. in Asme TDP 1, Asme Sec 1, b31.1
No ratings yet
S.V Reg. in Asme TDP 1, Asme Sec 1, b31.1
9 pages
O-Levels Metal Technology and Design Exemplar
100% (2)
O-Levels Metal Technology and Design Exemplar
33 pages
Cavallini Et. Al (2010)
No ratings yet
Cavallini Et. Al (2010)
12 pages
2 16KW
No ratings yet
2 16KW
1 page
Latihan Soal
No ratings yet
Latihan Soal
10 pages
Investigation of Lockout/Tagout Procedure Failure in Machine Maintenance Process
No ratings yet
Investigation of Lockout/Tagout Procedure Failure in Machine Maintenance Process
6 pages
Company Experience - NDT Services
No ratings yet
Company Experience - NDT Services
10 pages
Lafayette Parish Business Database 211
No ratings yet
Lafayette Parish Business Database 211
890 pages
03 SR Iit - Co-Sc GTM-13 (N) Main (Model-A - B - C) - 11-01-2024 - 2197346
No ratings yet
03 SR Iit - Co-Sc GTM-13 (N) Main (Model-A - B - C) - 11-01-2024 - 2197346
12 pages
May 2024 Resume
No ratings yet
May 2024 Resume
2 pages
Drugs and Vices
No ratings yet
Drugs and Vices
6 pages
Microteaching Chemistry
No ratings yet
Microteaching Chemistry
3 pages
Abinet PR
No ratings yet
Abinet PR
8 pages

Bio Report El

Uploaded by

Bio Report El

Uploaded by

DEPARTMENT OF BIOTECHNOLOGY

Gene Classification from DNA sequences

Under the guidance of

Dr. A H Manjunatha Reddy,

Go, Change the World

Dr. A H Manjunatha Reddy,

Understanding DNA Sequences:

Challenges in Gene Classification:

Machine Learning Approaches:

Cross-validation: Techniques like k-fold cross-validation ensure model robustness.

Feature Selection and Dimensionality Reduction:

Challenges and Future Directions:

1.2 Sequence Based Classification

DNA sequence-based gene classification typically works in the following steps:

In genomics, we refer to these types of manipulations as "k-mer counting", or counting the

1.4 Classifier Choice

You might also like