0% found this document useful (0 votes)
58 views15 pages

Metagenomics Classification: Project Synopsis

This document provides a project synopsis for a metagenomics classification project. The project aims to build an inference engine using a deep learning model called GeNet to classify metagenomic DNA sequences without requiring a large database. The project involves collecting DNA sequence data from online sources, preprocessing the data to create balanced training, validation and test datasets, developing the GeNet model architecture using convolutional and residual blocks, and training and evaluating the model to classify sequences by taxonomy. The project scope includes data collection and preprocessing, model development and training, and model evaluation.

Uploaded by

Samyak Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views15 pages

Metagenomics Classification: Project Synopsis

This document provides a project synopsis for a metagenomics classification project. The project aims to build an inference engine using a deep learning model called GeNet to classify metagenomic DNA sequences without requiring a large database. The project involves collecting DNA sequence data from online sources, preprocessing the data to create balanced training, validation and test datasets, developing the GeNet model architecture using convolutional and residual blocks, and training and evaluating the model to classify sequences by taxonomy. The project scope includes data collection and preprocessing, model development and training, and model evaluation.

Uploaded by

Samyak Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

METAGENOMICS CLASSIFICATION

Project Synopsis
Version 1.0

(ECS799)
Degree
BACHELOR OF TECHNOLOGY (CSE)

PROJECT GUIDE: SUBMITTED BY:


Prof. Ajay Chakravarti (Internal) Samyak Jain (TCA1709030)
Samyak Jain (TCA1709021)
Mr. Kolli Sarath(External)

August, 2020

COLLEGE OF COMPUTING SCIENCES AND INFORMATION TECHNOLOGY


TEERTHANKER MAHAVEER UNIVERSITY, MORADABAD
TMU-CCSIT Version 1.1 T001-Project Synopsis

Table of Contents

1 Project Title.........................................................................................................................................3
2 Domain................................................................................................................................................3
3 Problem Statement.............................................................................................................................3
4 Project Description..............................................................................................................................3
4.1 Scope of the Work.......................................................................................................................3
4.2 Project Modules...........................................................................................................................3
5 Implementation Methodology............................................................................................................3
6 Technologies to be used......................................................................................................................4
6.1 Software Platform........................................................................................................................4
6.2 Hardware Platform......................................................................................................................4
6.3 Tools............................................................................................................................................4
7 Advantages of this Project...................................................................................................................4
8 Future Scope and further enhancement of the Project.......................................................................4
9 Team Details........................................................................................................................................4
10 Conclusion.......................................................................................................................................5
11 References.......................................................................................................................................5

Title: Page 2 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

1 Project Title
This is project is based on metagenomics classification related to biology therefore it’s title
named is Metagenomics Classification.

2 Domain
This project is a research project and using Deep Learning technology to achieve the require
results.

3 Problem Statement
This project aims to build a inference engine which is a GeNet deep representation for
Metagenomics Classification based on this research
paper(https://arxiv.org/pdf/1901.11015.pdf) to replace the Kraken and Centrifuge DNA
classification methods which requires large database that makes them unaffordable
untransferable and become challenging when the amount of noise in data increases.

4 Project Description
To counter the above mention problem Deep learning systems is required that can learn
from the noise distribution of the input reads. Moreover, a classification model learns a
mapping from input read to class probabilities, and thus does not require a database at
run-time. Deep learning systems provide representations of DNA sequences which can be
leveraged for downstream tasks.
DNA sequence are also called read which is represented by (G,T,A,C) characters and varies
form organism to organism.
Taxonomy is the classification of any organism by this order (Kingdom, Phylum,
Class,Order,Family, Genus, Species).
We have to predict the taxonomy by passing read to seven models simultaneously and
each models classifies a particular part of above taxa. Combined results of these models
helps to classify the read.
There are six kingdoms in biological system -(Plants, Animals, Protists, Fungi,
Archaebacteria, Eubacteria)

Title: Page 3 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

This project is vast and divided according to the kingdoms as a sub project and each sub
project needs eight models taxa+organism name. This report is only on Eubacteria.

4.1 Scope of the Work


Step 1- Collection of data from online resources.
Step 2- Data processing.
Step 3- Preparing balanced datasets.
Step 4- Creating the model structure.
Step 5- Model training.
Step 6- Model testing.
Step 7- Model evaluation.

4.2 Project Modules

Collection of Data
This is the first steps of moving toward project. This project needs a data which has
reads and it’s belongs taxonomy.
Data is collected from these resources for each kingdom-
1. NCBI (https://www.ncbi.nlm.nih.gov/) for all kingom.
2. DairyDB(https://github.com/marcomeola/DAIRYdb) for bacteria
3. PlantGDB(http://www.plantgdb.org/) for plants.
4. RVDB(https://rvdb.dbi.udel.edu/) for virus.
5. PlutoF(https://www.gbif.org/dataset/search) for fungi.
6. GreenGene(https://greengenes.secondgenome.com/) for archaea.

All these data are available in FASTA file format which need preprocessing to filter out
the required data and stored in a csv file format.

Data Preprocessing

Title: Page 4 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

After filtering out the required data using python and prepare train, valid and test CSVs.

Each column of main csv file is considered as the particular model target labels and there
are n labels in each columns
.
Data Balancing
I taked 35 rows of each labels and removed label rows which are less than 35 and truncate
the rows of labels which are above 35. I created csv for each labels as row of 35 and put
under the particular column folder.
Now I have to prepare train,valid and test csv files. From each label csv I took 20 rows
column as train data, 10 rows as a valid data and 5 rows as test data of a particular
column.
Above processing will create balanced dataset which helps the model to learn equally for
each label. Imbalanced dataset decrease the accuracy.

GeNet Model Architecture


GeNet, a model for metagenomic classification based on convolutional neural networks.
GeNet is trained end-to-end from raw DNA sequences using standard backpropagation
with cross-entropy loss.
Here is it’s image below-

Title: Page 5 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

First layer of model is convolutional 2D neural layer which takes matrix of read as a input.

Resnet Blocks in image are residual block also called ‘skip connections’. They are used to
allow gradients to flow through a network directly, without passing through non-linear
activation functions. When network depth increase accuracy get saturated and then
degrade rapidly to remove that problem here we use residual blocks.

Here it’s image below -

There are two convolution layer with pooling and batch normalization layer in each residual
blocks. At last input is added to the output of second convolution layers as ‘skip connection’
to reduce gradient degrading problem.

Title: Page 6 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

Pooling Layers provide an approach to down sampling feature maps by summarizing the
presence of features in patches of the feature map. Two common pooling methods are
average pooling and max pooling that summarize the average presence of feature and the
most activated presence of a feature respectively. Here is the use of average pooling in
model.
Batch Normalization layer normalizes each input channel across a mini-batch. To speed up
training of convolutional neural networks and reduce the sensitivity to network initialization.
Relu refers to the Rectifier Unit, the most commonly deployed activation function for the
outputs of the CNN neurons. It introduced non-linearity in learning process of model which
helps it to learn features more efficiently and make it robust.
Relu is simple to compute therefore is faster than sigmoid function and avoids vanishing
gradient problem.

Models Evaluation
I am using test data set of 875 rows to evaluate the combined result of all models with
accuracy.

Title: Page 7 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

Inference Output

I take a DNA reads to get the combined result as taxonomy prediction .

'GATGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAGCGGAGTTTAACTGGAAGCACTTGTGCGACCGGATAAACTTA
GCGGCGGACGGGTGAGTAACACGTGAGCAACCTACCTATCGCAGGGGAACAACATTGGGAAACCAGTGCTAATACCGCAT
AACATCTTTTGGGGGCATCCCCGGAAGATCAAAGGATTTCGATCCGGCGACAGATGGGCTCGCGTCCGATTAGCTAGTTG
GTAAGGTAAAAGCTTACCAAGGCAACGATCGGTAGCCGAACTGAGAGGTTGATCGGCCACATTGGGACTGAGACACGGCC
CAGGCTCCTACGGGAGGCAGCAGTGGGGAATATTGGGCAATGGGGGAAACCCTGACCCAGCAACGCCGCGTGAAGGAAGA
AGGCCTTCGGGTTGTAAACTTCTTTGATCAGGGACGAAACAAATGACGGTACCTGAAGAACAAGTCACGGCTAACTACGT
GCCAGCAGCCGCGGTAATACGTAGGTGACAAGCGTTATCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGTTTCGTAA
GTTGGATGTGAAATTCTCAGGCTTAACCTGAGAGGGTCATCCAAAACTGCAAAACTTGAGTACTGGAGAGGATAGTGGAA

Title: Page 8 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

TTCCTAGTGTAGCGGTAAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTATCTGGACAGTAACTGACGC
TGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAATACTAGGTGTAGG
GGGTATCGACCCCCCCTGTGCCGCAGCTAACGCAATAAGTATTCCACCTGGGGAGTACGACCGCAAGGTTGAAACTCAAA
GGAATTGACGGGGGCCCGCACAAGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGGCTTGAC
ATCCTCTGACGGCTGTAGAGATACAGCTTTCCCTTCGGGGACAGAGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGT
CGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATGGTCAGTTGCCAGCACGTAATGGTGGGCACTCTGGCA
AGACTGCCGTTGATAAAACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGTCCTGGGCTACACACGTAC
TACAATGGCAACAACAGAGGGCAGCCAGGTCGCGAGGCCGAGCGAATCCCAAAATGTTGTCTCAGTTCAGATTGCAGGCT
GCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATGGCAGGTCAGCATACTGCCGTGAATACGTTCCCGGGTCTTGTAC
ACACCGCCCGTCACACCATGAGAGTTTGTAACACCCGAAGTCAGTAGTCTGACCGTAAGGAGGGCGCTGCCGAAGGTGGG
ACAGATAATTGGGGTG’

All taxas are predicted correctly for above DNA read.

5 Implementation Methodology

GeNet: Deep Representations for Metagenomics


Pipelining process of the deep representation for metagenomic
classification is divided into four parts. For better understanding here is the image below-

Title: Page 9 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

Dataset-
Each read is a string of characters (G,T,A,C) varies from organism to organism.

Vector Representation-
Each read’s character is encoded into the numeric data and that list of of numeric data
sequence is converted into 2d array.
After that we perform normalization and moved to next step.
Learning Process-
Here GeNet architecture model used for each taxa based on Covolutional Neural Network.
Trained Models-
Here are eight models taxa+name as a combined result.

Approach for High Accuracy

I choose the following hyperparameters to which helped me to reach to the average


accuracy of 65% of models.
Fully Connected layers is a feed forward neural networks. The input to the fully connected
layer is the ouput from the convolutional layers.

I used three FC layers -


Title: Page 10 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

Layer 1- {768,384}
Layer 2- {384,768}
Layer 3- {768,total classes}
I used dropout layer after the first two layers and at the end of residual blocks so each
model neuron run effectively by stop learning the some neurons at the probability of 20%.

I choose three FC layers with not much variance in the number of neurons to make it
deeper to avoid much widening of layers because wide layers memorize the output and not
work as a general model for different data.

Optimizer is an optimization algorithm that is used to update network weights iterative


using training data.
Optimizer Algorithms used-
Adam
SGD
Learning rate is a tuning parameter in an optimization algorithm that deternines the step
size at each iteration while moving toward a minimum of a loss function.
I used 0.001 and 0.0001 as a learning rate iteratively.
Training approach-

Title: Page 11 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

Above Graph is showing hikes in graph by at the time of Adam optimizer training the model
with learning rate of 0.001.

Title: Page 12 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

Above Graph showing the down slope at the time SGD optimizer training the model with
learning rate of 0.001

Step 1-
I started training with SGD optimizer and learning rate of 0.001 because
SGD(Stochastic Gradient Descent) select a few sample from whole data randomly for
each iteration due to which it perform many iterations and give time to learn model
slowly with 0.001 learning rate helps to move the global minima which is closer to the
correct prediction of input.

Step 2-
After training with 0.001 learning rate I train it with 0.0001 learning rate to learn more
slowly and learn more features and slowly gradient moves toward global minima.

Step 3-
In this step I change the SGD optimizer with Adam because which is a combination of
RMSprop and Stochastic Gradient Descent with momentum. It uses the squared gradients
to scale the learning rate like RMSprop and it takes advantage of momentum by using
moving average of the gradient instead of gradient like SGD with momentum.
Adam perform fast convergence than SGD with learning rate of
0.001 moves the model toward global minima.
This steps sometimes increase the accuracy abruptly.

Step 4- Repeat the above step until get the high accuracy.
Freezing and Unfreezing method improves the accuracy of the model further by 3 to 4%.
I freeze the previous layers of the models and truncate the last classification layer and add
new two new FC layer{768,384} and {384,192} with classification layer at last {192,total

Title: Page 13 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

classes}.

6 Technologies to be used
6.1 Software Platform
a) Google Colab
b) Ananconda Distribution

6.2 Hardware Platform


RAM, Hard Disk

6.3 Tools
Laptop

7 Advantages of this Project


It will help in replace the Kraken and Centrifuge DNA classification methods which requires

large database that makes them unaffordable untransferable and become challenging
when the amount of noise in data increases.

8 Future Scope and further enhancement of the Project


I and my team mate currently worked on bacteria and fungi will help in classifying bacteria
and fungi at low memory requirement and easily transferable .
Project is not completed yet we have to collect data for plants , animals and viruses and
build models for them.

9 Team Details
Group# Course Name Student ID Student Role Signature
Name
Industrial TCA1709030 Samyak Jain Developer
Project TCA1709021 Samyak Jain Developer

10 Conclusion
Here overall accuracy is not going above 28% because of small dataset and number of rows
which has all taxas predicted correctly are less. It can only be improved by using large data
Title: Page 14 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis

set and train models again.

11 References
https://arxiv.org/pdf/1901.11015.pdf

https://www.ncbi.nlm.nih.gov/
https://github.com/marcomeola/DAIRYdb
http://www.plantgdb.org/.
https://rvdb.dbi.udel.edu/
https://www.gbif.org/dataset/search
https://greengenes.secondgenome.com/

Title: Page 15 of 15

You might also like