Metagenomics Classification: Project Synopsis
Metagenomics Classification: Project Synopsis
Project Synopsis
Version 1.0
(ECS799)
Degree
BACHELOR OF TECHNOLOGY (CSE)
August, 2020
Table of Contents
1 Project Title.........................................................................................................................................3
2 Domain................................................................................................................................................3
3 Problem Statement.............................................................................................................................3
4 Project Description..............................................................................................................................3
4.1 Scope of the Work.......................................................................................................................3
4.2 Project Modules...........................................................................................................................3
5 Implementation Methodology............................................................................................................3
6 Technologies to be used......................................................................................................................4
6.1 Software Platform........................................................................................................................4
6.2 Hardware Platform......................................................................................................................4
6.3 Tools............................................................................................................................................4
7 Advantages of this Project...................................................................................................................4
8 Future Scope and further enhancement of the Project.......................................................................4
9 Team Details........................................................................................................................................4
10 Conclusion.......................................................................................................................................5
11 References.......................................................................................................................................5
Title: Page 2 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
1 Project Title
This is project is based on metagenomics classification related to biology therefore it’s title
named is Metagenomics Classification.
2 Domain
This project is a research project and using Deep Learning technology to achieve the require
results.
3 Problem Statement
This project aims to build a inference engine which is a GeNet deep representation for
Metagenomics Classification based on this research
paper(https://arxiv.org/pdf/1901.11015.pdf) to replace the Kraken and Centrifuge DNA
classification methods which requires large database that makes them unaffordable
untransferable and become challenging when the amount of noise in data increases.
4 Project Description
To counter the above mention problem Deep learning systems is required that can learn
from the noise distribution of the input reads. Moreover, a classification model learns a
mapping from input read to class probabilities, and thus does not require a database at
run-time. Deep learning systems provide representations of DNA sequences which can be
leveraged for downstream tasks.
DNA sequence are also called read which is represented by (G,T,A,C) characters and varies
form organism to organism.
Taxonomy is the classification of any organism by this order (Kingdom, Phylum,
Class,Order,Family, Genus, Species).
We have to predict the taxonomy by passing read to seven models simultaneously and
each models classifies a particular part of above taxa. Combined results of these models
helps to classify the read.
There are six kingdoms in biological system -(Plants, Animals, Protists, Fungi,
Archaebacteria, Eubacteria)
Title: Page 3 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
This project is vast and divided according to the kingdoms as a sub project and each sub
project needs eight models taxa+organism name. This report is only on Eubacteria.
Collection of Data
This is the first steps of moving toward project. This project needs a data which has
reads and it’s belongs taxonomy.
Data is collected from these resources for each kingdom-
1. NCBI (https://www.ncbi.nlm.nih.gov/) for all kingom.
2. DairyDB(https://github.com/marcomeola/DAIRYdb) for bacteria
3. PlantGDB(http://www.plantgdb.org/) for plants.
4. RVDB(https://rvdb.dbi.udel.edu/) for virus.
5. PlutoF(https://www.gbif.org/dataset/search) for fungi.
6. GreenGene(https://greengenes.secondgenome.com/) for archaea.
All these data are available in FASTA file format which need preprocessing to filter out
the required data and stored in a csv file format.
Data Preprocessing
Title: Page 4 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
After filtering out the required data using python and prepare train, valid and test CSVs.
Each column of main csv file is considered as the particular model target labels and there
are n labels in each columns
.
Data Balancing
I taked 35 rows of each labels and removed label rows which are less than 35 and truncate
the rows of labels which are above 35. I created csv for each labels as row of 35 and put
under the particular column folder.
Now I have to prepare train,valid and test csv files. From each label csv I took 20 rows
column as train data, 10 rows as a valid data and 5 rows as test data of a particular
column.
Above processing will create balanced dataset which helps the model to learn equally for
each label. Imbalanced dataset decrease the accuracy.
Title: Page 5 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
First layer of model is convolutional 2D neural layer which takes matrix of read as a input.
Resnet Blocks in image are residual block also called ‘skip connections’. They are used to
allow gradients to flow through a network directly, without passing through non-linear
activation functions. When network depth increase accuracy get saturated and then
degrade rapidly to remove that problem here we use residual blocks.
There are two convolution layer with pooling and batch normalization layer in each residual
blocks. At last input is added to the output of second convolution layers as ‘skip connection’
to reduce gradient degrading problem.
Title: Page 6 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
Pooling Layers provide an approach to down sampling feature maps by summarizing the
presence of features in patches of the feature map. Two common pooling methods are
average pooling and max pooling that summarize the average presence of feature and the
most activated presence of a feature respectively. Here is the use of average pooling in
model.
Batch Normalization layer normalizes each input channel across a mini-batch. To speed up
training of convolutional neural networks and reduce the sensitivity to network initialization.
Relu refers to the Rectifier Unit, the most commonly deployed activation function for the
outputs of the CNN neurons. It introduced non-linearity in learning process of model which
helps it to learn features more efficiently and make it robust.
Relu is simple to compute therefore is faster than sigmoid function and avoids vanishing
gradient problem.
Models Evaluation
I am using test data set of 875 rows to evaluate the combined result of all models with
accuracy.
Title: Page 7 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
Inference Output
'GATGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAGCGGAGTTTAACTGGAAGCACTTGTGCGACCGGATAAACTTA
GCGGCGGACGGGTGAGTAACACGTGAGCAACCTACCTATCGCAGGGGAACAACATTGGGAAACCAGTGCTAATACCGCAT
AACATCTTTTGGGGGCATCCCCGGAAGATCAAAGGATTTCGATCCGGCGACAGATGGGCTCGCGTCCGATTAGCTAGTTG
GTAAGGTAAAAGCTTACCAAGGCAACGATCGGTAGCCGAACTGAGAGGTTGATCGGCCACATTGGGACTGAGACACGGCC
CAGGCTCCTACGGGAGGCAGCAGTGGGGAATATTGGGCAATGGGGGAAACCCTGACCCAGCAACGCCGCGTGAAGGAAGA
AGGCCTTCGGGTTGTAAACTTCTTTGATCAGGGACGAAACAAATGACGGTACCTGAAGAACAAGTCACGGCTAACTACGT
GCCAGCAGCCGCGGTAATACGTAGGTGACAAGCGTTATCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGTTTCGTAA
GTTGGATGTGAAATTCTCAGGCTTAACCTGAGAGGGTCATCCAAAACTGCAAAACTTGAGTACTGGAGAGGATAGTGGAA
Title: Page 8 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
TTCCTAGTGTAGCGGTAAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTATCTGGACAGTAACTGACGC
TGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAATACTAGGTGTAGG
GGGTATCGACCCCCCCTGTGCCGCAGCTAACGCAATAAGTATTCCACCTGGGGAGTACGACCGCAAGGTTGAAACTCAAA
GGAATTGACGGGGGCCCGCACAAGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGGCTTGAC
ATCCTCTGACGGCTGTAGAGATACAGCTTTCCCTTCGGGGACAGAGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGT
CGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATGGTCAGTTGCCAGCACGTAATGGTGGGCACTCTGGCA
AGACTGCCGTTGATAAAACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGTCCTGGGCTACACACGTAC
TACAATGGCAACAACAGAGGGCAGCCAGGTCGCGAGGCCGAGCGAATCCCAAAATGTTGTCTCAGTTCAGATTGCAGGCT
GCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATGGCAGGTCAGCATACTGCCGTGAATACGTTCCCGGGTCTTGTAC
ACACCGCCCGTCACACCATGAGAGTTTGTAACACCCGAAGTCAGTAGTCTGACCGTAAGGAGGGCGCTGCCGAAGGTGGG
ACAGATAATTGGGGTG’
5 Implementation Methodology
Title: Page 9 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
Dataset-
Each read is a string of characters (G,T,A,C) varies from organism to organism.
Vector Representation-
Each read’s character is encoded into the numeric data and that list of of numeric data
sequence is converted into 2d array.
After that we perform normalization and moved to next step.
Learning Process-
Here GeNet architecture model used for each taxa based on Covolutional Neural Network.
Trained Models-
Here are eight models taxa+name as a combined result.
Layer 1- {768,384}
Layer 2- {384,768}
Layer 3- {768,total classes}
I used dropout layer after the first two layers and at the end of residual blocks so each
model neuron run effectively by stop learning the some neurons at the probability of 20%.
I choose three FC layers with not much variance in the number of neurons to make it
deeper to avoid much widening of layers because wide layers memorize the output and not
work as a general model for different data.
Title: Page 11 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
Above Graph is showing hikes in graph by at the time of Adam optimizer training the model
with learning rate of 0.001.
Title: Page 12 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
Above Graph showing the down slope at the time SGD optimizer training the model with
learning rate of 0.001
Step 1-
I started training with SGD optimizer and learning rate of 0.001 because
SGD(Stochastic Gradient Descent) select a few sample from whole data randomly for
each iteration due to which it perform many iterations and give time to learn model
slowly with 0.001 learning rate helps to move the global minima which is closer to the
correct prediction of input.
Step 2-
After training with 0.001 learning rate I train it with 0.0001 learning rate to learn more
slowly and learn more features and slowly gradient moves toward global minima.
Step 3-
In this step I change the SGD optimizer with Adam because which is a combination of
RMSprop and Stochastic Gradient Descent with momentum. It uses the squared gradients
to scale the learning rate like RMSprop and it takes advantage of momentum by using
moving average of the gradient instead of gradient like SGD with momentum.
Adam perform fast convergence than SGD with learning rate of
0.001 moves the model toward global minima.
This steps sometimes increase the accuracy abruptly.
Step 4- Repeat the above step until get the high accuracy.
Freezing and Unfreezing method improves the accuracy of the model further by 3 to 4%.
I freeze the previous layers of the models and truncate the last classification layer and add
new two new FC layer{768,384} and {384,192} with classification layer at last {192,total
Title: Page 13 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
classes}.
6 Technologies to be used
6.1 Software Platform
a) Google Colab
b) Ananconda Distribution
6.3 Tools
Laptop
large database that makes them unaffordable untransferable and become challenging
when the amount of noise in data increases.
9 Team Details
Group# Course Name Student ID Student Role Signature
Name
Industrial TCA1709030 Samyak Jain Developer
Project TCA1709021 Samyak Jain Developer
10 Conclusion
Here overall accuracy is not going above 28% because of small dataset and number of rows
which has all taxas predicted correctly are less. It can only be improved by using large data
Title: Page 14 of 15
TMU-CCSIT Version 1.1 T001-Project Synopsis
11 References
https://arxiv.org/pdf/1901.11015.pdf
https://www.ncbi.nlm.nih.gov/
https://github.com/marcomeola/DAIRYdb
http://www.plantgdb.org/.
https://rvdb.dbi.udel.edu/
https://www.gbif.org/dataset/search
https://greengenes.secondgenome.com/
Title: Page 15 of 15