Dataset

Uploaded by

Дарья Бромот

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views4 pages

Dataset

Uploaded by

Дарья Бромот

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Dataset

I have converted the sequences to k-mers. In bioinformatics, k-mers are sub-strings of a sequence with a length of k. Kmer method has
been used by other works such as [2], [5], and [6] and has been proved to be effective. Using k-mers we convert each of the k
nucleotides into a word. E.g. with k = 3 the sequence ’CCAGCTG’ turns into the list [‘cca’, ’cag’, ’agc’, ’gct’, ‘ctg’]. We make a
dictionary of these k-mer words, and we assign a number to each word in the dictionary so when we give the sentences to the network
each word is a number. Before we feed the words (their integer equivalent) to the neural network layers, we use word embeddings to
convert each token to a vector. The words are transformed into spaces of vectors with arbitrary dimensions such that the words that are
closer in the vector space are expected to have the same or close meanings. I used the n = 50 for embedding size which generates
vectors of size 50 for each kmer that are learned by the model. Kmer method with k = 3 is applied to the sequences and 259 unique
kmers of size 3 are generated from the training set.

The average length of the sequences in the training set is 24861 nucleotides long, and the longest sequence in the training dataset is
30121 nucleotides. Each of the six classes has 250 samples and this makes the training set very balanced.
I use the T-SNE method to reduce the dimensionality of my data for visualizing the training and validation data in a 2D space. As you
can observe in figure 2, the sequences related to SARS-COV- 1, SARS-COV-2, and MERS seem to be more closely related to each
other. This is because these three pathogens are in the same genus Betacoronavirus, and we can see that it is reflected in the data as
well. I think differentiating the mentioned three pathogens from each other would be the more challenging part of the classification task.
The visualization of sequences in the validation set also demonstrates the same pattern, with the mentioned three pathogens being
closer to each other and the other three being less closely related.

Figure 2 2D visualization of data samples from training and validation sets using the t-sne feature reduction technique

1.1 Transformer Networks

The transformer networks were introduced in the paper “Attention Is All You Need” by Vaswani et al. The original paper uses the
transformer in an encoder-decoder model architecture, however, I only utilize the encoder part for the classification task.
The Multi-Head Attention uses the mentioned process for calculating the attention scores on multiple sets of these vectors, and the
operations are performed in parallel on these sets. The h sets of attention results are then concatenated and fed to a feed-forward layer.
The number of attention heads is set to 2 for the transformer models in my project. Due to the long length of the input sequences, I
was not able to feed them directly to the transformer model, therefore I have used a convolution block that includes a conv1d layer
and a max-pooling layer to reduce the dimensions of the input. The positional encoded embeddings are fed into the conv block
and the result of the convolution is then passed on to the attention block. The results of the attention block are fed to a global average
pooling layer and then to a feed-forward layer of 20 nodes before the final softmax layer.

1.2 Baseline CNN

Although Convolutional Neural Networks (CNNs) are mostly used with image data, they are also applied to Natural Language
Processing (NLP) tasks. Since the DNA classification is similar to text classification, these networks can be utilized for the task. I use a
simple CNN as my baseline method that has 2 blocks of convolution. Each block includes a 1d conv layer with a kernel size of 3
followed by a max-pooling layer with a size and stride of 2. The first layer has 32 filters with the size of 3 and the second layer has 64
filters of the same size. I expect the Transformer models to outperform this simple CNN easily, transformer models should be more
suited to the lengthy sequences of my dataset.

2.3 Metric
The authors of the paper use the plain accuracy metric which is the number of correct predictions divided by the total number of
predictions. In addition to that, I use the f1 score to evaluate my models, and this is because there is a class imbalance in the test files. In
these scenarios, the model can only output the class number to which the majority of the samples belong and the accuracy of the model
will be high even if the predictions for other classes are wrong. The f1 score is defined as equation 1. This metric takes into account
the precision and recall metric thus the model with the smallest number of false positives and false negatives will have a higher score.I
have created three different models by combining CNN and Transformers which I will call CNN_transformers. The first
CNN_Tranformer has 32 conv1d filters as its first layer, the second has 64 filters and the third has 128. I have depicted the architecture
of CNN_Tranformer_32 in figure 3. I have used drop out with the rate of 10% as the regularization technique with my
CNN_Transformer models inside the attention block.
The reported scores for each model are the result of averaging the F1 score achieved by each model on each of the five test files.

2 Results
2.1 Experimental setup
I have used the TensorFlow framework for implementing the models in this work and the models are trained on an NVIDIA GTX 1080Ti
GPU. Due to the limitation of resources, the batch size has been set to 4 and all models have been trained for 20 epochs. The model’s weight
initialization has been done using TensorFlow’s default method “glorot uniform”. The global average pooling method has been used instead
of flattening in all the models.
2.2 Maximum lengthed sequences
The longest sequence in the training dataset was 30121 nucleotides long. Using the zero-padding method, I have padded all the sequences
with shorter lengths to be the same length as the longest one. And the sequences in the test files that are longer than the mentioned
number are trimmed from the right end.
The score attained by the baseline and CNN_Transformer model is very close. CNN model manages to reach 0.996, and the
CNN_transformer_32 reaches 0.985 which is a higher score than the other two transformer-based models. The baseline model can be trained
much faster than the other models, with only 3 seconds required to complete each epoch. On other hand, the Transformer based models are
trained relatively slower with 47, 49, and 60 seconds epoch times for the model with 32, 64, and 128 filters in their input. The results of the
training are displayed in table x.
The median length for the sequences in the training dataset is 20557. All the sequences that were longer than the mentioned sequence were
resized by removing the nucleotides from the right side of the sequences. zero-padding was applied to all the shorter sequences to resize them.
The baseline only attains a score of 0.906 and the transformer models with 32 and 64 kernels outperform the baseline and achieve 0.986 and
0.959 scores respectively. And The transformer with 128 filters attains the score of 0.896 which is almost the same as the baseline. Such as
in the previous experiment, the CNN model is trained much faster than the transformers in this case as well with virtually the same epoch
times for each model.

3 Discussion
When using the whole sequence, all models manage to achieve high accuracies and F1 scores. The CNN_transformer model with 32 filters
almost attains the same average score as the baseline, however, the other two transformers fail to do so. Believe this might be due to the
overfitting of the bigger models perhaps higher rates of dropout and longer training durations could alleviate this problem. And in the case of
median-sized sequences, the transformer with 32 filters outperforms the other models in terms of the f1 score. The baseline that did very well
with the complete sequences, has a 0.09 drop in its score. After visualizing the confusion matrix of predictions depicted in figure 4, I realized
the baseline cannot differentiate between classes 1 and 2 which are Sars-Cov-1 and MERS. If we pay attention to figure 2, we can observe
that the samples from these classes seem to be very close in 2D space. In all the five test sets, the average length of sequences in these two
classes is bigger than 29,000 nucleotides. When we limit the length of the sequences to 20557, about 9000 nucleotides on average are lost
from each sample. I believe this loss makes the detection of these classes that are already somewhat similar harder for the simple baseline
model. But the transformer models which benefit from a sophisticated attention mechanism can separate these classes easier. Another
interesting insight from the training with the median size sequences is evenwhen we throw away some considerable proportions of the
sequences, the models are still able to learn a great deal about the data.
Eventually, my experiments demonstrated the utility of CNNs when dealing with sequential data. They can be trained much faster than
other networks as demonstrated in figure 5, and they show excellent results. The baseline achieved the highest scores among all the models
when using the whole sequence. Additionally, we can always benefit from CNNs as feature extractors. The input sequences are downsized
by the scale of 4 due to the conv layers and the follow-up max-pooling layers, however, the attention heads can still learn the data well and
achieve high scores on test files as well.
This project demonstrates that the biggest model doesn’t always yield the best result. Perhaps, it is better to always start with simpler
models such as a simple CNN and then try more complex models like transformers.

4 Conclusion
In this project, I aimed to classify six classes of pathogens. Three of the classes were almost similar to each other because of the origins of
the viruses. I utilized the transformer and attention model for the classification and compared the performance of the transformer with a
simple CNN network. Because the sequences in my dataset were thousands of nucleotides long, I was not able to feed them directly to
the transformer model so I used a convolution block to reduce the input size before feeding it to the transformers. When I used the
complete sequences for the training the simple CNN showed slightly better results than the CNN_Transoformer, and it was trained much
faster. However, when I resized the sequences to the median length, the transformer model displayed superior results with a 0.09 higher f1
score in comparison with the baseline.

Model Maximum Length Sequences Median length

Sequences

Baseline 0.996 0.906

Table 1 Average F1 scores on the five test files produced by the models

REFERENCES
[1] Mikhail S Gelfand, “Prediction of function in dna sequence analysis”, Journal of Computational Biology, 2(1):87–115, 1995

[2] Saha Indrajit, et al. “COVID-DeepPredictor: Recurrent Neural Network to Predict SARS-CoV-2 and Other Pathogenic Viruses”, Journal of Frontiers in genetics, volume
12,83,2021

[3] Gurjit S Randhawa, et al. “Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens:

COVID-19 case study.” PloS one vol. 15,4 e0232391. 24 Apr. 2020, doi:10.1371/journal.pone.0232391
[4] Arslan, Hilal. "Machine learning methods for covid-19 prediction using human genomic data." Multidisciplinary Digital Publishing Institute Proceedings. Vol. 74. No. 1.
2021MASANORI Higashihara, JOVAN DAVID Rebolledo-Mendez, YOICHI Yamada, and KENJI Satou. Application of a feature selection method to nucleosome data:
accuracy improvement and comparison with other methods. WSEAS Transactions
on Biology and Biomedicine, 5(5):95–104, 2008.

[5] Daniel Jurafsky and James H Martin. Speech and language processing (draft). preparation [cited 2020 June 1] Available from: https://web. stanford. edu/˜ jurafsky/slp3,
2018.

Figure 4 Confusion matrix for predictions of the baseline model. The left side shows the predictions when using the whole
sequence, the right size shows the predictions with median lengthed sequences

Figure 5 Training history of baseline model and the CNN_Transformer_32. The upper graph shows the training with full-length
sequences and the lower ones are with median lengthed sequences. Unlike the transformer model, the baseline in both scenarios
reaches a high accuracy in the first few epochs.

Deep Learning
No ratings yet
Deep Learning
28 pages
Udacity Nanodegree Project Report
No ratings yet
Udacity Nanodegree Project Report
12 pages
Project Report On A Learning Framework For Morphological Operators Using CounterHarmonic Mean
No ratings yet
Project Report On A Learning Framework For Morphological Operators Using CounterHarmonic Mean
114 pages
Applsci 11 02243 v2
No ratings yet
Applsci 11 02243 v2
70 pages
R20!63!20ITC27 Deep Learning Lab Manual (Minor Proj 2) Dr.K.ramu
No ratings yet
R20!63!20ITC27 Deep Learning Lab Manual (Minor Proj 2) Dr.K.ramu
47 pages
Almost Free Embeddings Outperform Trained Graph Neural Networks in Graph Classification
No ratings yet
Almost Free Embeddings Outperform Trained Graph Neural Networks in Graph Classification
14 pages
Master Record
No ratings yet
Master Record
81 pages
VGGFace Transfer Learning and Siamese Network For Face Recognition
No ratings yet
VGGFace Transfer Learning and Siamese Network For Face Recognition
6 pages
2020 de Geus
No ratings yet
2020 de Geus
7 pages
RP 1
No ratings yet
RP 1
6 pages
Rec03 - Deep Architectures
No ratings yet
Rec03 - Deep Architectures
65 pages
ABCs2018 Paper 156
No ratings yet
ABCs2018 Paper 156
5 pages
Ccs355 - NN&DL Lab Manual
No ratings yet
Ccs355 - NN&DL Lab Manual
34 pages
Deep Learning Lab
No ratings yet
Deep Learning Lab
26 pages
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
No ratings yet
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
14 pages
Exercise 2
No ratings yet
Exercise 2
3 pages
AI Assignment
No ratings yet
AI Assignment
31 pages
End Sem Presentation
No ratings yet
End Sem Presentation
4 pages
Expt 5 Expt 6
No ratings yet
Expt 5 Expt 6
10 pages
Project On Alzheimer Explaination
No ratings yet
Project On Alzheimer Explaination
13 pages
We Take 10 COVID Classes Each of Which Has 500 FASTA Sequences
No ratings yet
We Take 10 COVID Classes Each of Which Has 500 FASTA Sequences
2 pages
CNN Paper For mIRNA
No ratings yet
CNN Paper For mIRNA
6 pages
Aman Arora Blog On Vision Transformer
No ratings yet
Aman Arora Blog On Vision Transformer
11 pages
Assignment-6 STC-DL
No ratings yet
Assignment-6 STC-DL
17 pages
Kirkvik Acit2022
No ratings yet
Kirkvik Acit2022
155 pages
Plant Disease Identification
No ratings yet
Plant Disease Identification
17 pages
Master Inspera
No ratings yet
Master Inspera
45 pages
Deep Learning Lab With Output
No ratings yet
Deep Learning Lab With Output
12 pages
ADL Exp File
No ratings yet
ADL Exp File
56 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
StatPred Deep Learning Winter 2020 Handout
No ratings yet
StatPred Deep Learning Winter 2020 Handout
17 pages
Handwritten Digit Recognition Using Machine Learning
No ratings yet
Handwritten Digit Recognition Using Machine Learning
5 pages
DL Record
No ratings yet
DL Record
11 pages
Localization Using Convolutional Neural Networks
No ratings yet
Localization Using Convolutional Neural Networks
29 pages
A Guide To Image Captioning. How Deep Learning Helps in Captioning
No ratings yet
A Guide To Image Captioning. How Deep Learning Helps in Captioning
17 pages
DL Practical 02 Binary Class Classifier Using ANN
No ratings yet
DL Practical 02 Binary Class Classifier Using ANN
5 pages
Exercise 8
No ratings yet
Exercise 8
6 pages
Week 6
No ratings yet
Week 6
8 pages
DEL AAT Front Sheet
No ratings yet
DEL AAT Front Sheet
8 pages
Report
No ratings yet
Report
8 pages
Room Classification Using Machine Learning
No ratings yet
Room Classification Using Machine Learning
16 pages
Synopsis Report
No ratings yet
Synopsis Report
7 pages
Cats and Dogs Classification
No ratings yet
Cats and Dogs Classification
12 pages
A Deep Learning Approach To DNA Sequence Classification: (Ricrizzo, Fiannaca, Larosa, Urso) @pa - Icar.cnr - It
No ratings yet
A Deep Learning Approach To DNA Sequence Classification: (Ricrizzo, Fiannaca, Larosa, Urso) @pa - Icar.cnr - It
12 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Lab DigitRecognitionMINST
No ratings yet
Lab DigitRecognitionMINST
10 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Deep Learning
No ratings yet
Deep Learning
46 pages
FPGA Based Implementation of Neural Network
No ratings yet
FPGA Based Implementation of Neural Network
5 pages
1.convolutional Neural Networks For Image Classification
No ratings yet
1.convolutional Neural Networks For Image Classification
11 pages
Facial Emotion Detection
No ratings yet
Facial Emotion Detection
10 pages
Project Documentation
No ratings yet
Project Documentation
24 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
CNN Project
No ratings yet
CNN Project
16 pages
DL Programs
No ratings yet
DL Programs
12 pages
MNIST Based Handwritten Digits Recognition
No ratings yet
MNIST Based Handwritten Digits Recognition
5 pages
Fruit Quality Classifier - Group 1
No ratings yet
Fruit Quality Classifier - Group 1
12 pages
"I C U N N ": Mage Lassification Sing Eural Etworks
No ratings yet
"I C U N N ": Mage Lassification Sing Eural Etworks
15 pages
Lab I TENSOR FLOW AND KERAS
No ratings yet
Lab I TENSOR FLOW AND KERAS
3 pages

Dataset

Uploaded by

Dataset

Uploaded by

Dataset

1.1 Transformer Networks

1.2 Baseline CNN

Model Maximum Length Sequences Median length

Baseline 0.996 0.906

You might also like