0% found this document useful (0 votes)
16 views8 pages

LayoutingFix

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

LayoutingFix

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Jurnal Informatika Universitas Pamulang ISSN: 2541-1004

Penerbit: Fakultas Ilmu Komputer Universitas Pamulang e-ISSN: 2622-4615


Vol. 9, No. 2, Juni 2024 (34-41) https://doi.org/10.32493/informatika.v9i2.39355

A Hybrid Model for Human DNA Sequence Classification Using


Convolutional Neural Networks and Random Forests
Gregorius Airlangga1*
1
Information System Study Program, Universitas Katolik Indonesia Atma Jaya, Jakarta, Indonesia
e-mail: [email protected]
*Corresponding author

Submitted Date: May 15th, 2024 Reviewed Date: June 29th, 2024
Revised Date: July 15th, 2024 Accepted Date: July 29th, 2024

Abstract

Human DNA sequence classification is a fundamental task in genomics, essential for understanding
genetic variations and its implications in disease susceptibility, personalized medicine, and evolutionary
biology. This study proposes a novel hybrid model combining Convolutional Neural Networks (CNN) for
feature extraction and Random Forest classifiers for final classification. The model was evaluated on a
dataset of human DNA sequences, with achieving an accuracy of 75.34%. The results showed that
performance metrics, including precision, recall, and F1-scores across multiple classes, showed significant
improvements over traditional models. The CNN component effectively captures local dependencies and
patterns within the sequences, while the Random Forest classifier handles complex decision boundaries,
resulting in enhanced classification accuracy. Comparative analysis demonstrated the superiority of our
hybrid approach, with the CNN-LSTM model achieving only 59.47% accuracy, and other RNN-based
models like CNN-GRU and CNN-BiLSTM performing similarly lower. These results suggest that hybrid
models can leverage the strengths of both deep learning and traditional machine learning techniques an
offering a more effective tool for DNA sequence classification. The future work will optimize model
architecture and explore larger, thus more diverse datasets to validate our approach's generalizability and
robustness.

Keywords: DNA classification; CNN; Random Forests; Hybrid models; Genomic data analysis

1. Introduction analysis of genomic data, offering the potential for


Advancements in the field of genomics have greater accuracy and efficiency (Li et al., 2022; Tan
significantly enhanced our understanding of the et al., 2021; Waring et al., 2020).
human genome, paving the way for breakthroughs The classification of DNA sequences
in medical research, personalized medicine, and involves determining the class or category to which
biotechnology (Satam et al., 2023; Sindelar, 2024; a given sequence belongs, based on its nucleotide
Wilson et al., 2022). One of the key challenges in composition (Tao et al., 2023). This task is
genomics is the accurate classification of DNA challenging due to the vast amount of data and the
sequences, which is crucial for identifying genetic complex patterns inherent in genomic sequences
disorders, understanding evolutionary (Cortés-Ciriano et al., 2022). Traditional
relationships, and discovering new genetic markers approaches, such as k-mer counting and motif
(Laskar et al., 2021; Maharachchikumbura et al., analysis, have been used extensively but often
2021; Theodoridis et al., 2020). Traditional require significant preprocessing and domain
methods for DNA sequence classification often expertise (Nisa et al., 2021). Machine learning
rely on manual feature extraction and domain- models, particularly deep learning architectures,
specific knowledge, which can be both time- offer a promising alternative by automating feature
consuming and prone to human error (Alamro et extraction and learning directly from raw sequence
al., 2024; Landolsi et al., 2024; Papoutsoglou et al., data (Goshisht, 2024).
2023). In recent years, machine learning techniques This study offers a novel approach for
have emerged as powerful tools for automating the human DNA sequence classification using a

http://openjournal.unpam.ac.id/index.php/informatika 34
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0
International (CC BY-NC 4.0) License
Copyright © 2024 Gregorius Airlangga
Jurnal Informatika Universitas Pamulang ISSN: 2541-1004
Penerbit: Fakultas Ilmu Komputer Universitas Pamulang e-ISSN: 2622-4615
Vol. 9, No. 2, Juni 2024 (34-41) https://doi.org/10.32493/informatika.v9i2.39355

combination of deep learning and ensemble particularly deep learning, have shown great
learning techniques. Specifically, we employ a promise in the field of genomics. Convolutional
Convolutional Neural Network (CNN) for Neural Networks (CNNs) have been successfully
automatic feature extraction from DNA sequences, applied to various genomic tasks, including
followed by a Random Forest classifier to perform sequence classification, motif discovery, and
the final classification. CNN is designed to capture variant calling (Avanzo et al., 2020). CNNs are
local patterns in the DNA sequences through well-suited for genomic data due to their ability to
convolutional layers, while Random Forest, an capture local dependencies and hierarchical
ensemble classifier, leverages the extracted patterns (Walkowiak et al., 2020). However,
features to make robust predictions. Ensemble training deep learning models on genomic data can
classifiers like Random Forest work by combining be challenging due to the high dimensionality and
the predictions of multiple base classifiers, limited availability of labeled data (Meharunnisa et
typically decision trees, will enhance overall al., 2024). Ensemble learning methods, such as
prediction performance. Using aggregating the Random Forests, provide a complementary
outputs of these individual trees, Random Forest approach by aggregating predictions from multiple
reduces the risk of overfitting and increases the models to improve accuracy and robustness
model's accuracy and generalizability. This hybrid (Mahmud et al., 2021).
approach aims to leverage the strengths of both State-of-the-art methods for DNA sequence
deep learning and traditional machine learning classification often combine deep learning with
methods, potentially improving classification traditional machine learning techniques to leverage
accuracy and generalizability. The urgency of their respective strengths (Luo et al., 2021). For
developing accurate and efficient methods for instance, hybrid models that integrate CNNs with
DNA sequence classification cannot be overstated. support vector machines (SVMs) or decision trees
With the increasing availability of genomic data, have shown improved performance over individual
driven by advances in sequencing technologies, models (Khan et al., 2020). These approaches
there is a pressing need for scalable and reliable benefit from the feature extraction capabilities of
analytical methods (Goshisht, 2024). Accurate deep learning and the interpretability and
classification of DNA sequences has far-reaching robustness of traditional classifiers (Balamurugan
implications, including the early detection of & Gnanamanoharan, 2023; Bian & Priyadarshi,
genetic diseases, identification of therapeutic 2024). Our proposed method builds on this
targets, and advancements in evolutionary biology paradigm by using CNN for feature extraction and
(Satam et al., 2023). Moreover, the ability to Random Forest for classification, aiming to achieve
automate this process can significantly reduce the a balance between accuracy, efficiency, and
time and resources required for genomic research, interpretability.
accelerating the pace of discovery and innovation The objective of this study is to develop a
(Liu et al., 2020). robust and accurate method for human DNA
Our literature survey reveals a diverse array sequence classification that can outperform
of approaches for DNA sequence classification, traditional approaches. We aim to demonstrate that
ranging from traditional statistical methods to the combination of CNN and Random Forest can
cutting-edge machine learning algorithms (Cheng effectively capture complex patterns in DNA
et al., 2023). Early methods focused on alignment- sequences and provide reliable predictions.
based techniques, such as BLAST, which compare Additionally, we seek to compare our method with
DNA sequences to known reference sequences other traditional models, such as k-mer frequency
(Wang et al., 2022). While effective, these methods analysis and alignment-based techniques, to
are computationally intensive and may not scale highlight the advantages and limitations of each
well with large datasets (Rashed et al., 2021). approach. Gap analysis reveals several areas where
Alignment-free methods, such as k-mer frequency current methods fall short. Traditional approaches
analysis, offer an alternative by representing often require extensive preprocessing and feature
sequences as fixed-length vectors, enabling faster engineering, which can be both time-consuming
comparisons. However, these methods often and prone to human error. Deep learning models,
require extensive feature engineering and may not while powerful, may suffer from overfitting and
capture complex patterns in the data (Narayanan et require large amounts of labeled data for training.
al., 2021). Recent advances in machine learning, Hybrid models, which combine deep learning and
http://openjournal.unpam.ac.id/index.php/informatika 35
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0
International (CC BY-NC 4.0) License
Copyright © 2024 Gregorius Airlangga
Jurnal Informatika Universitas Pamulang ISSN: 2541-1004
Penerbit: Fakultas Ilmu Komputer Universitas Pamulang e-ISSN: 2622-4615
Vol. 9, No. 2, Juni 2024 (34-41) https://doi.org/10.32493/informatika.v9i2.39355

traditional machine learning techniques, offer a conducted. The dataset is first checked for missing
promising solution but have not been extensively values and inconsistencies. Any sequences with
explored in the context of DNA sequence missing nucleotides or ambiguous characters (e.g.,
classification. Our research aims to address these 'N' for unknown bases) are either removed or
gaps by developing a hybrid model that is both replaced based on the overall quality and
accurate and efficient, with minimal preprocessing importance of the data. This ensures that the input
requirements. data is clean and reliable, which is essential for both
Our contributions to the field are threefold. CNN and Random Forest to learn effectively.
First, we propose a novel hybrid model that Furthermore, outlier detection and treatment
combines a CNN for feature extraction with a are conducted. Outliers in the DNA sequences,
Random Forest for classification, offering a which could be unusually short or long sequences
balance between accuracy and interpretability. or sequences with atypical nucleotide distributions,
Second, we conduct a comprehensive comparison are identified. These outliers are either corrected, if
of our method with traditional models, possible, or removed to prevent them from skewing
demonstrating its advantages in terms of accuracy the model's learning process.the DNA sequences
and efficiency. Third, we provide a detailed are converted into k-mers of length 3. Then k-mer
analysis of the model's performance, highlighting transformation is conducted. The DNA sequences
its ability to capture complex patterns in DNA are converted into k-mers of length 3. A k-mer is a
sequences and its potential for scalability to large substring of length 𝑘 from a sequence. For a DNA
datasets. The remaining structure of this journal sequence 𝑆 = 𝑠! , 𝑠" , … , 𝑠# , where 𝑠$ represents the
article is organized as follows. In the Methods i-th nucleotide, the sequence is transformed into
section, we provide a detailed description of the overlapping k-mers such that each k-mer is
dataset, preprocessing steps, and model (𝑠$ , 𝑠$%! , … , 𝑠$%&'! ). For example, for 𝑘 = 3, the
architecture. The Results section presents the sequence AGCTCGA would be represented as
performance metrics of our proposed method, AGC, GCT, CTC, TCG, CGA. This transformation
along with a comparison to traditional models. helps capture local patterns in the sequences.
Finally, the Conclusion section summarizes our Next, the class labels are encoded into
contributions and outlines potential directions for numerical values using a label encoder. Let the
future research. class labels be 𝐶 = {𝑐! , 𝑐" , … , 𝑐( } , where 𝑐$
represents the i-th class. The label encoder assigns
2. Research Methodology a unique integer to each class, transforming the
2.1. Dataset labels into 𝐶 ) = {𝑦! , 𝑦" , … , 𝑦( } , where 𝑦$ is the
The dataset used in this study consists of encoded value of class 𝑐$ . The k-mers are then
human DNA sequences, each associated with a tokenized, converting them into sequences of
specific class label indicating its category or integers. Let the vocabulary of k-mers be 𝑉 =
function. These sequences are drawn from a {𝑣! , 𝑣" , … , 𝑣& }, where 𝑣$ represents the i-th unique
comprehensive genomic database, and the dataset k-mer. The tokenizer maps each k-mer to a unique
encompasses seven distinct classes representing integer, transforming the sequence of k-mers into a
different functional categories. Each DNA sequence of integers 𝑇 = {𝑡! , 𝑡" , … , 𝑡# }, where 𝑡$ is
sequence is composed of the four nucleotides: the integer representation of the i-th k-mer.
adenine (A), cytosine (C), guanine (G), and To ensure uniform input dimensions for the
thymine (T). The sequences vary in length but have neural network, the tokenized sequences are
an average length of approximately 150 padded to a fixed length. Let 𝐿 be the desired
nucleotides. The dataset is stored in a tab-separated sequence length. If the length of a tokenized
text file with columns representing the DNA sequence 𝑇 is less than 𝐿, it is padded with zeros to
sequences and their corresponding class labels. obtain a sequence of length 𝐿 . This results in a
Dataset can be downloaded from (Vasani, 2022). padded sequence 𝑇 ) = {𝑡!) , 𝑡") , … , 𝑡*) } , where 𝑡$) is
either an integer token or zero. Finally, the dataset
2.2. Preprocessing Steps is split into training and testing sets. Let X represent
Preprocessing is a crucial step in preparing the set of padded sequences and Y represent the set
the dataset for model training. The steps involved of encoded labels. The dataset is split into training
in preprocessing the dataset are as follows: First, set (𝑋train , 𝑌train ) and testing set (𝑋test , 𝑌test ) using
the handling missing values and data cleansing is an 80-20 split, where 80% of the data is used for
http://openjournal.unpam.ac.id/index.php/informatika 36
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0
International (CC BY-NC 4.0) License
Copyright © 2024 Gregorius Airlangga
Jurnal Informatika Universitas Pamulang ISSN: 2541-1004
Penerbit: Fakultas Ilmu Komputer Universitas Pamulang e-ISSN: 2622-4615
Vol. 9, No. 2, Juni 2024 (34-41) https://doi.org/10.32493/informatika.v9i2.39355

training and 20% for testing. There are seven labels k-mer vocabulary and 𝑑 is the embedding
such as G-protein coupled receptors, Tyrosine dimension. The embedding layer transforms the
kinase, Tyrosine phosphatase, Synthetase, input sequence 𝑇 ) into a sequence of dense vectors
Synthase, Ion channel Transcription factor 𝑍 = {𝑧! , 𝑧" , … , 𝑧* } , where 𝑧$ ∈ 𝑅2 is the
embedding of the i-th k-mer. A convolutional layer
2.3. Model Architecture applies a set of filters to the embedded sequences
As presented in figure 1, the proposed model to capture local patterns. Let 𝐹 be the number of
architecture combines a Convolutional Neural filters and 𝑘3 be the filter size. Each filter 𝑊 ∈
Network (CNN) for feature extraction with a 𝑅&! ×2 is convolved with the input sequence to
Random Forest classifier for final classification. produce a feature map. The convolution operation
CNN is designed to capture local patterns in the
DNA sequences, while Random Forest leverages is defined as ℎ$ = 𝑓 C𝑊 ⋅ 𝑧$:$%&! '! + 𝑏G, where ℎ$
these features for robust predictions. is the i-th element of the feature map, 𝑓 is the
activation function (ReLU), ⋅ denotes the dot
product, and 𝑏 is the bias term.
A global max pooling layer reduces the
dimensionality of the feature maps by taking the
maximum value over each feature map. This
operation produces a fixed-length feature vector
ℎ = {ℎ! , ℎ" , … , ℎ6 } , where ℎ$ is the maximum
value in the i-th feature map. Fully connected layers
further process the extracted features. Let 𝑊3 ∈
𝑅6×7 and 𝑊8 ∈ 𝑅7×9 be the weight matrices of
the fully connected layers, where 𝐻 and 𝐺 are the
number of units in the respective layers. The output
of the fully connected layers is given by 𝑦3 =
𝑓J𝑊3 ⋅ ℎ + 𝑏3 K and 𝑦8 = 𝑓J𝑊8 ⋅ 𝑦3 + 𝑏8 K where
𝑏3 and 𝑏8 are the bias terms, and 𝑓 is the ReLU
activation function. The output 𝑦8 of the second
fully connected layer is used as the feature vector
for the subsequent classifier.

2.3.2. Random Forest Classifier


The extracted features 𝑦8 are used to trains a
Random Forest classifier. A Random Forest is an
ensemble learning method that constructs multiple
decision trees and aggregates their predictions. Let
𝐹$ be the i-th decision tree in the forest, and 𝑛 be
the total number of trees. The prediction of the
Random Forest for an input feature vector 𝑦8 is
given by the majority vote of the individual trees
𝑦M = mode{𝐹$ J𝑦8 K ∣ 𝑖 = 1, … , 𝑛 where 𝑦M is the
predicted class label. The Random Forest classifier
Figure 1. Model’s Architecture is trained on the features extracted from the training
set (𝑋train , 𝑌train ) and evaluated on the test set
2.3.1. Convolutional Neural Network (CNN) (𝑋test , 𝑌test ).
CNN consists of several layers designed to
extract features from the input sequences. The 2.4. Evaluation
architecture is as follows: firstly, an embedding The performance of the Random Forest
layer maps the integer-encoded k-mers into dense classifier is evaluated using accuracy, precision,
vectors of fixed size. Let 𝐸 be the embedding recall, and F1-score metrics. Accuracy is the ratio
matrix of size |𝑉| × 𝑑, where |𝑉| is the size of the
http://openjournal.unpam.ac.id/index.php/informatika 37
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0
International (CC BY-NC 4.0) License
Copyright © 2024 Gregorius Airlangga
Jurnal Informatika Universitas Pamulang ISSN: 2541-1004
Penerbit: Fakultas Ilmu Komputer Universitas Pamulang e-ISSN: 2622-4615
Vol. 9, No. 2, Juni 2024 (34-41) https://doi.org/10.32493/informatika.v9i2.39355

of correctly predicted instances to the total number indicating a strong ability to correctly identify
:;%:< positive instances of this class. Class 6 had the
of instances Accuracy = . Precision
:;%:<%6;%6<
highest recall (0.90), reflecting the model's
is the ratio of correctly predicted positive instances
effectiveness in capturing most of the actual
to the total predicted positive instances
:; positive instances for this class. The macro-
Precision = :;%6;. Furthermore, recall is the ratio averaged F1-score, which considers the F1-score
of correctly predicted positive instances to the total for each class and computes their unweighted
:;
actual positive instances Recall = :;%6<. F1-score mean, was 0.76, highlighting the model's overall
balanced performance. The results indicate that the
is the harmonic mean of precision and recall,
Precision×Recall proposed model outperforms several other models
F1-score = 2 × Precision%Recall, where 𝑇𝑃, 𝑇𝑁, 𝐹𝑃, in terms of accuracy.
and 𝐹𝑁 represent true positives, true negatives, The CNN-LSTM model achieved an
false positives, and false negatives, respectively. accuracy of 0.5947, precision of 0.7628, recall of
0.4660, and F1-score of 0.5756. The CNN-GRU
2.5. Comparison with Traditional Models model had an accuracy of 0.5571, precision of
The proposed method is compared with 0.7607, recall of 0.4025, and F1-score of 0.5239.
traditional models, including k-mer frequency The CNN-BiLSTM model achieved an accuracy of
analysis and alignment-based techniques. These 0.6110, precision of 0.7690, recall of 0.5039, and
methods involve manually extracting features from F1-score of 0.6042. Standalone CNN achieved an
DNA sequences and using standard classifiers like accuracy of 0.7486, precision of 0.8918, recall of
Support Vector Machines (SVMs) or k-Nearest 0.6934, and F1-score of 0.7800. The LSTM model
Neighbors (k-NN). In k-mer frequency analysis, k- achieved an accuracy of 0.7395, precision of
mer counts are extracted from the DNA sequences 0.8667, recall of 0.6856, and F1-score of 0.7646.
and used as features for classification. Let 𝐶(𝑘) be The GRU model had an accuracy of 0.7263,
the k-mer count vector for a sequence, representing precision of 0.8908, recall of 0.6258, and F1-score
the frequency of each k-mer in the sequence. These of 0.7342. The BiLSTM model achieved an
count vectors are used to train classifiers such as accuracy of 0.7397, precision of 0.8881, recall of
SVMs or k-NN. 0.6575, and F1-score of 0.7546.
In alignment-based techniques, DNA The performance comparison reveals several
sequences are aligned to known reference important insights. Firstly, the proposed hybrid
sequences using tools like BLAST. The alignment model (CNN + Random Forest) exhibits superior
scores are used as features for classification. Let performance compared to CNN-LSTM, CNN-
$A(s)$ be the alignment score vector for a GRU, and CNN-BiLSTM models. This suggests
sequence, representing the similarity scores to that while combining CNN with recurrent neural
reference sequences. These score vectors are used network (RNN) architectures like LSTM, GRU, or
to train classifiers. The performance of the BiLSTM can capture sequential dependencies in
traditional models is evaluated using the same the data, the Random Forest classifier is more
metrics as the proposed method, allowing for a effective in leveraging the features extracted by
comprehensive comparison. CNN for classification purposes. The Random
Forest's ability to aggregate the decisions from
3. Results and Discussion multiple trees contributes to its robustness and
The proposed model, which integrates a improved classification performance.
Convolutional Neural Network (CNN) for feature Secondly, standalone deep learning models,
extraction and a Random Forest classifier for final including CNN, LSTM, GRU, and BiLSTM, also
classification, demonstrated an overall accuracy of demonstrate competitive performance. The CNN
0.753 on the test set. The detailed performance model, with an accuracy of 0.7486, performs nearly
metrics, including precision, recall, and F1-score, on par with the proposed hybrid model, indicating
for each class are presented in Table 1. The model the strength of CNN in capturing spatial patterns
achieved a balanced performance across different within the DNA sequences. LSTM, GRU, and
classes, with precision values ranging from 0.65 to BiLSTM models, which are designed to handle
0.98, recall values ranging from 0.64 to 0.90, and sequential data, also achieve reasonable accuracies
F1-scores ranging from 0.65 to 0.88. The highest of 0.7395, 0.7263, and 0.7397, respectively. These
precision (0.98) was observed for class 2, models excel in capturing long-term dependencies
http://openjournal.unpam.ac.id/index.php/informatika 38
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0
International (CC BY-NC 4.0) License
Copyright © 2024 Gregorius Airlangga
Jurnal Informatika Universitas Pamulang ISSN: 2541-1004
Penerbit: Fakultas Ilmu Komputer Universitas Pamulang e-ISSN: 2622-4615
Vol. 9, No. 2, Juni 2024 (34-41) https://doi.org/10.32493/informatika.v9i2.39355

and temporal patterns, which are inherent in DNA F1-score provides a balanced measure to evaluate
sequences. overall performance. The LSTM and BiLSTM
However, the hybrid approach of combining models, with their ability to capture bidirectional
CNN for feature extraction with Random Forest for dependencies, demonstrate strong performance,
classification provides an optimal balance, with F1-scores of 0.7646 and 0.7546, respectively.
leveraging the strengths of both deep learning and The GRU model, although slightly lower in
traditional machine learning techniques. CNN performance, achieves a respectable F1-score of
efficiently extracts hierarchical features from the 0.7342. These results highlight the effectiveness of
DNA sequences, while the Random Forest, with its RNN-based models in handling sequential data,
ensemble of decision trees, effectively handles the such as DNA sequences. The proposed hybrid
classification task by reducing the risk of model (CNN + Random Forest) outperforms
overfitting and improving generalization. The several other models in terms of accuracy and
macro-averaged metrics (precision, recall, and F1- balanced performance metrics. The integration of
score) provide further insights into the model's deep learning techniques for feature extraction with
performance across different classes. The proposed traditional machine learning classifiers for final
model achieved a macro-averaged precision of classification proves to be an effective approach for
0.81, recall of 0.73, and F1-score of 0.76, indicating DNA sequence classification. The results
a balanced performance across classes. This is underscore the potential of hybrid models in
particularly important in the context of DNA leveraging the strengths of both paradigms to
sequence classification, where it is crucial to achieve superior predictive performance.
accurately identify sequences belonging to
different functional categories. Table 1. Performance Results of Models
In terms of precision, the proposed model Model Accuracy Precision Recall F1-
Score
excels in classifying sequences of classes 1, 2, and CNN_LSTM 0.5947 0.7628 0.4660 0.5756
5, with precision values of 0.93, 0.98, and 0.92, CNN_GRU 0.5571 0.7607 0.4025 0.5239
CNN_BiLSTM 0.6110 0.7690 0.5039 0.6042
respectively. These high precision values suggest CNN 0.7486 0.8918 0.6934 0.7800
that the model is effective in minimizing false LSTM 0.7395 0.8667 0.6856 0.7646
positives for these classes. The high recall value of GRU 0.7263 0.8908 0.6258 0.7342
BiLSTM 0.7397 0.8881 0.6575 0.7546
0.90 for class 6 indicates the model's ability to Hybrid Model 0.7534 0.81 0.73 0.7699
correctly identify most of the true positive
instances for this class, although the precision for Table 2. Performance Results of Models
this class is relatively lower (0.70). The balanced Class Precision Recall F1-Score
F1-scores across different classes, ranging from 0 0.84 0.72 0.77
1 0.93 0.70 0.80
0.65 to 0.88, reflect the model's overall robustness. 2 0.98 0.79 0.88
The F1-score, which considers both precision and 3 0.65 0.65 0.65
recall, is a crucial metric for evaluating 4 0.67 0.64 0.66
5 0.92 0.69 0.79
classification performance, particularly when 6 0.70 0.90 0.79
dealing with imbalanced datasets. The macro- Accuracy 0.75
Average 0.81 0.73 0.76
averaged F1-score of 0.76 further supports the
effectiveness of the proposed model in maintaining
a balance between precision and recall across all 4. Conclusions
classes. In this study, we introduced a novel hybrid
Comparing the hybrid model's performance model for human DNA sequence classification that
with standalone models, it is evident that the CNN combines a Convolutional Neural Network (CNN)
model achieves the highest precision (0.8918) for feature extraction with a Random Forest
among all models, followed by BiLSTM (0.8881), classifier for final classification. Our model
LSTM (0.8667), and GRU (0.8908). These achieved a significant performance improvement,
precision values highlight the capability of these with an accuracy of 75.34%, outperforming several
models to accurately identify positive instances. other models, including CNN-LSTM, CNN-GRU,
However, their recall values are slightly lower, and other standalone deep learning approaches.
indicating potential challenges in capturing all true The hybrid model's superior performance in
positive instances. This trade-off between precision precision, recall, and F1-score across multiple
and recall is common in classification tasks, and the classes demonstrates its effectiveness in accurately

http://openjournal.unpam.ac.id/index.php/informatika 39
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0
International (CC BY-NC 4.0) License
Copyright © 2024 Gregorius Airlangga
Jurnal Informatika Universitas Pamulang ISSN: 2541-1004
Penerbit: Fakultas Ilmu Komputer Universitas Pamulang e-ISSN: 2622-4615
Vol. 9, No. 2, Juni 2024 (34-41) https://doi.org/10.32493/informatika.v9i2.39355

classifying DNA sequences into their respective classification, challenges, and Future Research
categories. The significance of our findings lies in Issues. Archives of Computational Methods in
the innovative integration of CNNs and Random Engineering, 1–25.
Forests, which effectively captures local Cheng, K., Guo, Q., He, Y., Lu, Y., Gu, S. & Wu, H.
(2023). Exploring the potential of GPT-4 in
dependencies within DNA sequences while also
biomedical engineering: the dawn of a new era.
handling complex decision boundaries. This Annals of Biomedical Engineering, 51(8),
combination allows for a more nuanced 1645–1653.
understanding and classification of genomic data, Cortés-Ciriano, I., Gulhan, D. C., Lee, J. J.-K., Melloni,
setting our approach apart from traditional models. G. E. M. & Park, P. J. (2022). Computational
Notably, the CNN-LSTM model, which achieved analysis of cancer genome sequencing data.
an accuracy of 59.47%, was less effective Nature Reviews Genetics, 23(5), 298–314.
compared to our hybrid model, underscoring the Goshisht, M. K. (2024). Machine Learning and Deep
potential of combining deep learning with Learning in Synthetic Biology: Key
traditional machine learning techniques. Architectures, Applications, and Challenges.
ACS Omega, 9(9), 9921–9945.
Our research contributes to the existing body
Khan, S., Sajjad, M., Hussain, T., Ullah, A. & Imran, A.
of knowledge by offering a scalable and efficient S. (2020). A review on traditional machine
solution for genomic data analysis, demonstrating learning and deep learning models for WBCs
that hybrid models can leverage the strengths of classification in blood smear images. Ieee
both deep learning and traditional machine learning Access, 9, 10657–10673.
to improve predictive accuracy. This advancement Landolsi, M. Y., Hlaoua, L. & Romdhane, L. Ben.
has the potential to lead to more accurate and robust (2024). Extracting and structuring information
predictive models in the field of human DNA from the electronic medical text: state of the art
analysis, facilitating better understanding and and trendy directions. Multimedia Tools and
classification of genomic sequences. Future work Applications, 83(7), 21229–21280.
Laskar, P., Bhattacharya, S., Chaudhuri, A. & Kundu, A.
will focus on optimizing the model architecture,
(2021). Exploring the GRAS gene family in
including fine-tuning hyperparameters and common bean (Phaseolus vulgaris L.):
experimenting with different combinations of characterization, evolutionary relationships,
feature extraction and classification techniques. and expression analyses in response to abiotic
Additionally, applying the proposed model to stresses. Planta, 254, 1–21.
larger and more diverse genomic datasets could Li, R., Li, L., Xu, Y. & Yang, J. (2022). Machine
provide further insights into its generalizability and learning meets omics: applications and
robustness. Exploring other hybrid approaches, perspectives. Briefings in Bioinformatics,
such as combining different deep learning 23(1), bbab460.
architectures or incorporating domain-specific Liu, C., Ma, Y., Zhao, J., Nussinov, R., Zhang, Y.-C.,
Cheng, F. & Zhang, Z.-K. (2020).
knowledge, could also be a promising direction for
Computational network biology: data, models,
improving DNA sequence classification. and applications. Physics Reports, 846, 1–66.
Luo, D., Cheng, W., Yu, W., Zong, B., Ni, J., Chen, H.
References & Zhang, X. (2021). Learning to drop: Robust
graph neural network via topological
Alamro, H., Gojobori, T., Essack, M. & Gao, X. (2024). denoising. Proceedings of the 14th ACM
BioBBC: a multi-feature model that enhances International Conference on Web Search and
the detection of biomedical entities. Scientific Data Mining, 779–787.
Reports, 14(1), 7697. Maharachchikumbura, S. S. N., Chen, Y., Ariyawansa,
Avanzo, M., Wei, L., Stancanello, J., Vallieres, M., Rao, H. A., Hyde, K. D., Haelewaters, D., Perera, R.
A., Morin, O., Mattonen, S. A. & El Naqa, I. H., Samarakoon, M. C., Wanasinghe, D. N.,
(2020). Machine and deep learning methods for Bustamante, D. E., Liu, J.-K. & others. (2021).
radiomics. Medical Physics, 47(5), e185--e202. Integrative approaches for species delimitation
Balamurugan, T. & Gnanamanoharan, E. (2023). Brain in Ascomycota. Fungal Diversity, 109(1), 155–
tumor segmentation and classification using 179.
hybrid deep CNN with LuNetClassifier. Neural Mahmud, M., Kaiser, M. S., McGinnity, T. M. &
Computing and Applications, 35(6), 4739– Hussain, A. (2021). Deep learning in mining
4753. biological data. Cognitive Computation, 13(1),
Bian, K. & Priyadarshi, R. (2024). Machine learning 1–33.
optimization techniques: a Survey,

http://openjournal.unpam.ac.id/index.php/informatika 40
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0
International (CC BY-NC 4.0) License
Copyright © 2024 Gregorius Airlangga
Jurnal Informatika Universitas Pamulang ISSN: 2541-1004
Penerbit: Fakultas Ilmu Komputer Universitas Pamulang e-ISSN: 2622-4615
Vol. 9, No. 2, Juni 2024 (34-41) https://doi.org/10.32493/informatika.v9i2.39355

Meharunnisa, M., Sornam, M. & Ramesh, B. (2024). An Tan, X., Su, A. T., Hajiabadi, H., Tran, M. & Nguyen,
Optimized Hybrid Model for Classifying Q. (2021). Applying machine learning for
Bacterial Genus using an Integrated CNN-RF integration of multi-modal genomics data and
Approach on 16S rDNA Sequences: imaging data to quantify heterogeneity in
OPTIMIZED CNN-RF MODEL FOR tumour tissues. Artificial Neural Networks,
BACTERIAL GENUS CLASSIFICATION. 209–228.
Journal of Scientific & Industrial Research Tao, J., Bauer, D. E. & Chiarle, R. (2023). Assessing and
(JSIR), 83(4), 392–404. advancing the safety of CRISPR-Cas tools:
Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., from DNA to RNA editing. Nature
Patwary, M., Korthikanti, V., Vainbrand, D., Communications, 14(1), 212.
Kashinkunti, P., Bernauer, J., Catanzaro, B. & Theodoridis, S., Fordham, D. A., Brown, S. C., Li, S.,
others. (2021). Efficient large-scale language Rahbek, C. & Nogues-Bravo, D. (2020).
model training on gpu clusters using megatron- Evolutionary history and past climate change
lm. Proceedings of the International shape the distribution of genetic diversity in
Conference for High Performance Computing, terrestrial mammals. Nature Communications,
Networking, Storage and Analysis, 1–15. 11(1), 2557.
Nisa, I., Pandey, P., Ellis, M., Oliker, L., Buluç, A. & Vasani, N. (2022). Human DNA Data.
Yelick, K. (2021). Distributed-memory k-mer https://www.kaggle.com/datasets/neelvasani/h
counting on GPUs. 2021 IEEE International umandnadata
Parallel and Distributed Processing Walkowiak, S., Gao, L., Monat, C., Haberer, G., Kassa,
Symposium (IPDPS), 527–536. M. T., Brinton, J., Ramirez-Gonzalez, R. H.,
Papoutsoglou, G., Tarazona, S., Lopes, M. B., Kolodziej, M. C., Delorean, E., Thambugala,
Klammsteiner, T., Ibrahimi, E., Eckenberger, D. & others. (2020). Multiple wheat genomes
J., Novielli, P., Tonda, A., Simeon, A., Shigdel, reveal global variation in modern breeding.
R. & others. (2023). Machine learning Nature, 588(7837), 277–283.
approaches in microbiome research: challenges Wang, Z., Jiang, Y., Liu, Z., Tang, X. & Li, H. (2022).
and best practices. Frontiers in Microbiology, Machine learning and ensemble learning for
14, 1261889. transcriptome data: principles and advances.
Rashed, A. E. E.-D., Amer, H. M., El-Seddek, M. & 2022 5th International Conference on
Moustafa, H. E.-D. (2021). Sequence Advanced Electronic Materials, Computers and
alignment using machine learning-based Software Engineering (AEMCSE), 676–683.
needleman--wunsch algorithm. IEEE Access, Waring, J., Lindvall, C. & Umeton, R. (2020).
9, 109522–109535. Automated machine learning: Review of the
Satam, H., Joshi, K., Mangrolia, U., Waghoo, S., Zaidi, state-of-the-art and opportunities for
G., Rawool, S., Thakare, R. P., Banday, S., healthcare. Artificial Intelligence in Medicine,
Mishra, A. K., Das, G. & others. (2023). Next- 104, 101822.
generation sequencing technology: current Wilson, S., Steele, S. & Adeli, K. (2022). Innovative
trends and advancements. Biology, 12(7), 997. technological advancements in laboratory
Sindelar, R. D. (2024). Genomics, other “OMIC” medicine: Predicting the lab of the future.
technologies, precision medicine, and Biotechnology & Biotechnological Equipment,
additional biotechnology-related techniques. In 36(sup1), S9--S21.
Pharmaceutical Biotechnology: Fundamentals
and Applications (pp. 209–254). Springer.

http://openjournal.unpam.ac.id/index.php/informatika 41
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0
International (CC BY-NC 4.0) License
Copyright © 2024 Gregorius Airlangga

You might also like