0% found this document useful (0 votes)
25 views11 pages

Academic Internship Final Report

This document summarizes research on using machine learning and transfer learning via BERT for text classification on large datasets. The researchers explored models including Naive Bayes, decision trees, random forest, logistic regression, SVMs, RNNs, LSTMs, and XGBoost on four datasets of varying sizes and domains. Their findings showed that select models like SVMs, RNNs, LSTMs, and XGBoost outperformed others, and that transfer learning with BERT can significantly improve accuracy and efficiency for text classification on large datasets.

Uploaded by

Nilakhya Chawrok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views11 pages

Academic Internship Final Report

This document summarizes research on using machine learning and transfer learning via BERT for text classification on large datasets. The researchers explored models including Naive Bayes, decision trees, random forest, logistic regression, SVMs, RNNs, LSTMs, and XGBoost on four datasets of varying sizes and domains. Their findings showed that select models like SVMs, RNNs, LSTMs, and XGBoost outperformed others, and that transfer learning with BERT can significantly improve accuracy and efficiency for text classification on large datasets.

Uploaded by

Nilakhya Chawrok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Research on Text Classification using

Machine Learning on Large Datasets


And Scope of Transfer Learning Using
BERT

Authors:
Firoz Anjum Chowdhury, Hillol Pratim Kalita, Nilakhya Mandita Bordoloi,
Abir Akash Baruah, Mrinmoy Shyam

Supervised by:
Professor Rajib Chakrabarty
Department of Computer Science
Jorhat Engineering College

Abstract: This study explores the scope of transfer learning using BERT for large-scale text
classification, employing a diverse range of machine learning models, including Naive Bayes,
Decision Trees, Random Forest, Logistic Regression, Support Vector Machines (SVM), RNNs,
LSTMs, and XGBoost. We emphasize the critical role of model selection, hyperparameter
optimization, and how pre-trained models like BERT can significantly enhance accuracy and
efficiency in text classification. Our findings reveal that select models, such as SVM, RNN,
LSTM, and XGBoost, outperform others, underscoring the potential of transfer learning with
BERT. To validate our approach’s versatility, we conducted experiments on four distinct
datasets from various domains, each differing in size and class distribution, demonstrating the
adaptability of our models to real-world text classification tasks. This research significantly
contributes to understanding the scope and effectiveness of transfer learning for large-scale text
data analysis.

Date: October 20, 2023


1 Introduction
1.1 Background
Text classification is a pivotal task in natural language processing (NLP), with applications ranging from sentiment
analysis to spam detection and news categorization. This research study focuses on text classification using machine
learning (ML) techniques, particularly when applied to substantial datasets. We also explore the limitations of
conventional ML algorithms in handling such datasets and investigate the potential of transfer learning using BERT,
a pre-trained language model.

1.2 Objectives
The primary objectives of this research are as follows:
1. Preprocessing and cleaning a large news classification dataset.

2. Evaluating the performance of various ML algorithms on the cleaned dataset.


3. Identifying the limitations of traditional ML algorithms in handling substantial datasets.
4. Investigating the potential of transfer learning using BERT for text classification.

5. Drawing conclusions and providing insights into the effectiveness of different ML algorithms and transfer
learning approaches.

2 Methodology
2.1 Data Pre-processing and Dataset Description
2.1.1 Dataset Description
In this section, we provide a comprehensive description of the datasets used in this study, including their sizes and
the number of distinct labels, along with domain of the datasets. The data serves as the foundation for our research
and analysis.

Dataset Size of Dataset No. of Distinct Labels Class Distribution Domain


Dataset 1 7570 3 Balanced Academia(Healthcare)
Dataset 2 43125 6 Moderately Imbalanced News
Dataset 3 50425 4 Moderately Balanced E-Commerce
Dataset 4 108112 27 Highly Imbalanced Entertainment

Table 1: Data Description

2.1.2 Data Pre-processing Tasks


Data pre-processing is deemed a crucial phase in our research, with a primary focus on the refinement and cleaning
of datasets to enhance their quality and suitability for analysis. Encoding issues were addressed to ensure uniform
character encoding across the data, thus mitigating inconsistencies and potential misinterpretations.
Furthermore, the removal of HTML tags and specific data patterns, common in web-sourced data, was un-
dertaken to enhance text clarity and reduce noise in the datasets. Standard Natural Language Processing (NLP)
cleaning techniques, including lemmatization, stemming, and lowercasing, were applied to normalize text, reducing
data dimensionality and maintaining uniformity.
To improve NLP analysis efficiency, the elimination of stop words, characterized as low-information words, and
punctuations was performed. This preparation laid the foundation for subsequent analyses, ensuring data integrity
and consistency throughout.
Additionally, label encoding was utilized to transform categorical variables into numerical representations,
enabling non-numeric data to be effectively operated upon by our machine learning models.

1
For textual data, TF-IDF[12] (Term Frequency-Inverse Document Frequency) vectorization was employed,
converting text documents into numerical representations. This quantitative measure captures term relevance
within both the document and the dataset as a whole, especially beneficial for our text-based data.
Through the application of label encoding and TF-IDF vectorization in our data pre-processing pipeline, data
readiness for advanced analytics and machine learning tasks was ensured, with both categorical and textual data
being structured and prepared. This version of the summary emphasizes the use of passive voice in describing the
actions taken during data pre-processing.

Figure 1: Tree Map of Datasets

2.2 Model Building [18]


We explored various ML algorithms for text classification on the cleaned dataset, including:

2.2.1 Naive Bayes[17] Classifier


Naive Bayes classifiers are based on Bayes’ theorem, a fundamental concept in probability theory. Bayes’ theorem
calculates conditional probabilities. These classifiers make ”naive” assumptions about the independence of features
given the class label, simplifying calculations.
Bayes’ Theorem[3]: Bayes’ theorem calculates the probability of a hypothesis (in this context, the class label)
based on the probability of the observed evidence (the features). The formula for Bayes’ theorem is as follows:

P (X|C) · P (C)
P (C|X) =
P (X)
Where:
• P (C|X): The posterior probability of class C given evidence X.
• P (X|C): The likelihood of evidence X given class C.

• P (C): The prior probability of class C.

• P (X): The probability of evidence X.

These classifiers assume that the features are conditionally independent, which is a ”naive” assumption but
simplifies the calculations significantly.

2
2.2.2 Logistic Regression [6]
Logistic Regression models the probability of a binary outcome using the logistic function (sigmoid). It minimizes
the logistic loss (log loss) function to optimize the model.
Mathematical Formulation: Logistic Regression uses the logistic loss function:
N
1 X
Logistic Loss = − [yi log(p(yi )) + (1 − yi ) log(1 − p(yi ))] (1)
N i=1

2.2.3 Decision Trees[15]


Decision Trees recursively split the data to minimize impurity, using measures such as Gini impurity and entropy.
Mathematical Formulation: Decision Trees use impurity measures like Gini impurity and entropy:
C
X
Gini Impurity = pi (1 − pi ) (2)
i=1

C
X
Entropy = − pi log2 (pi ) (3)
i=1

2.2.4 Support Vector Machines (SVM) [5]


SVM aims to find the hyperplane that maximizes the margin between classes. It is evaluated using metrics like
Accuracy Score, Weighted Precision, and Recall.
Mathematical Formulation: SVM uses the hinge loss function:

L(y, ŷ) = max(0, 1 − y · ŷ) (4)

2.2.5 Random Forest[2]


Random Forest is an ensemble learning method that combines multiple decision trees to improve predictive
accuracy and reduce overfitting. Key concepts include:
1. Decision Trees: Individual decision trees are trained on bootstrapped subsets of the data with random
feature selection.
2. Ensemble Averaging: Predictions from multiple trees are averaged for regression tasks or use majority
voting for classification.
Mathematical concepts:
1. Bootstrapping: Random subsets of the data are used for training each tree.
2. Feature Selection: Random feature subsets are considered at each node split.
Random Forest is known for its simplicity and robustness, making it a valuable tool for various machine
learning tasks.

2.2.6 XGBoost: Extreme Gradient Boosting[4]


XGBoost, short for Extreme Gradient Boosting, is a powerful ensemble machine learning algorithm. Key
concepts include:
1. Gradient Boosting: XGBoost iteratively minimizes a loss function, combining the strength of weak learners
to create a strong one.
2. Regularization: It incorporates L1 and L2 regularization terms to control model complexity and prevent
overfitting.
3. Parallel Processing: XGBoost optimizes efficiency through parallel computation, making it suitable for
large datasets.
Mathematical formulas:
- The XGBoost objective function combines a loss function and regularization terms.
- The loss function depends on the specific problem, e.g., squared error for regression or log loss for classification.
XGBoost is versatile and widely used in both academia and industry for various machine learning tasks.

3
2.2.7 Recurrent Neural Networks (RNN)[8]
Recurrent Neural Networks (RNN) are a class of artificial neural networks designed for sequential data
processing. Key concepts include:
1. Temporal Modeling: RNNs are capable of modeling sequences by maintaining hidden states that capture
temporal dependencies.
2. Looping Architecture: RNNs use a looping mechanism that allows information to persist within the
network and be passed from one step to the next.
3. Applications: RNNs are widely used in tasks such as natural language processing, speech recognition, and
time series prediction.
Mathematical concepts:
- RNNs use recurrent connections and hidden states to update their internal memory over time.
- The network’s output at each step is influenced by the current input and previous hidden states.
Recurrent Neural Networks are well-suited for sequential data analysis and have found applications in a variety
of domains.

2.2.8 Long Short-Term Memory (LSTM)[9]


Long Short-Term Memory (LSTM) is a specialized type of recurrent neural network designed to address the
vanishing gradient problem and capture long-range dependencies. Key concepts include:
1. Memory Cells: LSTMs contain memory cells with self-contained memory that can store information over
long sequences.
2. Gates: LSTMs use three gates (input, forget, and output) to control the flow of information into and out
of the memory cells.
3. Vanishing Gradient: LSTMs mitigate the vanishing gradient problem, allowing them to learn and remem-
ber patterns over extended sequences.
Mathematical concepts:
- LSTMs have a more complex internal structure compared to traditional RNNs, involving interactions between
gates and memory cells.
- The gates’ activities are determined by various activation functions and weights.
Long Short-Term Memory networks are powerful tools for tasks involving sequential data, such as natural
language processing, time series analysis, and speech recognition.

2.3 Hyperparameter Tuning[1]


Hyperparameter tuning is a critical phase in machine learning model development, influencing the performance
and generalization of models. This research adopts a systematic approach to hyperparameter tuning, striving to
optimize model performance effectively.

2.3.1 Hyperparameter Selection


The initial step in hyperparameter tuning involves identifying the pertinent hyperparameters specific to the machine
learning algorithm under consideration. These hyperparameters significantly impact the model’s learning process
and must be configured prior to training.

2.3.2 Hyperparameter Search Space


Each hyperparameter’s search space is defined, specifying the range or set of possible values. The determination
of this space can rely on domain expertise, prior knowledge, or empirical experimentation.

2.3.3 Search Strategy


Several search strategies are available to navigate the hyperparameter space efficiently:
Grid Search[16]: Exhaustively evaluates model performance across all possible combinations of hyperparam-
eters within the defined search space. While comprehensive, it can be computationally demanding for larger search
spaces.
Random Search: Randomly samples hyperparameter configurations from the search space. This approach
offers computational efficiency while often delivering competitive results.

4
Bayesian Optimization[19]: An advanced approach, Bayesian Optimization leverages probabilistic models
to guide the search for optimal hyperparameters. It expertly balances exploration and exploitation, making it
particularly suitable when objective function evaluations are resource-intensive.

2.3.4 Mathematical Insights


Bayesian Optimization involves probabilistic modeling. At its core, it utilizes a surrogate model, often a Gaussian
Process (GP), to represent the objective function’s behavior. The GP models the relationship between hyperpa-
rameters and the objective function as a probability distribution, capturing uncertainty in the model’s predictions.
The GP is characterized by a mean function (often assumed to be zero) and a covariance function (kernel) that
quantifies the correlations between different hyperparameter configurations. A popular choice for the covariance
function is the Radial Basis Function (RBF) kernel:
 
1
k(x, x′ ) = exp − ∥x − x′ ∥2 (5)
2
where x and x′ represent different hyperparameter configurations.

2.3.5 Model Evaluation


Each hyperparameter configuration is subjected to model training on a dedicated training dataset, followed by
performance evaluation on an independent validation dataset. Standard evaluation metrics such as accuracy, F1-
score, or domain-specific metrics are employed to assess model performance.

2.3.6 Optimal Configuration Selection


Post-evaluation, the hyperparameter configuration that yields the best performance on the validation dataset is
selected as the optimal choice.

2.3.7 Testing on Holdout Data


To ensure the chosen model configuration’s generalization, it undergoes testing on a holdout dataset that was not
involved in the hyperparameter tuning process.

2.4 Bayesian Optimization


Bayesian Optimization represents an advanced and resource-efficient method for hyperparameter tuning, particu-
larly valuable in scenarios involving computationally expensive objective function evaluations.

2.4.1 Surrogate Model - Gaussian Process


Bayesian Optimization relies on a surrogate model, frequently a Gaussian Process (GP), to approximate the ob-
jective function. The GP models the hyperparameter-objective function relationship as a probability distribution,
offering valuable uncertainty estimates for each configuration.

2.4.2 Acquisition Function


Central to Bayesian Optimization is the acquisition function, a mathematical construct guiding the selection of the
next hyperparameter configuration to evaluate. Prominent acquisition functions include Probability of Improvement
(PI), Expected Improvement (EI), and Upper Confidence Bound (UCB). These functions balance exploration of
uncharted regions in the hyperparameter space with exploitation of configurations likely to enhance performance.

2.4.3 Mathematical Insights


The Expected Improvement (EI) acquisition function, for instance, is defined as:
EI(x) = E [max(f (x) − f (x∗ ), 0)] (6)
where x is the hyperparameter configuration being considered, f (x) represents the GP’s predicted performance for
that configuration, and f (x∗ ) is the best observed performance so far. The EI measures the expected improvement
over the current best performance.

5
2.4.4 Tree of Parzen Estimators (TPE)[19]
Building upon Bayesian Optimization principles, Tree of Parzen Estimators (TPE) is an enhancement that en-
hances the search process. TPE employs a novel tree-based structure to model the probabilistic relationships
between hyperparameters and the objective function more efficiently. This hierarchical approach results in im-
proved optimization performance.

2.4.5 Adaptive TPE [11]


Adaptive TPE extends the capabilities of TPE by dynamically adjusting its search strategy as it explores the
hyperparameter space. It uses the information gathered during optimization to concentrate the search in regions
with higher potential for improvement, further enhancing efficiency.
Bayesian Optimization, augmented by Tree of Parzen Estimators (TPE) and Adaptive TPE, offers a sophis-
ticated and efficient means of navigating the hyperparameter space, particularly beneficial when dealing with
resource-intensive objective functions.

2.5 Evaluation Metrics


Before exploring transfer learning with BERT, we defined the evaluation metrics used to assess the performance of
our machine learning models:
Accuracy Score: Measures the fraction of correctly classified instances among all instances.
Number of Correct Predictions
Accuracy = (7)
Total Number of Predictions
F1 Score[10]: The F1 Score is a metric that combines precision (P ) and recall (R) to evaluate the performance
of a classification model. It is defined as the harmonic mean of precision and recall and is often used when dealing
with imbalanced datasets.
The formula for calculating the F1 Score is as follows:
P ·R
F1 = 2 ·
P +R
Where:

P : Precision (the ratio of true positive predictions to all positive predictions)


R : Recall (the ratio of true positive predictions to all actual positive instances)

Precision: Measures the fraction of true positive predictions among all positive predictions.
True Positives
Precision = (8)
True Positives + False Positives
Recall (Sensitivity): Measures the fraction of true positive predictions among all actual positives.
True Positives
Recall = (9)
True Positives + False Negatives
These metrics provide a comprehensive evaluation of the classification models, taking into account both accuracy
and the ability to correctly classify positive instances.

2.6 Transfer Learning[14]


In the realm of natural language processing (NLP), transfer learning has revolutionized the field by harnessing
the power of pre-trained models to enhance performance in downstream tasks. This approach involves leveraging
knowledge gained from one task and applying it to another, often related, task. Within this scope, two pivotal
components play a significant role: Language Models (LM)[13] and BERT[7] (Bidirectional Encoder Representations
from Transformers).
Language Models (LM): Language models are the cornerstone of transfer learning in NLP. These models
are trained on massive corpora of text and are designed to predict the likelihood of a word or sequence of words
within a specific context. Through this process, LMs capture intricate linguistic patterns, syntactic structures, and

6
semantic relationships. LMs are adaptable, and their transfer learning capability allows for fine-tuning on more
specific downstream tasks. This flexibility makes LMs the basis for many advanced NLP applications.
BERT (Bidirectional Encoder Representations from Transformers): BERT is a groundbreaking in-
novation in NLP and transfer learning. Unlike traditional LMs, BERT’s distinctive feature is its ability to learn
bidirectional context representations, meaning it considers both the preceding and following words when under-
standing a word’s meaning. This bidirectional context modeling significantly enhances its capacity to capture
nuanced language structures and semantics. BERT’s pre-training on a vast corpus of text data makes it highly
effective for a diverse range of NLP tasks. In your research, you took advantage of BERT-based models for
text classification on four distinct datasets, offering a comprehensive evaluation of their performance compared
to conventional machine learning models. These experiments provide valuable insights into the applicability and
effectiveness of transfer learning with BERT across varied domains.
The scope of transfer learning, encompassing Language Models and BERT, is central to your research as it
seeks to explore the versatility and advantages of these techniques in addressing real-world text classification
challenges. By examining their performance across multiple datasets, your research contributes to the growing
body of knowledge on the practical applications of transfer learning in NLP.

3 Findings
In this section, we present the findings from our research:

3.1 Evaluation of Models

Model F1 Score (Weighted)


Dataset 1 Dataset 2 Dataset 3 Dataset 4
SVM 0.946 0.766 0.9591 0.52
Logistic Regression 0.944 0.766 0.9546 0.534
Naive-Bayes 0.93 0.73 0.9379 0.52
Random Forest 0.9587 0.72 0.9310 0.43
XGBoost 0.9592 0.746 0.9312 0.49
Decision Trees 0.94 0.72 0.88 0.39
RNN 0.95 0.63 0.888 0.39
LSTM 0.9583 0.75 0.9300 0.5
BERT 0.9585 0.835 0.9611 0.62
Best Model XGBoost BERT BERT BERT

Table 2: F1 Score of Models on all Datasets

3.2 Scope of Transfer Learning


In this section, we delve into the extensive potential of transfer learning, with a specific focus on BERT (Bidirectional
Encoder Representations from Transformers). Our research findings offer invaluable insights into the realm of
transfer learning, with quantified evidence highlighting the prowess of BERT in various data distribution scenarios.

3.2.1 Performance on Balanced Datasets


Our initial exploration encompassed datasets characterized by a balanced class distribution, shedding light on the
scope of transfer learning using BERT:

• Dataset 1 - The dataset showcasing class equilibrium revealed a marginal difference in F1 Score, with BERT
achieving an F1 Score of 0.9585, slightly lower than the ”Best Non-BERT Model” at 0.9592, resulting in a
negligible 0.073% decrease in F1 Score.

• Dataset 3 - Another balanced dataset, which is moderately balanced, illustrated the power of BERT, outper-
forming the best non-BERT model with an F1 Score of 0.9611 compared to 0.9591. This yielded a noteworthy
0.208% advantage in favor of BERT.

7
Datasets Class Distribution F1 Score (Weighted)
Best Non-BERT Model BERT Difference(in %)
Dataset 1 Balanced 0.9592 0.9585 -0.073
Dataset 3 Moderately Balanced 0.9591 0.9611 0.208
Average - 0.95915 0.9598 0.06

Table 3: Comparison Of BERT With Other Models(Balanced Dataset)

The calculated average F1 Score for BERT across both balanced datasets stood at 0.9598, while non-BERT
models achieved an average F1 Score of 0.95915, resulting in a subtle average difference of 0.06%. It is perti-
nent to emphasize that in scenarios with moderate class balance, BERT and non-BERT models exhibited similar
performance with minor, negligible differences, underscoring BERT’s versatility.

3.2.2 Performance on Imbalanced Datasets


Our investigation further extended to datasets manifesting varying degrees of class imbalance, highlighting the
exceptional scope of transfer learning using BERT:

Datasets Class Distribution F1 Score (Weighted)


Best Non-BERT Model BERT Difference(in %)
Dataset 2 Moderately Imbalanced 0.766 0.85 10.96
Dataset 4 Highly Imbalanced 0.534 0.62 16.1
Average - 0.65 0.735 13.1

Table 4: Comparison Of BERT With Other Models(Imbalanced Dataset)

• Dataset 2 - Representing moderate class imbalance, this dataset revealed a significant 10.96% difference
in favor of BERT, with an F1 Score of 0.85 compared to the best non-BERT model’s 0.766. These results
underscore BERT’s capability when faced with datasets that are not linearly separable and exhibit moderate
class imbalance.
• Dataset 4 - Portraying pronounced class imbalance, this dataset accentuated the substantial advantage of
BERT, with an F1 Score of 0.62 compared to the best non-BERT model’s 0.534, signifying a remarkable
16.1% superiority of BERT. These findings highlight BERT’s marked proficiency when dealing with highly
imbalanced datasets.

In summary, our research unequivocally substantiates that transfer learning, particularly involving BERT, sig-
nificantly enhances the performance of text classification models. While BERT maintains competitive performance
on balanced datasets, its strengths are most conspicuous when dealing with imbalanced datasets, especially those
that are linearly inseparable or highly imbalanced. In such cases, BERT offers a substantial advantage, with an
average difference of 13.1% in favor of BERT over non-BERT models in imbalanced dataset scenarios. This firmly
establishes BERT’s role in the scope of transfer learning for text classification.

4 Conclusion
In the course of our research endeavor, ”Research on Text Classification using Machine Learning on Large
Datasets and Scope of Transfer Learning Using BERT,” we embarked on an empirical journey to unravel
the intricate landscape of text classification, especially in the context of large datasets. Our exploration extended
to elucidate the tremendous potential of transfer learning, with a primary focus on BERT (Bidirectional En-
coder Representations from Transformers). Our findings, meticulously quantified and meticulously analyzed,
provide valuable insights into the interplay between machine learning and text classification in a contemporary,
data-rich environment.
Our research encompassed datasets representing varying class distribution scenarios. In balanced datasets, we
discerned that BERT, a state-of-the-art transformer-based model, exhibited competitive performance. However, the
hallmark of our findings lay in imbalanced datasets, where BERT consistently and significantly outperformed

8
non-BERT models. Particularly in cases of linearly inseparable or highly imbalanced datasets, BERT
showcased its remarkable potential, delivering an average difference of 13.1% in favor of BERT.
In the realm of text classification, where the nature of data is often nuanced and complex, the implications of
our research are profound. The real-world applicability of machine learning in handling large datasets is undeniably
elevated, as BERT, an exemplar of transfer learning, emerges as a formidable tool. Whether the data
is balanced or markedly imbalanced, the persuasive evidence of BERT’s superior performance underscores its
prowess in the domain of text classification.
As we conclude this research, we assert that the scope of transfer learning, particularly involving BERT,
significantly enhances the performance of text classification models. BERT’s versatility, adaptability,
and superior predictive accuracy on imbalanced datasets illuminate the path forward in handling the data-intensive
challenges of the modern era. The implications of this research extend beyond the theoretical realm, paving the way
for the practical integration of BERT-based models in a myriad of text classification applications, from sentiment
analysis to information retrieval.
In a dynamic world where information is abundant, our research underscores the pivotal role of machine learning,
the promise of large datasets, and the transformative power of transfer learning. We conclude with the convic-
tion that our findings will fuel further exploration, innovation, and applications in the ever-evolving field of text
classification.

5 Bibliography
References
[1] James Bergstra, Brent Komer, Chris Eliasmith, Daniel Yamins, and David D. Cox. Hyperparameter Optimiza-
tion. University of Waterloo, 2015.
[2] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

[3] George Casella and Roger L. Berger. Statistical Inference. Duxbury/Thomson Learning, 2002.
[4] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016.
[5] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.

[6] David R. Cox. The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B
(Methodological), 20(2):215–242, 1958.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Bidirectional encoder represen-
tations from transformers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 4171–4186, 2018.

[8] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
[9] Aurélien Géron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media,
2019.
[10] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2009.

[11] Hyperopt contributors. Hyperopt GitHub Repository, 2023.


[12] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval.
Cambridge University Press, 2008.
[13] Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing. MIT
Press, 1999.
[14] Sinno Jialin Pan and Qiang Yang. A Survey on Transfer Learning, volume 22. IEEE Transactions on Knowledge
and Data Engineering, 2010.
[15] J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.

9
[16] Sebastian Raschka and Vahid Mirjalili. Python Machine Learning. Packt Publishing, 2015.
[17] Irina Rish. An empirical study of the naive bayes classifier. In IJCAI 2001 workshop on empirical methods in
artificial intelligence, pages 41–46, 2001.

[18] scikit-learn developers. scikit-learn Documentation, 2023.


[19] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando De Freitas. Taking the Human Out
of the Loop: A Review of Bayesian Optimization, volume 104. 2016.

6 GitHub Repository
For access to the code and datasets used in this research, please visit our GitHub repository: GitHub Repository

10

You might also like