Academic Internship Final Report
Academic Internship Final Report
Authors:
Firoz Anjum Chowdhury, Hillol Pratim Kalita, Nilakhya Mandita Bordoloi,
Abir Akash Baruah, Mrinmoy Shyam
Supervised by:
Professor Rajib Chakrabarty
Department of Computer Science
Jorhat Engineering College
Abstract: This study explores the scope of transfer learning using BERT for large-scale text
classification, employing a diverse range of machine learning models, including Naive Bayes,
Decision Trees, Random Forest, Logistic Regression, Support Vector Machines (SVM), RNNs,
LSTMs, and XGBoost. We emphasize the critical role of model selection, hyperparameter
optimization, and how pre-trained models like BERT can significantly enhance accuracy and
efficiency in text classification. Our findings reveal that select models, such as SVM, RNN,
LSTM, and XGBoost, outperform others, underscoring the potential of transfer learning with
BERT. To validate our approach’s versatility, we conducted experiments on four distinct
datasets from various domains, each differing in size and class distribution, demonstrating the
adaptability of our models to real-world text classification tasks. This research significantly
contributes to understanding the scope and effectiveness of transfer learning for large-scale text
data analysis.
1.2 Objectives
The primary objectives of this research are as follows:
1. Preprocessing and cleaning a large news classification dataset.
5. Drawing conclusions and providing insights into the effectiveness of different ML algorithms and transfer
learning approaches.
2 Methodology
2.1 Data Pre-processing and Dataset Description
2.1.1 Dataset Description
In this section, we provide a comprehensive description of the datasets used in this study, including their sizes and
the number of distinct labels, along with domain of the datasets. The data serves as the foundation for our research
and analysis.
1
For textual data, TF-IDF[12] (Term Frequency-Inverse Document Frequency) vectorization was employed,
converting text documents into numerical representations. This quantitative measure captures term relevance
within both the document and the dataset as a whole, especially beneficial for our text-based data.
Through the application of label encoding and TF-IDF vectorization in our data pre-processing pipeline, data
readiness for advanced analytics and machine learning tasks was ensured, with both categorical and textual data
being structured and prepared. This version of the summary emphasizes the use of passive voice in describing the
actions taken during data pre-processing.
P (X|C) · P (C)
P (C|X) =
P (X)
Where:
• P (C|X): The posterior probability of class C given evidence X.
• P (X|C): The likelihood of evidence X given class C.
These classifiers assume that the features are conditionally independent, which is a ”naive” assumption but
simplifies the calculations significantly.
2
2.2.2 Logistic Regression [6]
Logistic Regression models the probability of a binary outcome using the logistic function (sigmoid). It minimizes
the logistic loss (log loss) function to optimize the model.
Mathematical Formulation: Logistic Regression uses the logistic loss function:
N
1 X
Logistic Loss = − [yi log(p(yi )) + (1 − yi ) log(1 − p(yi ))] (1)
N i=1
C
X
Entropy = − pi log2 (pi ) (3)
i=1
3
2.2.7 Recurrent Neural Networks (RNN)[8]
Recurrent Neural Networks (RNN) are a class of artificial neural networks designed for sequential data
processing. Key concepts include:
1. Temporal Modeling: RNNs are capable of modeling sequences by maintaining hidden states that capture
temporal dependencies.
2. Looping Architecture: RNNs use a looping mechanism that allows information to persist within the
network and be passed from one step to the next.
3. Applications: RNNs are widely used in tasks such as natural language processing, speech recognition, and
time series prediction.
Mathematical concepts:
- RNNs use recurrent connections and hidden states to update their internal memory over time.
- The network’s output at each step is influenced by the current input and previous hidden states.
Recurrent Neural Networks are well-suited for sequential data analysis and have found applications in a variety
of domains.
4
Bayesian Optimization[19]: An advanced approach, Bayesian Optimization leverages probabilistic models
to guide the search for optimal hyperparameters. It expertly balances exploration and exploitation, making it
particularly suitable when objective function evaluations are resource-intensive.
5
2.4.4 Tree of Parzen Estimators (TPE)[19]
Building upon Bayesian Optimization principles, Tree of Parzen Estimators (TPE) is an enhancement that en-
hances the search process. TPE employs a novel tree-based structure to model the probabilistic relationships
between hyperparameters and the objective function more efficiently. This hierarchical approach results in im-
proved optimization performance.
Precision: Measures the fraction of true positive predictions among all positive predictions.
True Positives
Precision = (8)
True Positives + False Positives
Recall (Sensitivity): Measures the fraction of true positive predictions among all actual positives.
True Positives
Recall = (9)
True Positives + False Negatives
These metrics provide a comprehensive evaluation of the classification models, taking into account both accuracy
and the ability to correctly classify positive instances.
6
semantic relationships. LMs are adaptable, and their transfer learning capability allows for fine-tuning on more
specific downstream tasks. This flexibility makes LMs the basis for many advanced NLP applications.
BERT (Bidirectional Encoder Representations from Transformers): BERT is a groundbreaking in-
novation in NLP and transfer learning. Unlike traditional LMs, BERT’s distinctive feature is its ability to learn
bidirectional context representations, meaning it considers both the preceding and following words when under-
standing a word’s meaning. This bidirectional context modeling significantly enhances its capacity to capture
nuanced language structures and semantics. BERT’s pre-training on a vast corpus of text data makes it highly
effective for a diverse range of NLP tasks. In your research, you took advantage of BERT-based models for
text classification on four distinct datasets, offering a comprehensive evaluation of their performance compared
to conventional machine learning models. These experiments provide valuable insights into the applicability and
effectiveness of transfer learning with BERT across varied domains.
The scope of transfer learning, encompassing Language Models and BERT, is central to your research as it
seeks to explore the versatility and advantages of these techniques in addressing real-world text classification
challenges. By examining their performance across multiple datasets, your research contributes to the growing
body of knowledge on the practical applications of transfer learning in NLP.
3 Findings
In this section, we present the findings from our research:
• Dataset 1 - The dataset showcasing class equilibrium revealed a marginal difference in F1 Score, with BERT
achieving an F1 Score of 0.9585, slightly lower than the ”Best Non-BERT Model” at 0.9592, resulting in a
negligible 0.073% decrease in F1 Score.
• Dataset 3 - Another balanced dataset, which is moderately balanced, illustrated the power of BERT, outper-
forming the best non-BERT model with an F1 Score of 0.9611 compared to 0.9591. This yielded a noteworthy
0.208% advantage in favor of BERT.
7
Datasets Class Distribution F1 Score (Weighted)
Best Non-BERT Model BERT Difference(in %)
Dataset 1 Balanced 0.9592 0.9585 -0.073
Dataset 3 Moderately Balanced 0.9591 0.9611 0.208
Average - 0.95915 0.9598 0.06
The calculated average F1 Score for BERT across both balanced datasets stood at 0.9598, while non-BERT
models achieved an average F1 Score of 0.95915, resulting in a subtle average difference of 0.06%. It is perti-
nent to emphasize that in scenarios with moderate class balance, BERT and non-BERT models exhibited similar
performance with minor, negligible differences, underscoring BERT’s versatility.
• Dataset 2 - Representing moderate class imbalance, this dataset revealed a significant 10.96% difference
in favor of BERT, with an F1 Score of 0.85 compared to the best non-BERT model’s 0.766. These results
underscore BERT’s capability when faced with datasets that are not linearly separable and exhibit moderate
class imbalance.
• Dataset 4 - Portraying pronounced class imbalance, this dataset accentuated the substantial advantage of
BERT, with an F1 Score of 0.62 compared to the best non-BERT model’s 0.534, signifying a remarkable
16.1% superiority of BERT. These findings highlight BERT’s marked proficiency when dealing with highly
imbalanced datasets.
In summary, our research unequivocally substantiates that transfer learning, particularly involving BERT, sig-
nificantly enhances the performance of text classification models. While BERT maintains competitive performance
on balanced datasets, its strengths are most conspicuous when dealing with imbalanced datasets, especially those
that are linearly inseparable or highly imbalanced. In such cases, BERT offers a substantial advantage, with an
average difference of 13.1% in favor of BERT over non-BERT models in imbalanced dataset scenarios. This firmly
establishes BERT’s role in the scope of transfer learning for text classification.
4 Conclusion
In the course of our research endeavor, ”Research on Text Classification using Machine Learning on Large
Datasets and Scope of Transfer Learning Using BERT,” we embarked on an empirical journey to unravel
the intricate landscape of text classification, especially in the context of large datasets. Our exploration extended
to elucidate the tremendous potential of transfer learning, with a primary focus on BERT (Bidirectional En-
coder Representations from Transformers). Our findings, meticulously quantified and meticulously analyzed,
provide valuable insights into the interplay between machine learning and text classification in a contemporary,
data-rich environment.
Our research encompassed datasets representing varying class distribution scenarios. In balanced datasets, we
discerned that BERT, a state-of-the-art transformer-based model, exhibited competitive performance. However, the
hallmark of our findings lay in imbalanced datasets, where BERT consistently and significantly outperformed
8
non-BERT models. Particularly in cases of linearly inseparable or highly imbalanced datasets, BERT
showcased its remarkable potential, delivering an average difference of 13.1% in favor of BERT.
In the realm of text classification, where the nature of data is often nuanced and complex, the implications of
our research are profound. The real-world applicability of machine learning in handling large datasets is undeniably
elevated, as BERT, an exemplar of transfer learning, emerges as a formidable tool. Whether the data
is balanced or markedly imbalanced, the persuasive evidence of BERT’s superior performance underscores its
prowess in the domain of text classification.
As we conclude this research, we assert that the scope of transfer learning, particularly involving BERT,
significantly enhances the performance of text classification models. BERT’s versatility, adaptability,
and superior predictive accuracy on imbalanced datasets illuminate the path forward in handling the data-intensive
challenges of the modern era. The implications of this research extend beyond the theoretical realm, paving the way
for the practical integration of BERT-based models in a myriad of text classification applications, from sentiment
analysis to information retrieval.
In a dynamic world where information is abundant, our research underscores the pivotal role of machine learning,
the promise of large datasets, and the transformative power of transfer learning. We conclude with the convic-
tion that our findings will fuel further exploration, innovation, and applications in the ever-evolving field of text
classification.
5 Bibliography
References
[1] James Bergstra, Brent Komer, Chris Eliasmith, Daniel Yamins, and David D. Cox. Hyperparameter Optimiza-
tion. University of Waterloo, 2015.
[2] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[3] George Casella and Roger L. Berger. Statistical Inference. Duxbury/Thomson Learning, 2002.
[4] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016.
[5] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
[6] David R. Cox. The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B
(Methodological), 20(2):215–242, 1958.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Bidirectional encoder represen-
tations from transformers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 4171–4186, 2018.
[8] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
[9] Aurélien Géron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media,
2019.
[10] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2009.
9
[16] Sebastian Raschka and Vahid Mirjalili. Python Machine Learning. Packt Publishing, 2015.
[17] Irina Rish. An empirical study of the naive bayes classifier. In IJCAI 2001 workshop on empirical methods in
artificial intelligence, pages 41–46, 2001.
6 GitHub Repository
For access to the code and datasets used in this research, please visit our GitHub repository: GitHub Repository
10