BS Thesis MidSem Report
BS Thesis MidSem Report
Bachelor of Science
in
Aditya Manoj
21347
February, 2023
2
Abstract
Clickbait is a widespread problem in online material, deceiving users with dra-
matic headlines that fall short of their claims. Conventional clickbait detection
techniques mostly concentrate on textual analysis, which frequently fails to de-
tect complex multimodal clickbait that incorporates deceptive images, videos,
or thumbnails. To improve classification accuracy, we present a multimodal ap-
proach to clickbait detection in this study that makes use of both textual and
visual clues. Our model uses deep learning-based image processing to extract
visual patterns frequently linked to misleading material and natural language
processing (NLP) techniques for textual analysis. To train and assess our model,
we use publicly accessible datasets, such as the Clickbait17 and YouTube Click-
bait datasets.
Our results show that a multimodal strategy outperforms the unimodal ones
by a large margin, successfully capturing a variety of clickbait tactics. In order to
preserve the integrity of content and increase user confidence in online platforms,
our research helps create clickbait detection systems that are more reliable and
scalable.
Keywords: Clickbait Detection, Multimodal Learning, Natural Language
Processing, Deep Learning, Image Processing, Fake News Prevention.
Contents
1 Introduction 1
1.1 Introduction to Clickbait and Its Challenges . . . . . . . . . . . . 1
1.2 Motivation for Multimodal Clickbait Detection . . . . . . . . . . . 1
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Objectives of the Research . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Scope of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Contributions of This Study . . . . . . . . . . . . . . . . . . . . . 3
1.7 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Definition and Characteristics of Clickbait . . . . . . . . . . . . . 5
2.2 Evolution of Clickbait in Online Media . . . . . . . . . . . . . . . 5
2.3 Impact of Clickbait on Digital Ecosystems . . . . . . . . . . . . . 6
2.4 Existing Approaches to Clickbait Detection . . . . . . . . . . . . . 6
2.4.1 Text-Based Clickbait Detection . . . . . . . . . . . . . . . 6
2.4.2 Image-Based Clickbait Detection . . . . . . . . . . . . . . 7
2.4.3 Multimodal Clickbait Detection . . . . . . . . . . . . . . . 7
2.5 Challenges in Clickbait Detection . . . . . . . . . . . . . . . . . . 7
2.6 Research Gaps and Motivation . . . . . . . . . . . . . . . . . . . . 8
3 Methodology 9
3.1 Dataset Collection and Preprocessing . . . . . . . . . . . . . . . . 9
3.1.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . 9
3.1.3 Image Preprocessing . . . . . . . . . . . . . . . . . . . . . 10
3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Text Feature Extraction . . . . . . . . . . . . . . . . . . . 11
3.2.2 Image Feature Extraction . . . . . . . . . . . . . . . . . . 11
3.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 Fusion Techniques . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 12
i
3.4.1 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7 Recall (Sensitivity) . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.8 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.9 ROC-AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Conclusion 19
5.1 Summary of Research . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Work Plan 21
6.1 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
ii
List of Tables
iii
iv
Chapter 1
Introduction
1
• The increasing reliance on thumbnails, memes, and infographics to
mislead users.
2
• To employ state-of-the-art deep learning models for classification,
such as transformer-based NLP models (BERT, RoBERTa) and
CNN-based image models (ResNet, Vision Transformers).[4][8][9]
This work lays the foundation for future research in automated misinformation
detection, particularly in the realm of multimodal content analysis.
3
• Conducts an extensive comparative study against existing unimodal
and multimodal detection methods.
4
Chapter 2
Background
Clickbait has become a major concern in digital content, where misleading head-
lines, exaggerated claims, and misleading visuals are used to attract user engage-
ment. This practice manipulates user behavior, contributes to misinformation,
and affects the integrity of content across digital platforms. Addressing this issue
requires an understanding of the history, characteristics and current research in
clickbait detection. This chapter provides a comprehensive overview of clickbait,
its evolution, its impact on digital ecosystems, existing detection methodologies,
and research challenges.[2] [6]
5
• Traditional Media Era: Sensationalized headlines in newspapers aimed
at boosting sales.
6
• Transformer-based models like BERT and GPT fine-tuned for clickbait
classification.
7
2.6 Research Gaps and Motivation
While existing studies have made progress in detecting clickbait, there remain
significant gaps:
8
Chapter 3
Methodology
9
Figure 3.1: Flowchart of Multimodal Clickbait Detection
10
• Data Augmentation - Includes flipping, rotation, and contrast adjust-
ments to improve model generalization.
11
• Image Processing Module - Utilizes deep CNN models (ResNet, Vision
Transformers) for visual analysis.
3.5 Accuracy
Accuracy measures the overall correctness of the model and is defined as:
TP + TN
Accuracy = (3.1)
TP + TN + FP + FN
12
where T P and T N represent the correctly predicted clickbait and non-clickbait
samples, respectively, while F P and F N denote the false positives and false
negatives.
3.6 Precision
Precision measures how many of the predicted clickbait samples are actually
clickbait and is given by:
TP
P recision = (3.2)
TP + FP
3.8 F1-Score
The F1-score is the harmonic mean of precision and recall and is given by:
P recision × Recall
F 1-Score = 2 × (3.4)
P recision + Recall
3.9 ROC-AUC
ROC-AUC evaluates the model’s ability to distinguish between classes and is
calculated based on the true positive rate (TPR) and false positive rate (FPR):
TP FP
TPR = , FPR = (3.5)
TP + FN FP + TN
The AUC value ranges from 0 to 1, where a higher score indicates better model
performance.
13
14
Chapter 4
This chapter presents the results of our multimodal clickbait detection ex-
periments and discusses their implications. Various deep learning models, in-
cluding BERT, RoBERTa, CLIP, and Vision Transformers (ViT), were
evaluated on benchmark datasets such as Clickbait17 and YouTube Click-
bait. The models were assessed based on classification metrics including accu-
racy, precision, recall, and F1-score.
We also analyze the impact of feature fusion techniques, dataset vari-
ations, and adversarial robustness on model performance, comparing uni-
modal and multimodal approaches.
15
Model Dataset Accuracy Precision F1-Score
BERT + ViT (Exp 2) YouTube Clickbait 89% 0.90 0.88
BERT + ViT (Exp 3) YouTube Clickbait 90% 0.91 0.89
RoBERTa + CLIP YouTube Clickbait 91% 0.92 0.90
16
Model Dataset Accuracy Precision F1-Score
ViT + CLIP Clickbait17-Test- 85% 0.91 0.84
170720
ViT-Exp1 Clickbait17-Train- 78% 0.83 0.76
170331
ViT-Exp2 Clickbait17-Train- 81% 0.86 0.79
170331
ViT-Exp3 Clickbait17-Train- 84% 0.89 0.82
170331
ViT-Exp4 Clickbait17-Train- 86% 0.91 0.84
170331
ViT-Exp5 Clickbait17-Train- 88% 0.93 0.86
170331
RoBERTa + CLIP Clickbait-Train- 89% 0.94 0.88
170331
BERT + ViT YouTube-Clickbait 90% 0.92 0.89
(Exp 2)
BERT + ViT YouTube-Clickbait 91% 0.93 0.90
(Exp 3)
RoBERTa + CLIP YouTube-Clickbait 92% 0.94 0.91
4.3 Conclusion
This chapter presented an extensive evaluation of different multimodal click-
bait detection models, comparing performance across datasets and architec-
tures. The results confirm that multimodal deep learning significantly en-
hances clickbait detection accuracy, particularly with transformer-based
language models and vision-language fusion techniques.
The next chapter will discuss future improvements and potential applications
of this research.
17
18
Chapter 5
Conclusion
19
• Multimodal models outperform unimodal approaches, indicating
that a combination of text and visual analysis significantly enhances click-
bait detection accuracy.
20
Chapter 6
Work Plan
21
Bibliography
[3] Carmela Comito, Luciano Caroprese, and Ester Zumpano. Multimodal fake
news detection on social media: a survey of deep learning techniques. Social
Network Analysis and Mining, 13(1):101, 2023.
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert:
Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 conference of the North American chapter of
the association for computational linguistics: human language technologies,
volume 1 (long and short papers), pages 4171–4186, 2019.
[5] Ayse Geçkil, Ahmet Anil Müngen, Esra Gündogan, and Mehmet Kaya.
A clickbait detection method on news sites. In 2018 IEEE/ACM Inter-
national Conference on Advances in Social Networks Analysis and Mining
(ASONAM), pages 932–937. IEEE, 2018.
[7] Mini Jain, Peya Mowar, Ruchika Goel, and Dinesh K Vishwakarma. Click-
bait in social media: detection and analysis of the bait. In 2021 55th annual
conference on information sciences and systems (CISS), pages 1–6. IEEE,
2021.
22
[8] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert:
Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of naacL-HLT, volume 1. Minneapolis, Minnesota, 2019.
[9] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
Roberta: A robustly optimized bert pretraining approach. arXiv preprint
arXiv:1907.11692, 2019.
[10] Abhishek Mallik and Sanjay Kumar. Word2vec and lstm based deep learn-
ing technique for context-free fake news detection. Multimedia Tools and
Applications, 83(1):919–940, 2024.
[11] Qing Meng, Bo Liu, Xiangguo Sun, Hui Yan, Chengyu Liang, Jiuxin Cao,
Roy Ka-Wei Lee, and Xing Bao. Attention-fused deep relevancy matching
network for clickbait detection. IEEE Transactions on Computational Social
Systems, 10(6):3120–3131, 2022.
23
24