Final Presentation Main
Final Presentation Main
Language Detection on
Social Media for Bangla
Language
Bias in Data
• Demographic and cultural biases
• Subjective labeling and language nuances
Overfitting
• Model specialization to training data
• Poor adaptation to new data
Generalization
• Struggle with new, unseen data
• Domain shift and distribution mismatches
2
Introduction
Research Problem: Challenges in Cyberbullying Detection
Overfitting
• Model specialization to training data
• Poor adaptation to new data
Generalization
• Struggle with new, unseen data
• Degradation of model performance over time 3
Introduction
Research Problem: Challenges in Cyberbullying Detection (Cont.)
Ethical Concerns
• Privacy and security issues
• Balancing harmful behavior recognition and free expression
Interpretability
• Understanding model decisions
• Dealing with symbols and local slang
Scalability
• Efficient processing of large datasets
• Real-time detection and adaptability 4
Introduction
Research Objective
5
Introduction
Research Objective (Cont.)
- Requires large datasets for optimal results; smaller datasets don't provide the best outcomes due to model
complexity.
- Focuses only on text, ignoring other mediums like images or videos.
- Limited by dataset size and may not be representative of all types of online content.
- Initial experimentation focuses only on specific platforms, such as tweets.
- Narrow focus on just two datasets limits generalizability.
7
Dataset Description
Data Collection
• Initial Dataset: 44,001 comments compiled and publicly available on Mendeley Data.
• Source of Comments: Gathered from social media platforms, primarily Facebook.
• Diverse Interactions: Comments from actors, influencers, politicians, athletes, etc.
• Expansion of Dataset: Additional 16,000 data points collected from Instagram, YouTube, etc.
8
Dataset Description
Data Collection (Cont.)
Original Dataset 9
Dataset Description
Data Analysis
Gender
• Distribution: Around 36,000 comments relate to
females, with the rest to males.
• Insights: Provides nuanced views on online
discourse across social spheres.
• Comprehensiveness: Dataset spans diverse topics,
offering insights into online interactions.
• Validation: Rigorous validation ensures data
accuracy, reliability, and trustworthiness.
• Potential Insights: Offers understanding on
harassment, discrimination, social dynamics, and
gender attitudes. 10
Gender Distribution of the Dataset
Dataset Description
Data Analysis (Cont.)
Category Distribution 11
Dataset Description
Data Analysis (Cont.)
Label Distribution
• Sexual Comments: 17.19% of dataset, often include
offensive language and unwanted approaches.
• Not Bully Comments: 32.98% of dataset, may
contribute to hostile environment despite seeming
benign.
• Troll Comments: 25.15% of dataset, aimed at
evoking strong emotions or sabotaging
conversations.
• Religious Comments: 14.87% of dataset, often
polarizing and promoting hate speech.
• Threat Comments: 9.83% of dataset, pose direct risk
to individuals' safety and well-being.
Label Distribution (Type of Bully) 12
Methodology
Preprocessing
- Remove noise: missing values, emojis, non-Bengali - Split data: training (80%), validation (10%),
characters, whitespace. testing (10%).
Methodology
14
Methodology
15
Methodology
Model Architecture
Transformer layers:
- Multi-Head Self-Attention
BERT Overview : - Feed-Forward Neural Network
- Model: `bert-base-multilingual-cased`. - Residual Connections and Layer Normalization
- Understands word context bidirectionally.
Output Representations:
- [CLS] Token
Input Representation: - [SEP] Token
- Token Embeddings
- Segment Embeddings Sequence Classification with BERT:
- Position Embeddings - Input Layer
- BERT Encoder
- Classification Layer 16
Result Analysis
17
Result Analysis
Training Phase (Cont.)
19
Here is the detailed report on Fold 2
Result Analysis
Training Phase (Cont.)
Class-wise Evaluation:
Interpretation:
- Overall Metrics: Above 94%.
- Balanced Performance.
Testing Phase Result 24
Result Analysis
Comprehensive Performance Analysis of On-line Behavior
Classification Model
Classes Evaluated:
25
Result Analysis
Confusion Matrix Analysis
Improvement Areas:
26
Result Analysis
ROC curve and AUC curve
ROC Curve:
AUC Curve:
28
Future work Plans
• Partner with social platforms for real-time application.
• Incorporate user feedback for algorithm refinement.
• Explore advanced NLP techniques.
• Develop automated moderation and response.
• Conduct longitudinal studies on cyberbullying trends.
• Enhance system for linguistic and cultural nuances.
• Collaborate with mental health organizations for support.
• Launch educational campaigns for awareness. 29
Reference
[1] S. Kemp, Digital 2023: Bangladesh- datareportal– global digital insights, Feb. 2023. [Online]. Available: https://datareportal.com/reports/digital-
2023 bangladesh.
[2] N. Shahbazi, Y. Lin, A. Asudeh, and H. V. Jagadish, “Representation bias in data: A survey on identification and resolution techniques,” ACM
Computing Surveys, vol. 55, no. 13s, pp. 1–39, Jul. 2023, issn: 1557-7341. doi: 10.1145/ 3588433. [Online]. Available:
http://dx.doi.org/10.1145/3588433.
[3] X. Ying, An overview of overfitting and its solutions, Feb. 2019. doi: 10.1088/ 1742-6596/1168/2/022022.
[4] J. Yadav, D. Kumar, and D. Chauhan, “Cyberbullying detection using pre trained bert model,” in 2020 International Conference on Electronics
and Sustainable Communication Systems (ICESC), 2020, pp. 1096–1100. doi: 10. 1109/ICESC48915.2020.9155700.
[5] M. Behzadi, I. G. Harris, and A. Derakhshan, “Rapid cyber-bullying detection method using compact bert models,” in 2021 IEEE 15th
International Con ference on Semantic Computing (ICSC), 2021, pp. 199–202. doi: 10.1109/ ICSC50631.2021.00042.
30
Reference
[6] M. Gada, K. Damania, and S. Sankhe, “Cyberbullying detection using lstm cnn architecture and its applications,” in 2021 International
Conference on Computer Communication and Informatics (ICCCI), 2021, pp. 1–6. doi: 10. 1109/ICCCI50826.2021.9402412.
[7] C. Raj, A. Agarwal, G. Bharathy, B. Narayan, and M. Prasad, Cyberbully ing detection: Hybrid models based on machine learning and natural
language processing techniques, Nov. 2021. doi: 10.3390/electronics10222810. [Online]. Available: http://dx.doi.org/10.3390/electronics10222810.
[8] B. Haidar, C. Maroun, and A. Serhrouchni, A multilingual system for cyber bullying detection: Arabic content detection using machine learning,
Dec. 2017. doi: 10.25046/aj020634.
[9] J. Hani, M. Nashaat, M. Ahmed, Z. Emad, E. Amer, and A. Mohammed, Social media cyberbullying detection using machine learning, 2019. doi:
10. 14569/IJACSA.2019.0100587. [Online]. Available: http://dx.doi.org/10. 14569/IJACSA.2019.0100587.
[10] D. Chatzakou, I. Leontiadis, J. Blackburn, et al., Detecting cyberbullying and cyberaggression in social media, 2019.
31
Reference
[11] A. Akhter, K. Uzzal, and M. Polash, Cyber bullying detection and classification using multinomial naıve bayes and fuzzy logic, 2019. 44
[12] M. F. Ahmed, Z. Mahmud, Z. T. Biash, A. A. N. Ryen, A. Hossain, and F. B. Ashraf, Cyberbullying detection using deep neural network from
social media comments in bangla language, 2021. arXiv: 2106.04506 [cs.CL].
[13] M. G. Hussain, T. A. Mahmud, and W. Akthar, “An approach to detect abu sive bangla text,” in 2018 International Conference on Innovation in
Engineer ing and Technology (ICIET), 2018, pp. 1–5. doi: 10.1109/CIET.2018.8660863.
[14] K. R. Talpur, S. S. Yuhaniz, and N. Amir, Cyberbullying detection: Current trends and future directions, 2020.
[15] S.Sarker and A.R.Shahid, Cyberbullying of high school students in bangladesh: An exploratory study, 2018. arXiv: 1901.00755 [cs.CY].
32
Reference
[16] Z. Alsaed and D. Eleyan, Approaches to cyberbullying detection on social net works: A survey, Jul. 2021.
[17] R. Ghosh, S. Nowal, and G. Manju, Social media cyberbullying detection using machine learning in bengali language, 2021.
[18] M. I. H. Emon, K. N. Iqbal, M. H. K. Mehedi, M. J. A. Mahbub, and A. A. Rasel, “Detection of bangla hate comments and cyberbullying in
social me dia using nlp and transformer models,” in Advances in Computing and Data Sciences: 6th International Conference, ICACDS 2022,
Kurnool, India, April 22–23, 2022, Revised Selected Papers, Part I, Springer, 2022, pp. 86–96.
[19] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR,
vol. abs/1810.04805, 2018. arXiv: 1810.04805. [Online]. Available: http://arxiv.org/abs/1810.04805.
[20] M.A.Moreno, A. D.Gower, H. Brittain, and T. Vaillancourt, Applying natural language processing to evaluate news media coverage of bullying
and cyberbul lying, 2019
33
Reference
[21] R. Kumar, B. Lahiri, and A. K. Ojha, Aggressive and offensive language iden tification in hindi, bangla, and english: A comparative study,
2021.
[22] M. Das, S. Banerjee, P. Saha, and A. Mukherjee, Hate speech and offensive language detection in bengali, 2022. arXiv: 2210.03479 [cs.CL].
[23] M. T. Ahmed, M. Rahman, S. Nur, A. Islam, and D. Das, “Deployment of machine learning and deep learning algorithms in detecting
cyberbully ing in bangla and romanized bangla text: A comparative study,” in 2021 International Conference on Advances in Electrical, Computing,
Communica tion and Sustainable Technologies (ICAECT), 2021, pp. 1–10. doi: 10.1109/ ICAECT49130.2021.9392608.
[24] Abdhullah-Al-Mamun and S. Akhter, “Social media bullying detection using machine learning on bangla text,” in 2018 10th International
Conference on Electrical and Computer Engineering (ICECE), 2018, pp. 385–388. doi: 10. 1109/ICECE.2018.8636797.
[25] S. Sultana, M. O. F. Redoy, J. Al Nahian, A. K. M. Masum, and S. Abu jar, Detection of abusive bengali comments for mixed social media data
using machine learning, 2023.
34
THANK YOU
QUESTIONS ?
35