Private Sensitive Content on Social Media: An Analysis and Automated Detection for Norwegian

Authors

  • Haldis Borgen
  • Oline Zachariassen
  • Pelin Mise
  • Ahmet Yildiz
  • Özlem Özgöbek

DOI:

https://doi.org/10.3384/ecp208001

Abstract

This study addresses the notable gap in research on detecting private-sensitive content within Norwegian social media by creating and annotating a dataset, tailored specifically to capture the linguistic and cultural nuances of Norwegian social media discourse. Utilizing Reddit as a primary data source, entries were compiled and cleaned, resulting in a comprehensive dataset of 4482 rows. Our research methodology encompassed evaluating a variety of computational models—including machine learning, deep learning, and transformers—to assess their effectiveness in identifying sensitive content. Among these, the NB BERT-based classifier emerged as the proficient, showcasing accuracy and F-1 score. This classifier demonstrated remarkable effectiveness, achieving an accuracy of 82.75% and an F1-score of 82.39%, underscoring its adeptness at navigating the complexities of privacy-sensitive content detection in Norwegian social media. This endeavor not only paves the way for enhanced privacy-sensitive content detection in Norwegian social media but also sets a precedent for future research in the domain, emphasizing the critical role of tailored datasets in advancing the field.

Downloads

Published

2024-06-14