Abstract
Nowadays, children have access to Internet on a regular basis. Just like the real world, the Internet has many unsafe locations where kids may be exposed to inappropriate content in the form of obscene, aggressive, erotic or rude comments. In this work, we address the problem of detecting erotic/sexual content on text documents using Natural Language Processing (NLP) techniques. Following an approach based on Machine Learning techniques, we have assessed twelve models resulting from the combination of three text encoders (Bag of Words, Term Frequency-Inverse Document Frequency and Word2vec) together with four classifiers (Support Vector Machines (SVMs), Logistic Regression, k-Nearest Neighbors and Random Forests). We evaluated these alternatives on a new created dataset extracted from public data on the Reddit Website. The best performance result was achieved by the combination of the text encoder TF-IDF and the SVM classifier with linear kernel with an accuracy of 0.97 and F-score 0.96 (precision 0.96/recall 0.95). This study demonstrates that it is possible to detect erotic content on text documents and therefore, develop filters for minors or according to user’s preferences.
Article PDF
Avoid common mistakes on your manuscript.
References
K. Pte, Digital 2019: Global Digital Overview, 2019. https://datareportal.com/reports/digital-2019-global-digital-overview
Eurostat, Internet Access and Use Statistics – Households and Individuals, 2016. https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Archive:Internet_access_and_use_statistics_-_households_and_individuals
NSPCC, Net Aware Report 2017: “Freedom to Express Myself Safely”: Exploring How Young People Navigate Opportunities and Risks in Their Online Lives, NSPCC, London, England, 2017. https://learning.nspcc.org.uk/research-resources/2017/net-aware-report-2017-freedom-toexpress-myself-safely
Y. Chen, H. Xu, Y. Zhou, S. Zhu, Is this app safe for children?: a comparison study of maturity ratings on android and ios applications, in Proceedings of the 22nd international conference on World Wide Web, ACM, Rio de Janeiro, Brazil, 2013, pp. 201–212.
C.-S. Li, G. Xiong, E.M. Tapia, New frontiers in cognitive content curation and moderation, APSIPA Trans. Signal Inf. Process. 7 (2018).
M. Yar, A failure to regulate? The demands and dilemmas of tackling illegal content and behaviour on social media, Int. J. Cybersecur. Intell. Cyber. 1 (2018), 5–20.
N. Duarte, E. Llanso, A. Loup, Mixed messages? The limits of automated social media content analysis, in FAT, New York City, NY, USA, 2018.
S.K. Dwiv, C. Arya, Automatic text classification in information retrieval: asurvey, in Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, ACM, Udaipur, India, 2016, p. 131.
C. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval, Nat. Lang. Eng. 16 (2010), 100–103.
W. Bruce Croft, D. Metzler, T. Strohman, Search Engines: Information Retrieval in Practice, vol. 520, Addison-Wesley Reading, 2010.
H. Bunke, A. Kandel, A. Schenker, M. Last, Classification of web documents using graph matching, Int. J. Pattern Recognit. Artif. Intell. 18 (2004), 475–496.
Z. Li, J. Huang, A text classification algorithm based on improved multidimensional–multiresolution topological pattern recognition, Int. J. Pattern Recognit. Artif. Intell. 13 (2019).
C.C. Aggarwal, Content-based recommender systems, in Recommender systems, Springer, Cham, Switzerland, 2016, pp. 139–166.
B. Liu, L. Zhang, A survey of opinion mining and sentiment analysis, in: C. Aggarwal, C. Zhai (Eds.), Mining Text Data, Springer, Boston, MA, USA, 2012, pp. 415–463.
A. Joshi, E. Fidalgo, E. Alegre, L. Fernández-Robles, SummCoder: an unsupervised framework for extractive text summarization based on deep auto-encoders, Expert Syst. Appl. 129 (2019), 200–215.
K. Kowsari, K.J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, D. Brown, Text classification algorithms: a survey, Information. 10 (2019), 150.
I. Kotenko, A. Chechulin, D. Komashinsky, Evaluation of text classification techniques for inappropriate web content blocking, in 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), IEEE, Warsaw, Poland, 2015, vol. 1, pp. 412–417.
H. Yenala, A. Jhanwar, M.K. Chinnakotla, J. Goyal, Deep learning for detecting inappropriate content in text, Int. J. Data Sci. Anal. 6 (2018), 273–286.
P. Bellan, C. Strapparava, Detecting inappropriate comments to news, in: C. Ghidini, B. Magnini, A. Passerini, P. Traverso (Eds.), International Conference of the Italian Association for Artificial Intelligence, Springer, Cham, Switzerland, 2018, pp. 403–414.
T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the problem of offensive language, in Eleventh International AAAI Conference on Web and Social Media, Montreal, Quebec, Canada, 2017.
L. Gao, R. Huang, Detecting online hate speech using context aware models, in Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2017), Varna, Bulgaria, 2017, pp. 260–266.
H.L. Hammer, Automatic detection of hateful comments in online discussion, in: L. Maglaras, H. Janicke, K. Jones (Eds.), Industrial Networks and Intelligent Systems, International Conference on Industrial Networks and Intelligent Systems, Springer, Cham, Switzerland, 2016, pp. 164–173.
D. Robinson, Z. Zhang, J. Tepper, Hate speech detection on twitter: feature engineering vs feature selection, in: A. Gangemi, et al. (Eds.), European Semantic Web Conference, Springer, Cham, Switzerland, 2018, pp. 46–49.
N.D.T. Ruwandika, A.R. Weerasinghe, Identification of hate speech in social media, in 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer), IEEE, Colombo, Sri Lanka, 2018, pp. 273–278.
H. Watanabe, M. Bouazizi, T. Ohtsuki, Hate speech on tTwitter: a pragmatic approach to collect hateful and offensive expressions and perform hate speech detection, IEEE Access. 6 (2018), 13825–13835.
Z. Zhang, D. Robinson, J. Tepper, Detecting hate speech on twitter using a convolution-gru based deep neural network, in: A. Gangemi, et al. (Eds.), European Semantic Web Conference, Springer, Cham, Switzerland, 2018, pp. 745–760.
S. Agrawal, A. Awekar, Deep learning for detecting cyberbullying across multiple social media platforms, in: G. Pasi, B. Piwowarski, L. Azzopardi, A. Hanbury (Eds.), Advances in Information Retrieval, European Conference on Information Retrieval, Springer, Cham, Switzerland, 2018, pp. 141–153.
M.A. Al-garadi, K.D. Varathan, S.D. Ravana, Cybercrime detection in online communications: the experimental case of cyberbullying detection in the twitter network, Comput. Hum. Behav. 63 (2016), 433–443.
J. Chen, S. Yan, K.-C. Wong, Verbal aggression detection on Twitter comments: Convolutional neural network for short-text sentiment analysis, Neural Comput. Appl. (2018), 1–10.
S.P. Murali, et al., Detecting cyber bullies on twitter using machine learning techniques, Int. J. Inf. Secur. Cyber. 6 (2017), 63–66.
K. Nalini, L.J. Sheela, Classification of tweets using text classifier to detect cyber bullying, in: S. Satapathy, A. Govardhan, K. Raju, J. Mandal (Eds.), Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Society of India CSI, vol. 2, Springer, Cham, Switzerland, 2015, pp. 637–645.
R. Zhao, A. Zhou, K. Mao, Automatic detection of cyberbullying on social networks based on bullying features, in Proceedings of the 17th international conference on distributed computing and networking, ACM, Singapore, 2016, p. 43.
S. Merayo-Alba, E. Fidalgo, V. González-Castro, R. Alaiz-Rodríguez, J. Velasco-Mata, Use of natural language processing to identify inappropriate content in text, in: H. Pérez García, L. Sánchez González, M. Castejón Limas, H. Quintián Pardo, E. Corchado Rodríguez (Eds.), Hybrid Artificial Intelligent Systems, International Conference on Hybrid Artificial Intelligence Systems, Springer, Cham, Switzerland, 2019, pp. 254–263.
B.K. Narayanan, S. Moses, M. Nirmala, et al., Adult content filtering: restricting minor audience from accessing inappropriate internet content, Educ. Inf. Technol. 23 (2018), 2719–2735.
A. Genkin, D.D. Lewis, D. Madigan, Large-scale bayesian logistic regression for text categorization, Technometrics. 49 (2007), 291–304.
G. Schohn, D. Cohn, Less is more: active learning with support vector machines, in Proceedings of the Seventeenth International Conference on Machine Learning, Citeseer, Stanford, CA, USA, 2000, p. 6.
R. Baeza-Yates, B. Ribeiro-Neto, et al., Modern Information Retrieval, vol. 463, ACM Press, New York, NY, USA, 1999.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990), 391–407.
Z.S. Harris, Distributional structure, Word. 10 (1954), 146–162.
K.S. Jones, A Statistical Interpretation of Term Specificity and its Application in Retrieval, Taylor Graham Publishing, London, UK, 1988.
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781, 2013.
T. Mikolov, W.-T. Yih, G. Zweig, Linguistic regularities in continuous space word representations, in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, 2013, pp. 746–751.
J. Pennington, R. Socher, C. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.
P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Trans. Assoc. Comput. Linguist. 5 (2017), 135–146.
E. Fidalgo, E. Alegre, L. Fernández-Robles, V. González-Castro, Fusión temprana de descriptores extraídos de mapas de prominencia multi-nivel para clasificar imágenes, Rev. Iberoam. Autom. Inf. ind. 16 (2019), 358–368.
S.M.H. Dadgar, M.S. Araghi, M.M. Farahani, A novel text mining approach based on tf-idf and support vector machine for news classification, in 2016 IEEE International Conference on Engineering and Technology (ICETECH), IEEE, Coimbatore, India, 2016, pp. 112–116.
X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, in Advances in Neural Information Processing Systems, Montreal, Quebec, Canada, 2015, pp. 649–657.
F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv. 34 (2002), 1–47.
T. Mikolov, Q.V. Le, I. Sutskever, Exploiting Similarities among Languages for Machine Translation, arXiv preprint arXiv:1309.4168, 2013.
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems, Nevada, USA, 2013, pp. 3111–3119.
X. Shu, R. Cohen, et al., Natural Language Toolkit (NLTK), 2010.
V. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998.
L. Breiman, Random forests, Mach. Learn. 45 (2001), 5–32.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
This is an open access article distributed under the CC BY-NC 4.0 license (https://doi.org/creativecommons.org/licenses/by-nc/4.0/).
About this article
Cite this article
Barrientos, G.M., Alaiz-Rodríguez, R., González-Castro, V. et al. Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text. Int J Comput Intell Syst 13, 591–603 (2020). https://doi.org/10.2991/ijcis.d.200519.003
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.2991/ijcis.d.200519.003