Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text

Barrientos, Gonzalo Molpeceres; Alaiz-Rodríguez, Rocío; González-Castro, Víctor; Parnell, Andrew C.

doi:10.2991/ijcis.d.200519.003

Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text

Research Article
Open access
Published: 11 June 2020

Volume 13, pages 591–603, (2020)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text

Download PDF

546 Accesses
21 Citations
Explore all metrics

Abstract

Nowadays, children have access to Internet on a regular basis. Just like the real world, the Internet has many unsafe locations where kids may be exposed to inappropriate content in the form of obscene, aggressive, erotic or rude comments. In this work, we address the problem of detecting erotic/sexual content on text documents using Natural Language Processing (NLP) techniques. Following an approach based on Machine Learning techniques, we have assessed twelve models resulting from the combination of three text encoders (Bag of Words, Term Frequency-Inverse Document Frequency and Word2vec) together with four classifiers (Support Vector Machines (SVMs), Logistic Regression, k-Nearest Neighbors and Random Forests). We evaluated these alternatives on a new created dataset extracted from public data on the Reddit Website. The best performance result was achieved by the combination of the text encoder TF-IDF and the SVM classifier with linear kernel with an accuracy of 0.97 and F-score 0.96 (precision 0.96/recall 0.95). This study demonstrates that it is possible to detect erotic content on text documents and therefore, develop filters for minors or according to user’s preferences.

Article PDF

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

K. Pte, Digital 2019: Global Digital Overview, 2019. https://datareportal.com/reports/digital-2019-global-digital-overview
Eurostat, Internet Access and Use Statistics – Households and Individuals, 2016. https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Archive:Internet_access_and_use_statistics_-_households_and_individuals
NSPCC, Net Aware Report 2017: “Freedom to Express Myself Safely”: Exploring How Young People Navigate Opportunities and Risks in Their Online Lives, NSPCC, London, England, 2017. https://learning.nspcc.org.uk/research-resources/2017/net-aware-report-2017-freedom-toexpress-myself-safely
Y. Chen, H. Xu, Y. Zhou, S. Zhu, Is this app safe for children?: a comparison study of maturity ratings on android and ios applications, in Proceedings of the 22nd international conference on World Wide Web, ACM, Rio de Janeiro, Brazil, 2013, pp. 201–212.
C.-S. Li, G. Xiong, E.M. Tapia, New frontiers in cognitive content curation and moderation, APSIPA Trans. Signal Inf. Process. 7 (2018).
M. Yar, A failure to regulate? The demands and dilemmas of tackling illegal content and behaviour on social media, Int. J. Cybersecur. Intell. Cyber. 1 (2018), 5–20.
Google Scholar
N. Duarte, E. Llanso, A. Loup, Mixed messages? The limits of automated social media content analysis, in FAT, New York City, NY, USA, 2018.
S.K. Dwiv, C. Arya, Automatic text classification in information retrieval: asurvey, in Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, ACM, Udaipur, India, 2016, p. 131.
C. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval, Nat. Lang. Eng. 16 (2010), 100–103.
W. Bruce Croft, D. Metzler, T. Strohman, Search Engines: Information Retrieval in Practice, vol. 520, Addison-Wesley Reading, 2010.
H. Bunke, A. Kandel, A. Schenker, M. Last, Classification of web documents using graph matching, Int. J. Pattern Recognit. Artif. Intell. 18 (2004), 475–496.
Google Scholar
Z. Li, J. Huang, A text classification algorithm based on improved multidimensional–multiresolution topological pattern recognition, Int. J. Pattern Recognit. Artif. Intell. 13 (2019).
C.C. Aggarwal, Content-based recommender systems, in Recommender systems, Springer, Cham, Switzerland, 2016, pp. 139–166.
B. Liu, L. Zhang, A survey of opinion mining and sentiment analysis, in: C. Aggarwal, C. Zhai (Eds.), Mining Text Data, Springer, Boston, MA, USA, 2012, pp. 415–463.
A. Joshi, E. Fidalgo, E. Alegre, L. Fernández-Robles, SummCoder: an unsupervised framework for extractive text summarization based on deep auto-encoders, Expert Syst. Appl. 129 (2019), 200–215.
K. Kowsari, K.J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, D. Brown, Text classification algorithms: a survey, Information. 10 (2019), 150.
Google Scholar
I. Kotenko, A. Chechulin, D. Komashinsky, Evaluation of text classification techniques for inappropriate web content blocking, in 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), IEEE, Warsaw, Poland, 2015, vol. 1, pp. 412–417.
H. Yenala, A. Jhanwar, M.K. Chinnakotla, J. Goyal, Deep learning for detecting inappropriate content in text, Int. J. Data Sci. Anal. 6 (2018), 273–286.
Google Scholar
P. Bellan, C. Strapparava, Detecting inappropriate comments to news, in: C. Ghidini, B. Magnini, A. Passerini, P. Traverso (Eds.), International Conference of the Italian Association for Artificial Intelligence, Springer, Cham, Switzerland, 2018, pp. 403–414.
T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the problem of offensive language, in Eleventh International AAAI Conference on Web and Social Media, Montreal, Quebec, Canada, 2017.
L. Gao, R. Huang, Detecting online hate speech using context aware models, in Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2017), Varna, Bulgaria, 2017, pp. 260–266.
H.L. Hammer, Automatic detection of hateful comments in online discussion, in: L. Maglaras, H. Janicke, K. Jones (Eds.), Industrial Networks and Intelligent Systems, International Conference on Industrial Networks and Intelligent Systems, Springer, Cham, Switzerland, 2016, pp. 164–173.
D. Robinson, Z. Zhang, J. Tepper, Hate speech detection on twitter: feature engineering vs feature selection, in: A. Gangemi, et al. (Eds.), European Semantic Web Conference, Springer, Cham, Switzerland, 2018, pp. 46–49.
N.D.T. Ruwandika, A.R. Weerasinghe, Identification of hate speech in social media, in 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer), IEEE, Colombo, Sri Lanka, 2018, pp. 273–278.
H. Watanabe, M. Bouazizi, T. Ohtsuki, Hate speech on tTwitter: a pragmatic approach to collect hateful and offensive expressions and perform hate speech detection, IEEE Access. 6 (2018), 13825–13835.
Google Scholar
Z. Zhang, D. Robinson, J. Tepper, Detecting hate speech on twitter using a convolution-gru based deep neural network, in: A. Gangemi, et al. (Eds.), European Semantic Web Conference, Springer, Cham, Switzerland, 2018, pp. 745–760.
S. Agrawal, A. Awekar, Deep learning for detecting cyberbullying across multiple social media platforms, in: G. Pasi, B. Piwowarski, L. Azzopardi, A. Hanbury (Eds.), Advances in Information Retrieval, European Conference on Information Retrieval, Springer, Cham, Switzerland, 2018, pp. 141–153.
M.A. Al-garadi, K.D. Varathan, S.D. Ravana, Cybercrime detection in online communications: the experimental case of cyberbullying detection in the twitter network, Comput. Hum. Behav. 63 (2016), 433–443.
Google Scholar
J. Chen, S. Yan, K.-C. Wong, Verbal aggression detection on Twitter comments: Convolutional neural network for short-text sentiment analysis, Neural Comput. Appl. (2018), 1–10.
S.P. Murali, et al., Detecting cyber bullies on twitter using machine learning techniques, Int. J. Inf. Secur. Cyber. 6 (2017), 63–66.
Google Scholar
K. Nalini, L.J. Sheela, Classification of tweets using text classifier to detect cyber bullying, in: S. Satapathy, A. Govardhan, K. Raju, J. Mandal (Eds.), Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Society of India CSI, vol. 2, Springer, Cham, Switzerland, 2015, pp. 637–645.
R. Zhao, A. Zhou, K. Mao, Automatic detection of cyberbullying on social networks based on bullying features, in Proceedings of the 17th international conference on distributed computing and networking, ACM, Singapore, 2016, p. 43.
S. Merayo-Alba, E. Fidalgo, V. González-Castro, R. Alaiz-Rodríguez, J. Velasco-Mata, Use of natural language processing to identify inappropriate content in text, in: H. Pérez García, L. Sánchez González, M. Castejón Limas, H. Quintián Pardo, E. Corchado Rodríguez (Eds.), Hybrid Artificial Intelligent Systems, International Conference on Hybrid Artificial Intelligence Systems, Springer, Cham, Switzerland, 2019, pp. 254–263.
B.K. Narayanan, S. Moses, M. Nirmala, et al., Adult content filtering: restricting minor audience from accessing inappropriate internet content, Educ. Inf. Technol. 23 (2018), 2719–2735.
Google Scholar
A. Genkin, D.D. Lewis, D. Madigan, Large-scale bayesian logistic regression for text categorization, Technometrics. 49 (2007), 291–304.
Google Scholar
G. Schohn, D. Cohn, Less is more: active learning with support vector machines, in Proceedings of the Seventeenth International Conference on Machine Learning, Citeseer, Stanford, CA, USA, 2000, p. 6.
R. Baeza-Yates, B. Ribeiro-Neto, et al., Modern Information Retrieval, vol. 463, ACM Press, New York, NY, USA, 1999.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990), 391–407.
Google Scholar
Z.S. Harris, Distributional structure, Word. 10 (1954), 146–162.
Google Scholar
K.S. Jones, A Statistical Interpretation of Term Specificity and its Application in Retrieval, Taylor Graham Publishing, London, UK, 1988.
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781, 2013.
T. Mikolov, W.-T. Yih, G. Zweig, Linguistic regularities in continuous space word representations, in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, 2013, pp. 746–751.
J. Pennington, R. Socher, C. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.
P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Trans. Assoc. Comput. Linguist. 5 (2017), 135–146.
Google Scholar
E. Fidalgo, E. Alegre, L. Fernández-Robles, V. González-Castro, Fusión temprana de descriptores extraídos de mapas de prominencia multi-nivel para clasificar imágenes, Rev. Iberoam. Autom. Inf. ind. 16 (2019), 358–368.
S.M.H. Dadgar, M.S. Araghi, M.M. Farahani, A novel text mining approach based on tf-idf and support vector machine for news classification, in 2016 IEEE International Conference on Engineering and Technology (ICETECH), IEEE, Coimbatore, India, 2016, pp. 112–116.
X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, in Advances in Neural Information Processing Systems, Montreal, Quebec, Canada, 2015, pp. 649–657.
F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv. 34 (2002), 1–47.
Google Scholar
T. Mikolov, Q.V. Le, I. Sutskever, Exploiting Similarities among Languages for Machine Translation, arXiv preprint arXiv:1309.4168, 2013.
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems, Nevada, USA, 2013, pp. 3111–3119.
X. Shu, R. Cohen, et al., Natural Language Toolkit (NLTK), 2010.
V. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998.
L. Breiman, Random forests, Mach. Learn. 45 (2001), 5–32.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical, Systems and Automation Engineering, Universidad de León Campus de Vegazana s/n, León, Spain
Gonzalo Molpeceres Barrientos, Rocío Alaiz-Rodríguez & Víctor González-Castro
Hamilton Institute, Maynooth University, Maynooth, Ireland
Andrew C. Parnell

Authors

Gonzalo Molpeceres Barrientos
View author publications
Search author on:PubMed Google Scholar
Rocío Alaiz-Rodríguez
View author publications
Search author on:PubMed Google Scholar
Víctor González-Castro
View author publications
Search author on:PubMed Google Scholar
Andrew C. Parnell
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Rocío Alaiz-Rodríguez.

Rights and permissions

This is an open access article distributed under the CC BY-NC 4.0 license (https://doi.org/creativecommons.org/licenses/by-nc/4.0/).

Reprints and permissions

About this article

Cite this article

Barrientos, G.M., Alaiz-Rodríguez, R., González-Castro, V. et al. Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text. Int J Comput Intell Syst 13, 591–603 (2020). https://doi.org/10.2991/ijcis.d.200519.003

Download citation

Received: 03 March 2020
Accepted: 05 May 2020
Published: 11 June 2020
Issue Date: January 2020
DOI: https://doi.org/10.2991/ijcis.d.200519.003

Key words

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text

Abstract

Article PDF

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Key words