Skip to main content

Advertisement

Springer Nature Link
Account
Menu
Find a journal Publish with us Track your research
Search
Cart
  1. Home
  2. International Journal of Computational Intelligence Systems
  3. Article

Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text

  • Research Article
  • Open access
  • Published: 11 June 2020
  • Volume 13, pages 591–603, (2020)
  • Cite this article
Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript
Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text
Download PDF
  • Gonzalo Molpeceres Barrientos1,
  • Rocío Alaiz-Rodríguez  ORCID: orcid.org/0000-0003-4164-58871,
  • Víctor González-Castro1 &
  • …
  • Andrew C. Parnell  ORCID: orcid.org/0000-0001-7956-79392 
  • 546 Accesses

  • 21 Citations

  • Explore all metrics

Abstract

Nowadays, children have access to Internet on a regular basis. Just like the real world, the Internet has many unsafe locations where kids may be exposed to inappropriate content in the form of obscene, aggressive, erotic or rude comments. In this work, we address the problem of detecting erotic/sexual content on text documents using Natural Language Processing (NLP) techniques. Following an approach based on Machine Learning techniques, we have assessed twelve models resulting from the combination of three text encoders (Bag of Words, Term Frequency-Inverse Document Frequency and Word2vec) together with four classifiers (Support Vector Machines (SVMs), Logistic Regression, k-Nearest Neighbors and Random Forests). We evaluated these alternatives on a new created dataset extracted from public data on the Reddit Website. The best performance result was achieved by the combination of the text encoder TF-IDF and the SVM classifier with linear kernel with an accuracy of 0.97 and F-score 0.96 (precision 0.96/recall 0.95). This study demonstrates that it is possible to detect erotic content on text documents and therefore, develop filters for minors or according to user’s preferences.

Article PDF

Download to read the full article text

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.
  • Categorization
  • Computational Linguistics
  • Machine Translation
  • Machine Learning
  • Natural Language Processing (NLP)
  • Sexual Offending
Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

  1. K. Pte, Digital 2019: Global Digital Overview, 2019. https://datareportal.com/reports/digital-2019-global-digital-overview

  2. Eurostat, Internet Access and Use Statistics – Households and Individuals, 2016. https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Archive:Internet_access_and_use_statistics_-_households_and_individuals

  3. NSPCC, Net Aware Report 2017: “Freedom to Express Myself Safely”: Exploring How Young People Navigate Opportunities and Risks in Their Online Lives, NSPCC, London, England, 2017. https://learning.nspcc.org.uk/research-resources/2017/net-aware-report-2017-freedom-toexpress-myself-safely

  4. Y. Chen, H. Xu, Y. Zhou, S. Zhu, Is this app safe for children?: a comparison study of maturity ratings on android and ios applications, in Proceedings of the 22nd international conference on World Wide Web, ACM, Rio de Janeiro, Brazil, 2013, pp. 201–212.

  5. C.-S. Li, G. Xiong, E.M. Tapia, New frontiers in cognitive content curation and moderation, APSIPA Trans. Signal Inf. Process. 7 (2018).

  6. M. Yar, A failure to regulate? The demands and dilemmas of tackling illegal content and behaviour on social media, Int. J. Cybersecur. Intell. Cyber. 1 (2018), 5–20.

    Google Scholar 

  7. N. Duarte, E. Llanso, A. Loup, Mixed messages? The limits of automated social media content analysis, in FAT, New York City, NY, USA, 2018.

  8. S.K. Dwiv, C. Arya, Automatic text classification in information retrieval: asurvey, in Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, ACM, Udaipur, India, 2016, p. 131.

  9. C. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval, Nat. Lang. Eng. 16 (2010), 100–103.

  10. W. Bruce Croft, D. Metzler, T. Strohman, Search Engines: Information Retrieval in Practice, vol. 520, Addison-Wesley Reading, 2010.

  11. H. Bunke, A. Kandel, A. Schenker, M. Last, Classification of web documents using graph matching, Int. J. Pattern Recognit. Artif. Intell. 18 (2004), 475–496.

    Google Scholar 

  12. Z. Li, J. Huang, A text classification algorithm based on improved multidimensional–multiresolution topological pattern recognition, Int. J. Pattern Recognit. Artif. Intell. 13 (2019).

  13. C.C. Aggarwal, Content-based recommender systems, in Recommender systems, Springer, Cham, Switzerland, 2016, pp. 139–166.

  14. B. Liu, L. Zhang, A survey of opinion mining and sentiment analysis, in: C. Aggarwal, C. Zhai (Eds.), Mining Text Data, Springer, Boston, MA, USA, 2012, pp. 415–463.

  15. A. Joshi, E. Fidalgo, E. Alegre, L. Fernández-Robles, SummCoder: an unsupervised framework for extractive text summarization based on deep auto-encoders, Expert Syst. Appl. 129 (2019), 200–215.

  16. K. Kowsari, K.J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, D. Brown, Text classification algorithms: a survey, Information. 10 (2019), 150.

    Google Scholar 

  17. I. Kotenko, A. Chechulin, D. Komashinsky, Evaluation of text classification techniques for inappropriate web content blocking, in 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), IEEE, Warsaw, Poland, 2015, vol. 1, pp. 412–417.

  18. H. Yenala, A. Jhanwar, M.K. Chinnakotla, J. Goyal, Deep learning for detecting inappropriate content in text, Int. J. Data Sci. Anal. 6 (2018), 273–286.

    Google Scholar 

  19. P. Bellan, C. Strapparava, Detecting inappropriate comments to news, in: C. Ghidini, B. Magnini, A. Passerini, P. Traverso (Eds.), International Conference of the Italian Association for Artificial Intelligence, Springer, Cham, Switzerland, 2018, pp. 403–414.

  20. T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the problem of offensive language, in Eleventh International AAAI Conference on Web and Social Media, Montreal, Quebec, Canada, 2017.

  21. L. Gao, R. Huang, Detecting online hate speech using context aware models, in Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2017), Varna, Bulgaria, 2017, pp. 260–266.

  22. H.L. Hammer, Automatic detection of hateful comments in online discussion, in: L. Maglaras, H. Janicke, K. Jones (Eds.), Industrial Networks and Intelligent Systems, International Conference on Industrial Networks and Intelligent Systems, Springer, Cham, Switzerland, 2016, pp. 164–173.

  23. D. Robinson, Z. Zhang, J. Tepper, Hate speech detection on twitter: feature engineering vs feature selection, in: A. Gangemi, et al. (Eds.), European Semantic Web Conference, Springer, Cham, Switzerland, 2018, pp. 46–49.

  24. N.D.T. Ruwandika, A.R. Weerasinghe, Identification of hate speech in social media, in 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer), IEEE, Colombo, Sri Lanka, 2018, pp. 273–278.

  25. H. Watanabe, M. Bouazizi, T. Ohtsuki, Hate speech on tTwitter: a pragmatic approach to collect hateful and offensive expressions and perform hate speech detection, IEEE Access. 6 (2018), 13825–13835.

    Google Scholar 

  26. Z. Zhang, D. Robinson, J. Tepper, Detecting hate speech on twitter using a convolution-gru based deep neural network, in: A. Gangemi, et al. (Eds.), European Semantic Web Conference, Springer, Cham, Switzerland, 2018, pp. 745–760.

  27. S. Agrawal, A. Awekar, Deep learning for detecting cyberbullying across multiple social media platforms, in: G. Pasi, B. Piwowarski, L. Azzopardi, A. Hanbury (Eds.), Advances in Information Retrieval, European Conference on Information Retrieval, Springer, Cham, Switzerland, 2018, pp. 141–153.

  28. M.A. Al-garadi, K.D. Varathan, S.D. Ravana, Cybercrime detection in online communications: the experimental case of cyberbullying detection in the twitter network, Comput. Hum. Behav. 63 (2016), 433–443.

    Google Scholar 

  29. J. Chen, S. Yan, K.-C. Wong, Verbal aggression detection on Twitter comments: Convolutional neural network for short-text sentiment analysis, Neural Comput. Appl. (2018), 1–10.

  30. S.P. Murali, et al., Detecting cyber bullies on twitter using machine learning techniques, Int. J. Inf. Secur. Cyber. 6 (2017), 63–66.

    Google Scholar 

  31. K. Nalini, L.J. Sheela, Classification of tweets using text classifier to detect cyber bullying, in: S. Satapathy, A. Govardhan, K. Raju, J. Mandal (Eds.), Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Society of India CSI, vol. 2, Springer, Cham, Switzerland, 2015, pp. 637–645.

  32. R. Zhao, A. Zhou, K. Mao, Automatic detection of cyberbullying on social networks based on bullying features, in Proceedings of the 17th international conference on distributed computing and networking, ACM, Singapore, 2016, p. 43.

  33. S. Merayo-Alba, E. Fidalgo, V. González-Castro, R. Alaiz-Rodríguez, J. Velasco-Mata, Use of natural language processing to identify inappropriate content in text, in: H. Pérez García, L. Sánchez González, M. Castejón Limas, H. Quintián Pardo, E. Corchado Rodríguez (Eds.), Hybrid Artificial Intelligent Systems, International Conference on Hybrid Artificial Intelligence Systems, Springer, Cham, Switzerland, 2019, pp. 254–263.

  34. B.K. Narayanan, S. Moses, M. Nirmala, et al., Adult content filtering: restricting minor audience from accessing inappropriate internet content, Educ. Inf. Technol. 23 (2018), 2719–2735.

    Google Scholar 

  35. A. Genkin, D.D. Lewis, D. Madigan, Large-scale bayesian logistic regression for text categorization, Technometrics. 49 (2007), 291–304.

    Google Scholar 

  36. G. Schohn, D. Cohn, Less is more: active learning with support vector machines, in Proceedings of the Seventeenth International Conference on Machine Learning, Citeseer, Stanford, CA, USA, 2000, p. 6.

  37. R. Baeza-Yates, B. Ribeiro-Neto, et al., Modern Information Retrieval, vol. 463, ACM Press, New York, NY, USA, 1999.

  38. S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990), 391–407.

    Google Scholar 

  39. Z.S. Harris, Distributional structure, Word. 10 (1954), 146–162.

    Google Scholar 

  40. K.S. Jones, A Statistical Interpretation of Term Specificity and its Application in Retrieval, Taylor Graham Publishing, London, UK, 1988.

  41. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781, 2013.

  42. T. Mikolov, W.-T. Yih, G. Zweig, Linguistic regularities in continuous space word representations, in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, 2013, pp. 746–751.

  43. J. Pennington, R. Socher, C. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.

  44. P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Trans. Assoc. Comput. Linguist. 5 (2017), 135–146.

    Google Scholar 

  45. E. Fidalgo, E. Alegre, L. Fernández-Robles, V. González-Castro, Fusión temprana de descriptores extraídos de mapas de prominencia multi-nivel para clasificar imágenes, Rev. Iberoam. Autom. Inf. ind. 16 (2019), 358–368.

  46. S.M.H. Dadgar, M.S. Araghi, M.M. Farahani, A novel text mining approach based on tf-idf and support vector machine for news classification, in 2016 IEEE International Conference on Engineering and Technology (ICETECH), IEEE, Coimbatore, India, 2016, pp. 112–116.

  47. X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, in Advances in Neural Information Processing Systems, Montreal, Quebec, Canada, 2015, pp. 649–657.

  48. F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv. 34 (2002), 1–47.

    Google Scholar 

  49. T. Mikolov, Q.V. Le, I. Sutskever, Exploiting Similarities among Languages for Machine Translation, arXiv preprint arXiv:1309.4168, 2013.

  50. T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems, Nevada, USA, 2013, pp. 3111–3119.

  51. X. Shu, R. Cohen, et al., Natural Language Toolkit (NLTK), 2010.

  52. V. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998.

  53. L. Breiman, Random forests, Mach. Learn. 45 (2001), 5–32.

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Department of Electrical, Systems and Automation Engineering, Universidad de León Campus de Vegazana s/n, León, Spain

    Gonzalo Molpeceres Barrientos, Rocío Alaiz-Rodríguez & Víctor González-Castro

  2. Hamilton Institute, Maynooth University, Maynooth, Ireland

    Andrew C. Parnell

Authors
  1. Gonzalo Molpeceres Barrientos
    View author publications

    Search author on:PubMed Google Scholar

  2. Rocío Alaiz-Rodríguez
    View author publications

    Search author on:PubMed Google Scholar

  3. Víctor González-Castro
    View author publications

    Search author on:PubMed Google Scholar

  4. Andrew C. Parnell
    View author publications

    Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Rocío Alaiz-Rodríguez.

Rights and permissions

This is an open access article distributed under the CC BY-NC 4.0 license (https://doi.org/creativecommons.org/licenses/by-nc/4.0/).

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barrientos, G.M., Alaiz-Rodríguez, R., González-Castro, V. et al. Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text. Int J Comput Intell Syst 13, 591–603 (2020). https://doi.org/10.2991/ijcis.d.200519.003

Download citation

  • Received: 03 March 2020

  • Accepted: 05 May 2020

  • Published: 11 June 2020

  • Issue Date: January 2020

  • DOI: https://doi.org/10.2991/ijcis.d.200519.003

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Key words

  • Inappropriate content
  • Machine learning
  • Text classification
  • Natural language processing
  • Text encoders
Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Advertisement

Search

Navigation

  • Find a journal
  • Publish with us
  • Track your research

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Journal finder
  • Publish your research
  • Language editing
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our brands

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Discover
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support
  • Legal notice
  • Cancel contracts here

Not affiliated

Springer Nature

© 2025 Springer Nature