Machine Learning Based Spam Comments Detection on YouTube
Machine Learning Based Spam Comments Detection on YouTube
P. Tarun
Dept. Of CSE
MLR Institution of Technology
(of JNTUH Affiliation)
Hyderabad, India
[email protected]
Abstract—YouTube provides only some tools for the comment which leads to the inappropriate websites. So, by
modification of the comments in the comment section. Because using the concept called machine learn ing we can predict
of this, the volume of spam comments increasing rapidly. Using and detect the spam. The algorith m that have been used for
Machine Learning, the comments can be detected and
detecting is Naive Bayes algorith m wh ich predicts which
prevented. There are a lot of approaches in ML to detect spam.
It is often seen in applications like YouTube where people comments are spam and wh ich not and various algorith m
watch a lot of videos for so many purposes it can be for have been used for spam co mment detection but however
entertainment or learning and it provides a way for users to Naive Bayes is best suitable as it is faster co mpared to other
interact with the creators through the comment section. There algorithms and perform better probabilistic calculations.
exists a way where people post scam comments which are quite
harmful. These comments can be dangerous and they can 1.1 What is a Spam Comment?
include links to other pages which can hack any information or Co mments that have the express intent of collecting
data or steal any confidential details when a link is composed personal data fro m readers, deceiving readers into leaving
on that comment, in some cases it can also redirect to the page YouTube, or engaging in any of the banned actions listed
where it attracts people to earn money while playing a game
and it is a scam and most people have actually lost their money above. Leave many, duplicate, or repetitive remarks that are
by clicking on such type of links displayed via comment or not targeted.
messages. The purpose of this project is to detect all the spam
comments that are being posted on the internet to avoid any II. EASE OF USE
scams or any unrelated information. In this study, Naive Bayes Machine Learn ing has been expanding in various sectors
classification algorithm is used. The detection accuracy of this including health, business, retail, finance and education.
proposed system is 92.78%. This project gives a clear cut understanding of the spam
comments that are being posted in applications like
Keywords—Spam comments, Machine learning, YouTube,
Naive Bayes. YouTube using machine learn ing algorith ms and
techniques. This provides a better way to know which
I. INT RODUCT ION comments are spam and which are not automat ically without
the need of humans. We often came across situations facing
There are a lot of content creators on YouTube platform.
like this receiving spam messages or co mments on the
Every creator has their own content to post or stream. They
internet and some which can harmful that can lead to
get a lot of following for their content in the form of the
another page which can potentially steal our data and some
SUBSCRIBERS and they get more VIEWS for that content.
might scam our money. We detect comments that are spam
So, YouTube provided a comment section for every post
by using machine learning algorith m called Naive Bayes
that the creator posted on YouTube in order to know the
which is a supervised classification algorith m. We classify
opinions of the VIEW ERS.So me v iewers like the posted
the comments by spam and not spam, we classify them by
video and some might not like it. So, these users might post
grouping certain objects which are similar and share
some negative or cursed comments. But there so me other
common characteristics.
category of comments which are unwanted and unasked
electronic messages known as spam comments. These spam III. SCOPE
comments are sent in a heavy or large amount. The
dangerous threat of the spam is when the involved spam This project works on detecting the spam comments that are
being posted on applications like YouTube. We can see not
all comments that are posted have to be 100% real some can SVM, and Gaussian SVM are statistically equivalent with a
be fake that is can be a scam. In a application like YouTube, degree of confidence of 99.9%.
there provides a way for users to interact with the owners of 3.6 DETECTIO N O F S PAM IN YO UTUBE CO MMENTS USING
the video by posting a comment. And which can be an open DIFFERENT CLASSIFIERS
platform for anyone to post comments. Some of which can These days, thousands of videos are shared on YouTube
be spam that is unrelated to the information that is being every minute, and users instantly begin to like and
posted. Some spam comments include link composed of comment. Millions of co mments are left on so me famous
messages when any person clicks on that link can redirect to and viral v ideos; some of these co mments are positive and
new page that can be harmful can possibly contain virus or healthy, while others are spam, abusive, and occasionally
can steal confidential information or can steal money. These include a URL for co mmercial advertising or a site redirect.
projects work by posting comments via a website and a In this study, we used datasets of YouTube comments fro m
button detect that predicts whether a comment is spam or five well-known vocalists to identify spam using both
not. The algorithm that have been used is Naive Bayes normal and art ificial neural network-based classifiers. The
algorithm which and is considered to be the fasted and better suggested method compares the classifiers' deduced results
for any kind of probabilistic calculations and can is easy to and recommends the top classifiers for identifying spam
build. All the data cleaning and processing is done and comments.
exploratory data analysis is done to clearly understand the
data and then the data is split for training and testing V. EXIST ING SYST EM
purposes and finally the data is fit into the model for model Google safe browsing and YouTube book marker are
building for detection. some tools used by the YouTube to protect the user from
spammers. These tools can block malicious links but
IV. SURVEY AND RESEARCH
cannot save the user in the early stage,in real-time. This
4.1 Using K-Nearest Neighbor and Support Vector is because they support the SVM and K-nearest
Machine, to detect spam comments on YouTube. algorithms for detecting the spam. These algorithms
cannot predict the accuracy rate. In some cases NN
The project to provide a framework for detecting YouTube algorithm is also used.
videos using K-Nearest Neighbor and Support Vector
Algorithms used are
Machine (SVM) (KNN). This study involves five (5) phases,
1. SVM - Support Vector Machine.
including data collection, pre processing, feature selection,
classification, and detection. Use of Weka and Rapid Miner 2. K-nearest neighbour algorithm.
is made for the experiments. 3. NN- Neuaral Network algorithm.
4.2 Spammer Detection: A Study of Spam Filter 5.1 Limitations of Existing System.
Comments on YouTube Videos. The limitations of the current algorithms are listed below
The goal of this study is to identify users who leave spam since they are all used in the codes of the existing systems
comments, those who do so with the aim of promoting stated above.
themselves, or users whose comments are irrelevant to the 1. SVM limitations.
given video. i.Large data sets are not a good fit for the SVM method.
4.3 Analysis and Classification of User Comments on ii.When the data set has more noise, such as when the
YouTube Videos target classes overlap, SVM does not work very well.
We group user remarks on the video-sharing website 2. KNN limitations.
YouTube according to how closely they relate to the i. The pred iction step may take a while if there are a lot
information provided in the description of the uploaded of data.
video. Positive and unfavorable comments are further ii. High RAM is necessary because all of the training
separated based on polarity analysis. data.
3. NN limitations.
4.4 Detection of Spam in YouTube Comments Using i . Computationally expensive.
Different Classifiers ii. Duration of development.
In this study, we used datasets of YouTube comments from VI. PROPOSED SYST EM
five well-known singers to identify spam using both normal
and artificial neural network-based classifiers. The The spam comments can be detected using machine
suggested method compares the classifiers' deduced results learning techniques. The spam comments are the
and proposes the top classifiers for identifying spam potential in spreading malware throughout the system.
comments. These exploit the machines of the user. The naive Bayes
algorithm is used to predict and detect spam comments.
4.5 Tube Spam: Comment Spam Filtering on YouTube Using this algorithm will give accurate results and takes
Since there are few tools available on YouTube for less time in predicting the spam. The algorithm that
comment moderation, the amount of spam is shockingly we’ve been using is a naive Bayes classifier which works
rising, forcing renowned channel owners to turn off the on any classification problem. Classification problems
comments feature on their videos. According to the include yes or no; true or false, it can be a binomial
statistical analysis of the data, decision trees, logistic classification or multinomial classification.
regression, Bernoulli Naive Bayes, random forests, linear
VII. DIAGRAMS
A. Equation A. Use-Case Diagram
The Naive Bayes algorithm works on the basis of the In the Unified Modelling Language (UML), a use case
Bayes theorem which is diagram is a specific kind of behavioral diagram that
results from and is defined by a use-case analysis. So, the
P(A/B)= P(B/A)*P(A)/P(B). system is originally started by the user. The user will
Here, upload the input or comment when the system starts.
P(A/B) is the posterior probability of class A with Regarding the current data set, this input comment is
respect to target B. examined. Then the algorithm for machine learning is
P(B/A) is the likelihood which is the probability of used. The text is then separated into segments. This split
the predictor given class. information is then analyzed to determine whether or not
P(A) is the prior probability of class. the input comment is spam.
P(B) is the probability of Evidence.
C. Sequence Diagram
Because it illustrates the interactions between a group of
items and the order in wh ich they take place, a sequence
diagram is a sort of interaction diagram. Software engineers
and business experts use these diagrams to comprehend the
specifications for a new system or to describe an existing
procedure.
A type of interaction d iagram used in the Un ified Modelling
Language (UM L), a sequence diagram demonstrates how
and when processes interact with one another. It is a
Message Sequence Chart construct. Event diagrams, event
situations, and timing d iagrams are other names fo r
sequence diagrams.
X. RESULT S
This project's find ings indicate if a co mment is spam o r
ham. As a result, when the input is given and the
programme is told to make a predict ion, it will deliver the
result as spam if the co mment is deemed spam and as not
spam if it is deemed a ham comment .
IX. FRONT END DESIGN The comment Hey, check out my new website!! This site
is about kids stuff. kidsmediausa . com is classified as
The layer above the back end is the front end and it spam. So, the output will be show as spam.
includes all software or hardware that is part of a
user interface. Human or digital users interact
directly with various aspects of the front end of a
program, including user-entered data, buttons,
programs, websites, and other features. It is often
afford as User Interface. In this project, the front end
consists of a text block and a button saying detect. We
have to enter the comment in comment box and
click on detect button.This gives whether the comment
is spam or not.
REFERENCES
[1] Sah, U. K., & Parmar, N. (2017). An approach for Malicious Spam
Detection in Email with
comparison of different classifiers.
[2]Alberto, T . C., Lochter, J. V., & Almeida, T. A. (2015, December).
T ubespam: Comment spam
filtering on youtube. In Machine Learning and Applications (ICMLA),
2015 IEEE 14th
International Conference on (pp. 138-143). IEEE.
[3] Alsaleh, M., Alarifi, A., Al-Quayed, F., & Al-Salman, A. (2016).
Combating comment spam with machine learning approaches. Proceedings
- 2015 IEEE 14th International Conference on
Machine Learning and Applications, ICMLA 2015, 295–300.
https://doi.org/10.1109/ICMLA.2015.192
[4] Scheltus, P., Dorner, V., & Lehner, F. (2013). Leave a Comment!
An In-Depth Analysis of User
Comments on YouTube. Wirtschaftsinformatik, 42.
[5] A. Kantchelian, J. Ma, L. Huang, S. Afroz, A. Joseph, J. D. Tygar,
Robust detection of comment spam using entropy rate, in: Proceedings of
the 5th ACM Workshop on Security and Artificial Intelligence, AISec ‘12,
ACM, New York, NY, USA, 2012, pp. 59-70.
doi:10.1145/2381896.2381907.
XI. CONCLUSION [6] S. Aiyar and N. P. Shetty, "N-gram assisted Youtube spam
comment detection", Proc. Comput. Sci., vol. 132, pp. 174-182, Jan. 2018.
10.1 Conclusion. [7] A. Kantchelian, J. Ma, L. Huang, S. Afroz, A. Joseph and J. D.
Several methods are employed to categorize co mments as Tygar, "Robust detection of comment spam using entropy rate", Proc. 5th
spam or not spam. Th is strategy is 18% more effective than ACM Workshop Secur. Artif. Intell. (AISec), pp. 59-70, 2012.
[8] A. Madden, I. Ruthven and D. Mcmenemy, "A classification
the previous strategy. Every user on YouTube has access to scheme for content analyses of Youtube video comments", J.
its open platform. There may be a shift in the spammers' Documentation, vol. 69, no. 5, pp. 693-714, Sep. 2013.
behaviour over time. [9] A. Severyn, A. Moschitti, O. Uryupina, B. Plank and K. Filippova,
10.2 Future Scope. "Opinion mining on Youtube", Proc. 52nd Annu. Meeting Assoc. Comput.
Linguistics (Long Papers), vol. 1, pp. 1-10, 2014.
This project aims to eliminate unwanted spam co mments [10] M. Z. Asghar, S. Ahmad, A. Marwat and F. M. Kundi, "Sentiment
fro m YouTube and enhance ham co mments with high analysis on Youtube: A brief survey", arXiv:1511.09142, 2015, [online]
accuracy. The project's output enhances the findings for Available: http://arxiv.org/abs/1511.09142.
future comparison and serves as a baseline for anyone [11] T. C. Alberto, J. V. Lochter and T. A. Almeida, "T ubeSpam:
Comment spam filtering on Youtube", Proc. IEEE 14th Int. Conf. Mach.
interested in YouTube spam comments. Informat ion on Learn. Appl. (ICMLA), pp. 138-143, Dec. 2015.
spam co mments on YouTube is gathered fro m social [12] A. U. R. Khan, M. Khan and M. B. Khan, "Naïve multi-label
networking sites. A data mining technique will be used to classification of Youtube comments using comparative opinion
compare the accuracy of the results utilizing this data. mining", Proc. Comput. Sci., vol. 82, pp. 57-64, Jan. 2016.