0% found this document useful (0 votes)
80 views

Review On Detection of Spam Comments Using NLP Algorithm

This document summarizes a research paper that proposes a method to detect spam comments on social media using natural language processing (NLP). It begins with an abstract that describes detecting trending topics from social media and the challenge of spam. The paper then reviews related work on spam detection using Naive Bayes classification and term frequency–inverse document frequency. It notes that these methods do not consider semantic information. The proposed method uses NLP techniques like subjectivity judgment, category judgment, and weight judgment in a deep learning model and statistical analysis to better detect spam comments based on semantics. It aims to address limitations in existing approaches by incorporating full semantic analysis of words.

Uploaded by

Shreyas Bhatt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Review On Detection of Spam Comments Using NLP Algorithm

This document summarizes a research paper that proposes a method to detect spam comments on social media using natural language processing (NLP). It begins with an abstract that describes detecting trending topics from social media and the challenge of spam. The paper then reviews related work on spam detection using Naive Bayes classification and term frequency–inverse document frequency. It notes that these methods do not consider semantic information. The proposed method uses NLP techniques like subjectivity judgment, category judgment, and weight judgment in a deep learning model and statistical analysis to better detect spam comments based on semantics. It aims to address limitations in existing approaches by incorporating full semantic analysis of words.

Uploaded by

Shreyas Bhatt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

www.ijecs.

in
International Journal Of Engineering And Computer Science ISSN:2319-7242
Volume 7 Issue 1 January 2018, Page No. 23386-23489
Index Copernicus Value (2015): 58.10, 76.25 (2016) DOI: 10.18535/ijecs/v7i1.03

Review On: Detection of Spam Comments Using NLP Algorithm


Miss Rohini D.Warkar1, Mr.I.R.Shaikh2
1,2
Computer Engineering Department, Savitribai Phule University,Pune,India,

Abstract:
Detecting trending topics is perfect to summarize information getting from social media. To extract what topic is becoming hot on online media
is one of the challenges. As we considering social media so social services are opportunity for spamming which greatly affect on value of real
time search. Therefore the next task is to control spamming from social networking sites. For completing these challenges different concepts of
data mining will be used. For now whatever work has been done is narrated below like spam control using natural language processing for
preprocessing and clustering. One account has been created for making it real.

Keywords: Control Spamming ,Text mining ,Information filtering, Social Networking site, NLP .
likelihood of matching a more low-weighted word in the
1. Introduction spam dictionary is. This can cause the spam-weight of the
Now a days there is popularity and importance of social media normal comment to be high, which can then in turn reduce
sites are enhanced in people’s daily life. Social media allows the Precision Rate.
online users to share their feelings through posting comments. Moreover, the shorter length spam comments with just
However, more and more spam comments are also being one or two middle-weighted spam words will cause the
posted in user’s account on the social media. So necessity of total weight to be lower than the standard line, and thus
spam detection is increased. In traditional system, there are identify the comment as normal, and in turn reduce the
different systems that are used for spam detection such as The Recall Rate. Through a combination of the two critical
Naive Bayes classifier and tfidf(term frequency – inverse factors, we can obtain more accurate results for detecting
document frequency) . But these methods do not take the spam comments.
semantic information of the spam words or phrases into
account, which leads to incomplete results. The improvement in the system is acquired by using Natural
Therefore, the requirement of research is to take into full language processing (NLP) technique.
account the significance of the semantic information of the
words within all comments posted, including the vector NLP Technique:
expression of a word. And the vector distance between words. NLP is the technique that belongs to the CS taxonomy as the
The skill of mining additional semantic features from words child of Artificial Intelligence (AI). Natural Language
has been widely used in emotional classification and text Processing is a technique used for analyzing and representing
classification, both of which have achieved good results. naturally occurring texts at one or more levels of linguistic
analysis for the purpose of achieving human-like language
This problem is solved by detecting the spam comments posted processing for a range of tasks or applications. “Naturally
on social media site, through a combination of methods based occurring texts” can be of any language, mode, genre, etc. The
on a deep learning model and statistical analysis. The Self- texts can be oral or written and must be in a language used by
Extensible Spam Dictionary employs the deep learning Model humans to communicate to one another.
of Skip-Gram, whose process of building is divided into three Significantly the text being analyses should not be specifically
progressive stages: constructed for the purpose of the analysis, but rather it should
(1) Subjectivity Judgment:-It is use to find out the be collected from actual usage. In simple terms, NLP is the use
semantic distinctions of words, dividing each word into of computers to process written and spoken language for some
either normal or spam; useful purpose: to translate languages, to get information from
(2) Category Judgment:-It is used to demarcate a word or the web on text data banks so as to answer questions, to carry
phrase from the comments as an AD or vulgar category; on conversations with machines. Natural language processing
(3) Weight Judgment:-It is used to measure the extent of approaches fall roughly into four categories: symbolic,
subjectivity and category, that is, the spam extent of a statistical, connectionist, and hybrid.
word or phrase in the AD or vulgar category.
The Proportion-Weight Filter Model uses statistical
analysis to select the proportion and total weight of spam
words contained in a single comment as the two key
factors in deciding whether the comment is spam or not.
This model addresses the problem that the distribution of
spam words in short and long comments is different. If we
only detect spam comments by the factor of spam-weight,
the longer of the normal comment, the greater the

Miss Rohini D.Warkar, IJECS Volume 7 Issue 1 January 2018 Page No. 23386-23489 Page 23486
semantic information of the spam words or phrases into
account, which leads to an incomplete results. Therefore, our
research attempts to take into full account the significance of
the semantic information of the words within all comments
posted, including the vector expression of a wordand the vector
distance between words. The skill of mining additional
semantic features from words has been widely used in
emotional classification and text classification, both of which
have achieved good results .

Disadvantages:
1. Not able to detect non English Words and spam messages
which are encrypted
2. Incomplete spam selection as semantic analysis not
Figure 1: Design of NLP Processing considered.

2. Literature Review 4. Proposed System


1.M. Cataldi, C. Schifanella and L. Di Caro proposed two There are two approaches in the proposed work,
measures, term frequency to calculate nutrition for each identifying current and control spamming. The first step is
word and a page rank measure. After that Bursty keywords pre-processing which is important for mining the data or
are obtained using nutrition trend. Then by using graph filtering the data. The work of pre-processing has been
based approach for bursty keywords generates the topic done. Then the spam control has been done. Spam control
boundary. analyze how the numeric features such as new is the part of feature extraction. Here used the bisecting K-
lines, punctuation marks, links, white spaces, capital means clustering algorithm, because clustering is an
letters, vulgar words help eliminate incoherent and important step for quality results. So nothing but natural
offensive comments and how the topic of the comment language processing (NLP) technique has been used for
influences the detection of spam comments. pre-processing, clustering etc.
2.Sayyadi, Maykov and Hurst used graph approach in
which clustering of keywords is done by matching pairs.
They used community detection algorithm in which made a
graph whose nodes are clustered. Also the topic extraction
is carried out by identifying document with similar term. In
this paper we presented a novel approach to detect in real-
time emerging topics on Twitter. We formalized the
keyword life cycle leveraging a novel aging theory
intended to mine terms that frequently occur in the
specified time interval and they are relatively rare in the
past. We also studied the social relationships in the user
network in order to quantify the importance of each
analyzed content.
3.Lehmann, Kleinberg and Backstorm have used the
graph for short phrases. Phrases are connected by edges
developed a framework for tracking short, distinctive
phrases that travel relatively intact through on-line text and Figure 2:Design of system
presented scalable algorithms for identifying and clustering
textual variants of such phrases that scale to a collection of 1. Preprocessing
90 million articles, which makes the present study one of 2. Feature Extraction
the largest analyses of on-line news in terms of data Preprocessing
scale.Our work offers some of the first quantitative Preprocessing contains filtering of data. Natural language
analyses of the global news cycle and the dynamics of processing concepts are used for preprocessing of data.
information propagation between mainstream and social Lexical analysis
media. The lexical analyzer covert sentences into words then words
4. David M. Blei,Andrew Y. Ng,Michael I. Jordan[5] used convert into characters.
Latent Dirichlet allocation (LDA) is a topic model that Elimination of punctuations
generates topics based on word frequency from a set of Remove punctuations like comma, full stop etc.
documents. LDA is particularly useful for finding Elimination of symbols
reasonably accurate mixtures of topics within a given Remove symbols like @, # etc.
document set. Elimination of stop words
Remove words like in,of, the, is, and, for etc.
3. Existing System Feature Extraction
In existing system, structured methods of identifying spam Feature extraction is used for reduction of dimensionality.
comments, such as the Naive Bayes classifier, and tfidf (term Before classification there is need ofreduction of feature space.
frequency - inverse document frequency).It do not take the

Miss Rohini D.Warkar, IJECS Volume 7 Issue 1 January 2018 Page No. 23386-23489 Page 23487
Now spam control is also nothing but a feature reduction task. 5. Delete words of candidate dictionary if they exist
Therefore, slang word reduction is done for the spam control. in Basic vulgar and AD dictionary or DS
Remove of slang word dictionary;
For spam control, dictionary of slang words is created. So, 6. Calculate the avg-weight of same words in
whenever user use any slang word in the post or comment that candidate spam dictionary.
word matches with the words available in the dictionary and it 7. for each word s__ in candidate spam dictionary do
replaces with the stars (****). Acquire 15 most similar words for word s__ by
Remove Repeated comments comparing the semantic similarity between them;
In extracting the data from document remove ambiguity in 8. if there are more than 4 words among exist in
result. Basic vulgar and AD dictionary then Add word
s__ into DS dictionary; else Drop it;
5. Algorithm 9. Empty the candidate dictionary
The Natural Language Processing(NLP) is used to processing 10. Output: the newly added spam words in this
the data to find spam detection and also /iterative algorithm is iteration.
used for constructing the spam dictionary.

NLP Algorithm: Advantages:


It contain flow of different phases given below: 1. As the system uses NLP it detect non English Words and
spam messages which are encrypted.
2.The semantic analysis and Proportional weight filter
technique do the complete spam selection.

6. Conclusion
The main aspects of the proposed work are to detect the current
topics of real world and to control the spamming created by
spammer. Pre-processing process is done. One account is
created for showing results Also feature extraction is the part of
spam control has done.

References
1. Chenwei Liu, Jiawei Wang, Kai Lei, “Detecting Spam
Comments Posted in Micro-BlogsUsing the Self-
Extensible Spam Dictionary”, IEEE ICC 2016 SAC
Social Networking
2. Cristina Radulescu, Mihaela Dinsoreanu, and Rodica
Potolea, “Identification of spam comments using
natural language processing techniques”,In Intelligent
Computer Communication and Processing (ICCP),
2014 IEEE International Conference on, pages 29–35.
IEEE, 2014.
3. M. Cataldi, L. Di Caro and C. Schifanella, “Emerging
topic detection on Twitter based on temporal and
social terms evaluation,” in Proc. MDMKDD: 10th
Int. Workshop Multimedia Data Mining, New York,
NY, USA, 2010, pp. 4:1–4:10, ACM.
4. Sayyadi, M. Hurst and A.Maykov, “Event detection and
tracking in social streams,” in ICWSM, E. Adar, M.
Hurst, T. Finin, N. S. Glance, N. Nicolov, and B. L.
Figure 3: NLP Algorithm Tseng, Eds. Palo Alto, CA, USA:
AAAI Press, 2009.
Iterative Algorithm 5. J. Leskovec, L. Backstrom, and J. Kleinberg, “Meme-
tracking and the dynamics of the news cycle,” in Proc.
Iterative algorithm for constructing the Domain Spam KDD: 15th ACM Int. Conf. Knowledge Discovery
Dictionary and Data Mining, New York, NY, USA, 2009, pp.
1. Procedure : Construct (Spam words) 497–506.
2. Input: spam words of Basic vulgar and AD 6. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent
dictionary or added in the result of the previous dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp.
iteration. 993–1022, Mar. 2003
3. Acquire 15 most similar words for word s by 7. Joseph Lilleberg, Yun Zhu, and Yanqing
comparing the semantic similarity between them; Zhang,“Support vector machines and word2vec for
4. And Add the word s_ into the candidate spam text classification with semantic features”.,In
dictionary; Cognitive Informatics & Cognitive Computing (ICCI*

Miss Rohini D.Warkar, IJECS Volume 7 Issue 1 January 2018 Page No. 23386-23489 Page 23488
CC), IEEE 14th International Conference on, pages 0661,p-ISSN: 2278-8727, Volume 16, Issue 5, Ver.
136–140, 2015. IV (Sep – Oct. 2014), PP 116-119
8. Bai Xue, Chen Fu, and Zhan Shaobin,“A study on 13. Cristina Rădulescu , Mihaela Dinsoreanu , Rodica
sentiment computing and classification of sina weibo Potolea ,“Identification of Spam Comments using
with word2vec.” In Big Data (BigData Congress), Natural Language Processing Techniques”, IEEE
IEEE International Congress on, pages 358–363 2014
IEEE, 2014. 14. Zengcai Su , Hua Xu;_ , Dongwen Zhang and
9. Huiyu Wang, Kai Lei, and Kuai Xu, “Profiling the Yunfeng Xu ,“ Chinese Sentiment Classification
followers of the most influential and verified users on Using A Neural Network Tool - Word2vec”,
sina weibo.”, In Communications (ICC), IEEE InMultisensor Fusion and Information Integration for
International Conference on, pages 1158–1163. IEEE, Intelligent Systems(MFI), International Conference
2015. on, pages 1–6. IEEE, 2014
10. Ala’ M. Al-Zoub, Ja’far Alqatawna, Hossam 15. https://www.tutorialspoint.com/artificial_intelligence/
Faris,“Spam Profile Detection in Social Networks artificial_intelligence_natural_language_processing.ht
Based on Public Features”, 8th International m
Conference on Information and Communication
Systems (ICICS),2017.
11. Haewoon Kwak, Changhyun Lee, Hosung Park, and
Sue Moon ,“ What is Twitter, a Social Network or a
News Media?”, In Proceedings of the 19th
international conference on World wide web, pages
591–600. ACM,2010
12. Rohit Giyanani, Mukti Desai,“Spam Detection using
Natural Language Processing”,IOSR Journal of
Computer Engineering (IOSR-JCE) e-ISSN: 2278-

Miss Rohini D.Warkar, IJECS Volume 7 Issue 1 January 2018 Page No. 23386-23489 Page 23489

You might also like