Multi-Purpose Chat Bot: Team Formation Team Members
Multi-Purpose Chat Bot: Team Formation Team Members
Team Formation
Team Members: -
1. Kartik Rawal (16BCE0368).
2. Aman Sandilya (16BCB0030).
3. Shubham Chaudhary (16BCE0529).
Title: -
Abstract: -
Correspondence has turned out to be more grounded because of exponential increment in the
use of online networking over the most recent couple of years. Individuals use them for
speaking with companions, finding new companions, refreshing any significant exercises of
their life, and so forth. Among various sorts of online life, most significant are long range
interpersonal communication locales and portable systems. Because of their developing
ubiquity and profound come to, these mediums are invaded with immense Volume of spam
messages. In this paper, we have talked about AI procedures for recognizing spam in the short
instant messages. The presentation of different classifiers is assessed with the assistance of
measurements like exactness, review, precision. The outcomes demonstrate that whether the
given message is spam or not, text analysis and a fully functional chat bot.
Keywords - Spam Detection, Machine learning, Traditional classifiers, Twitter spam,
SMS spam, Text Classification
Introduction: -
Spam alludes to the unessential or spontaneous messages sent over the system with the sole
goal of pulling in the consideration of an enormous number of individuals. Spam could possibly
be destructive to the proposed individual. It may extend from only a clever instant message to
a destructive infection that may degenerate the whole machine or a code written to take all the
data on your machine. At first, the spam began spreading with email, however with the
expansion in the utilization of the Internet and the approach of web-based life, they began to
spread like a pestilence. As indicated by a specialized report by Ferris Research Group, it is
expressed that these kinds of sends possess a lump of data transfer capacity and extra room
with the client squandering their valuable time and vitality in maintaining a strategic distance
from these sorts of sends. This has brought about the monetary strain on associations, expanded
necessity of capacity, spreading of hostile material like explicit substance or more all it
disregards the protection of the general population at the less than desirable end.
Different vehicles of spam are long range informal communication destinations, spam online
journals, and so on which are utilized to send/get messages and the SMS which convey spam
over versatile systems. The expanding mindfulness about the email spam has diminished the
arrival rate radically, consequently customary spammers are presently utilizing versatile and
Internet advancements as a spam medium. With the across the board accessibility of advanced
mobile phone, there is an expansion in the Vol. of information traded over the system. SMS is
a very practical strategy utilized for trading messages and consequently these can be utilized
to send to the clients separately. It has a higher reaction rate when contrasted with email spam.
Aside from messages, and SMS, interpersonal interaction like Twitter, Facebook, moment
delivery person like WhatsApp and so forth are likewise adding to a noteworthy piece of spam
over the system.
Spam identification is a dull assignment and without programmed measure for sifting of
message, the errand of spam separating is taken up with the individual at the less than desirable
end. One of the measures for spam security is to incorporate Ad hoc classifiers. These are the
classifiers are pertinent because of a specific kind of spam or to confine spam messages from
a specific source. Instances of these sorts of classifiers incorporate hindering the approaching
messages from a specific source by the email customer realizing that is in the boycott.
The best and a basic measurable classifier is a Naïve Bayes among different classifiers [13]. It
is most generally utilized and explored. It expects that the highlights extricated from the word
vector are free of one another [14, 15]. A great part of the work is done in the region of spam
discovery utilizing Naïve Bayes. Yang [16] proposed Naïve Bayes classifier gathering
dependent on sacking which improved the exactness of the classifier. Kim [17] tried different
things with various no. of highlights utilized for spam arrangement utilizing a similar
calculation. And rout so Poulos [18] likewise played out an examination between innocent Bias
and key word put together spam separating with respect to social bookmarking framework and
inferred that the exhibition of Naïve Bayes is better among the two. Almeida [19] utilized
various methods like report recurrence, data gain, and so forth for term determination and
utilized it with four unique adaptations of Naïve Bayes for spam sifting. He presumed that
Boolean properties perform superior to anything others and MV Bernoulli performs best with
this method. Aside from Naïve Bayes, another famous customary classifier is a Support Vector
Machine (SVM) [20]. Like Naïve Bayes, SVM is additionally utilized for recognition of spams
from different online life like Twitter [21], Blogs [22], and so on. There are different varieties
acquainted with further improve the exhibition of SVM classifier. For example, Wang [23]
proposed GA-SVM calculation in which hereditary calculation utilized for highlight
determination and SVM for the characterization of spams and its presentation was superior to
SVM. Tseng [24] made a calculation that gave a gradual help to SVM by removing highlights
from the clients in the system.
This model demonstrated to be viable for the recognition of spam on email. Useful spam
recognition is finished with the assistance of fleeting position in the multi-dimensional space.
Other than SVM, another utilitarian classifier is kNN [25]. It has additionally given great
outcomes in the region of spam discovery. There are numerous different specialists who
utilized KNN for spam identification in various applications [26, 27, 28]. Fake Neural Network
(ANN) has additionally demonstrated promising outcomes in the zone of spam discovery. Sabri
[29] utilized ANN for spam discovery in which the futile information layers could be changed
over some stretch of time with the valuable one. Silva [30] analysed various kinds of ANN like
MLP, SOM, Levenberg-Marquardt calculation, RBF for substance-based spam discovery,
reasoned that some of them have high potential. Troupe strategies have demonstrated their
capacity as a classifier in the field of spam identification. One of the exceptionally proficient
classifiers is irregular backwoods. DeBarr [31] utilized grouping alongside Random Forest for
spam characterization. A few scientists gave examination between the above talked about
classifiers in written works [32, 33, 34]
Modules: -
1. Machine Learning: -
The AI part of the task included the production of an application named BotSpot which is based
upon an AI library written in Java called DeepLearing4j (DL4J) (Gibson and Nicholson, 2017).
A basic arrangement system was created utilizing the AI hypothesis of Multilayer Perceptron's
(MLP). DL4J takes into account the arrangement and working of an MLP, inside the design,
viewpoints, for example, the learning rate, initiation capacity and number of shrouded layers
are characterized.
From the pre-processor module, an informational index was made, from this informational
collection a preparation and test set were made. Making the informational indexes included a
manual procedure of evaluating the information and checking the chat bot record to decide if
it was a bot account or not. The information inside the preparation or test set was then given a
mark of either 'HUMAN' or 'BOT'. The neural system model is then given the preparation
information which it examinations and gains from dependent on the names. It at that point
utilizes the test informational collection to assess itself, giving outcomes. The underlying
preparing of the neural system utilized a preparation set comprising bot accounts. The
underlying neural system was not extremely precise and neglected to group any bot accounts.
2. Designing the neural system: -
After the aftereffects of the underlying preparing, obviously the neural system was not designed
accurately as it was totally neglecting to recognize bot accounts inside the test information. To improve
the exactness of the neural system a portion of the factors were changed. After each change the recently
designed neural system was then tried and the outcomes analysed; the consequences of which can be
found. The three factors that were changed in the setup were the learning rate, the quantity of goes
through the information and the quantity of shrouded hubs. From the after effects of the tests it was
discovered that a little increment in the quantity of shrouded hubs had no impact on the ace scandalous
of neural system, further tests would should be done to guarantee, this was indisputable. The learning
rate was the variable that had the most critical impact on the exactness of the neural system. At the point
when the learning rate was expanded to 0.5 the precision of the neural system diminished to simply
beneath 18%, with the system foreseeing that the greater part of the test information were bots. The
learning rate was then brought down to 0.05 where the exactness expanded to around half.
def filter(x):
df=pd.read_csv("spam.csv",encoding='latin-1')
df=df.loc[:, ~df.columns.str.contains('^Unnamed')]
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,1], df.iloc[:,0],
test_size=0.33, random_state=53)
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train)
test=x
test=[test]
count_test = count_vectorizer.transform(test)
nb_classifier=MultinomialNB()
nb_classifier.fit(count_train,y_train)
pred=nb_classifier.predict(count_test)
if(pred[0]=="ham"):
print("This message is not spam!")
elif(pred[0]=="spam"):
print("This message is spam !")
lemmatizer = WordNetLemmatizer()
def print_topics(res):
topics=[]
for i in res:
k=i[1].split("+")
for l in k :
m=l.split("*")
c=m[1]
c=c.replace('"','')
c=c.replace(' ','')
ll=[]
ll.append(c)
tagged_sent=word_tokenize(c)
tagged_sent=nltk.pos_tag(tagged_sent)
if(tagged_sent[0][1]=='NNP' or tagged_sent[0][1]=='NN'):
topics.append(c)
return(list(set(topics)))
def rec_topic(x):
my_doc=x.split(".")
my_doc_fi=[]
exclude = set(string.punctuation)
for doc in my_doc:
s = ''.join(ch for ch in doc if ch not in exclude)
my_doc_fi.append(s)
lemma=[]
for doc in stop_tok:
temp=[]
for t in doc:
temp.append(lemmatizer.lemmatize(t))
lemma.append(temp)
dic=Dictionary(lemma)
corpus=[dic.doc2bow(i) for i in lemma]
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics=3, id2word = dic, passes=50)
res=ldamodel.print_topics(num_topics=3, num_words=3)
r=print_topics(res)
return(r)
print("\n\n\n\n\n\n\n\n\n")
print("computer: Hi !what is your name?")
y=input()
y=y.split()
name=""
for i in y:
tagged_sent=word_tokenize(i)
tagged_sent=nltk.pos_tag(tagged_sent)
if((tagged_sent[0][1]=='NNP') and i not in stopwords.words('english')):
name=name+i
while(1==1):
print(name,end=": ")
y=input()
if(int(x)==1):
state=input("enter the paragraph")
print("The important words are:")
for i in rec_topic(state):
print(i)
elif(int(x)==2):
state=input("enter the paragraph")
filter(state)
elif(int(x)==3):
print("bye"+name+"..have a great day!")
break
else:
response = chatbot.get_response(y)
print("computer: ",end=" ")
print(response)
Datasets: -
Data is the essential ingredients before we can develop any meaningful algorithm. Knowing
where to get your data can be a very handy tool especially when you are just a beginner.
Below are a few of the famous repositories where you can easily get thousand kind of data set
for free:
2. Kaggle datasets
3. AWS datasets
For this email spamming data set, it is distributed by Spam Assassin, you can click this link to
go to the data set. There are a few categories of the data, you can read the readme.html to get
more background information on the data.
In short, there is two types of data present in this repository, which is ham(non-spam)
and spam data. Furthermore, in the ham data, there are easy and hard, which mean there is some
non-spam data that has a very high similarity with spam data. This might pose a difficulty for
our system to make a decision.
https://www.kaggle.com/uciml/sms-spam-collection-dataset
Outcome of Project: -
In this project we are developing a prototype product for Spam Detection using Text Analysis.
This product can also be used as will basically act as a chat bot which will also detect spam
messages using text analysis. We are using Natural Language Programming concepts to
implement our project and using the algorithm mentioned above.
14. An empirical study of the naive Bayes classifier In IJCAI 2001 workshop on
empirical methods in artificial intelligence, Vol.3, Issue.22, pp.41-46, 2001.
16. Kim, Naive Bayes classifier learning with feature selection for spam detection
in social bookmarking In Proceedings of European Conference on Machine
Learning and Principles and Practice of Knowledge Discovery in Databases
(ECML/ PKDD), US, pp.32, 2008. I.
21. P. Kolari, T. Finin,, A. Joshi, March, SVMs for the Blogosphere: Blog
Identification and Splog Detection In AAAI Spring Symposium: Computational
Approaches to Analyzing Weblogs,Baltimore, pp.92-99, 2006.
22. H.B. Wang, Y. Yu, SVM classifier incorporating feature selection using GA for
spam detection In International Conference on Embedded and Ubiquitous
Computing, Japan, pp.1147-1154, 2005
23. Incremental SVM model for spam detection on dynamic email social networks
In Int. Conf. on Computational Science and Engineering, Vancouver, pp.128-
135, 2009.
24. An assessment of case base reasoning for short text message classification In N.
Creaney (Ed.), Proceedings of 16th Irish Conference on Artificial Intelligence and
Cognitive Science, Castlebar, pp.257-266, 2005.
25. A. Harisinghaney, A. Dixit, S. Gupta, Text and image based spam email
classification using KNN Naïve Bayes and Reverse DBSCAN algorithm In
Optimization Reliabilty and Information Technology (ICROIT), India, pp.153-
155, 2014.
26. Graph-based KNN Algorithm for Spam SMS Detection Journal of Universal
Computer Science, Vol.19, Issue.16, pp.2404-2419, 2013.
27. Using cellular automata for improving knn based spam filtering Internationa
Arab Journal Information Technology, Vol.11, Issue.4, pp.345-353, 2014.
29. MR. Nagpure, SS. Mesakar, SR. Raut and Vanita P.Lonkar, "Image Retrieval
System with Interactive Genetic Algorithm Using Distance", International Journal
of Computer Sciences and Engineering, Vol.2, Issue.12, pp.109-113, 2014.
30. Spam detection using clustering, random forests, and active learning In
Sixth Conference on Email and Anti-Spam. Mountain View, California, pp.1-6,
2009.
31. A. Karami, Improving static SMS spam detection by using new content-based
features In 20th Americas Conference on Information systems (AMCIS),
Savannah, pp.1-9, 2014.
33. Y. Zhang, S. Wang, P. Phillips G.Binary PSO with mutation operator for feature
selection using decision tree applied to spam detection Knowledge-Based
Systems, Vol.64, Issue.3, pp.22-31, 2014.