0% found this document useful (0 votes)
140 views15 pages

Multi-Purpose Chat Bot: Team Formation Team Members

This document describes a student project to build a multi-purpose chat bot using machine learning techniques. The project involves developing modules for machine learning, designing the neural network, and assessing and labeling data. The machine learning module uses a multi-layer perceptron neural network trained on a dataset of human and bot accounts to classify accounts. The neural network is optimized by adjusting the learning rate and number of hidden layers. An assessment module is used to analyze new unlabeled data using the trained neural network model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views15 pages

Multi-Purpose Chat Bot: Team Formation Team Members

This document describes a student project to build a multi-purpose chat bot using machine learning techniques. The project involves developing modules for machine learning, designing the neural network, and assessing and labeling data. The machine learning module uses a multi-layer perceptron neural network trained on a dataset of human and bot accounts to classify accounts. The neural network is optimized by adjusting the learning rate and number of hidden layers. An assessment module is used to analyze new unlabeled data using the trained neural network model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Fall Semester: 2019- 2020

Course Code: CSE3999

Course Title: Technical Answers for Real World Problems (TARP)

Class No: VL2019201002871


Slot: TD1

Team Formation
Team Members: -
1. Kartik Rawal (16BCE0368).
2. Aman Sandilya (16BCB0030).
3. Shubham Chaudhary (16BCE0529).

Title: -

• Multi-Purpose Chat Bot

Abstract: -
Correspondence has turned out to be more grounded because of exponential increment in the
use of online networking over the most recent couple of years. Individuals use them for
speaking with companions, finding new companions, refreshing any significant exercises of
their life, and so forth. Among various sorts of online life, most significant are long range
interpersonal communication locales and portable systems. Because of their developing
ubiquity and profound come to, these mediums are invaded with immense Volume of spam
messages. In this paper, we have talked about AI procedures for recognizing spam in the short
instant messages. The presentation of different classifiers is assessed with the assistance of
measurements like exactness, review, precision. The outcomes demonstrate that whether the
given message is spam or not, text analysis and a fully functional chat bot.
Keywords - Spam Detection, Machine learning, Traditional classifiers, Twitter spam,
SMS spam, Text Classification
Introduction: -
Spam alludes to the unessential or spontaneous messages sent over the system with the sole
goal of pulling in the consideration of an enormous number of individuals. Spam could possibly
be destructive to the proposed individual. It may extend from only a clever instant message to
a destructive infection that may degenerate the whole machine or a code written to take all the
data on your machine. At first, the spam began spreading with email, however with the
expansion in the utilization of the Internet and the approach of web-based life, they began to
spread like a pestilence. As indicated by a specialized report by Ferris Research Group, it is
expressed that these kinds of sends possess a lump of data transfer capacity and extra room
with the client squandering their valuable time and vitality in maintaining a strategic distance
from these sorts of sends. This has brought about the monetary strain on associations, expanded
necessity of capacity, spreading of hostile material like explicit substance or more all it
disregards the protection of the general population at the less than desirable end.
Different vehicles of spam are long range informal communication destinations, spam online
journals, and so on which are utilized to send/get messages and the SMS which convey spam
over versatile systems. The expanding mindfulness about the email spam has diminished the
arrival rate radically, consequently customary spammers are presently utilizing versatile and
Internet advancements as a spam medium. With the across the board accessibility of advanced
mobile phone, there is an expansion in the Vol. of information traded over the system. SMS is
a very practical strategy utilized for trading messages and consequently these can be utilized
to send to the clients separately. It has a higher reaction rate when contrasted with email spam.
Aside from messages, and SMS, interpersonal interaction like Twitter, Facebook, moment
delivery person like WhatsApp and so forth are likewise adding to a noteworthy piece of spam
over the system.
Spam identification is a dull assignment and without programmed measure for sifting of
message, the errand of spam separating is taken up with the individual at the less than desirable
end. One of the measures for spam security is to incorporate Ad hoc classifiers. These are the
classifiers are pertinent because of a specific kind of spam or to confine spam messages from
a specific source. Instances of these sorts of classifiers incorporate hindering the approaching
messages from a specific source by the email customer realizing that is in the boycott.
The best and a basic measurable classifier is a Naïve Bayes among different classifiers [13]. It
is most generally utilized and explored. It expects that the highlights extricated from the word
vector are free of one another [14, 15]. A great part of the work is done in the region of spam
discovery utilizing Naïve Bayes. Yang [16] proposed Naïve Bayes classifier gathering
dependent on sacking which improved the exactness of the classifier. Kim [17] tried different
things with various no. of highlights utilized for spam arrangement utilizing a similar
calculation. And rout so Poulos [18] likewise played out an examination between innocent Bias
and key word put together spam separating with respect to social bookmarking framework and
inferred that the exhibition of Naïve Bayes is better among the two. Almeida [19] utilized
various methods like report recurrence, data gain, and so forth for term determination and
utilized it with four unique adaptations of Naïve Bayes for spam sifting. He presumed that
Boolean properties perform superior to anything others and MV Bernoulli performs best with
this method. Aside from Naïve Bayes, another famous customary classifier is a Support Vector
Machine (SVM) [20]. Like Naïve Bayes, SVM is additionally utilized for recognition of spams
from different online life like Twitter [21], Blogs [22], and so on. There are different varieties
acquainted with further improve the exhibition of SVM classifier. For example, Wang [23]
proposed GA-SVM calculation in which hereditary calculation utilized for highlight
determination and SVM for the characterization of spams and its presentation was superior to
SVM. Tseng [24] made a calculation that gave a gradual help to SVM by removing highlights
from the clients in the system.
This model demonstrated to be viable for the recognition of spam on email. Useful spam
recognition is finished with the assistance of fleeting position in the multi-dimensional space.
Other than SVM, another utilitarian classifier is kNN [25]. It has additionally given great
outcomes in the region of spam discovery. There are numerous different specialists who
utilized KNN for spam identification in various applications [26, 27, 28]. Fake Neural Network
(ANN) has additionally demonstrated promising outcomes in the zone of spam discovery. Sabri
[29] utilized ANN for spam discovery in which the futile information layers could be changed
over some stretch of time with the valuable one. Silva [30] analysed various kinds of ANN like
MLP, SOM, Levenberg-Marquardt calculation, RBF for substance-based spam discovery,
reasoned that some of them have high potential. Troupe strategies have demonstrated their
capacity as a classifier in the field of spam identification. One of the exceptionally proficient
classifiers is irregular backwoods. DeBarr [31] utilized grouping alongside Random Forest for
spam characterization. A few scientists gave examination between the above talked about
classifiers in written works [32, 33, 34]

Modules: -
1. Machine Learning: -
The AI part of the task included the production of an application named BotSpot which is based
upon an AI library written in Java called DeepLearing4j (DL4J) (Gibson and Nicholson, 2017).
A basic arrangement system was created utilizing the AI hypothesis of Multilayer Perceptron's
(MLP). DL4J takes into account the arrangement and working of an MLP, inside the design,
viewpoints, for example, the learning rate, initiation capacity and number of shrouded layers
are characterized.
From the pre-processor module, an informational index was made, from this informational
collection a preparation and test set were made. Making the informational indexes included a
manual procedure of evaluating the information and checking the chat bot record to decide if
it was a bot account or not. The information inside the preparation or test set was then given a
mark of either 'HUMAN' or 'BOT'. The neural system model is then given the preparation
information which it examinations and gains from dependent on the names. It at that point
utilizes the test informational collection to assess itself, giving outcomes. The underlying
preparing of the neural system utilized a preparation set comprising bot accounts. The
underlying neural system was not extremely precise and neglected to group any bot accounts.
2. Designing the neural system: -
After the aftereffects of the underlying preparing, obviously the neural system was not designed
accurately as it was totally neglecting to recognize bot accounts inside the test information. To improve
the exactness of the neural system a portion of the factors were changed. After each change the recently
designed neural system was then tried and the outcomes analysed; the consequences of which can be
found. The three factors that were changed in the setup were the learning rate, the quantity of goes
through the information and the quantity of shrouded hubs. From the after effects of the tests it was
discovered that a little increment in the quantity of shrouded hubs had no impact on the ace scandalous
of neural system, further tests would should be done to guarantee, this was indisputable. The learning
rate was the variable that had the most critical impact on the exactness of the neural system. At the point
when the learning rate was expanded to 0.5 the precision of the neural system diminished to simply
beneath 18%, with the system foreseeing that the greater part of the test information were bots. The
learning rate was then brought down to 0.05 where the exactness expanded to around half.

3. Assessing and marking the information: -


Inside Bot Spot there is an assessment class, its motivation is to take into consideration the
investigation and assessment of unclassified information utilizing a neural system. The class
peruses in information that has been created by the pre-processor and separates from the ID of
a record and stores it to utilize later. The reason the ID must be part from the information is on
the grounds that the neural system can't process the it. When this is done the evaluator, class
reproduces a previously prepared neural system, it at that point utilizes this neural system to
assess the information. The neural system at that point makes two forecasts for each column of
information; a human for every centage and a bot rate. After this the class at that point affixes
both the expectations and the client ID back to the information and yields everything as a CSV
document. The assessed information is then passed to a naming class. The reason for this class
is to refresh the information inside the Neo4j database with a 'Bot' or 'Human' mark dependent
on the expectations made by the neural system. The class works by first separating the client
ID and the two rates given to the information by the neural system. The primary rate is the
manner by which sure the neural system is that the record is human and the second for how
sure it is that.
Analysis of modules as per Project: -
1. Naive Bayes: -
One-way spam message is arranged is by utilizing a Naive Bayes classifier. The Naive
Bayes calculation depends on Bayes Rule. This calculation will arrange each article by
taking a gander at all of its highlights separately. Bayes Rule beneath tells us the best way
to compute the back likelihood for only one element. The back likelihood of the article is
determined for each element and after that these probabilities are duplicated together to get
a last likelihood. This likelihood is determined for the different class too. Whichever has
the more prominent likelihood that at last figures out what class the article is in.

2. Latent Dirichlet allocation: -


LDA speaks to reports as blends of themes that let out words with specific probabilities. It
expects that reports are created in the accompanying design: when composing each record,
you Settle on the quantity of words N the record will have (say, as per a Poisson
circulation). Pick a theme blend for the archive (as indicated by a Dirichlet conveyance
over a fixed arrangement of K subjects). For instance, expecting that we have the two
nourishment and adorable creature points above, you may pick the record to comprise of
1/3 sustenance and 2/3 charming creatures.

Produce each word w_i in the archive by:


• First picking a theme (as indicated by the multinomial conveyance that you examined
above; for instance, you may pick the nourishment point with 1/3 likelihood and the
adorable creatures subject with 2/3 likelihood).
• Utilizing the theme to produce the word itself (as indicated by the point's multinomial
circulation). For instance, on the off chance that we chose the nourishment subject, we
may produce "broccoli" with 30% likelihood, "bananas" with 15% likelihood, etc.
• Expecting this generative model for an accumulation of records, LDA then attempts to
backtrack from the archives to locate a lot of subjects that are probably going to have
produced the gathering.
Work Break Down Structure: -
Code: - Python: -
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.corpora.dictionary import Dictionary
import gensim
from nltk.stem import WordNetLemmatizer
import string
import os
from chatterbot import ChatBot
from chatterbot.trainers import ListTrainer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.corpora.dictionary import Dictionary
import gensim
from nltk.stem import WordNetLemmatizer
import string
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import pandas as pd

def filter(x):
df=pd.read_csv("spam.csv",encoding='latin-1')
df=df.loc[:, ~df.columns.str.contains('^Unnamed')]
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,1], df.iloc[:,0],
test_size=0.33, random_state=53)
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train)
test=x
test=[test]
count_test = count_vectorizer.transform(test)
nb_classifier=MultinomialNB()
nb_classifier.fit(count_train,y_train)
pred=nb_classifier.predict(count_test)
if(pred[0]=="ham"):
print("This message is not spam!")
elif(pred[0]=="spam"):
print("This message is spam !")

lemmatizer = WordNetLemmatizer()
def print_topics(res):
topics=[]
for i in res:

k=i[1].split("+")

for l in k :
m=l.split("*")
c=m[1]
c=c.replace('"','')
c=c.replace(' ','')
ll=[]
ll.append(c)
tagged_sent=word_tokenize(c)
tagged_sent=nltk.pos_tag(tagged_sent)
if(tagged_sent[0][1]=='NNP' or tagged_sent[0][1]=='NN'):
topics.append(c)
return(list(set(topics)))

def rec_topic(x):
my_doc=x.split(".")

my_doc_fi=[]
exclude = set(string.punctuation)
for doc in my_doc:
s = ''.join(ch for ch in doc if ch not in exclude)
my_doc_fi.append(s)

tokenized_docs=[word_tokenize(i) for i in my_doc_fi]


stop_tok=[]
for doc in tokenized_docs:
temp=[]
for t in doc:
if(t.lower() not in stopwords.words('english')):
temp.append(t)
stop_tok.append(temp)

lemma=[]
for doc in stop_tok:
temp=[]
for t in doc:
temp.append(lemmatizer.lemmatize(t))
lemma.append(temp)

dic=Dictionary(lemma)
corpus=[dic.doc2bow(i) for i in lemma]

Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics=3, id2word = dic, passes=50)

res=ldamodel.print_topics(num_topics=3, num_words=3)

r=print_topics(res)
return(r)

chatbot = ChatBot("Ron Obvious")


trainer = ListTrainer(chatbot)
#for files in os.listdir('./english/'):
#data=open('./english/'+files,'r').readlines()
#print(data)
#trainer.train(dat

print("\n\n\n\n\n\n\n\n\n")
print("computer: Hi !what is your name?")
y=input()
y=y.split()
name=""
for i in y:
tagged_sent=word_tokenize(i)
tagged_sent=nltk.pos_tag(tagged_sent)
if((tagged_sent[0][1]=='NNP') and i not in stopwords.words('english')):
name=name+i

print("computer: oh! Thats a nice name....so how are you "+name+"?")


if(len(name)==0):
name=name+"user"

while(1==1):
print(name,end=": ")
y=input()

set1 = set(y.split(' '))


set2 = set(("help").split(' '))
if(set1 == set2):

print("press 1 for topic recognition")


print("press 2 for spam filter")
print("press 3 for exit")
x=input()

if(int(x)==1):
state=input("enter the paragraph")
print("The important words are:")
for i in rec_topic(state):
print(i)
elif(int(x)==2):
state=input("enter the paragraph")
filter(state)
elif(int(x)==3):
print("bye"+name+"..have a great day!")
break

else:
response = chatbot.get_response(y)
print("computer: ",end=" ")
print(response)
Datasets: -
Data is the essential ingredients before we can develop any meaningful algorithm. Knowing
where to get your data can be a very handy tool especially when you are just a beginner.

Below are a few of the famous repositories where you can easily get thousand kind of data set
for free:

1. UC Irvine Machine Learning Repository

2. Kaggle datasets

3. AWS datasets

For this email spamming data set, it is distributed by Spam Assassin, you can click this link to
go to the data set. There are a few categories of the data, you can read the readme.html to get
more background information on the data.

In short, there is two types of data present in this repository, which is ham(non-spam)
and spam data. Furthermore, in the ham data, there are easy and hard, which mean there is some
non-spam data that has a very high similarity with spam data. This might pose a difficulty for
our system to make a decision.

https://www.kaggle.com/uciml/sms-spam-collection-dataset

Outcome of Project: -
In this project we are developing a prototype product for Spam Detection using Text Analysis.
This product can also be used as will basically act as a chat bot which will also detect spam
messages using text analysis. We are using Natural Language Programming concepts to
implement our project and using the algorithm mentioned above.

Base Journal paper: -


https://www.mdpi.com/2076-3417/9/5/987/pdf-vor
The idea is drowned out from the above journal paper with enhancement by our own ideas.
References: -
11. Prominent feature extraction for review analysis: an empirical study Journal of
Experimental & Theoretical Artificial Intelligence, Vol.28, Issue.3, pp.485-498,
2016.

12. Sentiment analysis using conceptnet ontology and context information In


Prominent Feature Extraction for Sentiment Analysis (Springer), US, pp.63-75,
2016.
13. L. Zhang, J. Zhu, T. Yao, An evaluation of statistical spam filtering techniques
ACM Transactions on Asian Language Information Processing (TALIP), Vol.3,
Issue.4, pp.243-269, 2004.

14. An empirical study of the naive Bayes classifier In IJCAI 2001 workshop on
empirical methods in artificial intelligence, Vol.3, Issue.22, pp.41-46, 2001.

15. F. Sebastiani, categorization, ACM computing surveys (CSUR), Vol.34, Issue.1,


pp.1-47, 2002 Z. Yang, X. Nie, W. Xu, An approach to spam detection by naive
Bayes ensemble based on decision induction. In Sixth International Conference
on Intelligent Systems Design and Applications, China, pp.861-866, 2006 C.

16. Kim, Naive Bayes classifier learning with feature selection for spam detection
in social bookmarking In Proceedings of European Conference on Machine
Learning and Principles and Practice of Knowledge Discovery in Databases
(ECML/ PKDD), US, pp.32, 2008. I.

17. Androutsopoulos, J. Koutsias, K. V. Chandrinos, C. D. An experimental


comparison of naive Bayesian and keyword-based anti-spam filtering with
personal e-mail messages In Proceedings of the 23rd annual international ACM
SIGIR conference on Research and development in information retrieval, Greece,
pp.160-167, 2000.

18. T. A. Almeida, A. Yamakami, Evaluation of approaches for dimensionality


reduction applied with naive bayes anti-spam filter, International Conference on
Machine Learning and Applications, Miami, pp.517-522, 2009.

19. Support-vector networks Machine learning, Vol.20, Issue.3, pp.273-297, 1995

20. M. Mccord, Spam detection on twitter using traditional classifier In International


Conference on Autonomic and Trusted Computing, Heidelberg, pp.175-186, 2011.

21. P. Kolari, T. Finin,, A. Joshi, March, SVMs for the Blogosphere: Blog
Identification and Splog Detection In AAAI Spring Symposium: Computational
Approaches to Analyzing Weblogs,Baltimore, pp.92-99, 2006.

22. H.B. Wang, Y. Yu, SVM classifier incorporating feature selection using GA for
spam detection In International Conference on Embedded and Ubiquitous
Computing, Japan, pp.1147-1154, 2005
23. Incremental SVM model for spam detection on dynamic email social networks
In Int. Conf. on Computational Science and Engineering, Vancouver, pp.128-
135, 2009.

24. An assessment of case base reasoning for short text message classification In N.
Creaney (Ed.), Proceedings of 16th Irish Conference on Artificial Intelligence and
Cognitive Science, Castlebar, pp.257-266, 2005.

25. A. Harisinghaney, A. Dixit, S. Gupta, Text and image based spam email
classification using KNN Naïve Bayes and Reverse DBSCAN algorithm In
Optimization Reliabilty and Information Technology (ICROIT), India, pp.153-
155, 2014.

26. Graph-based KNN Algorithm for Spam SMS Detection Journal of Universal
Computer Science, Vol.19, Issue.16, pp.2404-2419, 2013.

27. Using cellular automata for improving knn based spam filtering Internationa
Arab Journal Information Technology, Vol.11, Issue.4, pp.345-353, 2014.

28. A.T. Sabri, A. H. Mohammads, B. Al-Shargabi, M. A. Hamdeh, detection using


artificial neural network (CLA_ANN) European Journal of Scientific
Research, Vol.42, Issue.3, pp.525-535, 2011.

29. MR. Nagpure, SS. Mesakar, SR. Raut and Vanita P.Lonkar, "Image Retrieval
System with Interactive Genetic Algorithm Using Distance", International Journal
of Computer Sciences and Engineering, Vol.2, Issue.12, pp.109-113, 2014.

30. Spam detection using clustering, random forests, and active learning In
Sixth Conference on Email and Anti-Spam. Mountain View, California, pp.1-6,
2009.

31. A. Karami, Improving static SMS spam detection by using new content-based
features In 20th Americas Conference on Information systems (AMCIS),
Savannah, pp.1-9, 2014.

32. A. Garg, N. Batra, I. Taneja, A. Bhatnagar, A. Yadav, S. Kumar, "Cluster


Formation based Comparison of Genetic Algorithm and Particle swarm
Optimization Algorithm in Wireless Sensor Network", International Journal of
Scientific Research in Computer Science and Engineering, Vol.5, Issue.2, pp.14-
20, 2017.

33. Y. Zhang, S. Wang, P. Phillips G.Binary PSO with mutation operator for feature
selection using decision tree applied to spam detection Knowledge-Based
Systems, Vol.64, Issue.3, pp.22-31, 2014.

You might also like