Final Document
Final Document
Submitted to
CERTIFICATE
This is to certify that the project report entitled “SPAM MESSAGE
IDENTIFICATION USING MACHINE LEARNING APPROACH” is submitted by
in partial fulfillment of the requirements for the award of Degree of Bachelor of Technology
in “Computer Science and Engineering” for the academic year 2022-23.
Ph.D.,
Assistant Professor in CSE, Professor & Head, Dept. of CSE,
AITS,Rajampet. Dean of Student
Affairs, AITS,
Rajampet.
Department of Computer Science and Engineering
Annamacharya Institute of Technology and Sciences
(An Autonomous Institution)
(Approved by AICTE, New-Delhi and affiliated to J.N.T.U, Anantapur)
(Accredited by NBA & NAAC)
New Boyanapalli, Rajampet , Annamaiah (Dt), A.P-516 126
CERTIFICATE
This is to certify that the project report entitled “SPAM MESSAGE
IDENTIFICATION USING MACHINE LEARNING APPROACH” is submitted by
ANTI-PLAGIARISM CERTIFICATE
This is to certify that the project report entitled “SPAM MESSAGE
IDENTIFICATION USING MACHINE LEARNING APPROACH” is submitted by
This is a record of bonafide work carried out by me/us and the results embodied in
this project report have not been reproduced or copied from any source. The results embodied
in this project report have been submitted to any other University or institute for the Award of
any other Degree or Diploma.
PROJECT ASSOSIATES
CHAPTER PAGE NO
ABSTRACT
1. INTRODUCTION 1-4
BILIOGRAPHY
PLAGARISM REPORT
JOURNAL PUBLICATION
LIST OF FIGURES & TABLES
We use some communication means to convey messages digitally. Digital tools allow
two or more persons to coordinate with each other. This communication can be textual,
visual, audio, and written. Smart devices including cell phones are the major sources of
communication these days. Intensive communication through SMSs is causing spamming as
well. Unwanted text messages define as junk information that we received in gadgets. Most
of the companies promote their products or services by sending spam texts which are
unwelcome. In general, most of the time spam emails more in numbers than Actual
messages. In this project, we have used text classification techniques to define SMS and
spam filtering in a short view, which segregates the messages accordingly. In this project, we
applied some Machine learning Algorithms such as SVM, XG boost, Naïve Bayes and Ada
Boost are compared by applying all these algorithms on the dataset and the best algorithm
SVM having the best accuracy is selected for the detection of spam messages.
INTRODUCTION
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH
1. INTRODUCTION
In just five years, there will be 3.8 billion mobile phone (or "smartphone") users.
China, India, and the US are the top three countries in terms of mobile usage. Short
Messaging Service, sometimes known as SMS, is a text messaging service that has been
around for a while. You can use SMS services even without an internet connection. SMS
service is thus accessible on both smartphones and low-end mobile devices. Although there
are numerous text messaging apps on smart phones, such as WhatsApp, this service can only
be used online. Nonetheless, SMS is available at all times. As a result, SMS service traffic is
growing daily.
The majority of these consumers had signed up for the Do Not Call Registry run by
the Telecom Regulatory Authority of India (TRAI). These calls and messages are typically
placed by telemarketers, spammers, vendors, banks, vehicle dealers, insurance brokers, real
estate agents, and other fraudsters.
Also, the system mostly derived knowledge from the data. For that goal, there are
numerous methods available, including classification, clustering, and many others. SMS
stands for short message service. 160-character messages must be sent by SMS, and lengthy
messages must be broken up into numerous smaller messages. Short text messages could be
exchanged between cell phones using the same communication protocols. The government
intends to keep up with the rapid pace of technological progress. Prior years saw an increase
in text messaging..
In a certain way, SMS spam is sophisticated. The lowest SMS prices have made it
possible for customers and service providers to move away from the issue and limited
availability of cell phone spam filtering software. Spam on SMS is lower than spam on email.
Despite the fact that it accounts for 30% of typescript letters sent to fashionable Asia and
about 1% of transcript intended in the United States. Under the Telephone Customer
Protection Act of 2004, SMS spam became prohibited in the United States. Those who
receive spam SMS are aware of how to lead a guidance counsellor to a court case with no real
legal significance. Three of China's top mobile phone users have now agreed to a joint effort
to combat mobile spam by establishing limits on the number of typescript messages sent to
one another over time since 2009.
classification in this study. Since texts sent from people are considered to be from the
human class or mobile
phone, spam messages are typically provided by businesses and organisations to advertise
their goods. Due to voicemail messages being commonly utilised, mobile phones or
smartphones are generally a communication device and are used by people in a wide range.
Due to the fact that SMS spam datasets are typically small in size, email filter spam
has a greater number of datasets than SMS spam. Due to the small size of spam SMS, the
spam filtering method used for email could not be extended to SMS. Spamming via email is
less common than SMS in some countries, including Korea. However, in western regions, the
opposite strategy was used, with email spam being more prevalent due to its lower cost than
SMS spam, which is more expensive and infrequent. On mobile devices, about 50% of SMS
messages are received as a text message and are flagged as spam. An SMS filtering system
should function in reserve resources as well as in cell phone hardware because of this. We
used ham and spam as genuine data in our analysis. We use a number of different
categorization algorithms, some of which have been used in earlier research and some of
which are brand- new.
Machine learning is a technology that allows computers to learn from the past and
make predictions about the future. The majority of problems in the real world may now be
solved using machine learning and deep learning in all fields, including health, security,
market analysis, etc. Machine learning can be divided roughly into two types: supervised
learning and unsupervised learning, respectively. Supervised learning is one of the important
subcategories of machine learning. Predictive modelling, another name for supervised
learning, is the process of producing predictions from data. There are no sizable datasets of
spam SMS that are publicly accessible. Even if there were, there is absolutely no expectation
that training on those datasets would result in a successful performance in our situation. As a
result, the only option is to use the real data that is streaming into the system to create a
bespoke dataset. Classification and regression are two instances of supervised learning.
supervised instruction for classification issues, the training data set has pre-labels, and for
regression issues, function values are known. Go to scoring later, where we can forecast
values for new data, once training is complete and the model has a minimal cost function for
the training data set.
Classification: It shows which groups a person belongs to. As a result, if we want our system
to determine which label to use when dealing with many events that are each defined by input
parameters that might be labelled in a variety of ways.
We applied a variety of supervised learning techniques for SMS spam detection using
a dataset from UCI that had labels.
LITERATURE SURVEY
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH
2. LITERATURE SURVEY
1. Sridevi Gadde(2021),S.Satyanarana, A.Lakshmanarao : Spam detection using machine
learning algorithms is nothing new. In the past, a number of researchers used machine
learning techniques to classify SMS spam by ID. With the use of a random forest classifier
and the TF- IDF approach, Nilam Nur Amir Sjarif et al. were able to attain an accuracy of
97.5%. With the help of two metrics Term Frequency and Inverse Document Frequency—the
words in a document can be quantified using the TF-IDF method. For email spam filtering,
A. Lakshmanarao et al. used four machine learning classifiers: Decision Trees, Naive Bayes,
Logistic Regression, and Random Forest. The random forest classifier had an accuracy of
97%.
Using support vector machines, Pavas Navaney et. al. suggested a number of machine
learning techniques and attained an accuracy of 97.4%. By using a logistic regression
classifier, Luo GuangJun et. al. were able to attain a high accuracy rate using a variety of
shallow machine learning techniques. The Hidden Markov Model was suggested by Tian Xia
et. al. for the identification of SMS spam. Their model addressed problems with low term
frequency by making use of the word order information. With their suggested HMM model,
they were able to obtain an accuracy of 98%. With a deep neural network, M. Nivaashini et.
al. were able to detect SMS spam with an accuracy of 98%. The effectiveness of DNN was
also contrasted with that of NB, Random Forest, Support Vector Machine, and KNN. Mehul
Gupta and others comparing various machine learning models for spam detection with deep
learning models and demonstrated that the latter had a high rate of success in detecting SMS
spam.
According to Hidalgo (2002), spam is "unrestricted mass email" that contains "info
made to be delivered to numerous recipients, notwithstanding their longings." In his 2007
illustration, Cormack described how spam that contains an enticing ingredient or compelling
information is distributed by mass mailing. Nonetheless, given the various media spam
techniques employed, including email spam and SMS spam, such spam may be easily
identifiable. Spammers inundate Short Message Service employees and provide end users
with large amounts of unrestricted SMS . From a commercial standpoint, Short Messaging
Service users must invest time in eliminating spam-filled messages because they undoubtedly
result in profit loss and could pose problems for partnerships. From this point on, how to
accurately and proficiently identify Short Messaging Service spam with high precision
becomes a massive report.
This strategy is used to emphasise the issues that still need to be resolved and the
differences from our existing analysis. Delany et al. conducted a survey on the filtering of
mobile SMS spam and developments. The difficulties in gathering the research dataset and
making it accessible were discussed by the authors. Further investigation in this area was
promoted by the publication. The results of a subsequent early benchmark experiment
revealed a lack of agreement regarding the most effective strategies for mobile SMS spam
detection and filtering. Also, it demonstrated the techniques used in the thorough SMS
filtering's text classification. Nevertheless, the explicit SMS features were not taken into
account.
The report, however, focused more on methods for detecting spam in emails and left
out artificial immune systems and other mobile SMS spam approaches. An overview of the
legal frameworks for spam in the mobile SMS industry in Switzerland, the EU, and the USA
was provided by Camponovo and Cerutti.
The initiative also looked into the conclusions that may be drawn for the commercial
mobile sector. In a study on SMS anti-spam systems, Wang et al. integrated temporal or
spectral testing with behavior-based social network analysis to identify spam with extremely
high recall and accuracy. The authors tackled the scalability issue of social networks by
outlining the classification architecture and presenting a reasonably precise neighbourhood
index solution. Chou and Lien investigated the "mobile teaser ads" by running two distinct
studies on brand awareness, representative friendliness, and representative expertise, as well
as the ways in which they could influence brand interest in subscribers with various SMS
mind- sets. The results showed that a likeable and well-known representation decreased
consumers' curiosities for teaser advertising presenting higher awareness products.
In blogs that feature product advertisements, Jindal and Liu examined spam and spam
recognition. Unfortunately, SMS spam was not mentioned in the review; it only covered
spams associated with blogs that posted product advertisements. Similar to Web, which
evaluated numerous algorithms for filtering questionable behaviour over a ten-year period,
these algorithms were divided into four classes: classic spam, phoney reviews, social spam,
and link farming. This evaluation, however, did not fully analyse SMS spam; it simply
addressed e-mail spams, false blog reviews, and social media spam. Chan et al. presented a
word assault strategy that makes use of the classifier's ability to be controlled with the fewest
characters introduced by combining weight values with word length in the SMS .
The feature reweighting method was put forth together with a novel scaling methodology
that diminished the value of the element denoting a little word in order to necessitate the use
of additional inserted characters for successful avoidance. Using text messages and a sample
comment spam bank, this strategy was evaluated empirically. The results of the experiment
demonstrated that word length was a crucial component of SMS spam filtering's resistance to
excellent word attacks. We outline the procedures utilised to review the previous studies in
the next section.
4. Inwhee Joe and Hyetaek Shim, Division of Computer Science and Engineering,
Hanyang University, Seoul, 133-791 South Korea:
This project describes an SVM (Support Vector Machine) and thesaurus-based spam
filtering system for SMS (Short Messaging Service). The system uses a pre-processing tool to
recognize words from sample data, combines their meanings with the help of a thesaurus,
generates features of integrated words using chi-square statistics, and then examines these
traits. The system is implemented within the Windows environment, and its effectiveness has
been empirically verified.
Spam filtering is an odd field that automatically determines if a page is spam or not.
Automated document categorization entails grouping together related documents and
assigning each one to the appropriate category using a classification scheme. The classifying
process comprises two stages.
After indexing a large number of documents, the first phase of the feature selection
process extracts the necessary features for classification. The second phase is the decision-
making process that determines the appropriate category for the first phase's results. By a
mechanical learning process, automatic document classification is able to automatically
assign the correct category.
A specific word was associated with a group of learnt documents for this process. The
word stands for the document, and the extracting feature denotes a batch operation to choose
words from the document's learned words. But, if it chooses every word in the learnt
document as a feature, it wastes time and loses its ability to make decisions. To avoid this
issue, assess the information contained in each word before choosing featured terms for
automatic classification. In text categorization, there are many different feature spaces to
consider. We therefore require a feature selection technique. Document frequency
thresholding (DF), the X
statistics (CHI), term strength (TS), information gain (IG), and mutual information are the
most widely used feature selection techniques.
Due to the rise in mobile phone usage, there has been a rapid increase in SMS spam
messages. It is difficult to battle mobile phone spam in practice due to the lower SMS usage
rate, which has allowed many consumers and service providers to ignore the issue, as well as
the limited availability of mobile phone spam-filtering software. On the other side, the lack of
publicly accessible datasets for SMS spam, which are essential for testing and comparing
different classifiers, is a significant disadvantage in academic environments. Also, because
SMS messages are frequently short, content-based spam filters could perform worse.
With this project, we present the largest authentic, accessible, and unencrypted SMS
spam collection we are aware of. Also, we contrast the results of various tried-and-true
machine learning techniques. The findings show that Support Vector Machine performs
better than other assessed classifiers, making it a useful foundation for further comparison.
6. Duan, L., Li, N., & Huang, L. (2009). “A new spam short message classification”:
7. B. G. Becker. Visualizing Decision Table Classifiers. Pages 102- 105, IEEE (1998):
Decision trees, decision networks, and decision tables are all categorization models used
for forecasting . Machine learning algorithms produce these. A decision table is made up of a
hierarchy of tables where each entry is split down into its component parts by the values of
two more characteristics to create a new table. Dimensional stacking is comparable to the
structure
. Here, a visualization technique is shown that enables even non-experts in machine learning
to comprehend a model built on a variety of attributes. This representation is more practical
than other static designs thanks to a variety of interactions.
8. Draško Radovanović, Božo Krstajić, Member, IEEE “Review Spam Detection using
Machine Learning”:
People typically educate themselves before making a purchase by reading online reviews.
Sellers frequently try to imitate user experience to increase their profits. Recognizing and
eliminating bogus reviews is crucial because customers are being duped in this manner. This
project examines machine learning-based spam detection techniques and gives an overview
and findings. According to , review spam can be split into three categories: 1. Untrue beliefs
2. Brand-specific reviews only 3. Omit reviews False reviews are intentionally false opinions.
Reviews that are solely about brands aren't concerned with items at all, but with brands or
producers. Ads and other unrelated reviews without opinions are considered non-reviews.
Types two and three don't mention specific products, but they aren't dishonest. These spam
subtypes are also simple to identify manually, and conventional classification techniques
have little trouble identifying them. It has been demonstrated that detecting false reviews is
substantially more difficult for both a machine and a human observer. These are the reasons
why this project takes into account this kind of spam.
After being first made available as a service in the second-generation (2G) terrestrial
mobile network architecture, the Short Messaging Service (SMS) gained popularity (Global
System for Mobile Communication-GSM). Short Messaging Service (SMS) usage on phones
has expanded to such a big degree due to technological developments and an increase in
content-based advertising that devices are occasionally inundated with a large number of
spam SMS. Private data loss is another risk posed by these spam mailings. There are
numerous content-based machine learning methods that have been successfully used to filter
spam emails. Contemporary studies have classified text messages as spam or gammon using
certain stylistic characteristics.
The ability to detect SMS spam can be significantly impacted by the use of well-
known terms, phrases, abbreviations, and idioms. This project compares various categorizing
methods using various datasets gathered from earlier research projects, and evaluates them
according to their accuracy, precision, recall, and CAP Curve. Deep learning approaches and
conventional machine learning techniques have been compared.
10. Milivoje Popovac, Mirjana Karanovic, Srdjan Sladojevic, Marko Arsenovic, Andras
Anderla :
Using various data analysis techniques, the classification challenge of spam message
detection can be resolved. The papers that apply artificial intelligence approaches are given
below, with reference to the methodology selected for this study. The designers of Tiago's
dataset, the authors of paper , have tested a number of machine learning methods and laid the
groundwork for additional study. They combined classifiers with two tokenizers (to recognize
a domain and preserve symbols that assist distinguish spam from ham messages), with the
highest performance coming from SVM, Boosted NB, Boosted C4.5, and PART. SVM had
the best performance, catching 83.10 percent of spam and blocking just 0.18% of non-spam
communications. Its accuracy was above 97.5%. The SMS Spam Corpus and SMS Spam
Collection datasets were utilized in project, both separately and combined. Both of the two
employed approaches, the FP-Growth Algorithm and the Naive Bayes classifier, achieved an
accuracy rate higher than 90%.
The FP-Growth method was applied on Tiago's dataset to provide the accuracy best
average (98.5%). The research for project was done using the Tiago dataset. Naive Bayes
performed better in the experiment than the algorithms Random Forest and Logistic
Regression. About 98.5% Convolutional Neural Network based SMS Spam Detection
accuracy findings were reported by NB by Milivoje Popovac, Mirjana Karanovic, Srdjan
Sladojevic, Marko Arsenovic, and Andras Anderla T. The project made use of the same
dataset. The authors came to the conclusion that the best outcomes came from boosting the
Random Forest and SVM algorithms. Outcomes can be improved by utilizing both Linguistic
Inquiry, Word Count (LIWC), and SMS-specific content based capabilities.
The Gentle Boost Classifier was selected by the authors of project as the best method
to utilize on Tiago's dataset after evaluating a number of other algorithms. The specified
algorithm, which combines the AdaBoostM1 and Logit Boost techniques, is useful for binary
classification and unbalanced data. This method has produced accuracy of more than 98.3%.
The following algorithms were utilized in project by the authors: NB, SVM, k-NN, RF, and
AdaBoost. All of the aforementioned algorithms achieved an accuracy rate more than 97%;
however, the best classifiers are multinomial naive Bayes with Laplace smoothing and SVM
with linear kernel.
In a paper, another study using Tiago's dataset was presented. The experiment
included involved clustering using the K-Means technique or the NMF Model in addition to
pre- processing and classification using several classifiers. Using the aforementioned
processes, an SMS thread identification solution was put forth. The study came to the
conclusion that the NMF and SVM algorithm combination produced good results in thread
identification and that the SVM algorithm performed better at categorising SMS messages.
On four spam datasets, authors of the research compared the performance of several methods.
Incorporating an N- gram tf.idf feature selection, a modified distribution-based balancing
algorithm, and a regularised deep multi-layer perceptron NN model with rectified linear units,
they proposed a novel spam filter (DBB-RDNN-ReL).
The suggested model's accuracy, FP rate, and auc for Tiago's dataset were 98.5%,
0.0024, and 0.961 respectively. Other applications for similar techniques include text
categorization, sentiment analysis, and email spam detection. The authors of paper trained
CNN to classify sentences; their experiment demonstrates that even a basic CNN with just
one layer of convolution can produce remarkable results.
A study was conducted as part of project to evaluate the CNNRNN model's use in multi-
label text categorization. CNN was utilised for feature extraction, and RNN was employed to
extract local semantic data and model label correlation. It is demonstrated that the
performance of the used model is significantly impacted by the amount of the dataset. Large
datasets can produce impressive performance, however small datasets may result in
overfitting. This study was carried out to assess how well CNN applied to the problem under
consideration.
11. Arijit Chandra, Sunil Kumar Khatri “Spam SMS Filtering using Recurrent Neural
Network and Long Short Term Memory”
Short Messaging Service, or SMS, is an acronym for the standard mobile device
protocols used for information sharing via brief text messages. Nowadays, SMS messages are
a quick, affordable, and commonly accepted alternative to phone calls for communication.
Spam is defined as ad hoc, uninvited messages distributed widely without the recipient's
consent. Consumers still have to deal with spammers that use SMS to advertise fraudulent
claims and access user privacy. With the development of the Internet, spammers have tried to
enter everywhere, including emails, social media sites, reviews, and even Twitter. Spammers
frequently use these various sorts of spam to get money, including comments, emails, search
12. Sakshi Agarwal, Sanmeet Kaur, Sunita Garhwal “SMS Spam Detection for Indian
Messages”
The number of mobile phone users is rising, which has caused a sharp rise in SMS
spam. While the majority of the globe still views mobile messaging as "clean" and reliable,
recent surveys have shown that the amount of mobile phone spam is drastically rising year
over year. It is a growing setback, particularly in Asia and the Middle East. SMS spam
filtering is a relatively new task to address this issue. Several issues and easy remedies carried
over from email spam screening. It does, however, provide some of its own concerns and
issues. By including Indian communications in the global SMS dataset, this effort encourages
further work on the issue of filtering mobile messages for Indian users as Ham or Spam. In
the project, various machine learning classifiers are analysed using a significant corpus of
SMS texts for Indian citizens.
In a landmark study to identify mobile phone spam, Gomez Hidalgo et al. evaluated a
number of Bayesian-based classifiers. The authors of this study suggested the Spanish and
English test databases as the first two well-known SMS spam datasets. For those two
datasets, the authors evaluated various machine learning and message presentation
techniques. They arrived to the conclusion that the Bayesian filter might be effectively used
to classify SMS spam. Even content-based spam filtering can be employed for brief text
messages, which can be found in three different contexts: SMS, blog comments, and email
summary information, according to Cormack et al.
The conclusion of their article said that SMS must contain fewer words in order to
support word or word bigram based spam classifiers. As a result, efficiency was boosted by
expanding the collection of features to include orthogonal sparse word bigrams, character
bigrams, and trigrams. Nuruzzaman et al. examined the effectiveness of utilizing Text
Classification techniques to filter out message spam on independent mobile phones .
They gathered 1353 spam messages and attempted to use them as the dataset that
grasped of no duplicity. Categorization of Machine Learning 635 2015 1st International
Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-
5 September 2015. behaviour of SMS spam.
They used orthogonal initialization k-way spectral clustering. By using spectral clustering
on their own built dataset, a small number of clusters—ten in total with connected top 8 terms
and an assumed annotation—were produced. The details of a brand-new collection of SMS
spam that contains the greatest possible amount of messages—one that is authentic, open, and
unencrypted—were presented by Tiago A. Almeida et al. 4,827 mobile ham communications
and 747 mobile spams make up this message. Also, the authors ran their dataset through a
number of well-known machine learning algorithms and came to the conclusion that SVM is
a superior method for advance evaluation. Houshmand Shirani-Mehr used various machine
learning algorithms to the challenge of classifying SMS spam, compared the results to gain
insight and further investigate the issue, and created a program based on one of these methods
that can accurately filter SMS spams . A database of 5574 text messages was used.
13. Lu CAO, Guihua NIE, Pingfeng LIU “Ontology-based Spam Detection Filtering
System”
systematic frame model of an ontology-based mobile phone spam messages detection system
is described, together with the pertinent essential technologies, on the basis of the existing
methods of spam message identification.
Spam may be found and filtered using a variety of methods. Depending on whether
the detection process is involved in the transmission of information, it contains real-time
detection and non-real-time detection. According to detection point, it includes the outgoing
and the terminating spam detection and filtering.
SYSTEM ANALYSIS
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH
3. SYSTEM ANALYSIS
Due to a lack of knowledge about data visualization, it is a bit difficult to deploy machine
learning algorithms in the current system. The current method uses mathematical calculations
to generate models, which can be time-consuming and inaccurate when using the Naive
Bayes algorithm.
3.1.1 Disadvantages
Low Accuracy
Processing Time is High
3.3.2 USER:
Upload data: User can upload messages data into the system.
View data: Here user can view the uploaded data.
View results: User can view the predicted results.
Following are the Machine algorithms used to train and test the sample dataset.
Linear SVM: Linear SVM is used for linearly separable data, which is defined as data that can
be divided into two classes using just one straight line. The classifier used for such data is
known as a Linear SVM classifier.
Non-linear SVM: When a dataset cannot be classified using a straight line, it is said to have
been non-linearly separated, and the classifier employed is known as a non-linear SVM
classifier.
Hyperplane: In n-dimensional space, there may be several lines or decision boundaries used
to separate the classes, but we must identify the optimum decision boundary that best aids in
classifying the data points. The hyperplane of SVM is a name for this optimal boundary.
The dataset's features determine the hyperplane's dimensions, therefore if there are just two
features (as in the example image), the hyperplane will be a straight line. Moreover, if there
are three features, the hyperplane will only have two dimensions. We always build a
hyperplane with a maximum margin, or the greatest possible separation between the data
points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. These vectors are called support
vectors because they support the hyperplane.
3.3.3.2 XGBOOST
Recently, the applied machine learning and Kaggle contests for structured or tabular
data have been dominated by the XGBoost method. A gradient boosted decision tree
implementation created for speed and performance is called XGBoost.
Bagging: Assume that there is now a panel of interviewers, each of whom has a vote, as
opposed to just one. Using a democratic voting procedure, bagging or bootstrap aggregating
combines the information from each interviewer to determine the final outcome.
Both XGBoost and Gradient Boosting Machines (GBMs), ensemble tree approaches,
use the gradient descent architecture to boost weak learners (CARTs in general). However
XGBoost enhances the fundamental GBM architecture with system optimization and
algorithmic improvements.
A probabilistic machine learning model called a Naive Bayes classifier is utilized for
classification tasks. The Bayes theorem serves as the foundation of the classifier.
When B has already happened, we may use the Bayes theorem to calculate the likelihood that
A will also occur. Thus, A is the hypothesis and B is the supporting evidence. Here, it is
assumed that the predictors and features are independent. That is, the presence of one feature
does not change the behaviour of another. The term "naive" is a result.
Let's use an illustration to comprehend it. I've included a training set of weather data and its
matching goal variable, "Play," below (suggesting possibilities of playing). We must now
categorize whether participants will participate in games based on the weather. Let's carry it
out by following the steps below.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29
and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class.
The class with the highest posterior probability is the outcome of prediction.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P (Sunny) = 5/14 = 0.36, P (Yes) = 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.
• It is easy and fast to predict class of test data set. It also performs well in multi class
prediction
• When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
• It performs well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve, which
is a strong assumption).
Real time Prediction: Naive Bayes is a quick classifier that eagerly learns new things. Thus,
it might be applied to real-time prediction.Multi class Prediction: The ability of this method
to predict many classes is very widely recognized. Here, we can forecast the likelihood of
several target variable classes.
Text classification/ Spam Filtering/ Sentiment Analysis: Because they perform better in
multi-class situations and follow the independence criterion, naive Bayes classifiers are
frequently employed in text classification and have a greater success rate than other methods.
It is therefore frequently used in Sentiment Analysis and Spam Filtering (to identify spam e-
mail) (in social media analysis, to identify positive and negative customer sentiments).
Let's first talk about how boosting functions. During the data training phase, 'n'
decision trees are created. The improperly classified record in the first model is given priority
as the first decision tree or model is constructed. For the second model, just these records are
sent as input. The procedure continues until we decide how many base learners to produce.
Repetition of records is permitted with all boosting procedures, keep in mind.
The first model is created, and the algorithm notes any flaws from the initial model. The
improperly classified record is utilized as input for the next model. Up until the given
condition is met, this process is repeated. The graphic shows that by using the errors from the
previous model, 'n' other models were created. Boosting functions in this way. Decision trees
are individual models that include models 1, 2, 3,..., and N. The basic operating mechanism
of all boosting schemes is the same.
Knowing the boosting principle today will make it simple to comprehend the AdaBoost
algorithm. The algorithm creates 'n' trees when the random forest is employed. It creates
correct trees with a start node and many leaf nodes. Although some trees may be larger than
others, a random forest has no set depth. Nevertheless, AdaBoost's approach only creates the
Stump node, which has two leaves. The fact that it has just one node and two leaves can be
easily explained. These stumps are poor students, and boosting methods favor this. In
AdaBoost, the order of the stumps matters a lot. The initial stump's mistake affects how
subsequent ones are created.
This is a sample dataset that only has three attributes and produces categorical results. The
dataset is actually represented in the image. Because of the output's binary/categorical nature,
it is now a classification issue. In reality, the dataset may contain any number of features and
records. For the purposes of explanation, let's look at 5 datasets. The result is categorical and
is presented below as Yes or No. A sample weight will be given to each of these records.
W=1/N, where N is the total number of records, is the formula used for this. Since there are
only 5 records in this dataset, the sample weight is initially set at 1. The weight of each record
is the same. It is 1/5 in this instance.
Ada Boosting, which is based on binary classification issues, is best utilised to improve the
performance of decision trees.
The author first referred to AdaBoost as AdaBoost.M1. Discrete Ada Boost is a more modern
name for it. Due to the fact that categorization rather than regression is the intended use.
Any machine learning algorithm's performance can be improved with AdaBoost. When
teaching weak learners, it works best.
Spam Message detection based on feature analysis, data analysis on the selected dataset was
carried. The confusion matrix shows the performance table on accuracy when compared with
the actual classifications in the dataset. Accuracy was used for performance evaluation which
was calculated based on the confusion matrix. Confusion matrix used specific table layout for
the projection of the performance.
ACCURACY: The proportion of accurate predictions that are generated from test data is
known as accuracy. The formula for calculating it is to divide the total number of guesses by
the number of correct forecasts. The most typical statistic for model evaluation, although it
doesn't really give a good picture of how well the model is doing. Classes that are uneven
experience the worst results.
CONFUSION MATRIX
A confusion matrix, whose rows and columns represent the number of target classes,
is a matrix used to assess the effectiveness of classification models. The matrix compares
actual goal values to values predicted by the machine learning model. This gives us a
comprehensive picture of the way in which our classification model is operating and the
kinds of mistakes that it is doing.
True Positive (TP): Correctly predicted spam messages were detected with the actual Spam
Messages
False Negative (FN): The actual spam messages were false classified and detected as
legitimate messages.
False Positive (FP): The actual legitimate messages were classified as false values and
detected as spam messages.
True Negative (TN): ): The actual class and the predicted class was the same as it showed
here the actual legitimate messages were correctly predicted as legitimate messages .
IDE : PyCharm
RAM : 8GB
Finding the optimum solution to meet performance requirements is the goal of a feasibility
study. They include a description of identification, an assessment of potential system
candidates, and the choice of the best candidate.
Economic Feasibility
Technical Feasibility
Behavioral Feasibility
The most popular way for determining whether a potential system is effective is
economic analysis. The process, more popularly known as cost/benefit analysis, entails
calculating savings and benefits to see if they outweigh expenses. If they do, the decision to
design and execute the system is then made. If the system is to have an enhancement that can
be approved, more justification or changes must be made.
The existing computer system's capabilities to accommodate the planned expansion are
the focus of the technical analysis (hardware, software, etc.). To allow technical
advancement, there must be financial concerns. The project is deemed unfeasible if funding
is a severe restriction.
The strength of the user staff's expected opposition to the creation of a computerised
system should be estimated. The introduction of a potential system necessitates extra effort
to inform, persuade, and train the current methods of thinking about business. It is well
known that computer installations have something to do with understanding.
The following list summarises some of the benefits of doing a feasibility study.
The analysis portion of this study, which is being created as the first stage of the
software development life cycle, assists in thoroughly examining the system
requirements.
Aids in determining the risk variables associated in creating and implementing the
system.
Planning for risk analysis is aided by the feasibility study.
Cost-benefit analyses made possible by feasibility studies enable effective operation
of the system and organisation.
Planning for training developers to put the system into place is aided by feasibility
studies.
1) Whenever a user logs into the system, they must authenticate themselves.
2) In the event of a cyberattack, shut down the system.
3) When a user registers for the first time on a software system, a verification email is
automatically sent to them.
4.4.2. Non-functional requirements
In essence, they are the quality requirements that the system must meet in accordance with
the project contract. Depending on the project, different aspects may be given varying
degrees of priority or implementation. These are also known as non-behavioral requirements.
1) With respect to such an activity, emails should be sent no more than 12 hours
afterwards.
3) If there are more than 10,000 simultaneous users, the website should load in 3
seconds.
SYSTEM DESIGN
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH
5. SYSTEM DESIGN
Sequence Diagram
Collaboration Diagram
State chart Diagram
Component Diagram
Deployment Diagram
5.2.1 GOALS
1. Make available to users a ready-to-use, expressive visual modeling language that enables
them to create and share meaningful models.
2. Provide mechanisms for extendibility and specialisation in order to broaden the scope of
the core concepts.
3. Refrain from using specific programming languages or development processes.
4. Lay the groundwork for a formal understanding of the modeling language.
5. The following are the primary goals of the UML design:
6. Encourage the growth of the market for OO tools.
7. Help with the implementation of higher-level development concepts like collaborations,
frameworks, patterns, and components.
8. Implement best practices.
5.3. UML NOTATIONS
7. Object object
objec
A Real -Time
t entity.
8. Message To communicate
between the lives
of
object.
9. State It depicts events
NewState that occur during
an objects lifetime.
10. Initial State Represents the
objects initial state.
11. Final State Represents the
objects final state.
Upload Data
View
Data
Preprocess
Model
Training syste
user m
Predictio
n
Generating
Result
View Results
It is utilised in analysis to show the system's specifics. Architecture examines the class
diagram to determine which classes have an excessive number of functions and, if any do,
whether they should be divided. The connections between the classes are made. The Class
Diagram is a tool used by developers to create classes. A class diagram is a group of related
objects that are all connected and have the same features, operations, relationships, and
connections and regulations referred to as semantics. A class is a huge group of items in a
production.
USER
USER
SYSTEM
SYSTEM
upload
data
pre-
processing
model
training
predictio
n
generating
results
view
result
The software artefacts of a system are the primary emphasis of UML. However, these
two particular diagrams are meant to highlight the hardware and software parts. In contrast to
deployment diagrams, which are designed to concentrate on a system's hardware topology,
most UML diagrams are used to manage logical components. The system engineers utilise
diagrams for deployment. You can characterise the function of deployment diagrams as:
syste user
m
An activity diagram illustrates the entire control flow. A flowchart with specific states
is similar to an activity diagram. With the activity diagram, you can keep track of the
sequence of actions occurring in your system. Activities look like states; however, they are a
little more rounded. They are stateless because they take place and then go unabatedly to the
following state. The "diamond" conditional branch determines which activity to switch to
based on a characteristic and is also stateless. Activity Diagram includes
1. Action states.
2. Transition.
3. Objects.
4. Contains Fork, Join and branching relations along with flow Chart symbols.
uploa
d
pre_proce view
s sing data
model
trainin
predictio
n
generatin view
g result
Component diagrams are used to represent the physical parts of a system from that
perspective. These parts include files, libraries, and other things. A static implementation
view of a system is another way to explain component diagrams. The arrangement of the
components at a specific time is represented by static implementation. The entire system
cannot be represented by a single component diagram; instead, a collection of diagrams is
employed. The component diagram's goal can be summed up as follows:
system use
r
The transfer of control between states is shown in a state chart diagram. States are
described as a situation where an object existing and changes as a result of an event.
Modelling an object's lifetime from conception to termination is the primary goal of a state
chart diagram. The forward and reverse engineering of a system also uses state chart
diagrams. The reactive system's modelling is the primary goal, though.
5.5. ER DIAGRAM:
The typical method for representing the information flows inside a system is a data
flow diagram (DFD). A good deal of the system requirements can be graphically represented
by a tidy and understandable DFD. It can be done manually, automatically, or both. It
demonstrates how data enters and exits the system, what modifies data, and where it is stored.
A DFD is used to illustrate the scope and bounds of a system as a whole. It can be
applied as a method for communication between a systems analyst and any participant in the
system that serves as the foundation for system redesign.
Features:
• Simple
• Easy
• Portable
• Object oriented
• High Level
• Open Source and Free
• Support for GUI
• Interpreted
• Dynamic
• Readable
Since Python's beginnings, the idea of a "scripting language" has evolved significantly
because it is now used to create huge, sophisticated systems rather than just simple ones. As
the internet became more widely used, so did this dependency on Python. Python is used by a
vast majority of web platforms and applications, including Google's search engine, YouTube,
and the New York Stock Exchange's web-based transaction system (NYSE). When a
language is used to power a stock exchange system, we know it must be fairly significant.
Python can also be used to solve mathematical problems, display numbers or graphics,
process text, and save data. In essence, it is used in the background to process many elements
that you might need or encounter on your device(s), including mobile.
1) Python may be used to create prototypes, and because it is so simple to use and read, it can
be done rapidly.
2) The majority of platforms for automation, data mining, and big data rely on Python.
3) Compared to large languages like C# and Java, Python offers a more productive coding
environment. By using Python, seasoned programmers tend to stay more organised and
productive.
4) Even if you're not an experienced programmer, Python is simple to read. Everyone can
start using the language; all it needs is some perseverance and lots of practise. Additionally,
this makes it a perfect choice for use by large development teams and teams with multiple
programmers.
5Django is a full and open-source web application framework that is powered by Python. The
process of developing software can be made simpler by using frameworks like Ruby on
Rails.
6) Because it was built by the community and is open source, it has a huge fan base. Millions of
like-minded programmers use the language every day and keep its foundational features up to
date. As time goes on, Python's most recent version continues to get updates and
improvements. This is a fantastic method of connecting with other developers.
Moreover, Python permits the use of modules and packages, allowing for the modular
architecture of programmes and code reuse across numerous projects. After a module or
package has been created, it may be scaled for usage in other applications and is simple to
import or export.
A well-known open-source Python toolbox for data science, data analysis, and machine learning
is called Pandas. It was created using NumPy, another Python library that supports
multidimensional arrays.
NumPy: To work with arrays, a Python module named Numpy is used. Also given are Fourier
transformations and linear algebraic functions.
Sklearn: Free Python machine learning library Scikit-learn, originally known as scikits.learn, is
available online. Support-vector machines, random forests, gradient boosting, k-means, and
DBSCAN are a few of the approaches for classification, regression, and clustering that are
featured. Using a consistent Python interface, scikit-learn offers a variety of supervised and
unsupervised learning techniques.
permissive simplified BSD licence, promoting both academic and commercial use.SciPy
(Scientific
Python), which must be installed before using scikit-learn, is the foundation upon which the
library is based.
Pandas: organisation and analysis of data SciPy modules or extensions that go by the
moniker SciKits. As a result, the module, known as scikit-learn, offers learning methods. The
library is intended to have the robustness and support necessary for use in production
systems. This entails placing a strong emphasis on issues like usability, code quality,
teamwork, documentation, and performance.
Pymysql: Py MySQL is a Python library that connects to a MySQL server, allowing Python
programmes to communicate with it. Using Python properties to get the port settings.
PyMySQL, a MySQL driver built entirely in Python, was originally created as a shoddy port
of the MySQL-Python driver. PyMySQL is completely open source, hosted on Github,
distributed via Pypi, and is continuously updated, therefore it satisfies all requirements for a
driver.It is completely compatible with Python 3 and eventlet-monkeypatch because it is
developed entirely in Python.
Pycharm
One of the most well-liked Python IDEs is PyCharm. There are many reasons for this,
one of which is that it was created by JetBrains, the company that also created the well-
known IntelliJ IDEA IDE, one of the "big 3" Java IDEs, and WebStorm, the "smartest
JavaScript IDE." Another solid justification is having Django support for web development.
This IDE was developed by Pycharm primarily for Python programming and to run
on different operating systems, including Windows, Linux, and macOS. The IDE includes
version control options, a debugger, testing tools, and tools for code analysis. It also helps
programmers create Python plugins with the aid of the many available APIs. The IDE enables
us to work directly with a number of databases without integrating them with other
programmes. Despite being specifically made for Python, this IDE also allows for the
creation of HTML, CSS, and Javascript files. It also has a stunning user interface that can be
altered using plugins to suit the demands.
app.py
return render_template('load.html')
@webapp.route('/view')
def view():
return render_template('view.html', columns=dataset.columns.values,
rows=dataset.values.tolist())
def preprocess_data(df):
# Convert text to lowercase
df['Message'] = df['Message'].str.strip().str.lower()
return df
@webapp.route('/preprocess',methods=['POST','GET'])
def preprocess():
global x,y,x_train, x_test, y_train,
y_test,x_test,X_transformed,X_test_transformed,vec,df1,df2
if request.method=="POST":
size=int(request.form['split'])
size=size/100
df = pd.read_csv("spam (1).csv", encoding='latin-1')
df = preprocess_data(df)
# Split into training and testing data
x = df['Message']
y = df['Category']
x_train, x_test, y_train, y_test = train_test_split(x,y, stratify=y, test_size=split,
random_state=42)
print(x)
print(y)
# Vectorize text reviews to numbers
vec = CountVectorizer(stop_words='english')
x = vec.fit_transform(x).toarray()
x_test = vec.transform(x_test).toarray()
print(x_test)
return render_template('preprocess.html',msg='Data Preprocessed and Trained
Successfully')
return render_template('preprocess.html')
@webapp.route('/model',methods=['POST','GET'])
def model():
if request.method=="POST":
print('ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc')
s=int(request.form['algo'])
if s==0:
return render_template('model.html',msg='Please Choose an Algorithm to Train')
elif s==1:
print('aaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb')
multinomialnb = MultinomialNB()
multinomialnb.fit(x_train,y_train)
# Predicting the Test set results
acc_rf = multinomialnb.score(x_test, y_test)*100
print('aaaaaaaaaaaaaaaaaaaaaaaaa')
msg = 'The accuracy obtained by Naive Bayes Classifier is ' + str(acc_rf) + str('%')
return render_template('model.html', msg=msg)
elif s==2:
linearsvc = LinearSVC()
linearsvc.fit(x_train,y_train)
acc_dt = linearsvc.score(x_test, y_test)*100
msg = 'The accuracy obtained by Support Vector Classifier is ' + str(acc_dt) + str('%')
return render_template('model.html', msg=msg)
return render_template('model.html')
@webapp.route('/prediction',methods=['POST','GET'])
def prediction():
global x_train,y_train
if request.method == "POST":
f1 = request.form['text']
print(f1)
# countvectorizer =CountVectorizer()
multinomialnb = MultinomialNB()
multinomialnb.fit(x_train,y_train)
from sklearn.feature_extraction.text import CountVectorizer
countvectorizer =CountVectorizer()
result =multinomialnb.predict(countvectorizer.transform([f1]))
if result==0:
msg = 'This is a Ham Message'
else:
msg= 'This is a Spam Message'
return render_template('prediction.html',msg=msg)
return render_template('prediction.html')
@webapp.route('/news')
def news():
return render_template('news.html')
if name ==' main ':
webapp.run(debug=True)
Prediction.html
<!DOCTYPE html>
<html lang="en">
<head>
<!-- basic -->
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<!-- mobile metas -->
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="viewport" content="initial-scale=1, maximum-scale=1">
<!-- site metas -->
<title>Spam Message Classification</title>
<link rel="icon" href="static/images/images.jpg" type="image/icon type">
<meta name="keywords" content="">
<meta name="description" content="">
<meta name="author" content="">
<!-- site icons -->
<link rel="icon" href="static/images/fevicon/logo.jpg" type="image/png" />
<!-- bootstrap css -->
<link rel="stylesheet" href="static/css/bootstrap.min.css" />
</div>
</div>
<div class="col-md-3 col-lg-2">
<div class="right_bt"><a class="bt_main"
href="{{url_for('prediction')}}">Prediction</a> </div>
</div>
</div>
</div>
</header>
<div class="overlay"></div>
<div class="gtco-container">
<div class="row">
<div class="col-md-12 col-md-offset-0 text-center">
<div class="display-t">
<div class="display-tc animate-box" data-animate-effect="fadeIn">
<center><h3 style="bottom: 151px;color:rgb(11, 203, 236);top: -
222;">{{msg}}</h3></center>
<h3 style="color:rgb(11, 203, 236);bottom: 115px;">Spam Message
Identification </h3>
</div>
</div>
</div>
</div>
<div class="row step_section">
<div class="offset-xl-1 col-xl-10 col-md-12">
<div class="row">
<div class="col-lg-3 col-md-6 col-sm-12 col-xs-12">
<div class="step_blog arrow_right_step">
<div class="step_inner">
<!-- <i class="fa fa-diamond"></i><br> -->
<!-- <p>Go app store</p> -->
</div>
</div>
</div>
<div class="col-lg-3 col-md-6 col-sm-12 col-xs-12">
<div class="step_blog">
<div class="step_inner">
<!-- <i class="fa fa-user"></i><br> -->
<!-- <p>Create an Account</p> -->
</div>
</div>
</div>
<div class="col-lg-3 col-md-6 col-sm-12 col-xs-12">
<div class="step_blog">
<div class="step_inner">
<!-- <i class="fa fa-download"></i><br> -->
<!-- <p>Download & Install</p> -->
</div>
</div>
</div>
<div class="col-lg-3 col-md-6 col-sm-12 col-xs-12">
<div class="step_blog">
<div class="step_inner">
<!-- <i class="fa fa-thumbs-up"></i><br> -->
<!-- <p>Enjoy & Rate us!</p> -->
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- end section -->
<!-- footer -->
<footer class="footer_style_2">
<div class="footer_top">
<div class="container">
<div class="row">
<div class="col-xs-12 col-sm-6 col-md-6 col-lg-4 margin_bottom_30">
<div class="full width_9" style="margin-bottom:25px;"> <a
href="index.html"><img class="img-responsive" width="250"
src="static/images/LOGO5.png" alt="#"></a> </div>
<div class="full width_9">
<p align = "justify"style="width: 600px;">Spam messages are messages
sent to a large group of recipients without their prior consent, typically advertising for goods
and services or business opportunities.
</p>
</div>
<div class="full width_9">
<!-- <p>the vero eos et accusamus et iusto odio dignissimos ducimus qui
blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias
excepturi sint occaecati..</p> -->
</div>
</div>
<!-- <div class="col-xs-12 col-sm-6 col-md-6 col-lg-3 margin_bottom_30">
<div class="full">
<div class="footer_blog_2 width_9">
<h3>Twitter Feed</h3>
<p><i class="fa fa-twitter"></i> Creative_Talent - 26 mins
Te invitamos a seguir la cta. de WEntrepreneur_ ¡Atrévete!
#Emprendimiento #PyMES #Economía #Bussines #Negocios https://t.co/Y7tZMmxGHn
</p>
<p><i class="fa fa-twitter"></i> Creative_Talent - 26 mins
Te invitamos a seguir la cta. de WEntrepreneur_ ¡Atrévete!
#Emprendimiento #PyMES #Economía #Bussines #Negocios https://t.co/Y7tZMmxGHn
</p>
</div>
</div>
</div> -->
<div class="col-xs-12 col-sm-6 col-md-6 col-lg-2
margin_bottom_30"style="left: 0px;margin-left: 290px;">
<div class="full">
<div class="footer_blog_2">
<h3>Social</h3>
</div>
</div>
<div class="full">
<ul class="footer-links">
<li><a href="#"><i class="fa fa-facebook"></i> 256 Likes</a></li>
<li><a href="#"><i class="fa fa-github"></i> 57+ Projects</a></li>
<li><a href="#"><i class="fa fa-twitter"></i> 1,258 Followers</a></li>
<li><a href="#"><i class="fa fa-pinterest"></i> 2538+ Pins</a></li>
</ul>
</div>
</div>
<div class="col-xs-12 col-sm-6 col-md-6 col-lg-3 margin_bottom_30">
<div class="full">
<div class="footer_blog_2 width_9">
<h3>Blog</h3>
</div>
<div class="blog_post_footer">
<div class="blog_post_img"> <img width="80" height="80"
src="static/images/scr1.png" alt="#"> </div>
<div class="blog_post_cont">
<p class="date">July 22, 2015</p>
<p class="post_head">Round and round like a carousel</p>
</div>
</div>
<div class="blog_post_footer">
<div class="blog_post_img"> <img width="80" height="80"
src="static/images/scr2.png" alt="#"> </div>
<div class="blog_post_cont">
<p class="date">July 22, 2015</p>
<p class="post_head">Round and round like a carousel</p>
</div>
</div>
<div class="blog_post_footer">
<div class="blog_post_img"> <img width="80" height="80"
src="static/images/scr3.png" alt="#"> </div>
<div class="blog_post_cont">
SYSTEM TESTING
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH
7. SYSTEM TESTING
Software testing is a method for evaluating the quality of software products and
identifying defects so that they can be rectified. Software testing makes an effort to
accomplish its goals, but there are significant constraints. On the other side, for testing to be
effective, dedication to the set objectives is required.
1. The user stories, designs, specifications, and code that make up the work products
2. To ensure that all conditions are satisfied.
3. Ensuring that the test object is complete and meets the expectations of users and
stakeholders
7.1.2. Test Case Design
Black box testing and white box testing are two types of software testing
methodologies. White Box testing, also known as structural testing, clear box testing, open
box testing, and transparent box testing, is covered in this article. It focuses on evaluating the
infrastructure and software's fundamental code against current inputs and anticipated and
desired outcomes. It emphasises internal structure analysis and is focused on a program's
internal activities. The fundamental goal of white box testing is to focus on the software's
inputs and outputs while also assuring its security. The phrases "clear box," "white box," and
"transparent box" all allude to being able to see through the exterior covering of the software.
White testing a box is used by designers. This stage involves testing every line of the
program's code. Prior to handing off the programme or software to the testing team, the
developers run white-box testing on it to ensure that it conforms with the requirements and to
identify any mistakes.
Before releasing the project to the testing team, the developer fixes the issues and
does one round of white box testing. In this case, fixing problems includes removing the
problem and activating the specific functionality of the application. For the following
engineers won't be helping to fix the problems: o Resolving the problem might impair other
features. As a result, developers should keep making advancements while the test engineer
should constantly look for faults.
If the test engineers spend most of their time fixing problems, they might not be able to find
any new flaws in the programme.
o Path testing
o Loop testing
o Condition evaluation
o Testing from the viewpoint of memory
o Test results for the programme
7.1.4. Black Box Testing
You are free to use any software package you choose as a Black-Box. A few
examples include an Oracle database, a Google website, the Windows operating system, or
even your own custom programme. You can test these applications using black box testing by
focusing just on their inputs and outputs and ignoring any awareness of how their underlying
code is implemented.
A unit test
Integrity Checks
Validation Examination
System Evaluation
Security Checks
Performance Evaluation
7.2.1. Unit Testing
The module is the smallest piece of software architecture that is tested as part of unit
testing. Within the constraints of the module, significant control channels are analysed using
the procedural design description as a guide. The smallest testable parts of a programme,
called units, are reviewed separately and independently during unit testing to guarantee
proper operation. This testing process is used by software engineers and, on occasion, QA
staff throughout the development phase. The main objective of unit testing is to test and
validate written code separately to ensure that it operates as intended.
When done correctly, unit testing can help detect coding flaws that would otherwise be
difficult to locate. TDD is a practical technique that regularly tests and enhances the product
development process in a complete manner. One of the elements of TDD is unit testing. This
method of testing serves as the initial phase of software testing and includes tests that come
before integration testing and other types of testing. Unit testing verifies a unit's
independence from any external code or functionalities. Manual testing is still an option even
if automation testing is more popular.
Integration testing is the process of creating a program's structure while running tests
to find interface problems. To create a design-based programme structure, unit-tested
methods are to be used. Integration testing is a testing procedure that conceptually connects
and puts software components to the test. Several software modules made by various
programmers make up a typical software project. Finding issues with how various software
components interact when they are integrated is the goal of this level of testing. The
interactions between these
modules are examined during integration testing. It is known as "String Testing," and the
finished product is "Thread Testing."
Top-Down Integration:
The next step in the testing process is top-down integrations, a method for building
and testing a program's structure progressively. Different modules in a software, product, or
application are integrated by moving downward through the systematic control hierarchy
between the modules, starting with the main control or home control or index programme.
The project's framework includes a variety of breadth- or depth-first activities or modules
related to the primary programme.
Bottom-up Integration:
The construction and testing of a few atomic modules, or the product's most basic
features, is the first step in the subsequent testing methodology. Since all processes or
modules are integrated bottom-up, there is no need for residual, and processing for modules
tied to a certain level is always available.
Validation testing assures that the software developed and tested satisfies the client's
or user's needs. Logic or scenarios for business requirements need to be thoroughly tested.
Here, it is necessary to test every significant component of the application. You must always
be able to validate the business logic or scenarios that are given to you as a tester. One such
method that encourages a careful examination of functioning is the validation process.
Validation testing ensures that the programme has been tested and built to meet user
or customer requirements. The justifications or scenarios for business demands must be
thoroughly tested. Every key component of the application must be tested in this situation. As
a tester, you will always be provided with scenarios or business logic that can be
independently checked. One such process that helps in a detailed analysis of performance is
the validation process.
System testing's main goal is to rigorously test computer-based systems. Even though
each test has a distinct goal, they all check to make sure that each system part is properly
integrated in order to reach the objectives. Examining an entirely integrated software system
is a component of system testing. A computer system is typically constructed by mixing
software (any Software is the sole component of a computer system. The programme is made
up of modules that, when placed together with other pieces of software and hardware, form a
complete computer system. In other words, a computer system is made up of numerous
software programmes that perform various jobs. Software, however, is unable to carry out
these duties alone.
System Testing requires the appropriate hardware must be used to help. System
testing is a set of processes used to verify the overall functionality of a computer system that
uses integrated software. The practise of system testing involves examining an application's
or software's end-to-end flow from the viewpoint of a user. Each module required for an
application is examined in detail, and systemic product testing is done to ensure that the final
features and functionality function as planned. Since the testing environment mirrors the
production environment, it is known as end-to-end testing.
Load testing is the simplest technique for evaluating how well a system will perform
under a particular load. A load test's findings will show how much work is put on the
application server, database, and other systems as well as the importance of key business
transactions. Stress testing is carried out to ascertain the system's maximum capacity and how
it will operate if the present load is greater than the predicted maximum.
Soak tests, often called endurance tests, are used to evaluate a system's performance
under a steady load. During soak testing, memory usage is monitored to identify performance
issues like memory leaks. Monitoring the system's performance over time is the main
objective. When testing during a "spike," the user base is rapidly expanded and the system's
performance is swiftly examined. The main objective is to assess the system's workload
management capabilities.
8. RESULTS
HOME PAGE:
Here user view the home page of Spam Message Identification Using Machine Learning
Approach
ABOUT PAGE:
UPLOAD PAGE:
VIEW PAGE:
PREPROCESSING PAGE:
Here we can prepare our data in such a way that our system should understand i.e., we will
make our data noise free.
PREDICTION PAGE:
In this page shows the detection result of the SPAM or HAM prediction data
Graph:
After applying Machine Learning Algorithms. The results are shown in below graph.
100.00%
99.00%
98.00%
97.00%
96.00%
95.00%
94.00%
Naive Bayes SVM XGBoost ADA Boost
CONCLUSION
AND
FUTURE ENHANCEMENTS
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH
This study concludes that most of the proposed spam messages identification methods
are based on supervised machine learning techniques. A labeled dataset for the supervised
model training is crucial and time-consuming task. We tested this model on only this dataset
in future we will test our model on several dataset .The study provides comprehensive
insights of these algorithms and some future research directions for spam message
identification. In the future we use some models to predict the message is spam or not.
[1] J. Han, M. Kamber. Data Mining Concepts and Techniques. by Elsevier inc., Ed: 2nd,
2006
[2] A. Tiago, Almeida , José María GómezAkebo Yamakami. Contributions to the Study of
SMS Spam Filtering. University of Campinas, Sao Paulo, Brazil.
[3] M. Bilal Junaid, Muddassar Farooq. Using Evolutionary Learning Classifiers To Do
Mobile Spam (SMS) Filtering. National University of Computer & Emerging Sciences
(NUCES) Islamabad, Pakistan.
[4] Inwhee Joe and Hyetaek Shim, "An SMS Spam Filtering System Using Support Vector
Machine," Division of Computer Science and Engineering, Hanyang University, Seoul,
133-791 South Korea.
[5] Xu, Qian, Evan Wei Xiang, Qiang Yang, Jiachun Du, and Jieping Zhong. "Sms spam
detection using noncontent features." IEEE Intelligent Systems 27, no. 6 (2012): 44-51.
Yadav, K., Kumaraguru, P., Goyal, A., Gupta, A., and Naik, V. "SMSAssassin:
Crowdsourcing driven mobile-based system for SMS spam filtering," Proceedings of the
12th Workshop on Mobile Computing Systems and Applications, ACM, 2011, pp. 1-6.
[6] Duan, L., Li, N., & Huang, L. (2009). “A new spam short message classification” 2009
First International Workshop on Education Technology and Computer Science, 168-171.
[7] Weka The University of Waikato, Weka 3: Data Mining Software in Java, viewed on
2011 September 14.
[8] Mccallum, A., & Nigam, K. (1998). “A comparison of event models for naive Bayes text
classification”. AAAI-98 Workshop on 'Learning for Text Categorization'
[9] Bayesian Network Classifiers in Weka, viewed on 2011 September 14.
[10] Llora, Xavier, and Josep M. Garrell (2001) Evolution of decision trees, edn., Forth
Catalan Conference on Artificial Intelligence (CCIA2001).
[11] B. G. Becker. Visualizing Decision Table Classifiers. Pages 102- 105, IEEE (1998).
Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
S.Mahammad Rafi Assistant Professor in Department of Computer Science and Engineering, Annamacharya
Institute of Technology and Sciences(Autonomous), Rajampet, Andhra Pradesh, India.
S.Venkata Padma Meghana, D.Sushma, P.Venkata Giridhar, T.Venkata Srinivasulu, A.C.Venkateshwara Reddy
ABSTRACT:
We use some communication means to convey messages digitally. Digital tools allow two or
more persons to coordinate with each other. This communication can be textual, visual, audio, and
written. Smart devices including cell phones are the major sources of communication these days.
Intensive communication through SMSs is causing spamming as well. Unwanted text messages
define as junk information that we received in gadgets. Most of the companies promote their
products or services by sending spam texts which are unwelcome. In general, most of the time spam
emails more in numbers than Actual messages. In this paper, we have used text classification
techniques to define SMS and spam filtering in a short view, which segregates the messages
accordingly. In this paper, we apply some classification methods along with “machine learning
algorithms” to identify how many SMS are spam or not. For that reason, we compared different
classified methods on dataset collection on which work done by using the Weka tool.
I. INTRODUCTION:
In five years, there will be 3.8 billion mobile phone (smartphone) users, up from 1 billion . China, India,
and the US are the top three countries in terms of mobile usage. Short Message Service, sometimes
known as SMS, is a text messaging service that has been around for a while. You can use SMS
services even without an internet connection. SMS service is thus accessible on both smartphones
and low-end mobile devices. Although there are numerous text messaging apps on smart phones,
such as WhatsApp, this service can only be used online. However, SMS is available at all times.
Consequently, the need for SMS services is growing daily.
Tiago, Almeida , José María GómezAkebo Yamakami. Contributions to the Study of SMS
Spam Filtering. University of Campinas, Sao Paulo, Brazil.
The number of mobile phone users has increased, which has resulted in a sharp rise in SMS spam
messages. The lower SMS usage rate, which has allowed many users and service providers to
We present the largest genuine, public, and unencrypted SMS spam collection we are aware of in this
project. Additionally, we contrast the results of various tried-and-true machine learning techniques.
The findings show that Support Vector Machine performs better than the other assessed classifiers,
making it an excellent starting point for future comparison.
Duan, L., Li, N., & Huang, L. (2009). “A new spam short message classification” 2009 First
International Workshop on Education Technology and Computer Science, 168-171.
This project suggests a method of message dual-filtering. The KNN classification algorithm and
rough set are combined to first separate spam messages from other messages. It must re-filter some
messages using the KNN classification algorithm to prevent lowering precision for reduction. Based
on a basic set of the KNN classification algorithm, this method not only increases classification
speed but also maintains excellent accuracy.
Inwhee Jo and Hyetaek Shim, "An SMS Spam Filtering System Using Support Vector
Machine," Division of Computer Science and Engineering, Hanyang University, Seoul, 133-791
South Korea.
An SMS spam filtering system based on SVM (Support Vector Machine) and thesaurus is presented
in this paper (Short Messaging Service). The system identifies words from sample data using a pre-
processing tool, integrates their meanings using a thesaurus, derives features of integrated words
using chi-square statistics, and then analyses these characteristics. The system has been tried out, and
it works well in a Windows context
B. G. Becker. Visualizing Decision Table Classifiers. Pages 102- 105, IEEE (1998).
Decision trees, decision networks, and decision tables are all categorization models used for
forecasting. Machine learning algorithms produce these. A decision table is made up of a hierarchy
of tables where each entry is split down into its component parts by the values of two more
characteristics to create a new table. Dimensional stacking is comparable to the structure [4]. Here, a
visualisation technique is shown that enables even non-experts in machine learning to comprehend a
Inwhee Joe and Hyetaek Shim, "An SMS Spam Filtering System Using Support Vector
Machine," Division of Computer Science and Engineering, Hanyang University, Seoul, 133-791
South Korea:
This project describes a powerful and adaptive spam filtering system for SMS (Short
Messaging Service) that uses SVM (Support Vector Machine) and a thesaurus. The system isolates
words from sample data using a pre-processing device and integrates meanings of isolated words
using a thesaurus, generates features of integrated words through chi-square statistics, and studies
these features. The system is realized in a Windows environment and its performance is
experimentally confirmed.
First phase is feature selection method by extracting needed feature to classify after indexing
bunch of documents. Second phase is decision make process that choose right category for the result
from first phase. Automatic document classification gets ability to assign right category
automatically through mechanical learning process.
For this process, it tagged specific word to bunch of learned document. The word represents the
documents and extracting feature means batch job to select words revealed from learned document.
However if it select every word in learned document as features, it takes too much time and looses
judgment. To prevent this problem, calculate weight of information for each word then select
featured words for automatic classification. In text categorization, we are dealing with a huge feature
spaces. This is why; we need a feature selection mechanism. The most popular feature selection
methods are document frequency thresholding (DF) , the X 2 statistics (CHI) , term strength (TS) ,
information gain (IG) , and mutual information.
In this appraisal, information mining will be used to manage AI by utilizing various classifiers
for preparing and testing and channels for information preprocessing and highlight choice. It plans
to peer out the ideal mix model with higher precision or base on other metric's evaluation. As of
now, there are various evaluation study done by utilizing information burrowing procedure for
example, information digging by strategies for plan.
Altogether much exertion underscore on single classifier. In any case, spamming rehearses are
changing the strategies to evade the spam territory [18]. Along these lines, in this examination, we
will zero in on the whole around on framework for managing SMS spam by utilizing information
mining technique. Questions, for example, regardless of whether the cross assortment model gives
better precision result standing apart from any single classifier utilized for email spam unmistakable
evidence will be seen through experimentation.
III. METHODOLOGY:
Naive Bayes
Input text
Predicting the
Message is Spam
or Ham
A. Dataset
In this model, we've combined data sets that we've created with spam datasets that we've obtained
from several online resources, including Kaggle. We test our model using 30% of the Kaggle
spam dataset, and we train our model using the remaining 70%. Data from spam and genuine
messages are included in the dataset.
B. Data preprocessing
The steps involved in data preprocessing include cleaning, instance selection, feature extraction,
normalization, transformation, etc. The training dataset as a whole is the end outcome of data
preprocessing. How data is pre-processed could have an impact on how the final results are
understood. Filling in the gaps in the data, reducing noise, identifying and eliminating outliers,
and resolving incompatibilities are all steps in the data cleaning process. The addition of certain
databases or data sets may be accomplished through a technique called data integration. When
collecting and normalizing data to measure a certain set of data, data transformation is taking
place.
C. Train-test split
In order for the training dataset to be utilised to detect spam messages on the testing dataset, the
dataset is divided into two subsets: testing set and training set. In order for the training model to
adequately train and learn the data, 30% of the data is examined for the testing set.
IV. ALGORITHMS:
Support Vector Machine is a type of supervised machine learning algorithm that provides data
analysis for classification and regression analysis. SVM is mostly used for classification. The value
of each feature is equal to the value of the specified coordinate. Then, we detect the ideal hyperplane
that differentiates between the two classes. Support vector machine is a representation as points in
space contrasted into categories by a gap that is as wide as possible of the training data. It is
effectual and efficient in high dimensional spaces and uses a subset of training points in the decision
function, hence, it is also known for its memory efficiency. The algorithm indirectly provides
probability estimations; these are calculated using five-fold cross-validation.
Naive Bayes :
A classification technique that is based on Bayes’ Theorem with the presumption of independence
among predictors. Naive Bayes is a way used to predict the class of the dataset. Using this, one can
perform a multi-class prediction. If the assumption of independence is valid, then Naive Bayes is
much more capable than the other algorithms like logistic regression. Furthermore, less training
data is
Page | 199 Copyright @ 2023 Authors
Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
required for the classification. Naive Bayes classifier works efficiently in real-world situations such
as document classification and spam filtering. Although, it is merely recognized as a bad estimator.
It is an easy and a quick technique
XGBOOST:
XG Boost stands for extreme Gradient Boosting. It is an application of gradient boosted decision
trees, which is intended for its speed and performance. Boosting is an ensemble learning method
where advanced techniques are included in order to rectify the errors made by the already proposed
models. Models are included consecutively till we find that no additional enhancement can be
carried out. While adding new models it uses a gradient descent technique to minimize the loss. The
application of this algorithm is to provide efficient computational time and memory supplies. The
aim of this design was to produce the best necessity of the accessible sources to train the model.
Execution Speed and Model Performance are the two main reasons to work with XG Boost. This
approach can support both classification and regression models.
ADABOOST:
AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as an Ensemble
Method in Machine Learning. It is called Adaptive Boosting as the weights are re-assigned to each
instance, with higher weights assigned to incorrectly classified instances. Boosting is used to reduce
bias as well as variance for supervised learning. It works on the principle of learners growing
sequentially. Except for the first, each subsequent learner is grown from previously grown learners.
In simple words, weak learners are converted into strong ones. The AdaBoost algorithm works on
the same principle as boosting with a slight difference.
100.00%
99.00%
98.00%
97.00%
96.00%
95.00%
94.00%
Naive Bayes SVM XGBoost ADA Boost
VI. CONCLUSION:
We presented a spam message categorization utilising several algorithms such as Naive Bayes ,Support
vector machine, XGBoost and ADABoost . For the assessment of spam base datasets using the
Weka tool, two classification techniques are employed in Weka: cross validation and training set.
The same data will be utilised for training and testing in the training set. In addition, for cross
validation, training data is separated into many folds. Following implementation and experimental
analysis, get the result that classifier with training set provides accuracy. As a consequence, Support
vector machine is the strategy that produces the best results for spam msg categorization. On just one
dataset, we tested this model. Future tests of our model on various datasets are planned.
VII. REFERENCES
[1] J. Han, M. Kamber. Data Mining Concepts and Techniques. by Elsevier inc., Ed: 2nd, 2006
[2] A. Tiago, Almeida , José María GómezAkebo Yamakami. Contributions to the Study of SMS
Spam Filtering. University of Campinas, Sao Paulo, Brazil.
[3] M. Bilal Junaid, Muddassar Farooq. Using Evolutionary Learning Classifiers To Do Mobile
Spam (SMS) Filtering. National University of Computer & Emerging Sciences (NUCES)
Islamabad, Pakistan.
[4] Inwhee Joe and Hyetaek Shim, "An SMS Spam Filtering System Using Support Vector
Machine," Division of Computer Science and Engineering, Hanyang University, Seoul, 133-791
South Korea.
[5] Xu, Qian, Evan Wei Xiang, Qiang Yang, Jiachun Du, and Jieping Zhong. "Sms spam detection
using noncontent features." IEEE Intelligent Systems 27, no. 6 (2012): 44-51.
[6] Yadav, K., Kumaraguru, P., Goyal, A., Gupta, A., and Naik, V. "SMSAssassin: Crowdsourcing
driven mobile-based system for SMS spam filtering," Proceedings of the 12th Workshop on
Mobile Computing Systems and Applications, ACM, 2011, pp. 1-6.
[7] Duan, L., Li, N., & Huang, L. (2009). “A new spam short message classification” 2009 First
International Workshop on Education Technology and Computer Science, 168-171.
[8] Weka The University of Waikato, Weka 3: Data Mining Software in Java, viewed on 2011
September 14.
[9] Mccallum, A., & Nigam, K. (1998). “A comparison of event models for naive Bayes text
classification”. AAAI-98 Workshop on 'Learning for Text Categorization'
[10] Bayesian Network Classifiers in Weka, viewed on 2011 September 14.
[11] Llora, Xavier, and Josep M. Garrell (2001) Evolution of decision trees, edn., Forth Catalan
Conference on Artificial Intelligence (CCIA2001).
[12] B. G. Becker. Visualizing Decision Table Classifiers. Pages 102- 105, IEEE (1998). A. Bantukul
and P. J. Marsico, ‘‘Methods, systems, and computer program products for short message service
(SMS) spam filtering using E-mail spam filtering resources,’’ U.S. Patent 7 751 836 B2, Jul. 6,
2010.
[13] H.-Y. Chou and N.-H. Lien, ‘‘Effects of SMS teaser ads on product curiosity,’’ Int. J. Mobile
Commun., vol. 12, no. 4, pp. 328–345, Jul. 2014.
[14] N. Jindal and B. Liu, ‘‘Review spam detection,’’ in Proc. 16th Int. Conf. World Wide Web, 2007,
pp. 1189–1190.
[15] M. Jiang, P. Cui, and C. Faloutsos, ‘‘Suspicious behavior detection: Current trends and future
directions,’’ IEEE Intell. Syst., vol. 31, no. 1, pp. 31–39, Jan./Feb. 2016.
Authored By
S.Mahammad Rafi
Assistant Professor in Department of Computer Science and Engineering, Annamacharya Institute
of Technology and Sciences(Autonomous), Rajampet, Andhra Pradesh, India.
Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal
This is to certify that the article entitled
Authored By
Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal
This is to certify that the article entitled
Authored By
D.Sushma,
Department of Computer Science and Engineering, Annamacharya Institute of Technology and
Sciences(Autonomous), Rajampet, Andhra Pradesh, India.
Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal
This is to certify that the article entitled
Authored By
P.Venkata Giridhar,
Department of Computer Science and Engineering, Annamacharya Institute of Technology and
Sciences(Autonomous), Rajampet, Andhra Pradesh, India.
Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal
This is to certify that the article entitled
Authored By
T.Venkata Srinivasulu,
Department of Computer Science and Engineering, Annamacharya Institute of Technology and
Sciences(Autonomous), Rajampet, Andhra Pradesh, India.
Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal
This is to certify that the article entitled
Authored By
A.C.Venkateshwara Reddy
Department of Computer Science and Engineering, Annamacharya Institute of Technology and
Sciences(Autonomous), Rajampet, Andhra Pradesh, India.
Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal