Social Bot

[DETECTING OF MALICIOUS SOCIAL BOTS]
ABSTRACT
Malicious social bots generate fake tweets and automate their social relationships either by
pretending like a follower or by creating multiple fake accounts with malicious activities.
Moreover, malicious social bots post shortened malicious URLs in the tweet in order to
redirect the requests of online social networking participants to some malicious servers. Hence,
distinguishing malicious social bots from legitimate users is one of the most important tasks
in the Twitter network. To detect malicious social bots, extracting URL-based features (such
as URL redirection, frequency of shared URLs, and spam content in URL) consumes less
amount of time in comparison with social graph-based features (which rely on the social
interactions of users). Furthermore, malicious social bots cannot easily manipulate URL
redirection chains. In this article, a learning automata-based malicious social bot detection
(LA-MSBD) algorithm is proposed by integrating a trust computation model with URL-based
features for identifying trustworthy participants (users) in the Twitter network. The proposed
trust computation model contains two parameters, namely, direct trust and indirect trust.
Moreover, the direct trust is derived from Bayes’ theorem, and the indirect trust is derived from
the Dempster–Shafer theory (DST) to determine the trustworthiness of each participant
accurately. Experimentation has been performed on two Twitter data sets, and the results
illustrate that the proposed algorithm achieves improvement in precision, recall, F-measure,
and accuracy compared with existing approaches for MSBD.
pg. 1
DETECTION OF MALICIOUS SOCIAL BOTS USING

MACHINE LEARNING
INTRODUCTION:
What is Machine Learning?
Machine Learning is a system of computer algorithms that can learn from example through
self-improvement without being explicitly coded by a programmer. Machine learning is a part
of artificial Intelligence which combines data with statistical tools to predict an output which
can be used to make actionable insights.
The breakthrough comes with the idea that a machine can singularly learn from the data (i.e.,
example) to produce accurate results. Machine learning is closely related to data mining and
Bayesian predictive modelling. The machine receives data as input and uses an algorithm to
formulate answers.
A typical machine learning tasks are to provide a recommendation. For those who have a
Netflix account, all recommendations of movies or series are based on the user's historical
data. Tech companies are using unsupervised learning to improve the user experience with
personalizing recommendation.
Machine learning is also used for a variety of tasks like fraud detection, predictive
maintenance, portfolio optimization, automatize task and so on.
Machine Learning vs. Traditional Programming:

Traditional programming differs significantly from machine learning. In traditional
programming, a programmer code all the rules in consultation with an expert in the industry
for which software is being developed. Each rule is based on a logical foundation; the machine
will execute an output following the logical statement. When the system grows complex, more
rules need to be written. It can quickly become unsustainable to maintain.
Traditional programming differs significantly from machine learning. In traditional
programming, a programmer code all the rules in consultation with an expert in the industry
for which software is being developed. Each rule is based on a logical foundation; the machine
will execute an output following the logical statement. When the system grows complex, more
rules need to be written. It can quickly become unsustainable to maintain.
pg. 2
Machine learning is supposed to overcome this issue. The machine learns how the input and
output data are correlated and it writes a rule. The programmers do not need to write new rules
each time there is new data. The algorithms adapt in response to new data and experiences to
improve efficacy over time.
How does Machine Learning Work?

Machine learning is the brain where all the learning takes place. The way the machine learns
is similar to the human being. Humans learn from experience. The more we know, the more
easily we can predict. By analogy, when we face an unknown situation, the likelihood of
success is lower than the known situation. Machines are trained the same. To make an accurate
prediction, the machine sees an example. When we give the machine a similar example, it can
figure out the outcome. However, like a human, if it’s feed a previously unseen example, the
machine has difficulties to predict.
The core objective of machine learning is the learning and inference. First of all, the machine
learns through the discovery of patterns. This discovery is made thanks to the data. One crucial
part of the data scientist is to choose carefully which data to provide to the machine. The list
of attributes used to solve a problem is called a feature vector. You can think of a feature
vector as a subset of data that is used to tackle a problem.
The machine uses some fancy algorithms to simplify the reality and transform this discovery
into a model. Therefore, the learning stage is used to describe the data and summarize it into
a model.
For instance, the machine is trying to understand the relationship between the wage of an
individual and the likelihood to go to a fancy restaurant. It turns out the machine finds a
positive relationship between wage and going to a high-end restaurant: This is the model
Inferring.
When the model is built, it is possible to test how powerful it is on never-seen-before data. The
new data are transformed into a features vector, go through the model and give a prediction.
This is all the beautiful part of machine learning. There is no need to update the rules or train
again the model. You can use the model previously trained to make inference on new data.
pg. 3
The life of Machine Learning programs is straightforward and can be summarized in the
following points:
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying
9. Use the model to make a prediction
Once the algorithm gets good at drawing the right conclusions, it applies that knowledge to
new sets of data.
Machine Learning Algorithms and Where they are Used?
Machine learning can be grouped into two broad learning tasks: Supervised and Unsupervised.
There are many other algorithms
Supervised learning
An algorithm uses training data and feedback from humans to learn the relationship of given
inputs to a given output. For instance, a practitioner can use marketing expense and weather
forecast as input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm will predict
new data.
There are two categories of supervised learning:
• Classification task
• Regression task
pg. 4
Classification
Imagine you want to predict the gender of a customer for a commercial. You will start
gathering data on the height, weight, job, salary, purchasing basket, etc. from your customer
database. You know the gender of each of your customer, it can only be male or female. The
objective of the classifier will be to assign a probability of being a male or a female (i.e., the
label) based on the information (i.e., features you have collected). When the model learned
how to recognize male or female, you can use new data to make a prediction. For instance,
you just got new information from an unknown customer, and you want to know if it is a male
or female. If the classifier predicts male = 70%, it means the algorithm is sure at 70% that this
customer is a male, and 30% it is a female.
The label can be of two or more classes. The above Machine learning example has only two
classes, but if a classifier needs to predict object, it has dozens of classes (e.g., glass, table,
shoes, etc. each object represents a class)
Regression
When the output is a continuous value, the task is a regression. For instance, a financial analyst
may need to forecast the value of a stock based on a range of feature like equity, previous stock
performances, macroeconomics index. The system will be trained to estimate the price of the
stocks with the lowest possible error.
Algorithm Name Description Type

Linear regression Finds a way to correlate each feature to the output to Regression
help predict future values.
Logistic regression Extension of linear regression that's used for Classification
classification tasks. The output variable 3is binary
(e.g., only black or white) rather than continuous
(e.g., an infinite list of potential colours)
Decision tree Highly interpretable classification or regression Regression
model that splits data-feature values into branches at Classification
decision nodes (e.g., if a feature is a colour, each
possible colour becomes a new branch) until a final
decision output is made.
Naive Bayes The Bayesian method is a classification method that Regression
makes use of the Bayesian theorem. The theorem Classification
updates the prior knowledge of an event with the
independent probability of each feature that can
affect the event.
Support vector Support Vector Machine, or SVM, is typically used Regression
machine for the classification task. SVM algorithm finds a (not very
hyperplane that optimally divided the classes. It is common)
best used with a non-linear solver. Classification
Random forest The algorithm is built upon a decision tree to Regression
improve the accuracy drastically. Random forest Classification
generates many times simple decision trees and uses
the 'majority vote' method to decide on which label
pg. 5
to return. For the classification task, the final

prediction will be the one with the most vote; while
for the regression task, the average prediction of all
the trees is the final prediction.
AdaBoost Classification or regression technique that uses a Regression
multitude of models to come up with a decision but Classification
weighs them based on their accuracy in predicting
the outcome.
Gradient-boosting Gradient-boosting trees is a state-of-the-art Regression
trees classification/regression technique. It is focusing on Classification
the error committed by the previous trees and tries to
correct it.
Unsupervised learning
In unsupervised learning, an algorithm explores input data without being given an explicit
output variable (e.g., explores customer demographic data to identify patterns)
You can use it when you do not know how to classify the data, and you want the algorithm to
find patterns and classify the data for you
Algorithm Description Type

Puts data into some groups (k) that each contains data with
K-means
similar characteristics (as determined by the model, not in Clustering
clustering
advance by humans)
Gaussian A generalization of k-means clustering that provides more
Clustering
mixture model flexibility in the size and shape of groups (clusters)
Hierarchical Splits clusters along a hierarchical tree to form a classification
Clustering
clustering system. Can be used for Cluster loyalty-card customer
Recommender Help to define the relevant data for making a recommendation.
Clustering
system
Mostly used to decrease the dimensionality of the data. The
Dimension
PCA/T-SNE algorithms reduce the number of features to 3 or 4 vectors with
Reduction
the highest variances.
How to Choose Machine Learning Algorithm
Machine Learning (ML) algorithm:

There are plenty of machine learning algorithms. The choice of the algorithm is based on the
objective.
In the Machine learning example below, the task is to predict the type of flower among the
three varieties. The predictions are based on the length and the width of the petal. The picture
depicts the results of ten different algorithms. The picture on the top left is the dataset. The
data is classified into three categories: red, light blue and dark blue. There are some groupings.
For instance, from the second image, everything in the upper left belongs to the red category,
in the middle part, there is a mixture of uncertainty and light blue while the bottom corresponds
pg. 6
to the dark category. The other images show different algorithms and how they try to classified
the data.
Challenges and Limitations of Machine Learning

The primary challenge of machine learning is the lack of data or the diversity in the dataset. A
machine cannot learn if there is no data available. Besides, a dataset with a lack of diversity
gives the machine a hard time. A machine needs to have heterogeneity to learn meaningful
insight. It is rare that an algorithm can extract information when there are no or few variations.
It is recommended to have at least 20 observations per group to help the machine learn. This
constraint leads to poor evaluation and prediction.
Application of Machine Learning :

Augmentation:
• Machine learning, which assists humans with their day-to-day tasks, personally or
commercially without having complete control of the output. Such machine learning is
used in different ways such as Virtual Assistant, Data analysis, software solutions. The
primary user is to reduce errors due to human bias.
Automation:
• Machine learning, which works entirely autonomously in any field without the need for
any human intervention. For example, robots performing the essential process steps in
manufacturing plants.
Finance Industry
• Machine learning is growing in popularity in the finance industry. Banks are mainly
using ML to find patterns inside the data but also to prevent fraud.
Government organization
• The government makes use of ML to manage public safety and utilities. Take the
example of China with the massive face recognition. The government uses Artificial
intelligence to prevent jaywalker.
Healthcare industry
• Healthcare was one of the first industry to use machine learning with image detection.
Marketing
• Broad use of AI is done in marketing thanks to abundant access to data. Before the age
of mass data, researchers develop advanced mathematical tools like Bayesian analysis
to estimate the value of a customer. With the boom of data, marketing department relies
on AI to optimize the customer relationship and marketing campaign.
pg. 7
Example of application of Machine Learning in Supply Chain

Machine learning gives terrific results for visual pattern recognition, opening up many
potential applications in physical inspection and maintenance across the entire supply chain
network.
Unsupervised learning can quickly search for comparable patterns in the diverse dataset. In
turn, the machine can perform quality inspection throughout the logistics hub, shipment with
damage and wear.
For instance, IBM's Watson platform can determine shipping container damage. Watson
combines visual and systems-based data to track, report and make recommendations in real-
time.
In past year stock manager relies extensively on the primary method to evaluate and forecast
the inventory. When combining big data and machine learning, better forecasting techniques
have been implemented (an improvement of 20 to 30 % over traditional forecasting tools). In
term of sales, it means an increase of 2 to 3 % due to the potential reduction in inventory costs.
Example of Machine Learning Google Car
For example, everybody knows the Google car. The car is full of lasers on the roof which are
telling it where it is regarding the surrounding area. It has radar in the front, which is informing
the car of the speed and motion of all the cars around it. It uses all of that data to figure out not
only how to drive the car but also to figure out and predict what potential drivers around the
car are going to do. What's impressive is that the car is processing almost a gigabyte a second
of data.
Why is Machine Learning Important?

Machine learning is the best tool so far to analyse, understand and identify a pattern in the
data. One of the main ideas behind machine learning is that the computer can be trained to
automate tasks that would be exhaustive or impossible for a human being. The clear breach
from the traditional analysis is that machine learning can take decisions with minimal human
intervention.
Take the following example for this ML tutorial; a retail agent can estimate the price of a house
based on his own experience and his knowledge of the market.
A machine can be trained to translate the knowledge of an expert into features. The features
are all the characteristics of a house, neighbourhood, economic environment, etc. that make
the price difference. For the expert, it took him probably some years to master the art of
estimate the price of a house. His expertise is getting better and better after each sale.
For the machine, it takes millions of data, (i.e., example) to master this art. At the very
beginning of its learning, the machine makes a mistake, somehow like the junior salesman.
Once the machine sees all the example, it got enough knowledge to make its estimation. At the
same time, with incredible accuracy. The machine is also able to adjust its mistake accordingly.
Most of the big company have understood the value of machine learning and holding data.
McKinsey have estimated that the value of analytics ranges from $9.5 trillion to $15.4 trillion
while $5 to 7 trillion can be attributed to the most advanced AI techniques.
Machine learning (ML) is the study of computer algorithms that improve automatically
through experience. It is seen as a part of artificial intelligence. Machine learning algorithms
build a model based on sample data, known as "training data", in order to make predictions or
pg. 8
decisions without being explicitly programmed to do so. Machine learning algorithms are used
in a wide variety of applications, such as email filtering and computer vision, where it is
difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
A subset of machine learning is closely related to computational statistics, which focuses on

making predictions using computers; but not all machine learning is statistical learning. The
study of mathematical optimization delivers methods, theory and application domains to the
field of machine learning. Data mining is a related field of study, focusing on exploratory data
analysis through unsupervised learning. In its application across business problems, machine
learning is also referred to as predictive analytics.
Overview
Machine learning involves computers discovering how they can perform tasks without being
explicitly programmed to do so. It involves computers learning from data provided so that they
carry out certain tasks. For simple tasks assigned to computers, it is possible to program
algorithms telling the machine how to execute all steps required to solve the problem at hand;
on the computer's part, no learning is needed. For more advanced tasks, it can be challenging
for a human to manually create the needed algorithms. In practice, it can turn out to be more
effective to help the machine develop its own algorithm, rather than having human
programmers specify every needed step.
The discipline of machine learning employs various approaches to teach computers to

accomplish tasks where no fully satisfactory algorithm is available. In cases where vast
numbers of potential answers exist, one approach is to label some of the correct answers as
valid. This can then be used as training data for the computer to improve the algorithm(s) it
uses to determine correct answers. For example, to train a system for the task of digital
character recognition, the MNIST dataset of handwritten digits has often been used.
Machine learning approaches

Machine learning approaches are traditionally divided into three broad categories, depending
on the nature of the "signal" or "feedback" available to the learning system:
Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own
to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end (feature learning).
Reinforcement learning: A computer program interacts with a dynamic environment in
which it must perform a certain goal (such as driving a vehicle or playing a game against an
opponent). As it navigates its problem space, the program is provided feedback that's
analogous to rewards, which it tries to maximize.
Other approaches have been developed which don't fit neatly into this three-fold
categorisation, and sometimes more than one is used by the same machine learning system.
For example, topic modelling, dimensionality reduction or meta learning.
As of 2020, deep learning has become the dominant approach for much ongoing work in the
field of machine learning.
pg. 9
History and relationships to other fields

The term machine learning was coined in 1959 by Arthur Samuel, an American IBMer and
pioneer in the field of computer gaming and artificial intelligence. A representative book of
the machine learning research during the 1960s was the Nilsson's book on Learning Machines,
dealing mostly with machine learning for pattern classification. Interest related to pattern
recognition continued into the 1970s, as described by Duda and Hart in 1973. In 1981 a report
was given on using teaching strategies so that a neural network learns to recognize 40
characters (26 letters, 10 digits, and 4 special symbols) from a computer terminal.
Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied
in the machine learning field: "A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P if its performance at tasks in T, as
measured by P, improves with experience E."This definition of the tasks in which machine
learning is concerned offers a fundamentally operational definition rather than defining the
field in cognitive terms. This follows Alan Turing's proposal in his paper "Computing
Machinery and Intelligence", in which the question "Can machines think?" is replaced with
the question "Can machines do what we (as thinking entities) can do?".
Modern day machine learning has two objectives, one is to classify data based on models
which have been developed, the other purpose is to make predictions for future outcomes based
on these models. A hypothetical algorithm specific to classifying data may use computer vision
of moles coupled with supervised learning in order to train it to classify the cancerous moles.
Whereas, a machine learning algorithm for stock trading may inform the trader of future
potential predictions.
Artificial intelligence
Machine Learning as subfield of AI
Part of Machine Learning as subfield of AI or part of AI as subfield of Machine Learning
As a scientific endeavour, machine learning grew out of the quest for artificial intelligence. In
the early days of AI as an academic discipline, some researchers were interested in having
machines learn from data. They attempted to approach the problem with various symbolic
methods, as well as what was then termed "neural networks"; these were mostly perceptrons
and other models that were later found to be reinventions of the generalized linear models of
statistics. Probabilistic reasoning was also employed, especially in automated medical
diagnosis.
However, an increasing emphasis on the logical, knowledge-based approach caused a rift
between AI and machine learning. Probabilistic systems were plagued by theoretical and
practical problems of data acquisition and representation. By 1980, expert systems had come
to dominate AI, and statistics was out of Favor. Work on symbolic/knowledge-based learning
did continue within AI, leading to inductive logic programming, but the more statistical line
of research was now outside the field of AI proper, in pattern recognition and information
retrieval. Neural networks research had been abandoned by AI and computer science around
the same time. This line, too, was continued outside the AI/CS field, as "connectionism", by
researchers from other disciplines including Hopfield, Rumelhart and Hinton. Their main
success came in the mid-1980s with the reinvention of backpropagation.
pg. 10
Machine learning (ML), reorganized as a separate field, started to flourish in the 1990s. The
field changed its goal from achieving artificial intelligence to tackling solvable problems of a
practical nature. It shifted focus away from the symbolic approaches it had inherited from AI,
and toward methods and models borrowed from statistics and probability theory.
As of 2020, many sources continue to assert that machine learning remains a subfield of AI.
The main disagreement is whether all of ML is part of AI, as this would mean that anyone
using ML could claim they are using AI. Others have the view that not all of ML is part of AI
where only an 'intelligent' subset of ML is part of AI.
The question to what is the difference between ML and AI is answered by Judea Pearl in The
Book of Why. Accordingly, ML learns and predicts based on passive observations, whereas AI
implies an agent interacting with the environment to learn and take actions that maximize its
chance of successfully achieving its goals.
Data mining
Machine learning and data mining often employ the same methods and overlap significantly,
but while machine learning focuses on prediction, based on known properties learned from the
training data, data mining focuses on the discovery of (previously) unknown properties in the
data (this is the analysis step of knowledge discovery in databases). Data mining uses many
machine learning methods, but with different goals; on the other hand, machine learning also
employs data mining methods as "unsupervised learning" or as a preprocessing step to improve
learner accuracy. Much of the confusion between these two research communities (which do
often have separate conferences and separate journals, ECML PKDD being a major exception)
comes from the basic assumptions they work with: in machine learning, performance is usually
evaluated with respect to the ability to reproduce known knowledge, while in knowledge
discovery and data mining (KDD) the key task is the discovery of previously unknown
knowledge. Evaluated with respect to known knowledge, an uninformed (unsupervised)
method will easily be outperformed by other supervised methods, while in a typical KDD task,
supervised methods cannot be used due to the unavailability of training data.
Optimization
Machine learning also has intimate ties to optimization: many learning problems are
formulated as minimization of some loss function on a training set of examples. Loss functions
express the discrepancy between the predictions of the model being trained and the actual
problem instances (for example, in classification, one wants to assign a label to instances, and
models are trained to correctly predict the pre-assigned labels of a set of examples).
Generalization
The difference between optimization and machine learning arises from the goal of
generalization: while optimization algorithms can minimize the loss on a training set, machine
learning is concerned with minimizing the loss on unseen samples. Characterizing the
generalization of various learning algorithms is an active topic of current research, especially
for deep learning algorithms.
pg. 11
Statistics
Machine learning and statistics are closely related fields in terms of methods, but distinct in
their principal goal: statistics draws population inferences from a sample, while machine
learning finds generalizable predictive patterns. According to Michael I. Jordan, the ideas of
machine learning, from methodological principles to theoretical tools, have had a long pre-
history in statistics. He also suggested the term data science as a placeholder to call the overall
field.
Leo Breiman distinguished two statistical modelling paradigms: data model and algorithmic
model, wherein "algorithmic model" means more or less the machine learning algorithms like
Random Forest.
Some statisticians have adopted methods from machine learning, leading to a combined field
that they call statistical learning.
Theory
A core objective of a learner is to generalize from its experience. Generalization in this context
is the ability of a learning machine to perform accurately on new, unseen examples/tasks after
having experienced a learning data set. The training examples come from some generally
unknown probability distribution (considered representative of the space of occurrences) and
the learner has to build a general model about this space that enables it to produce sufficiently
accurate predictions in new cases.
The computational analysis of machine learning algorithms and their performance is a branch
of theoretical computer science known as computational learning theory. Because training sets
are finite and the future is uncertain, learning theory usually does not yield guarantees of the
performance of algorithms. Instead, probabilistic bounds on the performance are quite
common. The bias–variance decomposition is one way to quantify generalization error.
For the best performance in the context of generalization, the complexity of the hypothesis
should match the complexity of the function underlying the data. If the hypothesis is less
complex than the function, then the model has under fitted the data. If the complexity of the
model is increased in response, then the training error decreases. But if the hypothesis is too
complex, then the model is subject to overfitting and generalization will be poorer.
In addition to performance bounds, learning theorists study the time complexity and feasibility
of learning. In computational learning theory, a computation is considered feasible if it can be
done in polynomial time. There are two kinds of time complexity results. Positive results show
that a certain class of functions can be learned in polynomial time. Negative results show that
certain classes cannot be learned in polynomial time.
Approaches
Types of learning algorithms
The types of machine learning algorithms differ in their approach, the type of data they input
and output, and the type of task or problem that they are intended to solve.
Supervised learning
A support vector machine is a supervised learning model that divides the data into regions
separated by a linear boundary. Here, the linear boundary divides the black circles from the
white.
pg. 12
Supervised learning algorithms build a mathematical model of a set of data that contains both
the inputs and the desired outputs. The data is known as training data, and consists of a set of
training examples. Each training example has one or more inputs and the desired output, also
known as a supervisory signal. In the mathematical model, each training example is
represented by an array or vector, sometimes called a feature vector, and the training data is
represented by a matrix. Through iterative optimization of an objective function, supervised
learning algorithms learn a function that can be used to predict the output associated with new
inputs. An optimal function will allow the algorithm to correctly determine the output for
inputs that were not a part of the training data. An algorithm that improves the accuracy of its
outputs or predictions over time is said to have learned to perform that task.
Types of supervised learning algorithms include active learning, classification and regression.
Classification algorithms are used when the outputs are restricted to a limited set of values,
and regression algorithms are used when the outputs may have any numerical value within a
range. As an example, for a classification algorithm that filters emails, the input would be an
incoming email, and the output would be the name of the folder in which to file the email.
Similarity learning is an area of supervised machine learning closely related to regression and
classification, but the goal is to learn from examples using a similarity function that measures
how similar or related two objects are. It has applications in ranking, recommendation systems,
visual identity tracking, face verification, and speaker verification.
Unsupervised learning
Unsupervised learning algorithms take a set of data that contains only inputs, and find structure
in the data, like grouping or clustering of data points. The algorithms, therefore, learn from
test data that has not been labelled, classified or categorized. Instead of responding to feedback,
unsupervised learning algorithms identify commonalities in the data and react based on the
presence or absence of such commonalities in each new piece of data. A central application of
unsupervised learning is in the field of density estimation in statistics, such as finding the
probability density function. Though unsupervised learning encompasses other domains
involving summarizing and explaining data features.
Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that
observations within the same cluster are similar according to one or more predesignated
criteria, while observations drawn from different clusters are dissimilar. Different clustering
techniques make different assumptions on the structure of the data, often defined by some
similarity metric and evaluated, for example, by internal compactness, or the similarity
between members of the same cluster, and separation, the difference between clusters. Other
methods are based on estimated density and graph connectivity.
Semi-supervised learning
Semi-supervised learning falls between unsupervised learning (without any labelled training
data) and supervised learning (with completely labelled training data). Some of the training
examples are missing training labels, yet many machine-learning researchers have found that
unlabelled data, when used in conjunction with a small amount of labelled data, can produce
a considerable improvement in learning accuracy.
In weakly supervised learning, the training labels are noisy, limited, or imprecise; however,
these labels are often cheaper to obtain, resulting in larger effective training sets.
pg. 13
Reinforcement learning
Reinforcement learning is an area of machine learning concerned with how software agents
ought to take actions in an environment so as to maximize some notion of cumulative reward.
Due to its generality, the field is studied in many other disciplines, such as game theory, control
theory, operations research, information theory, simulation-based optimization, multi-agent
systems, swarm intelligence, statistics and genetic algorithms. In machine learning, the
environment is typically represented as a Markov decision process (MDP). Many
reinforcement learning algorithms use dynamic programming techniques. Reinforcement
learning algorithms do not assume knowledge of an exact mathematical model of the MDP,
and are used when exact models are infeasible. Reinforcement learning algorithms are used in
autonomous vehicles or in learning to play a game against a human opponent.
Self-learning
Self-learning as a machine learning paradigm was introduced in 1982 along with a neural
network capable of self-learning named crossbar adaptive array (CAA). It is a learning with
no external rewards and no external teacher advice. The CAA self-learning algorithm
computes, in a crossbar fashion, both decisions about actions and emotions (feelings) about
consequence situations. The system is driven by the interaction between cognition and
emotion. The self-learning algorithm updates a memory matrix W =||w(a,s)|| such that in each
iteration executes the following machine learning routine:
In situation s perform an action a;

Receive consequence situation s’;
Compute emotion of being in consequence situation v(s’);
Update crossbar memory w’(a,s) = w(a,s) + v(s’).
It is a system with only one input, situation s, and only one output, action (or behaviour) a.
There is neither a separate reinforcement input nor an advice input from the environment. The
backpropagated value (secondary reinforcement) is the emotion toward the consequence
situation. The CAA exists in two environments, one is the behavioural environment where it
behaves, and the other is the genetic environment, wherefrom it initially and only once receives
initial emotions about situations to be encountered in the behavioural environment. After
receiving the genome (species) vector from the genetic environment, the CAA learns a goal-
seeking behaviour, in an environment that contains both desirable and undesirable situations.
Feature learning
Several learning algorithms aim at discovering better representations of the inputs provided
during training. Classic examples include principal components analysis and cluster analysis.
Feature learning algorithms, also called representation learning algorithms, often attempt to
preserve the information in their input but also transform it in a way that makes it useful, often
as a pre-processing step before performing classification or predictions. This technique allows
reconstruction of the inputs coming from the unknown data-generating distribution, while not
being necessarily faithful to configurations that are implausible under that distribution. This
replaces manual feature engineering, and allows a machine to both learn the features and use
them to perform a specific task.
pg. 14
Feature learning can be either supervised or unsupervised. In supervised feature learning,

features are learned using labelled input data. Examples include artificial neural networks,
multilayer perceptron’s, and supervised dictionary learning. In unsupervised feature learning,
features are learned with unlabelled input data. Examples include dictionary learning,
independent component analysis, autoencoders, matrix factorization and various forms of
clustering.
Manifold learning algorithms attempt to do so under the constraint that the learned
representation is low-dimensional. Sparse coding algorithms attempt to do so under the
constraint that the learned representation is sparse, meaning that the mathematical model has
many zeros. Multilinear subspace learning algorithms aim to learn low-dimensional
representations directly from tensor representations for multidimensional data, without
reshaping them into higher-dimensional vectors. Deep learning algorithms discover multiple
levels of representation, or a hierarchy of features, with higher-level, more abstract features
defined in terms of (or generating) lower-level features. It has been argued that an intelligent
machine is one that learns a representation that disentangles the underlying factors of variation
that explain the observed data.
Feature learning is motivated by the fact that machine learning tasks such as classification
often require input that is mathematically and computationally convenient to process.
However, real-world data such as images, video, and sensory data has not yielded to attempts
to algorithmically define specific features. An alternative is to discover such features or
representations thorough examination, without relying on explicit algorithms.
Sparse dictionary learning

Sparse dictionary learning is a feature learning method where a training example is represented
as a linear combination of basic functions, and is assumed to be a sparse matrix. The method
is strongly NP-hard and difficult to solve approximately. A popular heuristic method for sparse
dictionary learning is the K-SVD algorithm. Sparse dictionary learning has been applied in
several contexts. In classification, the problem is to determine the class to which a previously
unseen training example belongs. For a dictionary where each class has already been built, a
new training example is associated with the class that is best sparsely represented by the
corresponding dictionary. Sparse dictionary learning has also been applied in image de-
noising. The key idea is that a clean image patch can be sparsely represented by an image
dictionary, but the noise cannot.
Anomaly detection
In data mining, anomaly detection, also known as outlier detection, is the identification of rare
items, events or observations which raise suspicions by differing significantly from the
majority of the data. Typically, the anomalous items represent an issue such as bank fraud, a
structural defect, medical problems or errors in a text. Anomalies are referred to as outliers,
novelties, noise, deviations and exceptions.
In particular, in the context of abuse and network intrusion detection, the interesting objects
are often not rare objects, but unexpected bursts of inactivity. This pattern does not adhere to
the common statistical definition of an outlier as a rare object, and many outlier detection
methods (in particular, unsupervised algorithms) will fail on such data unless it has been
pg. 15
aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro-
clusters formed by these patterns.
Three broad categories of anomaly detection techniques exist. Unsupervised anomaly
detection techniques detect anomalies in an unlabelled test data set under the assumption that
the majority of the instances in the data set are normal, by looking for instances that seem to
fit least to the remainder of the data set. Supervised anomaly detection techniques require a
data set that has been labelled as "normal" and "abnormal" and involves training a classifier
(the key difference to many other statistical classification problems is the inherently
unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques
construct a model representing normal behaviour from a given normal training data set and
then test the likelihood of a test instance to be generated by the model.
Robot learning
In developmental robotics, robot learning algorithms generate their own sequences of learning
experiences, also known as a curriculum, to cumulatively acquire new skills through self-
guided exploration and social interaction with humans. These robots use guidance mechanisms
such as active learning, maturation, motor synergies and imitation.
Association rules
Association rule learning is a rule-based machine learning method for discovering
relationships between variables in large databases. It is intended to identify strong rules
discovered in databases using some measure of "interestingness".
Rule-based machine learning is a general term for any machine learning method that identifies,
learns, or evolves "rules" to store, manipulate or apply knowledge. The defining characteristic
of a rule-based machine learning algorithm is the identification and utilization of a set of
relational rules that collectively represent the knowledge captured by the system. This is in
contrast to other machine learning algorithms that commonly identify a singular model that
can be universally applied to any instance in order to make a prediction. Rule-based machine
learning approaches include learning classifier systems, association rule learning, and artificial
immune systems. Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński
and Arun Swami introduced association rules for discovering regularities between products in
large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For
example, the rule {\displaystyle \{\mathrm {onions,potatoes} \}\Rightarrow \{\mathrm
{burger} \}}\{{\mathrm {onions,potatoes}}\}\Rightarrow \{{\mathrm {burger}}\} found in
the sales data of a supermarket would indicate that if a customer buys onions and potatoes
together, they are likely to also buy hamburger meat. Such information can be used as the basis
for decisions about marketing activities such as promotional pricing or product placements. In
addition to market basket analysis, association rules are employed today in application areas
including Web usage mining, intrusion detection, continuous production, and bioinformatics.
In contrast with sequence mining, association rule learning typically does not consider the
order of items either within a transaction or across transactions.
Learning classifier systems (LCS) are a family of rule-based machine learning algorithms that
combine a discovery component, typically a genetic algorithm, with a learning component,
performing either supervised learning, reinforcement learning, or unsupervised learning. They
pg. 16
seek to identify a set of context-dependent rules that collectively store and apply knowledge
in a piecewise manner in order to make predictions.
Inductive logic programming (ILP) is an approach to rule-learning using logic programming
as a uniform representation for input examples, background knowledge, and hypotheses.
Given an encoding of the known background knowledge and a set of examples represented as
a logical database of facts, an ILP system will derive a hypothesized logic program that entails
all positive and no negative examples. Inductive programming is a related field that considers
any kind of programming language for representing hypotheses (and not only logic
programming), such as functional programs.
Inductive logic programming is particularly useful in bioinformatics and natural language
processing. Gordon Plotkin and Ehud Shapiro laid the initial theoretical foundation for
inductive machine learning in a logical setting. Shapiro built their first implementation (Model
Inference System) in 1981: a Prolog program that inductively inferred logic programs from
positive and negative examples. The term inductive here refers to philosophical induction,
suggesting a theory to explain observed facts, rather than mathematical induction, proving a
property for all members of a well-ordered set.
Models
Performing machine learning involves creating a model, which is trained on some training data
and then can process additional data to make predictions. Various types of models have been
used and researched for machine learning systems.
Artificial neural networks

An artificial neural network is an interconnected group of nodes, akin to the vast network of
neurons in a brain. Here, each circular node represents an artificial neuron and an arrow
represents a connection from the output of one artificial neuron to the input of another.
Artificial neural networks (ANNs), or connectionist systems, are computing systems vaguely
inspired by the biological neural networks that constitute animal brains. Such systems "learn"
to perform tasks by considering examples, generally without being programmed with any task-
specific rules.
An ANN is a model based on a collection of connected units or nodes called "artificial

neurons", which loosely model the neurons in a biological brain. Each connection, like the
synapses in a biological brain, can transmit information, a "signal", from one artificial neuron
to another. An artificial neuron that receives a signal can process it and then signal additional
artificial neurons connected to it. In common ANN implementations, the signal at a connection
between artificial neurons is a real number, and the output of each artificial neuron is computed
by some non-linear function of the sum of its inputs. The connections between artificial
neurons are called "edges". Artificial neurons and edges typically have a weight that adjusts
as learning proceeds. The weight increases or decreases the strength of the signal at a
connection. Artificial neurons may have a threshold such that the signal is only sent if the
aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers.
Different layers may perform different kinds of transformations on their inputs. Signals travel
from the first layer (the input layer) to the last layer (the output layer), possibly after traversing
the layers multiple times.
pg. 17
The original goal of the ANN approach was to solve problems in the same way that a human
brain would. However, over time, attention moved to performing specific tasks, leading to
deviations from biology. Artificial neural networks have been used on a variety of tasks,
including computer vision, speech recognition, machine translation, social network filtering,
playing board and video games and medical diagnosis.
Deep learning consists of multiple hidden layers in an artificial neural network. This approach
tries to model the way the human brain processes light and sound into vision and hearing.
Some successful applications of deep learning are computer vision and speech recognition.
Decision trees
Decision tree learning uses a decision tree as a predictive model to go from observations about
an item (represented in the branches) to conclusions about the item's target value (represented
in the leaves). It is one of the predictive modeling approaches used in statistics, data mining,
and machine learning. Tree models where the target variable can take a discrete set of values
are called classification trees; in these tree structures, leaves represent class labels and
branches represent conjunctions of features that lead to those class labels. Decision trees where
the target variable can take continuous values (typically real numbers) are called regression
trees. In decision analysis, a decision tree can be used to visually and explicitly represent
decisions and decision making. In data mining, a decision tree describes data, but the resulting
classification tree can be an input for decision making.
Support vector machines

Support vector machines (SVMs), also known as support vector networks, are a set of related
supervised learning methods used for classification and regression. Given a set of training
examples, each marked as belonging to one of two categories, an SVM training algorithm
builds a model that predicts whether a new example falls into one category or the other. An
SVM training algorithm is a non-probabilistic, binary, linear classifier, although methods such
as Platt scaling exist to use SVM in a probabilistic classification setting. In addition to
performing linear classification, SVMs can efficiently perform a non-linear classification
using what is called the kernel trick, implicitly mapping their inputs into high-dimensional
feature spaces.
Regression analysis
Regression analysis encompasses a large variety of statistical methods to estimate the
relationship between input variables and their associated features. Its most common form is
linear regression, where a single line is drawn to best fit the given data according to a
mathematical criterion such as ordinary least squares. The latter is often extended by
regularization (mathematics) methods to mitigate overfitting and bias, as in ridge regression.
When dealing with non-linear problems, go-to models include polynomial regression (for
example, used for trendline fitting in Microsoft Excel, logistic regression (often used in
statistical classification) or even kernel regression, which introduces non-linearity by taking
advantage of the kernel trick to implicitly map input variables to higher-dimensional space.
pg. 18
Bayesian networks
A simple Bayesian networks. Rain influences whether the sprinkler is activated, and both rain
and the sprinkler influence whether the grass is wet.
A Bayesian network, belief network, or directed acyclic graphical model is a probabilistic
graphical model that represents a set of random variables and their conditional independence
with a directed acyclic graph (DAG). For example, a Bayesian network could represent the
probabilistic relationships between diseases and symptoms. Given symptoms, the network can
be used to compute the probabilities of the presence of various diseases. Efficient algorithms
exist that perform inference and learning. Bayesian networks that model sequences of
variables, like speech signals or protein sequences, are called dynamic Bayesian networks.
Generalizations of Bayesian networks that can represent and solve decision problems under
uncertainty are called influence diagrams.
Genetic algorithms
A genetic algorithm (GA) is a search algorithm and heuristic technique that mimics the process
of natural selection, using methods such as mutation and crossover to generate new genotypes
in the hope of finding good solutions to a given problem. In machine learning, genetic
algorithms were used in the 1980s and 1990s. Conversely, machine learning techniques have
been used to improve the performance of genetic and evolutionary algorithms.
Training models
Usually, machine learning models require a lot of data in order for them to perform well.
Usually, when training a machine learning model, one needs to collect a large, representative
sample of data from a training set. Data from the training set can be as varied as a corpus of
text, a collection of images, and data collected from individual users of a service. Overfitting
is something to watch out for when training a machine learning model. Trained models derived
from biased data can result in skewed or undesired predictions. Algorithmic bias is a potential
result from data not fully prepared for training.
Federated learning
Federated learning is an adapted form of distributed artificial intelligence to training machine
learning models that decentralizes the training process, allowing for users' privacy to be
maintained by not needing to send their data to a centralized server. This also increases
efficiency by decentralizing the training process to many devices. For example, Gboard uses
federated machine learning to train search query prediction models on users' mobile phones
without having to send individual searches back to Google
pg. 19
SYSTEM STUDY
Feasibility Study
The feasibility of the project is analyzed in this phase and business proposal is put forth with
a very general plan for the project and some cost estimates. During system analysis the
feasibility study of the proposed system is to be carried out. This is to ensure that the
proposed system is not a burden to the company. For feasibility analysis, some
understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are:
 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 SOCIAL FEASIBILITY
Economic Feasibility
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and
development of the system is limited. The expenditures must be justified. Thus the developed
system as well within the budget and this was achieved because most of the technologies used
are freely available. Only the customized products had to be purchased.
Technical Feasibility
This study is carried out to check the technical feasibility, that is, the technical requirements
of the system. Any system developed must not have a high demand on the available technical
resources. This will lead to high demands on the available technical resources. This will lead
to high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.
Social Feasibility
The aspect of study is to check the level of acceptance of the system by the user. This includes
the process of training the user to use the system efficiently. The user must not feel threatened
by the system, instead must accept it as a necessity. The level of acceptance by the users solely
depends on the methods that are employed to educate the user about the system and to make
him familiar with it. His level of confidence must be raised so that he is also able to make
some constructive criticism, which is welcomed, as he is the final user of the system.
pg. 20
SYSTEM ANALYSIS
EXISTING SYSTEM:
➢ The existing malicious URL detection approaches are based on DNS information and
lexical properties of URLs. The malicious social bots use URL redirections in order to
avoid detection.
➢ Besel et al. analysed social botnet attack on Twitter. The authors have presented that
social bots use URL shortening services and URL redirection in order to redirect users
to malicious web pages.
➢ Echeverria and Zhou presented methods to detect, retrieve, and analyse botnet over
thousands of users to observe the social behaviour of bots.
➢ Dorri et.al., proposed a social bot hunter model has been presented based on the user
behavioural features, such as follower ratio, the number of URLs, and reputation score.
➢ M. Agarwal et. al. features a trust model has been designed to detect malicious activities
in an OSN.
Disadvantages Of Existing System:
➢ The malicious social bots can manipulate profile features, such as hashtag ratio, follower
ratio, URL ratio, and the number of retweets. The malicious social bots can also
manipulate tweet-content features, such as sentimental words, emoticons, and most
frequent words used in the tweets, by manipulating the content of each tweet. The social
relationship-based features are highly robust because the malicious social bots cannot
easily manipulate the social interactions of users in the Twitter network.
➢ The existing approaches rely on statistical features instead of analyzing the social
behaviour of users. Moreover, these approaches are not highly robust in detecting the
temporal data patterns with noisy data (i.e., where the data is biased with untrustworthy
or fake information) because the behaviour of malicious bots changes over time in order
to avoid detection.
PROPOSED SYSTEM:
➢ In the proposed system, the malicious behaviour of participants is analysed by
considering features extracted from the posted URLs (in the tweets), such as URL
redirection, frequency of shared URLs, and spam content in URL, to distinguish
between legitimate and malicious tweets. To protect against the malicious social bot
attacks, our proposed LA-based malicious social bot detection (LA-MSBD) algorithm
integrates a trust computational model with a set of URL-based features for the detection
of malicious social bots.
➢ In the proposed system, we Analyse the malicious behaviour of a participant by
considering URL-based features, such as URL redirection, the relative position of URL,
frequency of shared URLs, and spam content in URL.
➢ In the proposed system, we Evaluate the trustworthiness of tweets (posted by each
participant) by using the Bayesian learning and Dempster–Shafer theory (DST).
pg. 21
➢ Also, we Design the system by integrating a trust model with a set of URL-based
features.
Advantages Of Proposed System:
➢ The proposed system helps to detect malicious social bots accurately.
➢ The experimental results illustrate that our proposed system gives better performance
compared with conventional machine learning algorithms in terms of precision.
➢ The precision value obtained for The Fake Project data set is better than the Social
Honeypot data set because the Social Honeypot data set contains noisy and
untrustworthy information in its user content features than The Fake Project data set.
➢ The proposed system achieves the highest precision level. This is due to the fact that the
proposed system executes for a finite set of learning actions to update the action
probability value and achieves the advantages of incremental learning. Hence, the LA
model with a trust component identifies the malicious tweets that are posted by
malicious social bots.
pg. 22
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
➢ System : Pentium IV 2.4 GHz.

➢ Hard Disk : 40 GB.
➢ Floppy Drive : 1.44 Mb.
➢ Monitor : 15 VGA Colour.
➢ Mouse : Logitech.
➢ Ram : 512 Mb.
SOFTWARE REQUIREMENTS:
➢ Operating system : Windows 7.

➢ Coding Language : Python
➢ Database : MYSQL
pg. 23
SYSTEM DESIGN
SYSTEM ARCHITECTURE:
DATA FLOW DIAGRAM:
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be
used to represent a system in terms of input data to the system, various processing
carried out on this data, and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modelling tools. It is used to
model the system components. These components are the system process, the data used
by the process, an external entity that interacts with the system and the information
flows in the system.
3. DFD shows how the information moves through the system and how it is modified by
a series of transformations. It is a graphical technique that depicts information flow and
the transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any
level of abstraction. DFD may be partitioned into levels that represent increasing
information flow and functional detail.
pg. 24
Input data
Preprocessing
Training dataset
Feature Extraction
Prediction/Classification Testing Data
spam No spam
UML DIAGRAMS
UML stands for Unified Modelling Language. UML is a standardized general-purpose
modelling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group.
The goal is for UML to become a common language for creating models of object-oriented
computer software. In its current form UML is comprised of two major components: a Meta-
model and a notation. In the future, some form of method or process may also be added to; or
associated with, UML.
The Unified Modelling Language is a standard language for specifying, Visualization,
Constructing and documenting the artifacts of software system, as well as for business
modelling and other non-software systems.
The UML represents a collection of best engineering practices that have proven successful in
the modelling of large and complex systems.
pg. 25
The UML is a very important part of developing objects-oriented software and the software
development process. The UML uses mostly graphical notations to express the design of
software projects.
Goals:
The Primary goals in the design of the UML are as follows:

1. Provide users a ready-to-use, expressive visual modelling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modelling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks, patterns
and components.
7. Integrate best practices.
USE CASE DIAGRAM:
A use case diagram in the Unified Modeling Language (UML) is a type of behavioral diagram
defined by and created from a Use-case analysis. Its purpose is to present a graphical overview
of the functionality provided by a system in terms of actors, their goals (represented as use
cases), and any dependencies between those use cases. The main purpose of a use case diagram
is to show what system functions are performed for which actor. Roles of the actors in the
system can be depicted.
Input data
Preprocessing
User
Training
Classification
CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modelling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's
pg. 26
classes, their attributes, operations (or methods), and the relationships among the classes. It
explains which class contains information.
Input Output
Input data Features extraction

Classification
Preprocessing ( ) Finally get Classified

&Display Result: spam
OR no spam
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modelling Language (UML) is a kind of interaction diagram
that shows how processes operate with one another and in what order. It is a construct of a
Message Sequence Chart. Sequence diagrams are sometimes called event diagrams, event
scenarios, and timing diagrams.
Datacollection Training Testing
Collect the data from the user feature on cnn
Send the data to the traing stag
e
Perforn Preprocessing
Train the data
Extracted feature with images sending to the testing stage
Give input
Predict the type using proposed algorithm
pg. 27
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities and actions
with support for choice, iteration and concurrency. In the Unified Modelling Language,
activity diagrams can be used to describe the business and operational step-by-step workflows
of components in a system. An activity diagram shows the overall flow of control.
Input dataset
Preprocessing
Training
Prediction using proposed

algorithm(Passive-Aggressive)
Predicted Label As spam or No spam
pg. 28
SOFTWARE ENVIRONMENT
PYTHON:
Python is a high-level, interpreted, interactive and object-oriented scripting language. Python
is designed to be highly readable. It uses English keywords frequently where as other
languages use punctuation, and it has fewer syntactical constructions than other languages.
• Python is Interpreted − Python is processed at runtime by the interpreter. You do not
need to compile your program before executing it. This is similar to PERL and PHP.
• Python is Interactive − You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
• Python is Object-Oriented − Python supports Object-Oriented style or technique of
programming that encapsulates code within objects.
• Python is a Beginner's Language − Python is a great language for the beginner-level
programmers and supports the development of a wide range of applications from simple
text processing to WWW browsers to games.
History of Python
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, and Unix shell and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the GNU General
Public License (GPL).
Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.
Python Features
Python's features include −
• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined
syntax. This allows the student to pick up the language quickly.
• Easy-to-read − Python code is more clearly defined and visible to the eyes.
• Easy-to-maintain − Python's source code is fairly easy-to-maintain.
• A broad standard library − Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.
• Interactive Mode − Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.
pg. 29
• Portable − Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
• Extendable − You can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.
• Databases − Python provides interfaces to all major commercial databases.
• GUI Programming − Python supports GUI applications that can be created and ported
to many system calls, libraries and windows systems, such as Windows MFC,
Macintosh, and the X Window system of Unix.
• Scalable − Python provides a better structure and support for large programs than shell
scripting.
Apart from the above-mentioned features, Python has a big list of good features, few are listed
below −
• It supports functional and structured programming methods as well as OOP.
• It can be used as a scripting language or can be compiled to byte-code for building large
applications.
• It provides very high-level dynamic data types and supports dynamic type checking.
• It supports automatic garbage collection.
• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
Python is available on a wide variety of platforms including Linux and Mac OS X. Let's
understand how to set up our Python environment.
Getting Python
The most up-to-date and current source code, binaries, documentation, news, etc., is available
on the official website of Python https://www.python.org.
Windows Installation
Here are the steps to install Python on Windows machine.
• Open a Web browser and go to https://www.python.org/downloads/.
• Follow the link for the Windows installer python-XYZ.msifile where XYZ is the
version you need to install.
• To use this installer python-XYZ.msi, the Windows system must support Microsoft
Installer 2.0. Save the installer file to your local machine and then run it to find out if
your machine supports MSI.
• Run the downloaded file. This brings up the Python install wizard, which is really easy
to use. Just accept the default settings, wait until the install is finished, and you are done.
pg. 30
The Python language has many similarities to Perl, C, and Java. However, there are some
definite differences between the languages.
First Python Program
Let us execute programs in different modes of programming.
Interactive Mode Programming
Invoking the interpreter without passing a script file as a parameter brings up the following
prompt −
$ python
Python2.4.3(#1,Nov112010,13:34:43)
[GCC 4.1.220080704(RedHat4.1.2-48)] on linux2
Type"help","copyright","credits"or"license"for more information.
>>>
Type the following text at the Python prompt and press the Enter −
>>>print"Hello, Python!"
If you are running new version of Python, then you would need to use print statement with
parenthesis as in print ("Hello, Python!");. However in Python version 2.4.3, this produces
the following result −
Hello, Python!
Script Mode Programming

Invoking the interpreter with a script parameter begins execution of the script and continues
until the script is finished. When the script is finished, the interpreter is no longer active.
Let us write a simple Python program in a script. Python files have extension .py. Type the
following source code in a test.py file −
print"Hello, Python!"
We assume that you have Python interpreter set in PATH variable. Now, try to run this
program as follows −
$ python test.py
This produces the following result −

Hello, Python!
pg. 31
Flask Framework:
Flask is a web application framework written in Python. Armin Ronacher, who leads an
international group of Python enthusiasts named Pocco, develops it. Flask is based on
Werkzeug WSGI toolkit and Jinja2 template engine. Both are Pocco projects. Http protocol is
the foundation of data communication in world wide web. Different methods of data retrieval
from specified URL are defined in this protocol.
The following table summarizes different http methods −
Sr.No Methods & Description
1 GET
Sends data in unencrypted form to the server. Most common method.
2 HEAD
Same as GET, but without response body
3 POST
Used to send HTML form data to server. Data received by POST method is not
cached by server.
4 PUT
Replaces all current representations of the target resource with the uploaded
content.
5 DELETE
Removes all current representations of the target resource given by a URL
By default, the Flask route responds to the GET requests. However, this preference can be
altered by providing methods argument to route() decorator.
In order to demonstrate the use of POST method in URL routing, first let us create an HTML
form and use the POST method to send form data to a URL.
Save the following script as login.html
<html>
<body>
<formaction="http://localhost:5000/login"method="post">
<p>Enter Name:</p>
pg. 32
<p><inputtype="text"name="nm"/></p>
<p><inputtype="submit"value="submit"/></p>
</form>
</body>
</html>
Now enter the following script in Python shell.
from flask importFlask, redirect,url_for, request

app=Flask(__name__)
@app.route('/success/<name>')
def success(name):
return'welcome %s'% name
@app.route('/login',methods=['POST','GET'])
def login():
ifrequest.method=='POST':
user=request.form['nm']
return redirect(url_for('success',name= user))
else:
user=request.args.get('nm')
return redirect(url_for('success',name= user))
if __name__ =='__main__':
app.run(debug =True)
After the development server starts running, open login.html in the browser, enter name in
the text field and click Submit.
Form data is POSTed to the URL in action clause of form tag.

http://localhost/login is mapped to the login() function. Since the server has received data
by POST method, value of ‘nm’ parameter obtained from the form data is obtained by −
user = request.form['nm']
pg. 33
It is passed to ‘/success’ URL as variable part. The browser displays a welcome message in
the window.
Change the method parameter to ‘GET’ in login.html and open it again in the browser. The
data received on server is by the GET method. The value of ‘nm’ parameter is now obtained
by −
User = request.args.get(‘nm’)
Here, args is dictionary object containing a list of pairs of form parameter and its
corresponding value. The value corresponding to ‘nm’ parameter is passed on to ‘/success’
URL as before.
What is Python?
Python is a popular programming language. It was created in 1991 by Guido van Rossum.
It is used for:
• web development (server-side),
• software development,
• mathematics,
• system scripting.
What can Python do?
• Python can be used on a server to create web applications.
• Python can be used alongside software to create workflows.
• Python can connect to database systems. It can also read and modify files.
• Python can be used to handle big data and perform complex mathematics.
• Python can be used for rapid prototyping, or for production-ready software
development.
Why Python?
• Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).
• Python has a simple syntax similar to the English language.
pg. 34
• Python has syntax that allows developers to write programs with fewer lines than some
other programming languages.
• Python runs on an interpreter system, meaning that code can be executed as soon as it
is written. This means that prototyping can be very quick.
• Python can be treated in a procedural way, an object-orientated way or a functional way.
Good to know
• The most recent major version of Python is Python 3, which we shall be using in this
tutorial. However, Python 2, although not being updated with anything other than
security updates, is still quite popular.
• In this tutorial Python will be written in a text editor. It is possible to write Python in an
Integrated Development Environment, such as Thonny, Pycharm, Netbeans or Eclipse
which are particularly useful when managing larger collections of Python files.
Python Syntax compared to other programming languages
• Python was designed to for readability, and has some similarities to the English language
with influence from mathematics.
• Python uses new lines to complete a command, as opposed to other programming
languages which often use semicolons or parentheses.
• Python relies on indentation, using whitespace, to define scope; such as the scope of
loops, functions and classes. Other programming languages often use curly-brackets for
this purpose.
Python Install
Many PCs and Macs will have python already installed.
To check if you have python installed on a Windows PC, search in the start bar for Python
or run the following on the Command Line (cmd.exe):
C:\Users\Your Name>python --version
To check if you have python installed on a Linux or Mac, then on linux open the command
line or on Mac open the Terminal and type:
python --version
If you find that you do not have python installed on your computer, then you can
download it for free from the following website: https://www.python.org/
Python Quickstart
Python is an interpreted programming language, this means that as a developer you
write Python (.py) files in a text editor and then put those files into the python interpreter
to be executed.
The way to run a python file is like this on the command line:
C:\Users\Your Name>python helloworld.py

pg. 35
Where "helloworld.py" is the name of your python file.
Let's write our first Python file, called helloworld.py, which can be done in any text editor.
helloworld.py
print("Hello, World!")
Simple as that. Save your file. Open your command line, navigate to the directory where
you saved your file, and run:
C:\Users\Your Name>python helloworld.py
The output should read:
Hello, World!
Congratulations, you have written and executed your first Python program.
The Python Command Line

To test a short amount of code in python sometimes it is quickest and easiest not to write
the code in a file. This is made possible because Python can be run as a command line
itself.
Type the following on the Windows, Mac or Linux command line:
C:\Users\Your Name>python
From there you can write any python, including our hello world example from earlier in
the tutorial:
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("Hello, World!")
Which will write "Hello, World!" in the command line:
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
Hello, World!
Whenever you are done in the python command line, you can simply type the following to
quit the python command line interface:
pg. 36
exit()
Execute Python Syntax

As we learned in the previous page, Python syntax can be executed by writing directly in the
Command Line:
Hello, World!
Or by creating a python file on the server, using the .py file extension, and running it in the
Command Line:
C:\Users\Your Name>python myfile.py
Python Indentations
Where in other programming languages the indentation in code is for readability only, in
Python the indentation is very important.
Python uses indentation to indicate a block of code.
Example
if 5 > 2:
print("Five is greater than two!")
Python will give you an error if you skip the indentation:
Example
if 5 > 2:
print("Five is greater than two!")
Comments
Python has commenting capability for the purpose of in-code documentation.
Comments start with a #, and Python will render the rest of the line as a comment:
Example
Comments in Python:
#This is a comment.
Docstrings
Python also has extended documentation capability, called docstrings.
Docstrings can be one line, or multiline.
Python uses triple quotes at the beginning and end of the docstring:
Example
Docstrings are also comments:
"""This is a
multiline docstring."""
pg. 37
IMPLEMENTATION
MODULES:
❖ Data Collection
❖ Dataset
❖ Data Preparation
❖ Model Selection
❖ Analyse and Prediction
❖ Accuracy on test set
❖ Saving the Trained Model
❖ Database connecting using MySQL
MODULES DESCSRIPTION:
Data Collection:
This is the first real step towards the real development of a machine learning model, collecting
data. This is a critical step that will cascade in how good the model will be, the more and better
data that we get, the better our model will perform.
There are several techniques to collect the data, like web scraping, manual interventions and
etc.
Detection of Malicious Social Bots taken from kaggle and some other source
Dataset:
The dataset consists of 969812 individual data. There are 3 columns in the dataset, which are
described below
1. Id: unique id
2. Labels : Labels
Malicious
No Malicious
3. URL: url comment
Data Preparation:
We will transform the data. by getting rid of missing data and removing some columns. First
we will create a list of column names that we want to keep or retain.
Next, we drop or remove all columns except for the columns that we want to retain.
Finally, we drop or remove the rows that have missing values from the data set.
Steps to follow:
1. Removing extra symbols
2. Removing punctuations
3. Removing the Stopwords
4. Stemming
5. Tokenization
6. Feature extractions
7. Count Vectorizer
pg. 38
8. Counter vectorizer with TF-IDF transformer
Model Selection:
We used Logistic regression algorithm
Logistic regression is a classification algorithm, used when the value of the target variable
is categorical in nature. Logistic regression is most commonly used when the data in question
has binary output, so when it belongs to one class or another, or is either a 0 or 1.
Remember that classification tasks have discrete categories, unlike regressions tasks.
Here, by the idea of using a regression model to solve the classification problem, we rationally
raise a question of whether we can draw a hypothesis function to fit to the binary dataset. For
simplification, we only concern the binary classification problem
The answer is that you will have to use a type of function, different from linear functions,
called a logistic function, or a sigmoid function.
(Note: Here’s something important to remember: although the algorithm is called “Logistic
Regression”, it is, in fact, a classification algorithm, not a regression algorithm. This can be
confusing at first, but just try to remember it.)
The Sigmoid Function
The sigmoid function/logistic function is a function that resembles an “S” shaped curve when
plotted on a graph. It takes values between 0 and 1 and “squishes” them towards the margins
at the top and bottom, labeling them as 0 or 1.
The equation for the Sigmoid function is this:
What is the variable e in this instance? The e represents the exponential function or
exponential constant, and it has a value of approximately 2.71828.
Let’s see how the sigmoid function represent the given dataset.
This gives a value y that is extremely close to 0 if xis a large negative value and close to 1
if x is a large positive value. After the input value has been squeezed towards 0 or 1, the input
can be run through a typical linear function, but the inputs can now be put into distinct
categories.
Analyse and Prediction:

In the actual dataset, we chose only 2 features :
1. URL: url comment
2. Labels : Labels
Malicious
No Malicious
Accuracy on test set:
pg. 39
We got a accuracy of 96.02% on test set.
Saving the Trained Model:
Once you’re confident enough to take your trained and tested model into the production-ready
environment, the first step is to save it into a .h5 or. pkl file using a library like pickle .
Make sure you have pickle installed in your environment.
Next, let’s import the module and dump the model into. pkl file
Import MYSQL dB
So long as that works, do a quick control+d to exit the python instance.

Next, we want to make a Python file that can connect to the database. Generally, you will have
a separate "connect" file, outside of any main files you may have. This is usually true across
languages, and here's why. Initially, you may have just a simple __init__.py, or app.py, or
whatever, and that file does all of your operations. What can happen in time, however, is that
your website does other things. For example, with one of my websites, Sentdex.com, I perform
a lot of analysis, store that analysis to a database, and I also operate a website for users to use.
Generally, for tasks, you will use what is called a "cron." A cron is a scheduled task that runs
when you program it to run. Generally this runs another file, almost certain to not be your
website's file. So then, to connect to a database, you'd have to write the database connecting
code again in the file being run by your cron.
As time goes on, these sorts of needs stack up where you have some files modifying the
database, but you still want the website to be able to access it, and maybe modify it too. Then,
consider what might happen if you change your database password. You'd then need to go to
every single file that connects to the database and change that too. So, usually, you will find
the smartest thing to do is to just create one file, which houses the connection code.
Import the module.
Create a connection function to run our code. Here we specify where we're connecting to, the
user, the user's password, and then the database that we want to connect to.
As a note, we use "localhost" as our host. This just means we'll use the same server that this
code is running on. You can connect to databases remotely as well, which can be pretty neat.
To do that, you would connect to a host by their IP, or their domain. To connect to a database
remotely, you will need to first allow it from the remote database that will be
accessed/modified.
Next, let's go ahead and edit our __init__.py file, adding a register function. For now we'll
keep it simple, mostly just to test our connection functionality.
We allow for GET and POST, but aren't handling it just yet.
We're going to just try to run the imported connection function, which returns c and conn
(cursor and connection objects).If the connection is successful, we just have the page say okay,
otherwise it will output the error.
pg. 40
INPUT DESIGN AND OUTPUT DESIGN
INPUT DESIGN
The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and those steps are necessary to
put transaction data in to a usable form for processing can be achieved by inspecting the
computer to read data from a written or printed document or it can occur by having people
keying the data directly into the system. The design of input focuses on controlling the amount
of input required, controlling the errors, avoiding delay, avoiding extra steps and keeping the
process simple. The input is designed in such a way so that it provides security and ease of use
with retaining the privacy. Input Design considered the following things:
➢ What data should be given as input?

➢ How the data should be arranged or coded?
➢ The dialog to guide the operating personnel in providing input.
➢ Methods for preparing input validations and steps to follow when error occur.
OBJECTIVES
1. Input Design is the process of converting a user-oriented description of the input into a
computer-based system. This design is important to avoid errors in the data input process and
show the correct direction to the management for getting correct information from the
computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large volume of
data. The goal of designing input is to make data entry easier and to be free from errors. The
data entry screen is designed in such a way that all the data manipulates can be performed. It
also provides record viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with the help of
screens. Appropriate messages are provided as when needed so that the user will not be in
pg. 41
maize of instant. Thus, the objective of input design is to create an input layout that is easy to
follow
OUTPUT DESIGN
A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and direct
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out manner; the
right output must be developed while ensuring that each output element is designed so that
people will find the system can use easily and effectively. When analysis design computer
output, they should Identify the specific output that is needed to meet the requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by the system.
The output form of an information system should accomplish one or more of the following
objectives.
❖ Convey information about past activities, current status or projections of the

❖ Future.
❖ Signal important events, opportunities, problems, or warnings.
❖ Trigger an action.
❖ Confirm an action.
pg. 42
SCREEN SHOTS
pg. 43
pg. 44
pg. 45
pg. 46
pg. 47
pg. 48
pg. 49
pg. 50
CODING
1. USER.HTML
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<title>Malicious Social Bots </title>
<meta content="" name="description">
<meta content="" name="keywords">


<link
href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,600,600i,700,
700i|Roboto:300,300i,400,400i,500,500i,600,600i,700,700i|Poppins:300,300i,400,400i,500,5
00i,600,600i,700,700i" rel="stylesheet">

<link href="../static/vendor/animate.css/animate.min.css" rel="stylesheet">
<link href="../static/vendor/bootstrap/css/bootstrap.min.css" rel="stylesheet">
<link href="../static/vendor/bootstrap-icons/bootstrap-icons.css" rel="stylesheet">
<link href="../static/vendor/boxicons/css/boxicons.min.css" rel="stylesheet">
<link href="../static/vendor/glightbox/css/glightbox.min.css" rel="stylesheet">
<link href="../static/vendor/swiper/swiper-bundle.min.css" rel="stylesheet">

<link href="../static/css/style.css" rel="stylesheet">

pg. 51
</head>
<body>

<header id="header" class="fixed-top d-flex align-items-center">
<div class="container d-flex align-items-center justify-content-between">
<h1 class="logo"><a href="index.html">Malicious</a></h1>


<nav id="navbar" class="navbar">
<ul>
<li><a class="nav-link scrollto " href="{{
url_for('profile')}}">profile</a></li>
<li><a class="nav-link scrollto " href="{{ url_for('prediction')}}">Tweets</a></li>
url_for('users')}}">Timeline</a></li>
<li><a class="nav-link scrollto " href="{{ url_for('index')}}">Logout</a></li>
</ul>
<i class="bi bi-list mobile-nav-toggle"></i>
</nav>
</div>
</header>

<section id="hero">
<div class="hero-container">
<div id="heroCarousel" data-bs-interval="5000" class="carousel slide carousel-fade"
data-bs-ride="carousel">
<ol class="carousel-indicators" id="hero-carousel-indicators"></ol>
<div class="carousel-inner" role="listbox">


<div class="carousel-item active" style="background: url(../static/img/mal6.jpg);">
pg. 52
<div class="carousel-container">
<div class="carousel-content">
<h2 class="animate__animated animate__fadeInDown">Detection of Malicious
Social Bots Using machine Learning</h2>
<div>
</div>
</div>
</div>
</div>


</div>
</div>
</div>
</section>
<main id="main">
<section id="services" class="services">
<div class="container">
<div class="section-title">
<h2>tweet</h2>
<div id="fields">
<center>
<table>
{% for user in userDetails %}
<tr>
<td> <h3>Tweet :</h3> <font
style="color: #c43c35;font-size:25px;">@{{user[1]}}</font> </td>
</tr>
<tr><td></td></tr>
<tr><td></td></tr>
<tr><td></td></tr>
pg. 53
<tr><td></td></tr>
<tr><td></td></tr>
<tr>
<td>
<textarea readonly="" style="width:
400px; height: 90px; border-color: white; color: black">{{user[9]}} </textarea>
</td>
</tr><br>
{% endfor %}
</table>
</center>
</div>
</div>
</section>
<a href="#" class="back-to-top d-flex align-items-center justify-content-center"><i
class="bi bi-arrow-up-short"></i></a>

<script src="../static/vendor/bootstrap/js/bootstrap.bundle.min.js"></script>
<script src="../static/vendor/glightbox/js/glightbox.min.js"></script>
<script src="../static/vendor/isotope-layout/isotope.pkgd.min.js"></script>
<script src="../static/vendor/php-email-form/validate.js"></script>
<script src="../static/vendor/purecounter/purecounter.js"></script>
<script src="../static/vendor/swiper/swiper-bundle.min.js"></script>

<script src="../static/js/main.js"></script>
</body>
</html>
2. PROFILE.HTML
<!DOCTYPE html>
<html lang="en">
pg. 54
<head>

<link

</head>
<body>

pg. 55

<ul>
<li><a class="nav-link scrollto " href="{{ url_for('index')}}">Logout</a></li>
</ul>
</div>

<section id="hero">


Social Bots Using machine Learning</h2
pg. 56
<div>
</div>
</div>
</div>
</div>


</div>
</div>
</div>
<main id="main">
<h2> Your Profile details</h2>
<div id="fields">
<form class="form-horizontal">
<div class="row">
<div class="span4">
<div class="control-group">
<div class="controls">
<label style="color:black"><b>Your name :</b></label>
<input class="span4" readonly="" value="{{account[1]}}">
</div>
</div>
</br>
<label style="color:black"><b>Your email :</b></label>
pg. 57

</div>
</div>
</br>
<label style="color:black"><b>Password :</b></label>
</div>
</div>
</div>
</div>
</form>
</div>
</div>
</section>
</body>
</html
pg. 58
3. PREVIEW.HTML
<!DOCTYPE html>
<html lang="en">
<html lang="en">
<head>

<link

pg. 59
</head>
<body>

<ul>
<li><a class="nav-link scrollto active" href="{{ url_for('index')}}">Home</a></li>
<li><a class="nav-link scrollto" href="#services">Abstract</a></li>
url_for('upload')}}">upload</a></li>
</ul>
</div>

<section id="hero">

pg. 60

<div>
</div>
</div>
</div>
</div>


</div>
</div>
</div>
<main id="main">
<h2>Preview</h2>
<head>
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta name="description" content="">
<meta name="author" content="">

<link href="../static/vendor/fontawesome-free/css/all.min.css" rel="stylesheet"
type="text/css">
<link href="https://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet"
type="text/css">
<link href="https://fonts.googleapis.com/css?family=Lato:400,700,400italic,700italic"
rel="stylesheet" type="text/css">

<link href="../static/css/freelancer.min.css" rel="stylesheet">
pg. 61
<style>
#loading {
background: url('../static/ajax-loader.gif') no-repeat center center;
position: absolute;
top: 0;
left: 99;
height: 100%;
width: 90%;
z-index: 9999999;
}
</style>
</head>
<body id="page-top">


<section class="page-section" id="contact">
<br>
<br>

<h2 class="text-center text-uppercase text-secondary mb-0"> </h2>


<div class="divider-custom">
<div class="divider-custom-line"></div>
<div class="divider-custom-icon">
<i class="fas fa-star"></i>
</div>
</div>
pg. 62


<div class="row" style="margin-left:-150px">
<div class="col-lg-8 mx-auto">

{{ df_view.to_html(classes="table striped",na_rep="-") | safe}}
</div>
</div>
</section>
<div class="form-group" style="padding:0px 250px 10px 40px;height:200px">
<input style="margin-left:300px" type="button" onclick="hideLoader()"
class="btn btn-primary" value="Click to Train | Test" />
<div id="loading" style="display:None;margin-
top:552950px"></div>
</div>

<section class="copyright py-4 text-center text-white">
</div>
</section>

<div class="scroll-to-top d-lg-none position-fixed ">
<a class="js-scroll-trigger d-block text-center text-white rounded" href="#page-top">
<i class="fa fa-chevron-up"></i>
</a>
</div>

<script type='text/javascript'
src='https://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js'></script>
<script type='text/javascript'>
function hideLoader() {
$('#loading').show(0).delay(1000).hide(0,function(){
pg. 63
alert("Training finished!");
window.location = "{{url_for('register')}}";
});
}
</script>
</body>
</html>
</div>
</div>
</section>
<footer id="footer">
<div class="footer-top">
<div class="row">
<div class="col-lg-3 col-md-6">
<div class="footer-info">
</div>
</div>
<div class="col-lg-2 col-md-6 footer-links">
</div>
</div>
<div class="col-lg-4 col-md-6 footer-newsletter">
</div>
</div>
</div>
</div>
<div class="copyright">
</div>
pg. 64
<div class="credits">
</div>
</footer>
</body>
</html>
4. USERDETAIL.HTML
<!DOCTYPE html>
<html lang="en">
<head>

<link
pg. 65


<style>
#customers {
font-family: "Trebuchet MS", Arial, Helvetica, sans-serif;
font-size: 20px;
border-collapse: collapse;
width: 100%;
}
#customers td, #customers th {
border: 1px solid #ddd;
padding: 15px;
}
#customers tr:nth-child(even){background-color: #f2f2f2;}
#customers tr:hover {background-color: #ddd;}
#customers th {
padding-top: 12px;
pg. 66
padding-bottom: 12px;
text-align: left;
background-color: #1DA1F2;
color: white;
}
</style>
</head>
<body>

<ul>
<li><a class="nav-link scrollto" href="{{ url_for('index')}}">Home</a></li>
<li><a class="nav-link scrollto" href="{{ url_for('userdetail')}}">Register
details</a></li>
<li><a class="nav-link scrollto " href="{{ url_for('admin')}}">Full
details</a></li>
<li><a class="nav-link scrollto " href="{{ url_for('user')}}">Analysis</a></li>
</ul>
</div>

<section id="hero">
pg. 67


<div>
</div>
</div>
</div>
</div>

</div>
</div>
</div>
<main id="main">

<h2>Register details</h2>
<table id="customers" style="margin-right: 300px">
<tr>
<th>user_id</th>
<th>user_name</th>
<th>Email</th>
<th>password</th>
</tr>
pg. 68
{% for userdetail in useradmin %}

<tr>
<td> <p style="color:blue">{{userdetail[0]}} </p></td>

</tr>
{% endfor %}
</table>
</div>
</div>
</section>



</body>
</html>
5. USER.HTML
<!DOCTYPE html>
<html lang="en">
<head>
pg. 69


<link

</head>
<body>

pg. 70

<ul>
details</a></li>
details</a></li>
url_for('user')}}">Analysis</a></li>
</ul>
</div>

<section id="hero">

<div>
</div>
</div>
</div>
</div>
pg. 71


</div>
</div>
</div>
</section>
<link
< link href="../static/vendor/bootstrap/css/bootstrap.min.css" rel="stylesheet">

<style>
#customers {
font-family: "Trebuchet MS", Arial, Helvetica, sans-serif;
font-size: 20px;
border-collapse: collapse;
width: 100%;
}
#customers td, #customers th {
border: 1px solid #ddd;
padding: 15px;
}
pg. 75
#customers th {
padding-top: 12px;
padding-bottom: 12px;
text-align: left;
background-color: #1DA1F2;
color: white;
}
</style>
</head>
<body>

<ul>
details</a></li>
details</a></li>
<li><a class="nav-link scrollto " href="{{ url_for('user')}}">Analysis</a></li>
</ul>
</div>

<section id="hero">

<div>
pg. 76
</div>
</div>
</div>
</div>


</div>
</div>
</div>
<main id="main">
<h2>Full details users</h2>
<center>
<table id="customers" style="margin-right: 300px">
<tr>
<th>user_id</th>
<th>user_name</th>
<th>Email</th>
<th>tweets</th>
<th> prediction</th>
<th> status</th>
<th> action</th>
</tr>
{% for admin in userDetails %}
<tr>
<form action="{{ url_for('blockUser')}}" method="post" autocomplete="off">
<td> <input name="fid" size= "5" style="color:blue;border:none"
value="{{admin[0]}}" readonly /> </td>
<td> <p style="color:blue">{{admin[1]}} </p></td>
<td> <p style="color:blue">{{admin[2]}}</p> </td>
<td> <p style="color:blue">{{admin[9]}}</p> </td>
<td><button type="submit" id="Geeks" class="btn btn-danger">Block</button></td>
</form>
</tr>
{% endfor %}
</table>
</center>
</div>
</div>
</section>
pg. 77

</body>
</html>
7. PREDECTION.HTML
<!DOCTYPE html>
<html lang="en">
<head>

<link
pg. 78


</head>
<body>

<ul>
</ul>
</div>

pg. 79
<section id="hero">

<div>
</div>
</div>
</div>
</div>


</div>
</div>
</div>
<main id="main">
<h2>Tweet</h2>
<body>
<div class="login">
<h1 class="cover-heading" ></h1>
pg. 80
<div class="col-lg-4 ml-auto" style="margin-left:390px;" >


<form action="{{ url_for('chart')}}"method=['POST']>
<textarea type="text" class="form-control text-black" style="width:300px;
height:100px" name="news" placeholder="post your tweet" ></textarea><br>
<div class="col-lg-4 ml-auto" style="margin-left:90px;" >
<center><button type="submit" class="btn btn-primary btn-
large">Predict</button></a></center>
</div>
</form>
</div>
</div>
</body>
</div>
</div>
</section>
<div class="row">
</div>
</div>
</div>
</div>
</div>
</div>
pg. 81
</div>
</div>
</div>
</div>
</footer>
</body>
</html>
8. INDEX.HTML
<!DOCTYPE html>
<html lang="en">
<head>
pg. 82


<link
<link href="../static/style.css" rel="stylesheet">

</head>
<body>

<ul>
pg. 83

<li><a class="nav-link scrollto " href="{{ url_for('login')}}">Login</a></li>
url_for('register')}}">Register</a></li>
url_for('loginadmin')}}">Admin</a></li>
</ul>
</div>

<section id="hero">

<div class="carousel-item active" style="background: url(../static/img/mal.jpg);">
Social Bots Using Learning</h2>
<p class="animate__animated animate__fadeInUp">Automata With URL Features
in Twitter Network</p>
<div>
</div>
</div>
pg. 84
</div>
</div>

<div class="carousel-item" style="background: url(../static/img/mal5.png);">
Social Bots Using Learning</h2>
<p class="animate__animated animate__fadeInUp">Automata With URL Features
in Twitter Network</p>
<div>
</div>
</div>
</div>
</div>

</div>
<a class="carousel-control-prev" href="#heroCarousel" role="button" data-bs-
slide="prev">
<span class="carousel-control-prev-icon bi bi-chevron-left" aria-
hidden="true"></span>
</a>
<a class="carousel-control-next" href="#heroCarousel" role="button" data-bs-
slide="next">
<span class="carousel-control-next-icon bi bi-chevron-right" aria-
hidden="true"></span>
</a>
</div>
</div>
<main id="main">
pg. 85
<h2>Abstract</h2>
<p style="color:black">Malicious social bots generate fake tweets and
automate their social relationships either by pretending like a
follower or by creating multiple fake accounts with malicious
activities. Moreover, malicious social bots post shortened malicious URLs in the tweet in
order to redirect the requests of
online social networking participants to some malicious servers.
Hence, distinguishing malicious social bots from legitimate users
is one of the most important tasks in the Twitter network.
To detect malicious social bots, extracting URL-based features
(such as URL redirection, frequency of shared URLs, and spam
content in URL) consumes less amount of time in comparison
with social graph-based features (which rely on the social interactions of users).
Furthermore, malicious social bots cannot easily
manipulate URL redirection chains. In this article, a learning
automata-based malicious social bot detection (LA-MSBD) algorithm is proposed by
integrating a trust computation model with
URL-based features for identifying trustworthy participants
(users) in the Twitter network. The proposed trust computation model contains two
parameters, namely, direct trust and
indirect trust. Moreover, the direct trust is derived from Bayes’
theorem, and the indirect trust is derived from the Dempster–
Shafer theory (DST) to determine the trustworthiness of each
participant accurately. Experimentation has been performed on
two Twitter data sets, and the results illustrate that the proposed
algorithm achieves improvement in precision, recall, F-measure,
and accuracy compared with existing approaches for MSBD.</p>
</div>
</div>
</section>
pg. 86
<div class="row">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</footer>
pg. 87
</body>
</html>
-------------------------------
9. LOGINADMIN.HTML
<!DOCTYPE html>
<html lang="en">
<head>

<link

</head>
<body>

<ul>
</ul>
</div>

<section id="hero">

pg. 89
<div>
</div>
</div>
</div>
</div>

</div>
</div>
</div>
<main id="main">
<h2> Admin login</h2>
<head>
<style>
body {
background-image: url("../static/images/email3.jpg");
background-color: #cccccc;
}
</style>
<script>
addEventListener("load", function () {
setTimeout(hideURLbar, 0);
}, false);
function hideURLbar() {
pg. 90
window.scrollTo(0, 1);
}
function login(){
var uname = document.getElementById("uname").value;
var pwd = document.getElementById("pwd").value;
if(uname == "admin" && pwd == "admin")
{
alert("Login Success!");
window.location = "{{url_for('userdetail')}}";
return false;
}
else
{
alert("Invalid Credentials!")
}
}
</script>
</head>

<section class="page-section portfolio" id="portfolio">
<br>
<br>


<div class="row">

<div class="col-md-6 col-lg-4" style="margin-left:380px">
pg. 91


<label class="control-label"
for="username"><b>Username</b></label>
<input type="text" id="uname" name="uname" placeholder=""
class="form-control">
</div>
</div>
</br>

<label class="control-label"
for="password"><b>Password</b></label>
<input type="password" id="pwd" name="pwd" placeholder=""
class="form-control">
</div>
</div>
<div class="col-md-6 col-lg-4" style="margin-left:-150px">

<br>
<input type="button" class="btn btn-success" value="Login"
style="margin-left: 280px" onclick="login()">
</div>
</div>
</div>
</div>
</div>
</section>
</body>
pg. 92
</div>
</div>
</section>
<div class="row">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</footer>
pg. 93
</body>
</html>
10. UPLOAD.HTML
<!DOCTYPE html>
<html lang="en">
<head>

<link
pg. 94


</head>
<body>

<h1 class="logo"><a href="index.html"> Malicious </a></h1>
<ul>
</ul>
</div>

<section id="hero">
pg. 95


<div>
</div>
</div>
</div>
</div>


</div>
</div>
</div>
<main id="main">
<h2>upload</h2>

<section class="col-md-12 col-sm-12 col-xs-12" id="about">

pg. 96


<div class="col-md-12 col-sm-12 col-xs-12" style="margin-left:100px;">
<form action="http://localhost:5000/preview" name="fs" id="fs" method="post"
enctype=multipart/form-data>
<br/>
<input type="file" name="datasetfile" id="file1" required />
<br/>
<br/>
<br/>
<input type="submit" style="margin-right:230px" class="btn btn-success btn-
large" value="Upload">
</form>
</div>
</section>
</body>
</div>
</div>
</section>
<div class="row">
</div>
</div>
</div>
</div>
pg. 97
</div>
</div>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>
11. REGISTER.HTML
<!DOCTYPE html>
<html lang="en">
<head>
pg. 98


<link

</head>
<body>

pg. 99

<ul>
<li><a class="nav-link scrollto " href="{{ url_for('login')}}">Login</a></li>
</ul>
</div>

<section id="hero">

<div>
</div>
</div>
</div>
</div>


</div>
pg. 100
</div>
</div>
<main id="main">
<h2></h2>
<head>
</head>
<br>
<br>
<div class="login-title text-center">
<div class="row">
</div>
</div>
</div>
</div>
pg. 101
<div class="row">
<h2>Register</h2>
<form action="{{ url_for('register') }}" method="post" >

<label style="color:black"><b>Username :</b></label>
<input type="text" name="username" placeholder="Username"
required>
</div>
</div>
</br>

<label style="color:black"><b>Email ID : </b></label>
<input type="email" name="email" placeholder="Email" required>
</div>
</div>
</br>
<div class="form-group">

<input type="password" name="password"
placeholder="Password" required>
</div>
<div class="col-md-6 col-lg-4" style="margin-left:-100px">

pg. 102
<br>
<div class="msg">{{ msg }}</div>
<input type="submit" class="btn btn-success" value="submit"
style="margin-left: 240px" >
</div>
</div>
</div>
</div>
</div>
</form>
</div>
</div>
</section>
</body>
</div>
</div>
</section>
<div class="row">
</div>
</div>
</div>
</div>
pg. 103
</div>
</div>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>
12. LOGIN.HTML
<!DOCTYPE html>
<html lang="en">
<head>
pg. 104


<link
href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,60
0,600i,700,700i|Roboto:300,300i,400,400i,500,500i,600,600i,700,700i|Poppins:3
00,300i,400,400i,500,500i,600,600i,700,700i" rel="stylesheet">
<link href="../static/vendor/bootstrap-icons/bootstrap-icons.css"
rel="stylesheet">

pg. 105


<ul>
url_for('login')}}">Login</a></li>
</ul>
</div>

<section id="hero">
<div id="heroCarousel" data-bs-interval="4000" class="carousel slide
carousel-fade" data-bs-ride="carousel">

<div class="carousel-item active" style="background:
url(../static/img/mal6.jpg);">
pg. 106
<h2 class="animate__animated animate__fadeInDown">Detection of

Malicious Social Bots Using machine Learning</h2>
<div>
</div>
</div>
</div>
</div>

</div>

</div>
</div>
</div>
<main id="main">
<h2></h2>
<head>
</head>
<br>
<br>
pg. 107
<div class="login-title text-center">

<div class="row">
</div>
</div>
</div>
</div>
<form action="{{ url_for('login') }}" method="post">
<div class="row">
<h2>Login</h2>

<label style="color:black"><b>Username :</b></label>
<input type="text" name="username"
placeholder="Username" >
</div>
</div>
</br>
pg. 108

<input type="password" name="password"
placeholder="Password" required>
</div>
</div>
<div class="col-md-6 col-lg-4" style="margin-left:-
120px">

<br>
<div class="msg">{{ msg }}</div>
<input type="submit" class="btn btn-success"
value="Login" style="margin-left: 240px" >
</div>
</div>
</div>
</div>
</div>
</form>
</div>
</div>
</section>
{% with messages = get_flashed_messages() %}
{% if messages %}
<script>
var messages = {{ messages | safe }};
pg. 109
for (var i=0; i<messages.length; i++) {

alert(messages[i]);
}
</script>
{% endif %}
{% endwith %}
</body>
</div>
</div>
</section>
<div class="row">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
pg. 110
</div>
</div>
</footer>
<a href="#" class="back-to-top d-flex align-items-center justify-content-
center"><i class="bi bi-arrow-up-short"></i></a>
</body>
</html>
--------------------------
pg. 111
SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub-assemblies, assemblies and/or a finished product It is the process of
exercising software with the intent of ensuring that the
Software system meets its requirements and user expectations and does not fail in an
unacceptable manner. There are various types of test. Each test type addresses a specific testing
requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests
perform basic tests at component level and test a specific business process, application, and/or
system configuration. Unit tests ensure that each unique path of a business process performs
accurately to the documented specifications and contains clearly defined inputs and expected
results.
Integration testing
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfaction, as shown by successfully unit testing, the combination of
components is correct and consistent. Integration testing is specifically aimed at exposing the
problems that arise from the combination of component.
Functional test
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user
manuals.
Functional testing is centred on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
pg. 112
Systems/Procedures: interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements, key functions, or

special test cases. In addition, systematic coverage pertaining to identify Business process
flows; data fields, predefined processes, and successive processes must be considered for
testing. Before functional testing is complete, additional tests are identified and the effective
value of current tests is determined.
System Test
System testing ensures that the entire integrated software system meets requirements. It tests
a configuration to ensure known and predictable results. An example of system testing is the
configuration-oriented system integration test. System testing is based on process descriptions
and flows, emphasizing pre-driven process links and integration points.
White Box Testing

White Box Testing is a testing in which in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is purpose. It is used
to test areas that cannot be reached from a black box level.
Black Box Testing
Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested. Black box tests, as most other kinds of tests,
must be written from a definitive source document, such as specification or requirements
document, such as specification or requirements document. It is a testing in which the software
under test is treated, as a black box. you cannot “see” into it. The test provides inputs and
responds to outputs without considering how the software works.
6.1 Unit Testing:

Unit testing is usually conducted as part of a combined code and unit test phase of the software
lifecycle, although it is not uncommon for coding and unit testing to be conducted as two
distinct phases.
Test strategy and approach
Field testing will be performed manually and functional tests will be written in detail.
Test objectives
➢ All field entries must work properly.
➢ Pages must be activated from the identified link.
➢ The entry screen, messages and responses must not be delayed.
Features to be tested
pg. 113
➢ Verify that the entries are of the correct format

➢ No duplicate entries should be allowed
➢ All links should take the user to the correct page.
6.2 Integration Testing

Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.
The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level
– interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
6.3 Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation
by the end user. It also ensures that the system meets the functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
pg. 114
CONCLUSION
This article presents an LA-MSBD algorithm by integrating a trust computational model with
a set of URL-based features for MSBD. In addition, we evaluate the trustworthiness of tweets
(posted by each participant) by using the Bayesian learning and DST. Moreover, the proposed
LA-MSBD algorithm executes a finite set of learning actions to update action probability value
(i.e., probability of a participant posting malicious URLs in the tweets). The proposed LA-
MSBD algorithm achieves the advantages of incremental learning. Two Twitter data sets are
used to evaluate the performance of our proposed LA-MSBD algorithm. The experimental
results show that the proposed LA-MSBD algorithm achieves up to 7% improvement of
accuracy compared with other existing algorithms. For The Fake Project and Social Honeypot
data sets, the proposed LA-MSBD algorithm has achieved precisions of 95.37% and 91.77%
for MSBD, respectively. Furthermore, as a future research challenge, we would like to
investigate the dependence among the features and its impact on MSBD.
pg. 115
LITERATURE SURVEY
1) Detecting malicious social bots based on clickstream sequences
AUTHORS: P. Shi, Z. Zhang, and K.-K.-R. Choo

With the significant increase in the volume, velocity, and variety of user data (e.g., user-
generated data) in online social networks, there have been attempted to design new ways of
collecting and analyzing such big data. For example, social bots have been used to perform
automated analytical services and provide users with improved quality of service. However,
malicious social bots have also been used to disseminate false information (e.g., fake news),
and this can result in real-world consequences. Therefore, detecting and removing malicious
social bots in online social networks is crucial. The most existing detection methods of
malicious social bots analyze the quantitative features of their behavior. These features are
easily imitated by social bots; thereby resulting in low accuracy of the analysis. A novel
method of detecting malicious social bots, including both features selection based on the
transition probability of clickstream sequences and semi-supervised clustering, is presented in
this paper. This method not only analyzes transition probability of user behavior clickstreams
but also considers the time feature of behavior. Findings from our experiments on real online
social network platforms demonstrate that the detection accuracy for different types of
malicious social bots by the detection method of malicious social bots based on transition
probability of user behavior clickstreams increases by an average of 12.8%, in comparison to
the detection method based on quantitative analysis of user behavior.
2) Adaptive deep Q-learning model for detecting social bots and influential users in
online social networks
AUTHORS: G. Lingam, R. R. Rout, and D. V. L. N. Somayajulu

In an online social network (like Twitter), a botmaster (i.e., leader among a group of social
bots) establishes a social relationship among legitimate participants to reduce the probability
of social bot detection. Social bots generate fake tweets and spread malicious information by
manipulating the public opinion. Therefore, the detection of social bots in an online social
network is an important task. In this paper, we consider social attributes, such as tweet-based
attributes, user profile-based attributes and social graph-based attributes for detecting the
social bots among legitimate participants. We design a deep Q-network architecture by
incorporating a Deep Q-Learning (DQL) model using the social attributes in the Twitter
network for detection of social bots based on updating Q-value function (i.e., state-action value
function). We consider each social attribute of a user as a state and the learning agent’s
movement from one state to another state is considered as an action. For Q-value function, we
consider all the state-action pairs in order to construct the state transition probability values
between the state-action pairs. In the proposed DQL algorithm, the learning agent chooses a
specific learning action with an optimal Q-value in each state for social bot detection. Further,
we also propose an approach that identifies the most influential users (which are influenced
pg. 116
by the social bots) based on tweets and the users’ interactions. The experimentation using the
datasets collected from Twitter network illustrates the efficacy of proposed model.
3) Fluxing botnet command and control channels with URL shortening services
AUTHORS: S. Lee and J. Kim

URL shortening services (USSes), which provide short aliases to registered long URLs, have
become popular owing to Twitter. Despite their popularity, researchers do not carefully
consider their security problems. In this paper, we explore botnet models based on USSes to
prepare for new security threats before they evolve. Specifically, we consider using USSes for
alias flux to hide botnet command and control (C&C) channels. In alias flux, a botmaster
obfuscates the IP addresses of his C&C servers, encodes them as URLs, and then registers
them to USSes with custom aliases generated by an alias generation algorithm. Later, each bot
obtains the encoded IP addresses by contacting USSes using the same algorithm. For USSes
that do not support custom aliases, the botmaster can use shared alias lists instead of the shared
algorithm. DNS-based botnet detection schemes cannot detect an alias flux botnet, and
network-level detection and blacklisting of the fluxed aliases are difficult. We also discuss
possible countermeasures to cope with these new threats and investigate operating Uses.
4) A neural network-based ensemble approach for spam detection in Twitter
AUTHORS: S. Madisetty and M. S. Desarkar

As the social networking sites get more popular, spammers target these sites to spread spam
posts. Twitter is one of the most popular online social networking sites where users
communicate and interact on various topics. Most of the current spam filtering methods in
Twitter focus on detecting the spammers and blocking them. However, spammers can create a
new account and start posting new spam tweets again. So there is a need for robust spam
detection techniques to detect the spam at tweet level. These types of techniques can prevent
the spam in real time. To detect the spam at tweet level, often features are defined, and
appropriate machine learning algorithms are applied in the literature. Recently, deep learning
methods are showing fruitful results on several natural language processing tasks. We want to
use the potential benefits of these two types of methods for our problem. Toward this, we
propose an ensemble approach for spam detection at tweet level. We develop various deep
learning models based on convolutional neural networks (CNNs). Five CNNs and one feature-
based model are used in the ensemble. Each CNN uses different word embeddings (Glove,
Word2vec) to train the model. The feature-based model uses content-based, user-based, and n-
gram features. Our approach combines both deep learning and traditional feature-based models
using a multilayer neural network which acts as a meta-classifier. We evaluate our method on
two data sets, one data set is balanced, and another one is imbalanced. The experimental results
show that our proposed method outperforms the existing methods.
5) Comparisons of machine learning techniques for detecting malicious webpages
AUTHORS: H. B. Kazemian and S. Ahmed
pg. 117
This paper compares machine learning techniques for detecting malicious webpages. The
conventional method of detecting malicious webpages is going through the black list and
checking whether the webpages are listed. Black list is a list of webpages which are classified
as malicious from a user’s point of view. These black lists are created by trusted organizations
and volunteers. They are then used by modern web browsers such as Chrome, Firefox, Internet
Explorer, etc. However, black list is ineffective because of the frequent-changing nature of
webpages, growing numbers of webpages that pose scalability issues and the crawlers’
inability to visit intranet webpages that require computer operators to log in as authenticated
users. In this paper therefore alternative and novel approaches are used by applying machine
learning algorithms to detect malicious webpages. In this paper three supervised machine
learning techniques such as K-Nearest Neighbor, Support Vector Machine and Naive Bayes
Classifier, and two unsupervised machine learning techniques such as K-Means and Affinity
Propagation are employed. Please note that K-Means and Affinity Propagation have not been
applied to detection of malicious webpages by other researchers. All these machine learning
techniques have been used to build predictive models to analyze large number of malicious
and safe webpages. These webpages were downloaded by a concurrent crawler taking
advantage of gevent. The webpages were parsed and various features such as content, URL
and screenshot of webpages were extracted to feed into the machine learning models.
Computer simulation results have produced an accuracy of up to 98% for the supervised
techniques and silhouette coefficient of close to 0.96 for the unsupervised techniques. These
predictive models have been applied in a practical context whereby Google Chrome can
harness the predictive capabilities of the classifiers that have the advantages of both the
lightweight and the heavyweight classifiers.
pg. 118
REFERENCES
[1] P. Shi, Z. Zhang, and K.-K.-R. Choo, “Detecting malicious social bots based on clickstream
sequences,” IEEE Access, vol. 7, pp. 28855–28862, 2019.
[2] G. Lingam, R. R. Rout, and D. V. L. N. Somayajulu, “Adaptive deep Q-learning model for
detecting social bots and influential users in online social networks,” Appl. Intell., vol. 49, no.
11, pp. 3947–3964, Nov. 2019.
[3] D. Choi, J. Han, S. Chun, E. Rappos, S. Robert, and T. T. Kwon, “Bit.ly/practice:

Uncovering content publishing and sharing through URL shortening services,” Telematics
Inform., vol. 35, no. 5, pp. 1310–1323, 2018.
[4] S. Lee and J. Kim, “Fluxing botnet command and control channels with URL shortening
services,” Comput. Commun., vol. 36, no. 3, pp. 320–332, Feb. 2013.
[5] S. Madisetty and M. S. Desarkar, “A neural network-based ensemble approach for spam
detection in Twitter,” IEEE Trans. Comput. Social Syst., vol. 5, no. 4, pp. 973–984, Dec. 2018.
[6] H. B. Kazemian and S. Ahmed, “Comparisons of machine learning techniques for detecting
malicious webpages,” Expert Syst. Appl., vol. 42, no. 3, pp. 1166–1177, Feb. 2015.
[7] H. Gupta, M. S. Jamal, S. Madisetty, and M. S. Desarkar, “A framework for real-time spam
detection in Twitter,” in Proc. 10th Int. Conf. Commun. Syst. Netw. (COMSNETS), Jan. 2018,
pp. 380–383.
[8] T. Wu, S. Liu, J. Zhang, and Y. Xiang, “Twitter spam detection based on deep learning,” in
Proc. Australas. Comput. Sci. Week Multiconf. (ACSW), 2017, p. 3.
[9] Y. Boshmaf, I. Muslukhov, K. Beznosov, and M. Ripeanu, “Key challenges in defending

against malicious socialbots,” Presented at the 5th USENIX Workshop Large-Scale Exploits
Emergent Threats, 2012, pp. 1–4.
[10] G. Yan, “Peri-watchdog: Hunting for hidden botnets in the periphery of online social
networks,” Comput. Netw., vol. 57, no. 2, pp. 540–555, Feb. 2013.
[11] D. Canali, M. Cova, G. Vigna, and C. Kruegel, “Prophiler: A fast filter for the large-scale
detection of malicious Web pages,” in Proc. 20th Int. Conf. World Wide Web (WWW), 2011,
pp. 197–206.
[12] A. K. Jain and B. B. Gupta, “A machine learning based approach for phishing detection
using hyperlinks information,” J. Ambient Intell. Hum. Comput., vol. 10, no. 5, pp. 2015–
2028, May 2019.
pg. 119
[13] C. Chen, J. Zhang, X. Chen, Y. Xiang, and W. Zhou, “6 million spam tweets: A large
ground truth for timely Twitter spam detection,” in Proc. IEEE Int. Conf. Commun. (ICC),
Jun. 2015, pp. 7065–7070.
[14] Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia, “Detecting automation of Twitter
accounts: Are you a human, bot, or cyborg?” IEEE Trans. Dependable Secure Comput., vol.
9, no. 6, pp. 811–824, Nov. 2012.
[15] C. Chen, Y. Wang, J. Zhang, Y. Xiang, W. Zhou, and G. Min, “Statistical features-based
real-time detection of drifted Twitter spam,” IEEE Trans. Inf. Forensics Security, vol. 12, no.
4, pp. 914–925, Apr. 2017.
[16] N. Rndic and P. Laskov, “Practical evasion of a learning-based classifier: A case study,”
in Proc. IEEE Symp. Secur. Privacy, May 2014, pp. 197–211.
[17] A. Yazidi, O.-C. Granmo, and B. J. Oommen, “Learning-automaton based online

discovery and tracking of spatiotemporal event patterns,” IEEE Trans. Cybern., vol. 43, no. 3,
pp. 1118–1130, Jun. 2013.
[18] M. R. Khojasteh and M. R. Meybodi, “Evaluating learning automata as a model for

cooperation in complex multi-agent domains,” in Robot Soccer World Cup. Springer, 2006,
pp. 410–417.
[19] C.-M. Chen, D. J. Guan, and Q.-K. Su, “Feature set identification for detecting suspicious
URLs using Bayesian classification in social networks,” Inf. Sci., vol. 289, pp. 133–147, Dec.
2014.
[20] T. M. Chen and V. Venkataramanan, “Dempster-shafer theory for intrusion detection in

ad hoc networks,” IEEE Internet Comput., vol. 9, no. 6, pp. 35–41, Nov. 2005.
[21] S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi, “The paradigm-shift

of social spambots: Evidence, theories, and tools for the arms race,” in Proc. 26th Int. Conf.
World Wide Web Companion-(WWW Companion), 2017, pp. 963–972.
[22] K. Lee, B. D. Eoff, and J. Caverlee, “Seven months with the devils: A long-term study of
content polluters on Twitter,” in Proc. ICWSM, 2011, pp. 1–8.
[23] C. Besel, J. Echeverria, and S. Zhou, “Full cycle analysis of a large scale botnet attack on
Twitter,” in Proc. IEEE/ACM Int. Conf. Adv. Social Netw. Anal. Mining (ASONAM), Aug.
2018, pp. 170–177.
[24] J. Echeverria and S. Zhou, “Discovery, retrieval, and analysis of the’star wars’ botnet in
twitter,” in Proc. 2017 IEEE/ACM Int. Conf. Adv. Social Netw. Anal. Mining 2017, 2017, pp.
1–8.
pg. 120
[26] M. Agarwal and B. Zhou, “Using trust model for detecting malicious activities in Twitter,”
in Proc. Int. Conf. Social Comput., Behav.-Cultural Modeling, Predict. Springer, 2014, pp.
207–214.
[27] G. Lingam, R. R. Rout, and D. V. L. N. Somayajulu, “Detection of social botnet using a

trust model based on spam content in Twitter network,” in Proc. IEEE 13th Int. Conf. Ind. Inf.
Syst. (ICIIS), Dec. 2018, pp. 280–285.
[28] C. Yang, R. Harkreader, and G. Gu, “Empirical evaluation and new design for fighting
evolving Twitter spammers,” IEEE Trans. Inf. Forensics Security, vol. 8, no. 8, pp. 1280–1293,
Aug. 2013.
[30] S. Lee and J. Kim, “WarningBird: A near real-time detection system for suspicious URLs
in Twitter stream,” IEEE Trans. Dependable Secure Comput., vol. 10, no. 3, pp. 183–195, May
2013.
[31] D. R. Patil and J. B. Patil, “Malicious URLs detection using decision tree classifiers and
majority voting technique,” Cybern. Inf. Technol., vol. 18, no. 1, pp. 11–29, Mar. 2018.
[32] H. Guo, S. Li, B. Li, Y. Ma, and X. Ren, “A new learning automatabased pruning method
to train deep neural networks,” IEEE Internet Things J., vol. 5, no. 5, pp. 3263–3269, Oct.
2018.
[33] A. A. Rahmanian, M. Ghobaei-Arani, and S. Tofighy, “A learning automata-based

ensemble resource usage prediction algorithm for cloud computing environment,” Future
Gener. Comput. Syst., vol. 79, pp. 54–71, Feb. 2018.
[34] A. Moayedikia, K.-L. Ong, Y. L. Boo, and W. G. S. Yeoh, “Task assignment in microtask
crowdsourcing platforms using learning automata,” Eng. Appl. Artif. Intell., vol. 74, pp. 212–
225, Sep. 2018.
[35] G. Lingam, R. R. Rout, and D. Somayajulu, “Learning automatabased trust model for
user recommendations in online social networks,” Comput. Electr. Eng., vol. 66, pp. 174–188,
Feb. 2018.
[36] Manju, S. Chand, and B. Kumar, “Target coverage heuristic based on learning automata
in wireless sensor networks,” IET Wireless Sensor Syst., vol. 8, no. 3, pp. 109–115, Jun. 2018.
[37] Q. Sang, Z. Lin, and S. T. Acton, “Learning automata for image segmentation,” Pattern
Recognit. Lett., vol. 74, pp. 46–52, Apr. 2016.
[38] F. Morstatter, J. Pfeffer, H. Liu, and K. M. Carley, “Is the sample good enough? comparing
data from twitter’s streaming API with Twitter’s firehose,” in Proc. ICWSM, 2013, pp. 1–9.
[39] A. Neumann, J. Barnickel, and U. Meyer, “Security and privacy implications of url
shortening services,” in Proc. Workshop Web 2.0 Secur. Privacy, 2010, pp. 1–31.
pg. 121
[40] Online. WHOIS Database. Accessed: Nov.24, 2018. [Online]. Available:

http://whois.domaintools.com//
[41] K. Hans, L. Ahuja, and S. K. Muttoo, “Detecting redirection spam using multilayer
perceptron neural network,” Soft Comput., vol. 21, no. 13, pp. 3803–3814, Jul. 2017.
[42] X. Yang, Y. Guo, and Y. Liu, “Bayesian-inference-based recommendation in online social

networks,” IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 4, pp. 642–651, Apr. 2013.
[43] N. Abokhodair, D. Yoo, and D. W. McDonald, “Dissecting a social botnet: Growth,

content and influence in Twitter,” in Proc. 18th ACM Conf. Comput. Supported Cooperat.
Work Social Comput., 2015, pp. 839–851.
[44] W. Li and H. Song, “ART: An attack-resistant trust management scheme for securing
vehicular ad hoc networks,” IEEE Trans. Intell. Transp. Syst., vol. 17, no. 4, pp. 960–969, Apr.
2016.
[45] Z. Wei, H. Tang, F. R. Yu, M. Wang, and P. Mason, “Security enhancements for mobile
ad hoc networks with trust management using uncertain reasoning,” IEEE Trans. Veh.
Technol., vol. 63, no. 9, pp. 4647–4858, 2014.
[46] L. Zhao, T. Hua, C.-T. Lu, and I.-R. Chen, “A topic-focused trust model for Twitter,”
Comput. Commun., vol. 76, pp. 1–11, Feb. 2016.
[47] A. Rezvanian, M. Rahmati, and M. R. Meybodi, “Sampling from complex networks using
distributed learning automata,” Phys. A, Stat. Mech. Appl., vol. 396, pp. 224–234, Feb. 2014.
[48] H. Huang, X. Wei, and Y. Zhou, “Twin support vector machines: A survey,”
Neurocomputing, vol. 300, pp. 34–43, Jul. 2018.
[49] A. A. Heidari, H. Faris, I. Aljarah, and S. Mirjalili, “An efficient hybrid multilayer
perceptron neural network with grasshopper optimization,” Soft Comput., vol. 23, no. 17, pp.
7941–7958, Sep. 2019.
[50] M. C. Simmonds and J. P. Higgins, “A general framework for the use of logistic regression
models in meta-analysis,” Stat. Methods Med. Res., vol. 25, no. 6, pp. 2858–2877, Dec. 2016.
[51] Y. Zhou and G. Qiu, “Random forest for label ranking,” Expert Syst. Appl., vol. 112, pp.
99–109, Dec. 2018.
pg. 122

Social Bot

Uploaded by

Copyright:

Available Formats

Social Bot

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Social Bot

Uploaded by

Copyright:

Available Formats

[DETECTING OF MALICIOUS SOCIAL BOTS]

DETECTION OF MALICIOUS SOCIAL BOTS USING

Machine Learning vs. Traditional Programming:

How does Machine Learning Work?

Machine Learning Algorithms and Where they are Used?

Algorithm Name Description Type

to return. For the classification task, the final

Algorithm Description Type

How to Choose Machine Learning Algorithm

Machine Learning (ML) algorithm:

Challenges and Limitations of Machine Learning

Application of Machine Learning :

Example of application of Machine Learning in Supply Chain

Why is Machine Learning Important?

A subset of machine learning is closely related to computational statistics, which focuses on

The discipline of machine learning employs various approaches to teach computers to

Machine learning approaches

History and relationships to other fields

In situation s perform an action a;

Feature learning can be either supervised or unsupervised. In supervised feature learning,

Sparse dictionary learning

Artificial neural networks

An ANN is a model based on a collection of connected units or nodes called "artificial

Support vector machines

➢ System : Pentium IV 2.4 GHz.

➢ Operating system : Windows 7.

DATA FLOW DIAGRAM:

Prediction/Classification Testing Data

The Primary goals in the design of the UML are as follows:

Input data Features extraction

Preprocessing ( ) Finally get Classified

Datacollection Training Testing

Collect the data from the user feature on cnn

Send the data to the traing stag

Train the data

Extracted feature with images sending to the testing stage

Predict the type using proposed algorithm

Prediction using proposed

Predicted Label As spam or No spam

Script Mode Programming

This produces the following result −

Sr.No Methods & Description

Now enter the following script in Python shell.

from flask importFlask, redirect,url_for, request

Form data is POSTed to the URL in action clause of form tag.

C:\Users\Your Name>python --version

C:\Users\Your Name>python helloworld.py

Where "helloworld.py" is the name of your python file.

C:\Users\Your Name>python helloworld.py

The output should read:

The Python Command Line

Type the following on the Windows, Mac or Linux command line:

Which will write "Hello, World!" in the command line:

Execute Python Syntax

8. Counter vectorizer with TF-IDF transformer

The Sigmoid Function

The equation for the Sigmoid function is this:

Analyse and Prediction:

We got a accuracy of 96.02% on test set.

<h2 class="animateanimated animatefadeInDown">Detection of Malicious

<h2 class="animateanimated animatefadeInDown">Detection of