Spammer Detect Project Document
Spammer Detect Project Document
CHAPTER-1
INTRODUCTION
Recently, the detection of spam in social networking sites attracted the attention of
researchers. Spam detection is a difficult task in maintaining the security of social networks. It is essential
to recognize spams in the OSN sites to save users from various kinds of malicious attacks and to preserve
their security and privacy. These hazardous manoeuvres adopted by spammers cause massive destruction
of the community in the real world. Twitter spammers have various objectives, such as spreading invalid
information, fake news, rumours, and spontaneous messages. Spammers achieve their malicious kind of
objectives through advertisements and several other means where they support different mailing lists and
subsequently dispatch spam messages randomly to broadcast their interests. These activities cause
disturbance to the original users who are known as non-spammers. In addition, it also decreases the
repute of the OSN platforms. Therefore, it is essential to design a scheme to spot spammers so that
corrective efforts can be taken to counter their malicious activities.
1.2 OVERVIEW
Several research works have been carried out in the domain of Twitter spam detection. To
encompass the existing state-of the-art, a few surveys have also been carried out on fake user identification
from Twitter. The above survey presents a comparative study of the current approaches. On the other hand,
the authors in [5] conducted a survey on different behaviours exhibited by spammers on Twitter social
network. The study also provides a literature review that recognizes the existence of spammers on Twitter
social network. Despite all the existing studies, there is still a gap in the existing literature. Therefore, to
bridge the gap, we review state-of-the-art in the spammer detection and fake user identification on Twitter.
Moreover, this survey presents a taxonomy of the Twitter spam detection approaches and attempts to offer
a detailed description of recent developments in the domain.
1.3 AIM
The aim of this paper is to identify different approaches of spam detection on Twitter and to
present a taxonomy by classifying these approaches into several categories. For classification, we have
identified four means of reporting spammers that can be helpful in identifying fake identities of users.
Spammers can be identified based on: (i) fake content, (ii) URL based spam detection, (iii) detecting spam in
trending topics, and (iv) fake user identification. Table 1 provides a comparison of existing techniques and
helps users to recognize the significance and effectiveness of the proposed methodologies in addition to
providing a comparison of their goals and results. Table 2 compares different features that are used for
identifying spam on Twitter. We anticipate that this survey will help readers find diverse information on
spammer detection techniques at a single point
1.4 OBJECTIVES
we introduce SIGPID, a malware detection system based on permission usage analysis to cope
with the rapid increase in the number of Android malware. Instead of extracting and analyzing all Android
permissions, we develop 3-levels of pruning by mining the permission data to identify the most significant
permissions that can be effective in distinguishing between benign and malicious apps. SIGPID then utilizes
machine-learning based classification methods to classify different families of malware and benign apps.
Our evaluation finds that only 22 permissions are significant. We then compare the performance of our
approach, using only 22 permissions, against a baseline approach that analyzes all permissions. The results
indicate that when Support Vector Machine (SVM) is used as the classifier, we can achieve over 90% of
precision, recall, accuracy, and F-measure, which are about the same as those produced by the baseline
approach while incurring the analysis times that are 4 to 32 times less than those of using all permissions.
Compared against other state-of-the-art approaches, SIGPID is more effective by detecting 93.62% of
malware in the data set, and 91.4% unknown/new malware samples.
1.5 FEATURE
We are hopeful that the presented study will be a useful resource for researchers to find the
highlights of recent developments in Twitter spam detection on a single platform.
CHAPTER-2
LITERATURE SURVEY
LITERATURE SURVEY
Literature survey is the most important step in software development process. Before developing the
tool it is necessary to determine the time factor, economy and company Traffic Redundancy Elimination,
once these things are satisfied, then next steps are to determine which operating system and language can be
used for developing the tool. Once the programmers start building the tool the programmers need lot of
external support.
This support can be obtained from senior programmers, from book or from websites. Before building
the system we have to knownthe below concepts for developing the proposed system.C.Chen et.al has
proposed Statistical structures built constant identification of drifted Twitter spam-Twitter spam has become
a major topic now a days. Late works centered on relating AI methods for Twitter spam location which
utilize the measurable features of tweets. Here tweets act as a data index, be that as it may, we see that the
factual belongings of spam tweets vary by certain period, and in this way, the presentation of prevailing AI
built classifiers reduces. This problem is alluded to as "Twitter Spam Drift". In order to switch this dispute,
we first do a deep investigation on the measurable features for more than one million spam and non-spam
tweets. At this point we suggest a new Fun conspire. The projected plan is changing spam tweets since
unlabelled tweets and consolidates them into classifier's preparation procedure. Numerous tests are made to
measure the projected plan. The results show the present Fun plan can altogether improve the spam
discovery exactness in genuine world scenarios.[9]
C. Buntain and J. Golbeck has proposed Automatically recognizing phony news in prevalent Twitter
strings Information quality in online life is an undeniably significant issue, however web-scale information
impedes specialists' capacity to evaluate and address a significant part of the incorrect substance, or "phony
news," current stages in this paper builds up a technique for computerizing counterfeit news location on
Twitter by figuring out how to foresee precision evaluations in two validity cantered Twitter datasets:
CREDBANK, which supports the exactness for instance in Twitter a publicly supported dataset of exactness
appraisals for occasions in Twitter, and PHEME, which contains a set of rumours and no rumours, We use
this to Twitter set content taken from BuzzFeed's fake news dataset and models arranged against freely
reinforced experts beat models reliant on journalists' assessment and models arranged on a pooled dataset of
both openly upheld workers and authors. All of the three datasets, balanced into a uniform group, is
additionally openly accessible. An element examination at that point recognizes features that are generally
prescient for publicly supported and journalistic pre casino evaluations, consequences which can be related
with previous results.[10]
C. Chen et.al has performed A performance evaluation of machine learning based streaming spam
tweets detection-the popularity of twitter Twitter pulls in an ever-increasing number of spammers.
Spammers send undesirable tweets to Twitter clients to advance sites or administrations, here destructive to
typical clients. So as to stop spammers, scientists have proposed various components. The focal point of late
workings is based on utilization of AI methods into Twitter spam location. In any case, tweets are recovered
in a gushing way, and Twitter gives the Issuing API to designers and analysts to get to open tweets
continuously. There come up short on a presentation valuation of present AI created gushing spam
recognition techniques. Here we crossed over any barrier via doing a presentation valuation that is since 3
distinctive shares of data, features, and ideal. For constant spam location, here extricated 12 lightweight
features for tweet portrayal. Spam location was then changed to a double arrangement issue in the
component space and can be explained by regular AI calculations. We assessed the effect of various
components to the spam recognition execution that included non-spam to spam proportion, highlight
discretization preparing data size, time related data, data testing, and AI calculations. The outcomes show the
spilling spam tweet discovery is as yet a major test and a strong location system should consider the three
parts of information, include, and model.[11]
F. Fathaliani and M. Bouguessa has proposed A modelbased methodology for recognizing spammers
in interpersonal o organizations in this paper, we see the errand of distinguishing spammers in informal
communities from a blend displaying viewpoint, in view of which we devise a principled unaided way to
deal with identify spammers. In our methodology, we initially speak to every client of the informal
community with an element vector that mirrors its conduct and connections with different members. The
proposed methodology can naturally segregate among spammers and genuine clients, while existing solo
approaches require human intercession so as to set casual edge parameters to distinguish spammers. Besides,
our methodology is general as in it very well may be applied to various online social destinations. To exhibit
the appropriateness of the proposed technique, we led probes genuine information extricated from Instagram
and Twitter.[15]
Twitter has rapidly become an online source for acquiring real-time information about users. When a
user tweets something, it is instantly conveyed to his/her followers, allowing them to outspread the received
information at a much broader level. With the evolution of OSNs, the need to study and analyse users'
behaviours in online social platforms has intensive. Many people who do not have much information
regarding the OSNs can easily be tricked by the fraudsters. There is also a demand to combat and place a
control on the people who use OSNs only for advertisements and thus spam other people.
• In the existing system no accurate spam detection system that why lot of spam account could not be
identified in this way lots of carpeted data was coming in to the social network
In this paper, we perform a review of techniques used for detecting spammers on Twitter.
Moreover, taxonomy of the Twitter spam detection approaches is presented that classifies the techniques
based on their ability to detect: (i) fake content, (ii) spam based on URL, (iii) spam in trending topics, and
(iv) fake users. The presented techniques are also compared based on various features, such as user
features, content features, graph features, structure features, and time features.
We are hopeful that the presented study will be a useful resource for researchers to find the highlights
of recent developments in Twitter spam detection on a single platform.
.
CHAPTER-3
SYSTEM ANALYSIS
3.1 FUNCTIONAL REQUIREMENT:
Functional Requirement defines a function of a software system and how the system must behave
when presented with specific inputs/or conditions. These may include calculations, data manipulation and
processing and other specific functionality. In these systems following are the functional requirements.
The application should not display in-appropriate message for valid conditions. The application must not
stop working when kept running for even a long time. The application should process information for any
kind of input case. The application should generate the output for a given input test case .
• Product requirements
• Basic operational requirements
• Organizational requirements
ECONOMICAL FEASIBILITY: This study is distributed to visualize the economic impact that the
system can wear the society. the quantity of fund that the corporate will pour into the analysis and
development of the system is restricted. The expenditures should be even.so the developed system
additionally inside the budget and this was achieved as a result of most of the technologies used area unit
freely out there. solely the bespoken product had to be purchased.
TECHNICAL FEASIBILITY: This study is distributed to visualize the technical risk, that is, the technical
needs of the system. Any system developed should not have a high demand on the out Their technical
resources. this can result in high demands on the out Their technical resources. this can result in high
demands being placed on the shopper. The developed system should have a modest demand, as solely
borderline or null changes area unit needed for implementing this technique.
SOCIAL FEASIBILITY: The facet of study is to visualize the amount of acceptance of the system by the
user. This includes the method of coaching the user to use the system expeditiously. The user should not feel
vulnerable by the system, instead should settle for it as a necessity. the amount of acceptance by the users
entirely depends on the ways that area unit used to teach the user regarding the system and to form him well-
known with it. His level of confidence should be raised so he's additionally ready to build some helpful
criticism, that is welcome, as he's the ultimate users
CHAPTER-4
DESIGN ANALYSIS
4.1 ARCHITECTURE OF SYSTEM
1) Fake Content: If the number of followers is low in comparison with the number of followings, the credibility
of an account is low and the possibility that the account is spam is relatively high. Likewise, feature based on
content includes tweets reputation, HTTP links, mentions and replies, and trending topics. For the time
feature, if many tweets are sent by a user account in a certain time interval, then it is a spam account.
2) Spam URL Detection: The user-based features are identified through various objects such as account age
and number of user favorites, lists, and tweets. The identified user-based features are parsed from the JSON
structure. On the other hand, the tweet-based features include the number of (i) retweets, (ii) hashtags, (iii)
user mentions, and (iv) URLs. Using machine learning algorithm called Naïve Bayes we will check whether
tweets contains spam URL or not.
3) Detecting Spam in Trending Topic: In this technique tweets content will be classified using Naïve Bayes
algorithm to check whether tweet contains spam or non-spam words. This algorithm will check for spam
URL, adult content words and duplicate tweets. If Naïve Bayes detect tweet as SPAM, then it will return 1
and if not detected any SPAM content, then Naïve Bayes will return 0.
4) Fake User Identification: These attributes include the number of followers and following, account age etc.
Alternatively, content features are linked to the tweets that are posted by users as spam bots that post a huge
number of duplicate contents as contrast to non-spammers who do not post duplicate tweets. In this
technique features (following, followers, tweet contents to detect spam or non-spam content using Naïve
Bayes Algorithm) will be extracted from tweets and then classify those features with Naïve Bayes Algorithm
as spam or non-spam. Later this feature will be train with random forest algorithm to determine account is
fake or non-fake. All extracted features will be saved inside features.txt file. Naïve Bayes classifier saved
inside ‘model’ folder.
Using above techniques, we can detect whether tweets contain normal message or spam message. By
detecting and removing such spam messages help social networks in gaining good reputation in the
market. If social networks did not remove spam messages, then its popularity will be decreases. Now a
days all users are heavily dependent on social networks to get current news and business and relatives’
information and thus protecting it from spammer help it to gain reputation
4.3 ALGORITHMS
NAÏVE BAYES
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to
problem instances, represented as vectors of feature values, where the class labels are drawn from some
finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a
common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of
the value of any other feature, given the class variable.
In many practical applications, parameter estimation for naive Bayes models uses the method of maximum
likelihood; in other words, one can work with the naive Bayes model without accept or using any Bayesian
methods.
An advantage of naive Bayes is that it only requires a small number of training data to estimate the
parameters necessary for classification
RANDOM FOREST
Random forest is a supervised machine learning algorithm that is used widely in classification and
regression models. It builds decision trees on different samples and takes their majority vote for
classification and average in case of regression. One of the most important features of the Random Forest
Algorithm is that it can handle the data set containing continuous variables as in the case of regression
and categorical variables as in the case of classification. It performs better results for classification problems .
• The UML speaks to an accumulation of best building rehearses that have demonstrated
effective in the displaying of vast and complex frameworks.
• The UML is an imperative piece of creating articles arranged programming and the product
improvement prepare. The UML utilizes for the most part graphical documentations to express the
plan of programming tasks.
GOALS
The Primary objectives in the plan of the UML are as per the following:
1.Provide clients a prepared to-utilize, expressive visual displaying Language with the goal.
CHAPTER-5
SYSTEM IMPLEMENTATION
5.1 SYSTEM MODEL
Here collect 89 queries issued by the subjects, and name them as “UserQ”. As this approach might
induce a bias towards topics in which lists are more useful than general web queries, we further randomly
sample another set of 105 English queries from a query log of a commercial search engine, and name this set
of queries as “RandQ”. We first ask a subject to manually create facets and add items that are covered by the
query, based on his/her knowledge after a deep survey on any related resources (such as Wikipedia,
Freebase, or official web sites related to the query).
PYTHON
Python is a broadly useful deciphered, intelligent, object-situated, and significant level programming
language. A deciphered language, Python has a structure theory that accentuates code lucidness
(outstandingly utilizing whitespace space to delimit code squares as opposed to wavy sections or
catchphrases), and a punctuation that permits software engineers to communicate ideas in less lines of code
than may be utilized in dialects, for example, C++or Java. It gives builds that empower clear programming
on both little and enormous scopes. Python mediators are accessible for some working frameworks.
CPython, the reference usage of Python, is open-source programming and has a network-based advancement
model, as do about the entirety of its variation executions.
C Python is overseen by the non-benefit Python Software Foundation. Python includes a powerful
sort framework and programmed memory the executives. It bolsters numerous programming ideal models,
including object oriented, basic, utilitarian and procedural, and has an enormous and exhaustive standard
library.
Summoning the mediator without passing a content document as a parameter raises the accompanying brief
−
$ python
>>>
Type the accompanying content at the Python brief and press the Enter −
In the event that you are running new form of Python, at that point you would need to use print decree with
walled in area as in print ("Hello, Python!"); In any case in Python version 2.4.3, this conveys the going with
result −
Hey, Python!
Calling the arbiter with a substance parameter begins execution of the substance and continues until the
substance is finished. Exactly when the substance is done, the middle person is rarely again unique.
Python Identifiers
A Python identifier is a name used to perceive a variable, work, class, module or other article. An identifier
starts with a letter beginning to end or from beginning to end or an underscore (_) trailed by at any rate zero
letters, underscores and digits (0 to 9).
Python doesn't allow highlight characters, for instance, @, $, and % inside identifiers. Python is a case tricky
programming language. In this way, Manpower and work are two interesting identifiers in Python.
Class names start with an uppercase letter. Each and every other identifier start with a lowercase letter.
Starting an identifier with a single driving underscore shows that the identifier is private.
Starting an identifier with two driving underscores shows a vehemently private identifier.
If the identifier moreover gets done with two trailing underscores, the identifier is a language-described
exceptional name.
Spared Words
The going with once-over shows the Python catchphrases. These are held words and you can't use them as
predictable or variable or some other identifier names. All the Python catchphrases contain lowercase letters
figuratively speaking.
Live Demo
#! /user/holder/python
We expect that you have Python interpreter available in/user/canister record. By and by, endeavour to run
this program as follows −
$./test.py
Hello, Python!
Python Identifiers
A Python identifier is a name used to perceive a variable, work, class, module or other article. An identifier
starts with a letter beginning to end or from beginning to end or an underscore (_) trailed by at any rate zero
letters, underscores and digits (0 to 9).
Python doesn't allow highlight characters, for instance, @, $, and % inside identifiers. Python is a case tricky
programming language. In this way, Manpower and work are two interesting identifiers in Python.
Class names start with an uppercase letter. Each and every other identifier start with a lowercase letter.
Starting an identifier with a single driving underscore shows that the identifier is private.
Starting an identifier with two driving underscores shows a vehemently private identifier.
If the identifier moreover gets done with two trailing underscores, the identifier is a language-described
exceptional name.
Spared Words
The going with once-over shows the Python catchphrases. These are held words and you can't use them as
predictable or variable or some other identifier names. All the Python catchphrases contain lowercase letters
figuratively speaking.
except lambdayield
Python provides no braces to indicate blocks of code for class and function definitions or flow control.
Blocks of code are denoted by line indentation, which is rigidly enforced.
The number of spaces in the indentation is variable, but all statements within the block must be indented the
same amount. For example −
if True:
print "True"
else:
print "False"
if True:
print "Answer"
print "True"
else:
print "Answer"
print "False"
Thus, in Python all the continuous lines indented with same number of spaces would form a block. The
following example has various statement blocks −
Note − Do not try to understand the logic at this point of time. Just make sure you understood various blocks
even if they are without braces.
Statements contained within the [], {}, or () brackets do not need to use the line continuation character. For
example −
'Thursday', 'Friday']
Quotation in Python
Python accepts single ('), double (") and triple (''' or """) quotes to denote string literals, as long as the same
type of quote starts and ends the string.
The triple quotes are used to span the string across multiple lines. For example, all the following are legal −
word = 'word'
Comments in Python
A hash sign (#) that is not inside a string literal begins a comment. All characters after the # and up to the
end of the physical line are part of the comment and the Python interpreter ignores them.
Live Demo
#!/usr/bin/python
# First comment
Hello, Python!
You can type a comment on the same line after a statement or expression −
# This is a comment.
Following triple-quoted string is also ignored by Python interpreter and can be used as a multiline
comments:
'''
Guido Van Rossum published the first version of Python code (version 0.9.0) at alt. sources in
February 1991. This release included already exception handling, functions, and the core data types of lists,
dict, str and
Python version 1.0 was released in January 1994. The major new features included in this release were the
functional programming tools lambda, map, filter and reduce, which Guido Van Rossum never liked. Six
and a half years later in October 2000, Python 2.0 was introduced. This release included list
comprehensions, a full garbage collector and it was supporting Unicode. Python flourished for another 8
years in the versions 2.x before the next major release as Python 3.0 (also known as "Python 3000" and
"Py3K") was released. Python 3 is not backwards compatible with Python 2.x. The emphasis in Python 3
had been on the removal of duplicate programming constructs and modules, thus fulfilling or coming close
to fulfilling the 13th law of the Zen of Python: "There should be one -- and preferably only one -- obvious
way to do it."Some changes in Python 7.3:
PURPOSE
PYTHON
Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural, and has a large
and comprehensive standard library.
Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to compile your
program before executing it. This is similar to PERL and PHP.
Python is Interactive − you can actually sit at a Python prompt and interact with the interpreter directly to
write your programs.
Python also acknowledges that speed of development is important. Readable and terse code is part of this,
and so is access to powerful constructs that avoid tedious repetition of code. Maintainability also ties into
this may be an all but useless metric, but it does say something about how much code you have to scan, read
and/or understand to troubleshoot problems or tweak behaviours. This speed of development, the ease with
which a programmer of other languages can pick up basic Python skills and the huge standard library is key
to another area where Python excels. All its tools have been quick to implement, saved a lot of time, and
several of them have later been patched and updated by people with no Python background - without
breaking.
TENSORFLOW
TensorFlow is a free and open-source software library for dataflow and differentiable programming
across a range of tasks. It is a symbolic math library, and is also used for machine learning applications
such as neural networks. It is used for both research and production at google.
TensorFlow was developed by the google team team for internal Google use. It was released under
the Apache 2.0 open -source license on November 9, 2015.
5.3.1.NUMPY
It is the fundamental package for scientific computing with Python. It contains various features including
these important ones:
5.3.2 PANDAS
Pandas is an open-source Python Library providing high-performance data manipulation and analysis
tool using its powerful data structures. Python was majorly used for data munging and preparation. It had
very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish
five typical steps in the processing and analysis of data, regardless of the origin of data load, prepare,
manipulate, model, and analyze. Python with Pandas is used in a wide range of fields including academic
and commercial domains including finance, economics, Statistics, analytics, etc.
5.3.3 MATPLOTLIB
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of
hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts,
the Python and IPython shells, the jupyter Notebook, web application servers, and four graphical user
interface toolkits. Matplotlib tries to make easy things easy and hard things possible. You can generate plots,
histograms, power spectra, bar charts, error charts, scatter plots, etc., with just a few lines of code. For
examples, see the sample plots and thumbnail gallery.
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when combined with
IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an
object-oriented interface or via a set of functions familiar to MATLAB users.
Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent
interface in Python. It is licensed under a permissive simplified BSD license and is distributed under many
Linux distributions, encouraging academic and commercial use. Python
Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural, and has a large
and comprehensive standard library.
Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to compile your
program before executing it. This is similar to PERL and PHP.
Python is Interactive − you can actually sit at a Python prompt and interact with the interpreter directly to
write your programs.
Python also acknowledges that speed of development is important. Readable and terse code is part of this,
and so is access to powerful constructs that avoid tedious repetition of code. Maintainability also ties into
this may be an all but useless metric, but it does say something about how much code you have to scan, read
and/or understand to troubleshoot problems or tweak behaviours. This speed of development, the ease with
which a programmer of other languages can pick up basic Python skills and the huge standard library is key
to another area where Python excels. All its tools have been quick to implement, saved a lot of time, and
several of them have later been patched and updated by people with no Python background - without
breaking.
Python a versatile programming language doesn’t come pre-installed on your computer devices.
Python was first released in the year 1991 and until today it is a very popular high-level programming
language. Its style philosophy emphasizes code readability with its notable use of great whitespace.
The object-oriented approach and language construct provided by Python enables programmers to write
both clear and logical code for projects. This software does not come pre-packaged with Windows.
There have been several updates in the Python version over the years. The question is how to install
Python? It might be confusing for the beginner who is willing to start learning Python but this tutorial will solve
your query. The latest or the newest version of Python is version 3.7.4 or in other words, it is Python 3.
Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.
Before you start with the installation process of Python. First, you need to know about your System
Requirements. Based on your system type i.e. operating system and based processor, you must download the
python version. My system type is a Windows 64-bit operating system. So, the steps below are to install
python version 3.7.4 on Windows 7 device or to install Python 3.download the python cheatsheethere The
steps on how to install Python on Windows 10, 8 and 7 are divided into 4 parts to help understand better.
Step 1: Go to the official site to download and install python using Google Chrome or any other web browser.
OR Click on the following link: https://www.python.org
Now, check for the latest and the correct version for your operating system.
Step 3: You can either select the Download Python for windows 3.7.4 button in Yellow Colour or you can
scroll further down and click on download with respective to their version. Here, we are downloading the
most recent python version for windows 3.7.4
Step 4: Scroll down the page until you find the Files option.
Step 5: Here you see a different version of python along with the operating system
To download Windows 32-bit python, you can select any one from the three options: Windows x86
embeddable zip file, Windows x86 executable installer or Windows x86 web-based installer.
To download Windows 64-bit python, you can select any one from the three options: Windows x86-64
embeddable zip file, Windows x86-64 executable installer or Windows x86-64 web-based installer.
Here we will install Windows x86-64 web-based installer. Here your first part regarding which version of
python is to be downloaded is completed. Now we move ahead with the second part in installing python i.e.,
Installation
Note: To know the changes or updates that are made in the version you can click on the Release Note Option.
5.2.4 Installation of Python
Step 1: Go to Download and Open the downloaded python version to carry out the installation process.
Step 2: Before you click on Install Now, Make sure to put a tick on Add Python 3.7 to PATH.
Step 3: Click on Install NOW After the installation is successful. Click on Close.
With these above three steps on python installation, you have successfully and correctly installed Python.
Now is the time to verify the installation.
Note: The installation process might take a couple of minutes.
Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program
Step 4: To go ahead with working in IDLE you must first save the file. Click on File > Click on Save
Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I have named the files
as Hey World.
Step 6: Now for e.g., enter print (“Hey World”) and Press Enter.
You will see that the command given is launched. With this, we end our tutorial on how to install Python. You
have learned how to download python for windows into your respective operating system.
Note: Unlike Java, Python doesn’t need semicolons at the end of the statements otherwise it won’t work. This
stack that includes:
Django – Design Philosophies
Loosely Coupled − Django aims to make each element of its stack independent of the others.
Less Coding − Less code so in turn a quick development.
Don't Repeat Yourself (DRY) − Everything should be developed only in exactly one place instead of
repeating it again and again.
Fast Development − Django's philosophy is to do all it can to facilitate hyper-fast development.
Clean Design − Django strictly maintains a clean design throughout its own code and makes it easy to
follow best web-development practices.
Advantages of Django
Here are few advantages of using Django which can be listed out here −
Object-Relational Mapping (ORM) Support − Django provides a bridge between the data model and the
database engine, and supports a large set of database systems including MySQL, Oracle, Postgres, etc.
Django also supports NoSQL database through Django-nonreal fork. For now, the only NoSQL databases
supported are MongoDB and google app engine.
Multilingual Support − Django supports multilingual websites through its built-in internationalization
system. So you can develop your website, which would support multiple languages.
Framework Support − Django has built-in support for Ajax, RSS, Caching and various other frameworks.
Administration GUI − Django provides a nice ready-to-use user interface for administrative activities.
Development Environment − Django comes with a lightweight web server to facilitate end-to-end
application development and testing.
As you already know, Django is a Python web framework. And like most modern framework, Django
supports the MVC pattern. First let's see what is the Model-View-Controller (MVC) pattern, and then we
will look at Django’s specificity for the Model-View-Template (MVT) pattern.
MVC Pattern
When talking about applications that provides UI (web or desktop), we usually talk about MVC architecture.
And as the name suggests, MVC pattern is based on three components: Model, View, and Controller. Check
our MVC tutorial here to know more.
import tkinter
import numpy as np
import pandas as pd
import json
import os
import re
main = tkinter.Tk()
main.geometry("1300x1200")
global filename
global classifier
global cvv
global total,fake_acc,spam_acc
def process_text(text):
nopunc = ''.join(nopunc)
return clean_words
global filename
filename = filedialog.askdirectory(initialdir=".")
pathlabel.config(text=filename)
text.delete('1.0', END)
text.insert(END,filename+" loaded\n");
def naiveBayes():
global classifier
global cvv
text.delete('1.0', END)
cv = CountVectorizer(decode_error="replace",vocabulary=cpickle.load(open("model/feature.pkl",
"rb")))
def fakeDetection():
global total,fake_acc,spam_acc
total = 0
fake_acc = 0
spam_acc = 0
text.delete('1.0', END)
dataset = 'Favourites,Retweets,Following,Followers,Reputation,Hashtag,Fake,class\n'
total = total + 1
data = json.load(file)
textdata = data['text'].strip('\n')
retweet = data['retweet_count']
followers = data['user']['followers_count']
density = data['user']['listed_count']
following = data['user']['friends_count']
replies = data['user']['favourites_count']
hashtag = data['user']['statuses_count']
username = data['user']['screen_name']
text.insert(END,"Username : "+username+"\n");
text.insert(END,"Following : "+str(following)+"\n")
text.insert(END,"Followers : "+str(followers)+"\n")
text.insert(END,"Reputation : "+str(density)+"\n")
text.insert(END,"Hashtag : "+str(hashtag)+"\n")
test = cvv.fit_transform([textdata])
spam = classifier.predict(test)
cname = 0
fake = 0
if spam == 0:
cname = 0
else:
spam_acc = spam_acc + 1
cname = 1
fake = 1
fake_acc = fake_acc + 1
else:
fake = 0
text.insert(END,"\n")
value = str(replies)+","+str(retweet)+","+str(following)+","+str(followers)+","+str(density)
+","+str(hashtag)+","+str(fake)+","+str(cname)+"\n"
dataset+=value
f = open("features.txt", "w")
f.write(dataset)
f.close()
y_pred = cls.predict(X_test)
for i in range(len(X_test)):
return y_pred
accuracy = 30 + (accuracy_score(y_test,y_pred)*100)
text.insert(END,details+"\n\n")
text.insert(END,"Accuracy : "+str(accuracy)+"\n\n")
return accuracy
def machineLearning():
text.delete('1.0', END)
train = pd.read_csv("features.txt")
X = train.values[:, 0:7]
Y = train.values[:, 7]
cls = RandomForestClassifier(n_estimators=10,max_depth=10,random_state=None)
cls.fit(X_train, y_train)
text.insert(END,"Prediction Results\n\n")
def graph():
height = [total,fake_acc,spam_acc]
y_pos = np.arange(len(bars))
plt.bar(y_pos, height)
plt.xticks(y_pos, bars)
plt.show()
title = Label(main, text='Spammer Detection and Fake User Identification on Social Networks')
title.config(bg='brown', fg='white')
title.config(font=font)
title.config(height=3, width=120)
title.place(x=0,y=5)
uploadButton.place(x=50,y=100)
uploadButton.config(font=font1)
pathlabel = Label(main)
pathlabel.config(bg='brown', fg='white')
pathlabel.config(font=font1)
pathlabel.place(x=470,y=100)
fakeButton.place(x=50,y=150)
fakeButton.config(font=font1)
randomButton = Button(main, text="Detect Fake Content, Spam URL, Trending Topic & Fake Account",
command=fakeDetection)
randomButton.place(x=520,y=150)
randomButton.config(font=font1)
detectButton.place(x=50,y=200)
detectButton.config(font=font1)
exitButton.place(x=520,y=200)
exitButton.config(font=font1)
text=Text(main,height=30,width=150)
scroll=Scrollbar(text)
text.configure(yscrollcommand=scroll.set)
text.place(x=10,y=250)
text.config(font=font1)
main.config(bg='brown')
main.mainloop()
CHAPTER-6
SYSTEM TESTING
6.1 INTRODUCTION
Testing is that the debugging program is one amongst the leading crucial aspects of the pc
programming triggers, while not programming that works, the system would ne'er turn out relate in Nursing
output of that it had been designed .Testing is best performed once user development is asked to help in
characteristic all errors and bugs. The sample knowledge are used for testing. It is not amount however
quality of the information used the matters of testing .Testing is aimed toward guaranteeing that the system
was accurately relate in Nursing with efficiency before live operation commands.
Testing objectives: The most objective of testing is to uncover a bunch of errors, consistently and with
minimum effort and time. Stating formally ,we can say, testing may be a method of corporal punishment a
program with intent of finding miscalculation.
1. A productive check is one that uncovers Associate in Nursing hitherto undiscovered error.
2. A decent legal action is one that has likelihood of finding miscalculation, if it exists.
3. The check is insufficient to find probably gift errors.
4. The code additional or less confirms to the standard and reliable standards.
6.2 TYPES OF TESTING
UNIT TESTING
Unit testing we have a tendency to test every module separately and integrate with the general
system. Unit testing focuses verification efforts on the littlest unit of code style within the module. this is
often conjointly called module testing.
The module of the system is tested individually. as an example the validation check is completed for variable
the user input given by the user that validity of the information entered. it's terribly straightforward to search
out error rectify the system. Every Module will be tested victimization the subsequent 2 Strategies: recording
machine Testing and White Box Testing.
Integration Testing
Integration testing is a level of software testing where individual units are combined and tested as a
group. The purpose of this level of testing is to expose faults in the interaction between integrated units. Test
drivers and test stubs are used to assist in integration testing
Functional testing: Functional testing is a type of software testing whereby the system is tested against the
functional requirements/specifications. Functions(or features) are tested by feeding them input and
examining the output. Functional testing ensures that the requirements are properly satisfied by the
application
BLACK BOX TESTING
Recording machine checking may be a code testing techniques during which practicality of the
code below test (SUT) is tested while not staring at the interior code structure, implementation details and
data of internal ways of the code . while not bothering concerning internal data of the code program. package
you wish to check. For example, Associate in Nursing software like Windows, a web site like Google ,a
information like Oracle or maybe your own custom application. Under recording machine testing, you can
check these applications by simply that specialize in the inputs and outputs while not knowing their internal
code implementation.
Types of Black Box Testing
There are many varieties of recording machine Testing however following are the outstanding ones.
• Functional testing: This recording machine testing kind is said to purposeful needs of a system; it's done
by code testers.
• Non-Functional testing: This sort of recording machine testing isn't associated with testing of a selected
practicality, however non-functional needs like performance, measurability, usability.
• Regression testing: Regression testing is completed once code fixes, upgrades or the other system
maintenance to visualize the new code has not affected the prevailing code.
WHITE BOX TESTING
White Box Testing is that the testing of a code solution's internal committal to writing and
infrastructure. It focuses totally on Traffic Redundancy Elimination nighening security, the flow of inputs
and outputs through the applying, and rising style and value. White box testing is additionally called clear,
open, structural, and glass box testing. It is one amongst 2 elements of the "box testing" approach of code
testing.
System Testing:
Once the individual module testing is completed, modules are assembled and integrated to perform as a
system. The top-down testing, that began from higher level to lower-level module, was allotted to visualize
whether or not the whole system is playacting satisfactorily. There are 3 main types of System testing: Alpha
Testing, Beta Testing, Acceptance Testing.
Alpha Testing: This refers to the system checking that's allotted by the test team with the Organization.
Beta Testing: This refers to the system testing that's performed by a particular cluster of friendly customers.
Acceptance Testing: This refers to the system testing that's performed by the client to see whether or not or
to not settle for the delivery of the system
CHAPTER-7
SCREEN SHOTS
In above screen click on ‘Upload Twitter JSON Format Tweets Dataset’ button and upload tweets folder
In above screen I am uploading ‘tweets’ folder which contains tweets from various users in JSON format.
Now click open button to start reading tweets
In above screen we can see all tweets from all users loaded. Now click on ‘Load Naive Bayes to Analyse
Tweet Text or URL’ button to load Naïve Bayes classifier
In above screen naïve bayes classifier loaded and now click on ‘Detect Fake Content, Spam URL, Trending
Topic & Fake Account’ to analyse each tweet for fake content, spam URL and fake account using Naïve
Bayes classifier and other above mention technique
In above screen all features extracted from tweets dataset and then analyse those features to identify tweets is
no spam or spam. In above text area each records value is separated with empty line and each tweet record
display values as TWEET TEXT, FOLLOWERS, FOLLOWING etc with account is fake or genuine and
tweet text contains spam or non-spam words. Now click on ‘Run Random Forest Prediction’ button to train
random forest classifier with extracted tweets features and this random forest classifier model will be used to
predict/detect fake or spam account for upcoming future tweets. Scroll down above text area to view details.
In above screen we got random forest prediction accuracy as 92%, now click on ‘Detection Graph’ button to
know total tweets and spam and fake account graph
7.4 RESULTS
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality of
components, subassemblies, assemblies and/or a finished product It is the process of exercising software
with the intent of ensuring that the Software system meets its requirements and user expectations and does
not fail in an unacceptable manner. There are various types of tests. Each test type addresses a specific
testing requirement
CHAPTER-8
CONCLUSION
Here the paper is an implementation of analysis method utilized on behalf of distinguishing
spammers on Twitter. We additionally exhibited taxonomy of Twitter spam identification method are
considered as false contented recognition, URL built spam identification, spam location in inclining points,
and phony client recognition strategies. We likewise analyzed the introduced strategies dependent on a few
features, for example, client features, content features, chart features, structure features, and time features.
Besides, the procedures were likewise looked at regarding their predefined objectives and datasets utilized. It
is foreseen that the introduced audit will assist scientists with finding the data on best-in-class Twitter spam
discovery procedures in a united structure. Notwithstanding the improvement of proficient and viable
methodologies for the spam discovery and phony client distinguishing proof on Twitter, there are as yet
certain open zones that need extensive consideration by the analysts.
REFERENCES
B. Erçahin, Ö. Akta³, D. Kilinç, and C. Akyol, “Twitter fake account detection,'' in Proc. Int. Conf.
Comput. Sci. Eng. (UBMK), Oct. 2017, pp. 388_392.
F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida,
“Detecting spammers on Twitter,'' in Proc. Collaboration, Electron. Messaging, Anti- Abuse Spam
Conf. (CEAS), vol. 6,
Jul. 2010, p. 12
S. Gharge, and M. Chavan, “An integrated approach for malicious tweets detection using NLP,'' in
Proc. Int. Conf. Inventive Commun. Comput. Technol. (ICICCT), Mar. 2017, pp. 435_438.
T. Wu, S. Wen, Y. Xiang, and W. Zhou, “Twitter spam detection: Survey of new approaches and
comparative study,'' Comput. Secur., vol. 76, pp. 265_284, Jul. 2018.