Detection of Malicious Social Bots Using Learning Automata With URL Features in Twitter Network
Detection of Malicious Social Bots Using Learning Automata With URL Features in Twitter Network
CHAPTER 1
ABOUT THE ORGANIZATION
1.1 COMPANY PROFILE
Company Logo
Foundation Logo
Website www.vidvek.com
Email [email protected]
Affiliations Karnataka,Telangana
Also they providers Training and Projects in Embedded systems, Power systems,
Power Electronics, Electronic Drivers, Machines, DSP/DIP, VLSI, Data warehousing, .Net,
C#, Java/J2EE and Linux as well as develops its own range of quality Embedded products.
Vidvek InfoTech has successfully powerd itself in training thousands of students and
professionals. The teaching philosophy deployed, trives to create in-depth knowledge about
the subject at hand. We believe that depth is an essential ingredient to achieve heights in
training and development
The company strongly believes on customer success through cooperation, unity and service.
We are specialized in providing services in Custom IC development which includes
Analog/Mixed Signal Design, Foundation IP Development viz. Memory Compilers, IOs and
Standard Cells, IBIS, VerilogA Modelling. The team has extensive experience in Circuit
Design, Characterization and Layout up to 20nm/ 22nm process nodes.
TechKshetra Info Soluions Pvt. Ltd is an ISO 9001-2015 certified Software development
firm with Corporate Office in Bangalore and registered office at KEONICS IT Park
Kalaburagi. Techkshetra Info Sulutions Pvt.Ltd is an software development company
providing IT services and solutions focusing on delivering beautiful, scalable and high-
quality products and apps globally. It is specialized in product development, high-end mobile
apps and custom software development..
TANNA EDUCATION
Tanna Educational Services is located in the heart of the city of Rajkot which is one
of fastest developing educational and industrial center in Gujarat. They have another branch
in Santa Clara, CA, USA which is a hub of world-wide semiconductor and software
development known as Silicon Valley.
It is the leading educational institute. They are specialized in the field of Embedded
Systems, VLSI and Database Management. They are overwhelmed with highly trained staff
with many years of experience working in multinational companies. They offer variety of
teaching and training programs to help engineers sharpen their skills and broaden their
knowledge. Please go to our "Services" section for detailed information about our various
Tech services.
INNOVETECH
Innovetech Pro Solution Trainers Published Several Books By World Class Publishers
Like Phi India Pvt. Ltd. Innovetech Pro Solutions Trainers Were Recognized As The Best
Trainers By Top MNC Companies Like Wipro, TCS, Logical CMG, I flex, IGATE, L&T,
Satyam Etc.Innovetech Pro Solutions Involved In Training And Academic Projects Program
In More Than 100 Engineering Colleges From Tamil Nadu And Pondicherry.
SOOXMA TECHNOLOGIES
KEONICS
TURING POINT
Turning Point Computers is a leading Training Institute in North Karnataka since with
the mission of providing best quality Computer education to all class of people. Turning
Point Computers is a Franchisee of KEONICS (A government of Karnataka Enterprise) for
government Certification and Employment. Over the past few years the growth of the
computer industry has been quite remarkable and today it is the fastest growing industry, Not
just the students or housewives, even experienced professional are helped greatly by
upgrading themselves in Turning Point Computers. Our organization not only provides the
platform to build up the bright professional career in computer field but also provides the
placement opportunities. As Computer knowledge has become primary requirement for
everyone, our Institute provides best Quality Computer Education. Our motive is to make all
class of people Computer literate and take all possible advantages to make their future much
brighter
2. Software Services
3. Development Board
4. Academic Projects
In the manufacture of Microprocessor Trainers and Interface Boards. With the efforts
from our Research and Development Team, the company has expanded its activities in
various areas as follows. Microcontroller, Process Control Instrumentation, Digital Signal
Processing, Power Electronics & Drives, Data Acquisition Systems, Personal Computer
Trainer Systems, VLSI & Embedded Systems etc.
At present, the company is concentrating in various new fields like Advanced Control
Systems, Solar Heat Pump Trainers, Advanced Process control, Chemical Reactors,
Distillation Column, Image Processing, Nuclear Electronics, Defense projects etc.,
1.4.2SOFTWARE SERVICES
The provide a software service in the area of software development, mobile app
development, and digital marketing, in Electronics and communication, Computer Science,
and Mechanical with advance tools as required by the customer, college and industry needs.
Development board is a printed circuit board containing controller and the minimal
support logic needed for an engineer to become acquainted with the microprocessor on the
board and to learn to program it. It also served users of the microprocessor as a method to
prototype applications in products. Like Microcontroller, FPGA and PCBS
This Institute offers 24-Week Advanced Course in Embedded Systems. This course is
designed to offer application oriented training & real time exposure to students, there by
provides for bridging the gap between industry‟s requirements and students‟ academic skill
set. By pursuing the Institute‟s Program in Embedded Systems the students gain ready
acceptance in the market.
EMBEDDED SYSTEM
design and web development are often used interchangeably, web design is technically a
subset of the broader category of web
MATLAB
VLSI
MECHATRONICS
CLOUD COMPUTING
DATAMINING:
BIG DATA
JAVA/.Net
Whereas .Net needs a very heavy framework to be installed which have higher
Hardware requirements too compared to Java. C# is the most popular language
of .Net and is used to create any kind of programming like Web Application.
ANDROID
NS2
CHAPTER 2
Digital signal processing (DSP) is the numerical manipulation of signals, usually with
the intention to measure, filter, produce or compress continuous analog signals. It is
characterized by the use of digital signals to represent these signals as discrete time, discrete
frequency, or other discrete domain signals in the form of a sequence of numbers or symbols
to permit the digital processing of these signals.
Digital signal processing and analog signal processing are subfields of signal
processing. DSP applications include audio and speech signal processing, sonar and radar
signal processing, sensor array processing, spectral estimation, statistical signal processing,
digital image processing, signal processing for communications, control of systems,
Dept. of CSE, Sharnbasva University Kalaburgi 11
Detection Of Malicious Social Bots Using Learning Automata With URL Features In Twitter Network
biomedical signal processing, seismic data processing, among others. DSP algorithms have
long been run on standard computers, as well as on specialized processors called digital
signal processors, and on purpose-built hardware such as application-specific integrated
circuit (ASICs). Currently, there are additional technologies used for digital signal processing
including more powerful general purpose microprocessors, field-programmable gate arrays
(FPGAs), digital signal controllers (mostly for industrial applications such as motor control),
and stream processors, among others.
Digital signal processing can involve linear or nonlinear operations. Nonlinear signal
processing is closely related to nonlinear system identification and can be implemented in
the time, frequency, and spatio-temporal domains.
APPLICATIONS OF DSP
The main applications of DSP are audio signal processing, audio compression, digital
image processing, video compression, speech processing, speech recognition, digital
communications, radar, sonar, financial signal processing, seismology and biomedicine.
Specific examples are speech compression and transmission in digital mobile phones, room
correction of sound in hi-fi and sound reinforcement applications, weather forecasting,
economic forecasting, seismic data processing, analysis and control of industrial processes,
medical imaging such as CAT scans and MRI, MP3 compression, computer graphics, image
manipulation, hi-fi loudspeakercrossovers and equalization, and audio effects for use with
electric guitaramplifiers.
Digital image processing deals with manipulation of digital images through a digital
computer. It is a subfield of signals and systems but focus particularly on images. DIP
focuses on developing a computer system that is able to perform processing on an image. The
input of that system is a digital image and the system process that image using efficient
algorithms, and gives an image as an output. The most common example is Adobe
Photoshop. It is one of the widely used application for processing digital images.
Some of the major fields in which digital image processing is widely used are mentioned
below
Medical field
Remote sensing
Machine/Robot vision
Color processing
Pattern recognition
Video processing
Microscopic Imaging
A video signal is the term used to describe any sequence of time varying images. A
still image is a spatial distribution of intensities that remain constant with time while a time
varying image has a spatial intensity distribution that varies with time. Movies (films) and
television are both examples of video signals as are the signals that drive computer monitor,
laptop and PDA displays. It is widely expected that video communications in particular will
be the next application driving the mobile and handheld device market. This course should
give you the tools to understand the components that are necessary for such systems to
operate effectively
Motion tracking
HD videos
Video Editing
Surveillance cameras
video compression
Video coding
Modern embedded systems are often based on microcontrollers (i.e. CPUs with
integrated memory or peripheral interfaces) but ordinary microprocessors (using external
chips for memory and peripheral interface circuits) are also still common, especially in more
complex systems. In either case, the processor(s) used may be types ranging from general
purpose to those specialised in certain class of computations, or even custom designed for the
application at hand. A common standard class of dedicated processors is the digital signal
processor (DSP).
Since the embedded system is dedicated to specific tasks, design engineers can
optimize it to reduce the size and cost of the product and increase the reliability and
performance. Some embedded systems are mass-produced, benefiting from economies of
scale.
Embedded systems range from portable devices such as digital watches and MP3
players, to large stationary installations like traffic lights, factory controllers, and largely
complex systems like hybrid vehicles, MRI, and avionics. Complexity varies from low, with
a single microcontroller chip, to very high with multiple units, peripherals and networks
mounted inside a large chassis or enclosure.
2.1.5 ROBOTICS
These technologies deal with automated machines that can take the place of humans
in dangerous environments or manufacturing processes, or resemble humans in appearance,
behavior, and/or cognition. Many of today's robots are inspired by nature contributing to the
field of bio-inspired robotics.
The concept of creating machines that can operate autonomously dates back to
classical times, but research into the functionality and potential uses of robots did not grow
substantially until the 20th century. Throughout history, it has been frequently assumed that
robots will one day be able to mimic human behavior and manage tasks in a human-like
fashion. Today, robotics is a rapidly growing field, as technological advances continue;
researching, designing, and building new robots serve various practical purposes, whether
domestically, commercially, or militarily. Many robots do jobs that are hazardous to people
such as defusing bombs, mines and exploring shipwrecks.
APPLICATIONS OF ROBOTICS
Space Robotics
Underwater Robotics
Electric Mobility
Agricultural Robotics
2.1.6 JAVA
limited set of functions they could perform. An electronic circuit might consist of a CPU,
ROM, RAM and other glue logic. VLSI lets IC designers add all of these into one chip.
APPLICATIONS OF JAVA
Poly silicon
SOFTWARE ADOPTED
Vivado Xilinx
Matlab& Simulink
DSCH &Mirowind
HARDWAR ADOPTED
DSP - TMS320C6416
Wireless Communication
Internet of Things
Embedded systems,
CHAPTER 3
TASK PERFORMED
3.1 INTRODUCTION
By utilizing the Internet, it has become very unassuming to get any sort of data from
any source all throughout the planet. The expanded interest from social locales permits clients
to assemble a bounty of client data and information. Tremendous measures of information on
these pages regularly draw the consideration of phony clients [1]. Twitter has quickly gotten
an on the web hotspot for gaining continuous data about clients. Twitter is an Online Social
Network (OSN) where clients can share everything without exception, such as news,
assessments, and surprisingly their temperaments. A few contentions can be held over various
themes, for example, legislative issues, current undertakings, and significant occasions. When
a client tweets something, it is quickly passed on to his/her adherents, permitting them to
extend the gotten data at a lot more extensive level [2]. With the advancement of OSNs, the
need to contemplate and break down clients' practices in online social stages has power.
Numerous individuals who don't have a lot data with respect to the OSNs can undoubtedly be
deceived by the fraudsters. There is additionally an interest to battle what‟s more, place a
control on individuals who use OSNs as it were for notices and hence spam others‟ accounts.
As of late, the identification of spam in friendly organizing locales pulled in the consideration
of analysts. Spam location is a troublesome errand in keeping up with the security of informal
communities.
Majority of people uses internet and trust's the contents over it. The scenario where
anyone can bring out a survey gives an open edge to the spammer to generate fake surveys
about products and services. Identification of these intruder and fake contain is a widely
debated issue of research as of now tremendous amount of studies has been done till now,
then also the existing work lacks behind in differentiating spam reviews and none of them
gives the significant result to the collected feature type
Due to the continuous growth of data size on platforms with large data such as social
media, the detection of fraudulent accounts on these platforms is becoming more difficult.
Although social media is preferred for communication purposes, it is becoming an
increasingly attractive target for spammers and fraudsters. A suggestion system can be
developed in order to provide better products to customers by analyzing the shares and
interactions of people on social media. But if the messages are not sent by a real people, the
analysis is wrong.
Survey on Spam Filtering Using Netspam Framework [4] This paper presents an
application uses a new structure, called NetSpam, which offers spam features to demonstrate
product review data sets as heterogeneous information networks in order to design a spam
review detection method in such networks. Using the importance of spam features helps us to
achieve better results on review data sets with respect to different metrics. The outcomes
represent that NetSpam results with the previous methods and encompassed by four
categories of features: The first type of features performs better than the other categories,
involving review - behavioral, user - behavioral, linguistic review and user - linguistic.
Detection of Fake Twitter Accounts with Multiple Classifier and Data Augmentation
Technique[5] In this study, malicious accounts have been identified in order to prevent dirty
and false information circulating on the internet by using the messages sent by social media
users. For this purpose, a system has been developed to classify automatic or normal accounts
using intelligent techniques. The nearest neighbor, logistic regression and random forest
algorithms were used for the identification of counterfeit accounts. The classification
performances of these methods were compared and smote and majority voting techniques
were applied to related algorithms to improve performance.
Machine Learning Based Twitter Spam Account Detection: A Review[6] Twitter is one
of the biggest microblogging networking platform, it has more than half a billion tweets are
posted every day in average by millions of users on Twitter. Such a versatility and wide
spread of use, Twitter easily get intruded with malicious activities. Malicious activities
includes malware intrusion, spam distribution, social attacks, etc. Spammers use social
engineering attack strategy to send spam tweets, spam URLs, etc. This made twitter an ideal
arena for proliferation of anomalous spam accounts. The impact stimulates researchers to
develop a model that analyze, detects and recovers from defamatory actions in twitter.
Twitter network is inundated with tens of millions of fake spam profiles which may
jeopardize the normal user's security and privacy. To improve real users safety and
identification of spam profiles become key parts of the research.
Trust-Aware Review Spam Detection [7]In this paper, we aim at providing an efficient and
effective method to identify review spammers by incorporating social relations based on two
assumptions that people are more likely to consider reviews from those connected with them
as trustworthy, and review spammers are less likely to maintain a large relationship network
with normal users. The contributions of this paper are two-fold: (1) We elaborate how social
relationships can be incorporated into review rating prediction and propose a trust-based
rating prediction model using proximity as trust weight, and (2) We design a trust-aware
detection model based on rating variance which iteratively calculates user-specific overall
trustworthiness scores as the indicator for spamicity. Experiments on the dataset collected
from Yelp.com show that the proposed trust-based prediction achieves a higher accuracy than
standard CF method, and there exists a strong correlation between social relationships and the
overall trustworthiness scores
TWEET SERVER
In this module, the Admin has to login by using valid user name and password. After
login successful he can perform some operations such as view and authorize users, Adding
Short URLs, Listing all Friends Request and Responses, Listing all User Posted Tweets,
Listing all Tweets and Re-tweets with Comments ,Viewing all Spammers URLs, Viewing
URL Shortening Users and Post Details, Finding all Clicked Shortened URLs and
Corresponding Users and Chart Results.
Viewing and Authorizing Users- In this module, the admin views all users details and
authorize them for login permission. User Details such as User Name, Address, Email Id,
Mobile Number.
Viewing all Friends Request and Response -In this module, the admin can see all the
friends‟ requests and response history. Details such as Requested User Name and Image, and
Requested to User Name and Image, status and date.
List all User Posted TweetsIn this module, the admin can see all the tweets posted by the
users. The Tweet Details such as, tweet name, tweet image, tweet description, tweet uses,
date of post and Posted user name.
View all Inference Attackers- In this module, the all-Inference Attacker details will be
listed. The details consist of the comment which has Shortening URLs (like t.co, goo.gl,
bit.ly), Tweet Name, and Date of Attack.
View URL Shortening Users and Post Details - In this, the admin can see all URL
Shortening users and post details. This contains the number of times the particular user used
these URLs (t.co, goo.gl, bit.ly) while commenting on tweets.
View all Clicked Shortened URLs and Corresponding End Users- In this, the admin can
view all the users who clicked Number of times on these URLs (t.co, goo.gl, bit.ly).
Find Number of times Posted URL shortening users in Chart- In this, the admin can see
the graph which describes the number of times the particular user used these Shortened URLs
while tweeting or Re-Tweeting (Posting their comment).
Find Number of times used URL shortening users in Chart- In this, the admin can see the
graph which describes the number of times the particular Shortened URL is used by the users
while tweeting or Re-Tweeting (Posting their comment).
USER
In this module, there are n numbers of users are present. User should register before
performing any operations. Once user registers, their details will be stored to the database.
After registration successful, he has to login by using authorized user name and password.
Once Login is successful user can perform some operations like viewing their profile details,
searching for friends and sending friend requests, accepting friend requests, viewing friends
details, Posting Their own Tweets, Finding Friends tweets and Re-tweets, Listing user tweets
and comments and Finding Inference Attack user Posted tweets.
Tweet Server
Viewing Profile Details, Search and Request FriendsIn this module, the user can see their
own profile details, such as their address, email, mobile number, profile Image.The user can
search for friends and can send friend requests or can accept friend requests.
Post Tweets - In this, the user can post their own tweets by providing details such as tweet
image, tweet name, tweet description, tweet uses.
View Friends Tweets on Posts and Re-Tweet - In this, the user can see all his/her friends‟
tweets on posts and can Re-tweet on them by providing user own comment (if the comment
contains Shortened URLs that is, t.co, goo.gl, bit.ly then user will become an inference
attacker).
View Inference Attack on User Posts(Tweets)- In this, the user can see all the Inference
attackers who have posted Shortening URLs in their comments on User Posts.
Dataset: The benefit of using these words based on their entropy score in the characteristic-
set is that we have been capable of lessen uncertainty in the prediction final results as those
phrases have a exceptional effect of frequency count in spam and non-spam URL .
Feature Selection: The main advantage of using the words present in the dataset is that it is
capable of reducing uncertainty in the prediction of the final results as those phrases have a
remarkable effect of frequency count in spam and ham comments in URL .
N-grams: N-grams is used to improve the accuracy. It is dealt with single word but when
there are two mutual words the complete meaning will be changed. So, the variation of
accuracy is better occurred when text is split into token of two or more words rather than
being a single word.
Analyzer: “Whether the feature should be made of word or character n-grams. Option
„char_wb‟ creates character n-grams only from text inside word boundaries; n-grams at
the edges of words are padded with space.”
Vocabulary: “Either a Mapping (e.g., a dicts) where keys are terms and values are
indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is
determined from the input documents. Indices in the mapping should not be repeated and
should not have any gap between 0 and the largest index.
Binary :If True, all non zero counts are set to 1. This is useful for discrete probabilistic
models that model binary events rather than integer counts. Model Building After Pre-
processing there has to be a way of constructing a version to keep the abilities of the
function of the project in accordance to the labelled model, which is built as per the
Supervised set of rules.
Max_Features: If not None, build a vocabulary that only consider the top max_features
ordered by term frequency across the corpus. This parameter is ignored if vocabulary is
not None.
Adaboost is the boosting algorithm which is adapted in solving practices. It helps to combine
many weak classifiers to a single strong classifier. It first separates the weak learners called
as decision stumps which means the decision tree with single split. It then separates the
datasets based on the level of difficulty, it puts more weight on the instances which are
trickier and more difficult, and less weight on the ones which are handled properly.
The decision stumps will be made into two subsets and a threshold value will be calculated
all the data will be either above or below the threshold value. It is moderately accurate on
dataset because it failed when we get a value which is an exception from threshold value.
Decision tree is a series of true or false questions that are asked about our data eventually
leading to continuous value or predicted. In this it tries to form nodes in which it contains
high proportion of data points from a particular or single class by finding the values in
features which divides the data into classes.
It is a nonlinear model which is built by many linear boundaries, here for a model we give
both label and features so that it will understand to classify points based on features, due to
overfitting in the data it is not accurate compared with other algorithms. Random forest has
number of blocks of decision trees together in a single thing, so it is not accurate compared
with other algorithms. Logisitic regression is used for prediction of binomial or multinomial
values of a variable. It uses a statistical approach to find the outcome. The outcome is binary
in nature. It uses a logit function for the prediction of probability of occurrence of binary
outcome, it follows bernoulis distribution, so the outcome here will be accurate either x or y.
Here it works on dataset and predicts x or y that is spam or ham.
CHAPTER 4
SPECIFIC OUTCOME
4.1 RESULTS
Experience at the company satisfactory, the people works in co-ordination and the
company environment is very safe and studious. The reason to choose this company was that
it was offering internship in wirelesswhich is my core specialization in PG degree and I
wanted to benefit from this experience, also I got to learn new tools like JAVA/J2EE (JSP,
SERVLET).
I used to spend nearly 5 to 6 hours daily in the company trying out with different
circuits and make their layout manually. I think my guide who was always there by my side
throughout my internship process giving me advice, feedback and tips on how the people
work in an industry environment.
On the whole, this internship was a useful experience. I have gained new knowledge,
skills and met many new people. Internships help us to learn more about our self. Through
an internship, we come to know clarity on our strengths, weaknesses, and interests.
Internships increase our professional confidence and also improve our communication
skills. Through an internship, we get a chance to learn what it is really like to work in a
company, in an industry, and in various job functions. Internships help us to develop better
work habits and learn how to manage tasks/projects and learn how to carry our self in a
professional environment. We can also learn from our colleagues by observing their positive
and negative work habits.
Interpersonal skills influence business cultures because they affect job performance,
which in turn helps to decide the outcome of a company's success. Interpersonal skills include
interaction with others, good communication skills, listening skills and attitude. Companies
should realize that interpersonal skills are not learned in a classroom; rather they are
characteristics that an individual may possess naturally.
thinking on their feet and manage diverse relationships both internally and externally.
Measuring a potential employee's ability to interact with others in a respectful and
appropriate manner determines how we likely to thrive in a team-oriented environment.
Some of the major factors that make up a person's interpersonal skills are diplomacy,
helpfulness, optimism, influence and flexibility. Also vital arecollaboration skills, empathy,
tolerance and frankness. These characteristics often align with corporate culture as well as
small business culture.
Ways to improve interpersonal skills include touring different sites, managing by
walking around, arranging lunches and corresponding consistently via phone or email.
Having good interpersonal skills promotes approachability, likability and comfort. Managers
who possess strong interpersonal skills motivate their staff to challenge themselves and do a
better job. Most importantly, they make workers feel as if they can go to their bosses with
any problems or concerns.
Both verbal and non-verbal interpersonal skills are extremely important when it
comes to a company's success. When you can speak to people in an articulate manner, you
avoid communication errors and are more likely to have happy customers. It's just as
important to maintain the correct tone of voice as well. Non-verbal communication consists
of facial expressions, hand gestures and body language. It can also determine whether or not
your interaction results in a satisfied customer. When you combine both verbal and non-
verbal skills, the result is a powerful demeanor that may help to determine the success of a
company.In addition, superb interpersonal skills encompass listening skills, problem-solving,
decision-making and negotiation skills. The ability to communicate internally with
employees and coworkers is just as important as building and maintaining solid relationships
with customers.
CHAPTER 5
CONCLUSIONANDFUTUREWORK
REFERENCES
[1]. B. Erçahin, Ö. Aktaş, D. Kilinç, and C. Akyol, „„Twitter fake account detection,‟‟ in Proc.
Int. Conf. Comput. Sci. Eng. (UBMK), Oct. 2017, pp. 388–392.
[5]. Mehmet Sevi;İlhan Aydin “Detection of Fake Twitter Accounts with Multiple
Classifier and Data Augmentation Technique” 2019 International Artificial
Intelligence and Data Processing Symposium (IDAP)Year: 2019 | Conference
Paper | Publisher: IEEE