1822 B.E Cse Batchno 220
1822 B.E Cse Batchno 220
by
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
MAY– 2022
i
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of Ranga
Avinash (Reg No: 38110449) who carried out the project entitled “A
COMPARATIVE STUDY ON FAKE JOB POST PREDICTION
USING DIFFERENT MACHINE LEARNING TECHNIQUES” under my
supervision from June 2021 to November 2021.
INTERNAL GUIDE
_____________________________________________________________________
ii
DECLARATION
I Ranga Avinash (Reg No: 38110449) hereby declare that the Project
Report entitled “A COMPARATIVE STUDY ON FAKE JOB POST
PREDICTION USING DIFFERENT MACHINE LEARNING
TECHNIQUES” done by us under the guidance of Dr Prayla Shyry M.E
Ph.D. is submitted in partial fulfillment of the requirements for the award of
Bachelor of Engineering degree in 2018-2022.
DATE:
1
ACKNOWLEDGEMENT
I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of Computer Science and Engineering who were helpful in many
ways for the completion of the project.
1
ABSTRACT
1
TABLE OF CONTENTS
1 INTRODUCTION 8
1.1. OVERVIEW 8
1.3 OBJECTIVE 9
2 1.4
LITERATURE SURVEY 10
3 METHODOLOGY 15
3.6 DIAGRAMS 19
3.7 MODULES 28
4 SYSTEM STUDY 40
1
4.1 FEASIBILITY STUDY 40
5 CONCLUSION 48
5.1 CONCLUSION 48
REFERENCE 49
APPENDICES 51
A. SOURCE CODE 51
B. SCREENSHOTS 55
C. PLAGIARISM REPORT 74
D. JOURNAL PAPER 76
1
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
In modern time, the development in the field of industry and technology has
opened a huge opportunity for new and diverse jobs for the job seekers. With the
help of the advertisements of these job offers, job seekers find out their options
depending on their time, qualification, experience, suitability etc. Recruitment
process is now influenced by the power of internet and social media. Since the
successful completion of a recruitment process is dependent on its advertisement,
the impact of social media over this is tremendous. Social media and
advertisements in electronic media have created newer and newer opportunity to
share job details. Instead of this, rapid growth of opportunity to share job posts has
increased the percentage of fraud job postings which causes harassment to the job
seekers. So, people lacks in showing interest to new job postings due to preserve
security and consistency of their personal, academic and professional information.
Thus the true motive of valid job postings through social and electronic media
faces an extremely hard challenge to attain people’s belief and reliability.
Technologies are around us to make our life easy and developed but not to create
unsecured environment for professional life. If jobs posts can be filtered properly
predicting false job posts, this will be a great advancement for recruiting new
employees. Fake job posts create inconsistency for the job seeker to find their
preferable jobs causing a huge waste of their time. An automated system to predict
false job post opens a new window to face difficulties in the field of Human
Resource Management.
1
1.2 MACHINE LEARNING
1.3 OBJECTIVE
In modern technology and social communication, advertising new job posts has
become very common issue in the present world. So, fake job posting prediction
task is going to be a great concern for all. Like many other classification tasks, fake
job posing prediction leaves a lot of challenges to face.
1
CHAPTER 2
LITERATURE SURVEY
Twitter spam has become a critical problem nowadays. Recent works focus on
applying machine learning techniques for Twitter spam detection, which make use
of the statistical features of tweets. In our labeled tweets data set, however, we
observe that the statistical properties of spam tweets vary over time, and thus, the
performance of existing machine learning-based classifiers decreases. This issue is
referred to as “Twitter Spam Drift”. In order to tackle this problem, we first carry
out a deep analysis on the statistical features of one million spam tweets and one
million non-spam tweets, and then propose a novel Lfun scheme. The proposed
scheme can discover “changed” spam tweets from unlabeled tweets and
incorporate them into classifier's training process. A number of experiments are
performed to evaluate the proposed scheme. The results show that our proposed
Lfun scheme can significantly improve the spam detection accuracy in real-world
scenarios.
1
Information quality in social media is an increasingly important issue, but web-
scale data hinders experts' ability to assess and correct much of the inaccurate
content, or "fake news," present in these platforms. This paper develops a method
for automating fake news detection on Twitter by learning to predict accuracy
assessments in two credibility-focused Twitter datasets: CREDBANK, a
crowdsourced dataset of accuracy assessments for events in Twitter, and PHEME,
a dataset of potential rumors in Twitter and journalistic assessments of their
accuracies. We apply this method to Twitter content sourced from BuzzFeed's fake
news dataset and show models trained against crowdsourced workers outperform
models based on journalists' assessment and models trained on a pooled dataset of
both crowdsourced workers and journalists. All three datasets, aligned into a
uniform format, are also publicly available. A feature analysis then identifies
features that are most predictive for crowdsourced and journalistic accuracy
assessments, results of which are consistent with prior work. We close with a
discussion contrasting accuracy and credibility and why models of non-experts
outperform models of journalists for fake news detection in Twitter.
The popularity of Twitter attracts more and more spammers. Spammers send
unwanted tweets to Twitter users to promote websites or services, which are
harmful to normal users. In order to stop spammers, researchers have proposed a
1
number of mechanisms. The focus of recent works is on the application of machine
learning techniques into Twitter spam detection. However, tweets are retrieved in a
streaming way, and Twitter provides the Streaming API for developers and
researchers to access public tweets in real time. There lacks a performance
evaluation of existing machine learning-based streaming spam detection methods.
In this paper, we bridged the gap by carrying out a performance evaluation, which
was from three different aspects of data, feature, and model. A big ground-truth of
over 600 million public tweets was created by using a commercial URL-based
security tool. For real-time spam detection, we further extracted 12 lightweight
features for tweet representation. Spam detection was then transformed to a binary
classification problem in the feature space and can be solved by conventional
machine learning algorithms. We evaluated the impact of different factors to the
spam detection performance, which included spam to nonspam ratio, feature
discretization, training data size, data sampling, time-related data, and machine
learning algorithms. The results show the streaming spam tweet detection is still a
big challenge and a robust detection technique should take into account the three
aspects of data, feature, and model.
In this paper, we view the task of identifying spammers in social networks from a
mixture modeling perspective, based on which we devise a principled unsupervised
approach to detect spammers. In our approach, we first represent each user of the
social network with a feature vector that reflects its behaviour and interactions with
1
other participants. Next, based on the estimated users feature vectors, we propose a
statistical framework that uses the Dirichlet distribution in order to identify
spammers. The proposed approach is able to automatically discriminate between
spammers and legitimate users, while existing unsupervised approaches require
human intervention in order to set informal threshold parameters to detect
spammers. Furthermore, our approach is general in the sense that it can be applied
to different online social sites. To demonstrate the suitability of the proposed
method, we conducted experiments on real data extracted from Instagram and
Twitter.
Law Enforcement Agencies cover a crucial role in the analysis of open data and
need effective techniques to filter troublesome information. In a real scenario, Law
Enforcement Agencies analyze Social Networks, i.e. Twitter, monitoring events
and profiling accounts. Unfortunately, between the huge amount of internet users,
there are people that use microblogs for harassing other people or spreading
malicious contents. Users' classification and spammers' identification is a useful
technique for relieve Twitter traffic from uninformative content. This work
proposes a framework that exploits a non-uniform feature sampling inside a gray
box Machine Learning System, using a variant of the Random Forests Algorithm
to identify spammers inside Twitter traffic. Experiments are made on a popular
1
Twitter dataset and on a new dataset of Twitter users. The new provided Twitter
dataset is made up of users labeled as spammers or legitimate users, described by
54 features. Experimental results demonstrate the effectiveness of enriched feature
sampling method
1
CHAPTER 3
METHODOLOGY
• Because of Privacy Issues the Facebook dataset is very limited and a lot of
details are not made public.
• having less accuracy
• More complex
• The social networking sites are making our social lives better but
nevertheless there are a lot of issues with using these social networking sites.
• The issues are privacy, online bullying, potential for misuse, trolling, etc.
These are done mostly by using fake job post.
• In this project, we came up with a framework through which we can detect a
fake job post using machine learning algorithms so that the social life of
people become secured.
1
Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
A random forest algorithm consists of many decision trees. The ‘forest’ generated
by the random forest algorithm is trained through bagging or bootstrap
aggregating. Bagging is an ensemble meta-algorithm that improves the accuracy of
machine learning algorithms.
The below diagram explains the working of the Random Forest algorithm:
ALGORITHM USED
1
The beginning of random forest algorithm starts with randomly
selecting “k” features out of total “m” features. In the image, you can observe that
we are randomly taking features and observations.
1
3.4 HARDWARE REQUIREMENTS
Ram : 2 GB
3.6 DIAGRAM
1
depicts information flow and the transformations that are applied as data
moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a
system at any level of abstraction. DFD may be partitioned into levels that
represent increasing information flow and functional detail.
FIG 3.1
1
UML DIAGRAMS
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that
they can develop and exchange meaningful models.
1
2. Provide extendibility and specialization mechanisms to extend the core
concepts.
3. Be independent of particular programming languages and development
process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.
1
Fig.3.2
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities
and actions with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams can be used to describe the business and
operational step-by-step workflows of components in a system. An activity
diagram shows the overall flow of control.
1
Fig 3.3
1
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction
diagram that shows how processes operate with one another and in what order. It is
a construct of a Message Sequence Chart. Sequence diagrams are sometimes called
event diagrams, event scenarios, and timing diagrams.
FIG 3.4
1
CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language
(UML) is a type of static structure diagram that describes the structure of a system
by showing the system's classes, their attributes, operations (or methods), and the
relationships among the classes. It explains which class contains information.
FIG 3.5
1
3.7 MODULES
➢ Data Collection
➢ Pre-Processing
➢ Train and Test
➢ Machine Learning Technique
➢ Detection of Fake Job post
MODULE DESCRIPTION
3.7.2 Pre-Processing
We convert data in to scalar format and then create new features which are passed
to algorithm and features are saved in x and labels in y.
we will split the dataset into training dataset and test dataset. We will use 70% of
our data to train and the rest 30% to test. To do this, we will create a split
parameter which will divide the data frame in a 70-30 ratio.
1
3.7.4 Machine Learning Technique
After splitting the dataset into training and test dataset, we will instantiate Random
Forest Classifier and KNN classifier fit the train data by using ‘fit’ function. Then
we will store as model.
Describing the overall features of the software is concerned with defining the
requirements and establishing the high level of the system. During architectural
design, the various web pages and their interconnections are identified and
designed. The major software components are identified and decomposed into
processing modules and conceptual data structures and the interconnections among
the modules are identified. The following modules are identified in the proposed
system.
1
FIG 3.6
1
3.8.1 Problem Statement:
In modern technology and social communication, advertising new job posts has
become very common issue in the present world. So, fake job posting prediction
task is going to be a great concern for all. Like many other classification tasks, fake
job posing prediction leaves a lot of challenges to face.
INPUT DESIGN
The input design is the link between the information system and the user. It
comprises the developing specification and procedures for data preparation and
those steps are necessary to put transaction data in to a usable form for processing
can be achieved by inspecting the computer to read data from a written or printed
document or it can occur by having people keying the data directly into the system.
The design of input focuses on controlling the amount of input required,
controlling the errors, avoiding delay, avoiding extra steps and keeping the process
simple. The input is designed in such a way so that it provides security and ease of
use with retaining the privacy. Input Design considered the following things:
1
OBJECTIVES
2. It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be
free from errors. The data entry screen is designed in such a way that all the data
manipulates can be performed. It also provides record viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with
the help of screens. Appropriate messages are provided as when needed so that the
user will not be in maize of instant. Thus the objective of input design is to create
an input layout that is easy to follow
OUTPUT DESIGN
A quality output is one, which meets the requirements of the end user and presents
the information clearly. In any system results of processing are communicated to
the users and to other system through outputs. In output design it is determined
how the information is to be displaced for immediate need and also the hard copy
output. It is the most important and direct source information to the user. Efficient
and intelligent output design improves the system’s relationship to help user
decision-making.
1
1. Designing computer output should proceed in an organized, well thought out
manner; the right output must be developed while ensuring that each output
element is designed so that people will find the system can use easily and
effectively. When analysis design computer output, they should Identify the
specific output that is needed to meet the requirements.
The output form of an information system should accomplish one or more of the
following objectives.
1
CHAPTER 4
SYSTEM STUDY
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
1
research and development of the system is limited. The expenditures must be
justified. Thus the developed system as well within the budget and this was
achieved because most of the technologies used are freely available. Only the
customized products had to be purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the
technical requirements of the system. Any system developed must not have a high
demand on the available technical resources. This will lead to high demands on the
available technical resources. This will lead to high demands being placed on the
client. The developed system must have a modest requirement, as only minimal or
null changes are required for implementing this system.
SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently.
The user must not feel threatened by the system, instead must accept it as a
necessity. The level of acceptance by the users solely depends on the methods that
are employed to educate the user about the system and to make him familiar with
it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system.
1
SYSTEM TESTING
Software system meets its requirements and user expectations and does not fail in
an unacceptable manner. There are various types of test. Each test type addresses a
specific testing requirement.
TYPES OF TESTS
UNIT TESTING
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is the
testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that
relies on knowledge of its construction and is invasive. Unit tests perform basic
tests at component level and test a specific business process, application, and/or
system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined
inputs and expected results.
1
INTEGRATION TESTING
FUNCTIONAL TEST
1
Organization and preparation of functional tests is focused on requirements, key
functions, or special test cases. In addition, systematic coverage pertaining to
identify Business process flows; data fields, predefined processes, and successive
processes must be considered for testing. Before functional testing is complete,
additional tests are identified and the effective value of current tests is determined.
SYSTEM TEST
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An
example of system testing is the configuration oriented system integration test.
System testing is based on process descriptions and flows, emphasizing pre-driven
process links and integration points.
1
.you cannot “see” into it. The test provides inputs and responds to outputs without
considering how the software works.
Unit testing is usually conducted as part of a combined code and unit test
phase of the software lifecycle, although it is not uncommon for coding and unit
testing to be conducted as two distinct phases.
Test objectives
• All field entries must work properly.
• Pages must be activated from the identified link.
• The entry screen, messages and responses must not be delayed.
Features to be tested
• Verify that the entries are of the correct format
• No duplicate entries should be allowed
• All links should take the user to the correct page.
1
6.2 INTEGRATION TESTING
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
1
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
1
CHAPTER-5
CONCLUSION
5.1 CONCLUSION
Job scam detection has become a great concern all over the world at present. In this
paper, we have analyzed the impacts of job scam which can be a very prosperous
area in research filed creating a lot of challenges to detect fraudulent job posts. We
have experimented with EMSCAD dataset which contains real life fake job posts.
In this paper we have experimented both machine learning algorithms (SVM,
KNN, Naive Bayes, Random Forest and MLP) and deep learning model (Deep
Neural Network). This work shows a comparative study on the evaluation of
traditional machine learning and deep learning based classifiers. We have found
highest classification accuracy for Random Forest Classifier among traditional
machine learning algorithms and 99 % accuracy for DNN (fold 9) and 97.7%
classification accuracy on average for Deep Neural Network.
1
REFERENCES
[1] C. Chen, S. Wen, J. Zhang, Y. Xiang, J. Oliver, A. Alelaiwi, and M. M.
Hassan, ‘‘Investigating the deceptive information in Twitter spam,’’ Future Gener.
Comput. Syst., vol. 72, pp. 319–326, Jul. 2017.
[2] I. David, O. S. Siordia, and D. Moctezuma, ‘‘Features combination for the
detection of malicious Twitter accounts,’’ in Proc. IEEE Int. Autumn Meeting
Power, Electron. Comput. (ROPEC), Nov. 2016, pp. 1–6.
[3] M. Babcock, R. A. V. Cox, and S. Kumar, ‘‘Diffusion of pro- and anti-false
information tweets: The black panther movie case,’’ Comput. Math. Org. Theory,
vol. 25, no. 1, pp. 72–84, Mar. 2019.
[4] S. Keretna, A. Hossny, and D. Creighton, ‘‘Recognising user identity in Twitter
social networks via text mining,’’ in Proc. IEEE Int. Conf. Syst., Man, Cybern.,
Oct. 2013, pp. 3079–3082.
[5] C. Meda, F. Bisio, P. Gastaldo, and R. Zunino, ‘‘A machine learning approach
for Twitter spammers detection,’’ in Proc. Int. Carnahan Conf. Secur. Technol.
(ICCST), Oct. 2014, pp. 1–6.
[6] W. Chen, C. K. Yeo, C. T. Lau, and B. S. Lee, ‘‘Real-time Twitter content
polluter detection based on direct features,’’ in Proc. 2nd Int. Conf. Inf. Sci. Secur.
(ICISS), Dec. 2015, pp. 1–4.
[7] H. Shen and X. Liu, ‘‘Detecting spammers on Twitter based on content and
social interaction,’’ in Proc. Int. Conf. Netw. Inf. Syst. Comput., pp. 413–417, Jan.
2015.
[8] G. Jain, M. Sharma, and B. Agarwal, ‘‘Spam detection in social media using
convolutional and long short term memory neural network,’’ Ann. Math. Artif.
Intell., vol. 85, no. 1, pp. 21–44, Jan. 2019.
[9] M. Washha, A. Qaroush, M. Mezghani, and F. Sedes, ‘‘A topic-based hidden
Markov model for real-time spam tweets filtering,’’ Procedia Comput. Sci., vol.
112, pp. 833–843, Jan. 2017.
1
[10] F. Pierri and S. Ceri, ‘‘False news on social media: A data-driven survey,’’
2019, arXiv:1902.07539. [Online]. Available: https://arxiv. org/abs/1902.07539
[11] S. Sadiq, Y. Yan, A. Taylor, M.-L. Shyu, S.-C. Chen, and D. Feaster,
‘‘AAFA: Associative affinity factor analysis for bot detection and stance
classification in Twitter,’’ in Proc. IEEE Int. Conf. Inf. Reuse Integr. (IRI), Aug.
2017, pp. 356–365.
[12] M. U. S. Khan, M. Ali, A. Abbas, S. U. Khan, and A. Y. Zomaya,
‘‘Segregating spammers and unsolicited bloggers from genuine experts on
Twitter,’’ IEEE Trans. Dependable Secure Comput., vol. 15, no. 4, pp. 551–560,
Jul./Aug. 2018.
1
APPENDICES
A. Source code:
import numpy as np
import pandas as pd
from flask import Flask, request, jsonify, render_template, redirect, flash,
send_file
from sklearn.preprocessing import MinMaxScaler
from werkzeug.utils import secure_filename
import pickle
app = Flask(__name__) #Initialize the flask App
model = pickle.load( open('random.pickle', 'rb') )
vecs = pickle.load( open('vectorizers.pickle', 'rb') )
classifiers = pickle.load( open('classifiers.pickle', 'rb') )
@app.route('/')
@app.route('/index')
def index():
return render_template('index.html')
@app.route('/chart')
def chart():
return render_template('chart.html')
@app.route('/performance')
def performance():
return render_template('performance.html')
@app.route('/login')
def login():
1
return render_template('login.html')
@app.route('/upload')
def upload():
return render_template('upload.html')
@app.route('/preview',methods=["POST"])
def preview():
if request.method == 'POST':
dataset = request.files['datasetfile']
df = pd.read_csv(dataset,encoding = 'unicode_escape')
df.set_index('Id', inplace=True)
return render_template("preview.html",df_view = df)
@app.route('/fake_prediction')
def fake_prediction():
return render_template('fake_prediction.html')
@app.route('/predict',methods=['POST'])
def predict():
features = [float(x) for x in request.form.values()]
final_features = [np.array(features)]
y_pred = model.predict(final_features)
if y_pred[0] == 1:
label="Fake Job Post"
elif y_pred[0] == 0:
label="Legit Job Post"
return render_template('fake_prediction.html', prediction_texts=label)
@app.route('/text_prediction')
def text_prediction():
return render_template("text_prediction.html")
1
@app.route('/job')
def job():
abc = request.args.get('news')
input_data = [abc.rstrip()]
# transforming input
tfidf_test = vecs.transform(input_data)
# predicting the input
y_preds = classifiers.predict(tfidf_test)
if y_preds[0] == 1:
labels="Fake Job Post"
elif y_preds[0] == 0:
labels="Legit Job Post"
return render_template('text_prediction.html', prediction_text=labels)
if __name__ == "__main__":
app.run(debug=True)
1
B. SCREEN SHOTS
Fig 5.1
• It is the home page of our website.
1
Fig 5.2
• This is the static login page for the user.It was created as static login so that
every one can access it.Username and password is commom for any user.
1
Fig 5.3
1
Fig 5.4
1
Fig 5.5
• This is the preview section of dataset.it displays the details of dataset like the
parameters of dataset
• In this preview section all the sample data is displayed .
1
Fig 5.6
1
Fig 5.7
• At the end of preview page we can train the data using KNN algorithm.
1
Fig 5.8
• This is the main part of the project we can predict the job post is legit or
fake.By choosing different parameters of the job post like employement
type,required education,required experience and function.
1
Fig 5.9
1
Fig 5.10
• For corresponding parameters of the job post the result is Legit Job
1
Fig 5.11
• For the corresponding parameters the given job post is a fake post
1
Fig 5.12
1
Fig 5.13
1
Fig 5.14
1
Fig 5.15
1
Fig 5.16
1
Fig 17
1
A. PLAGIARISM REPORT
1
1
1
1
1
1
1
1
1
1
1