Complete Final Sem Report PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 79

Project Report on

March 2020
Table of Contents
Introduction__________________________________________2
1.1 Purpose__________________________________________________2
1.2 Scope____________________________________________________2
1.3 Objective__________________________________________________2
1.3 Technology and Tools_______________________________________2
2) Project Management_________________________________5
2.1 Project Planning___________________________________________6
2.2 Project Scheduling__________________________________________8
2.3 Risk Management__________________________________________8
3) System Requirements Study___________________________11
3.1 User Characteristics_______________________________________12
3.2 Hardware and Software Requirements________________________12
3.3 Constraints Assumptions and Dependencies____________________13
4) System Analysis____________________________________15
4.1 Study of Current System____________________________________16
4.2 Problem and Weaknesses of Current System____________________16
4.3 Requirements of New System________________________________16
4.4 Feasibility Study__________________________________________17
4.5 Requirements Validation____________________________________18
4.6 Features of New System____________________________________18
4.7 Data Flow Diagram________________________________________20
4.8 ER Diagram______________________________________________22
4.9 UML Diagrams___________________________________________23
4.10 Selection of Hardware and Software and Justification___________26
5) System Design_____________________________________27
5.1 Overview________________________________________________28
5.2 Product Function_________________________________________28
5.3 User Characteristics_______________________________________29
5.4 Constraints______________________________________________29
5.5 User Requirements________________________________________29
5.6 Performance Requirements________________________________31
5.7 Code Snippet____________________________________________31
6) Proposed Solution and Code Implementation_____________32
6.1 Proposed Solution________________________________________33
6.2 Implementation Environment______________________________44
6.3 Program/Module Specification_____________________________44
6.4 Coding Standards________________________________________45
6.5 Coding_________________________________________________46
7) Results and Discussion______________________________59
7.1 Take a Valid News Article URL_____________________________60
7.2 Extract Relevant Text From URL___________________________60
7.3 Extracting Feature from Relevant Text_______________________60
7.4 Applying Machine Learning Algorithms for Classification_______61
7.5 Store Classification Result in Database______________________61
7.6 User Login and Sign up___________________________________61
7.7 User Feedback__________________________________________61
7.8 Verifying Results_________________________________________61
7.9 Retraining of Machine Learning Models_____________________61
7.10 Non-Functional Requirement Achieved_____________________62
8) Testing___________________________________________63
8.1 Testing Plan____________________________________________64
8.2 Testing Strategy_________________________________________64
8.3 Testing Methods_________________________________________65
8.4 Test Cases______________________________________________66
9) Limitations and Future Enhancement__________________67
9.1 Limitations and Future Enhancement_______________________68
10) Conclusion and Discussion__________________________69
10.1 Self analysis and Project viabilities________________________70
10.2 Problem encountered and possible solutions_________________71
10.3 Summary of project_____________________________________71
11) References_______________________________________72
Acknowledgement

We have taken many efforts in this project. However, it would not have been possible
without the kind support and help of many individuals and organizations. We would
like to extend our sincere thanks to all of them.
We are highly indebted to My guide Om Prakash Sir for his guidance and constant
supervision as well as for providing necessary information regarding the Project
Titled “Fake News Detection System”. We would like to express my gratitude
towards my class mates for their kind co-operation and encouragement which helped
us in completion of this project.
We also say the big thank you to my parents for such a support and without them we
can do nothing not in just project but also in life. Thankful to our family for their
support.
The feeling of gratefulness to any one’s help directly arises from the bottom of heart.
A small but an important and timely help can prove to be a milestone in one’s life.
Very thankful to almighty of all of us ”God” to give us such a best persons and all
the thing he provides before we need and we always feel that without him we are
nothing.

Bhanu Pratap Mishra


Abstract

The scourge of cyberbullying has assumed alarming proportions with an ever-


increasing number of adolescents admitting to having dealt with it either as a
victim or as a bystander.
Anonymity and the lack of meaningful supervision in the electronic medium are
two factors that have exacerbated this social menace.
Comments or posts involving sensitive topics that are personal to an individual are
more likely to be internalized by a victim, often resulting in tragic outcomes.
We decompose the overall detection problem into detection of sensitive topics,
tending itself into text classification sub-problems.
We find that binary classifiers for individual tablets outperform multiclass
classifiers.
Our findings show that the detection of textual cyberbullying can be tackled by
building individual topic sensitive classifiers.
List of Figures

Figure Title Pg No.

Gant Chart
8

Data Flow Diagram


20

Component Diagram
21

E-R Diagram
22

Use Case Diagram 1


23
Use Case Diagram 2
24
Use Case Diagram 3 25
Layered Architecture 28
Workflow Diagram 34

Accuracy vs Number of Features 36

Accuracy vs SVM Kernels 37

Accuracy vs Depth of Random Forest and Decision


Tree 37

Accuracy vs Train/Test Split 38

Feature Reduction (Graph) 39


List of Tables

Title Pg No.

Milestones and Deliverables


7
Server Side Hardware Requirement
12
Software Requirement
13
Client-side Hardware Requirement
13
Client-side Hardware Requirement
13
Hardware Requirements 26
Software Requirements 26

Performance Requirement 31

Features with importance 35

Feature Reduction (Tabulated) 39

Decision Tree 40

Random Forest 41

SVM 42

Accuracy vs Training Algorithms 43

Performance Requirement 62

Security Requirements 62

Usability Requirements 62
Chapter 1
Introduction

 Purpose

 Scope

 Objective

 Technology and Tool


INTRODUCTION

 PURPOSE:
The purpose of this project is to use machine learning algorithm to detect the fake news
in online social media that travels as a real one, it is like a click bait.
It will try to enhance the user experience on the online social media platform and will
also save lot of time of users that they might spent on fake news otherwise.

 SCOPE:
The scope of this project is very diverse, it ranges from various online social media like
Face book, twitter, Instagram etc. to fake blogs, fake websites that deceive the users in
one way or the other.

 OBJECTIVE:
This is the standalone application that will use the dataset which is consists of various
information in mixture it contains fake news and real news and also the news that appear
real but are fake.

 TECHNOLOGY AND TOOLS:


Front End: For designing the structure of the project following technologies are
used:
1) Jupyter notebook:

It is an open-source web application that allows you to create and share documents that
contain live code, equations, visualizations and narrative text. Uses include: data
cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning etc. The Notebook is a server-client application that
allows editing and running notebook documents via a web browser. It can be executed
on a local desktop requiring no internet access or can be installed on a remote server and
accessed through the internet.

In addition to displaying/editing/running notebook documents. It has a “Dashboard”


(Notebook Dashboard), a “control panel” showing local files and allowing to open
notebook documents or shutting down their kernels.
Jupyter Notebook (formerly IPython Notebooks) is a web-based interactive
computational environment for creating, executing, and visualizing Jupyter notebooks.
It is similar to the notebook interface of other programs such as Maple, Mathematica,
and SageMath, a computational interface style that originated with Mathematica in the
1980s.It supports execution environments (aka kernels) in dozens of languages. By
default Jupyter Notebook ships with the IPython kernel but there are over 100 Jupyter
kernels as of May 2018.
2) Anaconda:
Anaconda is a free and open source distribution of the Python and R programming
languages for data science and machine learning related applications (large-scale data
processing, predictive analytics, scientific computing), that aims to simplify package
management and deployment. Package versions are managed by the package
management system conda.

Anaconda is a scientific Python distribution. It has no IDE of its own. Anaconda bundles
a whole bunch of Python packages that are commonly used by people using Python for
scientific computing and/or data science.

It provides a single download and an install program/script that install all the packages
in one go. Alternate is to install Python and individually install all the required packages
using pip. Additionally, it provides its own package manager (conda) and package
repository. But it allows installation of packages from PyPI using pip if the package is
not in Anaconda repositories. It is especially good if you are installing on Microsoft
Windows as it can easily install packages that would otherwise require you to install
C/C++ compilers and libraries if you were using pip. It is certainly an added advantage
that conda, in addition to being a package manager, is also a virtual environment
manager allowing you to install independent development environments and switch
from one to the other (similar to virtualenv).

3). Python:
Python is an interpreted, object-oriented, high level programming with dynamic
semantics.
Its high level built in data structures, combined with dynamic typing and binding, make it
very attractive for Rapid Application Development, as well as for use as a scripting or
glue language to connect existing components together.
Python’s simple, easy to learn syntax emphasizes readability and therefore reduces the
cost of program maintenance. It supports modules and packages, which encourages
program modularity and code reuse. The Python interpreter and the extensive standard
library are available in source or binary form without charge for all major platforms, and
can be freely distributed.
Debugging Python program is easy: a bug or bad input will never cause a segmentation
fault. Instead, when the interpreter discovers an error, it causes an exception. When the
program doesn’t catch the exception, the interpreter prints a stack trace. A source level
debugger allows inspection of local and global variables, evaluation of arbitrary
expressions, setting breakpoints, stepping through the code a line at a time, and so on.
4). Dataset:
A dataset is a collection of data. Most commonly a data set corresponds to the contents
of a single data basé table, or a single statistical data matrix, where every column of the
table represents a particular variable, and each row corresponds to a given member of
the data set in question. It lists values for each of the variables, such as height and
weight of an object, for each member of the data set. Each value is known as a datum.
The dataset may comprise data for one or more members, corresponding to the number
of rows. The term dataset may also be used more loosely, to refer to the data in a
collection of closely related tables, corresponding to a particular experiment or event.
5). Machine learning:
Machine learning gives computers the ability to learn without being explicitly
programmed (Arthur Samuel, 1959).It is a subfield of computer science.

Machine learning explores the construction of algorithms which can learn and make
predictions on data. Such algorithms follow programmed instructions, but can also make
predictions or decisions based on data. They build a model from sample inputs.
Machine learning is done where designing and programming explicit algorithms cannot
be done. Examples include spam filtering, detection of network intruders or malicious
insiders working towards a data breach, fake news detection in online social media.
6). Deep Learning:
Deep learning is part of a of machine learning methods based on learning data
representations, as opposed to task-specific algorithms. Learning can be supervised,
semi-supervised or unsupervised.
Deep learning architectures such as deep neural networks, deep belief networks and
recurrent neural networks have been applied to fields including computer vision, speech
recognition, processing, social network filtering, machine translation, bioinformatics,
drug design and board game programs, where they have produced results comparable to
and in some cases even exceeded the human experts.

7). Naive Bayes Algorithm:


In machine learning, naive Bayes classifiers are a family of simple "probabilistic
classifiers" based on applying Bayes' theorem with strong (naive) independence
assumptions between the features. It is a popular method for text categorization, the
problem of judging documents as belonging to one category or the other (such as spam or
legitimate, sports or politics, etc.) with word frequencies as the features. With
appropriate pre-processing, it is competitive in this domain with more advanced methods
including support vector machines.
Chapter 2
Project Management

 Project Planning
 Project Scheduling
 Risk Management
2.0. PROJECT MANAGEMENT

 PROJECT PLANNING

Project Planning is concerned with identifying and measuring the activities, milestones
and deliverables produced by the project. Project planning is undertaken and completed
sometimes even before any development activity starts. Project planning consists of
following essential activities:

 Scheduling manpower and other resources needed to develop the system.

 Staff organization and staffing plans.

 Risk identification, analysis, and accurate planning.

 Estimating some of the basic attributes of the project like cost, duration
and efforts the effectiveness of the subsequent planning activities is
based on the accuracy of these estimations.

 Miscellaneous plans like quality assurance plan, configuration management plan,


etc.

Project management involves planning, monitoring and control of the process, and the
events that occurs as the software evolves from a preliminary concept to an operational
implementation. Cost estimation is a relative activity that is concerned with the resources
required to accomplish the project plan.

1.1) Project Development Approach And Justification:

A Software process model is a simplified abstract representation of a software process,


which is presented from a particular perspective. A process model for software
engineering is chosen based on the nature of the project and application, the methods and
tools to be used, and the controls and deliverables that are required. All software
development can be characterized as a problem-solving loop which in four distinct stages
is encountered:

 Requirement analysis
 Coding
 Testing
 Deployment
1.2) Milestones and Deliverables:

As software is tangible, this information can only be provided as documents that describe
the state of the software being developed without this information it is impossible to
judge progress at different phases and therefore schedules cannot be determined or
updated.
Milestone is an end point of the software process activity. At each milestone there should
be formal output such as report that can be represented to the guide. Milestones are the
completion of the outputs for each activity. Deliverables are the requirements definition
and the requirements specification.
Milestone represents the end of the distinct, logical stage in the project. Milestone may be
internal project results that are used by the project manager to check progress.
Deliverables are usually Milestones but reverse need not be true. We have divided the
software process into activities for the following milestone that should be achieved.

Software Process Activity Milestone

Project Plan Project schedule

Requirement Collection User requirements, System


Requirements
Analysis of Dataset Choosing of appropriate dataset.

Implementation Algorithm implementation.

Table Milestones and Deliverables

1.3) Roles and Responsibilities:

This phase defines the role and responsibilities of each and every member involved in
developing the system. To develop this system there is only one person involved in
working on the whole application. The same was responsible for each and every part of
developing the system. Our team structure is of single control team organization as it
consist of me and my guide as chief programmer organization.
1.4) Group Dependencies:

The structure chosen for the system is the chief programmer structure .In this system,
Chief Programmer team structure is used because in the organization, a senior engineer
provides the technical leadership and is designated as the chief programmer. The chief
programmer partitions the task into small activities and assigns them to me on time
deadline basis. He also verifies and integrates the products developed by me and i work
under the constant supervision of the chief programmer. For this system reporting entity
represents myself and the role of chief programmer is played by my internal guide.

 PROJECT SCHEDULING

The scheduling is the peak of a planning activity, a primary component of software


project management. When combined with estimation methods and risk analysis,
scheduling establishes a roadmap for project management. The characteristics of the
project are used to adapt an appropriate task set for doing work.

Task 1Dec- 31Jan- 10Feb- 20Feb- 30Feb- 5March-


25Dec 10Feb 20Feb 30Feb 5March 10Marc
h
Develop
project 25 days
proposal
Analysis
11 days
Designing
10 days
Coding
10days
Unit
Testing 5 days
Impleme
ntation 5 days

Fig. shows Gant chart of this Project

 RISK MANAGEMENT

Risk management consists of a series of steps that help a software development team to
understood and manage uncertain problems that may arise during the course of software
development and can plague a software project.
Risks are the dangerous conditions or potential problems for the system which may
damage the system functionalities to very high level which would not be acceptable at
any cost. so in order to make our system stable and give its 100% performance we must
have identify those risks, analyze their occurrences and effects on our project and must
prevent them to occur.

3.1) Risk Identification

Risk identification is a first systematic attempt to specify risks to project plan, scheduling
resources, project development. It may be carried out as a team process using
brainstorming approach.

Technology risk: Technical risks concern implementation and testing problems.


 Dataset Enlargement
 Algorithm Output.

People Risks: These risks are concerns with the team and its members who are taking
part in developing the system.
 Lack of knowledge
 Lack of clear vision.
 Poor communication between people.

Tools Risks:
These are more concerned with tools used to develop the project.
 Tools containing virus.

General Risks:
General Risks are the risks, which are concerned with the mentality and resources.
 Rapidly changing Datasets.
 Lack of resources can cause great harm to efficiency and timelines of project.
 Changes in dataset can cause a great harm to implementation and
schedule of developing the system.
 Insufficient planning and task identification.
 Decision making conflicts.
3.2) Risk Analysis

“Risk analysis = risk assessment + risk management + risk communication.


” Risk analysis is employed in its broadest sense to include:

Risk assessment

Involves identifying sources of potential harm, assessing the likelihood that


harm will occur and the consequences if harm does occur.
For this project It might be :- Software(Tool) Crashing.

Risk management
Evaluates which risks identified in the risk assessment process require management and
selects and implements the plans or actions that are required to ensure that those risks are
controlled.

Precautions taken to make risks minimal are as under:-


Keeping the software tool up to date by updating the software periodically.

Risk communication

Involves an interactive dialogue between guide and us, which actively informs
the other processes.
Steps taken for risk communication is as under: -

 All the possible risks are listed out during communication and project is
developed taking care of that risks.
Chapter 3

System Requirements Study

 User Characteristics

 Hardware and Software Requirements

 Constraints Assumptions and


Dependencies
 SYSTEM REQUIREMENT STUDY

 USER CHARACTERISTICS
Admin:-
 Show project and user full detail
 Manage user
 Mange project
 Manage dataset

User:-
 Upload pieces of news
 Circulation of news
 Analyze the news

 HARDWARE AND SOFTWARE REQUIREMENT


SPECIFICATION
This shows minimum requirements to carry on to run this system efficiently.

1.2.1) Hardware Requirements Server side Hardware Requirement:

Devices Description

Processor Intel Core Duo 2.0 GHz or more

RAM 512 MB or more

Hard Disk 10 GB or more

Table Server side Hardware Requirement


1.2.2) Software Requirements
For which Software

Operating System Windows XP/2003/vista/7/8/10,Linux,


Mac OS x
Front End Jupyter notebook

Back End Numpy, Panda

Scripting Python
Language
Table Software Requirements

1.2.3) Client side Requirements


For which Requirement
Browser Any Compatible browser
device
Table client-side Requirements

 CONSTRAINTS
1.3.1) Hardware Limitations
The major hardware limitations faced by the system are as follows:

If the appropriate hardware is not there like processor, RAM, hard disks

-the problem in processing requests of client


-if appropriate storage is not there our whole database will crash due to less
storage because our main requirement is large storage.

1.3.2) Interfacing with other systems


There should be the compatible browser to perfectly detect the fake news. The
functionality of the system should be such that it can be used as sub module of
some larger applications.

1.3.3) Reliability Constraints


The major reliability constraints are as follows:

 The software should be efficiently designed so as to give reliable


recognition of fake news and so that it can be used for more pragmatic
purpose.

 The design should be versatile and user friendly.


 The application should be fast, reliable and time saving.

 The system should have universal adaptations.

 The system be compatible with future upgradation.

 DEPENDENCIES

The entire project depends on various libraries of python. The libraries are as follows:

NumPy: NumPy is the fundamental package for scientific computing with


Python. It contains among other things:
 a powerful N-dimensional array object
 sophisticated (broadcasting) functions
 tools for integrating C/C++ and Fortran code
 useful linear algebra, Fourier transform, and random number capabilities

Pandas: pandas is an open source, BSD-licensed library providing high-


performance, easy-to- use data structures and data analysis tools for the python
programming language.
pandas is a NumFOCUS sponsored project. This will help ensure the success of
development of pandas as a world-class open-source project, and makes it
possible to donate to the project.

Python: This module implements a number of iterator building blocks


inspired by constructs from APL, Haskell and SML. Each has been recast in a
form suitable for Python.

Matplotlib: Matplotlib is a Python 2D plotting library which produces


publication quality figures in variety of hardcopy formats and interactive
environments across platforms. Matplotlib can be used in Python scripts, the
Python and IPython shells, the Jupyter notebook, web application servers, and
four graphical user interface toolkits.

Scikit: Simple and efficient tools for data mining and data analysis. Accessible
to everybody, and reusable in various contexts. Built on NumPy, SciPy, and
matplotlib. Open source, commercially usable-BSD license.
Chapter 4

System Analysis
 Study of Current System
 Problem and Weaknesses of
Current System
 Requirements of New System
 Feasibility Study
 Requirements Validation
 Features of New System
 Data Flow Diagram
 ER Diagram
 UML Diagrams
 Selection of Hardware and
Software and Justification
 STUDY OF CURRENT SYSTEM
Current system focus on classifying online reviews and publicly
available social media posts.
 PROBLEMS AND WEAKNESS OF CURRENT SYSTEM
The current system is undoubtedly well-designed for detecting the
deception but it has some following limitations:

“Conroy, Rubin, and Chen outline several approaches that seem


promising towards the aim of perfectly classify the misleading articles.
They note that simple content-related n-grams and shallow parts-of-
speech (POS) tagging have proven insufficient for the classification
task, often failing to account for important context information. Rather,
these methods have been shown useful only in tandem with more
complex methods of analysis. Deep Syntax analysis using Probabilistic
Context Free Grammars (PCFG) have been shown to be particularly
valuable in combination with n-gram methods. Feng, Banerjee, and
Choi are able to achieve 85%-91% accuracy in deception related
classification tasks using online review corpora.
Feng and Hirst implemented a semantic analysis looking at 'object:
descriptor' pairs for contradictions with the text on top of Feng's initial
deep syntax model for additional improvement. Rubin, Lukoianova and
Tatiana analyze rhetorical structure using a vector space model with
similar success. Ciampaglia et al. employ language pattern similarity
networks requiring a pre-existing knowledge base. “

 Lack of an awareness of this system.


 Implementation is difficult and complex
 Some security related issues may be created.
 Cost Effectiveness

 REQUIREMENTS SPECIFICATION
Requirements specification adds further information to the requirements
definition.
3.1) Algorithm Requirements
 Dataset
 Input
 Appropriate functions
 Training
 Efficiency
 Output
3.2) System Requirements
 Usability:

The system should be easily able to detect the deception in blogs or news
in online social media.
 Efficiency:
The system should provide easy and fast response.

 FEASIBILTIY STUDY
An important outcome of the preliminary investigation is the
determination that the system is feasible or not. The main aim of the
feasibility study activity is to determine whether it would be financially
and technically feasible to develop a project.

The feasibility study activity involves the analysis of the problem and
collection of all relevant information relating to the product such as the
different datasets which would be input to the system, the processing
required to be carried out on these datasets, the output required to be
produced by the system as well as the various constraints on the
behaviors of the system.
4.1) Does the system contribute to the overall objectives of the organization?

The main aim of behind development of this system is to provide fake


news detection that can prevent the social bullying of the persons
which need it and also for the people who does want to waste their time
on fake news.

4.2) Can the system be implemented using the current technology


and within the given cost and schedule constraints?
 The system can be easily implemented using existing technology. The
technology used is anaconda which is user friendly and
freeware. After seeing the functionality that system provides the
cost of developing the application does not matter.

 Taking the schedule constraints in consideration the time


available is approximately 9 months. The time period is enough
to develop the system.
5. REQUIREMENT VALIDATION
A requirements validation is concerned to check whether the requirements
actually define the system, which the customer wants? Requirements validation
is important because errors in requirements document can lead to extensive
rework costs when they are subsequently discovered. We have performed the
following validation checks

 Validity checks
Check whether the information entered is in valid format

 Consistency checks
A requirement in a document is not conflicting.

 Completeness checks

The requirements document includes requirement, which define all functions,


and constraints intended by the system user.

 Realism checks

Using knowledge of existing technology, the requirements are checked to


ensure that they could actually be implemented.

 Verifiability
The requirements are given in verifiable manner (e.g.: Using quantifiable
measures) to reduce disputes between client and developer.

6. FEATURES OF NEW SYSTEM

We will try to develop application as follows:

 The system being available in regional languages.


 Provide the more awareness in our country India about this concept.
 User can upload his/her idea through description, team information,
videos of his/her work, and the form of reward and main for which
purpose he/she needed the money.
 One can pledge the money if one like the idea.
 Communication provided between innovators and investors.
 Safety for money transfer and surety of security of ideas.
7. FLOWCHART OF NEW SYSTEM:
Data Flow Diagram
8. Component Diagram
9. E-R Diagram
10. Use Case Diagrams
Following are the use case diagrams for our system that describe a set of actions
(use cases) that the system should or can perform in collaboration with one or more
external users of the system (actors).

10.1 Use Case Diagram 1

Use Case Diagram 1

The classification System is the backbone of entire software. Figure 1 shows


the use case related to classification system. The classification system extracts
text from News URL and uses NLP to extract the required features.
Then different machine learning algorithms are applied using the features and
results are displayed to the user and stored in the database.
10.2 Use Case Diagram 2

Use Case Diagram 2

The use case related to user feedback is shown in Figure 2. In order for a
user to give feedback related to accuracy of classification a user must sign up.
The system displays all the recently processed/classified URL’s to the user. If
the user is logged in he can choose to vote for any classification result. After
some time (1 week) the system will check the votes for the classification and
based on the votes the system will be able to verify whether the classification
was correct or not. If the classification is verified the system adds the features
of the correct classification to the training set.
10.3 Use Case Diagram 3

Use Case Diagram 3

Figure 3 shows the use case related to basic use of the software. User enters a
News URL. System verifies the URL and extracts relevant text from the URL
using a web crawler and then classified the news article as fake or credible
using machine learning algorithms. After the result is computed the user can
view the result.
11 SELECTION OF HARDWARE AND SOFTWARE
The Tables below give idea of the hardware and software required for the system
and client side requirements.

 Hardware Selection

Devices Description

Processor Intel Core Duo 2.0 GHz or


more
RAM 512 MB or more

Hard Disk 10 GB or more

Table Hardware Requirements

 Software Selection
For which Software

Operating System Windows XP/2003/vista/7/8/10,Linux, Mac os x

Front End Jupyter Notebook

Back End Numpy,Panda

Scripting Language Python

Table Software Requirements

 Client side requirements:


For which Requirement
Browser Any Compatible browser device
Table Client Side Requirements
Chapter 5

System Design

 Overview

 Product Function

 User Characteristics

 Constraints

 User Requirements

 Performance Requirements

 Code Snippet
1. Overview
The system works on already trained Machine Learning algorithms. Multiple
machine learning algorithms have been trained by providing a data set of both fake
and authentic news. The summary of overall procedure is as follows.

1. User enters URL.


2. URL is verified if entered text is in URL format or not, then web crawler extracts

relevant text from that news URL.


3. NLP is applied on text extracted.
4. Features extracted from NLP are fed to ML Algorithms.
5. There’s a voting mechanism among ML algorithms, which predicts whether the
news is fake or authentic.

6. Each classified gets stored in the database.

7. A user can login to give a feedback if previously classified news was

Figure Layered Architecture

2 Product Functions
1. A URL of news article must be entered.

2. NLP is performed on the text extracted from the URL and relevant features are

extracted from that NLP.


3. News articles are classified as fake or authentic from the features extracted.
4. Classified news are stored in data base to maintain list of URLs with the output
predicted (Fake/Authentic), and each user can view that maintained list.
5. User can vote on maintained list if that specific news isn’t classified correctly.

3 User Characteristics
Moderator: The moderator will be monitoring the rating submitted by the users,
to maintain the credibility of ratings.

Administrator: Will maintain the overall aspects of web application and will be
responsible for giving users appropriate roles and authority.

User: The main actor using the web application to analyze the URLs.

4 Constraints
1 Our software will never assure authenticity of the result. For this, we need user

feedback.
2 Our software will only be available in English language and news article
provided
to the software should also be in English language.

3 We don’t have access to huge amount of data for training of machine learning
4 Software will not work without internet connection.
5 Our software does not perform well when article`s body is plain, short and
emotionless.

5 User Requirements
Following are the user requirements that describe what the user expects from the
software to do

5.1 External Interface Requirements


The user interface will be web based provided to user through a web browser. The
screen will consist of a log in form. Upon logging in the user will presented with a
dashboard. The dashboard will consist of a header, sidebar menu and body. On the
top right the menu for managing user preferences will be provided. The body will
be consisting of dialogue box which will be used to get the input from user. There
will be a button to submit the query entered by user in the dialogue box. Below the
dialogue and button, a list of previously processed URLs with their rating from user
will be displayed. Against each list item the user will be able to rate that
corresponding processed URL result either good or bad.

1. Numpy: a scientific computing package generating N-dimensional array


objects. As for this project, several machine learning models use Numpy as the
data container; the implementation of our random tree and random forest also
depends on this.
2. Scikit-learn: A Python library built on Numpy. This project uses it mainly for
data classification.
3. NLTK: A Python library used for NLP (natural language processing). We will be
using NLTK for feature extraction from the news article.
4. Angular: The angular 4 will be used to implement the web based interface and
client side of application.
5. Scrapy: A Python library to scrape websites. We will be using scrapy to fetch
text of the news article’s header from URL provided by the user

5.2 Functional Requirements


1. Take a valid news article URL from user.
2. Extract relevant text from the URL, provided by the user, using Scrapy.
3. Then we will extract relevant features from the text using NLP (Natural
Language Processing).
4. Correctly classify news article as fake news or credible news using different
machine learning models (SVM and Random Forest).
5. Store the classification results in database to maintain a list of URL’s which are
already processed and classified.
6. User can Sign up and Login.
7. Each user can view all the recently processed and classified news articles and
verify the correctness of the classification by voting (sign in required).
8. After a predefined limit of time and number of votes we can verify that whether
the software classified a given news article correctly or not.
9. We can then modify our classification if needed and add the news article in the
training set to improve accuracy of future predictions.
6. Performance Requirements

Table Performance Requirements

ID Performance Requirement

1 Feature Extraction must be done in reasonable time

2 Time taken by machine learning algorithms should be in


milliseconds
3 System should be able to handle multiple simultaneous requests.

7. CODESNIPPET:

The Jupyter notebook will be used for implementing our machine learning algorithm
and it has many files including dataset files and python notebooks which has
following extensions i.e. “.tsv” “.pynb” .
We also tried to use python libraries like torch and the famous numpy. A small level
implementation of our project is shown below.

Dataset files
Chapter 6

Proposed Solution and


Code Implementation
 Proposed Solution
 Implementation Environment
 Program/Module Specification
 Coding Standards
 Coding
1. Proposed Solution
The only solution to the problem defined in the earlier section was to design and implement such
a Web based application which will take a news URL as an input and will give result of its
authenticity with higher accuracy. We had a problem in achieving higher accuracy because of
limited dataset. We still are achieving 85.7% test accuracy which is much higher than the
Research Papers we have been following. To tackle this issue, we have implemented the
mechanism where processed URLs get stored in the database which are the fed to the training
algorithms. In this way our system keeps getting smarter with time.

1.1 Methodology
Developing an Automatic Fake News Detector was a challenging problem. To make sure, that
we
accomplished this task efficiently, without facing major problems, which would have caused
major
redesigns and re-engineering of the software architecture, in a time and cost constrained project
environment, we started off with developing SRS (Software Requirement Specifications) and
detailed
design of the system. Gantt chart and work break down structure were created in that phase to
monitor the project and when a phase should start or end.

After that we started to gather dataset for training purpose. We were able to gather dataset of
about
6,500 labeled News Articles from multiple sources. After that we started our research on which
Machine Learning Algorithms to apply and what kind of NLP to use. We used SVM and
Random
forest as our machine learning algorithms, which gave us accuracy of 85.7%.

Over all process is as follows.


 Labeled Dataset is gathered of about 6,500 News articles containing Text and Title of News.

 NLP is applied on each news article to extract relevant features


e.g., Punctuation Count, Text difficulty index etc.

 In total 38 features are extracted.

 Training is done by SVM (Linear Kernel) and Random Forest

 When the URL is entered, text and title of the news form that URL is scrapped using WEB
crawler.

 Same NLP is applied to the extracted text and title and 38 features are fed to
Machine Learning Algorithms.

 We have combined the strong points of both Algorithms which increases our
Accuracy.
 SVM is better at detecting Fake News while Radom Forest is better for Authentic
News

 When user enters a URL and checks the authentication of News, it gets stored in Database.

 System maintains a list of already processed URLs which users can see.

 User can also give a feedback to any Already Processed News article by a dislike
button, if the news has been predicted wrong by our Algorithm.

 Predicted News with Low user ratings are then manually observed.

 After some time, these already processed News articles are fed to Machine
Learning Algorithms.

 Size of our dataset keeps increasing and the System keeps getting smarter with
time.

Figure Workflow Diagram

Feature Selection

We have used total 38 features in total. These features were extracted from title and news article
both. Previous researches done on this topic used only title of the news for training. We couldn’t
get our desired accuracy using title only.
Following is the table of features selected for text with the weight/importance of each feature as
calculated by machine learning algorithm. Same features have been selected for title but not
mentioned in the table.
Table Features with importance

Feature Importance

Word Count 0.03223736

Character Count 0.11497973

Punctuation Count 0.0979961

Uppercase Count 0.07135418

Gunning Fog 0.0166595

Automated Readability Index 0.03313012

Linsear Write Formula 0.01666274

Difficult Words 0.0262762

Dale-chall Readability Score 0.01767803

Punctuation Count / Character Count 0.21654589

Count of numbers 0.01909209

Count of brackets 0.00145834

Count of Asterisk (offensive words) 0.01956875

The above table shows us which features are most important for news classification, by giving
them a weight or score. For example, according to this table, Ratio of Punctuation Count and
Character Count has highest score (.2165). It means that this feature has 21.65% importance, and
it has the highest probability of classifying the news. While bracket counts has least importance,
which means that this feature helps least to classify the news article into fake or authentic.

Normalization

We have used the normalization in which we rescaled the feature values between [0, 1]. There
was quite obvious increase in our accuracy after the use of this normalization method.
The formula is given as:
Where x is an original value, x' is the normalized value.
For example if punctuation count ranges from [10 , 200], x' can be calculated by subtracted each
news’s punctuation count with 10, and dividing by 190.

1.2 Training

After cleaning and normalizing the data, we set it to training. We tried multiple algorithms and
techniques for training the data, and selected two (Random Forest and SVM) which gave the
highest accuracy. Training acquired most of the time of the project development, because we had
endless combinations and possibilities to try out, in order to achieve highest accuracy with
limited size of dataset. We tried changing the normalization method, training algorithm, number
of iterations, kernel in SVM and number of features.

1.2.1 Number of Features

Following is the graph of Accuracy vs Number of Features.

Figure Accuracy vs Number of Features

Above graph clearly shows the phenomenon of over and underfitting. At 19 number of features,
we are getting the highest accuracy (85.7%) by SVM Linear Kernel. After that, the model starts
to overfit the data and test accuracy starts to decline.
Note that, 19 features are used for title and text separately, in total 38 features are used.
1.2.2 SVM Kernels

Following graph shows the difference in accuracy with different SVM kernels.

Figure Accuracy vs SVM Kernel

In above graph, it can be seen that Linear kernel gives the highest accuracy (85.7%). That’s
because most of textual data is linearly separable, and linear kernel works really good when data
is linearly separable or has high number of features. That's because mapping the data to a higher
dimensional space does not really improve the performance (L Arras, F Horn et al., 2017).

1.2.3 Random Forest and Decision Tree

Following is graph of Accuracy vs Maximum Depth of Random Forest and Decision Tree.

Figure Accuracy vs Depth of Random Forest and Decision Tree


Here it can be seen that maximum accuracy is at depth 10 by Random Forest (83.8%). And it can
also be observed that Decision tree never surpasses the accuracy obtained by Random Forest.

1.2.4 Train/Test Split

Right now we’re splitting the data into 80/20, with 80 being training set and 20 being the test set.
Following is the graph that shows Accuracy vs Machine Learning models with different splits.

Figure Accuracy vs train/test split

It can be seen from this graph that highest accuracy is achieved when the dataset is split 80/20,
with 20% being test set. Phenomenon of over and underfitting can be observed in this graph as
well.

1.2.5 Feature Reduction

We have used PCA and LDA for feature reduction.

Following is the graph of Accuracy with PCA and LDA, and without feature reduction vs
number of features.
Note: Feature reduction is applied on Random Forest, and accuracy of Random Forest has been
used. Algorithm was trained multiple times, and accuracy of Normal Random Forest in each try
was compared with Random Forest’s accuracy after PCA and LDA
Figure Feature Reduction (Graph)

Above graph is given below in tabulated form

Table Feature Reduction (Tabulated)

Without Feature Reduction PCA LDA

10 82.39 83.5 80.2

15 82.767 85 79.63

20 82.7 84.9 79.9

It can be clearly seen that PCA has always been greater than Random forest trained without
Reduction in Features.

1.2.6 Summary of Training

As depicted in the previous graphs, we played around with the data, features and machine
learning algorithms to achieve the desired accuracy. We also implemented neural networks but it
was giving really low accuracy (53%) due to insufficient data size. So we decided not to include
neural network in our work, we will add it in future when we have hands on sufficient data size.
We hope when have large amount of news articles, deep learning will cause a great increase the
accuracy of our system.
Following is the overall summary of what has been discussed previously related to Training the
data.
Decision Tree:
Table Decision Tree

Features Depth Accuracy %

6 5 35

- 8 38

- 10 41.12

- 14 39

13 5 53.5

- 8 55

- 10 58.78

- 14 54.2

19 5 68.12

- 8 69.5

- 10 77

- 14 74.2
Random Forest:

Table Random Forest

Features Depth Accuracy %

6 5 37

- 8 39.75

- 10 43

- 14 41.2

13 5 54.5

- 8 59

- 10 61.25

- 14 58

19 5 79.54

- 8 84

- 10 82.3

- 14 78
SVM:

Table SVM

Kernel Features Accuracy %

Default 6 39.25

- 13 51.8

- 19 56

- 25 58.7

Linear 6 68.12

- 13 82.35

- 19 85.7

- 25 84.2

Over-All:

This is the over-all summary, Table 4.2-8 is constructed considering following values.

 No. of Features = 19 (For title and text separate, total 38)



 Maximum depth for Random Forest and Decision Tree = 10

 SVM Kernel = Linear

 Train/Test Split = 80/20
Table Accuracy vs Training Algorithms

Training Algorithm Accuracy %

Random Forest 84

SVM 85.7

Decision Tree 77

ANN (2 hidden Layers) 51

ANN (5 hidden Layers) 57

ANN (10 hidden Layers) 53

Here we can see that SVM gives us the highest accuracy among other Machine Learning
algorithms, the reason has been described previously. SVM performs great on textual data
because textual data is almost all the time linearly separable and SVM is a good choice for
linearly separable data.

1.3 Server-side Implementation

Main part of our server is Machine Learning Algorithms. Classification and Web Backend part
of the project has been implemented in Python. Django is used for back-end library of Sklearn is
used for the training purposes. We started our project with Decision Tree algorithm with 19
features, and got 53% accuracy after splitting the dataset to 80-20 into training and testing. After
going through research papers and obtaining strong points from each of them we were able to get
85.7% accuracy. We combined Random Forest and SVM (Linear Kernel) to give us the highest
accuracy. We wanted to use Deep learning and hoped to get much higher accuracy with it, but
failed due to small size of dataset. For the NLP part, we used NLTK and Textstat (python APIs)
for complex feature extraction like adverb count or text difficulty.
One of our main hurdle was to scrap html page properly. Online news articles are not
written in standard form, e.g., news on Facebook is written in different html format than the
news on bbc.com. We couldn’t tackle this generality, and used python’s library Newspaper3k
which is made specially to scrap of news articles.
1.4 Database Design

SQLite is chosen to progress our database. SQLite is self-contained, high- reliability, embedded,
full-featured, public-domain, SQL database engine. There are two main tables of Users and
URLs. User table keeps record of password and username etc. so that user can login to the
system. While URL table keeps record of already processed news article so if any new user
enters the same URL again, system doesn’t have to go through all the processing
again and can just give the result from the database. Voting table has also been maintained which
keeps record of vote give to each URL.

2. IMPLEMENTATION ENVIRONMENT

As our project is study based project and the best tool which is used at the undergraduate level
is “Anaconda” . It consists of different modules in which we can code but for our project we
have used Jupyter Notebook, which is used for high level python programming. Jupyter
Notebook provides browser environment as it opens up in the browser.it can also connect to
kernel and terminal.

3. PROGRAM/MODULE SPECIFICATION

The naive bayes classifier algorithm is the most applicable algorithm to implement fake news
detection as it works on conditional probability and other major concepts of Data mining that
are used in this project and we have also studied it in 4th semester which made the
understanding of code quite easy.
The final output is generated with the help of matplot lib of python which helps to
understand the count of various words in the articles circulated. Below it is shown
with the help of color coding in the given matrix and respective range.

4. CODING STANDARDS

Normally, good software development organization requires their programmers


to adhere to some well-defined and standard style of coding called coding
standard.

4.1 Variable Standards:

Our project implementation uses apt variable names that makes the
understanding of the domain quite easy.

4.2 Comment Standards:

Comments increases readability of our code and makes it easy for the third
party to understand it. We have used comments everywhere needed and also
used the references of the online codes.

Every code block and the different modules start with the comments,
describing in brief about the code and the details.

Comments may also be used in between and along with the lines of code to
explain one specific line or lines.

In python we can use. ‘#’ to for single comment and for multiple lines we can
use delimiters that is,” ‘ ‘ ‘ “ . We have used both during programming.
5. Coding

LSTM.py:

import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from collections import Counter
import os
import getEmbeddings2
import matplotlib.pyplot as plt
import scikitplot.plotters as skplt

top_words = 5000
epoch_num = 5
batch_size = 64

def plot_cmat(yte, ypred):


'''Plotting confusion matrix'''
skplt.plot_confusion_matrix(yte, ypred)
plt.show()

if not os.path.isfile('./xtr_shuffled.npy') or \
not os.path.isfile('./xte_shuffled.npy') or \
not os.path.isfile('./ytr_shuffled.npy') or \
not os.path.isfile('./yte_shuffled.npy'):
getEmbeddings2.clean_data()

xtr = np.load('./xtr_shuffled.npy')
xte = np.load('./xte_shuffled.npy')
y_train = np.load('./ytr_shuffled.npy')
y_test = np.load('./yte_shuffled.npy')

cnt = Counter()
x_train = []
for x in xtr:
x_train.append(x.split())
for word in x_train[-1]:
cnt[word] += 1
# Storing most common words
most_common = cnt.most_common(top_words + 1)
word_bank = {}
id_num = 1
for word, freq in most_common:
word_bank[word] = id_num
id_num += 1

# Encode the sentences


for news in x_train:
i=0
while i < len(news):
if news[i] in word_bank:
news[i] = word_bank[news[i]]
i += 1
else:
del news[i]

y_train = list(y_train)
y_test = list(y_test)

# Delete the short news


i=0
while i < len(x_train):
if len(x_train[i]) > 10:
i += 1
else:
del x_train[i]
del y_train[i]

# Generating test data


x_test = []
for x in xte:
x_test.append(x.split())

# Encode the sentences


for news in x_test:
i=0
while i < len(news):
if news[i] in word_bank:
news[i] = word_bank[news[i]]
i += 1
else:
del news[i]
# Truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(x_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(x_test, maxlen=max_review_length)

# Convert to numpy arrays


y_train = np.array(y_train)
y_test = np.array(y_test)

# Create the model


embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words+2, embedding_vecor_length,
input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epoch_num,
batch_size=batch_size)

# Final evaluation of the model


scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy= %.2f%%" % (scores[1]*100))

# Draw the confusion matrix


y_pred = model.predict_classes(X_test)
plot_cmat(y_test, y_pred)

getEmbeddings.py

import numpy as np
import re
import string
import pandas as pd
from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence
from gensim import utils
from nltk.corpus import stopwords

def textClean(text):
text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
text = text.lower().split()
stops = set(stopwords.words("english"))
text = [w for w in text if not w in stops]
text = " ".join(text)
return (text)

def cleanup(text):
text = textClean(text)
text = text.translate(str.maketrans("", "", string.punctuation))
return text

def constructLabeledSentences(data):
sentences = []
for index, row in data.iteritems():
sentences.append(LabeledSentence(utils.to_unicode(row).split(), ['Text' + '_
%s' % str(index)]))
return sentences

def getEmbeddings(path,vector_dimension=300):
data = pd.read_csv(path)

missing_rows = []
for i in range(len(data)):
if data.loc[i, 'text'] != data.loc[i, 'text']:
missing_rows.append(i)
data = data.drop(missing_rows).reset_index().drop(['index','id'],axis=1)

for i in range(len(data)):
data.loc[i, 'text'] = cleanup(data.loc[i,'text'])

x = constructLabeledSentences(data['text'])
y = data['label'].values

text_model = Doc2Vec(min_count=1, window=5,


vector_size=vector_dimension, sample=1e-4, negative=5, workers=7, epochs=10,
seed=1)
text_model.build_vocab(x)
text_model.train(x, total_examples=text_model.corpus_count,
epochs=text_model.iter)

train_size = int(0.8 * len(x))


test_size = len(x) - train_size
text_train_arrays = np.zeros((train_size, vector_dimension))
text_test_arrays = np.zeros((test_size, vector_dimension))
train_labels = np.zeros(train_size)
test_labels = np.zeros(test_size)

for i in range(train_size):
text_train_arrays[i] = text_model.docvecs['Text_' + str(i)]
train_labels[i] = y[i]

j=0
for i in range(train_size, train_size + test_size):
text_test_arrays[j] = text_model.docvecs['Text_' + str(i)]
test_labels[j] = y[i]
j=j+1

return text_train_arrays, text_test_arrays, train_labels, test_labels

getEmbeddings2.py

import numpy as np
import re
import string
import pandas as pd
from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence
from gensim import utils
from nltk.corpus import stopwords

def textClean(text):
text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
text = text.lower().split()
stops = set(stopwords.words("english"))
text = [w for w in text if not w in stops]
text = " ".join(text)
return (text)

def cleanup(text):
text = textClean(text)
text = text.translate(str.maketrans("", "", string.punctuation))
return text
def constructLabeledSentences(data):
sentences = []
for index, row in data.iteritems():
sentences.append(LabeledSentence(utils.to_unicode(row).split(), ['Text' + '_
%s' % str(index)]))
return sentences

def clean_data():
path = 'datasets/train.csv'
vector_dimension=300

data = pd.read_csv(path)

missing_rows = []
for i in range(len(data)):
if data.loc[i, 'text'] != data.loc[i, 'text']:
missing_rows.append(i)
data = data.drop(missing_rows).reset_index().drop(['index','id'],axis=1)

for i in range(len(data)):
data.loc[i, 'text'] = cleanup(data.loc[i,'text'])

data = data.sample(frac=1).reset_index(drop=True)

x = data.loc[:,'text'].values
y = data.loc[:,'label'].values

train_size = int(0.8 * len(y))


test_size = len(x) - train_size

xtr = x[:train_size]
xte = x[train_size:]
ytr = y[:train_size]
yte = y[train_size:]

np.save('xtr_shuffled.npy',xtr)
np.save('xte_shuffled.npy',xte)
np.save('ytr_shuffled.npy',ytr)
np.save('yte_shuffled.npy',yte)
naive-bayes.py

from getEmbeddings import getEmbeddings


from sklearn.naive_bayes import GaussianNB
import numpy as np
import matplotlib.pyplot as plt
import scikitplot.plotters as skplt

def plot_cmat(yte, ypred):


'''Plotting confusion matrix'''
skplt.plot_confusion_matrix(yte,ypred)
plt.show()

xtr,xte,ytr,yte = getEmbeddings("datasets/train.csv")
np.save('./xtr', xtr)
np.save('./xte', xte)
np.save('./ytr', ytr)
np.save('./yte', yte)

xtr = np.load('./xtr.npy')
xte = np.load('./xte.npy')
ytr = np.load('./ytr.npy')
yte = np.load('./yte.npy')

gnb = GaussianNB()
gnb.fit(xtr,ytr)
y_pred = gnb.predict(xte)
m = yte.shape[0]
n = (yte != y_pred).sum()
print("Accuracy = " + format((m-n)/m*100, '.2f') + "%") # 72.94%

plot_cmat(yte, y_pred)

neural-net-keras.py

from getEmbeddings import getEmbeddings


import matplotlib.pyplot as plt
import numpy as np
import keras
from keras import backend as K
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Embedding, Input, RepeatVector
from keras.optimizers import SGD
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import scikitplot.plotters as skplt

def plot_cmat(yte, ypred):


'''Plotting confusion matrix'''
skplt.plot_confusion_matrix(yte, ypred)
plt.show()

xtr,xte,ytr,yte = getEmbeddings("datasets/train.csv")
np.save('./xtr', xtr)
np.save('./xte', xte)
np.save('./ytr', ytr)
np.save('./yte', yte)

xtr = np.load('./xtr.npy')
xte = np.load('./xte.npy')
ytr = np.load('./ytr.npy')
yte = np.load('./yte.npy')

def baseline_model():
'''Neural network with 3 hidden layers'''
model = Sequential()
model.add(Dense(256, input_dim=300, activation='relu',
kernel_initializer='normal'))
model.add(Dropout(0.3))
model.add(Dense(256, activation='relu', kernel_initializer='normal'))
model.add(Dropout(0.5))
model.add(Dense(80, activation='relu', kernel_initializer='normal'))
model.add(Dense(2, activation="softmax", kernel_initializer='normal'))

# gradient descent
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)

# configure the learning process of the model


model.compile(loss='categorical_crossentropy', optimizer=sgd,
metrics=['accuracy'])
return model
model = baseline_model()
model.summary()
x_train, x_test, y_train, y_test = train_test_split(xtr, ytr, test_size=0.2,
random_state=42)
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
encoded_y = np_utils.to_categorical((label_encoder.transform(y_train)))
label_encoder.fit(y_test)
encoded_y_test = np_utils.to_categorical((label_encoder.transform(y_test)))
estimator = model.fit(x_train, encoded_y, epochs=20, batch_size=64)
print("Model Trained!")
score = model.evaluate(x_test, encoded_y_test)
print("")
print("Accuracy = " + format(score[1]*100, '.2f') + "%") # 92.69%

probabs = model.predict_proba(x_test)
y_pred = np.argmax(probabs, axis=1)

plot_cmat(y_test, y_pred)

neural-net-tf.py

import numpy as np
import tensorflow as tf
from getEmbeddings import getEmbeddings
import matplotlib.pyplot as plt
import scikitplot.plotters as skplt
import pickle
import os.path

IN_DIM = 300
CLASS_NUM = 2
LEARN_RATE = 0.0001
TRAIN_STEP = 20000
tensorflow_tmp = "tmp_tensorflow/three_layer2"

def plot_cmat(yte, ypred):


'''Plotting confusion matrix'''
skplt.plot_confusion_matrix(yte,ypred)
plt.show()
def dummy_input_fn():
return np.array([1.0] * IN_DIM)

def model_fn(features, labels, mode):


"""The model function for tf.Estimator"""
# Input layer
input_layer = tf.reshape(features["x"], [-1, IN_DIM])
# Dense layer1
dense1 = tf.layers.dense(inputs=input_layer, units=300, \
activation=tf.nn.relu)
# Dropout layer1
dropout1 = tf.layers.dropout(inputs=dense1, rate=0.4, \
training=(mode == tf.estimator.ModeKeys.TRAIN))
# Dense layer2
dense2 = tf.layers.dense(inputs=dropout1, units=300, \
activation=tf.nn.relu)
# Dropout layer2
dropout2 = tf.layers.dropout(inputs=dense2, rate=0.4, \
training=(mode == tf.estimator.ModeKeys.TRAIN))
# Dense layer3
dense3 = tf.layers.dense(inputs=dropout2, units=300, \
activation=tf.nn.relu)
# Dropout layer3
dropout3 = tf.layers.dropout(inputs=dense3, rate=0.4, \
training=(mode == tf.estimator.ModeKeys.TRAIN))
# Logits layer
logits = tf.layers.dense(inputs=dropout3, units=CLASS_NUM)

# prediction result in PREDICT and EVAL phases


predictions = {
# Class id
"classes": tf.argmax(input=logits, axis=1),
# Probabilities
"probabilities": tf.nn.softmax(logits, name="softmax_tensor")
}

if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

# Calculate Loss for TRAIN and EVAL


loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

# Configure the training Op


if mode == tf.estimator.ModeKeys.TRAIN:
optimizer =
tf.train.GradientDescentOptimizer(learning_rate=LEARN_RATE)
train_op = optimizer.minimize(\
loss=loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(\
mode=mode, loss=loss, train_op=train_op)

# Add evaluation metrics


eval_metric_ops = {
"accuracy": tf.metrics.accuracy(\
labels=labels, predictions=predictions["classes"])
}
return tf.estimator.EstimatorSpec(\
mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

def main():
# Get the training and testing data from getEmbeddings
train_data, eval_data, train_labels, eval_labels = \
getEmbeddings("datasets/train.csv")
train_labels = train_labels.reshape((-1, 1)).astype(np.int32)
eval_labels = eval_labels.reshape((-1, 1)).astype(np.int32)

# Create the Estimator


classifier = \
tf.estimator.Estimator(model_fn=model_fn, model_dir=tensorflow_tmp)

# Setup logging hook for prediction


tf.logging.set_verbosity(tf.logging.INFO)
tensors_to_log = {"probabilities": "softmax_tensor"}
logging_hook = tf.train.LoggingTensorHook(
tensors=tensors_to_log, every_n_iter=200)

# Train the model


train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": train_data},
y=train_labels,
batch_size=50,
num_epochs=None,
shuffle=True)
classifier.train(
input_fn=train_input_fn,
steps=TRAIN_STEP,
hooks=[logging_hook])
# Evaluate the model and print results
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},
y=eval_labels,
num_epochs=1,
shuffle=False)
eval_results = classifier.evaluate(input_fn=eval_input_fn)
print(eval_results) # 81.42%

# Draw the confusion matrix


predict_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},
num_epochs=1,
shuffle=False)
predict_results = classifier.predict(input_fn=predict_input_fn)
predict_labels = [label["classes"] for label in predict_results]
plot_cmat(eval_labels, predict_labels)

if __name__ == "__main__":
main()

svm.py

from getEmbeddings import getEmbeddings


import numpy as np
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import scikitplot.plotters as skplt

def plot_cmat(yte, ypred):


'''Plotting confusion matrix'''
skplt.plot_confusion_matrix(yte,ypred)
plt.show()

xtr,xte,ytr,yte = getEmbeddings("datasets/train.csv")
np.save('./xtr', xtr)
np.save('./xte', xte)
np.save('./ytr', ytr)
np.save('./yte', yte)
xtr = np.load('./xtr.npy')
xte = np.load('./xte.npy')
ytr = np.load('./ytr.npy')
yte = np.load('./yte.npy')

clf = SVC()
clf.fit(xtr, ytr)
y_pred = clf.predict(xte)
m = yte.shape[0]
n = (yte != y_pred).sum()
print("Accuracy = " + format((m-n)/m*100, '.2f') + "%") # 88.42%

plot_cmat(yte, y_pred)
Chapter 7
Results and Discussion
 Take a Valid News Article URL
 Extract Relevant Text From URL
 Extracting Feature from Relevant Text
 Applying Machine Learning Algorithms
for Classification
 Store Classification Result in Database
 User Login and Sign up
 User Feedback
 Verifying Results
 Retraining of Machine Learning Models
 Non-Functional Requirement Achieved
Results and Discussion

We integrated all the system components successfully. Our Systems accuracy was quite good. It
correctly classified news article with 85.7% accuracy. Our main goal was to develop a user
friendly web application which classify a news article as fake or credible, by simply taking its
URL from the user. We achieved this goal by fulfilling all the user requirement which were
crucial to the success of our project.
There were also requirements related to performance. We constantly improved our system to
achieve maximum performance and the results were quite satisfactory. The response time of our
system was adequately fast. We constantly applied software engineering processes to keep track
of all the functional and non-functional requirements.

1. Take a Valid News Article URL, (FR-01)

This functional requirement was critical to our system. In order for all the system components to
work flawlessly, the system must get a valid news article URL from the user, from where it
extracts text. If the system does not get a news article URL, the web crawler will generate an
exception. In order to fulfil this requirement we used a form input of URL type so that it takes
only a URL as input and we also used exception handling to catch the exception if the URL
provided is not of a news article.

2. Extract Relevant Text from URL, (FR-02)

This was a very challenging problem in our project. In order to classify the news article as fake
or credible we only needed the relevant text from page source, on which our system applies
Natural Language Processing to make feature vectors. This was particularly hard as we had to
make generic scrapper that works for every news website. We used newspaper3k API to solve
this problem, which made it easier for us to extract only the news article title and text (body).

3. Extracting Feature from Relevant Text, (FR-03)

The system uses nltk to apply NLP on the news article title and text to make feature vectors,
which are then fed to the machine learning algorithms. We used 38 dimensional feature vectors.
This is a necessary step as it allows us to convert text into numeric form which is then easy to
use for machine learning algorithms.
4. Applying Machine Learning Algorithms for Classification, (FR-04)
This requirement is the backbone of our system. The success of our system depended on how
accurately our machine learning models predicted whether a news article is fake or not. In order
to achieve maximum accuracy with finite resources, we trained our machine learning models on
a labelled dataset of 7000 news articles. We used 2 different machine learning models SVM and
Random Forest for classification and we combined the result of both models. We achieved a
maximum of 86% test accuracy.

5. Store Classification Result in Database, (FR-05)

We stored the result of every URL processed by our system in our database alongside its title and
text. This requirement helped us improve the performance of our as it eliminated redundancy. If
2 users entered the same URL our system will only process it once and will it store its
classification result in the database for subsequent queries.

6. User Login an Sign up, (FR-06)

We used django user model to implement this requirement. This was also a necessary
requirement as users need to login to give feedback on the classification results.

7. User Feedback, (FR-07)

After a user login into the system, user can give feedback on all the classification results of the
processed URL’s. We implemented this by creating a voting system. In which a user can like or
dislike a URL’s classification result. We also made a table of voting in the database which is
associated to both user model and URL model to make sure that a user can vote only once for a
particular URL.

8. Verifying Results, (FR-08)

After a month of processing a URL our system automatically checks the rating, which is given
by the users, of URL. If the rating is more than 50% our system retains the classification result.
But if the rating is less than 50% the classification result is altered as poor rating shows incorrect
classification by the system.

9. Retraining of Machine Learning Models, (FR-09)

After a month all the URL’s which are verified our added to our dataset along with their
classification result. All the machine learning models our trained and saved again. This ensures
that our system improves with time as more and more data is available for training. This will
help our system evolve continuously and our accuracy will get better and better.

10. Non-Functional Requirement Achieved


Table Performance Requirement

Performance Requirements
The system should respond to a user query and return a result in less than 5 seconds.

Web crawling should be done in fast time.

Feature extraction must be done in milliseconds.

Time taken by ML algorithms should be in milliseconds.

System should be able to handle multiple simultaneous requests.

Table Security Requirements

Security Requirements
User should be able to securely login.

User password should be encrypted.


It is stored in the database in encrypted form.
User password should be long and contain special
characters.
Table Usability Requirements

Usability Requirements
The system should be user friendly and easy to use

The system should not need extra instruction manual


to use
The user should be able to learn the system in less than
5 minutes
Chapter 8
Testing
 Testing Plan
 Testing Strategy
 Testing Methods
 Test Cases
Testing

Various parameters like implementation environment, program modules and coding


standards are explained in previous chapter while this chapter is aimed to provide brief
account of testing the software.

There are two principal motives of testing the software

 To rectify the error in execution


 To check the viability of software

The testing ensures that the software is according to the required specification
standards and performs the task meant for it. The testing is done by our in house
employee that act as novice user and test the application with all possible way to find
the bugs and error as well as check validation.

1. TESTING PLAN

Testing is carried out at the following three stages :

 Design

 Implementation

 Coding

1.1 Design Testing:

The design errors are to be rectified at the initial stage. Such errors are very difficult to
repair after the execution of software.

1.2 Implementation Testing:

The errors occurred at this stage can’t be overlooked because such errors do not allow
the further process.

1.3 Coding Testing:

The coding procedure plays significant role in software designing. The


improper coding of any software can generate inconsistent results. Such errors
may occur due to incorrect syntax or false logic. If the errors at coding stage
remain unnoticed may give rise to grave failure of the system.
2. TESTING STRATEGY
A strategy for software testing integrates software test case design method into a well-
planned series of steps that result in the successful construction of the software.
The strategy provides the roadmap that describes the steps to be conducted as a part of
testing, then these steps are planned and then undertaken, and how much effort, time
and resource will be required.

 We have tested our whole system using bottom up testing strategy.

 Bottom up testing involves integrating and testing the modules to the


lower levels in the hierarchy, and then working up hierarchy of modules
until the final module is tested.

 Bottom up testing strategy shows how actual testing is to be done with


whole system but it does not show any detail about each module testing.

 When all modules are tested successfully then I will move to one step up
and continue with white box testing strategy.

 When all modules will be tested successfully then I will integrate those
modules and try to test integrated system using black box testing
strategy.

Why Black Box Testing in my Project?

In my project whatever I have implemented was going to be tested by


guide Mr. Rajesh Davda so there was a black box testing involve directly.

3. TESTING METHOD

3.1 Unit Testing

The unit testing is meant for testing smallest unit of software. There are two
approaches namely bottom-up and top-down.

In bottom up approach the last module is tested and then moving towards the first
module while top down approach reverses the action. In present work we opt for the
first one.

The bottom up approach for the current project is carried out as shown in.

3.2 Integration Testing

The integration testing is meant to test all the modules simultaneously because it
is possible that all the modules may function correctly when tested individually.
But they may not work altogether and may lead to unexpected outcome.
3.3 Validation Testing

After the integration testing software is completely assembled as a package,


interfacing error have been uncovered and corrected, and then validation testing may
begin. Validation can be defined in many ways but a simple definition is what a
validation succeeds when software functions in a manner.
3.4 Storage Testing

The dataset of the system has to be stored on the hard disk. So the storage capacity of
the hard disk should be enough to store all the data required for the efficient running of
the software.

4. TEST CASES
4.1 Purpose

The purpose of this project is to use machine learning algorithm to detect the fake
news in online social media that travels as a real one, it is like a click bait. It will try
to enhance the user experience on the online social media platform and will also
save lot of time of users that they might spent on fake news otherwise.
Chapter 9

Limitations and
Future
Enhancement

 Limitations and Future Enhancement


1.1 LIMITATIONS:

Though we tried our best in developing this domain but as


limitations are mere parts of any system so are of our system.
Some limitations of our domains is:

 The present software uses high quality external hardware at input level. If
quality of input document is poor, output may suffer due to limitation of
it.
 The platform used is ANACONDA (JUPYTER NOTEBOOK) which is
an open source software. This limits the cost of project.
 Limited dataset
 Limited processing speed
 When compared to real world applications our domains are not
applicable as it is entirely study based.

1.2 FUTURE ENHANCEMENT:


There is always a scope for enhancements in any developed system, especially when
our nature of the project is iterative which allows us to rethink on the method of
development to adopt changes in the project. Below mentioned are some of the
changes possible in the future to increase the adaptability, and efficiency of the
system:

 Increase the dataset


 Increase the processing speed.
 Try to bring the domain as close as possible to the real world.
 Quality of dataset can be improved.
Chapter 10
Conclusion
and
Discussion

 Self analysis and Project viabilities


 Problem encountered and
possible solutions
 Summary of project
1. SELF ANALYSIS AND PROJECT VIABILITIES

This shows a simple approach for fake news detection using naive Bayes
classifier. This approach was implemented as a software system and tested
against a data set of Facebook news posts. We achieved classification accuracy
of approximately 74% on the test set which is a decent result considering the
relative simplicity of the model. These results may be improved in several ways,
that are described in the article as well. Received results suggest, that fake news
detection problem can be addressed with artificial intelligence methods.

2. PROBLEM ENCOUNTERED AND POSSIBLE SOLUTIONS:

2.1 Resource Availability:

An important part of checking the veracity of a specific claim is to evaluate the


stance different news sources take towards the assertion. Automatic stance
evaluation, i.e. stance detection, would arguably facilitate the process of fact
checking.

2.2 Requirement Understanding:

Automatic fake news detection is a challenging problem in deception detection,


and it has tremendous real-world political and social impacts. However,
statistical approaches to combating fake news has been dramatically limited by
the lack of labeled benchmark datasets.

2.3 Problem Encountered and Possible Solutions:

Problem:

Fake news

Solution:

To detect fake news and analyze it.

3. SUMMARY OF PROJECT

The scourge of cyberbullying has assumed alarming proportions with an ever-


increasing number of adolescents admitting to having dealt with it either as a
victim or as a bystander.

Anonymity and the lack of meaningful supervision in the electronic medium are
two factors that have exacerbated this social menace.

Fake news is a phenomenon which is having a significant impact on our social


life, in particular in the political world. Fake news detection is an emerging
research area which is gaining interest but involved some challenges due to the
limited amount of resources available.
We propose in this paper, a fake news detection model that use machine
learning techniques. We investigate and compare two different features
extraction techniques and six different machine classification techniques.
Experimental evaluation yields the best performance using Term Frequency
Inverted Document Frequency (TFIDF) as feature extraction technique, and
Linear Support Vector Machine (LSVM) as a classifier, with an accuracy of
more than 74%.
We find that binary classifiers for individual tablets outperform multiclass
classifiers. Our findings show that the detection of textual cyberbullying can be
tackled by building individual topic sensitive classifiers.
REFERNCES:
https://patents.google.com
https://patentimages.storage.googleapis.com/6c/34/81/c390e0d0b7a340/
US8185448.pdf https://patents.google.com/patent/CA2984904A1/en?
oq=CA2984904A1 https://patents.google.com/patent/US20130073473A1/en?
q=Social+media&q=events&q=det
ection&q=verification&oq=Social+media+events+detection+and+verification
https://patents.google.com/patent/CN102929918A/en?
q=False&q=online&q=public+opinion
&q=identification&q=method&oq=False+online+public+opinion+identification+m
ethod https://patentimages.storage.googleapis.com/a9/be/27/26f147f349b34a/
US6131085.pdf
https://patentimages.storage.googleapis.com/fa/f7/07/6ea3c92236263e/
US6807558.pdf
https://patentimages.storage.googleapis.com/80/42/4a/bd509910cfedc1/
US5877759.pdf
https://patentimages.storage.googleapis.com/56/2d/6e/dcbce2b8a3124d/
US8194986.pdf https://arxiv.org/abs/1705.00648
https://link.springer.com/chapter/10.1007/978-3-319-
69155-8_9 https://www.aclweb.org/anthology/W16-
0802 https://arxiv.org/abs/1707.03264
https://ieeexplore.ieee.org/abstract/document/8100379
/

You might also like