Complete Final Sem Report PDF
Complete Final Sem Report PDF
Complete Final Sem Report PDF
March 2020
Table of Contents
Introduction__________________________________________2
1.1 Purpose__________________________________________________2
1.2 Scope____________________________________________________2
1.3 Objective__________________________________________________2
1.3 Technology and Tools_______________________________________2
2) Project Management_________________________________5
2.1 Project Planning___________________________________________6
2.2 Project Scheduling__________________________________________8
2.3 Risk Management__________________________________________8
3) System Requirements Study___________________________11
3.1 User Characteristics_______________________________________12
3.2 Hardware and Software Requirements________________________12
3.3 Constraints Assumptions and Dependencies____________________13
4) System Analysis____________________________________15
4.1 Study of Current System____________________________________16
4.2 Problem and Weaknesses of Current System____________________16
4.3 Requirements of New System________________________________16
4.4 Feasibility Study__________________________________________17
4.5 Requirements Validation____________________________________18
4.6 Features of New System____________________________________18
4.7 Data Flow Diagram________________________________________20
4.8 ER Diagram______________________________________________22
4.9 UML Diagrams___________________________________________23
4.10 Selection of Hardware and Software and Justification___________26
5) System Design_____________________________________27
5.1 Overview________________________________________________28
5.2 Product Function_________________________________________28
5.3 User Characteristics_______________________________________29
5.4 Constraints______________________________________________29
5.5 User Requirements________________________________________29
5.6 Performance Requirements________________________________31
5.7 Code Snippet____________________________________________31
6) Proposed Solution and Code Implementation_____________32
6.1 Proposed Solution________________________________________33
6.2 Implementation Environment______________________________44
6.3 Program/Module Specification_____________________________44
6.4 Coding Standards________________________________________45
6.5 Coding_________________________________________________46
7) Results and Discussion______________________________59
7.1 Take a Valid News Article URL_____________________________60
7.2 Extract Relevant Text From URL___________________________60
7.3 Extracting Feature from Relevant Text_______________________60
7.4 Applying Machine Learning Algorithms for Classification_______61
7.5 Store Classification Result in Database______________________61
7.6 User Login and Sign up___________________________________61
7.7 User Feedback__________________________________________61
7.8 Verifying Results_________________________________________61
7.9 Retraining of Machine Learning Models_____________________61
7.10 Non-Functional Requirement Achieved_____________________62
8) Testing___________________________________________63
8.1 Testing Plan____________________________________________64
8.2 Testing Strategy_________________________________________64
8.3 Testing Methods_________________________________________65
8.4 Test Cases______________________________________________66
9) Limitations and Future Enhancement__________________67
9.1 Limitations and Future Enhancement_______________________68
10) Conclusion and Discussion__________________________69
10.1 Self analysis and Project viabilities________________________70
10.2 Problem encountered and possible solutions_________________71
10.3 Summary of project_____________________________________71
11) References_______________________________________72
Acknowledgement
We have taken many efforts in this project. However, it would not have been possible
without the kind support and help of many individuals and organizations. We would
like to extend our sincere thanks to all of them.
We are highly indebted to My guide Om Prakash Sir for his guidance and constant
supervision as well as for providing necessary information regarding the Project
Titled “Fake News Detection System”. We would like to express my gratitude
towards my class mates for their kind co-operation and encouragement which helped
us in completion of this project.
We also say the big thank you to my parents for such a support and without them we
can do nothing not in just project but also in life. Thankful to our family for their
support.
The feeling of gratefulness to any one’s help directly arises from the bottom of heart.
A small but an important and timely help can prove to be a milestone in one’s life.
Very thankful to almighty of all of us ”God” to give us such a best persons and all
the thing he provides before we need and we always feel that without him we are
nothing.
Gant Chart
8
Component Diagram
21
E-R Diagram
22
Title Pg No.
Performance Requirement 31
Decision Tree 40
Random Forest 41
SVM 42
Performance Requirement 62
Security Requirements 62
Usability Requirements 62
Chapter 1
Introduction
Purpose
Scope
Objective
PURPOSE:
The purpose of this project is to use machine learning algorithm to detect the fake news
in online social media that travels as a real one, it is like a click bait.
It will try to enhance the user experience on the online social media platform and will
also save lot of time of users that they might spent on fake news otherwise.
SCOPE:
The scope of this project is very diverse, it ranges from various online social media like
Face book, twitter, Instagram etc. to fake blogs, fake websites that deceive the users in
one way or the other.
OBJECTIVE:
This is the standalone application that will use the dataset which is consists of various
information in mixture it contains fake news and real news and also the news that appear
real but are fake.
It is an open-source web application that allows you to create and share documents that
contain live code, equations, visualizations and narrative text. Uses include: data
cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning etc. The Notebook is a server-client application that
allows editing and running notebook documents via a web browser. It can be executed
on a local desktop requiring no internet access or can be installed on a remote server and
accessed through the internet.
Anaconda is a scientific Python distribution. It has no IDE of its own. Anaconda bundles
a whole bunch of Python packages that are commonly used by people using Python for
scientific computing and/or data science.
It provides a single download and an install program/script that install all the packages
in one go. Alternate is to install Python and individually install all the required packages
using pip. Additionally, it provides its own package manager (conda) and package
repository. But it allows installation of packages from PyPI using pip if the package is
not in Anaconda repositories. It is especially good if you are installing on Microsoft
Windows as it can easily install packages that would otherwise require you to install
C/C++ compilers and libraries if you were using pip. It is certainly an added advantage
that conda, in addition to being a package manager, is also a virtual environment
manager allowing you to install independent development environments and switch
from one to the other (similar to virtualenv).
3). Python:
Python is an interpreted, object-oriented, high level programming with dynamic
semantics.
Its high level built in data structures, combined with dynamic typing and binding, make it
very attractive for Rapid Application Development, as well as for use as a scripting or
glue language to connect existing components together.
Python’s simple, easy to learn syntax emphasizes readability and therefore reduces the
cost of program maintenance. It supports modules and packages, which encourages
program modularity and code reuse. The Python interpreter and the extensive standard
library are available in source or binary form without charge for all major platforms, and
can be freely distributed.
Debugging Python program is easy: a bug or bad input will never cause a segmentation
fault. Instead, when the interpreter discovers an error, it causes an exception. When the
program doesn’t catch the exception, the interpreter prints a stack trace. A source level
debugger allows inspection of local and global variables, evaluation of arbitrary
expressions, setting breakpoints, stepping through the code a line at a time, and so on.
4). Dataset:
A dataset is a collection of data. Most commonly a data set corresponds to the contents
of a single data basé table, or a single statistical data matrix, where every column of the
table represents a particular variable, and each row corresponds to a given member of
the data set in question. It lists values for each of the variables, such as height and
weight of an object, for each member of the data set. Each value is known as a datum.
The dataset may comprise data for one or more members, corresponding to the number
of rows. The term dataset may also be used more loosely, to refer to the data in a
collection of closely related tables, corresponding to a particular experiment or event.
5). Machine learning:
Machine learning gives computers the ability to learn without being explicitly
programmed (Arthur Samuel, 1959).It is a subfield of computer science.
Machine learning explores the construction of algorithms which can learn and make
predictions on data. Such algorithms follow programmed instructions, but can also make
predictions or decisions based on data. They build a model from sample inputs.
Machine learning is done where designing and programming explicit algorithms cannot
be done. Examples include spam filtering, detection of network intruders or malicious
insiders working towards a data breach, fake news detection in online social media.
6). Deep Learning:
Deep learning is part of a of machine learning methods based on learning data
representations, as opposed to task-specific algorithms. Learning can be supervised,
semi-supervised or unsupervised.
Deep learning architectures such as deep neural networks, deep belief networks and
recurrent neural networks have been applied to fields including computer vision, speech
recognition, processing, social network filtering, machine translation, bioinformatics,
drug design and board game programs, where they have produced results comparable to
and in some cases even exceeded the human experts.
Project Planning
Project Scheduling
Risk Management
2.0. PROJECT MANAGEMENT
PROJECT PLANNING
Project Planning is concerned with identifying and measuring the activities, milestones
and deliverables produced by the project. Project planning is undertaken and completed
sometimes even before any development activity starts. Project planning consists of
following essential activities:
Estimating some of the basic attributes of the project like cost, duration
and efforts the effectiveness of the subsequent planning activities is
based on the accuracy of these estimations.
Project management involves planning, monitoring and control of the process, and the
events that occurs as the software evolves from a preliminary concept to an operational
implementation. Cost estimation is a relative activity that is concerned with the resources
required to accomplish the project plan.
Requirement analysis
Coding
Testing
Deployment
1.2) Milestones and Deliverables:
As software is tangible, this information can only be provided as documents that describe
the state of the software being developed without this information it is impossible to
judge progress at different phases and therefore schedules cannot be determined or
updated.
Milestone is an end point of the software process activity. At each milestone there should
be formal output such as report that can be represented to the guide. Milestones are the
completion of the outputs for each activity. Deliverables are the requirements definition
and the requirements specification.
Milestone represents the end of the distinct, logical stage in the project. Milestone may be
internal project results that are used by the project manager to check progress.
Deliverables are usually Milestones but reverse need not be true. We have divided the
software process into activities for the following milestone that should be achieved.
This phase defines the role and responsibilities of each and every member involved in
developing the system. To develop this system there is only one person involved in
working on the whole application. The same was responsible for each and every part of
developing the system. Our team structure is of single control team organization as it
consist of me and my guide as chief programmer organization.
1.4) Group Dependencies:
The structure chosen for the system is the chief programmer structure .In this system,
Chief Programmer team structure is used because in the organization, a senior engineer
provides the technical leadership and is designated as the chief programmer. The chief
programmer partitions the task into small activities and assigns them to me on time
deadline basis. He also verifies and integrates the products developed by me and i work
under the constant supervision of the chief programmer. For this system reporting entity
represents myself and the role of chief programmer is played by my internal guide.
PROJECT SCHEDULING
RISK MANAGEMENT
Risk management consists of a series of steps that help a software development team to
understood and manage uncertain problems that may arise during the course of software
development and can plague a software project.
Risks are the dangerous conditions or potential problems for the system which may
damage the system functionalities to very high level which would not be acceptable at
any cost. so in order to make our system stable and give its 100% performance we must
have identify those risks, analyze their occurrences and effects on our project and must
prevent them to occur.
Risk identification is a first systematic attempt to specify risks to project plan, scheduling
resources, project development. It may be carried out as a team process using
brainstorming approach.
People Risks: These risks are concerns with the team and its members who are taking
part in developing the system.
Lack of knowledge
Lack of clear vision.
Poor communication between people.
Tools Risks:
These are more concerned with tools used to develop the project.
Tools containing virus.
General Risks:
General Risks are the risks, which are concerned with the mentality and resources.
Rapidly changing Datasets.
Lack of resources can cause great harm to efficiency and timelines of project.
Changes in dataset can cause a great harm to implementation and
schedule of developing the system.
Insufficient planning and task identification.
Decision making conflicts.
3.2) Risk Analysis
Risk assessment
Risk management
Evaluates which risks identified in the risk assessment process require management and
selects and implements the plans or actions that are required to ensure that those risks are
controlled.
Risk communication
Involves an interactive dialogue between guide and us, which actively informs
the other processes.
Steps taken for risk communication is as under: -
All the possible risks are listed out during communication and project is
developed taking care of that risks.
Chapter 3
User Characteristics
USER CHARACTERISTICS
Admin:-
Show project and user full detail
Manage user
Mange project
Manage dataset
User:-
Upload pieces of news
Circulation of news
Analyze the news
Devices Description
Scripting Python
Language
Table Software Requirements
CONSTRAINTS
1.3.1) Hardware Limitations
The major hardware limitations faced by the system are as follows:
If the appropriate hardware is not there like processor, RAM, hard disks
DEPENDENCIES
The entire project depends on various libraries of python. The libraries are as follows:
Scikit: Simple and efficient tools for data mining and data analysis. Accessible
to everybody, and reusable in various contexts. Built on NumPy, SciPy, and
matplotlib. Open source, commercially usable-BSD license.
Chapter 4
System Analysis
Study of Current System
Problem and Weaknesses of
Current System
Requirements of New System
Feasibility Study
Requirements Validation
Features of New System
Data Flow Diagram
ER Diagram
UML Diagrams
Selection of Hardware and
Software and Justification
STUDY OF CURRENT SYSTEM
Current system focus on classifying online reviews and publicly
available social media posts.
PROBLEMS AND WEAKNESS OF CURRENT SYSTEM
The current system is undoubtedly well-designed for detecting the
deception but it has some following limitations:
REQUIREMENTS SPECIFICATION
Requirements specification adds further information to the requirements
definition.
3.1) Algorithm Requirements
Dataset
Input
Appropriate functions
Training
Efficiency
Output
3.2) System Requirements
Usability:
The system should be easily able to detect the deception in blogs or news
in online social media.
Efficiency:
The system should provide easy and fast response.
FEASIBILTIY STUDY
An important outcome of the preliminary investigation is the
determination that the system is feasible or not. The main aim of the
feasibility study activity is to determine whether it would be financially
and technically feasible to develop a project.
The feasibility study activity involves the analysis of the problem and
collection of all relevant information relating to the product such as the
different datasets which would be input to the system, the processing
required to be carried out on these datasets, the output required to be
produced by the system as well as the various constraints on the
behaviors of the system.
4.1) Does the system contribute to the overall objectives of the organization?
Validity checks
Check whether the information entered is in valid format
Consistency checks
A requirement in a document is not conflicting.
Completeness checks
Realism checks
Verifiability
The requirements are given in verifiable manner (e.g.: Using quantifiable
measures) to reduce disputes between client and developer.
The use case related to user feedback is shown in Figure 2. In order for a
user to give feedback related to accuracy of classification a user must sign up.
The system displays all the recently processed/classified URL’s to the user. If
the user is logged in he can choose to vote for any classification result. After
some time (1 week) the system will check the votes for the classification and
based on the votes the system will be able to verify whether the classification
was correct or not. If the classification is verified the system adds the features
of the correct classification to the training set.
10.3 Use Case Diagram 3
Figure 3 shows the use case related to basic use of the software. User enters a
News URL. System verifies the URL and extracts relevant text from the URL
using a web crawler and then classified the news article as fake or credible
using machine learning algorithms. After the result is computed the user can
view the result.
11 SELECTION OF HARDWARE AND SOFTWARE
The Tables below give idea of the hardware and software required for the system
and client side requirements.
Hardware Selection
Devices Description
Software Selection
For which Software
System Design
Overview
Product Function
User Characteristics
Constraints
User Requirements
Performance Requirements
Code Snippet
1. Overview
The system works on already trained Machine Learning algorithms. Multiple
machine learning algorithms have been trained by providing a data set of both fake
and authentic news. The summary of overall procedure is as follows.
2 Product Functions
1. A URL of news article must be entered.
2. NLP is performed on the text extracted from the URL and relevant features are
3 User Characteristics
Moderator: The moderator will be monitoring the rating submitted by the users,
to maintain the credibility of ratings.
Administrator: Will maintain the overall aspects of web application and will be
responsible for giving users appropriate roles and authority.
User: The main actor using the web application to analyze the URLs.
4 Constraints
1 Our software will never assure authenticity of the result. For this, we need user
feedback.
2 Our software will only be available in English language and news article
provided
to the software should also be in English language.
3 We don’t have access to huge amount of data for training of machine learning
4 Software will not work without internet connection.
5 Our software does not perform well when article`s body is plain, short and
emotionless.
5 User Requirements
Following are the user requirements that describe what the user expects from the
software to do
ID Performance Requirement
7. CODESNIPPET:
The Jupyter notebook will be used for implementing our machine learning algorithm
and it has many files including dataset files and python notebooks which has
following extensions i.e. “.tsv” “.pynb” .
We also tried to use python libraries like torch and the famous numpy. A small level
implementation of our project is shown below.
Dataset files
Chapter 6
1.1 Methodology
Developing an Automatic Fake News Detector was a challenging problem. To make sure, that
we
accomplished this task efficiently, without facing major problems, which would have caused
major
redesigns and re-engineering of the software architecture, in a time and cost constrained project
environment, we started off with developing SRS (Software Requirement Specifications) and
detailed
design of the system. Gantt chart and work break down structure were created in that phase to
monitor the project and when a phase should start or end.
After that we started to gather dataset for training purpose. We were able to gather dataset of
about
6,500 labeled News Articles from multiple sources. After that we started our research on which
Machine Learning Algorithms to apply and what kind of NLP to use. We used SVM and
Random
forest as our machine learning algorithms, which gave us accuracy of 85.7%.
When the URL is entered, text and title of the news form that URL is scrapped using WEB
crawler.
Same NLP is applied to the extracted text and title and 38 features are fed to
Machine Learning Algorithms.
We have combined the strong points of both Algorithms which increases our
Accuracy.
SVM is better at detecting Fake News while Radom Forest is better for Authentic
News
When user enters a URL and checks the authentication of News, it gets stored in Database.
System maintains a list of already processed URLs which users can see.
User can also give a feedback to any Already Processed News article by a dislike
button, if the news has been predicted wrong by our Algorithm.
Predicted News with Low user ratings are then manually observed.
After some time, these already processed News articles are fed to Machine
Learning Algorithms.
Size of our dataset keeps increasing and the System keeps getting smarter with
time.
Feature Selection
We have used total 38 features in total. These features were extracted from title and news article
both. Previous researches done on this topic used only title of the news for training. We couldn’t
get our desired accuracy using title only.
Following is the table of features selected for text with the weight/importance of each feature as
calculated by machine learning algorithm. Same features have been selected for title but not
mentioned in the table.
Table Features with importance
Feature Importance
The above table shows us which features are most important for news classification, by giving
them a weight or score. For example, according to this table, Ratio of Punctuation Count and
Character Count has highest score (.2165). It means that this feature has 21.65% importance, and
it has the highest probability of classifying the news. While bracket counts has least importance,
which means that this feature helps least to classify the news article into fake or authentic.
Normalization
We have used the normalization in which we rescaled the feature values between [0, 1]. There
was quite obvious increase in our accuracy after the use of this normalization method.
The formula is given as:
Where x is an original value, x' is the normalized value.
For example if punctuation count ranges from [10 , 200], x' can be calculated by subtracted each
news’s punctuation count with 10, and dividing by 190.
1.2 Training
After cleaning and normalizing the data, we set it to training. We tried multiple algorithms and
techniques for training the data, and selected two (Random Forest and SVM) which gave the
highest accuracy. Training acquired most of the time of the project development, because we had
endless combinations and possibilities to try out, in order to achieve highest accuracy with
limited size of dataset. We tried changing the normalization method, training algorithm, number
of iterations, kernel in SVM and number of features.
Above graph clearly shows the phenomenon of over and underfitting. At 19 number of features,
we are getting the highest accuracy (85.7%) by SVM Linear Kernel. After that, the model starts
to overfit the data and test accuracy starts to decline.
Note that, 19 features are used for title and text separately, in total 38 features are used.
1.2.2 SVM Kernels
Following graph shows the difference in accuracy with different SVM kernels.
In above graph, it can be seen that Linear kernel gives the highest accuracy (85.7%). That’s
because most of textual data is linearly separable, and linear kernel works really good when data
is linearly separable or has high number of features. That's because mapping the data to a higher
dimensional space does not really improve the performance (L Arras, F Horn et al., 2017).
Following is graph of Accuracy vs Maximum Depth of Random Forest and Decision Tree.
Right now we’re splitting the data into 80/20, with 80 being training set and 20 being the test set.
Following is the graph that shows Accuracy vs Machine Learning models with different splits.
It can be seen from this graph that highest accuracy is achieved when the dataset is split 80/20,
with 20% being test set. Phenomenon of over and underfitting can be observed in this graph as
well.
Following is the graph of Accuracy with PCA and LDA, and without feature reduction vs
number of features.
Note: Feature reduction is applied on Random Forest, and accuracy of Random Forest has been
used. Algorithm was trained multiple times, and accuracy of Normal Random Forest in each try
was compared with Random Forest’s accuracy after PCA and LDA
Figure Feature Reduction (Graph)
15 82.767 85 79.63
It can be clearly seen that PCA has always been greater than Random forest trained without
Reduction in Features.
As depicted in the previous graphs, we played around with the data, features and machine
learning algorithms to achieve the desired accuracy. We also implemented neural networks but it
was giving really low accuracy (53%) due to insufficient data size. So we decided not to include
neural network in our work, we will add it in future when we have hands on sufficient data size.
We hope when have large amount of news articles, deep learning will cause a great increase the
accuracy of our system.
Following is the overall summary of what has been discussed previously related to Training the
data.
Decision Tree:
Table Decision Tree
6 5 35
- 8 38
- 10 41.12
- 14 39
13 5 53.5
- 8 55
- 10 58.78
- 14 54.2
19 5 68.12
- 8 69.5
- 10 77
- 14 74.2
Random Forest:
6 5 37
- 8 39.75
- 10 43
- 14 41.2
13 5 54.5
- 8 59
- 10 61.25
- 14 58
19 5 79.54
- 8 84
- 10 82.3
- 14 78
SVM:
Table SVM
Default 6 39.25
- 13 51.8
- 19 56
- 25 58.7
Linear 6 68.12
- 13 82.35
- 19 85.7
- 25 84.2
Over-All:
This is the over-all summary, Table 4.2-8 is constructed considering following values.
Random Forest 84
SVM 85.7
Decision Tree 77
Here we can see that SVM gives us the highest accuracy among other Machine Learning
algorithms, the reason has been described previously. SVM performs great on textual data
because textual data is almost all the time linearly separable and SVM is a good choice for
linearly separable data.
Main part of our server is Machine Learning Algorithms. Classification and Web Backend part
of the project has been implemented in Python. Django is used for back-end library of Sklearn is
used for the training purposes. We started our project with Decision Tree algorithm with 19
features, and got 53% accuracy after splitting the dataset to 80-20 into training and testing. After
going through research papers and obtaining strong points from each of them we were able to get
85.7% accuracy. We combined Random Forest and SVM (Linear Kernel) to give us the highest
accuracy. We wanted to use Deep learning and hoped to get much higher accuracy with it, but
failed due to small size of dataset. For the NLP part, we used NLTK and Textstat (python APIs)
for complex feature extraction like adverb count or text difficulty.
One of our main hurdle was to scrap html page properly. Online news articles are not
written in standard form, e.g., news on Facebook is written in different html format than the
news on bbc.com. We couldn’t tackle this generality, and used python’s library Newspaper3k
which is made specially to scrap of news articles.
1.4 Database Design
SQLite is chosen to progress our database. SQLite is self-contained, high- reliability, embedded,
full-featured, public-domain, SQL database engine. There are two main tables of Users and
URLs. User table keeps record of password and username etc. so that user can login to the
system. While URL table keeps record of already processed news article so if any new user
enters the same URL again, system doesn’t have to go through all the processing
again and can just give the result from the database. Voting table has also been maintained which
keeps record of vote give to each URL.
2. IMPLEMENTATION ENVIRONMENT
As our project is study based project and the best tool which is used at the undergraduate level
is “Anaconda” . It consists of different modules in which we can code but for our project we
have used Jupyter Notebook, which is used for high level python programming. Jupyter
Notebook provides browser environment as it opens up in the browser.it can also connect to
kernel and terminal.
3. PROGRAM/MODULE SPECIFICATION
The naive bayes classifier algorithm is the most applicable algorithm to implement fake news
detection as it works on conditional probability and other major concepts of Data mining that
are used in this project and we have also studied it in 4th semester which made the
understanding of code quite easy.
The final output is generated with the help of matplot lib of python which helps to
understand the count of various words in the articles circulated. Below it is shown
with the help of color coding in the given matrix and respective range.
4. CODING STANDARDS
Our project implementation uses apt variable names that makes the
understanding of the domain quite easy.
Comments increases readability of our code and makes it easy for the third
party to understand it. We have used comments everywhere needed and also
used the references of the online codes.
Every code block and the different modules start with the comments,
describing in brief about the code and the details.
Comments may also be used in between and along with the lines of code to
explain one specific line or lines.
In python we can use. ‘#’ to for single comment and for multiple lines we can
use delimiters that is,” ‘ ‘ ‘ “ . We have used both during programming.
5. Coding
LSTM.py:
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from collections import Counter
import os
import getEmbeddings2
import matplotlib.pyplot as plt
import scikitplot.plotters as skplt
top_words = 5000
epoch_num = 5
batch_size = 64
if not os.path.isfile('./xtr_shuffled.npy') or \
not os.path.isfile('./xte_shuffled.npy') or \
not os.path.isfile('./ytr_shuffled.npy') or \
not os.path.isfile('./yte_shuffled.npy'):
getEmbeddings2.clean_data()
xtr = np.load('./xtr_shuffled.npy')
xte = np.load('./xte_shuffled.npy')
y_train = np.load('./ytr_shuffled.npy')
y_test = np.load('./yte_shuffled.npy')
cnt = Counter()
x_train = []
for x in xtr:
x_train.append(x.split())
for word in x_train[-1]:
cnt[word] += 1
# Storing most common words
most_common = cnt.most_common(top_words + 1)
word_bank = {}
id_num = 1
for word, freq in most_common:
word_bank[word] = id_num
id_num += 1
y_train = list(y_train)
y_test = list(y_test)
getEmbeddings.py
import numpy as np
import re
import string
import pandas as pd
from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence
from gensim import utils
from nltk.corpus import stopwords
def textClean(text):
text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
text = text.lower().split()
stops = set(stopwords.words("english"))
text = [w for w in text if not w in stops]
text = " ".join(text)
return (text)
def cleanup(text):
text = textClean(text)
text = text.translate(str.maketrans("", "", string.punctuation))
return text
def constructLabeledSentences(data):
sentences = []
for index, row in data.iteritems():
sentences.append(LabeledSentence(utils.to_unicode(row).split(), ['Text' + '_
%s' % str(index)]))
return sentences
def getEmbeddings(path,vector_dimension=300):
data = pd.read_csv(path)
missing_rows = []
for i in range(len(data)):
if data.loc[i, 'text'] != data.loc[i, 'text']:
missing_rows.append(i)
data = data.drop(missing_rows).reset_index().drop(['index','id'],axis=1)
for i in range(len(data)):
data.loc[i, 'text'] = cleanup(data.loc[i,'text'])
x = constructLabeledSentences(data['text'])
y = data['label'].values
for i in range(train_size):
text_train_arrays[i] = text_model.docvecs['Text_' + str(i)]
train_labels[i] = y[i]
j=0
for i in range(train_size, train_size + test_size):
text_test_arrays[j] = text_model.docvecs['Text_' + str(i)]
test_labels[j] = y[i]
j=j+1
getEmbeddings2.py
import numpy as np
import re
import string
import pandas as pd
from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence
from gensim import utils
from nltk.corpus import stopwords
def textClean(text):
text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
text = text.lower().split()
stops = set(stopwords.words("english"))
text = [w for w in text if not w in stops]
text = " ".join(text)
return (text)
def cleanup(text):
text = textClean(text)
text = text.translate(str.maketrans("", "", string.punctuation))
return text
def constructLabeledSentences(data):
sentences = []
for index, row in data.iteritems():
sentences.append(LabeledSentence(utils.to_unicode(row).split(), ['Text' + '_
%s' % str(index)]))
return sentences
def clean_data():
path = 'datasets/train.csv'
vector_dimension=300
data = pd.read_csv(path)
missing_rows = []
for i in range(len(data)):
if data.loc[i, 'text'] != data.loc[i, 'text']:
missing_rows.append(i)
data = data.drop(missing_rows).reset_index().drop(['index','id'],axis=1)
for i in range(len(data)):
data.loc[i, 'text'] = cleanup(data.loc[i,'text'])
data = data.sample(frac=1).reset_index(drop=True)
x = data.loc[:,'text'].values
y = data.loc[:,'label'].values
xtr = x[:train_size]
xte = x[train_size:]
ytr = y[:train_size]
yte = y[train_size:]
np.save('xtr_shuffled.npy',xtr)
np.save('xte_shuffled.npy',xte)
np.save('ytr_shuffled.npy',ytr)
np.save('yte_shuffled.npy',yte)
naive-bayes.py
xtr,xte,ytr,yte = getEmbeddings("datasets/train.csv")
np.save('./xtr', xtr)
np.save('./xte', xte)
np.save('./ytr', ytr)
np.save('./yte', yte)
xtr = np.load('./xtr.npy')
xte = np.load('./xte.npy')
ytr = np.load('./ytr.npy')
yte = np.load('./yte.npy')
gnb = GaussianNB()
gnb.fit(xtr,ytr)
y_pred = gnb.predict(xte)
m = yte.shape[0]
n = (yte != y_pred).sum()
print("Accuracy = " + format((m-n)/m*100, '.2f') + "%") # 72.94%
plot_cmat(yte, y_pred)
neural-net-keras.py
xtr,xte,ytr,yte = getEmbeddings("datasets/train.csv")
np.save('./xtr', xtr)
np.save('./xte', xte)
np.save('./ytr', ytr)
np.save('./yte', yte)
xtr = np.load('./xtr.npy')
xte = np.load('./xte.npy')
ytr = np.load('./ytr.npy')
yte = np.load('./yte.npy')
def baseline_model():
'''Neural network with 3 hidden layers'''
model = Sequential()
model.add(Dense(256, input_dim=300, activation='relu',
kernel_initializer='normal'))
model.add(Dropout(0.3))
model.add(Dense(256, activation='relu', kernel_initializer='normal'))
model.add(Dropout(0.5))
model.add(Dense(80, activation='relu', kernel_initializer='normal'))
model.add(Dense(2, activation="softmax", kernel_initializer='normal'))
# gradient descent
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
probabs = model.predict_proba(x_test)
y_pred = np.argmax(probabs, axis=1)
plot_cmat(y_test, y_pred)
neural-net-tf.py
import numpy as np
import tensorflow as tf
from getEmbeddings import getEmbeddings
import matplotlib.pyplot as plt
import scikitplot.plotters as skplt
import pickle
import os.path
IN_DIM = 300
CLASS_NUM = 2
LEARN_RATE = 0.0001
TRAIN_STEP = 20000
tensorflow_tmp = "tmp_tensorflow/three_layer2"
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
def main():
# Get the training and testing data from getEmbeddings
train_data, eval_data, train_labels, eval_labels = \
getEmbeddings("datasets/train.csv")
train_labels = train_labels.reshape((-1, 1)).astype(np.int32)
eval_labels = eval_labels.reshape((-1, 1)).astype(np.int32)
if __name__ == "__main__":
main()
svm.py
xtr,xte,ytr,yte = getEmbeddings("datasets/train.csv")
np.save('./xtr', xtr)
np.save('./xte', xte)
np.save('./ytr', ytr)
np.save('./yte', yte)
xtr = np.load('./xtr.npy')
xte = np.load('./xte.npy')
ytr = np.load('./ytr.npy')
yte = np.load('./yte.npy')
clf = SVC()
clf.fit(xtr, ytr)
y_pred = clf.predict(xte)
m = yte.shape[0]
n = (yte != y_pred).sum()
print("Accuracy = " + format((m-n)/m*100, '.2f') + "%") # 88.42%
plot_cmat(yte, y_pred)
Chapter 7
Results and Discussion
Take a Valid News Article URL
Extract Relevant Text From URL
Extracting Feature from Relevant Text
Applying Machine Learning Algorithms
for Classification
Store Classification Result in Database
User Login and Sign up
User Feedback
Verifying Results
Retraining of Machine Learning Models
Non-Functional Requirement Achieved
Results and Discussion
We integrated all the system components successfully. Our Systems accuracy was quite good. It
correctly classified news article with 85.7% accuracy. Our main goal was to develop a user
friendly web application which classify a news article as fake or credible, by simply taking its
URL from the user. We achieved this goal by fulfilling all the user requirement which were
crucial to the success of our project.
There were also requirements related to performance. We constantly improved our system to
achieve maximum performance and the results were quite satisfactory. The response time of our
system was adequately fast. We constantly applied software engineering processes to keep track
of all the functional and non-functional requirements.
This functional requirement was critical to our system. In order for all the system components to
work flawlessly, the system must get a valid news article URL from the user, from where it
extracts text. If the system does not get a news article URL, the web crawler will generate an
exception. In order to fulfil this requirement we used a form input of URL type so that it takes
only a URL as input and we also used exception handling to catch the exception if the URL
provided is not of a news article.
This was a very challenging problem in our project. In order to classify the news article as fake
or credible we only needed the relevant text from page source, on which our system applies
Natural Language Processing to make feature vectors. This was particularly hard as we had to
make generic scrapper that works for every news website. We used newspaper3k API to solve
this problem, which made it easier for us to extract only the news article title and text (body).
The system uses nltk to apply NLP on the news article title and text to make feature vectors,
which are then fed to the machine learning algorithms. We used 38 dimensional feature vectors.
This is a necessary step as it allows us to convert text into numeric form which is then easy to
use for machine learning algorithms.
4. Applying Machine Learning Algorithms for Classification, (FR-04)
This requirement is the backbone of our system. The success of our system depended on how
accurately our machine learning models predicted whether a news article is fake or not. In order
to achieve maximum accuracy with finite resources, we trained our machine learning models on
a labelled dataset of 7000 news articles. We used 2 different machine learning models SVM and
Random Forest for classification and we combined the result of both models. We achieved a
maximum of 86% test accuracy.
We stored the result of every URL processed by our system in our database alongside its title and
text. This requirement helped us improve the performance of our as it eliminated redundancy. If
2 users entered the same URL our system will only process it once and will it store its
classification result in the database for subsequent queries.
We used django user model to implement this requirement. This was also a necessary
requirement as users need to login to give feedback on the classification results.
After a user login into the system, user can give feedback on all the classification results of the
processed URL’s. We implemented this by creating a voting system. In which a user can like or
dislike a URL’s classification result. We also made a table of voting in the database which is
associated to both user model and URL model to make sure that a user can vote only once for a
particular URL.
After a month of processing a URL our system automatically checks the rating, which is given
by the users, of URL. If the rating is more than 50% our system retains the classification result.
But if the rating is less than 50% the classification result is altered as poor rating shows incorrect
classification by the system.
After a month all the URL’s which are verified our added to our dataset along with their
classification result. All the machine learning models our trained and saved again. This ensures
that our system improves with time as more and more data is available for training. This will
help our system evolve continuously and our accuracy will get better and better.
Performance Requirements
The system should respond to a user query and return a result in less than 5 seconds.
Security Requirements
User should be able to securely login.
Usability Requirements
The system should be user friendly and easy to use
The testing ensures that the software is according to the required specification
standards and performs the task meant for it. The testing is done by our in house
employee that act as novice user and test the application with all possible way to find
the bugs and error as well as check validation.
1. TESTING PLAN
Design
Implementation
Coding
The design errors are to be rectified at the initial stage. Such errors are very difficult to
repair after the execution of software.
The errors occurred at this stage can’t be overlooked because such errors do not allow
the further process.
When all modules are tested successfully then I will move to one step up
and continue with white box testing strategy.
When all modules will be tested successfully then I will integrate those
modules and try to test integrated system using black box testing
strategy.
3. TESTING METHOD
The unit testing is meant for testing smallest unit of software. There are two
approaches namely bottom-up and top-down.
In bottom up approach the last module is tested and then moving towards the first
module while top down approach reverses the action. In present work we opt for the
first one.
The bottom up approach for the current project is carried out as shown in.
The integration testing is meant to test all the modules simultaneously because it
is possible that all the modules may function correctly when tested individually.
But they may not work altogether and may lead to unexpected outcome.
3.3 Validation Testing
The dataset of the system has to be stored on the hard disk. So the storage capacity of
the hard disk should be enough to store all the data required for the efficient running of
the software.
4. TEST CASES
4.1 Purpose
The purpose of this project is to use machine learning algorithm to detect the fake
news in online social media that travels as a real one, it is like a click bait. It will try
to enhance the user experience on the online social media platform and will also
save lot of time of users that they might spent on fake news otherwise.
Chapter 9
Limitations and
Future
Enhancement
The present software uses high quality external hardware at input level. If
quality of input document is poor, output may suffer due to limitation of
it.
The platform used is ANACONDA (JUPYTER NOTEBOOK) which is
an open source software. This limits the cost of project.
Limited dataset
Limited processing speed
When compared to real world applications our domains are not
applicable as it is entirely study based.
This shows a simple approach for fake news detection using naive Bayes
classifier. This approach was implemented as a software system and tested
against a data set of Facebook news posts. We achieved classification accuracy
of approximately 74% on the test set which is a decent result considering the
relative simplicity of the model. These results may be improved in several ways,
that are described in the article as well. Received results suggest, that fake news
detection problem can be addressed with artificial intelligence methods.
Problem:
Fake news
Solution:
3. SUMMARY OF PROJECT
Anonymity and the lack of meaningful supervision in the electronic medium are
two factors that have exacerbated this social menace.