Full Document - Fake News Detection
Full Document - Fake News Detection
1. INTRODUCTION
A different way to detect fake news is through stance detection which will
be the focus of our study. Stance Detection is the process of automatically
detecting the relationship between two pieces of text. In this study, we explore
ways to predict the stance, given a news article and news headline pair.
Depending on how similar the news article content and headlines are, the stances
between them can be defined as ‘agree’, ‘disagree’, ‘discuss’ or ‘unrelated’. We
experimented with several traditional machine learning models to set a baseline
and then compare results to the state-of-the art deep networks to classify the
stance between article body and headline Fake news can be come in many forms,
including: unintentional errors committed by news aggregators, outright false
stories, or the stories which are developed to mislead and influence reader’s
opinion.
While fake news may have multiple forms, the effect that it can have on
people, government and organizations may generally be negative since it differs
from the facts. Detecting fake news is hard for many reasons. First, manual task
of identifying fake news is very subjective. Assessing the veracity of a news story
is a complex and cumbersome task, even for trained experts. News is not only
spread through traditional media outlets anymore but also through various social
media channels. Automated solution requires understanding the natural language
processing which is difficult and complex. These complexities make it a daunting
task to classify text as fake news.
In the current fake news corpus, there have been multiple instances where
both supervised and unsupervised learning algorithms are used to classify text.
However, most of the literature focuses on specific datasets or domains, most
1
prominently the politics domain. Therefore, the algorithm trained works best on
a particular type of article’s domain and does not achieve optimal results when
exposed to articles from other domains. Since articles from different domains
have a unique textual structure, it is difficult to train a generic algorithm that
works best on all particular news domains. In this paper, we propose a solution to
the fake news detection problem using the machine learning ensemble approach.
Our study explores different textual properties that could be used to distinguish
fake contents from real.
Models are trained by using a large set of labeled data and neural network
architectures that contain many layers. Deep learning achieves recognition
accuracy at higher levels than ever before. This helps consumer electronics meet
user expectations, and it is crucial for safety-critical applications like driverless
cars. Recent advances in deep learning have improved to the point where deep
learning outperforms humans in some tasks like classifying objects in images.
While deep learning was first theorized in the 1980s, there are two main reasons
it has only recently become useful: Deep learning requires large amounts of
2
labeled data. For example, driverless car development requires millions of images
and thousands of hours of video. Deep learning requires substantial computing
power. High-performance GPUs have a parallel architecture that is efficient for
deep learning. When combined with clusters or cloud computing, this enables
development teams to reduce training time for a deep learning network from
weeks to hours or less.
Aerospace and Defense: Deep learning is used to identify objects from satellites
that locate areas of interest, and identify safe or unsafe zones for troops.
3
1.4 DEEP LEARNING WORKS
CNNs eliminate the need for manual feature extraction, so you do not need
to identify features used to classify images. The CNN works by extracting
features directly from images. The relevant features are not pretrained; they are
learned while the network trains on a collection of images. This automated feature
extraction makes deep learning models highly accurate for computer vision tasks
such as object classification. CNNs learn to detect different features of an image
using tens or hundreds of hidden layers. Every hidden layer increases the
complexity of the learned image features. For example, the first hidden layer
could learn how to detect edges, and the last learns how to detect more complex
shapes specifically catered to the shape of the object we are trying to recognize.
4
– where a network is given raw data and a task to perform, such as classification,
and it learns how to do this automatically.Another key difference is deep learning
algorithms scale with data, whereas shallow learning converges. Shallow learning
refers to machine learning methods that plateau at a certain level of performance
when you add more examples and training data to the network. A key advantage
of deep learning networks is that they often continue to improve as the size of
your data increases.The three most common ways people use deep learning to
perform object classification are:
To train a deep network from scratch, you gather a very large labeled data
set and design a network architecture that will learn the features and model. This
is good for new applications, or applications that will have a large number of
output categories. This is a less common approach because with the large amount
of data and rate of learning, these networks typically take days or weeks to train.
Transfer Learning
Feature Extraction
5
A slightly less common, more specialized approach to deep learning is to
use the network as a feature extractor. Since all the layers are tasked with learning
certain features from images, we can pull these features out of the network at any
time during the training process. These features can then be used as input to
a machine learning model such as support vector machines (SVM).
With all this data, tools are necessary to extract insights and
trends. Machine learning techniques are used to find patterns in data and to build
models that predict future outcomes. A variety of machine learning algorithms
are available, including linear and nonlinear regression, neural networks, support
6
vector machines, decision trees, and other algorithms. Predictive analytics helps
teams in industries as diverse as finance, healthcare, pharmaceuticals,
automotive, aerospace, and manufacturing.
7
Predictive analytics is the process of using data analytics to make predictions
based on data. This process uses data along with analysis, statistics, and machine
learning techniques to create a predictive model for forecasting future events.
8
• Computational biology, for tumor detection, drug discovery, and DNA
sequencing
• Energy production, for price and load forecasting
• Automotive, aerospace, and manufacturing, for predictive maintenance
• Natural language processing, for voice recognition applications
Supervised Learning
9
segmentation.Common algorithms for performing classification include support
vector machine (SVM), boosted and bagged decision trees, k-nearest
neighbor, Naïve Bayes, discriminant analysis, logistic regression, and neural
networks.
Unsupervised Learning
AI also involves prodigious amounts of data. Yet labeling data and images
is tedious and time-consuming. Sometimes, you don’t have enough data,
especially for safety-critical systems. Generating accurate synthetic data can
improve your data sets. In both cases, automation is critical to meeting deadlines.
Deployment
The deployment process is accelerated when you generate code from your
models and target your devices. Using code generation optimization techniques
and hardware-optimized libraries, you can tune the code to fit the low power
profile required by embedded and edge devices or the high-performance needs of
enterprise systems and the cloud.
11
1.8 REINFORCEMENT LEARNING
12
behavior on its own, without (human) supervision. Deep learning spans all three
types of machine learning; reinforcement learning and deep learning are not
mutually exclusive. Complex reinforcement learning problems often rely on deep
neural networks, a field known as deep reinforcement learning.
13
for pick-and-place applications. Other robotics applications include human-robot
and robot-robot collaboration.
14
CHAPTER 2
2. LITERATURE SURVEY
15
information. We want to contribute to the debate on how to deal with fake news
and related online phenomena with technological means, by providing means to
separate related from unrelated headlines and further classifying the related
headlines. We present a system for stance detection of headlines with regard to
their corresponding article bodies. Our system is based on simple, lemmatization-
based n-gram matching for the binary classification of “related” vs. “unrelated”
headline/article pairs. The best results were obtained using a setup where the more
fine-grained classification of the “related” pairs (into “agree”, “disagree”,
“discuss”) is carried out using a Logistic Regression classifier at first, then three
binary classifiers with slightly different training procedures for the cases where
the first classifier lacked confidence (i. e., the difference between the best and
second-best scoring class was below a threshold). For the more fine-grained
classification of articles that have been classified as “related”, the threeway
classification is a relevant first step, but other classes may need to be added to the
set, or a more detailed division may need to be made in order to take the next
steps in tackling the fake news challenge. Additionally, we see the integration of
known facts and general discourse knowledge (possibly through Linked Data),
and the incorporation of source credibility information as important and
promising suggestions for future research.
16
1’s performance metric. As we developed our approach to FNC-1, we first
explored existing research as it pertains to related NLP problems in entailment as
well as stance detection. First, we examined papers regarding the Stanford
Natural Language (SNLI) Dataset, which has been become popular in recent
years when developing models to classify entailment and contradiction amongst
hypothesis-premise pairs. From the original SNLI paper (Bowman et al. 2015)
we derived two of our baseline models - a Bag of Words (BOW) Multilayer
Perceptron (MLP) and a Long-Short-Term-Memory (LSTM) that receives
concatenated hypothesis-premise pairs as inputs. Additionally, we drew heavily
upon Tim Rocktaschel’s Reasoning about Entailment with Neural Attention. In
the paper, Rocktaschel proposes an architecture of conditionally encoded LSTMs
upon which attention is applied in order to classify entailment on the SNLI
Dataset.
Todor mihaylov, et.al,…[5] The reason could be that most troll comments
are replies to other comments, while those by nontrolls are mostly not replies.
Adding other features such as sentiment-based features, bad words, POS, and
punctuation hurts the performance significantly. Features such as bad words are
at the very bottom: they do not apply to all comments and thus are of little use
17
alone; similarly for mentions and sentiment features, which are also quite weak
in isolation. These results suggest that mentioned trolls are not that different from
non-trolls in terms of language use, but have mainly different behavior in terms
of replying to other users. We have presented experiments in predicting whether
a comment is written by a troll or not, where we define troll as somebody who
was called such by other people. We have shown that this is a useful definition
and that comments by mentioned trolls are similar to such by confirmed paid
trolls. Overall, we have seen that our classifier for telling apart comments by
mentioned trolls vs. such by non-trolls performs almost equally well for paid
trolls vs. non-trolls, where the non-troll comments are sampled from the same
threads that the troll comments come from. Moreover, the most and the least
important features ablated from all are also similar. This suggests that mentioned
trolls are very similar to paid trolls (except for their reply rate, time and day of
posting patterns)
18
CHAPTER 3
3. EXISITNG SYSTEM
Generally, for the news creators, besides the articles written by them, we
are also able to retrieve his/her profile information from either the social network
website or external knowledge libraries, e.g., Wikipedia or government-internal
database, which will provide fundamental complementary information for his/her
background check. Based on various types of heterogeneous information sources,
including both textual contents/profile/descriptions and the authorship and article
subject relationships among them, we aim at identifying fake news from the
online social networks simultaneously. We formulate the fake news detection
problem as a credibility inference problem, where the real ones will have a higher
credibility while unauthentic ones will have a lower one instead
19
decision-tree with few features. If we can construct multiple small, weak decision
trees concurrently, we can then average or take the majority vote to join the trees
to create a single, strong learner. It is frequently discovered in reality that random
forests are the most up until now precise learning techniques.
Precondition: A training set S := (x1, y1), . . . ,(xn, yn), features F, and number of
trees in forest B.
1 H←∅
2 for i ∈ 1, . . . , B do
3 S (i) ← A bootstrap sample from S
4 hi ← RandomizedTreeLearn(S (i) , F)
5 H ← H ∪ {hi}
6 end for
7 return H
8 end function
20
9 function RandomizedTreeLearn(S , F)
10 At each node:
11 f ← very small subset of F
12 Split on best feature in f
13 return The learned tree
14 end function
Input:
Training dataset T,
Output:
Step:
21
2. Calculate the mean and standard deviation of the predictor variable in each
class;
3. Repeat
Calculate the probability of f i using the gauss density equation in
each class;
3.2 CHALLENGES
While using machine learning classifiers for fake news detection can be effective,
there are also several challenges that need to be addressed to achieve accurate
results. Here are some of the main challenges:
22
• Adversarial attacks: Adversarial attacks are deliberate attempts to
manipulate the model's predictions by injecting subtle changes into the
input data. Adversarial attacks can make it difficult for the model to detect
fake news accurately.
• Dynamic nature of fake news: Fake news is constantly evolving, and new
types of fake news can emerge quickly. This means that the model must be
able to adapt to new forms of fake news and continue to perform accurately
over time.
3.3 DISADVANTAGES
• Accuracy is less
• Need large number of datasets to train the data
• Provide high number of false positive rate
• Only done supervised classification
23
CHAPTER 4
4. PROPOSED SYSTEM
Research that studied the velocity of fake news concluded that tweets
containing false information reach people on Twitter six times faster than truthful
tweets. Technologies such as Machine learning and Natural Language Processing
(NLP) tools offer great promise for researchers to build systems which could
automatically detect fake news. However, detecting fake news is a challenging
task to accomplish as it requires models to summarize the news and compare it
to the actual news in order to classify it as fake. Moreover, the task of comparing
proposed news with the original news itself is a daunting task as its highly
subjective and opinionated. In this project, we can implement text mining
algorithm to extract the key terms based on natural language processing and also
include classification algorithms such as deep learning algorithm named as multi-
layer perceptron algorithm.
24
on feed forward algorithm. Data flow is directed from input to output layer. DNN
creates a number of virtual neurons initialized with a random numerical value as
connection weights. This weight is multiplied with the input and produce an
output between 0 and 1. The training process adjust the weights to classify the
output efficiently. Added layers make the model to learn rare patterns which leads
the model to overfitting. Dropout layers reduce the number of trainable
parameters to make the model generalized. In this paper, we have used a
sequential model of dense layers for training the data, relu as activation function
and adam as optimizer. During training process, calculates individual learning
rates on different parameters as this is an adaptive learning method.
• Starting with the input layer, propagate data forward to the output layer.
This step is the forward propagation.
• Based on the output, calculate the error (the difference between the
predicted and known outcome). The error needs to be minimized.
• Backpropagate the error. Find its derivative with respect to each weight in
the network, and update the model.
Repeat the three steps given above over multiple epochs to learn ideal weights.
Finally, the output is taken via a threshold function to obtain the predicted class
labels. Figure 2 displays framework for multi-layer perceptron algorithm
25
Fig 4.1 MLP ALGORITHM
4.2 ADVANTAGES
4.3 APPLICATIONS
Fake news detection using deep learning algorithms has numerous applications,
some of which include:
• Social media platforms: Social media platforms can use deep learning
algorithms to automatically detect and flag fake news articles shared by
users. This can help reduce the spread of misinformation and improve the
overall quality of content on these platforms.
• News organizations: News organizations can use deep learning algorithms
to verify the authenticity of news articles before publishing them. This can
help prevent the spread of fake news and maintain the credibility of news
outlets.
26
• Government agencies: Government agencies can use deep learning
algorithms to monitor the spread of fake news and detect potential threats
to national security. This can help them take appropriate action to
counteract the effects of fake news.
• Educational institutions: educational institutions can use deep learning
algorithms to teach students how to identify fake news and distinguish it
from real news. This can help improve media literacy and critical thinking
skills among students.
• Fact-checking organizations: Fact-checking organizations can use deep
learning algorithms to automate the fact-checking process and speed up the
verification of news articles. This can help them keep up with the high
volume of news articles that need to be fact-checked on a daily basis.
Learning Approach:
27
Flexibility and Adaptability:
• Multi Perceptron algorithm can adapt its internal weights and biases during
training, allowing it to learn and adapt to different datasets and problem
domains.
• Random Forest, which relies on an ensemble of decision trees with fixed
structures, and Naive Bayes, which assumes fixed probabilistic
relationships between features.
28
CHAPTER 5
5. SYSTEM DESIGN
29
5.2 MODULES
• TEXT MINING
• CLASSIFICATION
In the first step, the text documents are collected which are present in .TXT.
30
In this process, the given input document is processed for removing
redundancies, inconsistencies, separate words, stemming and documents are
prepared for next step, the stages performed are as follows:
Tokenization
In this step the removal of usual words like a, an, but, and, of, the
etc. is done.
Stemming
In this module, can calculate the term frequency and inverse document
frequency. In information retrieval, tf–idf or TFIDF, short for term frequency–
inverse document frequency, is a numerical statistic that is intended to reflect how
important a word is to a document in a collection or corpus. It is often used as a
weighting factor in searches of information retrieval, text mining, and user
modelling. The tf-idf value increases proportionally to the number of times a
word appears in the document and is offset by the frequency of the word in the
corpus, which helps to adjust for the fact that some words appear more frequently
in general. The calculate the values of entropy and probability of IDF. Entropy
gives higher weight to the terms with less frequency in few documents. Normal
is used to correct discrepancies in document lengths and also normalize the
31
document vectors. ProbIDF is similar to IDF and assigns very low negative
weight for the terms occurring in every document.
5.2.4 CLASSIFICATION
User can input the news datasets or twitter datasets. In this module,
implement multi-layer perceptron algorithm to classify the extract keywords. A
multilayer perceptron (MLP) is a feed forward artificial neural network that
generates a set of outputs from a set of inputs. An MLP is characterized by several
layers of input nodes connected as a directed graph between the input nodes
connected as a directed graph between the input and output layers. MLP uses
back propagation for training the network. MLP is a deep learning method. Multi-
layer perceptron (MLP) is a supplement of feed forward neural network. The
input layer receives the input signal to be processed. The required task such as
prediction and classification is performed by the output layer. An arbitrary
number of hidden layers that are placed in between the input and output layer are
the true computational engine of the MLP.
Similar to a feed forward network in a MLP the data flows in the forward
direction from input to output layer. The neurons in the MLP are trained with
the back propagation learning algorithm. MLPs are designed to approximate any
continuous function and can solve problems which are not linearly separable. The
major use cases of MLP are pattern classification, recognition, prediction and
approximation. MLPs are global approximators and can be trained to implement
any given nonlinear input-output mapping. In a subsequent testing phase, they
prove their interpolation ability by generalizing even in sparse data space regions.
When designing a neural network, specifically deciding for a fixed architecture,
performance and computational complexity considerations play a crucial role.
Mathematically, it has been proved that even one hidden-layer MLP is able to
approximate the mapping of any continuous function.
32
As with all neural networks, the dimension of the input vector dictates the
number of neurons in the input layer, while the number of classes to be learned
dictates the number of neurons in the output layer. The number of chosen hidden
layers and the number of neurons in each layer have to be empirically determined.
5.2.5 FAKE NEWS DETECTION:
Classification of any news item /post / blog into fake or real one has
generated great interest from researchers around the globe. Several research
studies have been carried out to find effect of falsified and fabricated news on
masses and reactions of people upon coming through such news items. Falsified
news or fabricated new is any textual or non-textual content that is fake and is
generated so the readers will start believing in something which is not true. Based
on classification, fake news data are predicted. The proposed system provides
improved accuracy rate in fake news detection. If the user continuously posts the
fake news means, warn or block the users. Accuracy parameter is calculated in
terms of true positive and false positive rate
33
CHAPTER 6
6. SYSTEM SPECIFICATIONS
34
6.3 SOFTWARE DESCRIPTION
35
Fig 6.1 PYTHON LOGO
37
Rapid Application Development, as well as for use as a scripting or glue language
to connect existing components together. Python's simple, easy to learn syntax
emphasizes readability and therefore reduces the cost of program maintenance.
Python supports modules and packages, which encourages program modularity
and code reuse.
The Python interpreter and the extensive standard library are available in
source or binary form without charge for all major platforms, and can be freely
distributed. Often, programmers fall in love with Python because of the increased
productivity it provides. Since there is no compilation step, the edit-test-debug
cycle is incredibly fast. Debugging Python programs is easy: a bug or bad input
will never cause a segmentation fault. Instead, when the interpreter discovers an
error, it raises an exception. When the program doesn't catch the exception, the
interpreter prints a stack trace. Python also has a large and active community of
developers who contribute to a wide range of open-source libraries and tools,
making it easy to find and use pre-built code to solve complex problems.
Data Science: Python is one of the most popular languages for data science,
thanks to libraries like NumPy, Pandas, and Matplotlib that make it easy to
manipulate and visualize data.
Machine Learning: Python is also widely used in machine learning and artificial
intelligence, with libraries like TensorFlow, Keras, and Scikit-learn that provide
powerful tools for building and training machine learning models.
38
Scientific Computing: Python is used extensively in scientific computing, with
libraries like SciPy and SymPy that provide powerful tools for numerical analysis
and symbolic mathematics.
In addition to its versatility and ease of use, Python is also known for its
portability and compatibility. Python code can be run on a wide range of
platforms, including Windows, macOS, and Linux, and it can be integrated with
other languages like C and Java.
There are two attributes that make development time in Python faster than in other
programming languages:
39
computing and data analysis to web development and machine learning. Some
popular Python libraries and frameworks include:
NumPy: a library for numerical computing in Python, providing support for large,
multi-dimensional arrays and matrices, along with a large collection of
mathematical functions to operate on these arrays.
Pandas: a library for data manipulation and analysis in Python, providing support
for reading and writing data in a variety of formats, as well as powerful tools for
manipulating and analyzing data.
Python's popularity has also led to a large and active community of developers
who contribute to open-source projects and share code and resources online. This
community provides a wealth of resources for learning Python, including
tutorials, online courses, and forums for asking and answering questions.
40
SciPy, and is designed to be user-friendly, efficient, and extensible. One of the
key advantages of scikit-learn is its easy-to-use interface, which provides a
consistent way of implementing machine learning algorithms. This makes it easy
for both beginners and advanced users to use the library for various applications.
Scikit-learn also supports a wide range of machine learning algorithms, including
classification, regression, clustering, and dimensionality reduction, making it a
versatile tool for solving a variety of machine learning problems. Overall, scikit-
learn is a highly useful library for anyone interested in implementing machine
learning algorithms in Python.
41
interface, wide range of algorithms, and various other features make it a popular
choice for both beginners and advanced users in the machine learning community.
6.3.2 MATPLOTLIB
42
CHAPTER 7
7. SYSTEM TESTING
7.1 TESTING
2. works as expected,
3. It can be implemented with the same characteristics, and satisfies the needs of
stakeholders.
43
occurs after the requirements have been defined and the coding process has been
completed.
Testing can never completely identify all the defects within software.
Instead, it furnishes a criticism or comparison that compares the state and
behaviour of the product against oracles principles or mechanisms by which
someone might recognize a problem. These oracles may include (but are not
limited to) specifications, contracts, comparable products, past versions of the
same product, inferences about intended or expected purpose, user or customer
expectations, relevant standards, applicable laws, or other criteria. A primary
purpose of testing is to detect software failures so that defects may be discovered
and corrected.
44
7.2 TESTING METHODS
White-box testing (also known as clear box testing, glass box testing,
transparent box testing, and structural testing) tests internal structures or workings
of a program, as opposed to the functionality exposed to the end users. In white-
box testing an internal perspective of the system, as well as programming skills,
are used to design test cases. The tester chooses inputs to exercise paths through
the code and determine the appropriate outputs. This is analogous to testing nodes
in a circuit, e.g., in-circuit testing (ICT). While white-box testing can be applied
at the unit, integration and system levels of the software testing process, it is
usually done at the unit level. It can test paths within a unit, paths between units
during integration, and between subsystems during a system–level test. Though
this method of test design can uncover many errors or problems, it might not
detect unimplemented parts of the specification or missing requirements.
Techniques used in white-box testing include: 1. API testing (application
45
programming interface) - testing of the application using public and private APIs
2. Code coverage - creating tests to satisfy some criteria of code coverage (e.g.,
the test designer can create tests to cause all statements in the program to be
executed at least once) 3. Fault injection methods - intentionally introducing
faults to gauge the efficacy of testing strategies
Code coverage tools can evaluate the completeness of a test suite that was
created with any method, including black-box testing. This allows the software
team to examine parts of a system that are rarely tested and ensures that the most
important function points have been tested. Code coverage as a software metric
can be reported as a percentage for: 1. Function coverage, which reports on
functions executed 2. Statement coverage, which reports on the number of lines
executed to complete the test 100% statement coverage ensures that all code
paths, or branches (in terms of control flow) are executed at least once. This is
helpful in ensuring correct functionality, but not sufficient since the same code
may process different inputs correctly or incorrectly.
46
expected value specified in the test case. Test cases are built around specifications
and requirements, i.e., what the application is supposed to do. It uses external
descriptions of the software, including specifications, requirements, and designs
to derive test cases. These tests can be functional or non-functional, though
usually functional. Specification based testing may be necessary to assure correct
functionality, but it is insufficient to guard against complex or high-risk
situations. One advantage of the black box technique is that no programming
knowledge is required.
Whatever biases the programmers may have had, the tester likely has a
different set and may emphasize different areas of functionality. On the other
hand, black-box testing has been said to be "like a walk in a dark labyrinth without
a flashlight." Because they do not examine the source code, there are situations
when a tester writes many test cases to check something that could have been
tested by only one test case, or leaves some parts of the program untested. This
method of test can be applied to all levels of software testing: unit, integration,
system and acceptance. It typically comprises most if not all testing at higher
levels, but can also dominate unit testing as well.
47
testing may also include reverse engineering to determine, for instance, boundary
values or error messages.
By knowing the underlying concepts of how the software works, the tester
makes better-informed testing choices while testing the software from outside.
Typically, a grey-box tester will be permitted to set up his testing environment;
for instance, seeding a database; and the tester can observe the state of the product
being tested after performing certain actions. For instance, in testing a database
product he/she may fire an SQL query on the database and then observe the
database, to ensure that the expected changes have been reflected. Grey-box
testing implements intelligent test scenarios, based on limited information. This
will particularly apply to data type handling, exception handling, and so on.
Visual testing
48
testing is particularly well-suited for environments that deploy agile methods in
their development of software, since agile methods require greater
communication between testers and developers and collaboration within small
teams. Visual testing is gathering recognition in customer acceptance and
usability testing, because the test can be used by many individuals involved in the
development process. For the customer, it becomes easy to provide detailed bug
reports and feedback, and for program users, visual testing can record user actions
on screen, as well as their voice and image, to provide a complete picture at the
time of software failure for the developer.
Tests are frequently grouped by where they are added in the software
development process, or by the level of specificity of the test. The main levels
during the development process as defined by the SWEBOK guide are unit-,
integration-, and system testing that are distinguished by the test target without
implying a specific process model. Other test levels are classified by the testing
objective.
Unit testing, also known as component testing, refers to tests that verify the
functionality of a specific section of code, usually at the function level. In an
object-oriented environment, this is usually at the class level, and the minimal
unit tests include the constructors and destructors. These types of tests are usually
written by developers as they work on code (white-box style), to ensure that the
specific function is working as expected. One function might have multiple tests,
to catch corner cases or other branches in the code. Unit testing alone cannot
verify the functionality of a piece of software, but rather is used to assure that the
building blocks the software uses work independently of each other.
49
Integration testing is any type of software testing that seeks to verify the
interfaces between components against a software design. Software components
may be integrated in an iterative way or all together ("big bang"). Normally the
former is considered a better practice since it allows interface issues to be
localised more quickly and fixed. Integration testing works to expose defects in
the interfaces and interaction between integrated components (modules).
Progressively larger groups of tested software components corresponding to
elements of the architectural design are integrated and tested until the software
works as a system.
7.4.1 Bottom-Up
50
7.4.2 Top-Down
51
7.5.4 Regression Testing
Beta testing comes after alpha testing and can be considered a form of
external user acceptance testing. Versions of the software, known as beta versions,
are released to a limited audience outside of the programming team. The software
is released to groups of people so that further testing can ensure the product has
few faults or bugs. Sometimes, beta versions are made available to the open
public to increase the feedback field to a maximal number of future users.
52
Functional testing refers to activities that verify a specific action or
function of the code. These are usually found in the code requirements
documentation, although some development methodologies work from use cases
or user stories. Functional tests tend to answer the question of "can the user do
this" or "does this particular feature work." Non-functional testing refers to
aspects of the software that may not be related to a specific function or user action,
such as scalability or other performance, behaviour under certain constraints, or
security.
53
when certain components (for example a file or database) increase radically in
size. Stress testing is a way to test reliability under unexpected or rare workloads.
Usability testing is needed to check if the user interface is easy to use and
understand. It is concerned mainly with the use of the application. It is important
to check interface is working properly or not as planned
54
CHAPTER 8
8.1 CONCLUSION
In this project, we have studied the fake news article, creator and subject
detection problem. Based on the news augmented heterogeneous social network,
a set of explicit and latent features can be extracted from the textual information
of news articles, creators and subjects respectively. Furthermore, based on the
connections among news articles, creators and news subjects, a deep diffusive
network model has been proposed for incorporate the network structure
information into model learning. The accuracy metric presumably would be
altogether improved by methods for utilizing progressively complex model. It is
worth noting, that even with the given dataset, only part of the information was
used. The current project did not include domain knowledge related features, such
as entity-relationships. The proposed system proves that multi-layer perceptron
neural network algorithm provides improved accuracy rate. We formulated the
fake news detection on social media as an inference problem in deep learning
model that can be solved using multi-layer neural network algorithm. We can
conclude that, the proposed system to provide improved accuracy rate in fake
news detection. Experiments on well-known benchmark datasets show that the
proposed model consistently improves over the state of the art in fake news
detection in both the late and early detection settings
55
APPENDIX
A1. CODING
import numpy as np
import pandas as pd
import pickle
data = pd.read_csv("DataSet/DataSet.csv")
print(data.shape)
data = data.dropna()
print(data.shape)
56
# preprocessing
def preprocess(data):
le = LabelEncoder()
data['label'] = le.fit_transform(data['label'])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['text'])
sequences = tokenizer.texts_to_sequences(data['text'])
X = pad_sequences(sequences, maxlen=500)
y = data['label']
print(X_train)
print(X_test)
print(y_train)
print(y_test)
57
classifier = MLPClassifier(random_state=0,max_iter=200)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred))
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
filename = 'model.pkl'
58
pickle.dump(classifier, open(filename, 'wb'))
'''tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['description'])
sequences = tokenizer.texts_to_sequences(data['description'])
X = pad_sequences(sequences, maxlen=500)'''
import pandas as pd
import pickle
import pandas as pd
import pickle
59
import matplotlib.pyplot as plt
app = Flask(__name__)
@app.route('/')
def home():
return render_template('home.html')
@app.route('/predict',methods=['POST'])
def predict():
if request.method == 'POST':
comment = request.form['comment']
tokenizer = Tokenizer()
tokenizer.fit_on_texts(comment)
sequences = tokenizer.texts_to_sequences(comment)
X = pad_sequences(sequences, maxlen=500)
filename = 'model.pkl'
my_prediction = classifier.predict(X)
#warnings.filterwarnings("ignore", category=DeprecationWarning)
print(my_prediction[0])
if my_prediction[0] == 0:
else:
60
return render_template('result.html', prediction=my_prediction[0])
if __name__ == '__main__':
app.run(debug=True)
61
A2 SCREENSHOTS
L:\Python2023\UG\Cavery\NLTKDatasetPy\venv\Scripts\python.exe
L:/Python2023/UG/Cavery/NLTKDatasetPy/NewModel.py
2023-03-28 12:00:01.455064: W
tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load
dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2023-03-28 12:00:01.456089: I
tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart
dlerror if you do not have a GPU set up on your machine.
(999, 5)
(868, 5)
62
[ 0 0 0 ... 18 2899 20541]
...
[ 0 0 0 ... 18 1 4776]
...
358 1
584 0
596 1
390 0
24 1
..
959 0
225 0
726 0
647 0
63
790 0
302 0
941 0
87 0
526 0
663 0
..
752 0
495 0
88 1
909 0
231 0
64
Accuracy on training set: 0.99
65
Fig A2.3 WEB APPLICATION LINK
66
Fig A2.5 DETECTING FAKE NEWS
67
REFERENCES
5. Lu, Jing, Peilin Zhao, and Steven Hoi. "Online passive aggressive active
learning and its applications." Asian Conference on Machine Learning. PMLR,
2015.
68
8. T. Mihaylov, ‘‘Finding opinion manipulation trolls in news community
forums,’’ in Proc. 19th Conf. Comput. Natural Lang. Learn., Beijing, China, Jul.
2015, pp. 310–314. [Online].
69