174820-Fake News Detection Using Python
174820-Fake News Detection Using Python
INTRODUCTION
The extensive spread of fake news can have a serious negative impact on
individuals and society. First, fake news can break the authenticity balance of the
news ecosystem. For example, it is evident that the most popular fake news was
1
even more widely spread on Facebook than the most popular authentic
mainstream news during the U.S. 2016 president election. Second, fake news
intentionally persuades consumers to accept biased or false beliefs. Fake news is
usually manipulated by propagandists to convey political messages or influence.
For example, some report shows that Russia has created fake accounts and social
bots to spread false stories. Third, fake news changes the way people interpret
and respond to real news. For example, some fake news was just created to
trigger people’s distrust and make them confused, impeding their abilities to
differentiate what is true from what is not. To help mitigate the negative effects
caused by fake news–both to benefit the public and the news ecosystem–It’s
critical that we develop methods to automatically detect fake news on social
media.
Detecting fake news on social media poses several new and challenging
research problems. Though fake news itself is not a new problem–nations or
groups have been using the news media to execute propaganda or influence
operations for centuries –the rise of web-generated news on social media makes
fake news a more powerful force that challenges traditional journalistic norms.
There are several characteristics of this problem that make it uniquely
challenging for automated detection. First, fake news is intentionally writ- ten to
mislead readers, which makes it nontrivial to detect simply based on news
content. The content of fake news is rather diverse in terms of topics, styles and
media platforms, and fake news attempts to distort truth with diverse linguistic
styles while simultaneously mocking true news. For example, fake news may cite
true evidence within the in- correct context to support a non-factual claim. Thus,
existing hand-crafted and data-specific textual features are generally not
sufficient for fake news detection. Other auxiliary information must also be
applied to improve detection, such as knowledge base and user social
engagements. Second, exploiting this auxiliary information actually leads to
another critical challenge: the quality of the data itself. Fake news is usually
related to newly emerging, time-critical events, which may not have been
2
properly verified by existing knowledge bases due to the lack of corroborating
evidence or claims. In addition, users social engagements with fake news produce
data that is big, incomplete, unstructured, and noisy. Effective methods to
differentiate credible users, extract useful post features and exploit network
interactions are an open area of research and need further investigations.
In this article, we present an overview of fake news detection and discuss
promising research directions. The key motivations of this survey are
summarized as follows:
• Fake news on social media has been occurring for several years; however, there is
no agreed upon definition of the term “fake news”. To better guide the future
directions of fake news detection research, appropriate clarifications are
necessary.
• Social media has proved to be a powerful source for fake news dissemination.
There are some emerging patterns that can be utilized for fake news detection in
social media. A review on existing fake news detection methods under various
social media scenarios can provide a basic understanding on the state-of-the-art
fake news detection methods.
• Fake news detection on social media is still in the early age of development, and
there are still many challenging issues that need further investigations. It is
necessary to discuss potential research directions that can improve fake news
detection and mitigation capabilities.
3
We discuss the narrow and broad definitions of fake news that cover most
existing definitions in the literature and further present the unique characteristics
of fake news on social media and its implications com- pared with the traditional
media; We give an overview of existing fake news detection methods with a
principled way to group representative methods into different categories; and
We discuss several open issues and provide future directions of fake news
detection in social media.
Giant man-bats that spent their days collecting fruit and holding animated
conversations; goat-like creatures with blue skin; a temple made of polished sapphire.
These were the astonishing sights witnessed by John Herschel, an eminent British
astronomer, when, in 1835, he pointed a powerful telescope “of vast dimensions” towards
the Moon from an observatory in South Africa. Or that, at least, was what readers of the New
York Sun were told in a series of newspaper reports.
This caused a sensation. People flocked to buy each day’s edition of the Sun. The
paper’s circulation shot up from 8,000 to over 19,000 copies, overtaking the Times of
London to become the world’s bestselling daily newspaper. There was just one small hitch.
The fantastical reports had in fact been concocted by Richard Adams Locke, the Sun’s
editor. Herschel was conducting genuine astronomical observations in South Africa. But
4
Locke knew it would take months for his deception to be revealed, because the only means
of communication with the Cape was by letter. The whole thing was a giant hoax – or, as
we would say today, “fake news”. This classic of the genre illuminates the pros and cons of
fake news as a commercial strategy – and helps explain why it has re-emerged in the
internet era.
That fake news shifted copies had been known since the earliest days of printing. In the
16th and 17th centuries, printers would crank out pamphlets, or newsbooks, offering
detailed accounts of monstrous beasts or unusual occurrences. A newsbook published in
Catalonia in 1654 reports the discovery of a monster with “goat’s legs, a human body,
seven arms and seven heads”; an English pamphlet from 1611 tells of a Dutch woman who
lived for 14 years without eating or drinking. So what if they weren’t true? Printers argued,
as internet giants do today, that they were merely providing a means of distribution, and
were not responsible for ensuring accuracy.
But newspapers were different. They contained a bundle of different stories, not just
one, and appeared regularly under a consistent title. They therefore had reputations to
maintain. The Sun, founded in 1833, was the first modern newspaper, funded primarily by
advertisers rather than subscriptions, so it initially pursued readership at all costs. At first it
prospered from the Moon hoax, even collecting its reports in a bestselling pamphlet. But it
was soon exposed by rival papers. Editors also realised that an infinite supply of genuine
human drama could be found by sending reporters to the courts and police stations to write
true-crime stories – a far more sustainable model. As the 19th century progressed,
impartiality and objectivity were increasingly venerated at the most prestigious newspapers.
But in recent years search engines and social media have blown apart newspapers’
bundles of stories. Facebook shows an endless stream of items from all over the web. Click
an interesting headline and you may end up on a fake-news site, set up by a political
propagandist or a teenager in Macedonia to attract traffic and generate advertising revenue.
Peddlers of fake stories have no reputation to maintain and no incentive to stay honest; they
are only interested in the clicks. Hence the bogus stories, among the most popular of 2016,
that the pope had endorsed Donald Trump, or that Hillary Clinton had sold weapons to
5
Islamic State. The impetus behind these was commercial rather than political; it transpired
that Trump supporters were more likely to click and share bogus stories.
Fake news has existed for a very long time, nearly the same amount of time
as news began to circulate widely after the printing press was invented in 1439 7.
However, there is no agreed definition of the term “fake news”. Therefore, we first
discuss and compare some widely used definitions of fake news in the existing
literature, and provide our definition of fake news that will be used for the
remainder of this survey. A narrow definition of fake news is news articles that
are intentionally and verifiably false and could mislead readers. There are two key
features of this definition: authenticity and intent. First, fake news includes false
information that can be verified as such. Second, fake news is created with
dishonest intention to mislead consumers. This definition has been widely
adopted in recent studies. Broader definitions of fake news focus on the either
authenticity or intent of the news content. Some papers regard satire news as fake
news since the contents are false even though satire is often entertainment-
oriented and reveals its own deceptiveness to the consumers. Other literature
directly treats deceptive news as fake news, which includes serious fabrications,
hoaxes, and satires.
6
In this article, we use the narrow definition of fake news. Formally, we state
this definition as follows,
The reasons for choosing this narrow definition are three- folds. First, the
underlying intent of fake news provides both theoretical and practical value that
enables a deeper understanding and analysis of this topic. Second, any techniques
for truth verification that apply to the narrow conception of fake news can also be
applied to under the broader definition. Third, this definition is able to eliminate
the ambiguities between fake news and related concepts that are not considered in
this article. The following concepts are not fake news according to our definition:
(1) satire news with proper context, which has no intent to mislead or deceive
consumers and is unlikely to be mis-perceived as factual;
(2) rumors that did not originate from news events;
(3) conspiracy theories, which are difficult verify as true or false;
(4) misinformation that is created unintentionally;
(5) hoaxes that are only motivated by fun or to scam targeted individuals.
Fake news itself is not a new problem. The media ecology of fake news has
been changing over time from newsprint to radio/television and, recently, online
7
news and social media. We denote “traditional fake news” as the fake news
problem before social media had important effects on its production and
dissemination. Next, we will describe several psychological and social science
foundations that describe the impact of fake news at both the individual and
social information ecosystem levels.
Psychological Foundations of Fake News. Humans are naturally not very good
at differentiating between real and fake news. There are several psychological
and cognitive theories that can explain this phenomenon and the influential power
of fake news. Traditional fake news mainly tar- gets consumers by exploiting their
individual vulnerabilities. There are two major factors which make consumers
natu- rally vulnerable to fake news: (i) Naı̈ve Realism: consumers tend to believe
that their perceptions of reality are the only accurate views, while others who
disagree are regarded as uninformed, irrational, or biased; and (ii) Confirmation
Bias: consumers prefer to receive information that confirms their existing views.
Due to these cognitive biases inherent in human nature, fake news can often be
perceived as real by consumers. Moreover, once the misperception is formed, it is
very hard to correct it. Psychology studies shows that correction of false
information (e.g., fake news) by the presentation of true, factual information is
not only unhelpful to reduce misperceptions, but sometimes may even increase
the misperceptions, especially among ideological groups
Social Foundations of the Fake News Ecosystem. Considering the entire news
consumption ecosystem, we can also describe some of the social dynamics that
contribute to the proliferation of fake news. Prospect theory describes decision
making as a process by which people make choices based on the relative gains
and losses as compared to their current state. This desire for maximizing the
reward of a decision applies to social gains as well, for instance, continued
acceptance by others in a user’s immediate social network. As described by social
identity theory and normative influence theory, this preference for social
acceptance and affirmation is essential to a person’s identity and self-esteem,
8
making users likely to choose “socially safe” options when consuming and
disseminating news informa- tion, following the norms established in the
community even if the news being shared is fake news.
original signal s to resultant news report a with an effect of distortion bias b, i.e., s b
a, where b = [ 1, 0, 1] indicates [left, no, right] biases take effects on news publishing
process. Intuitively, this is capturing the degree to which a news article may be
biased or distorted to produce fake news. The utility for the publisher stems from two
perspectives:
(i) Short-term utility: the incentive to maximize profit, which is positively
correlated with the number of consumers reached;
(ii) long- term utility: their reputation in terms of news authenticity. Utility of
consumers consists of two parts:
(i) Information utility: obtaining true and unbiased information (usually extra
investment cost needed);
(ii) psychology utility: receiving news that satisfies their prior opinions and
social needs, e.g., confirmation bias and prospect theory. Both publisher and
consumer try to maximize their overall utilities in this strategy game of the
news consumption process. We can capture the fact that fake news happens
when the short-term utility dominates a publisher’s overall utility and
psychology utility dominates the consumer’s overall utility, and an equilibrium
is maintained. This explains the social dynamics that lead to an information
ecosystem where fake news can thrive.
9
1.1.4 FAKE NEWS ON SOCIAL MEDIA
Malicious Accounts on Social Media for Propaganda. While many users on social
media are legitimate, social media users may also be malicious, and in some
cases are not even real humans. The low cost of creating social media accounts
also encourages malicious user accounts, such as social bots, cyborg users, and
trolls. A social bot refers to a social media account that is controlled by a
computer al- gorithm to automatically produce content and interact with humans
(or other bot users) on social media. Social bots can become malicious entities
designed specifically with the purpose to do harm, such as manipulating and
spreading fake news on social media. Studies shows that social bots distorted the
2016 U.S. presidential election online discussions on a large scale, and that
around 19 million bot accounts tweeted in support of either Trump or Clinton in
the week leading up to election day. Trolls, real human users who aim to disrupt
online communities and provoke consumers into an emotional response, are also
playing an important role in spreading fake news on social media. For example,
evidence suggests that there were 1,000 paid Russian trolls spreading fake news
on Hillary Clinton9. Trolling behaviors are highly affected by people’s mood and
the con- text of online discussions, which enables the easy dissemination of fake
news among otherwise “normal” online com- munities. The effect of trolling is to
trigger people’s inner negative emotions, such as anger and fear, resulting in
doubt, distrust, and irrational behavior. Finally, cyborg users can spread fake
news in a way that blends automated activities with human input. Usually cyborg
accounts are registered by human as a camouflage and set automated pro- grams
to perform activities in social media. The easy switch of functionalities between
human and bot offers cyborg users unique opportunities to spread fake news. In a
10
nutshell, these highly active and partisan malicious accounts on social media
become the powerful sources and proliferation of fake news.
Fake news has become increasingly prevalent over the last few years, with over 100
incorrect articles and rumors spread incessantly just with regard to the 2016 United States
11
presidential election. These fake news articles tend to come from satirical news websites
or individual websites with an incentive to propagate false information, either as clickbait
or to serve a purpose. Since they typically hope to intentionally promote incorrect
information, such articles are quite difficult to detect. When identifying a source of
information, one must look at many attributes, including but not limited to the content of
the email and social media engagements. specifically, the language is typically more
inflammatory in fake news than real articles, in part because the purpose is to confuse and
generate clicks. Furthermore, modeling techniques such as n-gram, python encodings
and bag of words have served as other linguistic techniques to determine the legitimacy of
a news source On top of that, researchers have determined that visual-based cues also
play a factor in categorizing an article, specifically some features can be designed to
assess if a picture was legitimate and provides more clarity on the news. There is also
many social context features that can play a role, as well as the model of spreading the
news. Websites such as “Snopes” try to detect this information manually, while certain
universities are trying to build mathematical models to do this themselves.
Fake news in India has led to episodes of violence between castes and religions,
interfering with public policies. It often spreads through the smartphone instant
messenger WhatsApp, which had 200 million monthly active users in the country as of
February 2017.
12
Prabhakar Kumar of the Indian media research agency CMS, told The Guardian that
India was hit harder by fake news because the country lacked media policy for verification.
Law enforcement officers in India arrested individuals with charges of creating fictitious
articles, predominantly if there was likelihood the articles inflamed societal conflict.
In April 2018, the Information and Broadcasting Ministry said the government
would cancel the accreditation of journalists found to be sharing fake news, but this was
quickly retracted after criticism that this was an attack on freedom of the press.
To tackle the menace of fake news in Kashmir, Amir Ali Shah, a youth from south
Kashmir' Anantnag district has developed a website called "Stop Fake in Kashmir" where
news and facts can be verified. The website is the first of its kind developed in the Kashmir
valley.
Internet gave opportunity to everyone enter online news business because many of
them were already rejected the traditional news sources that had gained high level of public
trust and also credibility of the work. According to a survey general trust on mass media
collapsed as lowest in the history of this business. Especially in political right 51%
democrats and 14% republican in USA expressing a great deal and trust in mass media as
news source.
It has come to known that the information repeated again is more likely to be rated
true than the information that has not been heard before. Familiarity with false news would
increase with truthfulness. Further this thing did not stop here as the false stories would
13
result to create the false memory. The authors first observed “illusory-truth effect” and gave
the results that subject rated repeated statements truer as compare to the new statements.
They present a case study with the results that the participants who had read the false news
or stories consecutively five weeks believe false stories more truthful and more plausible as
compare to the participants who had not been exposed.
News can be true if the information it expresses that is more familiar. Familiarity
means automatic consequences of exposure so it will influence on truth and that is fully
unintentional. In those cases where the source or the agency that circulated stories warns
that source may not be credible, people did not stop to believe on that story due to the
familiarity. Another study that contains half statements showing in the experiments were
true and half were false but the results shows that the participants like repeated statements
although they were false but due to the familiarity they rated as more true than the stories
they heard first time (Bacon et al. 1979). Monitoring of source is itself an ability to check
and identify the news origin we read. Some studies clearly indicate that the participants use
familiarity to understand the source of their memory. Another study that proposed general
knowledge and semantic memory does not focus on conditions but it only helps a person
when and where he learned this information. Similarly a person may have some knowledge
about an event but not remember the event so it comes from memory
State of art semantic models do an excellent job at detecting semantic similarity. For
example, calculating cosine similarity between two word vectors.
Such model will be able to tell that cappuccino, espresso and americano are similar
to each other. However, it cannot predict semantic differences between words.
If you can tell that americano is similar to cappuccino and espresso but you can't tell
the difference between them, you don't know what americano is. As a consequence, any
semantic model that is only good at similarity detection will be of limited practical use.
14
1.3 CONCEPTUAL FRAMEWORK
The first part of the framework is on the approach of determining fake news
sources. There is no established network of fake news sources, for that reason, the creation
of fake news can be seen to be initiated by entity usually unreliable ones. Due to technical
constraints, most of the unreliable entities are difficult to be identified. Therefore, barring
these entities from further spreading fake news is a challenging method to be implemented.
Then, the entity combines fake news and truth to produce the fake news. This news can be
published using fake headline and true content, true headline and fake content or
combination of fake and true headline and content. The fake article produce will be
published online using any website hosted by the entity. In most cases, the website hosted
by entity looks similar to reputable, authentic and reliable website. The links from these
websites or articles are shared on Microblogging sites because this news is considered
credible news by some parties. So, they share this so-called credible news to provide
information to other Microblogging site users. Hence the paths followed is outline and then
drafted.
Following this path, the nodes identifiers are tags. This are the first point where
verification check is required. As a result, a verification function of the legitimacy of the
15
node(s) mostly domain name or IP address is set. Furthermore, the entire work of the
function in this verification step is named. A specially name like “the verification of IP
address” is most appropriate. In the examination of node(s), network analyzer tools like
Wireshark application is used as a preliminary tool. Although, there are online tool that can
used to check the information regarding the entered IP address or domain name, but in this
framework, a function suitable for direct evaluation of domain and IP address is required
for examine the pattern of sources of what constitute a fake news. Thus, examining source
identifier is a lead for determining a fake news: For instance, If the IP address is constantly
changing, this occurrence is known as DNS hijack. The result will be invalid IP address. If
the IP address is static and correct, the system will print valid IP address. If it is otherwise,
the result will be invalid.
In the next phase, the proposed detection of the contain of information that make of
a fake news from the source is evaluated. A function is set, and its responsibilities is to
check the sources node contain for which either an article, or title of the article, the author
of the article and the background information of the article. This check is appended with the
validity guide and guideline. A dictionary of those guides is provided in a database. Access
to the database is open. For each information required to be analyzed from a source, a
check function to the database should be invoked. If found in database, return authentic
source. If not found in the database, the system returns ambiguous and save the entry in a
different database where it will be analyzed by the verification team later. Next, the title of
the article is analyzed, the topic is automatically searched online. If the title is found, return
true value and the output is ensured that it is from legitimate website. If the title is not
found, the title is sent to the verification team for confirmation and return false value.
The content of the article is manually checked by the validation team where the
validation is done by supporting the claims. If the content is right, then return true value for
the content. If the content is otherwise, return false. On the article, if author’s name is
available then check database. If author is found in the database, the system will return true
value for author. If author name is not available on the article or not found in the database,
then return false. If website value is true, title value is true, content value is true and author
16
value is true, the article is verified or else the article is considered ambiguous.
The next aspect of detections is on how to determine the status of the news. This is
based on the result of verification of source and the validity of the content. If result from
source verification is legitimate, and content valid, then the status of news is verified. If the
source of the news and is ambiguous and the content of the new is valid, then, the news is
also valid, and the status of news is now claim. If source of the news is verified and the
content of the news is invalid, the status of news is label fake or claim. If the source of the
news is ambiguous and the content is invalid, the status of news is as similar stated. Various
unforeseen events are possible, but in general a matrix of source node and content of news
is the backbone of this proposed conceptual framework.
Finally, after labelling a news as either fake or reliable, by the status label, for each,
round a tag is appended displaying the status of what has been requested by the user. These
tags are now filtered. The sequence of the news is determined. If the status is verified, the
news is labelled “verified” by the tag and placed on the top of the newsfeed of
Microblogging sites. If the news is considered claim, then the news will be labelled “claim”
and below all the verified news. If the news is marked fake news, the news will be labelled
“fake” and placed below the list of claim news.
17
1.4 STATEMENT OF THE PROBLEM
• Let a refer to a News Article. It consists of two major components: Publisher and
Content. Publisher p ~ a includes a set of profile features to describe the original author,
such as name, domain, age, among other attributes. Content c ~ a consists of a set of
attributes that represent the news article and includes headline, text, image, etc.
Definition 2 (Fake News Detection) Given the social news engagements E among n users
for news article a, the task of fake news detection is to predict whether the news article a
is a fake news piece or not, i.e., F : E → {0,1} such that,
where F is the prediction function we want to learn. Note that we define fake news
detection as a binary classification problem for the following reason: fake news is
essentially a distortion bias on information manipulated by the publisher. According to
18
previous research about media bias theory, distortion bias is usually modeled as a
binary classification problem.
Next, we propose a general data mining framework for fake news detection
which includes two phases:
(i) feature extraction and
(ii) model construction. The feature extraction phase aims to represent
news content and related auxiliary information in a formal
mathematical structure, and model construction phase further builds
machine learning models to better differentiate fake news and real
news based on the feature representations.
Feature Extraction:
Fake news detection on traditional news media mainly relies on news content,
while in social media, extra social context auxiliary information can be used to as
additional information to help detect fake news. Thus, we will present the details of
how to extract and represent useful features from news content and social context.
•
Headline: Short title text that aims to catch the attention of readers and
describes the main topic of the article
• Body Text: Main text that elaborates the details of the news story; there is
usually a major claim that is specifically highlighted and that shapes the angle of
the publisher
•
Image/Video: Part of the body content of a news article that provides visual
cues to frame the story
Based on these raw content attributes, different kinds of feature
representations can be built to extract discriminative characteristics of fake news.
19
Typically, the news content we are looking at will mostly be linguistic-based and
visual- based, described in more detail below.
Linguistic-based: Since fake news pieces are intention- ally created for
financial or political gain rather than to re- port objective claims, they often contain
opinionated and inflammatory language, crafted as “clickbait” (i.e., to en- tice
users to click on the link to read the full article) or to incite confusion. Thus, it is
reasonable to exploit linguistic features that capture the different writing styles and
sensational headlines to detect fake news. Linguistic- based features are extracted
from the text content in terms of document organizations from different levels, such
as characters, words, sentences, and documents. In order to capture the different
aspects of fake news and real news, existing work utilized both common linguistic
features and domain-specific linguistic features. Common linguistic features are
often used to represent documents for various tasks in natural language processing.
Typical common linguistic features are:
(i) lexical features, including character- level and word-level features, such
as total words, characters per word, frequency of large words, and unique words;
(ii) syntactic features, including sentence-level features, such as frequency of
function words and phrases or punctuation and parts- of-speech (POS) tagging.
Domain-specific linguistic features, which are specifically aligned to news domain,
such as quoted words, external links, number of graphs, and the average length of
graphs, etc. Moreover, other features can be specifically designed to capture the
deceptive cues in writing styles to differentiate fake news, such as lying- detection
features.
20
based on various user-level and tweet-level hand-crafted features using
classification framework. Recently, various visual and statistical features has been
extracted for news verification. Visual features include clarity score, coherence
score, similarity distribution histogram, diversity score, and clustering score.
Statistical features include count, image ratio, multi-image ratio, hot image ratio,
long image ratio, etc.
21
features capture overall characteristics of groups of users related to the news. The
assumption is that the spreaders of fake news and real news may form different
communities with unique characteristics that can be depicted by group level
features. Commonly used group level features come from aggregating (e.g.,
averaging and weighting) individual level features, such as ‘percentage of
verified users’ and ‘average number of followers.
Post-based: People express their emotions or opinions to- wards fake news
through social media posts, such as skeptical opinions, sensational reactions, etc.
Thus, it is reasonable to extract post-based features to help find potential fake
news via reactions from the general public as expressed in posts. Post-based
features focus on identifying useful in- formation to infer the veracity of news
from various aspects of relevant social media posts. These features can be
categorized as post level, group level, and temporal level. Post level features generate
feature values for each post. The aforementioned linguistic-based features and
some embed- ding approaches for news content can also be applied for each post.
Specifically, there are unique features for posts that represent the social response
from general pub- lic, such as stance, topic, and credibility. Stance features (or
viewpoints) indicate the users opinions towards the news, such as supporting,
denying, etc. Topic features can be extracted using topic models, such as latent
Dirichlet allocation (LDA). Credibility features for posts assess the degree of
reliability. Group level features aim to aggregate the feature values for all
relevant posts for specific news articles by using “wisdom of crowds”. For
example, the average credibility scores are used to evaluate the credibility of
news. A more comprehensive list of group-level post features can also be found
in. Temporal level features consider the temporal variations of post level feature
values. Unsupervised embedding methods, such as re- current neural network
(RNN), are utilized to capture the changes in posts over time. Based on the shape
of this time series for various metrics of relevant posts (e.g, number of posts),
mathematical features can be computed, such as SpikeM parameters.
22
Network-based: Users form different networks on social media in terms of
interests, topics, and relations. As mentioned before, fake news dissemination
processes tend to form an echo chamber cycle, highlighting the value of extracting
network-based features to represent these types of network patterns for fake news
detection. Network-based features are extracted via constructing specific networks
among the users who published related social media posts. Different types of
networks can be constructed. The stance network can be built with nodes indicating
all the tweets relevant to the news and the edge indicating the weights of similarity of
stances. Another type of network is the co- occurrence network, which is built based
on the user engagements by counting whether those users write posts relevant to the
same news articles. In addition, the friendship network indicates the
following/followees structure of users who post related tweets. An extension of this
friendship network is the diffusion network, which tracks the trajectory of the spread
of news, where nodes represent the users and edges represent the information
diffusion paths among them. That is, a diffusion path between two users ui and
uj exists if and only if (1) uj follows ui , and (2) uj posts about a given news only
after ui does so. After these networks are properly built, existing network metrics
can be applied as feature representations. For example, degree and clustering
coefficient have been used to characterize the diffusion network and friendship
network. Other approaches learn the latent node embedding features by using
SVD or network propagation algorithms.
Model Construction
In the previous section, we introduced features extracted from different sources,
i.e., news content and social context, for fake news detection. In this section, we
discuss the details of the model construction process for several existing approaches.
Specifically we categorize existing methods based on their main input sources as:
News Content Models and Social Context Models.
•
Expert-oriented fact-checking heavily relies on human domain experts to
investigate relevant data and documents to construct the verdicts of claim
veracity, for example PolitiFact, Snopes, etc. However, expert- oriented fact-
checking is an intellectually demanding and time-consuming process, which
limits the potential for high efficiency and scalability.
24
(ii) discriminating the veracity of fact claims. To identify check-worthy claims,
factual claims in news content are extracted that convey key statements and
viewpoints, facilitating the subsequent fact-checking process. Fact-checking for
specific claims largely relies on external resources to determine the truthfulness
of a particular claim. Two typical external sources include the open web and
structured knowledge graph. Open web sources are utilized as references that
can be compared with given claims in terms of both the consistency and
frequency. Knowledge graphs are integrated from the linked open data as a
structured network topology, such as DB- pedia and Google Relation
Extraction Corpus. Fact- checking using a knowledge graph aims to check
whether the claims in news content can be inferred from existing facts in the
knowledge graph.
25
and grandparent rules. Rhetorical structure theory can be utilized to capture the
differences between deceptive and truthful sentences. Deep network models, such as
convolutional neural networks (CNN), have also been applied to classify fake news
veracity.
26
down” reactions expressed in Facebook. Implicit stances can be automatically
extracted from social media posts. Stance detection is the task of automatically
determining from a post whether the user is in favor of, neutral toward, or against
some target entity, event, or idea. Previous stance classification methods mainly
rely on hand-crafted linguistic or embedding features on individual posts to
predict stances. Topic model methods, such as latent dirichlet allocation (LDA)
can be applied to learn latent stance from topics. Using these methods, we can
infer the news veracity based on the stance values of relevant posts. Tacchinietal.
proposed to con- struct a bipartite network of user and Facebook posts using the
“like” stance information; based on this network, a semi-supervised probabilistic
model was used to predict the likelihood of Facebook posts being hoaxes. Jin et
al. explored topic models to learn latent viewpoint values and further exploited
these viewpoints to learn the credibility of relevant posts and news content.
27
1.5 SCOPE AND LIMITATIONS OF THE FAKE NEWS DETECTION
The U.S. government is currently developing programs that detect fake news and
false information.
These fake media outlets use visuals to show false information from the Internet.
Whether this was specific program developed due to Russian interference in the 2016
Presidential Elections is unclear. It is a great starting point by the government in the fight to
counter the dissemination of false information.
28
Political Polarization:
The foundation of any democracy contains the freedoms of speech, protest, and
right to vote. Governments around the world have plans to address the fake news issue, but
to varying degrees. One question comes to mind: does a law to discourage or halt fake news
truly promote democracy?
This will vary on a case by case basis, but authoritarian countries (such as China
and Russia) jail dissidents whom they claim are spreading false information online. This is
often not the case and is only being used as justification to jail political critics and
journalists. The Chinese model of media censorship limits the freedoms of speech in the
public sphere on social media.
Western nations and institutions, such as the European Union, United Kingdom, and
the United States have enacted laws and regulations. They are attempting to remove social
media posts and accounts of individuals who disseminate fake news and conspiracies.
Many groups interpret this as an attempt to suppress free speech. The fine line is
being drawn on a country-by-country basis. In Europe, large social media companies such
as Facebook and Twitter may face fines and tougher regulations on these social media
giants. This is if they disregard or fail to remove posts and accounts that are deemed
29
“harmful and/or illegal”.
The European Union voted in favor of Articles 11 and 13. To sum up the articles,
they will call for companies like Google “to pay media companies a so-called “link tax”
when sharing their content.” Article 13 wants social media platforms to monitor content
uploaded to posts “ahead of their publication by using automated software that would
detect and filter out intellectual property violations” (Quartz).
The fine line between regulating fake news while allowing the freedom of speech is
challenging. This challenge is something governments worldwide will have to figure out
and come to terms with soon enough.
Cybersecurity threats are not always the conventional, short-term gains of the
hacking of voter systems, banks, or government entities, but are now long-term plays in
undermining the democracy and political polarization of countries.
1. Distil Networks promises to help their clients fight the bad bots and gain
visibility over web-based traffic. Distil Networks was founded in April 2011 by Rami
Essaid, Engin Akyol and Andrew Stein. They claim to be the pioneers of ‘bot mitigation’.
They can identify a bot’s source with the help of a technology called ‘device
fingerprinting’.
2. PressCoin is a platform that offers trustworthy news, while the people using the
platform can earn PressCoins.
3. Digital Shadows, a cybersecurity startup, helps fight fake news by monitoring
activity on the dark web. The company manages and remediates digital risk across data
sources within the open, deep, and dark web to protect an organization’s business, brand,
and reputation.
30
4. Another website that works on setting the record straight is AltNews. Founded by
Pratik Sinha and the anonymous ‘Unofficial Sususwamy’, AltNews busts propaganda and
misinformation.
5. PerimeterX provides protection against automated attacks by detecting malicious
web behavior. It uses human behavior analysis as well as analysis of applications and
networks to catch automated attacks in real-time. The company was founded in 2014 by
CEO Omri Iluz.
6. Indian startup SM Hoax Slayer began in 2015 as a Facebook page. Founded by
Pankaj Jain, the website now deals with fake news in any form, be it, religious, political, or
communal.
7. Headed by CEO Dhruv Ghulati, Factmata, is a startup that calls itself a ‘fact
checking community’. It uses AI to help journalists and fact checkers detect, verify and fact
check media information in close to real time. They also help advertisers avoid placing
advertising on fake news, hate speech, and extremist content.
8. An ad-tech startup called Storyzy was launched in 2012, helps verify quotes that
are attributed to public figures or celebrities using Natural Language Processing (NLP) in
real time.
9. Userfeeds is using block chain technology to protect news. The idea is that if
every piece of information is swaddled in encrypted protection, its integrity can be vouched
for.
10. Crisp Thinking, which started in 2005, provides services that protects social
media reputations of companies, and also protects children and teens from cyberbullying
and inappropriate content.
11. Check4Spam was founded by Shammas Oliyath and Bal Krishn Birla in 2015,
to help bust fake news and hoaxes.
12. Rappler, a Philippines-based startup, has won the 2017 Democracy Award from
the National Democratic Institute for its journalistic efforts in curbing the spread of fake
news. The name, ‘Rappler’, is an amalgamation of ‘rap’ (discuss) and ‘ripple’ (create
waves).
The ‘Postcard News’ incident is a wake-up call for the millions of social media
users who blindly believe every forward and every morphed photograph. However,
31
awareness among the public is being spread by the above mentioned startups and
companies. Given time, better technology, and funding, a whole army of startups could be
out there fighting fake news.
Lately the fact-checking world has been in a bit of a crisis. Sites like Politifact and
Snopes have traditionally focused on specific claims, which is admirable but tedious; by the
time they’ve gotten through verifying or debunking a fact, there’s a good chance it’s
already traveled across the globe and back again.
Social media companies have also had mixed results limiting the spread of
propaganda and misinformation. Facebook plans to have 20,000 human moderators by the
end of the year, and is putting significant resources into developing its own fake-news-
detecting algorithms.
Researchers from MIT’s Computer Science and Artificial Intelligence Lab (CSAIL)
and the Qatar Computing Research Institute (QCRI) believe that the best approach is to
focus not only on individual claims, but on the news sources themselves. Using this tack,
they’ve demonstrated a new system that uses machine learning to determine if a source is
accurate or politically biased.
“If a website has published fake news before, there’s a good chance they’ll do it
again,” says postdoc Ramy Baly, the lead author on a new paper about the system. “By
automatically scraping data about these sites, the hope is that our system can help figure out
which ones are likely to do it in the first place.”
Baly says the system needs only about 150 articles to reliably detect if a news
source can be trusted — meaning that an approach like theirs could be used to help stamp
out new fake-news outlets before the stories spread too widely.
32
The system is a collaboration between computer scientists at MIT CSAIL and
QCRI, which is part of the Hamad Bin Khalifa University in Qatar. Researchers first took
data from Media Bias/Fact Check (MBFC), a website with human fact-checkers who
analyze the accuracy and biases of more than 2,000 news sites; from MSNBC and Fox
News; and from low-traffic content farms.
They then fed those data to a machine learning algorithm, and programmed it to
classify news sites the same way as MBFC. When given a new news outlet, the system was
then 65 percent accurate at detecting whether it has a high, low or medium level of
factuality, and roughly 70 percent accurate at detecting if it is left-leaning, right-leaning, or
moderate.
The team determined that the most reliable ways to detect both fake news and
biased reporting were to look at the common linguistic features across the source’s stories,
including sentiment, complexity, and structure.
For example, fake-news outlets were found to be more likely to use language that is
hyperbolic, subjective, and emotional. In terms of bias, left-leaning outlets were more likely
to have language that related to concepts of harm/care and fairness/reciprocity, compared to
other qualities such as loyalty, authority, and sanctity. (These qualities represent a popular
theory — that there are five major moral foundations — in social psychology.)
Co-author Preslav Nakov, a senior scientist at QCRI, says that the system also
found correlations with an outlet’s Wikipedia page, which it assessed for general — longer
is more credible — as well as target words such as “extreme” or “conspiracy theory.” It
even found correlations with the text structure of a source’s URLs: Those that had lots of
special characters and complicated subdirectories, for example, were associated with less
reliable sources.
“Since it is much easier to obtain ground truth on sources [than on articles], this
method is able to provide direct and accurate predictions regarding the type of content
33
distributed by these sources,” says Sibel Adali, a professor of computer science at
Rensselaer Polytechnic Institute who was not involved in the project.
Nakov is quick to caution that the system is still a work in progress, and that, even
with improvements in accuracy, it would work best in conjunction with traditional fact-
checkers.
“If outlets report differently on a particular topic, a site like Politifact could instantly
look at our fake news scores for those outlets to determine how much validity to give to
different perspectives,” says Nakov.
Baly and Nakov co-wrote the new paper with MIT Senior Research Scientist James
Glass alongside graduate students Dimitar Alexandrov and Georgi Karadzhov of Sofia
University. The team will present the work later this month at the 2018 Empirical Methods
in Natural Language Processing (EMNLP) conference in Brussels, Belgium.
The researchers also created a new open-source dataset of more than 1,000 news
sources, annotated with factuality and bias scores, that is the world’s largest database of its
kind. As next steps, the team will be exploring whether the English-trained system can be
adapted to other languages, as well as to go beyond the traditional left/right bias to explore
region-specific biases (like the Muslim world’s division between religious and secular).
“This direction of research can shed light on what untrustworthy websites look like
and the kind of content they tend to share, which would be very useful for both web
designers and the wider public,” says Andreas Vlachos, a senior lecturer at the University
of Cambridge who was not involved in the project.
Nakov says that QCRI also has plans to roll out an app that helps users step out of
their political bubbles, responding to specific news items by offering users a collection of
articles that span the political spectrum.
34
“It’s interesting to think about new ways to present the news to people,” says
Nakov. “Tools like this could help people give a bit more thought to issues and explore
other perspectives that they might not have otherwise considered."
Fake news – or we can also just call it lies – is certainly not a phenomenon peculiar
to the modern, digital world. Nor is using it to achieve ideological, political or economic
aims anything new. Nevertheless, the issue of fake news is at the centre of public debate –
at the latest since the election of the current President of the USA, Donald Trump – and it is
associated directly with digitalisation and the social media. Social media are usually the
first channels that rumours, lies and fake news appear on, from which they are extensively
distributed and from which they find their way into public debate and awareness. You will
no doubt have heard how, in 2015, Chancellor Angel Merkel had her photo taken with the
Brussels assassin, whom she had let into the country as a refugee. This is, of course, an
absurd piece of fake news. But everyone who saw it remembers the accompanying photo.
Worse still, in many cases it brings back the memory of the fake news item even outside the
wrong context.
The strategy behind this story is nothing new. Much of it may even, with a great
deal of effort, have been possible 200 years ago. What is new is that the organization,
technology and manpower needed are much less today. And it is precisely this that makes
fake news seem to us to be so dangerous in the “Digital Age”.
The openness and anonymity provided by social networks make possible a great
amount of diversity and freedom of opinion, as well as protection wherever the free
expression of opinion in an “analogue” world is dangerous. But, to the same extent, they
offer opportunities for abuse. The abstract nature of a simple user account and, at the same
time, the availability of programming interfaces with social networks make it possible to
spread and duplicate all types of content in huge amounts – sometimes even automatically.
Unlike the (likewise automatic) distribution of spam mails, the aim is not to reach less
interested users by means of massive replication. The metadata on users which are freely
35
available on social networks, as well as the networking of interest groups, permit content to
be distributed in a highly targeted way. If a (usually ideologically motivated) piece of fake
news is placed in a suitable environment, it is often forwarded without being checked, or
even deliberately.
The path taken by fake news does not, however, end there. It really becomes
important when it makes the jump from social media to those media working with editors
and which are often considered to be trustworthy and have a large reach. This jump
succeeds because topics from social media increasingly serve as triggers for journalistic
stories. And it is not rare for social media communities to be seen a group representing
society.
The question remains: can we not protect ourselves from fake news by using
technology? An automatic recognition of fake news would mean being able to state
mechanically whether the content of a piece of news is true or false. At present, there is no
method of reliably doing this – nor is there any in sight. Not for nothing does Facebook
deploy a host of checkers to detect fake news. Demonstrating the existence of so-called
social bots is not always effective. Even when automated profiles are discovered, it is not as
a rule clear whether they are part of a campaign or simply utility software. And in any case,
not all campaigns are carried out in a fully automated way.
One universally applicable approach for identifying automation and fake news
would appear to be the detection of campaigns themselves. If the existence of these is
proven, then both content and players can be easily extracted and checked. Currently, this
approach has not been researched to any great extent and it is a highly interesting issue for
research to be undertaken on.
Using social media as a medium for news updates is a double-edged sword. On one
hand, social media provides for easy access, little to no cost, and the spread of information
at an impressive rate (Shu, Sliva, Wang, Tang, & Liu, 2017). However, on the other hand,
social media provides the ideal place for the creation and spread of fake news. Fake news
36
can become extremely influential and has the ability to spread exceedingly fast. With the
increase of people using social media, they are being exposed to new information and
stories every day. Misinformation can be difficult to correct and may have lasting
implications. For example, people can base their reasoning on what they are exposed to
either intentionally or subconsciously, and if the information they are viewing is not
accurate, then they are establishing their logic on lies. In addition, since false information is
able to spread so fast, not only does it have the ability to harm people, but it can also be
detrimental to huge corporations and even the stock market. For instance, in October of
2008, a journalist posted a false report that Steve Jobs had a heart attack. This report was
posted through CNN’s iReport.com, which is an unedited and unfiltered site, and
immediately people retweeted the fake news report. There was much confusion and
uncertainty because of how widespread it became in such a short amount of time. The stock
of Job’s company, Apple Inc., fluctuated dramatically that day due to one false news report
that had been mistaken for authentic news reporting.
However, the biggest reason why false information is able to thrive continuously is
that humans fall victim to Truth-Bias, Naïve Realism, and Confirmation Bias. When
referring to people being naturally “truth-biased” this means that they have “the
presumption of truth” in social interactions, and “the tendency to judge an interpersonal
message as truthful, and this assumption is possibly revised only if something in the
situation evokes suspicion” (Rubin, 2017). Basically humans are very poor lie detectors and
lack the realization that there is the possibility they are being potentially lied to. Users of
social media tend to be unaware that there are posts, tweets, articles or other written
documents that have the sole purpose of shaping the beliefs of others in order to influence
their decisions. Information manipulation is not a well-understood topic and generally not
on anyone’s mind, especially when fake news is being shared by a friend. Users tend to let
their guard down on social media and potentially absorb all the false information as if it
were the truth. This is also even more detrimental considering how young users tend to rely
on social media to inform them of politics, important events, and breaking news (Rubin,
2017). For instance, “Sixty-two percent of U.S. adults get news on social media in 2016,
while in 2012, only fort-nine percent reported seeing news on social media,” which
37
demonstrates how more and more people are becoming tech savvy and relying on social
media to keep them updated (Shu et al., 2017). In addition, people tend to believe that their
own views on life are the only ones that are correct and if others disagree then those people
are labeled as “uniformed, irrational, or biased,” otherwise known as Naïve Realism and
python
This leads to the problem of Confirmation Bias, which is the notion that people
favor receiving information that only verifies their own current views. Consumers only
want to hear what they believe and do not want to find any evidence against their views.
For instance, someone could be a big believer of unrestricted gun control and may desire to
use any information they come across in order to support and justify their beliefs further.
Whether that is using random articles from uncredible sites, posts from friends, re-shared
tweets, or anything online that does agrees with their principles. Consumers do not wish to
find anything that contradicts what they believe because it is simply not how humans
function. People cannot help but favor what they like to hear and have a predisposition for
confirmation bias. It is only those who strive for certain academic standards that may be
able to avoid or limit any biasness, but the average person who is unaware of false
information to begin with will not be able to fight these unintentional urges. In addition, not
only does fake news negatively affect individuals, but it is also harmful to society in the
long run. With all this false information floating around, fake news is capable of ruining the
“balance of the news ecosystem” (Shu et al., 2017). For instance, in the 2016 Presidential
Election, the “most popular fake news was even more widely spread on Facebook” instead
of the “most popular authentic mainstream news” (Shu et al., 2017). This demonstrates how
users may pay more attention to manipulated information than authentic facts. This is a
problem not only because fake news “persuades consumers to accept biased or false
beliefs” in order to communicate a manipulator’s agenda and gain influence, but also fake
news changes how consumers react to real news (Shu et al., 2017). People who engage in
information manipulation desire to cause confusion so that a person’s ability to decipher the
true from the false is further impeded. This, along with influence, political agendas, and
manipulation, is one of the many motives why fake news is generated.
38
1.6.1 CONTRIBUTORS OF FAKE NEWS
While many social media users are very much real, those who are malicious and out
to spread lies may or may not be real people. There are three main types of fake news
contributors: social bots, trolls, and cyborg users (Shu et al., 2017). Since the cost to create
social media accounts is very low, the creation of malicious accounts is not discouraged. If
a social media account is being controlled by a computer algorithm, then it is referred to as
a social bot. A social bot can automatically generate content and even interact with social
media users. Social bots may or may not always be harmful but it entirely depends on how
they are programmed. If a social bot is designed with the sole purpose of causing harm,
such as spreading fake news in social media, then they can be very malicious entities and
contribute greatly to the creation of fake news. For example, “studies shows that social bots
distorted the 2016 US presidential election discussions on a large scale, and around 19
million bot accounts tweeted in support of either Trump or Clinton in the week leading up
to the election day,” which demonstrates how influential social bots can be on social media.
However, fake humans are not the only contributors to the dissemination of false
information; real humans are very much active in the domain of fake news. As implied,
trolls are real humans who “aim to disrupt online communities” in hopes of provoking
social media users into an emotional response (Shu et al., 2017). For instance, there has
been evidence that claims “1,000 Russian trolls were paid to spread fake news on Hilary
Clinton,” which reveals how actual people are performing information manipulation in
order to change the views of others (Shu et al., 2017). The main goal of trolling is to
resurface any negative feelings harvested in social media users, such as fear and even
anger, so that users will develop strong emotions of doubt and distrust (Shu et al., 2017).
When a user has doubt and distrust in their mind, they won’t know what to believe and may
start doubting the truth and believing the lies instead.
While contributors of fake news can be either real or fake, what happens when it’s a
blend of both? Cyborg users are a combination of “automated activities with human input”
(Shu et al., 2017). The accounts are typically registered by real humans as a cover, but use
39
programs to perform activities in social media. What makes cyborg users even more
powerful is that they are able to switch the “functionalities between human and bot,” which
gives them a great opportunity to spread false information.
Now that we know some of the reasons why and how fake news progresses, it
would be beneficial to discuss the methods of detecting online deception in word-based
format, such as e-mails. The two main categories for detecting false information are the
Linguistic Cue and Network Analysis approaches.
40
2. LITERATURE REVIEW
Data Mining techniques have been used in past to explore and analyze the data in
order to find better business ways in an organization. The huge amount of unexplored data
is freely available on web and data mining (DM) techniques have been applied to extract
some hidden useful information which may be useful to enhance the business of an
organization. There is literature available that supports this fact that DM techniques have
been used in past to develop new business opportunities. There are various applications of
DM techniques, sentiment analysis and opinion mining is one of them which can be applied
on un-structured data. Sentiment analysis or Opinion Mining is a deterministic technique
for classifying and evaluating other people's opinions. Now a day's people builds their
perception and make decisions by analyzing the facts and reviews of other people either
manually or computationally. Since everything is online now a day's , hence internet has
become an integrated part of human lives and is thus used for exchanging all aspects of
human life viz. sentiments, emotions, affection, support, opinions, trade, business etc. With
the onset of social media there has been numerous platform such as blogs, discussion
forums, reviews and social networks where an individual can post his or her reviews,
feedbacks and list their likes and dislikes for a product's attributes or features or
comparison of different products (same or different feature). These reviews are
gathered and are analyzed to evaluate the overall orientation of the collected reviews. This
chapter focuses the past work done related to sentiment analysis and opinion mining. We
have presented the outcome of research papers which have shown the application of
machine learning techniques on online reviews. This chapter also discusses the research
papers in which methods and techniques used for gathering and analyzing the reviews,
extracting the phrases based on the Subjectivity and thereafter some work is also discussed
for calculating the semantic orientation of the collected reviews. Sentiment analysis is the
part of Subjectivity analysis which is also very popular by the name Opinion mining.
Opinion Mining is mainly concern with analysis of linguistic natural expression of
41
individual’s opinion about certain product or any other area where public opinion or review
matter a most. Subjectivity analysis aims at determining the attitude of the writer or author
of opinion with respect to some topic or product or services or the overall contextual
polarity or tonality of a document or review. The attitude may involve the user's
experience, evaluation, judgment, the emotional state or intended emotional effect. It is a
Natural Language Processing and Information Extraction task that identifies the writer's
feelings and experiences expressed in positive and negative comments, questions and
requests, by analyzing monstrous amount of information available over the web. The major
force behind the emergence of Opinion Mining today, is the exponential increase in Internet
usage and exchange or share of public views and opinions. It was observed by that the
some of the opinions can be topic based where documents are classified into predefined
topic classes, e.g, science, sports, entertainment, politics etc. Topic related words are
important in topic based classification.
Subjectivity Analysis
The investigation of online item reviews has gotten a lot of consideration from
specialists and as of late, a developing number of studies have investigated diverse
viewpoints related with item reviews. The experimental discoveries of various studies
related to sentiment analysis on product reviews has presented that detail analysis of these
reviews is very much beneficial from both customer and organization point of view. It was
observed by in his early study related with online verbal data investigated the effect of
negative customer reviews. The outcomes demonstrate that the impact of negative online
word of mouth on saw retailer dependability and buyer buy aims is negative and this
negative impact relies on upon commonality with the retailer. Customers who are less
acquainted with a retailer are more probable influenced by the negative reviews. The
concentrate likewise proposes that the degree of word of mouth inquiry relies on upon the
43
customer's purposes behind picking online vendor. observed that reviews collected on
books from amazon web site were analyzed and the customer feedback influenced other
consumers for better deal. (Sorensen and Rasmussen, 2004) also done the similar kind of
work they study and researched the effect of New York Times reviews with different
fiction title. In their work they have concluded the effect of consumer feedback matters a
lot on purchaser mind. The positive reviews have influenced the customer segment in large
numbers. have also investigated the impact of online item surveys on relative offers of two
online bookshops.
(Sen and Lerman, 2007) investigated in their work that polarity or orientation of
customer reviews influence the customer mindset. These reviews have significant impact
on customer mindset. A similar type of research was conducted by Li and Hitt (2008), there
has been a huge advancement in research identified with online reviews and their effect on
buyers' behavioral aim, state of mind and buy basic leadership forms. The study proposed a
model to analyze the particular inclinations of early purchasers and its effect on long haul
customer buy conduct and also on the social welfare made by review frameworks. The
effect of negative and positive electronic word of mouth on electronic word of mouth
impact for various item sorts (look versus experience) was likewise examined by Park and
Lee (2007).
The outcomes demonstrated that the positive electronic word of mouth impact was
more prominent for negative electronic word of mouth than for positive electronic word of
mouth. The tests additionally reported that built up sites indicated more electronic word of
mouth impact and these impact was more prominent for experience products than for
pursuit merchandise (Park and Lee, 2007).
It was observed by (Lee and BradLow, 2011) that online reviews have additionally
been utilized for mechanized statistical surveying to support the examination and
perception of business sector structure. The study proposes that business sector structure
examination can be performed via consequently evoking item traits from online consumer
reviews. This sort of business sector structure examination can encourage investigation of
item substitutes and supplements. The impact of outsider item audits on money related
44
estimation of firms presenting new items was additionally contemplated in a late research
(Chen, Liu, and Zhang, 2012). The outcomes proposed that such surveys assume huge parts
in influencing firm esteem as the financial specialists overhaul their assumption about new
item deals potential (Chen, et al., 2012). Some late studies have further proposed that the
examination of item audits at various granularity levels can uncover item characteristic
quality and shortcoming (Zhang, Xu, and Wan, 2012) which thus can clarify particular
inclinations of every client (Wei, Chen, Yang, and Yang, 2010).
45
(Pang & Lee, 2002) in 2002 studied sentiment classification in the area of film
assessment. They performed it in such a manner which has no prior experience or
information or teaching. (Pang et al.,2004) used very easy structure of machine learning in
their document. The approach was employed by the authors for conducting learning for the
enactment of words chosen by subject. This work was carried out further by couple of
students from computer science fraternity. They analytically tested the prevailing ways of
using human-interpreted seed words as per testing response for sentiment analysis, such as
that used by Turney in the same year. The two students proposed a list of words for
beneficial and unfavorable words, respectively. These words were then vetted with a list
created from self-assessment and basic measurements of the facts and figures. The list
demonstrated considerably far consistent as compared to the ones being used by human
recommended seed words. Although the learning was very much confined in volume,
suggested that programmed administered feature selection could produce features that can
be better than those generated by social speculation based methods. (Pang et al. 2008)
therefore exploited this processed information for producing a very easy, fully self-
assessment based opinion classifier process using machine learning.
Pang and Lee (2008) wrote a book that offers a complete impression of the
investigation in the concerned area. Pang et al. (2002) lead initial polarity classification of
assessments using supervised tactics. The techniques which were explored are backing up
Vector Machines (SVMs), Naive Bayes and Python; this study used data sets with a
different set of functions, for instance unigrams, bigrams, binary and term frequency
feature weights and others. The outcome of their observation was that sentiment
classification is not that easy than regular topic-based classification they also concluded
that using a SVM classifier with binary unigram based structures generates the best output.
A succeeding advancement was the identification and deduction of the neutral portions of
documents and the implementation of a polarity classifier on the remaining (Pang and Lee,
2004). This helped achieving text soundness with contiguous text lengths which were
expected to belong to the same subjectivity or objectivity class. Documents were portrayed
as charts/graphs with sentences as nodes and attached scores in between as edges. Two
supplementary points characterized the subjective and objective nodes. The weights
46
between the two nodes were derived using three various, empirical decomposing tasks.
Identifying a partition that reduces the cost function splits the objective from the subjective
verdicts. They stated a statistically important enhancement over Naive Bayes standard
using the complete text however, with only very minute hike as compared to using a SVM
classifier on the overall document. (Pang, B. and L. Lee, 2008)
Mullen and Collier (2004) used SVMs and prolonged the feature set for
demonstrating the documents with favorable methods from a range of various sources.
They hosted structures based on Osgood’s Theory of Semantic Differentiation (Osgood,
1967) using WordNet to develop the values of effectiveness, motion and evaluative of
adjectives and Turney’s semantic coordination (Turney, 2002). Their conclusions
showcased that using a hybrid SVM classifier, which usesas features the distance of
documents from the splitting hyperplane, with all the stated features yields the superlative
outcomes (Mullen, T. and N. Collier, 2004).
Whitelaw et al. (2005) contributed very minute detailed semantic differences in the
feature set. This approach was constructed on a lexicon formed in a semi-supervised
environment and then physically fine-tuned. It was unified of 1329 adjectives and their
modifiers categorized under numerous nomenclatures of evaluation attributes based on
Martin and White’s Appraisal Theory (2005). They collectively created appraisal clusters
with unigram-based manuscript representations as features to a Backing Vector Machine
identifier (Witten and Frank, 1999), resulting in substantial amount of hike in precision.
Lexicon-based procedures trust on a sentiment lexicon, an assembly of identified and
precompiled sentiment terms. The major contribution was given by (Popescu and Etzioni,
2005; Scharl andWeichselbraun, 2008; Taboada et al., 2011). Machine learning
methodologies make use of syntactic and/or philological structures (Pak and Paroubek,
2010b; Go et al., 2009, Boiy and Moens, 2009). There also exists hybrid approach, with
sentiment lexicons plays a vital part in a lot of procedures, e.g. (Diakopoulos et al., 2010).
For example, (Moghaddam and Popowich, 2010) establishes the polarization of
assessments by recognizing the division of the adjectives that occurs in them, with a
correctness testified of about 10% greater than pure machine learning performances.
47
However, such relatively effective performances often don’t comply when switched
towards different realms or context forms, due to the inflexibility in the ambiguity of
sentiment terms. The sentiment terms might designate partiality, but there could be
unsatisfactory circumstance to calculate its semantic positioning, principally for adjectives
in sentiment lexicons (Mullaly et al., 2010). Numerous estimations have showcased the
importance of circumstantial facts and evidence (Weichselbraun et al., 2010; Wilson et al.,
2009), and have recognized context words with a greater influence on the separation of
uncertain conditions (Gindl et al., 2010).
For example, the adjective unpredictable” might have an adverse coordination in an
automotive review, in an expression such as “unpredictable steering”, but it might have a
constructive coordination in a movie review, in an expression such as “unpredictable plot”.
Consequently two sequential words are extracted, where in one associate of the duo is an
adjective or an adverb and the following offers background.
In the recent past, procedures for opinion mining have started focusing at various
public medias, in combination with a inclination towards its solicitation as a pro- active
instead of a reactive technique. Understanding general outlook can have vital magnitudes
for assessing and forecasting the upcoming events and trends.
One of the very predominant and most recognizable application of this is for review
rating:(Peter D. Turney,2002) found that, with 410 assessments from Epinions, the
algorithm achieves an average precision of 74%. It also gives the idea that film assessments
are harder to characterize, in light of the fact that the entire is not as a matter of course the
whole of the parts; in this manner the precision on motion picture audits is around 66%.
Then again, in case of Financial Institutions, banks and cars, it appears that the entire is the
whole complete set of information, and the precision is 80% to 84%. Tourism assessment is
a form of transitional case. Another implementation of the same is in the area of stock
market forecasts: (Bollen and Mao, 2011) found that, opposing to the anticipation that if the
stock markets falls, then general population mood would also be shifted towards negative
side, in fact fall in general population’s opinion acts as a precursor to a collapse in the stock
market.
48
Practically, all the work on opinion mining from Twitter has used machine learning
techniques. (Pak and Paroubek, 2010b) aiming to identify random tweets on the foundation
of positive, negative and neutral sentiment, building a simple binary classifier which uses
n-gram and POS features, and proficient on various cases which had been annotated
permitting to the existence of positive and negative emotions.
Opinion Mining can be helpful in a few ways. For instance, in advertising it aids in
identifying the accomplishments of a campaign battle or new item dispatch, to figure out
which of the options of an item or administration are well known and evenidentified which
demographics like or aversion specific components. Hence, the belief classification is
beneficial to prospective consumers (buyers) and product manufacturers both.
49
(which can already be done at some sites), but also the voice of prevailing customers. For a
product vendor, assessing customer’s thoughts about it’s products and those of its
opponents to find their weakness and strengths is very critical for marketing intelligence
and for product benchmarking. This type of work is usually performed manually which is
very intensive, hectic and takes a lot of time. This method is very supportive and comes
very handy in this case.
On applying free text treatment procedures on web based assessments could find
patterns and subjects that might be important for bigger purchaser and business groups
(Gamon, Aue, Corston-Oliver, and Ringger, 2005). Furthermost applicable exploration on
handling web based consumer reviews concentrates on sentiment analysis and opinion
mining which mean to find analysts' attitude, whether incremental or decrement,
concerning an item in general or different components of the item.
50
The basic approach in classifying opinion is to treat the problem as a topic-based
text classification problem, then any text classification algorithm can be applied to
determine the semantic orientation of the tagged reviews , such as Naive Bayesian, SVM or
kNN (Yugowati P; Shaou-Gang M; Hui-Ming W, 2013) . The orientation can also be
determined using score function. We discuss three main approaches , Naïve Bayesian ,
Support Vector Machine and Maximum Entropy. (Bo Pang et al., 2008) proposed a novel
machine learning strategy that applies content order procedures to change in accordance
with simply the subjective part of the record, which is in taking after procedure:
(1) mark the sentences in the report as either subjective or target, disposing of the
last mentioned; and after that
(2) Apply a standard machine learning classifier to the subsequent concentrate. Creators
utilized a proficient and instinctive diagram construct detailing depending with respect to
discovering least cuts. Their investigations include ordering film audits as either positive or
negative. To assemble subjective sentences or expressions, creators gathered 5000 motion
picture audit pieces from www. rottentomatoes.com, and for target information they taken
from the web motion picture dataset (www.imdb.com). Both Naïve Bayes and SVMs can
be prepared on subjectivity dataset and afterward utilized as a fundamental subjective
finder.
(N. Kobayashi et al., 2004) experimented with machine learning techniques and suggested
that the proper investigation of useful attribute from the reviews can help in extracting the
useful information from online reviews. Authors conducted experiment with Japanese web
documents. The authors have used SVM as machine learning classifier , they also used
feature selection method in order to identify beat attribute for dimensionality reduction.
(Xavier et al, 2011) proposed a profound learning approach which figures out how
to remove a significant representation for every review in unsupervised strategy. It depends
on calculation for finding middle representation worked in a progressive way. The
information is collected in the form of reviews from amazon. Their analysis demonstrates
that linear classifiers prepared with this more elevated amount learnt highlight
representation of reviews beat the present condition of state-of-art.
51
(Andrew L. Maas et al., 2011) exhibited a model that uses a blend of unsupervised
and managed methods to learn word vectors catching semantic term-report data and also
rich notion content. This model catch both semantic and sentiment similitudesamong
words. Creators assess this model with report level and sentence level classification
undertaking in the area of online motion picture audits. For archive and sentence level
creators contrast this present model's statement representations and a few sacks of words
weighting technique, and option ways to deal with word vector prompting. For test creators
utilized IMDB audit dataset. They assess classifier execution after cross approving
classifier parameters on the preparation set using a linear SVM in all cases. However their
model showed superior performance to other approaches, and performed best when
concatenated with bag of words representation (Andrew et al, 2011) likewise performed
sentence level subjectivity order. For this undertaking classifier is prepared to choose
whether a given sentence is subjective , communicating the author's sentiments , or
objective , communicating the essayist's assessments , or objective ,communicating simply
actualities. Creator's uses dataset of (Pang and Lee 2004), which contains subjective
sentences from film audit rundowns and target sentences from motion picture plot
synopses. Creators in haphazardly split the 10,000 case into 10 overlap and report 10 fold
cross acceptance precision utilizing the SVM preparing convention of Pang and Le(2004).
However authors find that their model gave predominant element when analyzed against
others SVM. (Gizem et al., 2012) proposed and evaluate new feature to be used in a word
polarity based approach to sentiment classification.
(Wei Wang et al.,2013) proposed novel half and half affiliation principle digging
technique for understood component recognizable proof in Chinese item audits. Firstly
creators extricate applicant highlight marker based word division, parts of discourse
labeling and highlight grouping, then process the co-event degree between the hopeful
element pointers and highlight words utilizing five collocation extraction calculations. For
experiments data crawled from Chinese shopping sites 360buy.com.In authors designed
five rules for implicit feature identification, however they find that basic rules is best rule
among five rules.
52
(S. Saha et al.,2011) show that it is possible to develop a model using multi-
objective optimization techniques based on Generics Algorithms. For an experiment
authors used BART[60] a modular kit for anaphora resolution that support state of the art
statistical approaches to the task and enables efficient feature engineering . Authors
evaluated their approach on the ACE-02 dataset which is divided in three subsets: bnews,
npaper, and nwire. However authors claim that optimizing according to the multiple metrics
simultaneously may result in better results with respect to each individual metric then
optimizing according to that metric only.
53
number of analysts and surveys of fabricated items from Amazon.com. The results
showcases that proposed method positioning is very efficient and they replicate people’s
opinions of spam and non- spam. (M. Ottwe et al.,) developed and tested the three
methodologies for identification of unreliable opinion spam, Genre identification,
Psycholinguistic deception detection and Text categorization and develop a classifier to
detect opinion Spam. Researchers extract all 6,977 assessments from the 20 supreme and
prevalent Chicago hotels on TripAdvisor and ranked three programmed methodologies for
identifying unreliable opinion spam. Author’s use SVM light to train their linear SVM
models on all three approaches and find that programmed classifiers outper form human
judges for every metric, instead of honest recall where JUDGE 2 implements best. Authors
get nearly 90% precise on gold-standard opinion spam at a set.
(Zheng, L., et.al, 2014) presented in their work the significance of customer attitude
on product. They have presented a multidimensional approach for sentiment analysis. They
have used sentiment lexicon approach and also remove word ambiguity in different
dimension. They have proposed a new algorithm and conducted experiments on a very
huge amount of dataset which consist of approximately 28 million reviews. (Zhang,
Yongfeng,et. al, 2014) authors have made use of Explicit Factor model (EFM) for
recommendation system and use phrase level sentiment analysis on online user reviews.
Their work was able to generate results on real world data sets and predicted top-k
recommendations. Their work also shows that different features can be useful for different
category of users.
(Zhang, Y., et. al, 2015) This work has used phrase level sentiment analysis on user
reviews in a recommender system. Their approach was mainly focus on explicit features in
54
user reviews. They have use collaborative filtering approach in their work. (Vinodhini, G,
et, al. 2014) presented a hybrid model using principle component analysis for classification
of product reviews. They have used logistic regression and support vector machine as
machine learning methods. They have showed with experiments that hybrid model for
opinion mining is more promising. (D’Avanzo, et. al, 2015) presented and investigated that
online reviews are really helpful for making purchasing decision, Maximum buyers take
help from online reviews before shopping. The shopping experience of various buyers is
can be found in the reviews they post. Authors have presented cognitively based procedure
that mines users opinions from specific kinds of market
55
3. SOFTWARE
3.1 PYTHON SOFTWARE
Python's name is derived from the British comedy group Monty Python, whom
Python creator Guido van Rossum enjoyed while developing the language and first released
in 1991. Python's design philosophy emphasizes code readability with its notable use
of significant whitespace. Its language constructs and object-oriented approach aim to help
programmers write clear, logical code for small and large-scale projects.
History: Python was conceived in the late 1980s as a successor to the ABC
language. Python 2.0, released 2000, introduced features like list comprehensions and
a garbage collection system capable of collecting reference cycles. Python 3.0, released
2008, was a major revision of the language that is not completely backward-compatible,
and much Python 2 code does not run unmodified on Python 3. Due to concern about the
amount of code written for Python 2, support for Python 2.7 (the last release in the 2.x
series) was extended to 2020. Language developer Guido van Rossum shouldered sole
responsibility for the project until July 2018 but now shares his leadership as a member of a
five-person steering council.
The Python 2 language, i.e. Python 2.7.x, is "sunsetting" on January 1, 2020, and
the Python team of volunteers will not fix security issues, or improve it in other ways after
that date. With the end-of-life, only Python 3.6.x and later will be supported.
Python interpreters are available for many operating systems. A global community
of programmers develops and maintains CPython, an open source reference
implementation. A non-profit organization, the Python Software Foundation, manages and
directs resources for Python and CPython development.
56
Python's large standard library, commonly cited as one of its greatest
strengths, provides tools suited to many tasks. For Internet-facing applications, many
standard formats and protocols such as MIME and HTTP are supported. It includes
modules for creating graphical user interfaces, connecting to relational
databases, generating pseudorandom numbers, arithmetic with arbitrary-precision
decimals, manipulating regular expressions, and unit testing.
Some parts of the standard library are covered by specifications (for example,
the Web Server Gateway Interface (WSGI) implementation wsgiref follows PEP), but
most modules are not. They are specified by their code, internal documentation, and test
suites (if supplied). However, because most of the standard library is cross-platform Python
code, only a few modules need altering or rewriting for variant implementations.
3.1.1 PACKAGES
As of March 2018, the Python Package Index (PyPI), the official repository for
third-party Python software, contains over 130,000 packages with a wide range of
functionality, including:
57
Human vision is unique and superior in that it detects and discriminates the objects
around with ease. It can perceive the 3-D structures with perfection and also categorize
them efficiently. The texture of an object can be well distinguished by the human eye.
Rather, computer vision is a broad term which describes the computer performing the
function of an eye by using different mathematical algorithms on a digital image. Computer
vision researchers have made tremendous progress and much of their work has been
practically applied in various fields of the human day-to-day life. Computer Vision is
precise in its identification and quicker in execution, with a good and uncomplicated data.
Human visual perception at very high spatial frequencies are less due to physical
limitations, unlike computer vision which have no constraints as such. Computer Vision
involves intensive processing of huge amount of data which consumes quiet a lot of
resources and memory of the computer.
The basic types of an image are binary, grayscale and true colour (colour image).
Binary image has only two possible values 0 or 1 for each pixel. In 8-bit grayscale image
each pixel is a shade of gray, values ranging from 0 (black) to 255 (white). In true colour
images (24 bit), each pixel is a representation of different amounts of red, green and blue
and each pixel in a true colour image has the possibility of having 256 possible colours.
Spatial resolution is the density of pixels over the image and it is the smallest
discernible detail in an image. When there is a greater spatial resolution, it means that there
are more pixels involved to display the image. The number of grayscales which is used to
represent an image is called quantization. Convolution is a process by which a mask is
moved from pixel to pixel in an image and at each pixel a predefined quantity is computed
using mathematical operations. The output image so obtained is of the same size as the
input image. Convolution can be used to implement operators such as spatial filters and
feature detectors, to achieve image smoothing and image sharpening.
The similar textural elements that are replicated over the region of an image are
called texels. Texels are the basic elements of textures and they have certain specific
characteristics that are explained below.
59
4. The contrast they exhibit show different magnitudes and variations.
6. The variations which form the texture may have varying degrees of randomness versus
regularity
The texture has intuitive properties of its own and it was summed up by Tuceryan et
al. as given below
• Texture characterises areas or regions and the texture of a point remains undefined.
Therefore, texture can be a considered to possess a contextual property and it is defined in
terms of gray values in a spatial neighborhood. The size of the spatial neighbourhood relies
on the type of texture and the size of the primitives which define the texture.
• Since texture includes the spatial distribution of the gray levels, the use of co-occurrence
matrices are considered to be a favourable texture analysis tool.
• A region in an image is observed to have texture, when the primitives in the region are
more in number. When only few primitives are present, then a group of countable objects
are perceived. Therefore, a texture can be observed only when notable individual forms are
absent.
The two significant characteristics of a texture are directionality and coarseness and
the two prime texture analysis approaches are statistical and structural. Textures in an
image can be created artificially or in the case of natural images they can be observed in
that captured image. The natural images have naturally created patterns or textures in them,
while the artificially created images have textures which are produced by mechanical
60
influence. The repeated arrangement of texels with same intensity values over an area is
called as spatial frequency which describes the spatial distribution of gray values.
Basically, human vision detects the patterns, geometrical portrayals and variations
in an image, which helps it to identify and determine an object. So a human vision can
assess textures in an image only qualitatively, but since there is a need for quantitative
assessment of textures the texture properties must be defined and computed
mathematically.
Texture analysis is used to extract features in an image for recognition. The useful
information extracted after texture analysis, are then interpreted by various methods for
identification or classification. The mathematical computations for the quantitative texture
analysis are performed through various algorithms.
The region in an image is an area with similar pixel values and are computed over
large neighbourhood. In the natural images, homogeneous regions may be surrounded by
non-homogeneous regions with irregular boundaries and such regions can be easily studied
with segmentation and edge detection techniques. The non-homogeneous regions have
varying attributes of intensity and colour which provide the cue for segmented analysis of
the texture. Structural approach is well suited for textures which have an even structure
with textural primitives that are large enough to be segmented individually. Structural
approach represents texture by well defined textural elements which appears repeatedly,
corresponding to placement rules.
In general, texture analysis involves few random steps and they include
preprocessing, feature extraction, texture classification and texture segmentation. Pre-
processing is used for noise attenuation, correction of image orientation and so on. The pre-
processing techniques like homomorphic filtering, histogram equalization, adaptive
61
histogram equalization, contrast limited adaptive histogram equalization, gamma correction
are widely used. Pre-processing is considered to improve the contrast of the image,
especially before the textures are computed.
Feature extraction recognizes and determines a set of unique and well described
features to characterise a texture. Detecting the perceived qualities of texture in an image is
the first important step towards building mathematical models for texture. The intensity
alterations in an image which characterize texture are mostly due to some underlying
physical variations in the scene.
Feature extraction to describe the characteristics of texture falls into different types
of models which include statistical, structural, geometrical, model based and signal
processing methods. Since the texture of different objects vary and are composed of
complicated parameters, a variety of approaches are necessary to characterize the texture
and classify them.
The sentiment analysis phase to identify sentiment based cliques with respect to
various issues or events. The normalized and transformed tweets in the form of unigrams in
the preprocessing phase have been used to identify the sentiments of the tweets. The
Twitter Sentiment Analysis (TSA) has been used for several applications which include
product reviews, political orientation extraction, stock market prediction etc. More over the
real time tweets analysis on various issues has become a strong indicator to analyse the
human behaviour and reaction on various issues. this shows the tweets generated in the
name of leading chief ministerial candidate in Delhi during the election result declaration.
The analysis of such tweets will give strong insight on the popularity of these persons in the
community.
62
uni-grams. SenticNet 2.0 is a publicly available semantic resource which contains
commonly used polarity concepts.
It illustrates the general framework for the sentiment analysis. Twitter corpus
consist ngrams generated by the pre-processing module from the tweets. Sentiment analyser
make use of three sentiment lexicons (SentiWordNet, SenticNet and SentiSlangNet) to find
the polarity of each tweets and use this information to generate cliques based on the
sentiments on each issues.
63
4. ANALYSIS
4.1 CODING
#!/usr/bin/env python
# coding: utf-8
# In[1]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
# In[2]:
df=pd.read_csv('D:\\DATA FLAIR\\news.csv')
df.shape
df.head()
# In[3]:
labels=df.label
labels.head()
64
# In[4]:
x_train,x_test,y_train,y_test=train_test_split(df['text'],labels,test_size=0.2, random_state=7)
# In[5]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train=tfidf_vectorizer.fit_transform(x_train)
tfidf_test=tfidf_vectorizer.transform(x_test)
# In[6]:
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')
# In[7]:
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])
65
4.2 ADVANCED PYTHON PROJECT-DETECTING FAKE NEWS
Do you trust all the news you hear from social media? All news are not real, right?
So how will you detect the fake news? The answer is Python. By practicing this advanced
python project of detecting fake news, you will easily make a difference between real and
fake news. Before moving ahead in this advanced Python project, get aware of the terms
related to it like fake news, tfidfvectorizer, PassiveAggressive Classifier.
TF (Term Frequency): The number of times a word appears in a document is its Term
Frequency. A higher value means a term appears more often than others, and so, the
document is a good match when the term is part of the search terms.
IDF (Inverse Document Frequency): Words that occur many times a document, but also
occur many times in many others, may be irrelevant. IDF is a measure of how significant a
term is in the entire corpus.
The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF
features.
This advanced python project of detecting fake news deals with fake and real news.
Using sklearn, we build a TfidfVectorizer on our dataset. Then, we initialize a
PassiveAggressive Classifier and fit the model. In the end, the accuracy score and the
confusion matrix tell us how well our model fares.
66
4.3 INSTALL PACKAGES
4.3.1 WHAT IS A NUMPY?
NumPy is a general-purpose array-processing package. It provides a high-
performance multidimensional array object, and tools for working with these arrays.
It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:
Installation:
Windows does not have any package manager analogous to that in linux or mac.
Please download the pre-built windows installer for NumPy(according to your system
configuration and Python version).
And then install the packages manually.
To install numpy in python
Open command prompt and type the below code:
pip install numpy
4.4 INTERPRETATION
Follow the below steps for detecting fake news and complete your first advanced
Python Project –
68
2. Now, let’s read the data into a DataFrame, and get the shape of the data and the first 5
records.
69
3. And get the labels from the DataFrame.
5. Let’s initialize a TfidfVectorizer with stop words from the English language and a
maximum document frequency of 0.7 (terms with a higher document frequency will be
discarded). Stop words are the most common words in a language that are to be filtered out
before processing the natural language data. And a TfidfVectorizer turns a collection of raw
documents into a matrix of TF-IDF features.
Now, fit and transform the vectorizer on the train set, and transform the vectorizer on the
test set.
70
6. Next, we’ll initialize a PassiveAggressiveClassifier. This is. We’ll fit this on tfidf_train
and y_train.
Then, we’ll predict on the test set from the TfidfVectorizer and calculate the accuracy with
accuracy_score() from sklearn.metrics.
7. We got an accuracy of 93.05% with this model. Finally, let’s print out a confusion matrix
to gain insight into the number of false and true negatives and positives.
71
72
5. CONCLUSION
Fake news is an important challenge that takes place over social and computational
infrastructure. In this paper, we have proposed multiple hypotheses related to three different
characteristics of fake news: origin, proliferation and linguistic tone. The hypotheses are
tested using statistical methods on the Fake News Net dataset collected via Twitter. The
results of these hypotheses suggest the following: 1) fake news is not published on popular
websites, rather they are published by lesser known media outlets or websites; 2) fake news
are proliferated more by unverified users compared to verified users; and 3) the fake news
stories are written in a specific linguistic tone, though it is inconclusive to say which one
(negative, positive or neutral). The results expand the understanding of fake news as a
phenomenon and motivate future work, which includes: expanding the study on additional
hypotheses, and designing and developing a multifarious fusion model to better detect
given news for fakeness or legitimacy.
73
6. BIBLIOGRAPHY
Sadia Afroz, Michael Brennan, and Rachel Green- stadt. Detecting hoaxes,
frauds, and deception in writing style online. In ISSP’12.
Hunt Allcott and Matthew Gentzkow. Social media and fake news in the 2016
election. Technical report, National Bureau of Economic Research, 2017.
Solomon E Asch and H Guetzkow. Effects of group pressure upon the modification
and distortion of judgments. Groups, leadership, and men , pages 222–236, 1951.
Meital Balmas. When fake news becomes real: Combined exposure to multiple news
sources and political attitudes of inefficacy, alienation, and cynicism. Communication
Research, 41(3):430–454, 2014.
Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead,
and Oren Etzioni. Open information extraction from the web. In IJCAI’07.
Alessandro Bessi and Emilio Ferrara. Social bots dis- tort the 2016 us presidential
election online discussion. First Monday, 21, 2016.
Prakhar Biyani, Kostas Tsioutsiouliklis, and John Blackmer. ” 8 amazing secrets
for getting more clicks”: Detecting clickbaits in news streams using article in-
formality. In AAAI’16.
Jonas Nygaard Blom and Kenneth Reinecke Hansen. Click bait: Forward-
reference as lure in online news headlines. Journal of Pragmatics, 76:87–100, 2015.
Paul R Brewer, Dannagal Goldthwaite Young, and Michelle Morreale. The impact of
real news about fake news: Intertextual processes and political satire. International
Journal of Public Opinion Research, 25:323–343, 2013.
Carlos Castillo, Mohammed El-Haddad, Jurgen P feffer, and Matt Stempeck.
Characterizing the life cycle of online news stories using social media reactions.
In CSCW’14.
Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. Information credibility
on twitter. In WWW’11.
Abhijnan Chakraborty, Bhargavi Paranjape, Sourya Kakarla, and Niloy Ganguly.
Stop clickbait: Detecting and preventing clickbaits in online news media. In
ASONAM’16.
74