0% found this document useful (0 votes)
21 views17 pages

AAT Cover Page

This document is a project report submitted as part of an alternate assessment for a Machine Learning elective course. It presents a model for detecting fake news articles using machine learning techniques. The report includes an index listing the contents, which are introductions, problem definition and proposed solution, literature survey, methodology, implementation, result analysis, and conclusion. The objective is to develop an accurate model for identifying fake news. The proposed solution uses natural language processing and machine learning algorithms like logistic regression to preprocess and analyze text data, extracting features to distinguish real from fake news.

Uploaded by

harshbafna.ei20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views17 pages

AAT Cover Page

This document is a project report submitted as part of an alternate assessment for a Machine Learning elective course. It presents a model for detecting fake news articles using machine learning techniques. The report includes an index listing the contents, which are introductions, problem definition and proposed solution, literature survey, methodology, implementation, result analysis, and conclusion. The objective is to develop an accurate model for identifying fake news. The proposed solution uses natural language processing and machine learning algorithms like logistic regression to preprocess and analyze text data, extracting features to distinguish real from fake news.

Uploaded by

harshbafna.ei20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

B.M.

S COLLEGE OF ENGINEERING
(An Autonomous College under VTU, Belagavi)
Bull Temple Road, Bangalore - 560 019

A Project Report- 2021-22

On
“Fake News Detection using Machine Learning”
Submitted as a part of Alternate Assessment for the cluster Elective - course
MACHINE LEARNING
offered by
DEPARTMENT of ELECTRONICS AND COMMUNICATIONS
ENGINEERING
Submitted By

NAME USN
HARSH BAFNA 1BM20EI018
ANAGHA R NAYAK 1BM20EI005
AMULYA M S 1BM20EC018
ANKITHA S 1BM20EC020

FIC:Dr Geetishree Mishra


Designation: Assistant professor
INDEX:

Sl. No. Content

1. Introduction

2. Problem Definition and Proposed solution

3. Literature survey

4. Methodology

5. Implementation

6. Result Analysis

7. Conclusion
1. Introduction:

The spread of fake news has become a significant issue in


today's digital age. With the rapid growth of online platforms
and social media, it has become increasingly challenging to
distinguish between real and fabricated news articles. Fake
news can have severe consequences, including
misinformation, manipulation, and erosion of public trust. In
this report, we present a fake news detection model that
utilizes machine learning techniques to automatically identify
and classify fake news articles.
2. Problem Definition and
Proposed Solution:

The primary objective of this project is to develop a reliable


model capable of detecting fake news articles with high
accuracy.
Fake news detection is a complex task due to the vast amount
of online content and the subtle manipulation techniques
employed by malicious entities.

Our proposed solution involves the use of Natural Language


Processing (NLP) and machine learning algorithms (Logistic
Regrssion) to preprocess and analyze textual data, extracting
informative features to distinguish between real and fake
news articles.
3. Literature survey

Sl. Journal/Conference/ Date of Publications Title


No. Thesis
September 2019 A smart System for Fake News
I. https://
Detection Using Machine
www.researchgate.n
Learning
et/publication/
339022255_A_smart
_System_for_Fake_
News_Detection_Us
ing_Machine_Learni
ng
II. https:// 28-30 June 2019 Detecting Fake News using
ieeexplore.ieee.org/ Machine Learning and Deep
document/8843612 Learning Algorithms

III. https://doi.org/ 29 December 2017 Detecting opinion spams and


10.1002/spy2.9 fake news using text
classification

IV. https:// 20 January 2019 Detecting Fake News with


ieeexplore.ieee.org/ Machine Learning Method
document/8620051

V. https:// 02 November 2019 Automating fake news detection


link.springer.com/ system using multi-level voting
article/10.1007/ model
s00500-019-04436-y
Main Aspect:
i. A novel multi-level voting ensemble model is proposed to
detect fake content from social media. Different feature
extraction techniques are compared, and the proposed
model outperforms individual models, achieving improved
accuracy.
ii. A new n-gram model is introduced for automatic detection
of fake content. Various feature extraction techniques and
machine learning classification techniques are studied,
showing promising performance compared to existing
methods.
iii.A model using machine learning and natural language
processing techniques is proposed to authenticate news
articles circulated on social media. Support Vector
Machine is employed for fake news detection, achieving
high accuracy.
iv. A model for detecting fake news messages from Twitter is
proposed, utilizing machine learning algorithms. Support
Vector Machine and Naïve Bayes classifiers outperform
others, showcasing the efficiency of classification
performance.
v. Machine learning techniques, including Naive Bayes,
Neural Network, and Support Vector Machine, are used for
fake news detection. High accuracy rates are achieved,
with Naive Bayes having an accuracy of 96.08% and
Neural Network and Support Vector Machine reaching
99.90% accuracy.
4. Methodology:

**Model:**
The fake news detection model is based on Logistic Regression, a
supervised learning algorithm commonly used for classification tasks.
It uses a linear function to model the relationship between the input
features (preprocessed and transformed news articles) and the target
variable (real or fake news).

**Working:**
1. Data Collection & Preprocessing: The news articles undergo
preprocessing steps such as removing non-alphabetic characters,
converting text to lowercase, and stemming using the Porter stemmer.
This reduces noise and standardizes the text data.
2. Feature Extraction: The preprocessed articles are merged,
creating a new 'content' column. Then, TF-IDF vectorization is
applied to convert the textual data into numerical representations. TF-
IDF measures the importance of words in distinguishing between real
and fake news.
3. Splitting Data: The dataset is split into training and test sets, with
80% for training and 20% for testing.
4. Model Selection & Training: A Logistic Regression model is
initialized and trained on the training set. The model learns the
patterns and relationships between the transformed features and the
corresponding labels (real or fake news).
5. Model Evaluation: The trained model is evaluated on the test set.
Accuracy scores are calculated for both the training and test data to
measure the model's performance.
6. Prediction: A sample from the test set is used to demonstrate
prediction. The model predicts the label (0 or 1) for the sample,
indicating whether the news is real or fake.
**Flow of Code:**
1. Libraries and dependencies are imported.
2. Stopwords are downloaded and printed to verify the list of
stopwords in English.
3. The dataset is loaded into a pandas DataFrame.
4. Basic exploration of the dataset is performed, such as checking its
shape and displaying the first few rows.
5. Missing values in the dataset are identified and replaced with
empty strings.
6. The 'author' name and 'title' columns are merged to create a new
'content' column.
7. Data and labels are separated, storing them in variables X and Y.
8. Text data is preprocessed by applying stemming and removing
stopwords.
9. Data and labels are separated again, storing them in variables X and
Y.
10. Textual data is converted to numerical data using TF-IDF
vectorization.
11. The dataset is split into training and test sets.
12. A Logistic Regression model is initialized.
13. The model is trained on the training set.
14. The accuracy score is calculated on the training data and printed.
15. The accuracy score is calculated on the test data and printed.
16. A sample from the test data is used for prediction.
17. The predicted label and true label of the sample are printed.
5. Implementation:
The proposed fake news detection model is implemented using Python
programming language. We leverage popular libraries such as numpy, pandas,
nltk, and scikit-learn to facilitate data manipulation, preprocessing, and
machine learning operations. The model is trained and tested on a labeled
dataset containing real and fake news articles. The implementation code
includes loading the dataset, preprocessing the data, feature extraction, model
training, and evaluation.
1. Import necessary libraries:
- numpy: for numerical operations
- pandas: for data manipulation and analysis
- re: for regular expression operations
- nltk.corpus: for accessing NLTK's built-in stopwords corpus
- nltk.stem.porter: for Porter stemming algorithm
- sklearn.feature_extraction.text: for TF-IDF vectorization
- sklearn.model_selection: for train-test split
- sklearn.linear_model: for logistic regression
- sklearn.metrics: for accuracy score calculation
2. Download NLTK stopwords corpus using `nltk.download('stopwords')`.
3. Print the stopwords in English using `print(stopwords.words('english'))`.
4. Load the dataset into a pandas DataFrame using `pd.read_csv('<path>')`.
Replace `<path>` with the actual path to the dataset file.
5. Check the shape of the dataset using `news_dataset.shape`.
6. Print the first 5 rows of the dataset using `news_dataset.head()`.
7. Count the number of missing values in the dataset using
`news_dataset.isnull().sum()`.
8. Replace null values with empty strings in the dataset using `news_dataset =
news_dataset.fillna('')`.
9. Merge the author name and news title into a new column called 'content'
using `news_dataset['content'] = news_dataset['author'] + ' ' +
news_dataset['title']`.
10. Print the content column of the dataset using
`print(news_dataset['content'])`.
11. Separate the data and labels. Assign the dataset without the 'label' column
to variable X using `X = news_dataset.drop(columns='label', axis=1)`. Assign
the 'label' column to variable Y using `Y = news_dataset['label']`.
12. Create an instance of the PorterStemmer using `port_stem =
PorterStemmer()`.
13. Define a function called `stemming` that takes a content parameter. Inside
the function, perform the following steps:
- Remove non-alphabetic characters using `re.sub('[^a-zA-Z]', ' ', content)`.
- Convert the content to lowercase using `.lower()`.
- Split the content into individual words using `.split()`.
- Apply stemming to each word using a for loop and `port_stem.stem(word)`.
- Filter out stopwords using a list comprehension and `if not word in
stopwords.words('english')`.
- Join the stemmed words back into a string using `'
'.join(stemmed_content)`.
- Return the stemmed content.
14. Apply the `stemming` function to the 'content' column of the dataset using
`news_dataset['content'].apply(stemming)`.
15. Separate the data and labels again. Assign the 'content' column values to
variable X using `X = news_dataset['content'].values`. Assign the 'label'
column values to variable Y using `Y = news_dataset['label'].values`.
16. Convert the textual data in X to numerical data using TF-IDF vectorization.
Create an instance of TfidfVectorizer using `vectorizer = TfidfVectorizer()`. Fit
the vectorizer on X using `vectorizer.fit(X)`. Transform X using `X =
vectorizer.transform(X)`.
17. Split the data into training and testing sets using train_test_split. Assign the
train-test split results to variables X_train, X_test, Y_train, Y_test using
`train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)`.
18. Create an instance of LogisticRegression for the model using `model =
LogisticRegression()`.
19. Fit the model on the training data using `model.fit(X_train, Y_train)`.
20. Calculate the accuracy score on the training data by predicting the labels
for X_train and comparing them with Y_train. Assign the accuracy score to
variable training_data_accuracy using `accuracy_score(X_train_prediction,
Y_train)`. Print the accuracy score using `print('Accuracy score of the training
data: ', training_data_accuracy)`.
21. Calculate the accuracy score on the test data by predicting the labels for
X_test and comparing them with Y_test. Assign the accuracy score to variable
test_data_accuracy using `accuracy_score(X_test_prediction, Y_test)`. Print
the accuracy score using `print('Accuracy score of the test data: ',
test_data_accuracy)`.
22. Optional: Select a sample from the test data (e.g., X_test[0]) and predict its
label using `model.predict(X_new)`. Print the prediction using
`print(prediction)`.
23. Optional: Print the actual label of the sample from the test data using
`print(Y_test[0])`.
6. Result Analysis:
The performance of the fake news detection model is
assessed using accuracy scores on both the training and
test sets. The accuracy score on the training data
indicates how well the model has learned the patterns
from the training set, while the accuracy score on the
test data reflects the model's ability to generalize and
classify unseen news articles. The results are analyzed
to determine the effectiveness of the model in
accurately detecting fake news articles.
Result picture:

Accuracy picture:
7. Conclusion:
In conclusion, we have presented a fake news detection model that
utilizes machine learning techniques to automatically identify and
classify fake news articles. Through the use of preprocessing, TF-IDF
vectorization, and Logistic Regression, the model demonstrates
promising results in distinguishing between real and fake news. While
the presented model serves as an effective solution, further
improvements can be explored by incorporating additional features,
exploring different algorithms, and increasing the size and diversity of
the training data. The development of reliable fake news detection
models is crucial in combating misinformation and promoting the
dissemination of accurate information in the digital era.
CODE:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
--------------------------------------------------------------------------
import nltk
nltk.download('stopwords')
--------------------------------------------------------------------------
# printing the stopwords in English
print(stopwords.words ('english'))
--------------------------------------------------------------------------
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('<path>')
--------------------------------------------------------------------------
news_dataset.shape
--------------------------------------------------------------------------
# print the first 5 rows of the data frame
news_dataset.head()
--------------------------------------------------------------------------
#counting the number of missing values in the dataset
news_dataset.isnull().sum()
--------------------------------------------------------------------------
#replacing null values with empty string
news_dataset = news_dataset.fillna('')
--------------------------------------------------------------------------
#merging the author name and news title
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']
print(news_dataset['content'])
--------------------------------------------------------------------------
#separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']
print(X)
print(Y)
--------------------------------------------------------------------------
port_stem = PorterStemmer()
--------------------------------------------------------------------------
def stemming(content):
stemmed_content = re.sub('[^a-zA-Z]',' ',content)
stemmed_content = stemmed_content.lower()
stemmed_content = stemmed_content.split()
stemmed_content = [port_stem.stem(word) for word in stemmed_content if not
word in stopwords.words('english')]
stemmed_content = ' '.join(stemmed_content)
return stemmed_content
---------------------------------------------------------------------------
news_dataset['content'] = news_dataset['content'].apply(stemming)
print(news_dataset['content'])
---------------------------------------------------------------------------
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values
print(X)
print(Y)
Y.shape
----------------------------------------------------------------------------
#converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)
print(X)
----------------------------------------------------------------------------
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.2,
stratify=Y, random_state=2)
----------------------------------------------------------------------------
model = LogisticRegression()
----------------------------------------------------------------------------
model.fit(X_train, Y_train)
----------------------------------------------------------------------------
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy score of the training data : ', training_data_accuracy)
----------------------------------------------------------------------------
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score of the test data : ', test_data_accuracy)
----------------------------------------------------------------------------
X_new = X_test[0]

prediction = model.predict(X_new)
print(prediction)

if(prediction[0]==0):
print('The news is real')
else:
print('The news is fake')
-----------------------------------------------------------------------------
print(Y_test[0])

You might also like