AAT Cover Page
AAT Cover Page
S COLLEGE OF ENGINEERING
(An Autonomous College under VTU, Belagavi)
Bull Temple Road, Bangalore - 560 019
On
“Fake News Detection using Machine Learning”
Submitted as a part of Alternate Assessment for the cluster Elective - course
MACHINE LEARNING
offered by
DEPARTMENT of ELECTRONICS AND COMMUNICATIONS
ENGINEERING
Submitted By
NAME USN
HARSH BAFNA 1BM20EI018
ANAGHA R NAYAK 1BM20EI005
AMULYA M S 1BM20EC018
ANKITHA S 1BM20EC020
1. Introduction
3. Literature survey
4. Methodology
5. Implementation
6. Result Analysis
7. Conclusion
1. Introduction:
**Model:**
The fake news detection model is based on Logistic Regression, a
supervised learning algorithm commonly used for classification tasks.
It uses a linear function to model the relationship between the input
features (preprocessed and transformed news articles) and the target
variable (real or fake news).
**Working:**
1. Data Collection & Preprocessing: The news articles undergo
preprocessing steps such as removing non-alphabetic characters,
converting text to lowercase, and stemming using the Porter stemmer.
This reduces noise and standardizes the text data.
2. Feature Extraction: The preprocessed articles are merged,
creating a new 'content' column. Then, TF-IDF vectorization is
applied to convert the textual data into numerical representations. TF-
IDF measures the importance of words in distinguishing between real
and fake news.
3. Splitting Data: The dataset is split into training and test sets, with
80% for training and 20% for testing.
4. Model Selection & Training: A Logistic Regression model is
initialized and trained on the training set. The model learns the
patterns and relationships between the transformed features and the
corresponding labels (real or fake news).
5. Model Evaluation: The trained model is evaluated on the test set.
Accuracy scores are calculated for both the training and test data to
measure the model's performance.
6. Prediction: A sample from the test set is used to demonstrate
prediction. The model predicts the label (0 or 1) for the sample,
indicating whether the news is real or fake.
**Flow of Code:**
1. Libraries and dependencies are imported.
2. Stopwords are downloaded and printed to verify the list of
stopwords in English.
3. The dataset is loaded into a pandas DataFrame.
4. Basic exploration of the dataset is performed, such as checking its
shape and displaying the first few rows.
5. Missing values in the dataset are identified and replaced with
empty strings.
6. The 'author' name and 'title' columns are merged to create a new
'content' column.
7. Data and labels are separated, storing them in variables X and Y.
8. Text data is preprocessed by applying stemming and removing
stopwords.
9. Data and labels are separated again, storing them in variables X and
Y.
10. Textual data is converted to numerical data using TF-IDF
vectorization.
11. The dataset is split into training and test sets.
12. A Logistic Regression model is initialized.
13. The model is trained on the training set.
14. The accuracy score is calculated on the training data and printed.
15. The accuracy score is calculated on the test data and printed.
16. A sample from the test data is used for prediction.
17. The predicted label and true label of the sample are printed.
5. Implementation:
The proposed fake news detection model is implemented using Python
programming language. We leverage popular libraries such as numpy, pandas,
nltk, and scikit-learn to facilitate data manipulation, preprocessing, and
machine learning operations. The model is trained and tested on a labeled
dataset containing real and fake news articles. The implementation code
includes loading the dataset, preprocessing the data, feature extraction, model
training, and evaluation.
1. Import necessary libraries:
- numpy: for numerical operations
- pandas: for data manipulation and analysis
- re: for regular expression operations
- nltk.corpus: for accessing NLTK's built-in stopwords corpus
- nltk.stem.porter: for Porter stemming algorithm
- sklearn.feature_extraction.text: for TF-IDF vectorization
- sklearn.model_selection: for train-test split
- sklearn.linear_model: for logistic regression
- sklearn.metrics: for accuracy score calculation
2. Download NLTK stopwords corpus using `nltk.download('stopwords')`.
3. Print the stopwords in English using `print(stopwords.words('english'))`.
4. Load the dataset into a pandas DataFrame using `pd.read_csv('<path>')`.
Replace `<path>` with the actual path to the dataset file.
5. Check the shape of the dataset using `news_dataset.shape`.
6. Print the first 5 rows of the dataset using `news_dataset.head()`.
7. Count the number of missing values in the dataset using
`news_dataset.isnull().sum()`.
8. Replace null values with empty strings in the dataset using `news_dataset =
news_dataset.fillna('')`.
9. Merge the author name and news title into a new column called 'content'
using `news_dataset['content'] = news_dataset['author'] + ' ' +
news_dataset['title']`.
10. Print the content column of the dataset using
`print(news_dataset['content'])`.
11. Separate the data and labels. Assign the dataset without the 'label' column
to variable X using `X = news_dataset.drop(columns='label', axis=1)`. Assign
the 'label' column to variable Y using `Y = news_dataset['label']`.
12. Create an instance of the PorterStemmer using `port_stem =
PorterStemmer()`.
13. Define a function called `stemming` that takes a content parameter. Inside
the function, perform the following steps:
- Remove non-alphabetic characters using `re.sub('[^a-zA-Z]', ' ', content)`.
- Convert the content to lowercase using `.lower()`.
- Split the content into individual words using `.split()`.
- Apply stemming to each word using a for loop and `port_stem.stem(word)`.
- Filter out stopwords using a list comprehension and `if not word in
stopwords.words('english')`.
- Join the stemmed words back into a string using `'
'.join(stemmed_content)`.
- Return the stemmed content.
14. Apply the `stemming` function to the 'content' column of the dataset using
`news_dataset['content'].apply(stemming)`.
15. Separate the data and labels again. Assign the 'content' column values to
variable X using `X = news_dataset['content'].values`. Assign the 'label'
column values to variable Y using `Y = news_dataset['label'].values`.
16. Convert the textual data in X to numerical data using TF-IDF vectorization.
Create an instance of TfidfVectorizer using `vectorizer = TfidfVectorizer()`. Fit
the vectorizer on X using `vectorizer.fit(X)`. Transform X using `X =
vectorizer.transform(X)`.
17. Split the data into training and testing sets using train_test_split. Assign the
train-test split results to variables X_train, X_test, Y_train, Y_test using
`train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)`.
18. Create an instance of LogisticRegression for the model using `model =
LogisticRegression()`.
19. Fit the model on the training data using `model.fit(X_train, Y_train)`.
20. Calculate the accuracy score on the training data by predicting the labels
for X_train and comparing them with Y_train. Assign the accuracy score to
variable training_data_accuracy using `accuracy_score(X_train_prediction,
Y_train)`. Print the accuracy score using `print('Accuracy score of the training
data: ', training_data_accuracy)`.
21. Calculate the accuracy score on the test data by predicting the labels for
X_test and comparing them with Y_test. Assign the accuracy score to variable
test_data_accuracy using `accuracy_score(X_test_prediction, Y_test)`. Print
the accuracy score using `print('Accuracy score of the test data: ',
test_data_accuracy)`.
22. Optional: Select a sample from the test data (e.g., X_test[0]) and predict its
label using `model.predict(X_new)`. Print the prediction using
`print(prediction)`.
23. Optional: Print the actual label of the sample from the test data using
`print(Y_test[0])`.
6. Result Analysis:
The performance of the fake news detection model is
assessed using accuracy scores on both the training and
test sets. The accuracy score on the training data
indicates how well the model has learned the patterns
from the training set, while the accuracy score on the
test data reflects the model's ability to generalize and
classify unseen news articles. The results are analyzed
to determine the effectiveness of the model in
accurately detecting fake news articles.
Result picture:
Accuracy picture:
7. Conclusion:
In conclusion, we have presented a fake news detection model that
utilizes machine learning techniques to automatically identify and
classify fake news articles. Through the use of preprocessing, TF-IDF
vectorization, and Logistic Regression, the model demonstrates
promising results in distinguishing between real and fake news. While
the presented model serves as an effective solution, further
improvements can be explored by incorporating additional features,
exploring different algorithms, and increasing the size and diversity of
the training data. The development of reliable fake news detection
models is crucial in combating misinformation and promoting the
dissemination of accurate information in the digital era.
CODE:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
--------------------------------------------------------------------------
import nltk
nltk.download('stopwords')
--------------------------------------------------------------------------
# printing the stopwords in English
print(stopwords.words ('english'))
--------------------------------------------------------------------------
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('<path>')
--------------------------------------------------------------------------
news_dataset.shape
--------------------------------------------------------------------------
# print the first 5 rows of the data frame
news_dataset.head()
--------------------------------------------------------------------------
#counting the number of missing values in the dataset
news_dataset.isnull().sum()
--------------------------------------------------------------------------
#replacing null values with empty string
news_dataset = news_dataset.fillna('')
--------------------------------------------------------------------------
#merging the author name and news title
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']
print(news_dataset['content'])
--------------------------------------------------------------------------
#separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']
print(X)
print(Y)
--------------------------------------------------------------------------
port_stem = PorterStemmer()
--------------------------------------------------------------------------
def stemming(content):
stemmed_content = re.sub('[^a-zA-Z]',' ',content)
stemmed_content = stemmed_content.lower()
stemmed_content = stemmed_content.split()
stemmed_content = [port_stem.stem(word) for word in stemmed_content if not
word in stopwords.words('english')]
stemmed_content = ' '.join(stemmed_content)
return stemmed_content
---------------------------------------------------------------------------
news_dataset['content'] = news_dataset['content'].apply(stemming)
print(news_dataset['content'])
---------------------------------------------------------------------------
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values
print(X)
print(Y)
Y.shape
----------------------------------------------------------------------------
#converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)
print(X)
----------------------------------------------------------------------------
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.2,
stratify=Y, random_state=2)
----------------------------------------------------------------------------
model = LogisticRegression()
----------------------------------------------------------------------------
model.fit(X_train, Y_train)
----------------------------------------------------------------------------
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy score of the training data : ', training_data_accuracy)
----------------------------------------------------------------------------
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score of the test data : ', test_data_accuracy)
----------------------------------------------------------------------------
X_new = X_test[0]
prediction = model.predict(X_new)
print(prediction)
if(prediction[0]==0):
print('The news is real')
else:
print('The news is fake')
-----------------------------------------------------------------------------
print(Y_test[0])