A Machine Learning Project Report Fake News Prediction
A Machine Learning Project Report Fake News Prediction
on
Bachelor of Technology
in
Computer Science and Engineering (AI&ML)
by
M.SUMAIYA (22261A6638)
V.AARTHI (22261A6659)
2024 - 2025
TABLE OF CONTENTS
List of Figures i
List of Tables ii
Abstract iii
1. Introduction 1
1.1 Motivation 2
2. Literature Survey 4
3. Methodology 9
3.1 Implementation 9
5.1Conclusion 19
6. 20
LIST OF FIGURES
Figure 3.2.1 Activity Diagram 10
12
Figure 4.1 Classification Report and ROC of LR 20
Figure 4.2 Classification Report of SVM 14
Figure 4.3 Classification Report of Decision tree 15
Figure 4.4 Classification Report of Naïve Bayes 15
Figure 4.5 Classification Report of RBF 16
Figure 4.6 Classification Report and ROC of Random Forest 16
17
LIST OF TABLES 18
ii
ABSTRACT
This project aims to develop a machine learning system capable of classifying news articles as
either real or fake using textual data. The goal is to enhance information integrity in an era
where misinformation proliferates.
The project begins with the importation of essential libraries, which provide the necessary tools
for data manipulation and machine learning model development. A significant preprocessing
step involves the creation of a 'content' column by combining the 'author' and 'title' of each
article, augmenting the feature set used for classification.
To prepare the text data for analysis, techniques such as stemming are employed, reducing
words to their root forms to maintain consistency in the dataset. All text is converted to
lowercase to eliminate case sensitivity issues. The textual data is then transformed into
numerical format using the Term Frequency-Inverse Document Frequency (TF-IDF) method,
which quantifies the importance of words in relation to the entire dataset.
The classification task is executed using a logistic regression model, which predicts the
authenticity of articles based on the computed features. The model demonstrates high efficacy,
achieving an accuracy score of 98% on the training data. This project underscores the effective
use of machine learning techniques in distinguishing between legitimate and misleading news
content, offering a potential tool for combating fake news in digital platforms.
iii
1.INTRODUCTION
In recent years, the proliferation of misinformation and fake news has emerged as a significant
challenge in the digital age. With the rapid expansion of online platforms and social media,
individuals are often exposed to a vast array of news articles, making it increasingly difficult
to discern credible information from falsehoods. The consequences of spreading fake news
can be detrimental, leading to public confusion and misinformed decisions. As a response to
this pressing issue, the need for automated systems that can effectively classify news content
has become more critical than ever.
This project aims to build a machine learning system capable of accurately classifying news
articles as real or fake, leveraging textual data to facilitate its predictions. By utilizing
established natural language processing techniques, the system processes various textual
features, including the article's title, author, and content. The project incorporates a
comprehensive approach that includes data preprocessing steps such as stemming, word
normalization, and vectorization using the TF-IDF method. These steps ensure that the model
can interpret and analyze the text data effectively, paving the way for robust predictions.
To achieve the classification goal, we implement a logistic regression model, which is adept
at handling binary classification tasks. The model is trained on a dataset comprised of labeled
news articles, allowing it to learn the underlying patterns that differentiate real news from fake
news. With a training accuracy score of 98%, the project demonstrates the potential of machine
learning in combating misinformation. This system not only showcases the capabilities of AI
in text classification but also serves as a valuable tool for users seeking to verify the
authenticity of news articles in an increasingly complex information landscape.
1.1 Motivation
In the digital age, access to information is unprecedented, but so is the proliferation of
misinformation and fake news. With social media and online platforms serving as primary
sources of news, the public is increasingly exposed to misleading and inaccurate information.
This not only distorts public perception but can also lead to serious societal consequences,
including a loss of trust in legitimate news sources, increased polarization, and public health
risks, particularly when false information is spread regarding critical issues such as health or
political events.
2
1.5 Requirements Specification
3
2. LITERATURE SURVEY
The literature on fake news detection highlights a growing interest in utilizing machine learning
and natural language processing (NLP) techniques to combat misinformation. Early studies
primarily focused on identifying the unique features of fake news articles compared to reliable
sources, such as linguistic cues, sentiment analysis, and credibility indicators. Researchers like
Lazer et al. (2018) emphasized the role of social media in amplifying false information,
prompting a wave of investigation into how algorithms could be employed to recognize and
stop the spread of fake news. This backdrop established a foundation for further exploration
into effective detection methodologies.
Recent advancements have led to a variety of machine learning models being applied to the
task of fake news detection. Techniques such as support vector machines, random forests, and
neural networks have shown promising results. For example, the work of Shang et al. (2020)
employed deep learning approaches to improve accuracy in classification tasks by leveraging
large datasets of news articles. Moreover, the integration of NLP techniques, like tokenization,
stemming, and the use of TF-IDF vectors, has significantly enhanced the feature extraction
process. These developments underscore the importance of sophisticated text processing
methodologies in building reliable classification systems.
The ongoing research in this field continues to evolve, with scholars exploring hybrid models
that combine multiple algorithms and approaches for greater accuracy. Recent studies have
introduced ensemble methods that aggregate the predictions of various classifiers to improve
performance. Furthermore, there is a growing focus on the ethical implications of automated
fake news detection, including biases in training data and the potential for algorithmic bias.
This literature survey highlights the dynamic nature of fake news detection research and the
critical need for continuous innovation in machine learning techniques to adapt to emerging
trends in misinformation dissemination.
4
Table 2.1: Comparison of Literature survey
5
6
7
8
3. METHODOLOGY
The implementation of the machine learning project for classifying news articles as real or fake
involves several key steps, ranging from data collection and preprocessing to model training
and evaluation. Below is a detailed breakdown of the implementation process.
3.1 IMPLEMENTATION
1. Data Collection
The first step in implementing the project is to gather a suitable dataset comprising labeled
news articles. The dataset should contain a diverse array of news articles classified as either
real or fake. Publicly available datasets, such as the "Fake News Dataset" from Kaggle or the
"LIAR dataset," can be used for this purpose. The data typically consists of columns for the
article's title, author, content, and label (real or fake).
2. Data Preprocessing
Once the data is collected, preprocessing is essential to prepare it for machine learning:
Combining Features: Create a new column `content` by combining the `author` and `title`
columns. This new column serves as the main input for classification.
python
df['content'] = df['author'] + ' ' + df['title']
Text Normalization: Convert all text to lowercase to maintain consistency and facilitate
analysis.
python
df['content'] = df['content'].str.lower()
Stemming:Apply stemming to reduce words to their root form, which helps in text
standardization. This can be done using libraries like NLTK or SpaCy.
python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
df['content'] = df['content'].apply(lambda x: ' '.join([stemmer.stem(word) for word in
x.split()]))
9
Vectorization: Use the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization
to convert the text data into numerical representations that can be fed into a machine learning
model.
python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['content'])
y = df['label'] # Assuming 'label' contains real or fake
3. Model Training
With the preprocessed data in hand, the next step is to train a machine learning model. A logistic
regression model can be selected for its ease of implementation and effectiveness in binary
classification tasks.
Splitting the Data: Split the dataset into training and testing sets to evaluate model
performance.
python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Training the Model:Fit the logistic regression model on the training data.
python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
4. Model Evaluation
After training the model, it's important to evaluate its performance using the test set.
Predictions and Accuracy: Use the model to predict labels for the test set and calculate the
accuracy.
Python
from sklearn.metrics import accuracy_score
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy * 100:.2f}%')
10
5.Classification Models:
Logistic Regression: The model predicts labels using a sigmoid function, which maps
predicted probabilities to binary outcomes (0 or 1)
Support Vector Machine: SVM is a powerful machine learning algorithm used for
classification tasks, which aims to find the optimal hyperplane that separates different
classes— in this case, real and fake news articles .
Decision tree: A decision tree is a flowchart-like structure used in machine learning for
making decisions based on feature splits to classify or predict outcomes.
Random Forest: Random Forest is an ensemble learning method that constructs multiple
decision trees during training and outputs the mode of their classes for classification tasks.
Navie Bayes : Naive Bayes is a simple and efficient probabilistic classifier based on Bayes'
theorem, assuming feature independence.
Radial Basis Function (RBF) : It is a type of kernel function used in machine learning
algorithms, such as Support Vector Machines and neural networks, that calculates the similarity
between data points based on their distance from a center point.
11
3.2 Project Architecture
UML Diagram
3.2.1 Activity Diagram
12
4. TESTING AND RESULTS
Testing is a crucial phase that determines the quality of models used as well as the
importances of all the features under consideration. The algorithms used in this project
have been rigorously tested based on various factors including accuracy, recall,
precision, f1 score and kappa statistic.
Accuracy - It measures how many observations, both positive and negative, were
correctly classified.
Recall - It measures how many observations out of all positive observations, have we
classified as positive. Taking our customer churn example, it tells us how many churned
customers we recalled from all the churned customers.
1
While optimizing recall, you want to make sure you have identified ALL the customers
who could churn.
Precision - It measures how many observations predicted as positive are in fact positive.
Taking our fraud detection example, it tells us what the ratio of customers correctly
classified as churned is.
While optimizing precision, you want to make sure that the customers that you classify
as churned ARE ACTUALLY CHURNED.
F-1 score - Simply put, it combines precision and recall into one metric. It’s the
harmonic mean between precision and recall. A perfect F1-score is 1.0 or 100%. The
closer it is to 1.0, the better the model. You can calculate it in the following way:
13
4.1 Model Performances
4.1.1 Logistic Regression
14
4.1.2 Support Vector Machine
15
4.1.4 Naive Bayes
16
4.1.6 Random Forest
17
4.2 Comparison of Models
From the above table, we observe that the results predicted by the Random forest
algorithm are the most efficient, evident from the high accuracy, precision and f1 score.
18
5.CONCLUSION AND FUTURE WORK
5.1 Conclusion
In conclusion, the project successfully demonstrates the process of building a machine learning
system to classify news articles as real or fake using textual data. By employing various natural
language processing techniques such as stemming and the TF-IDF vectorization method, the
project effectively converts unstructured text into a structured format suitable for analysis. The
implementation of a logistic regression model yielded a high accuracy score of 98% on the
training data, indicating the model's capability to distinguish between genuine and misleading
news articles. This project highlights the importance of machine learning in combating
misinformation and the potential for automated systems to assist readers in evaluating the
credibility of news.
For future work, several avenues can be explored to enhance the performance and robustness
of the classification system:
2. Larger and Diverse Datasets: Expanding the dataset to include a broader range of news
topics, sources, and styles can improve the model's generalizability and ability to handle
different types of misinformation.
5. Real-time Detection: Developing a real-time news classification system that can analyze and
label articles as they are published would be a valuable tool for combating fake news on social
media platforms.
19
BIBLIOGRAPHY
20