0% found this document useful (0 votes)
56 views24 pages

A Machine Learning Project Report Fake News Prediction

Uploaded by

prasunagummadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views24 pages

A Machine Learning Project Report Fake News Prediction

Uploaded by

prasunagummadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

A Machine Learning Project Report (CM551PC)

on

Fake News Prediction


Submitted
in partial fulfilment of the requirements for
the award of the degree of

Bachelor of Technology
in
Computer Science and Engineering (AI&ML)
by
M.SUMAIYA (22261A6638)
V.AARTHI (22261A6659)

Under the guidance of


Mrs.J.Sreedevi
(Assistant Professor)

DEPARTMENT OF EMERGING TECHNOLOGIES


Mahatma Gandhi Institute of Technology (Autonomous)
(Affiliated to Jawaharlal Nehru Technological University Hyderabad)
Kokapet(V), Gandipet(M), Hyderabad.
Telangana - 500 075.

2024 - 2025
TABLE OF CONTENTS
List of Figures i
List of Tables ii

Abstract iii

1. Introduction 1

1.1 Motivation 2

1.2 Problem Definition 2

1.3 Existing System 2

1.4 Proposed System 2

1.5 Requirements Specification 3

1.5.1 Software Requirements 3

1.5.2 Hardware Requirements 3

2. Literature Survey 4

3. Methodology 9

3.1 Implementation 9

3.2 Project Architecture 12

3.2.1 Activity Diagram 12

4. Testing and Results 13

4.1 Model Performances 14

4.2 Comparison of Models 18

5. Conclusion and Future Work 1919

5.1Conclusion 19

5.2 Future Work 19

6. 20
LIST OF FIGURES
Figure 3.2.1 Activity Diagram 10
12
Figure 4.1 Classification Report and ROC of LR 20
Figure 4.2 Classification Report of SVM 14
Figure 4.3 Classification Report of Decision tree 15
Figure 4.4 Classification Report of Naïve Bayes 15
Figure 4.5 Classification Report of RBF 16
Figure 4.6 Classification Report and ROC of Random Forest 16
17

LIST OF TABLES 18

Table 2.1 Comparison of Literature survey 14

Table 4.1 Comparison of Results 38

ii
ABSTRACT

This project aims to develop a machine learning system capable of classifying news articles as
either real or fake using textual data. The goal is to enhance information integrity in an era
where misinformation proliferates.

The project begins with the importation of essential libraries, which provide the necessary tools
for data manipulation and machine learning model development. A significant preprocessing
step involves the creation of a 'content' column by combining the 'author' and 'title' of each
article, augmenting the feature set used for classification.

To prepare the text data for analysis, techniques such as stemming are employed, reducing
words to their root forms to maintain consistency in the dataset. All text is converted to
lowercase to eliminate case sensitivity issues. The textual data is then transformed into
numerical format using the Term Frequency-Inverse Document Frequency (TF-IDF) method,
which quantifies the importance of words in relation to the entire dataset.

The classification task is executed using a logistic regression model, which predicts the
authenticity of articles based on the computed features. The model demonstrates high efficacy,
achieving an accuracy score of 98% on the training data. This project underscores the effective
use of machine learning techniques in distinguishing between legitimate and misleading news
content, offering a potential tool for combating fake news in digital platforms.

iii
1.INTRODUCTION

In recent years, the proliferation of misinformation and fake news has emerged as a significant
challenge in the digital age. With the rapid expansion of online platforms and social media,
individuals are often exposed to a vast array of news articles, making it increasingly difficult
to discern credible information from falsehoods. The consequences of spreading fake news
can be detrimental, leading to public confusion and misinformed decisions. As a response to
this pressing issue, the need for automated systems that can effectively classify news content
has become more critical than ever.

This project aims to build a machine learning system capable of accurately classifying news
articles as real or fake, leveraging textual data to facilitate its predictions. By utilizing
established natural language processing techniques, the system processes various textual
features, including the article's title, author, and content. The project incorporates a
comprehensive approach that includes data preprocessing steps such as stemming, word
normalization, and vectorization using the TF-IDF method. These steps ensure that the model
can interpret and analyze the text data effectively, paving the way for robust predictions.

To achieve the classification goal, we implement a logistic regression model, which is adept
at handling binary classification tasks. The model is trained on a dataset comprised of labeled
news articles, allowing it to learn the underlying patterns that differentiate real news from fake
news. With a training accuracy score of 98%, the project demonstrates the potential of machine
learning in combating misinformation. This system not only showcases the capabilities of AI
in text classification but also serves as a valuable tool for users seeking to verify the
authenticity of news articles in an increasingly complex information landscape.
1.1 Motivation
In the digital age, access to information is unprecedented, but so is the proliferation of
misinformation and fake news. With social media and online platforms serving as primary
sources of news, the public is increasingly exposed to misleading and inaccurate information.
This not only distorts public perception but can also lead to serious societal consequences,
including a loss of trust in legitimate news sources, increased polarization, and public health
risks, particularly when false information is spread regarding critical issues such as health or
political events.

1.2 Problem Definition


The project aims to develop a machine learning system that classifies news articles as either
real or fake based on their textual content, addressing the critical issue of misinformation in
today's digital landscape. By preprocessing the text data through normalization techniques,
stemming, and TF-IDF vectorization, the project will utilize a logistic regression model to
predict article authenticity. The goal is to achieve a high accuracy rate (targeting around 98%)
in differentiating between real and fake news, ultimately providing a valuable tool for
individuals and organizations to combat the spread of misinformation and promote informed
decision-making.

1.3 Existing System


The existing systems for fake news detection include rule-based approaches that use heuristics
and predefined criteria, traditional machine learning models like Naive Bayes and SVMs
relying on manual feature extraction, and advanced deep learning techniques such as RNNs
and CNNs that learn features from large datasets. While these systems can vary in
effectiveness, they face significant limitations, such as inflexibility to adapt to evolving
misinformation tactics, the need for extensive labeled datasets, and scalability issues due to
labor-intensive verification processes. Additionally, many current platforms lack transparency,
impacting user trust in their capabilities. The proposed machine learning system aims to
address these challenges by implementing a comprehensive approach with advanced
preprocessing and modeling techniques to achieve high accuracy in classifying news articles
efficiently

1.4 Proposed System


The proposed system aims to develop a robust machine learning framework for classifying
news articles as real or fake by leveraging advanced text preprocessing techniques and an
efficient logistic regression model. It will begin by combining relevant text features, such as
author and title, to create a comprehensive content representation. The system will then employ
normalization methods, including stemming and TF-IDF vectorization, to convert textual data
into a format suitable for analysis. By focusing on a streamlined, scalable approach, the model
is designed to achieve high accuracy in distinguishing between authentic and misleading news
articles while enhancing adaptability to emerging trends in misinformation. This system not
only aims for superior classification performance but also emphasizes transparency and user
trust, providing a valuable tool for individuals and organizations to navigate the complexities
of news authenticity effectively.

2
1.5 Requirements Specification

1.5.1 Software Requirements


I. Programming Language: Python 3.12
II. Operating System: Windows / Linux / macOS
III. Tools and Libraries:
Data Analysis and Modeling: NumPy, Pandas, Scikit-learn,
TensorFlow/PyTorch,Matplotlib, Seaborn
Data Visualization: Plotly, Dash (optional for interactive visualizations)
Data Preprocessing: SciPy, Statsmodels
Notebook Environment: Jupyter Notebook or JupyterLab

1.5.2 Hardware Requirements


I. Processor: Intel Core i5 or equivalent
II. RAM: 8 GB (minimum) recommended for handling large datasets efficiently
III. Storage: 20 GB free space for dataset storage and model checkpoints

3
2. LITERATURE SURVEY

The literature on fake news detection highlights a growing interest in utilizing machine learning
and natural language processing (NLP) techniques to combat misinformation. Early studies
primarily focused on identifying the unique features of fake news articles compared to reliable
sources, such as linguistic cues, sentiment analysis, and credibility indicators. Researchers like
Lazer et al. (2018) emphasized the role of social media in amplifying false information,
prompting a wave of investigation into how algorithms could be employed to recognize and
stop the spread of fake news. This backdrop established a foundation for further exploration
into effective detection methodologies.

Recent advancements have led to a variety of machine learning models being applied to the
task of fake news detection. Techniques such as support vector machines, random forests, and
neural networks have shown promising results. For example, the work of Shang et al. (2020)
employed deep learning approaches to improve accuracy in classification tasks by leveraging
large datasets of news articles. Moreover, the integration of NLP techniques, like tokenization,
stemming, and the use of TF-IDF vectors, has significantly enhanced the feature extraction
process. These developments underscore the importance of sophisticated text processing
methodologies in building reliable classification systems.

The ongoing research in this field continues to evolve, with scholars exploring hybrid models
that combine multiple algorithms and approaches for greater accuracy. Recent studies have
introduced ensemble methods that aggregate the predictions of various classifiers to improve
performance. Furthermore, there is a growing focus on the ethical implications of automated
fake news detection, including biases in training data and the potential for algorithmic bias.
This literature survey highlights the dynamic nature of fake news detection research and the
critical need for continuous innovation in machine learning techniques to adapt to emerging
trends in misinformation dissemination.

4
Table 2.1: Comparison of Literature survey

5
6
7
8
3. METHODOLOGY

The implementation of the machine learning project for classifying news articles as real or fake
involves several key steps, ranging from data collection and preprocessing to model training
and evaluation. Below is a detailed breakdown of the implementation process.

3.1 IMPLEMENTATION

1. Data Collection
The first step in implementing the project is to gather a suitable dataset comprising labeled
news articles. The dataset should contain a diverse array of news articles classified as either
real or fake. Publicly available datasets, such as the "Fake News Dataset" from Kaggle or the
"LIAR dataset," can be used for this purpose. The data typically consists of columns for the
article's title, author, content, and label (real or fake).

2. Data Preprocessing
Once the data is collected, preprocessing is essential to prepare it for machine learning:

Combining Features: Create a new column `content` by combining the `author` and `title`
columns. This new column serves as the main input for classification.

python
df['content'] = df['author'] + ' ' + df['title']

Text Normalization: Convert all text to lowercase to maintain consistency and facilitate
analysis.

python
df['content'] = df['content'].str.lower()

Stemming:Apply stemming to reduce words to their root form, which helps in text
standardization. This can be done using libraries like NLTK or SpaCy.

python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
df['content'] = df['content'].apply(lambda x: ' '.join([stemmer.stem(word) for word in
x.split()]))

9
Vectorization: Use the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization
to convert the text data into numerical representations that can be fed into a machine learning
model.

python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['content'])
y = df['label'] # Assuming 'label' contains real or fake

3. Model Training
With the preprocessed data in hand, the next step is to train a machine learning model. A logistic
regression model can be selected for its ease of implementation and effectiveness in binary
classification tasks.

Splitting the Data: Split the dataset into training and testing sets to evaluate model
performance.

python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training the Model:Fit the logistic regression model on the training data.

python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

4. Model Evaluation
After training the model, it's important to evaluate its performance using the test set.

Predictions and Accuracy: Use the model to predict labels for the test set and calculate the
accuracy.
Python
from sklearn.metrics import accuracy_score
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy * 100:.2f}%')

10
5.Classification Models:

Logistic Regression: The model predicts labels using a sigmoid function, which maps
predicted probabilities to binary outcomes (0 or 1)

Support Vector Machine: SVM is a powerful machine learning algorithm used for
classification tasks, which aims to find the optimal hyperplane that separates different
classes— in this case, real and fake news articles .

Decision tree: A decision tree is a flowchart-like structure used in machine learning for
making decisions based on feature splits to classify or predict outcomes.

Random Forest: Random Forest is an ensemble learning method that constructs multiple
decision trees during training and outputs the mode of their classes for classification tasks.

Navie Bayes : Naive Bayes is a simple and efficient probabilistic classifier based on Bayes'
theorem, assuming feature independence.

Radial Basis Function (RBF) : It is a type of kernel function used in machine learning
algorithms, such as Support Vector Machines and neural networks, that calculates the similarity
between data points based on their distance from a center point.

11
3.2 Project Architecture
UML Diagram
3.2.1 Activity Diagram

12
4. TESTING AND RESULTS
Testing is a crucial phase that determines the quality of models used as well as the
importances of all the features under consideration. The algorithms used in this project
have been rigorously tested based on various factors including accuracy, recall,
precision, f1 score and kappa statistic.
Accuracy - It measures how many observations, both positive and negative, were
correctly classified.

Recall - It measures how many observations out of all positive observations, have we
classified as positive. Taking our customer churn example, it tells us how many churned
customers we recalled from all the churned customers.

1
While optimizing recall, you want to make sure you have identified ALL the customers
who could churn.

Precision - It measures how many observations predicted as positive are in fact positive.
Taking our fraud detection example, it tells us what the ratio of customers correctly
classified as churned is.

While optimizing precision, you want to make sure that the customers that you classify
as churned ARE ACTUALLY CHURNED.

F-1 score - Simply put, it combines precision and recall into one metric. It’s the
harmonic mean between precision and recall. A perfect F1-score is 1.0 or 100%. The
closer it is to 1.0, the better the model. You can calculate it in the following way:

13
4.1 Model Performances
4.1.1 Logistic Regression

Figure 4.1.Classification Report and ROC of LR

14
4.1.2 Support Vector Machine

Figure 4.2.Classification Report of SVM

4.1.3 Decision Tree

Figure 4.64Classification Report of Decision Tree

15
4.1.4 Naive Bayes

Figure 4.3.Classification Report of Naïve Bayes

4.1.5 Radial Basis Function (RBF)

Figure 4.5.Classification Report of RBF

16
4.1.6 Random Forest

Figure 4.6.Classification Report and ROC of Random Forest

17
4.2 Comparison of Models

A thorough comparison of algorithms based on the metrics mentioned above gives a


comprehensive insight into the performance and efficiency of each of them.

4.1. Comparison of Results

From the above table, we observe that the results predicted by the Random forest
algorithm are the most efficient, evident from the high accuracy, precision and f1 score.

18
5.CONCLUSION AND FUTURE WORK

5.1 Conclusion

In conclusion, the project successfully demonstrates the process of building a machine learning
system to classify news articles as real or fake using textual data. By employing various natural
language processing techniques such as stemming and the TF-IDF vectorization method, the
project effectively converts unstructured text into a structured format suitable for analysis. The
implementation of a logistic regression model yielded a high accuracy score of 98% on the
training data, indicating the model's capability to distinguish between genuine and misleading
news articles. This project highlights the importance of machine learning in combating
misinformation and the potential for automated systems to assist readers in evaluating the
credibility of news.

5.2 Future Work

For future work, several avenues can be explored to enhance the performance and robustness
of the classification system:

1. Model Optimization: Experimenting with various advanced algorithms beyond logistic


regression, such as support vector machines, random forests, or deep learning models like
recurrent neural networks (RNN) and transformers, can potentially improve classification
accuracy.

2. Larger and Diverse Datasets: Expanding the dataset to include a broader range of news
topics, sources, and styles can improve the model's generalizability and ability to handle
different types of misinformation.

3. Handling Imbalanced Data: Implementing techniques to manage class imbalance, which is


often present in real vs. fake news datasets, by using oversampling, undersampling, or synthetic
data generation methods can result in improved model performance.

4. Sentiment Analysis Integration: Combining sentiment analysis with news classification


could provide deeper insights into the nature of the articles and help in distinguishing subtle
forms of misinformation.

5. Real-time Detection: Developing a real-time news classification system that can analyze and
label articles as they are published would be a valuable tool for combating fake news on social
media platforms.

19
BIBLIOGRAPHY

 O'Sullivan, D. (2020). Machine Learning for Text: A Comprehensive Guide to Data


Science for Text Classification. New York: Springer.
 Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information
Retrieval. Cambridge: Cambridge University Press.
 Sebastian Raschka, & Vahid Mirjalili. (2019). Python Machine Learning: Machine
Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2.
Birmingham: Packt Publishing.
 K. D. O. G. J. N. P. (2018). Fake News Detection on Social Media: A Data Mining
Perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22-36.
doi:10.1145/3287560.3287598.
 Cohen, J. (2021). "The Role of Natural Language Processing in Fake News
Detection." Journal of Machine Learning Research, 22(1), 1-15. Retrieved
from http://www.jmlr.org/papers/volume22/19-088/19-088.pdf
 Bo Pang & Linda Lee. (2008). "Opinion Mining and Sentiment Analysis." Foundations
and Trends in Information Retrieval, 2(1-2), 1-135. doi:10.1561/1500000001.
 Zhang, L., & Wang, S. (2020). "Fine-Tuning Pretrained Transformers for Text
Classification." Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, 9-15. Retrieved
from https://www.aclweb.org/anthology/2020.acl-main.1
 Scikit-learn documentation. (n.d.). Retrieved from https://scikit-
learn.org/stable/user_guide.html.
 S. J. M. (2021). "TF-IDF: A Comprehensive Explanation." Towards Data Science.
Retrieved from https://towardsdatascience.com/tf-idf-a-comprehensive-explanation-
1c094499e332
 B. S. (2023). "Understanding Stemming and Lemmatization." DataCamp Community.
Retrieved from https://www.datacamp.com/community/tutorials/stemming-
lemmatization-python

20

You might also like