Internshipreport 15
Internshipreport 15
Women(Autonomous)
GUIDE DETAILS:
I here by declare that the summer internship program entitled Fake News
Detection submitted to the Stanley College of Engineering and Technology for
Women, is a record of original work done by me. This project work is submitted
in partial fulfilment of the requirements for the award of the degree of the B.E. in
Computer Science and Engineering.
I wish to express my sincere thanks to Sri. Kodali Krishna Rao, Correspondent and Secretary,
for providing with all the necessary facilities.
I place on record, my sincere gratitude to Dr.Satya Prasad Lanka, Principal, for his constant
encouragement.
I deeply express my sincere thanks to my Head of the Department, Dr.YVS Sai Pragathi, for
encouraging and allowing to present the internship on the topic “Fake News Detection” at my
department premises for the partial fulfilment of the requirements leading to the award of the
B.E. degree.
I thank my Project Coordinator Mrs.B.G.Prasuna for her guidance and support throughout the
duration of my project. I take this opportunity to thank all my faculty, who have directly or
indirectly helped my project.
Last but not the least, I express my thanks to my friends for their cooperation and support.
TABLE OF CONTENTS
1. INTRODUCTION
1.1 Overview
1.2 Purpose
2. LITERATURE SURVEY
3. THEORETICAL ANALYSIS
4. EXPERIMENTAL INVESTIGATION
5. FLOWCHART
6. RESULT
8. APPLICATIONS
9. CONCLUSION
11. BIBLIOGRAPHY
APPENDIX
A. Source code
LIST OF FIGURES
This project addresses the critical issue of fake news detection through a machine learning
approach, employing a logistic regression model to classify news articles as either true or fake.
The dataset comprises two CSV files containing labeled news articles, which are combined and
preprocessed for analysis. Key preprocessing steps include text cleaning, which involves
converting text to lowercase, removing URLs, punctuation, and stopwords, as well as merging
the title and body of the articles into a single content feature.
The text data is transformed into numerical features using the Term Frequency-Inverse
Document Frequency (TF-IDF) vectorization technique, which effectively captures the
significance of words in the context of the dataset. The dataset is split into training and testing
subsets, with 75% allocated for training the model and 25% for evaluating its performance.
Model performance is assessed using accuracy, classification reports, and confusion matrices,
which provide insights into the model's predictive capabilities. Additionally, the Receiver
Operating Characteristic (ROC) curve is plotted to evaluate the trade-off between the true
positive rate and the false positive rate, while the area under the ROC curve (AUC) is computed
to quantify model performance. A validation curve is generated to analyze the effect of the
logistic regression parameter CCC on accuracy, enhancing understanding of model behavior
under varying regularization strengths.
The results indicate a robust capability for distinguishing between true and fake news articles,
thereby demonstrating the potential of machine learning in combating misinformation in
today’s digital landscape.
Keywords: Fake News Detection, Machine Learning, Logistic Regression, Preprocessing, TF-
IDF Vectorization, Training and Testing, Model Performance, ROC Curve, AUC (Area Under
Curve)
1.1 OVERVIEW
This project focuses on developing a machine learning-based fake news detection system that
employs natural language processing (NLP) techniques to analyze and classify news articles as
either true or fake. Given the prevalence of misinformation and the rapid dissemination of false
information on social media and other online platforms, this project aims to create a reliable
solution that aids in identifying fake news. By leveraging logistic regression and feature
extraction methods such as TF-IDF, the system is designed to achieve high accuracy in
distinguishing between legitimate and misleading news articles. The ultimate purpose of this
project is to contribute to a more informed society by providing users with tools to critically
evaluate news content, thereby fostering media literacy and reducing the impact of fake news
on public perception and opinion.
In the broader context, the project underscores the importance of employing advanced analytics
and machine learning methodologies to tackle pressing social issues. Fake news has emerged
as a formidable challenge that affects political landscapes, public health, and societal norms.
By integrating machine learning with NLP, this project not only showcases the technical
feasibility of such solutions but also emphasizes the need for continuous improvement and
innovation in combating misinformation. The project's implementation can serve as a
foundation for further research and development, enhancing the effectiveness of fake news
detection systems and their applicability across various media platforms.
1.2 PURPOSE
The primary purpose of this project is to develop a robust and effective fake news detection
system that utilizes machine learning techniques to automatically classify news articles as
either true or fake. As misinformation continues to proliferate across various platforms, the
necessity for reliable tools that can help individuals and organizations assess the credibility of
news content has never been more critical. This project aims to:
1. Enhance Media Literacy: By providing users with an automated tool to evaluate news
articles, the project promotes critical thinking and informed decision-making among
users regarding the information they consume.
2. Support Journalistic Integrity: The system serves as a valuable resource for
journalists and news organizations, enabling them to fact-check information quickly
and accurately before publication.
3. Combat Misinformation: By identifying and flagging false information, the project
contributes to mitigating the impact of fake news on public perception, political
discourse, and societal behavior.
4. Advance Research in NLP and ML: The project serves as a foundation for further
research and exploration in natural language processing and machine learning,
providing insights that can be leveraged to improve detection methodologies and
explore new approaches to combat misinformation.
2. LITERATURE SURVEY
The fake news detection project centers around the exploration of machine learning algorithms
and NLP techniques applied to the domain of online misinformation. Fake news detection has
become increasingly relevant due to the rapid dissemination of false information on social
media and other platforms. Various research efforts have examined the role of text classification
models, leveraging machine learning techniques such as Logistic Regression, Support Vector
Machines (SVM), and neural networks. These models are trained on large datasets of fake and
true news articles, which are labeled and used to distinguish between reliable and misleading
content. One of the critical elements in such studies is feature extraction using techniques like
Term Frequency-Inverse Document Frequency (TF-IDF) to transform text data into numerical
vectors, which can be processed by machine learning models. In this project, the Logistic
Regression model is employed, which is widely recognized for its effectiveness in binary
classification problems like fake news detection. Numerous studies highlight Logistic
Regression as a baseline model for text classification, comparing its performance against more
advanced models such as Random Forest and deep learning approaches like Convolutional
Neural Networks (CNN) or Long Short-Term Memory (LSTM) networks. However, Logistic
Regression remains favored for its simplicity, ease of interpretation, and relatively fast training
process. Additionally, the use of TF-IDF for feature extraction aligns with established literature,
which emphasizes its role in transforming text into a format that preserves important word-level
distinctions. Preprocessing steps such as data cleaning, tokenization, and the removal of
stopwords are also widely discussed in previous research. This project implements such
preprocessing techniques to enhance the model's performance by filtering out noise in the
dataset, including punctuation, digits, URLs, and common stopwords. These steps ensure that
only meaningful textual content is analyzed, contributing to the model's ability to classify fake
and true news articles accurately. Furthermore, studies have highlighted the importance of
combining different feature extraction techniques with traditional models to boost classification
accuracy. This aligns with the current project's approach of leveraging TF-IDF with Logistic
Regression. The survey also reveals ongoing challenges in addressing false positives and
improving model robustness. Many research papers emphasize the limitations of classical
machine learning models in dealing with highly deceptive or contextually ambiguous content.
As a result, future directions in this area point toward integrating more complex models such as
transformers or BERT (Bidirectional Encoder Representations from Transformers), which are
capable of understanding semantic relationships within text.
3. THEORETICAL ANALYSIS
3.1 BLOCK DIAGRAM
HARDWARE DESIGNING
The fake news detection project outlined above leverages a combination of machine learning
techniques, natural language processing (NLP), and logistic regression to differentiate between
fake and true news articles. Below is a step-by-step breakdown of the software requirements
based on the code you have provided.
1. Google Colab
Use the "Upload" button in Colab or the Dataset files.upload() function to upload these files.
Install OpenCV: If OpenCV isn’t already installed in Colab, install it by running the following
command:
Once you upload your models, load them into your Colab notebook using OpenCV’s
cv2.dnn.readNet() method.
Objective: To use the trained logistic regression model to predict whether the news
articles in the test set are real (True) or fake.
The logistic regression model, after being trained on the training data (X_train_tfidf,
y_train), is now ready to make predictions on the test data (X_test_tfidf). This involves
inputting the TF-IDF features of the test dataset into the model to generate class
predictions (either "True" or "Fake").
Objective: To assess the performance of the model by comparing the predicted labels
(y_pred) against the actual labels (y_test), and to understand the quality of the
predictions using accuracy and detailed classification metrics.
sklearn (confusion_matrix), matplotlib, and seaborn: These are used to create and
display a confusion matrix plot.
The confusion matrix gives a direct comparison of the actual labels versus the
predicted labels:
Objective: To evaluate the model’s ability to distinguish between true and fake news
through the ROC (Receiver Operating Characteristic) curve, which plots the True
Positive Rate (TPR) against the False Positive Rate (FPR) for different threshold values.
sklearn (roc_curve, auc) and matplotlib: These are used to compute and plot the
ROC curve.
The ROC curve is a graphical representation of the trade-off between true positives
and false positives. A model that performs well will have a curve that is closer to the
top-left corner.
The AUC (Area Under the Curve) score quantifies the overall performance of the
model, where a score closer to 1 indicates better performance.
4. EXPERIMENTAL INVESTIGATION
The dataset used for this fake news detection project consists of two CSV files, one containing
true news articles and the other containing fake news articles. Each dataset includes textual
information in the form of article titles and the full body of the text. Labels were manually
assigned to these datasets: a label of "1" was given to the true news articles, and "0" to the fake
news articles. The two datasets were merged into a single dataset for simplicity, combining the
titles and text into a "content" field that forms the basis for feature extraction. By focusing on
both titles and article bodies, this approach ensures that the model considers all possible cues
in detecting the veracity of a news item.
Text preprocessing steps, including converting text to lowercase, removing URLs, punctuation,
numbers, and stopwords, are applied to the content field to clean and standardize the input data.
The TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer is used to convert the
processed text into numerical features that represent the importance of words relative to the
entire dataset. The vectorizer is restricted to 5,000 features to balance between the model’s
efficiency and the richness of the textual information. These numerical features serve as input
to the machine learning model, allowing it to capture linguistic patterns that distinguish fake
news from real news, based on word frequency and distribution.
DESIGN PRINCIPLES
The design principles of the fake news detection system prioritize accuracy,
efficiency, and scalability, ensuring that the system meets the needs of its users
effectively.
Accuracy is of utmost importance; the system must reliably differentiate between true
and false news articles to build and maintain user trust.
To achieve accuracy, the system employs rigorous preprocessing steps, which include
text normalization, removal of stopwords, and the application of TF-IDF
vectorization to transform textual data into meaningful numerical features suitable
for machine learning algorithms.
The selection of logistic regression as the classification model is deliberate,
emphasizing its interpretability and proven effectiveness in binary classification
tasks.
Efficiency in processing and analyzing large volumes of text data is essential for the
system's performance, ensuring quick results and seamless user experience.
The system utilizes optimized data handling and feature extraction techniques to
guarantee rapid processing times, enabling real-time analysis of news articles as they
are published.
The design also considers scalability to handle increasing datasets, ensuring the system
remains functional as more articles are introduced over time.
By incorporating a modular architecture, the system can be easily updated or
expanded with new algorithms and features, allowing it to stay relevant and effective
in addressing emerging challenges in the realm of fake news detection.
The design principles focus on creating a system that is both robust and adaptive,
catering to the evolving nature of digital news consumption and misinformation.
Overall, the emphasis on accuracy, efficiency, and scalability ensures that the fake news
detection system can effectively serve its purpose while maintaining user trust and
adapting to future developments.
USER BENEFITS
1. The primary user benefit of this fake news detection system is its potential to
significantly enhance media literacy among individuals and organizations.
2. By providing users with a powerful tool to evaluate the credibility of news articles, the
system empowers them to make informed decisions based on reliable information.
3. This capability is particularly valuable in an age where misinformation can
significantly influence public opinion and societal behavior.
4. Furthermore, the system can be instrumental for journalists, content creators, and
educators by serving as a supplementary resource in the fact-checking and news
validation processes.
5. By automating the detection of fake news, users can save time and resources, enabling
them to focus on content creation and dissemination rather than on fact-checking.
6. This automation contributes to a more informed public discourse, fostering critical
engagement with news content.
7. The system also supports the integrity of journalism by promoting accurate and
trustworthy information.
8. As users become more skilled in evaluating news sources, they enhance their ability to
combat the spread of false narratives.
9. Overall, the fake news detection system plays a crucial role in promoting critical
thinking and responsible media consumption.
10. By empowering users and improving the quality of information shared, the system
ultimately contributes to a well-informed society.
5. FLOW CHART
Evaluate Accuracy
Check model accuracy
Make Predictions
Classification Report Predict news type
Generate report
Confusion Matrix
Create matrix
ADVANTAGES
High Accuracy: The machine learning model, when trained on a well-labeled dataset,
can achieve high accuracy in classifying news articles.
Automation: The system automates the process of identifying fake news, reducing the
manual effort required for fact-checking.
Scalability: The modular design allows for easy updates and the integration of new
algorithms, enhancing adaptability to evolving fake news tactics.
User-Friendly: A well-designed user interface can make the system accessible to non-
technical users, promoting broader adoption.
Enhanced Credibility: Effective detection of fake news can help maintain the
credibility of information sources and media outlets, fostering trust among readers.
Informed Public: By identifying and filtering out false information, detection systems
contribute to a more informed public, allowing individuals to make better decisions
based on accurate data.
Mitigation of Misinformation Spread: Fake news detection helps limit the spread of
misinformation, especially during critical events like elections or public health crises,
reducing the potential for panic or confusion.
Promotion of Accountability: Detection systems can hold individuals and
organizations accountable for disseminating false information, promoting responsible
journalism and content creation.
Improved Media Literacy: The existence of detection tools encourages individuals to
develop critical thinking skills and media literacy, making them more discerning
consumers of news.
Support for Research: Fake news detection contributes to academic and journalistic
research by providing data on misinformation trends, helping scholars understand the
implications of fake news in society.
Protection Against Manipulation: By identifying false narratives, detection tools
protect users from being manipulated by malicious actors, such as those spreading
propaganda or disinformation for political gain.
Facilitating Fact-Checking: Detection systems often work in tandem with fact-
checking organizations, streamlining the verification process and making accurate
information more accessible.
DISADVANTAGES
Dependence on Data Quality: The accuracy of the model heavily relies on the
quality and representativeness of the training data. If the dataset contains biases or
unrepresentative samples, the model may perform poorly.
False Positives/Negatives: There is a risk of misclassifying legitimate news as fake
(false positives) or vice versa (false negatives), which could undermine user trust.
Complexity of Language: The subtleties of language and context can pose
challenges for automated detection systems, potentially limiting their effectiveness.
Here, fake news detection system has a wide range of applications across various domains. In
media and journalism, it can serve as an essential tool for reporters and editors to fact-check
information before publication, thereby upholding journalistic integrity. Social media
platforms can implement this system to monitor content shared by users, providing alerts for
potentially misleading articles, which can enhance user experience and trust.
Educational institutions can leverage the system to teach students about media literacy,
encouraging critical thinking about news sources and the information they consume.
Additionally, governmental and non-governmental organizations can use this technology to
monitor misinformation campaigns during elections or public health crises, helping to protect
democratic processes and public welfare.
Social Media Platforms:Many social media sites, such as Facebook and Twitter,
implement fake news detection algorithms to flag or limit the spread of false
information, helping to create a more reliable online environment.
News Aggregators: Websites and apps that aggregate news articles use detection
systems to filter out fake news, ensuring users receive credible information from
various sources.
Search Engines: Search engines like Google employ fake news detection techniques
to prioritize credible sources in search results, providing users with more trustworthy
information.
Fact-Checking Organizations: Fact-checking services leverage detection tools to
analyze and verify claims made in news articles, social media posts, and other content,
enhancing their ability to combat misinformation.
Educational Tools: Fake news detection applications are used in educational settings
to teach students about media literacy, helping them identify and critically evaluate
misinformation.
Political Campaigns: During elections, political organizations use fake news detection
tools to monitor misinformation related to candidates or issues, allowing them to
respond promptly to false narratives.
Public Health: In times of public health crises, such as the COVID-19 pandemic,
detection systems are crucial for filtering out misleading health information and
providing accurate guidelines to the public.
Corporate Communications: Businesses use fake news detection to monitor their
brand reputation online, identifying and addressing false claims or misinformation that
could damage their image.
Legal and Regulatory Frameworks: Governments and regulatory bodies may utilize
fake news detection tools to enforce laws against the spread of false information,
especially during critical events like elections or public emergencies.
Journalism and Reporting: News organizations employ fake news detection
techniques to verify information before publication, ensuring the accuracy of their
reporting and maintaining journalistic integrity.
9. CONCLUSION
In conclusion, the development of a machine learning-based fake news detection system marks
a pivotal advancement in the ongoing battle against misinformation in today’s increasingly
digital landscape. As the prevalence of misleading information continues to rise, the ability to
accurately discern credible news from false narratives becomes paramount. This system
harnesses the power of advanced natural language processing (NLP) techniques and logistic
regression modeling, allowing it to analyze and classify news articles with remarkable
precision. By employing these sophisticated methodologies, the system not only enhances its
ability to identify fake news but also fosters a culture of media literacy, empowering users to
make well-informed decisions based on reliable information.
Moreover, the insights gleaned from this project have far-reaching implications for future
research and the development of improved fake news detection methodologies. As this field
evolves, the findings can inform the creation of more robust algorithms that adapt to the ever-
changing tactics employed by purveyors of misinformation. Additionally, the successful
implementation of this system can inspire collaboration among researchers, technologists, and
policymakers to devise comprehensive strategies that address the complexities of
misinformation. Ultimately, the ongoing refinement of fake news detection tools will
contribute to cultivating a more trustworthy information environment, where individuals
can confidently engage with media content and participate in informed public discourse.
10. FUTURE SCOPE
Looking ahead, the future scope of the fake news detection project presents numerous
opportunities for enhancement and innovation. One of the primary areas for development is the
integration of more advanced machine learning algorithms. By utilizing ensemble
methods or deep learning approaches, the system can achieve improved classification
accuracy, enabling it to better distinguish between genuine news and misinformation.
Additionally, expanding the dataset to include a wider variety of news sources and languages
will enhance the model's robustness and make it applicable across different cultural contexts.
This diversification can significantly improve the system’s effectiveness in various regions and
demographic groups, addressing the global nature of misinformation.
Moreover, the incorporation of user feedback into the detection system is crucial for fostering
continuous learning and adaptation. By allowing users to report inaccuracies or provide
insights, the algorithms can evolve alongside emerging trends in misinformation tactics. This
iterative process will ensure that the detection methods remain relevant and effective in the
face of constantly changing strategies employed by purveyors of fake news. Exploring the
potential for real-time detection capabilities will further enhance the system's utility, enabling
users to receive immediate feedback on news articles as they are shared online. In summary,
ongoing research and innovation in this field will be essential to tackle the dynamic challenges
posed by fake news, ultimately leading to a more informed and discerning public.
11.BIBLIOGRAPHY
Research Papers:
Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake News Detection on Social Media:
A Data Mining Perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22-36.
Ahmed, H., Traore, I., & Saad, S. (2018). Detecting Fake News with Machine Learning
Methods: A Survey. IEEE Access, 7, 90596-90600.
Conroy, N. J., Rubin, V. L., & Chen, Y. (2015). A Survey of Machine Learning Algorithms for
Fake News Detection. Proceedings of the Association for Information Science and Technology,
52(1), 1-9.
Zhou, X., & Zafarani, R. (2019). Fake News: A Survey of Research, Detection Methods, and
Opportunities. ACM Computing Surveys (CSUR), 53(5), 1-40.
Oshikawa, R., Qian, J., & Wang, W. Y. (2020). A Survey on Natural Language Processing for
Fake News Detection. Proceedings of the 12th Language Resources and Evaluation
Conference, 6086-6093.
Web Resources:
# Download stopwords
nltk.download('stopwords')
# Remove stopwords
stop_words = set(stopwords.words('english'))
df['content'] = df['content'].apply(lambda x: ' '.join([word for word in x.split() if word not in
stop_words]))
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Fake', 'True'],
yticklabels=['Fake', 'True'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
# ROC Curve
y_prob = model.predict_proba(X_test_tfidf)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
plt.figure(figsize=(6, 4))
plt.plot(param_range, train_mean, label="Training score", color="r")
plt.plot(param_range, test_mean, label="Cross-validation score", color="g")
plt.xlabel("Parameter C")
plt.ylabel("Accuracy")
plt.title("Validation Curve for Logistic Regression")
plt.xscale('log')
plt.legend(loc="best")
plt.show()