Project Report
Project Report
It is with immense gratitude that I take this opportunity to express my heartfelt thanks
to all those who have helped and supported me during the course of my six-week
internship at MSME.
First and foremost, I would like to express my deep sense of gratitude to Dr.
Ilavarasan E, Professor and Head, Department of Computer Science and Engineering,
Puducherry Technological University, for his unwavering support, valuable guidance,
and consistent encouragement throughout this internship. His expertise, insightful
suggestions, and continuous motivation have been pivotal in enriching my learning
experience and helping me complete this internship successfully.
I am also deeply thankful to Dr. Thenmozhi and Dr. Salini for their excellent
coordination, timely support, and constant encouragement during the internship period.
Their guidance and support were crucial in helping me navigate through challenges and
maintain focus on my learning objectives.
I would also like to sincerely thank my Class Advisor and Faculty Advisor for their
continuous mentorship and for providing a strong academic foundation that greatly
aided me during the internship. Their advice and support have been instrumental in
shaping my professional and personal growth.
This internship has been a significant learning experience for me, and I am truly
grateful to everyone who contributed to making it a rewarding and enriching
experience.
Table of Contents
Bonafide Certificate
Internship Certificate
Institution Profile
Acknowledgement
Abstract
1. Introduction 1……………………………………………………………………………… 1
5.1.Introduction ………………………………………………………………………….….33
5.3.Methodology ……………………………………………………………………….…… 37
6. Conclusion……………………………………………………..…………….48
LIST OF FIGURES
Figure 5: A box plot showing total bill distribution by day of the week
Figure 10: A bar chart showing values for five different categories
Figure 11: A box plot showing total bill distribution by day of the week
Figure 14: A scatter plot with actual data points and the linear regression line
Figure 17: A scatter plot showing data points colored by cluster assignment
Figure 18: Two plots showing the elbow method and silhouette scores
Chapter 1
Introduction
1.1 Overview of Data Science
Data Science is an interdisciplinary field that combines statistics, computer sci-
ence, and domain expertise to extract meaningful insights from data. This report
documents my learning journey during the MSME Data Science training
program, focusing on Python programming, data visualization, and machine
learning con- cepts.
During the internship, I applied the concepts learned throughout the training to real-
world projects. I developed skills in data cleaning, exploratory data analysis, and
model building using machine learning algorithms. Additionally, I gained hands-on
experience with popular libraries such as Pandas, NumPy, Matplotlib, Seaborn, and
Scikit-learn. The internship enhanced my problem-solving abilities, improved my
coding practices, and provided a deeper understanding of how data science solutions
are applied in industry settings.
1
Chapter 2
Python for Data Science
2.1 Introduction to Python
Python is a high-level, interpreted programming language that has become the
language of choice for data science and analytics. Its simple syntax, readabil-
ity, and vast ecosystem of libraries make it accessible to beginners and powerful
for professionals. Python supports multiple programming paradigms, including
procedural, object-oriented, and functional programming, which allows users to
write code in a style that best suits their problem. In data science, Python is val-
ued for its ability to handle data manipulation, statistical analysis, and machine
learning tasks with ease. The language’s dynamic typing and automatic mem-
ory management simplify the development process, while its extensive standard
library and third-party packages provide tools for everything from web develop-
ment to scientific computing. Python’s interactive shells, such as IPython and
Jupyter Notebook, further enhance productivity by allowing for rapid
prototyping and visualization. Whether you are cleaning data, building
predictive models, or visualizing results, Python offers a flexible and efficient
environment for all stages of the data science workflow.
Python’s relevance in data science is also driven by its integration capabilities with
big data platforms, databases, and cloud services. Libraries like PySpark allow
Python to process massive datasets in distributed environments, while connectors
to SQL, NoSQL, and cloud APIs make it easy to pull data from various sources.
Furthermore, the combination of Python with machine learning frameworks such
as TensorFlow, PyTorch, and Scikit-learn has enabled the development of state-of-
the-art models in artificial intelligence. As industries continue to generate more
data, Python’s adaptability, scalability, and vibrant ecosystem position it as an
essential skill for any aspiring data scientist.
2
1 # Variables and data types
2 integer_var = 42
3 float_var = 3.1415
4 string_var = " Data Science
5 " boolean_var = True
6
23 # List comprehensions
24 squared = [ x**2 for x in my_list]
25 print(" Squared values:", squared )
26
27 # Exception handling
28 try:
29
result = 10 / 0
30
except Zero Division Error
31
:
32
result = None
print(" Cannot divide by zero .")
Listing 1: Comprehensive Python Basics Example
Output
Hello, Alice! Hello, Alice! Hello, Alice! Squared values: [1, 4, 9, 16, 25]
Cannot divide by zero. Result: None
2.1.1 NumPy
NumPy, which stands for Numerical Python, is one of the foundational libraries
for scientific computing in Python. It provides support for large, multi-dimensional
arrays and matrices, along with a vast collection of high-level mathematical func-
tions to operate on these arrays efficiently. NumPy arrays are more compact and
faster than Python’s built-in lists, making them ideal for numerical computations
3
and data analysis tasks. The library is widely used in data science, machine
learn- ing, and engineering applications due to its speed and versatility. With
NumPy, you can perform a variety of operations such as element-wise
arithmetic, linear algebra, statistical analysis, and reshaping of data structures.
Its broadcasting feature allows for operations between arrays of different
shapes, which is par- ticularly useful in data manipulation and transformation.
NumPy also integrates seamlessly with other libraries like Pandas, Matplotlib,
and Scikit-learn, forming the backbone of the Python data science ecosystem.
Whether you are handling simple arrays or complex mathematical
computations, NumPy provides the tools necessary to work efficiently and
effectively with numerical data. Mastery of NumPy is essential for anyone
aspiring to become proficient in data science or scientific programming with
Python.
import numpy as np
1
20 # Mathematical functions
21 mean_val = np. mean ( arr1 )
22 std_val = np. std ( arr1 )
23
max_val = np. max( arr2 )
24
min_val = np. min ( arr2 )
25
26
# Reshaping arrays
27
reshaped = arr2 . reshape (3 , 2)
28
29
# Matrix multiplication
30
mat1 = np. array ([[1 , 2], [3 , 4]])
31
mat2 = np. array ([[5 , 6], [7 , 8]])
32
mat_product = np. dot( mat1 , mat2 )
33
34
# Random number generation
35
rand_arr = np. random . rand (2 , 3)
4
36
Output
Shape of arr1: (5,) Shape of arr2: (2, 3) Data type of arr1: int64 First three
elements of arr1: [10 20 30] Element at row 1, col 2 of arr2: 6 Sum array:
[15 25 35 45 55] Product array: [ 20 40 60 80 100] Mean: 30.0 Std Dev:
14.142135623730951 Max: 6 Min: 1 Reshaped array: [[1 2] [3 4] [5 6]]
Matrix product: [[19 22] [43 50]] Random array: [[0.37454012 0.95071431
0.73199394] [0.59865848 0.15601864 0.15599452]]
2.1.2 Pandas
Pandas is a powerful and flexible open-source data analysis and manipulation li-
brary for Python. It introduces two primary data structures: Series (one-
dimensional) and DataFrame (two-dimensional), which are designed for handling
structured data with ease. Pandas excels at importing data from various file
formats, clean- ing and transforming datasets, and performing complex operations
such as group- ing, merging, and pivoting. Its intuitive syntax and rich set of
functions make it
a favorite among data scientists for tasks ranging from exploratory data analy-
sis to feature engineering. With Pandas, you can filter, aggregate, and visualize
data efficiently, making it possible to gain insights quickly from large datasets.
The library also integrates seamlessly with other data science tools like NumPy,
Matplotlib, and Scikit-learn, enabling end-to-end workflows within the Python
ecosystem. Whether you are working with time series, categorical data, or nu-
merical data, Pandas provides the tools necessary to manipulate and analyze
data effectively. Its robust handling of missing data, powerful indexing, and
support for custom functions make it indispensable for modern data analysis.
Mastery of Pandas is essential for anyone looking to work with real-world data
in Python.
import pandas as pd
1
6
6 ' Age ': [25 , 30 , 35 , 28 , 22],
7 ' Salary ': [50000 , 60000 , 70000 , 65000 , 45000] ,
8 ' Department ': [' HR ', ' IT ', ' Finance ', ' IT ', ' Marketing ']
9 }
10 df = pd. Data Frame ( data )
11
33
# Handling missing data
34
df. loc[2 , ' Salary '] =
35
None
36
print("\ n With missing value :\ n", df)
37
print("\ n Filled missing salaries:" )
print( df[' Salary ']. fillna ( df[' Salary ']. mean
38
()))
39
7
Figure 1: Output of the comprehensive Pandas example.
2.1.3 Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations in Python. It is the foundation of most Python plotting
libraries and is highly customizable, allowing users to create a wide variety of
plots, from sim- ple line graphs to complex 3D visualizations. Matplotlib’s
object-oriented API gives fine-grained control over every element of a plot,
including axes, labels, legends, and colors. This flexibility makes it suitable for
both quick data explo- ration and publication-quality graphics. The library
integrates well with NumPy and Pandas, enabling seamless plotting of data
stored in arrays and DataFrames. Matplotlib supports multiple output formats,
such as PNG, PDF, and SVG, and can be used in interactive environments like
Jupyter Notebooks. Its extensive doc- umentation and active community make it
accessible to beginners while offering advanced features for experienced users.
Whether you are visualizing trends, distributions, or relationships in your data,
Matplotlib provides the tools to com- municate your findings effectively.
Mastery of Matplotlib is essential for any data scientist or analyst who needs to
present data-driven insights visually.
import matplotlib . pyplot as plt
1 import numpy as np
2
8
12
9
13 ax. plot(x, y1 , label=' sin ( x)', color=' blue ', linewidth =2)
14 ax. plot(x, y2 , label=' cos( x)', color=' red ', linestyle = ' - - ',
linewidth =2)
15
28
# Save and show the plot
29
plt. tight_layout ()
30
plt. savefig (' sine_cosine_plot . png
Listing 4: Comprehensive Matplotlib Example
4 # Generate data
5 x = np. linspace (0 , 2* np. pi , 100)
6 y_sin = np. sin ( x)
7 y_cos = np. cos( x)
8
9 # Create subplots
10
10 fig , axs = plt. subplots (2 , 1 , figsize =(8 , 6))
11
12 # First subplot
13 axs [0]. plot(x, y_sin , ' b - ', label=' sin (
14 x)') axs [0]. set_title (' Sine Function
15 ') axs [0]. set_ylabel( ' sin ( x)')
16 axs [0]. grid ( True )
17 axs [0]. legend ()
18
19 # Second subplot
20 axs [1]. plot(x, y_cos , ' r - ', label=' cos(
21 x)') axs [1]. set_title (' Cosine Function
22
') axs [1]. set_xlabel('x')
23
axs [1]. set_ylabel( ' cos(
24
x)') axs [1]. grid ( True
25
) axs [1]. legend ()
26
27
plt. tight_layout ()
28
plt. show ()
4 # Sample data
5 categories = [' Category A ', ' Category B ', ' Category C ', '
Category D ', ' Category E']
11
6 values = [22 , 35 , 14 , 28 , 30]
7
2.1.4 Seaborn
Seaborn is a powerful Python data visualization library based on Matplotlib that
provides a high-level interface for creating attractive and informative statistical
graphics. It simplifies the process of generating complex visualizations by of-
fering built-in themes, color palettes, and functions for visualizing distributions,
12
relationships, and categorical data. Seaborn integrates seamlessly with Pandas
DataFrames, making it easy to plot data directly from tabular structures. The li-
brary includes advanced features such as regression plots, heatmaps, pair plots,
and facet grids, which allow for multi-dimensional data exploration. Seaborn’s
emphasis on statistical visualization helps users quickly identify patterns,
trends, and outliers in their data. Its concise syntax and sensible defaults make it
ac- cessible to beginners, while its flexibility and customization options cater to
ad- vanced users. Whether you are performing exploratory data analysis or
preparing publication-quality figures, Seaborn provides the tools to visualize
your data ef- fectively. Mastery of Seaborn is invaluable for data scientists and
analysts who need to communicate insights through compelling graphics.
1 import matplotlib . pyplot as plt
2 import seaborn as sns
3
4 # For all Seaborn plots , we ' ll use a style that 's more compact
5 sns. set( style =" whitegrid ", font_scale =0.9)
6
13
36 print( tips. head (3). to_string ( index= False ))
37
58
# Minimal text output - just stats instead of raw data
59
print(" Pair plot saved . Summary statistics of iris dataset (
first 2 columns):" )
60
print( iris. describe ()[[' sepal_length ', ' sepal_width ']].
round
61
(2). to_string ())
62
63
# Execute the functions
64
create_box_plot ()
create_pair_plot ()
Listing 7: Box Plot with Seaborn
14
Figure 5: A box plot showing total bill distribution by day of the week
Figure 6: A pair plot showing pairwise relationships between features in the iris
dataset, colored by species
15
2.2 Machine Learning with Scikit-learn
Scikit-learn is a robust and widely-used machine learning library in Python that
provides simple and efficient tools for data mining and data analysis. It sup-
ports a wide range of supervised and unsupervised learning algorithms,
including classification, regression, clustering, and dimensionality reduction.
Scikit-learn is built on top of NumPy, SciPy, and Matplotlib, ensuring seamless
integration with the broader scientific Python ecosystem. The library’s
consistent API and comprehensive documentation make it accessible to both
beginners and experts. Scikit-learn emphasizes ease of use, performance, and
reproducibility, allowing users to quickly build and evaluate models with
minimal code. It includes utilities for preprocessing data, selecting features,
tuning hyperparameters, and validating models through cross-validation.
Whether you are building a simple linear re- gression model or a complex
ensemble classifier, Scikit-learn provides the tools necessary to implement and
assess machine learning solutions. Its active com- munity and frequent updates
ensure that it remains at the forefront of machine learning research and practice.
Mastery of Scikit-learn is essential for anyone pursuing a career in data science
orfrom
machine learning.
sklearn . datasets import load_iris
1 from sklearn . model_selection import train_test_split
2 from sklearn . preprocessing import Standard Scaler
3 from sklearn . linear_model import LogisticRegression
4 from sklearn . metrics accuracy_score , classification_report
5 , import
importconfusion_matrix
matplotlib . pyplot plt
6 import as
7
seaborn as sns
8 # Load dataset
9 data = load_iris ()
10 X = data . data
11 y = data . target
12
16 # Feature scaling
17 scaler = Standard Scaler ()
18 X_train_scaled = scaler. fit_transform ( X_train )
19 X_test_scaled = scaler. transform ( X_test)
20
25 # Make predictions
16
26
17
27 y_pred = model. predict( X_test_scaled )
28
Output
Accuracy: 0.98 Classification Report: precision recall f1-score support
0 1.00 1.00 1.00 16 1 1.00 0.95 0.97 18 2 0.95 1.00 0.97 11
accuracy 0.98 45 macro avg 0.98 0.98 0.98 45 weighted avg 0.98 0.98 0.98
45
18
Figure 7: A confusion matrix heatmap showing prediction results
19
Chapter 3
Data Visualization
3.1 Introduction to Data Visualization
Data visualization is crucial for understanding patterns, trends, and relationships in data. It
transforms raw data into graphical formats, making complex information more accessible and
easier to interpret. During the MSME Data Science training, we explored various visualization
techniques using several powerful Python libraries. Visualizations not only help in identifying
key insights quickly but also support better communication of findings to both technical and
non-technical audiences.
We primarily used libraries such as Matplotlib, Seaborn, and Plotly to create a wide range of
charts and plots. Matplotlib, one of the most fundamental libraries, allowed us to build basic
plots like line graphs, bar charts, and scatter plots with high customizability. Seaborn, built on
top of Matplotlib, simplified the process of creating attractive and informative statistical
graphics. It introduced advanced visualizations like heatmaps, violin plots, and pair plots, which
helped in performing exploratory data analysis more effectively. Plotly, an interactive graphing
library, enabled us to create dynamic and interactive dashboards, providing a more engaging
experience when exploring data.
Throughout the training, we learned to choose the appropriate type of visualization depending on
the nature of the data and the insights we wanted to highlight. For example, scatter plots were
effective for showing relationships between two variables, while histograms helped in
understanding the distribution of a dataset. Box plots were used to detect outliers and understand
data spread, and heatmaps provided a clear visual of correlations between multiple variables.
Overall, gaining hands-on experience with different visualization techniques enhanced our
ability to draw meaningful conclusions from datasets. It also emphasized the importance of
clarity, color schemes, labeling, and storytelling in presenting data effectively.
3.1.1 Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations.
1 import matplotlib . pyplot as plt
2 import numpy as np
3
4 # Generate data
5 x = np. linspace (0 , 2* np. pi , 100)
6 y_sin = np. sin ( x)
7 y_cos = np. cos( x)
8
9 # Create subplots
10 fig , axs = plt. subplots (2 , 1 , figsize =(8 , 6))
11
12 # First subplot
13 axs [0]. plot(x, y_sin , ' b - ', label=' sin (
14 x)') axs [0]. set_title (' Sine Function
15 ') axs [0]. set_ylabel( ' sin ( x)')
16 axs [0]. grid ( True )
17 axs [0]. legend ()
18
19 # Second subplot
20 axs [1]. plot(x, y_cos , ' r - ', label=' cos( x)')
21
21 axs [1]. set_title (' Cosine Function ')
22 axs [1]. set_xlabel('x')
23 axs [1]. set_ylabel( ' cos(
24 x)') axs [1]. grid ( True
25 ) axs [1]. legend ()
26
27 plt. tight_layout ()
28 plt. show ()
4 # Sample data
5 categories = [' Category A ', ' Category B ', ' Category C ', '
Category D ', ' Category E']
6 values = [22 , 35 , 14 , 28 , 30]
7
26 plt. tight_layout ()
plt. show ()
Listing 11: Bar Plot with Matplotlib
Figure 10: A bar chart showing values for five different categories
3.1.2 Seaborn
Seaborn is a statistical data visualization library built on top of Matplotlib, pro-
viding a high-level interface for drawing attractive statistical graphics.
1 import matplotlib . pyplot as plt
2 import seaborn as sns
3
Figure 11: A box plot showing total bill distribution by day of the week
11 # Create a heatmap
12 plt. figure ( figsize =(10 ,
13 8))
sns. heatmap ( correlation , annot=True , cmap =' coolwarm ', vmin =-1 ,
14 vmax =1)
15 plt. title (' Correlation Heatmap ', fontsize =14)
16 plt. tight_layout ()
17 plt. show ()
18
24
Output
Sample correlation values: A B C A 1.00 0.10 0.05 B 0.10 1.00 0.15 C 0.05
0.15 1.00
25
Figure 13: A grid of pairwise relationships between features in the iris dataset,
colored by species
26
Chapter 4
Machine Learning Basics
4.1 Introduction to Machine Learning
Machine learning has revolutionized the field of data science by enabling com-
puters to learn patterns from data without explicit programming. This branch of
artificial intelligence has transformed how organizations make decisions, prod- ucts
are developed, and scientific discoveries are made. At its core, machine learning
encompasses several paradigms including supervised learning (where models learn
from labeled data), unsupervised learning (discovering hidden pat- terns in
unlabeled data), and reinforcement learning (learning through interaction with an
environment). The fundamental process begins with data collection and
preprocessing, followed by feature engineering to transform raw data into mean-
ingful inputs. Models are then trained using algorithms that iteratively adjust
parameters to minimize prediction errors. Scikit-learn, Python’s premier machine
Supervised learning can be broadly categorized into two types: classification and regression.
In classification problems, the outputs are categorical labels. For example, a model might be
trained to classify emails as "spam" or "not spam," or to recognize handwritten digits from
images. In regression problems, the outputs are continuous values. An example would be
predicting the price of a house based on its features like size, location, and number of
bedrooms.
Overall, supervised learning forms the foundation of many real-world applications of machine
learning, including recommendation systems, fraud detection, and medical diagnosis, making
it an essential skill for any data scientist.
Output
Coefficient: 3.1259 Intercept: 3.8691 Mean Squared Error: 0.8183 R²
Score: 0.8561
Figure 14: A scatter plot with actual data points and the linear regression line
29
4.2.2 Classification with Decision Trees
1 import numpy as np
2 import matplotlib . pyplot as plt
, confusion_matrix
7 import seaborn as sns
8
20 # Make predictions
21 y_pred = clf. predict( X_test)
22
28
32 sns. heatmap ( cm , annot=True , fmt='d ', cmap =' Blues ', cbar= False )
36 plt. show ()
37
30
46 # Predict for each point in the mesh
47 Z = clf. predict( np. c_[ xx. ravel (), yy. ravel
48 ()]) Z = Z. reshape ( xx. shape )
49
Output
Accuracy: 0.8333
Classification Report: precision recall f1-score support
0 0.85 0.80 0.82 45 1 0.82 0.87 0.84 45
accuracy 0.83 90 macro avg 0.83 0.83 0.83 90 weighted avg 0.83 0.83 0.83
90
31
Figure 16: A decision boundary plot showing classified data points
36 for k in k_range :
37 kmeans = KMeans( n_clusters=k, random_state =42)
38 kmeans. fit( X)
39 inertia . append ( kmeans. inertia_
40 ) labels = kmeans. labels_
41 silhouette_scores . append ( silhouette_score (X, labels))
42
52
plt. subplot (1 , 2 , 2)
53
plt. plot( k_range , silhouette_scores , 'ro
54
- ') plt. xlabel(' Number of Clusters ( k)')
55
plt. ylabel(' Silhouette Score ')
56
plt. title (' Silhouette Method ')
plt. grid ( True , alpha =0.3)
57
58
plt. tight_layout ()
59
plt. show ()
Listing 17: K-Means Clustering Example
Output
Silhouette Score: 0.7573 Cluster Centers: [[ 7.09369362 8.95412693]
[-9.48413448 2.23470412] [ 1.76173574 -9.32901144] [ 3.6913933
3.64307413]]
33
Figure 17: A scatter plot showing data points colored by cluster
assignment with centroids marked
Figure 18: Two plots showing the elbow method and silhouette scores
for deter- mining optimal number of clusters
34
Chapter 5
Capstone Project- Fake News Detection
ABSTRACT
In the era of rapid digital communication, the spread of fake news has emerged as a
significant threat to public trust, societal stability, and informed decision-making.
The urgent need for efficient and reliable automated detection systems has become
more evident than ever before. This internship project focuses on the design and
implementation of a comprehensive fake news detection system by leveraging the
FakeNewsNet dataset, a widely recognized benchmark for fake news research.
The project workflow begins with an extensive data preprocessing phase, which
includes cleaning the text data, removing stop words, handling missing values, and
standardizing the textual information to ensure consistency. Following
preprocessing, the Term Frequency–Inverse Document Frequency (TF-IDF)
technique is applied to transform the textual data into meaningful numerical feature
representations that can be utilized by machine learning models.
35
Capstone Project : Fake News Detection
Introduction
5.1.1 Overview
5.1.2 Objectives
To deploy the most accurate and efficient model through a user-friendly web
interface, enabling real-time fake news classification accessible to end-users.
The scope of this project is centered on the classification of news articles based
exclusively on their headlines. This decision is motivated by data availability and the
need for rapid prediction suitable for real-time applications. Although full-text analysis
could provide deeper contextual and semantic understanding, it is intentionally
excluded to maintain the focus on fast and lightweight classification models. However,
this approach introduces certain limitations. Headline-based models may miss nuanced
details contained within full articles, leading to potential misclassification in complex
cases. Additionally, the dataset employed may carry inherent biases, such as source-
specific language patterns, which could affect model generalization. The deployed
model is static, meaning it does not continuously learn from new data; periodic
retraining is required to adapt to emerging linguistic trends and new types of
misinformation. These constraints should be considered when interpreting the results
and planning future enhancements.
37
Literature Review
The field of fake news detection has attracted significant research interest, given the
profound societal risks associated with misinformation. Multiple methodologies have
been proposed over the years, each with varying degrees of complexity and
effectiveness. This chapter presents a comprehensive overview of key approaches, from
early statistical techniques to modern hybrid systems.
Initial efforts in fake news detection primarily focused on statistical analysis and
linguistic feature engineering. Researchers employed stylometric techniques, such as
analyzing the frequency of n-grams (word pairs or triples), readability scores, and
psycholinguistic markers, to distinguish between genuine and fabricated news articles
[1,2]. These methods leveraged observable writing style differences, assuming that fake
news exhibits distinct linguistic patterns compared to credible journalism. While
statistical and lexical approaches offered valuable insights and were relatively easy to
implement, they were inherently limited by their shallow representation of text. They
often failed to capture deeper semantic nuances, making them less effective against
sophisticated, well-crafted fake news articles that mimic authentic writing styles.
The next wave of research introduced machine learning algorithms, which significantly
improved fake news detection capabilities. Traditional classifiers such as Support
Vector Machines (SVM), Random Forests, Decision Trees, and Naive Bayes were
trained on engineered text features, including TF-IDF vectors and word embeddings.
These models provided better scalability and generalization compared to handcrafted
rule-based systems. With the advent of deep learning, more advanced architectures like
Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM)
networks were employed to automatically learn hierarchical and sequential
representations from raw text [3,4]. Deep learning models excelled in capturing
complex linguistic structures and contextual dependencies, achieving state-of-the-art
performance across many benchmark datasets.
Although numerous studies have achieved notable success, most existing models
assume access to either the full article text or supplementary social context information.
In fast-paced environments, especially in the age of breaking news and micro-content
(e.g., Twitter, headline feeds), such resources are not always accessible. Consequently,
there is a critical need for models that can operate effectively on limited data, such as
headlines alone. This project addresses this gap by focusing exclusively on headline-
based classification, aiming to develop lightweight, rapid detection systems suitable for
integration into real-time news monitoring pipelines.
39
Methodology
This chapter details the methodology adopted for the development of the fake news
detection system, covering dataset characteristics, preprocessing steps, feature
engineering, model training, hyperparameter optimization, and evaluation strategies.
Effective preprocessing is crucial for enhancing model performance. The raw text
undergoes the following transformations:
1. Data Cleaning: Null values, duplicate records, and irrelevant entries are removed
to ensure data integrity.
3. Tokenization: Headlines are split into individual tokens (words) using the
Natural Language Toolkit (NLTK).
5. Stop-word Removal: Common English words (e.g., 'the', 'and', 'is') that
contribute little to semantic meaning are filtered out.
The preprocessed dataset is split into a training set (80%) and a testing set (20%). The
following machine learning models are trained and evaluated:
Decision Tree: A non-parametric model that splits data based on Gini impurity,
with depth constraints applied to enhance generalization.
Decision Tree: Maximum depth values considered are {10, 20, None}.
Naive Bayes: Laplace smoothing parameter (α) values examined are {0.5, 1.0,
1.5}.
41
Cross-validation ensures that the selected hyperparameters generalize well to unseen
data.
The trained models are evaluated using multiple performance metrics to ensure
comprehensive assessment:
Recall: Assesses the proportion of true positive predictions among all actual
positive instances.
42
Results and Discussion
The dataset comprises 20,000 real headlines and 20,000 fake headlines.
43
F1-
Model Accuracy Precision Recall
Score
Random Forest 0.88 0.87 0.88 0.87
Multinomial Naive
0.78 0.77 0.78 0.77
Bayes
Table 4.1: Comparison of Model Performance
Classification Report:
Class Precision Recall F1- Support
Score
0 (Fake) 0.75 0.42 0.54 1131
1 (Real) 0.84 0.95 0.89 3509
Classification Report:
Class Precision Recal F1-Score Support
44
Classification Report:
Class Precision Recall F1-Score Support
0 0.68 0.51 0.58 1131
(Fake)
1 (Real) 0.85 0.92 0.89 3509
Classification Report:
Class Precision Recall F1-Score Support
45
Confusion Matrix
46
Figure 4.2: Confusion Matrix for Random Forest Model
47
Figure 4.3: Confusion Matrix for Naive Bayes Model
48
4.1 ROC Curves
Fig ROC Curve for Logistic Regression Model Fig ROC Curve for Random Forest Model
Figure 4.4: ROC Curve for Naive Bayes Model Figure 4.4: ROC Curve for Decision Tree Model
49
50
The bar chart illustrates the Top 20 Features ranked by their TF-IDF (Term
Frequency-Inverse Document Frequency) scores, highlighting the most influential
terms contributing to fake news classification.
"kardashian" has the highest TF-IDF score among all features, indicating that it
is the most distinctive and impactful word in the corpus for separating fake and
real news headlines.
Other dominant terms include "new," "star," "jennif," "jenner," "season,"
and "award," which are also highly influential.
Many of the top-ranked words (e.g., "kardashian," "jenner," "kim," "meghan,"
"justin") are associated with celebrity culture and entertainment news,
suggesting that fake news articles in the dataset often target or mention popular
public figures and celebrity events.
Words like "babi," "wed," "say," "reveal," and "date" point toward topics
involving personal life events such as weddings, births, and revelations—
common themes that fake news outlets may exploit to attract readers.
Interestingly, the presence of shortened forms or truncated words like "princ"
(likely "prince"), "celebr" (celebrity), and "markl" (Markle) indicates that
headline texts sometimes feature partial words due to preprocessing or inherent
text styles.
51
52
Chapter 6
Conclusion
The MSME Data Science training provided a comprehensive understanding of Python
programming, data analysis, visualization, and machine learning concepts. Through
hands-on examples and exercises, I gained practical experience with essential libraries
and techniques used in the field of data science.
Visualization skills acquired through libraries like Matplotlib, Seaborn, and Plotly
enabled me to better communicate analytical results. I learned how to select the
appropriate visualization techniques based on the data type and the message I wanted to
convey. Effective data storytelling is an essential aspect of data science, helping
stakeholders and decision-makers quickly understand findings and take action.
One of the most valuable experiences from the training was completing practical
projects that simulated real-world problems. These projects involved complete
workflows, from data collection and cleaning to model building and evaluation. They
helped strengthen my ability to think critically, approach problems systematically, and
apply the right tools and techniques at each stage of a data science project.
53
54