0% found this document useful (0 votes)
18 views58 pages

Project Report

The document is an acknowledgment of gratitude from an intern at MSME, expressing thanks to various mentors and faculty for their support during a six-week internship focused on data science. It highlights the importance of guidance received from professors and the practical skills gained in Python programming, data visualization, and machine learning. The document also outlines the structure of the internship report, including chapters on Python, data visualization, machine learning, and a capstone project on fake news detection.

Uploaded by

Subhash razz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views58 pages

Project Report

The document is an acknowledgment of gratitude from an intern at MSME, expressing thanks to various mentors and faculty for their support during a six-week internship focused on data science. It highlights the importance of guidance received from professors and the practical skills gained in Python programming, data visualization, and machine learning. The document also outlines the structure of the internship report, including chapters on Python, data visualization, machine learning, and a capstone project on fake news detection.

Uploaded by

Subhash razz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 58

ACKNOWLEDGEMENT

It is with immense gratitude that I take this opportunity to express my heartfelt thanks
to all those who have helped and supported me during the course of my six-week
internship at MSME.

First and foremost, I would like to express my deep sense of gratitude to Dr.
Ilavarasan E, Professor and Head, Department of Computer Science and Engineering,
Puducherry Technological University, for his unwavering support, valuable guidance,
and consistent encouragement throughout this internship. His expertise, insightful
suggestions, and continuous motivation have been pivotal in enriching my learning
experience and helping me complete this internship successfully.

I am also deeply thankful to Dr. Thenmozhi and Dr. Salini for their excellent
coordination, timely support, and constant encouragement during the internship period.
Their guidance and support were crucial in helping me navigate through challenges and
maintain focus on my learning objectives.

I would also like to sincerely thank my Class Advisor and Faculty Advisor for their
continuous mentorship and for providing a strong academic foundation that greatly
aided me during the internship. Their advice and support have been instrumental in
shaping my professional and personal growth.

I extend my sincere appreciation to all the faculty members of the Department of


Computer Science and Engineering, Puducherry Technological University, for
imparting knowledge and skills that have been of immense value during this internship.
I am also thankful to the team at MSME for providing me with a supportive and
stimulating work environment. Their guidance and professional insights allowed me to
practically apply the knowledge gained during my academic studies and to develop
new skills that will benefit me in my future endeavors.

This internship has been a significant learning experience for me, and I am truly
grateful to everyone who contributed to making it a rewarding and enriching
experience.
Table of Contents

 Bonafide Certificate
 Internship Certificate
 Institution Profile
 Acknowledgement
 Abstract
1. Introduction 1……………………………………………………………………………… 1

1.1.Overview of Data Science ………………………………………………………….1

1.2.Training Objectives …………………………………………………………………. 1

1.3.Internship Outcome ………………………………………………………………..…1

2. Python for Data Science ………………………………………………………………..2

2.1.Introduction to Python ………………………………………………………………..2

2.1.1. NumPy ……………………………………………………………………………3

2.1.2. Pandas …………………………………………………………………………….5

2.1.3. Matplotlib ………………………………………………………………………..7

2.1.4. Seaborn …………………………………………………………………………..10

2.2.Machine Learning with Scikit-learn ……………………………………………..14

3. Data Visualization …………………………………………………………………………17

3.1.Introduction to Data Visualization ………………………………………………...17

3.1.1. Matplotlib ………………………………………………………………………...17

3.1.2. Seaborn …………………………………………………………………………....20

4. Machine Learning Basics ……………………………………………………………....24

4.1.Introduction to Machine Learning ………………………………………………...24

4.2.Supervised Learning …………………………………………………………………..25

4.2.1. Linear Regression …………………………………………………………...…26

4.2.2. Classification with Decision Trees ………………………………….……27


4.2.3. K-Means Clustering …………………………………………………………...29

5. Capstone Project: Fake News Detection ………………………………………..…32.

5.1.Introduction ………………………………………………………………………….….33

5.1.1. Overview ……………………………………………………………………….…33

5.1.2. Objectives ……………………………………………………………………...…33

5.1.3. Scope and Limitations ……………………………………………………..….34

5.2.Literature Review ………………………………………………………………..……..35

5.2.1. Statistical and Linguistic Analysis ……………………………..…………35.

5.2.2. Machine Learning Approaches ……………………………………...…….. 35

5.2.3. Hybrid and Contextual Methods………………………………………….... 35

5.2.4. Gap Analysis ……………………………………………………………..………36

5.3.Methodology ……………………………………………………………………….…… 37

5.3.1. Dataset Description …………………………………………………………….37

5.3.2. Data Preprocessing ……………………………………………………………..37

5.3.3. Feature Extraction …………………………………………………..…………..37

5.3.4. Model Training ……………………………………………………..……………38

5.3.5. Hyperparameter Tuning …………………………………………….…………38

5.3.6. Evaluation Metrics ………………………………………………...……………39

5.4 Results and Discussion ……………………………………………..………………..40

6. Conclusion……………………………………………………..…………….48
LIST OF FIGURES

Figure 1: Output of the comprehensive Pandas example

Figure 2: A sine wave function plotted on a coordinate system

Figure 3: Sine and cosine functions in separate subplots

Figure 4: A bar chart showing values for five different categories

Figure 5: A box plot showing total bill distribution by day of the week

Figure 6: A pair plot showing pairwise relationships between features

Figure 7: A confusion matrix heatmap showing prediction results

Figure 8: A sine wave function plotted on a coordinate system

Figure 9: Sine and cosine functions in separate subplots

Figure 10: A bar chart showing values for five different categories

Figure 11: A box plot showing total bill distribution by day of the week

Figure 12: A correlation heatmap showing relationships between variables

Figure 13: A grid of pairwise relationships between features

Figure 14: A scatter plot with actual data points and the linear regression line

Figure 15: A confusion matrix heatmap showing prediction results

Figure 16: A decision boundary plot showing classified datapoints

Figure 17: A scatter plot showing data points colored by cluster assignment

Figure 18: Two plots showing the elbow method and silhouette scores
Chapter 1
Introduction
1.1 Overview of Data Science
Data Science is an interdisciplinary field that combines statistics, computer sci-
ence, and domain expertise to extract meaningful insights from data. This report
documents my learning journey during the MSME Data Science training
program, focusing on Python programming, data visualization, and machine
learning con- cepts.

1.2 Training Objectives


• Understanding Python programming for data analysis
• Learning essential data science libraries
• Mastering data visualization techniques
• Introduction to machine learning concepts
• Practical project implementation

1.3 Internship Outcome

During the internship, I applied the concepts learned throughout the training to real-
world projects. I developed skills in data cleaning, exploratory data analysis, and
model building using machine learning algorithms. Additionally, I gained hands-on
experience with popular libraries such as Pandas, NumPy, Matplotlib, Seaborn, and
Scikit-learn. The internship enhanced my problem-solving abilities, improved my
coding practices, and provided a deeper understanding of how data science solutions
are applied in industry settings.

1
Chapter 2
Python for Data Science
2.1 Introduction to Python
Python is a high-level, interpreted programming language that has become the
language of choice for data science and analytics. Its simple syntax, readabil-
ity, and vast ecosystem of libraries make it accessible to beginners and powerful
for professionals. Python supports multiple programming paradigms, including
procedural, object-oriented, and functional programming, which allows users to
write code in a style that best suits their problem. In data science, Python is val-
ued for its ability to handle data manipulation, statistical analysis, and machine
learning tasks with ease. The language’s dynamic typing and automatic mem-
ory management simplify the development process, while its extensive standard
library and third-party packages provide tools for everything from web develop-
ment to scientific computing. Python’s interactive shells, such as IPython and
Jupyter Notebook, further enhance productivity by allowing for rapid
prototyping and visualization. Whether you are cleaning data, building
predictive models, or visualizing results, Python offers a flexible and efficient
environment for all stages of the data science workflow.

Python’s relevance in data science is also driven by its integration capabilities with
big data platforms, databases, and cloud services. Libraries like PySpark allow
Python to process massive datasets in distributed environments, while connectors
to SQL, NoSQL, and cloud APIs make it easy to pull data from various sources.
Furthermore, the combination of Python with machine learning frameworks such
as TensorFlow, PyTorch, and Scikit-learn has enabled the development of state-of-
the-art models in artificial intelligence. As industries continue to generate more
data, Python’s adaptability, scalability, and vibrant ecosystem position it as an
essential skill for any aspiring data scientist.
2
1 # Variables and data types
2 integer_var = 42
3 float_var = 3.1415
4 string_var = " Data Science
5 " boolean_var = True
6

7 # Lists and dictionaries


8 my_list = [1 , 2 , 3 , 4 , 5]
9 my_dict = {' name ': ' Alice ', ' age ': 25 , ' field ': ' Data Science
10 '}
11

12 # Functions and control flow


13 def greet( name ):
14 if name :
15 return f" Hello , { name
16 }!" else :
17 return " Hello , World !"
18

19 for i in range (3):


20 print( greet( my_dict[' name
21 ']))
22

23 # List comprehensions
24 squared = [ x**2 for x in my_list]
25 print(" Squared values:", squared )
26

27 # Exception handling
28 try:
29
result = 10 / 0
30
except Zero Division Error
31
:
32
result = None
print(" Cannot divide by zero .")
Listing 1: Comprehensive Python Basics Example

Output
Hello, Alice! Hello, Alice! Hello, Alice! Squared values: [1, 4, 9, 16, 25]
Cannot divide by zero. Result: None

2.1.1 NumPy
NumPy, which stands for Numerical Python, is one of the foundational libraries
for scientific computing in Python. It provides support for large, multi-dimensional
arrays and matrices, along with a vast collection of high-level mathematical func-
tions to operate on these arrays efficiently. NumPy arrays are more compact and
faster than Python’s built-in lists, making them ideal for numerical computations

3
and data analysis tasks. The library is widely used in data science, machine
learn- ing, and engineering applications due to its speed and versatility. With
NumPy, you can perform a variety of operations such as element-wise
arithmetic, linear algebra, statistical analysis, and reshaping of data structures.
Its broadcasting feature allows for operations between arrays of different
shapes, which is par- ticularly useful in data manipulation and transformation.
NumPy also integrates seamlessly with other libraries like Pandas, Matplotlib,
and Scikit-learn, forming the backbone of the Python data science ecosystem.
Whether you are handling simple arrays or complex mathematical
computations, NumPy provides the tools necessary to work efficiently and
effectively with numerical data. Mastery of NumPy is essential for anyone
aspiring to become proficient in data science or scientific programming with
Python.
import numpy as np
1

2 # Creating 1 D and 2 arrays


3
D 20 , 30 , 40 ,
4
arr1 = np. array ([10 250])
, 3], [4 , 5 ,
5
, arr2 = np. array ([[1 6]])
6 # Array properties
7 print(" Shape of arr1 :", arr1 . shape )
8 print(" Shape of arr2 :", arr2 . shape )
9 print(" Data type of arr1 :", arr1 . dtype )
10

11 # Element - wise operations


12 sum_arr = arr1 + 5
13 product_arr = arr1 * 2
14

15 # Slicing and indexing


16 print(" First three elements of arr1 :", arr1 [:3])
17 print(" Element at row 1 , col 2 of arr2 :", arr2 [1 ,
18 2])
19

20 # Mathematical functions
21 mean_val = np. mean ( arr1 )
22 std_val = np. std ( arr1 )
23
max_val = np. max( arr2 )
24
min_val = np. min ( arr2 )
25

26
# Reshaping arrays
27
reshaped = arr2 . reshape (3 , 2)
28

29
# Matrix multiplication
30
mat1 = np. array ([[1 , 2], [3 , 4]])
31
mat2 = np. array ([[5 , 6], [7 , 8]])
32
mat_product = np. dot( mat1 , mat2 )
33

34
# Random number generation
35
rand_arr = np. random . rand (2 , 3)
4
36

37 print(" Sum array:", sum_arr)


38 print(" Product array:", product_arr)
39 print(" Mean :", mean_val , " Std Dev:",
40 std_val) print(" Max:", max_val , " Min :",
41 min_val) print(" Reshaped array :\ n",
42 reshaped ) print(" Matrix product :\ n",
43 mat_product) print(" Random array :\ n",
rand_arr)
Listing 2: Comprehensive NumPy Example

Output
Shape of arr1: (5,) Shape of arr2: (2, 3) Data type of arr1: int64 First three
elements of arr1: [10 20 30] Element at row 1, col 2 of arr2: 6 Sum array:
[15 25 35 45 55] Product array: [ 20 40 60 80 100] Mean: 30.0 Std Dev:
14.142135623730951 Max: 6 Min: 1 Reshaped array: [[1 2] [3 4] [5 6]]
Matrix product: [[19 22] [43 50]] Random array: [[0.37454012 0.95071431
0.73199394] [0.59865848 0.15601864 0.15599452]]

2.1.2 Pandas
Pandas is a powerful and flexible open-source data analysis and manipulation li-
brary for Python. It introduces two primary data structures: Series (one-
dimensional) and DataFrame (two-dimensional), which are designed for handling
structured data with ease. Pandas excels at importing data from various file
formats, clean- ing and transforming datasets, and performing complex operations
such as group- ing, merging, and pivoting. Its intuitive syntax and rich set of
functions make it
a favorite among data scientists for tasks ranging from exploratory data analy-
sis to feature engineering. With Pandas, you can filter, aggregate, and visualize
data efficiently, making it possible to gain insights quickly from large datasets.
The library also integrates seamlessly with other data science tools like NumPy,
Matplotlib, and Scikit-learn, enabling end-to-end workflows within the Python
ecosystem. Whether you are working with time series, categorical data, or nu-
merical data, Pandas provides the tools necessary to manipulate and analyze
data effectively. Its robust handling of missing data, powerful indexing, and
support for custom functions make it indispensable for modern data analysis.
Mastery of Pandas is essential for anyone looking to work with real-world data
in Python.
import pandas as pd
1

2 # Creating a Data Frame


3 data = {
' Name ': [' Alice ', ' Bob ', ' Charlie
5 ', ' David ', ' Eva '],
4

6
6 ' Age ': [25 , 30 , 35 , 28 , 22],
7 ' Salary ': [50000 , 60000 , 70000 , 65000 , 45000] ,
8 ' Department ': [' HR ', ' IT ', ' Finance ', ' IT ', ' Marketing ']
9 }
10 df = pd. Data Frame ( data )
11

12 # Display basic info


13 print(" Data Frame head :")
14 print( df. head ())
15 print("\ n Info :")
16 df. info ()
17

18 # Filtering and selection


19 it_employees = df[ df[' Department '] == ' IT ']
20 print("\ nIT Employees :\ n", it_employees)
21

22 # Grouping and aggregation


23 avg_salary = df. groupby(' Department ')[' Salary ']. mean ()
24 print("\ n Average Salary by Department :\ n", avg_salary )
25

26 # Adding and modifying columns


27 df[' Experience '] = [3 , 8 , 12 , 5 ,
28
2]
df[' Seniority '] = df[' Experience ']. apply( lambda x: ' Senior ' if
29
x
30
>= 8 else ' Junior ')
31
print("\ n Data Frame with Experience and Seniority :\ n", df)
32

33
# Handling missing data
34
df. loc[2 , ' Salary '] =
35
None
36
print("\ n With missing value :\ n", df)
37
print("\ n Filled missing salaries:" )
print( df[' Salary ']. fillna ( df[' Salary ']. mean
38
()))
39

Listing 3: Comprehensive Pandas Example

7
Figure 1: Output of the comprehensive Pandas example.

2.1.3 Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations in Python. It is the foundation of most Python plotting
libraries and is highly customizable, allowing users to create a wide variety of
plots, from sim- ple line graphs to complex 3D visualizations. Matplotlib’s
object-oriented API gives fine-grained control over every element of a plot,
including axes, labels, legends, and colors. This flexibility makes it suitable for
both quick data explo- ration and publication-quality graphics. The library
integrates well with NumPy and Pandas, enabling seamless plotting of data
stored in arrays and DataFrames. Matplotlib supports multiple output formats,
such as PNG, PDF, and SVG, and can be used in interactive environments like
Jupyter Notebooks. Its extensive doc- umentation and active community make it
accessible to beginners while offering advanced features for experienced users.
Whether you are visualizing trends, distributions, or relationships in your data,
Matplotlib provides the tools to com- municate your findings effectively.
Mastery of Matplotlib is essential for any data scientist or analyst who needs to
present data-driven insights visually.
import matplotlib . pyplot as plt
1 import numpy as np
2

3 # Generate data for plotting


4 x = np. linspace (0 , 2 * np. pi ,
5 100) y1 = np. sin ( x)
6 y2 = np. cos( x)
7

8 # Create a figure and axes


9 fig , ax = plt. subplots( figsize =(10 , 6))
10

11 # Plot sine and cosine

8
12

9
13 ax. plot(x, y1 , label=' sin ( x)', color=' blue ', linewidth =2)
14 ax. plot(x, y2 , label=' cos( x)', color=' red ', linestyle = ' - - ',
linewidth =2)
15

16 # Add title and labels


17 ax. set_title (' Sine and Cosine Functions ', fontsize
18 =16) ax. set_xlabel('x ( radians) ', fontsize
19 =14) ax. set_ylabel( 'y', fontsize =14)
20

21 # Add grid , legend , and


22
annotations ax. grid ( True ,
23
alpha =0.3) ax. legend ( loc='
24
upper right ')
25
ax. annotate (' Peak ', xy=( np. pi/2 , 1), xytext =(2 , 1.2) ,
26
arrowprops= dict( facecolor=' black ', shrink
=0.05))
27

28
# Save and show the plot
29
plt. tight_layout ()
30
plt. savefig (' sine_cosine_plot . png
Listing 4: Comprehensive Matplotlib Example

Figure 2: A sine wave function plotted on a coordinate system

1 import matplotlib . pyplot as plt


2 import numpy as np
3

4 # Generate data
5 x = np. linspace (0 , 2* np. pi , 100)
6 y_sin = np. sin ( x)
7 y_cos = np. cos( x)
8

9 # Create subplots

10
10 fig , axs = plt. subplots (2 , 1 , figsize =(8 , 6))
11

12 # First subplot
13 axs [0]. plot(x, y_sin , ' b - ', label=' sin (
14 x)') axs [0]. set_title (' Sine Function
15 ') axs [0]. set_ylabel( ' sin ( x)')
16 axs [0]. grid ( True )
17 axs [0]. legend ()
18

19 # Second subplot
20 axs [1]. plot(x, y_cos , ' r - ', label=' cos(
21 x)') axs [1]. set_title (' Cosine Function
22
') axs [1]. set_xlabel('x')
23
axs [1]. set_ylabel( ' cos(
24
x)') axs [1]. grid ( True
25
) axs [1]. legend ()
26

27
plt. tight_layout ()
28
plt. show ()

Listing 5: Multiple Plots in Matplotlib

Figure 3: Sine and cosine functions in separate subplots

1 import matplotlib . pyplot as plt


2 import numpy as np
3

4 # Sample data
5 categories = [' Category A ', ' Category B ', ' Category C ', '
Category D ', ' Category E']

11
6 values = [22 , 35 , 14 , 28 , 30]
7

8 # Create a bar plot


9 plt. figure ( figsize =(10 ,
10 6))
11 bars = plt. bar( categories , values , color=' steelblue ')
12

13 # Customize the plot


14 plt. xlabel(' Categories ', fontsize
15 =12) plt. ylabel(' Values ', fontsize
16 =12)
17 plt. title (' Bar Plot of Categories vs Values ', fontsize =14)
18 plt. xticks( rotation =45)
19 plt. grid ( True , axis='y', alpha =0.3)
20

21 # Add value labels on top of bars


22 for bar in bars:
23
height = bar. get_height ()
24
plt. text( bar. get_x () + bar. get_width ()/2., height + 1 ,
25
f'{ height}', ha=' center ', va=' bottom ')
26
plt. tight_layout ()
Listing 6: Bar Plot with Matplotlib

Figure 4: A bar chart showing values for five different categories

2.1.4 Seaborn
Seaborn is a powerful Python data visualization library based on Matplotlib that
provides a high-level interface for creating attractive and informative statistical
graphics. It simplifies the process of generating complex visualizations by of-
fering built-in themes, color palettes, and functions for visualizing distributions,
12
relationships, and categorical data. Seaborn integrates seamlessly with Pandas
DataFrames, making it easy to plot data directly from tabular structures. The li-
brary includes advanced features such as regression plots, heatmaps, pair plots,
and facet grids, which allow for multi-dimensional data exploration. Seaborn’s
emphasis on statistical visualization helps users quickly identify patterns,
trends, and outliers in their data. Its concise syntax and sensible defaults make it
ac- cessible to beginners, while its flexibility and customization options cater to
ad- vanced users. Whether you are performing exploratory data analysis or
preparing publication-quality figures, Seaborn provides the tools to visualize
your data ef- fectively. Mastery of Seaborn is invaluable for data scientists and
analysts who need to communicate insights through compelling graphics.
1 import matplotlib . pyplot as plt
2 import seaborn as sns
3

4 # For all Seaborn plots , we ' ll use a style that 's more compact
5 sns. set( style =" whitegrid ", font_scale =0.9)
6

7 # Listing 7/14: Box Plot with Seaborn ( combined version to avoid


duplication )
8 def create_box_plot ():
9 print(" Generating box_plot_tips. jpg
10 ...") # Load sample dataset
11 tips = sns. load_dataset( ' tips ')
12

13 # Create a more compact figure


14 plt. figure ( figsize =(8 , 5))
15 # Use a simpler boxplot with fewer details
16 ax = sns. boxplot( x=' day ', y=' total_bill ', data=tips ,
palette = " Set3 ")
17

18 # Customize appearance to be more compact


19 plt. title (' Box Plot Total Bill by Day ', fontsize
20 plt. xlabel(' Day ', fontsize =10)
of =12)
21 plt. ylabel(' Total Bill ( $)', fontsize
22 =10) plt. xticks( fontsize
23 =9) plt. yticks( fontsize =9)
24

25 # Format y- axis to reduce space


26 from matplotlib . ticker import FormatStrFormatter
27 ax. yaxis. set_major_formatter ( FormatStrFormatter ('%.0 f'))
28

29 # Tight layout to maximize space usage


30 plt. tight_layout ()
31 plt. savefig (' box_plot_tips. jpg ', dpi
32 =100) plt. close ()
33

34 # Display minimal output to avoid overflow


35 print(" Box plot saved . First 3 rows of tips dataset:")

13
36 print( tips. head (3). to_string ( index= False ))
37

38 # Listing 9/16: Pair Plot with Seaborn ( combined version to avoid


duplication )
39 def create_pair_plot ():
40 print(" Generating iris_pairplot. jpg ...")
41 # Load the iris dataset
42 iris = sns. load_dataset( ' iris ')
43

44 # Create a pair plot with minimal elements


45 # Use a smaller figure size and simpler markers
46 g = sns. pairplot( iris , hue=' species ', height =1.8 , aspect =1 ,
47 plot_kws ={'s': 15 , ' edgecolor ': ' none '},
48 diag_kind =' kde ')
49

50 # Remove the title to save space


51 # Instead of plt. suptitle which takes extra space
52

53 # Adjust layout to be compact


54 plt. tight_layout ()
55 plt. savefig (' iris_pairplot. jpg ', dpi
56
=100) plt. close ()
57

58
# Minimal text output - just stats instead of raw data
59
print(" Pair plot saved . Summary statistics of iris dataset (
first 2 columns):" )
60
print( iris. describe ()[[' sepal_length ', ' sepal_width ']].
round
61
(2). to_string ())
62

63
# Execute the functions
64
create_box_plot ()
create_pair_plot ()
Listing 7: Box Plot with Seaborn

14
Figure 5: A box plot showing total bill distribution by day of the week

Figure 6: A pair plot showing pairwise relationships between features in the iris
dataset, colored by species

15
2.2 Machine Learning with Scikit-learn
Scikit-learn is a robust and widely-used machine learning library in Python that
provides simple and efficient tools for data mining and data analysis. It sup-
ports a wide range of supervised and unsupervised learning algorithms,
including classification, regression, clustering, and dimensionality reduction.
Scikit-learn is built on top of NumPy, SciPy, and Matplotlib, ensuring seamless
integration with the broader scientific Python ecosystem. The library’s
consistent API and comprehensive documentation make it accessible to both
beginners and experts. Scikit-learn emphasizes ease of use, performance, and
reproducibility, allowing users to quickly build and evaluate models with
minimal code. It includes utilities for preprocessing data, selecting features,
tuning hyperparameters, and validating models through cross-validation.
Whether you are building a simple linear re- gression model or a complex
ensemble classifier, Scikit-learn provides the tools necessary to implement and
assess machine learning solutions. Its active com- munity and frequent updates
ensure that it remains at the forefront of machine learning research and practice.
Mastery of Scikit-learn is essential for anyone pursuing a career in data science
orfrom
machine learning.
sklearn . datasets import load_iris
1 from sklearn . model_selection import train_test_split
2 from sklearn . preprocessing import Standard Scaler
3 from sklearn . linear_model import LogisticRegression
4 from sklearn . metrics accuracy_score , classification_report
5 , import
importconfusion_matrix
matplotlib . pyplot plt
6 import as
7
seaborn as sns
8 # Load dataset
9 data = load_iris ()
10 X = data . data
11 y = data . target
12

13 # Split into train and test sets


14 X_train , X_test , y_train , y_test = train_test_split (X,
15 y, test_size =0.3 , random_state =42)

16 # Feature scaling
17 scaler = Standard Scaler ()
18 X_train_scaled = scaler. fit_transform ( X_train )
19 X_test_scaled = scaler. transform ( X_test)
20

21 # Train a logistic regression model


22 model = LogisticRegression ( max_iter =200)
23 model. fit( X_train_scaled , y_train )
24

25 # Make predictions

16
26

17
27 y_pred = model. predict( X_test_scaled )
28

29 # Evaluate the model


30 acc = accuracy_score ( y_test , y_pred
31 ) print( f" Accuracy : { acc :.2 f}")
32 print(" Classification Report :\ n", classification_report ( y_test ,
y_pred ))
33

34 # Confusion matrix visualization


35 cm = confusion_matrix ( y_test , y_pred )
36 plt. figure ( figsize =(6 , 5))
37 sns. heatmap ( cm , annot=True , fmt='d ', cmap =' Blues ', cbar=False ,
38 xticklabels= data. target_names , yticklabels= data.
target_names)
39 plt. xlabel(' Predicted ')
40 plt. ylabel(' Actual ')
41 plt. title (' Confusion Matrix
42 ') plt. show ()

Listing 8: Comprehensive Scikit-learn Example

Output
Accuracy: 0.98 Classification Report: precision recall f1-score support
0 1.00 1.00 1.00 16 1 1.00 0.95 0.97 18 2 0.95 1.00 0.97 11
accuracy 0.98 45 macro avg 0.98 0.98 0.98 45 weighted avg 0.98 0.98 0.98
45

18
Figure 7: A confusion matrix heatmap showing prediction results

19
Chapter 3
Data Visualization
3.1 Introduction to Data Visualization
Data visualization is crucial for understanding patterns, trends, and relationships in data. It
transforms raw data into graphical formats, making complex information more accessible and
easier to interpret. During the MSME Data Science training, we explored various visualization
techniques using several powerful Python libraries. Visualizations not only help in identifying
key insights quickly but also support better communication of findings to both technical and
non-technical audiences.

We primarily used libraries such as Matplotlib, Seaborn, and Plotly to create a wide range of
charts and plots. Matplotlib, one of the most fundamental libraries, allowed us to build basic
plots like line graphs, bar charts, and scatter plots with high customizability. Seaborn, built on
top of Matplotlib, simplified the process of creating attractive and informative statistical
graphics. It introduced advanced visualizations like heatmaps, violin plots, and pair plots, which
helped in performing exploratory data analysis more effectively. Plotly, an interactive graphing
library, enabled us to create dynamic and interactive dashboards, providing a more engaging
experience when exploring data.

Throughout the training, we learned to choose the appropriate type of visualization depending on
the nature of the data and the insights we wanted to highlight. For example, scatter plots were
effective for showing relationships between two variables, while histograms helped in
understanding the distribution of a dataset. Box plots were used to detect outliers and understand
data spread, and heatmaps provided a clear visual of correlations between multiple variables.

Overall, gaining hands-on experience with different visualization techniques enhanced our
ability to draw meaningful conclusions from datasets. It also emphasized the importance of
clarity, color schemes, labeling, and storytelling in presenting data effectively.

3.1.1 Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations.
1 import matplotlib . pyplot as plt
2 import numpy as np
3

4 # Generate sample data


5 x = np. linspace (0 , 10 100)
6 y ,
7

8 # Create a line plot


9 plt. figure ( figsize =(8 , 5))
plt. plot(x, y, ' b- ', linewidth =2 , label=' sin (
20
x)')
11 plt. axhline ( y=0 , color='k', linestyle = ' - ', alpha =0.3)
12 plt. xlabel( 'x', fontsize =12)
13 plt. ylabel(' sin ( x)', fontsize =12)
14 plt. title (' Sine Wave Plot ', fontsize
15 =14) plt. legend ()
16 plt. grid ( True , alpha =0.3)
17 plt. show ()

Listing 9: Basic Line Plot with Matplotlib

Figure 8: A sine wave function plotted on a coordinate system

1 import matplotlib . pyplot as plt


2 import numpy as np
3

4 # Generate data
5 x = np. linspace (0 , 2* np. pi , 100)
6 y_sin = np. sin ( x)
7 y_cos = np. cos( x)
8

9 # Create subplots
10 fig , axs = plt. subplots (2 , 1 , figsize =(8 , 6))
11

12 # First subplot
13 axs [0]. plot(x, y_sin , ' b - ', label=' sin (
14 x)') axs [0]. set_title (' Sine Function
15 ') axs [0]. set_ylabel( ' sin ( x)')
16 axs [0]. grid ( True )
17 axs [0]. legend ()
18

19 # Second subplot
20 axs [1]. plot(x, y_cos , ' r - ', label=' cos( x)')

21
21 axs [1]. set_title (' Cosine Function ')
22 axs [1]. set_xlabel('x')
23 axs [1]. set_ylabel( ' cos(
24 x)') axs [1]. grid ( True
25 ) axs [1]. legend ()
26

27 plt. tight_layout ()
28 plt. show ()

Listing 10: Multiple Plots in Matplotlib

Figure 9: Sine and cosine functions in separate subplots

1 import matplotlib . pyplot as plt


2 import numpy as np
3

4 # Sample data
5 categories = [' Category A ', ' Category B ', ' Category C ', '
Category D ', ' Category E']
6 values = [22 , 35 , 14 , 28 , 30]
7

8 # Create a bar plot


9 plt. figure ( figsize =(10 ,
10 6))
11 bars = plt. bar( categories , values , color=' steelblue ')
12

13 # Customize the plot


14 plt. xlabel(' Categories ', fontsize
15
=12) plt. ylabel(' Values ', fontsize
=12)
22
16 plt. xticks( rotation =45)
17 plt. grid ( True , axis='y', alpha
18 =0.3)
19

20 # Add value labels on top of bars


21 for bar in bars:
22 height = bar. get_height ()
23 plt. text( bar. get_x () + bar. get_width ()/2., height + 1 ,
24 f'{ height}', ha=' center ', va=' bottom ')
25

26 plt. tight_layout ()
plt. show ()
Listing 11: Bar Plot with Matplotlib

Figure 10: A bar chart showing values for five different categories

3.1.2 Seaborn
Seaborn is a statistical data visualization library built on top of Matplotlib, pro-
viding a high-level interface for drawing attractive statistical graphics.
1 import matplotlib . pyplot as plt
2 import seaborn as sns
3

4 # Load sample dataset


5 tips = sns. load_dataset( ' tips ')
6

7 # Create a box plot


8 plt. figure ( figsize =(10 ,
9 6))
10 sns. boxplot( x=' day ', y=' total_bill ', data = tips)
11 plt. title (' Box Plot of Total Bill by Day ', fontsize
12 =14) plt. xlabel(' Day ', fontsize =12)
plt. ylabel(' Total Bill ( $)', fontsize =12)
23
13 plt. grid ( True , axis='y', alpha =0.3)
14 plt. show ()
15

16 print(" First few rows of the tips dataset:")


17 print( tips. head ())

Listing 12: Box Plot with Seaborn

Figure 11: A box plot showing total bill distribution by day of the week

1 import matplotlib . pyplot as plt


2 import seaborn as sns
3 import numpy as np
4 import pandas as pd
5

6 # Create a correlation matrix


7 data = pd. Data Frame ( np. random . randn (10 , 6),
8 columns =['A ', 'B ', 'C ', 'D ', 'E', 'F'])
9 correlation = data. corr ()
10

11 # Create a heatmap
12 plt. figure ( figsize =(10 ,
13 8))
sns. heatmap ( correlation , annot=True , cmap =' coolwarm ', vmin =-1 ,
14 vmax =1)
15 plt. title (' Correlation Heatmap ', fontsize =14)
16 plt. tight_layout ()
17 plt. show ()
18

19 print(" Sample correlation values:")


print( correlation . round (2). iloc [:3 , :3])
Listing 13: Heatmap with Seaborn

24
Output
Sample correlation values: A B C A 1.00 0.10 0.05 B 0.10 1.00 0.15 C 0.05
0.15 1.00

Figure 12: A correlation heatmap showing relationships between variables

1 import matplotlib . pyplot as plt


2 import seaborn as sns
3

4 # Load the iris dataset


5 iris = sns. load_dataset( ' iris ')
6

7 # Create a pair plot


8 sns. pairplot( iris , hue=' species ', markers =['o ', 's', 'D '])
9 plt. suptitle (' Pair Plot of Iris Dataset ', y=1.02 , fontsize
10 =16) plt. show ()
11

12 print(" First few rows of the iris dataset:")


13 print( iris. head ())
14 print("\ n Summary statistics of iris dataset:")
15 print( iris. describe (). round (2))

Listing 14: Pair Plot with Seaborn

25
Figure 13: A grid of pairwise relationships between features in the iris dataset,
colored by species

26
Chapter 4
Machine Learning Basics
4.1 Introduction to Machine Learning
Machine learning has revolutionized the field of data science by enabling com-
puters to learn patterns from data without explicit programming. This branch of
artificial intelligence has transformed how organizations make decisions, prod- ucts
are developed, and scientific discoveries are made. At its core, machine learning
encompasses several paradigms including supervised learning (where models learn
from labeled data), unsupervised learning (discovering hidden pat- terns in
unlabeled data), and reinforcement learning (learning through interaction with an
environment). The fundamental process begins with data collection and
preprocessing, followed by feature engineering to transform raw data into mean-
ingful inputs. Models are then trained using algorithms that iteratively adjust
parameters to minimize prediction errors. Scikit-learn, Python’s premier machine

learning library, provides an intuitive interface to implement these concepts with


pre-built algorithms like decision trees, support vector machines, and neural net-
works. The library excels in both classification tasks (assigning categories to
observations) and regression (predicting continuous values). Modern machine
learning extends beyond traditional algorithms to include ensemble methods that
combine multiple models for improved accuracy, and deep learning which uses ar-
tificial neural networks with multiple layers to learn hierarchical representations.
What makes machine learning particularly powerful is its ability to generalize from
training data to make predictions on unseen examples. This capability de- pends on
careful model selection, hyperparameter tuning, and robust evaluation using metrics
appropriate to the problem domain. Cross-validation techniques help assess model
performance and prevent overfitting—when models memorize training data rather
than learning generalizable patterns. As the field continues to evolve, automated
machine learning (AutoML) tools are making these tech- niques more accessible,
27
while researchers push boundaries with techniques like transfer learning, federated
learning, and explainable AI that address limitations of traditional approaches.

4.2 Supervised Learning


Supervised learning involves training a model on labeled data to make predictions or
decisions. In this approach, the dataset consists of input-output pairs, where each input is
associated with a correct output, or label. The goal of the model is to learn the mapping from
inputs to outputs so that it can accurately predict the label for new, unseen data. During
training, the model adjusts its internal parameters based on the difference between its
predictions and the actual labels, using techniques such as gradient descent and loss
minimization.

Supervised learning can be broadly categorized into two types: classification and regression.
In classification problems, the outputs are categorical labels. For example, a model might be
trained to classify emails as "spam" or "not spam," or to recognize handwritten digits from
images. In regression problems, the outputs are continuous values. An example would be
predicting the price of a house based on its features like size, location, and number of
bedrooms.

Overall, supervised learning forms the foundation of many real-world applications of machine
learning, including recommendation systems, fraud detection, and medical diagnosis, making
it an essential skill for any data scientist.

4.2.1 Linear Regression


1 import numpy as np
2 import matplotlib . pyplot as plt
3 from sklearn . linear_model import LinearRegression
4 from sklearn . model_selection import train_test_split
5 from sklearn . metrics import mean_squared_error , r2 _score
6

7 # Generate synthetic data


8 np. random . seed (42)
9 X = 2 * np. random . rand (100 , 1)
10 y = 4 + 3 * X + np. random . randn (100 , 1)
11

12 # Split the data


13 X_train , X_test , y_train , y_test = train_test_split (X, y,
test_size =0.2 , random_state =42)
14

15 # Train the model


16 model = LinearRegression ()
17 model. fit( X_train , y_train
) 28
19 # Make predictions
20 y_pred = model. predict( X_test)
21

22 # Evaluate the model


23 mse = mean_squared_error ( y_test , y_pred )
24 r2 = r2 _score ( y_test , y_pred )
25

26 print( f" Coefficient: { model. coef_ [0 ][0 ]:.4


27 f}") print( f" Intercept: { model. intercept_
28 [0]:.4 f}") print( f" Mean Squared Error: { mse
29 :.4 f}") print( f" R Score : { r2 :.4 f}")
30

31 # Plot the results


32 plt. figure ( figsize =(10 ,
33 6))
34 plt. scatter( X_test , y_test , color=' blue ', label=' Actual data ')
plt. plot( X_test , y_pred , color=' red ', linewidth =2 , label='
35 Regression line ')
36 plt. xlabel('X ', fontsize
37
=12) plt. ylabel( 'y',
38
fontsize =12)
39
plt. title (' Linear Regression : Actual vs Predicted ', fontsize =14)
40
plt. legend ()
plt. grid ( True , alpha =0.3)
Listing 15: Linear Regression Example

Output
Coefficient: 3.1259 Intercept: 3.8691 Mean Squared Error: 0.8183 R²
Score: 0.8561

Figure 14: A scatter plot with actual data points and the linear regression line
29
4.2.2 Classification with Decision Trees
1 import numpy as np
2 import matplotlib . pyplot as plt

3 from sklearn . datasets import make_classification

4 from sklearn . tree import Decision Tree Classifier


5 from sklearn . model_selection import train_test_split
6 from sklearn . metrics import accuracy_score , classification_report

, confusion_matrix
7 import seaborn as sns
8

9 # Generate synthetic data


10 X, y = make_classification ( n_samples =300 , n_features =2 ,
n_informative =2 ,
11 n_redundant =0 , n_clusters_per_class =1
, random_state =42)
12

13 # Split the data


14 X_train , X_test , y_train , y_test = train_test_split (X, y,
test_size =0.3 , random_state =42)
15

16 # Train the model


17 clf = Decision Tree Classifier ( max_depth =3 , random_state =42)
18 clf. fit( X_train , y_train )
19

20 # Make predictions
21 y_pred = clf. predict( X_test)
22

23 # Evaluate the model


24 accuracy = accuracy_score ( y_test , y_pred )

25 print( f" Accuracy : { accuracy :.4 f}")

26 print("\ n Classification Report:")


27 print( classification_report ( y_test , y_pred ))

28

29 # Create a confusion matrix


30 cm = confusion_matrix ( y_test , y_pred )
31 plt. figure ( figsize =(8 , 6))

32 sns. heatmap ( cm , annot=True , fmt='d ', cmap =' Blues ', cbar= False )

33 plt. xlabel(' Predicted Labels ')

34 plt. ylabel(' True Labels ')

35 plt. title (' Confusion Matrix ')

36 plt. show ()

37

38 # Visualize the decision boundary


39 plt. figure ( figsize =(10 , 6))
40 # Create a mesh grid
41 x_min , x_max = X[:, 0]. min () - 1 , X[:, 0]. max() + 1
42 y_min , y_max = X[:, 1]. min () - 1 , X[:, 1]. max() + 1
43 xx , yy = np. meshgrid ( np. arange ( x_min , x_max , 0.02) ,
44 np. arange ( y_min , y_max , 0.02))
45

30
46 # Predict for each point in the mesh
47 Z = clf. predict( np. c_[ xx. ravel (), yy. ravel
48 ()]) Z = Z. reshape ( xx. shape )
49

50 # Plot the decision boundary


51 plt. contourf( xx , yy , Z, alpha
52 =0.3)
plt. scatter( X_test [:, 0], X_test [:, 1], c=y_test , edgecolors= 'k',
53 marker='o ')
54 plt. xlabel(' Feature 1 ')
55 plt. ylabel(' Feature 2 ')
56 plt. title (' Decision Tree Classification with Decision Boundary ')
plt. show ()
Listing 16: Decision Tree Classification

Output
Accuracy: 0.8333
Classification Report: precision recall f1-score support
0 0.85 0.80 0.82 45 1 0.82 0.87 0.84 45
accuracy 0.83 90 macro avg 0.83 0.83 0.83 90 weighted avg 0.83 0.83 0.83
90

Figure 15: A confusion matrix heatmap showing prediction results

31
Figure 16: A decision boundary plot showing classified data points

4.2.3 K-Means Clustering


1 import numpy as np
2 import matplotlib . pyplot as plt
3 from sklearn . datasets import make_blobs
4 from sklearn . cluster import KMeans
5 from sklearn . import silhouette_score
6 metrics
7 # Generate synthetic data
8 X, _ = make_blobs( n_samples =300 , centers =4 , cluster_std =0.60 ,
random_state =42)
9

10 # Apply K- means clustering


11 kmeans = KMeans( n_clusters =4 , random_state =42)
12 y_kmeans = kmeans. fit_predict( X)
13

14 # Calculate silhouette score


15 silhouette_avg = silhouette_score (X, y_kmeans)
16 print( f" Silhouette Score : { silhouette_avg :.4 f}")
17 print( f" Cluster Centers :\ n{ kmeans. cluster_centers_
18 }")
19

20 # Visualize the clusters


21 plt. figure ( figsize =(10 , 6))
22 plt. scatter( X[:, 0], X[:, 1], c= y_kmeans , s=50 , cmap =' viridis
') plt. scatter( kmeans. cluster_centers_ [:, 0], kmeans.
23 cluster_centers_ [:, 1],
24 s=200 , c=' red ', marker='X ', label=' Centroids
25 ') plt. title ('K- means Clustering ', fontsize =14)
26
plt. xlabel(' Feature 1 ', fontsize =12)
27
plt. ylabel(' Feature 2 ', fontsize =12)
plt. legend ()
32
28 plt. grid ( True , alpha =0.3)
29 plt. show ()
30

31 # Elbow method to find optimal K


32 inertia = []
33 silhouette_scores = []
34 k_range = range (2 , 11)
35

36 for k in k_range :
37 kmeans = KMeans( n_clusters=k, random_state =42)
38 kmeans. fit( X)
39 inertia . append ( kmeans. inertia_
40 ) labels = kmeans. labels_
41 silhouette_scores . append ( silhouette_score (X, labels))
42

43 plt. figure ( figsize =(12 , 5))


44 plt. subplot (1 , 2 , 1)
45 plt. plot( k_range , inertia , 'bo
46 - ')
47 plt. xlabel(' Number of Clusters (
48 k)') plt. ylabel(' Inertia ')
49
plt. title (' Elbow Method ')
50
plt. grid ( True , alpha =0.3)
51

52
plt. subplot (1 , 2 , 2)
53
plt. plot( k_range , silhouette_scores , 'ro
54
- ') plt. xlabel(' Number of Clusters ( k)')
55
plt. ylabel(' Silhouette Score ')
56
plt. title (' Silhouette Method ')
plt. grid ( True , alpha =0.3)
57

58
plt. tight_layout ()
59
plt. show ()
Listing 17: K-Means Clustering Example

Output
Silhouette Score: 0.7573 Cluster Centers: [[ 7.09369362 8.95412693]
[-9.48413448 2.23470412] [ 1.76173574 -9.32901144] [ 3.6913933
3.64307413]]

33
Figure 17: A scatter plot showing data points colored by cluster
assignment with centroids marked

Figure 18: Two plots showing the elbow method and silhouette scores
for deter- mining optimal number of clusters

34
Chapter 5
Capstone Project- Fake News Detection

ABSTRACT
In the era of rapid digital communication, the spread of fake news has emerged as a
significant threat to public trust, societal stability, and informed decision-making.
The urgent need for efficient and reliable automated detection systems has become
more evident than ever before. This internship project focuses on the design and
implementation of a comprehensive fake news detection system by leveraging the
FakeNewsNet dataset, a widely recognized benchmark for fake news research.

The project workflow begins with an extensive data preprocessing phase, which
includes cleaning the text data, removing stop words, handling missing values, and
standardizing the textual information to ensure consistency. Following
preprocessing, the Term Frequency–Inverse Document Frequency (TF-IDF)
technique is applied to transform the textual data into meaningful numerical feature
representations that can be utilized by machine learning models.

Several supervised machine learning algorithms are explored and implemented,


including Logistic Regression, Decision Tree, Random Forest, and Naive Bayes
classifiers. Each model is rigorously trained, with an emphasis on fine-tuning
hyperparameters to maximize performance. Cross-validation strategies are employed
to ensure that the models generalize well to unseen data and avoid overfitting.
Comprehensive evaluation of the models is conducted using a variety of
performance metrics, such as accuracy, precision, recall, F1-score, confusion
matrices, and Receiver Operating Characteristic (ROC) curves. These evaluation
measures provide a thorough understanding of each model’s strengths and
limitations, enabling a comparative analysis to determine the most effective
approach for fake news detection.

Additionally, this report documents my contributions during my six-week internship


at MSME, detailing the technical skills acquired, the challenges encountered, and the
solutions devised throughout the project. Through this internship, I have gained

35
Capstone Project : Fake News Detection

Introduction
5.1.1 Overview

The digital revolution has significantly altered the landscape of information


dissemination, making it faster, more accessible, and more diverse than ever before.
Social media platforms, online news portals, and instant messaging applications have
empowered individuals to access information within seconds. However, alongside
these advancements, there has been an alarming rise in the spread of misinformation
and disinformation, collectively referred to as fake news. Fake news refers to
deliberately fabricated or misleading information presented as legitimate news, often
with the intent to deceive readers, manipulate public opinion, or generate
sensationalism for political or financial gains. The widespread propagation of such
content has serious societal implications; it can undermine democratic institutions,
incite unnecessary panic, exacerbate social divisions, and weaken the credibility of
authentic news outlets. Therefore, there is an urgent need for developing automated,
accurate, and scalable fake news detection systems to preserve the integrity of
information and protect public discourse.

5.1.2 Objectives

The primary objectives of this internship project are as follows:

 To design and develop an end-to-end pipeline for the automated detection of


fake news, utilizing the widely recognized FakeNewsNet dataset.

 To conduct a comparative analysis of multiple machine learning classification


algorithms, assessing their effectiveness in detecting fake news based solely on
headline text.

 To deploy the most accurate and efficient model through a user-friendly web
interface, enabling real-time fake news classification accessible to end-users.

 To document the methodology, challenges encountered, and key insights gained


during the course of the internship, thereby contributing to the broader academic
and professional understanding of machine learning applications in the area of
fake news detection.
36
5.1.3 Scope and Limitations

The scope of this project is centered on the classification of news articles based
exclusively on their headlines. This decision is motivated by data availability and the
need for rapid prediction suitable for real-time applications. Although full-text analysis
could provide deeper contextual and semantic understanding, it is intentionally
excluded to maintain the focus on fast and lightweight classification models. However,
this approach introduces certain limitations. Headline-based models may miss nuanced
details contained within full articles, leading to potential misclassification in complex
cases. Additionally, the dataset employed may carry inherent biases, such as source-
specific language patterns, which could affect model generalization. The deployed
model is static, meaning it does not continuously learn from new data; periodic
retraining is required to adapt to emerging linguistic trends and new types of
misinformation. These constraints should be considered when interpreting the results
and planning future enhancements.

37
Literature Review
The field of fake news detection has attracted significant research interest, given the
profound societal risks associated with misinformation. Multiple methodologies have
been proposed over the years, each with varying degrees of complexity and
effectiveness. This chapter presents a comprehensive overview of key approaches, from
early statistical techniques to modern hybrid systems.

5.2.1 Statistical and Linguistic Analysis

Initial efforts in fake news detection primarily focused on statistical analysis and
linguistic feature engineering. Researchers employed stylometric techniques, such as
analyzing the frequency of n-grams (word pairs or triples), readability scores, and
psycholinguistic markers, to distinguish between genuine and fabricated news articles
[1,2]. These methods leveraged observable writing style differences, assuming that fake
news exhibits distinct linguistic patterns compared to credible journalism. While
statistical and lexical approaches offered valuable insights and were relatively easy to
implement, they were inherently limited by their shallow representation of text. They
often failed to capture deeper semantic nuances, making them less effective against
sophisticated, well-crafted fake news articles that mimic authentic writing styles.

5.2.2 Machine Learning Approaches

The next wave of research introduced machine learning algorithms, which significantly
improved fake news detection capabilities. Traditional classifiers such as Support
Vector Machines (SVM), Random Forests, Decision Trees, and Naive Bayes were
trained on engineered text features, including TF-IDF vectors and word embeddings.
These models provided better scalability and generalization compared to handcrafted
rule-based systems. With the advent of deep learning, more advanced architectures like
Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM)
networks were employed to automatically learn hierarchical and sequential
representations from raw text [3,4]. Deep learning models excelled in capturing
complex linguistic structures and contextual dependencies, achieving state-of-the-art
performance across many benchmark datasets.

5.2.3 Hybrid and Contextual Methods

Recognizing the limitations of text-only approaches, researchers began integrating


social and contextual data into fake news detection systems. Hybrid models combine
textual analysis with metadata, such as user engagement patterns (likes, shares,
38
comments), source credibility ratings, temporal posting behaviors, and network
propagation structures. These hybrid systems leverage the broader ecosystem
surrounding a piece of news, resulting in significant improvements in detection
accuracy [5,6]. For instance, articles originating from low-credibility sources or
receiving abnormal engagement patterns are more likely to be flagged as fake.
However, the reliance on social context poses practical challenges, especially in real-
time applications where such metadata may not be readily available.

5.2.4 Gap Analysis

Although numerous studies have achieved notable success, most existing models
assume access to either the full article text or supplementary social context information.
In fast-paced environments, especially in the age of breaking news and micro-content
(e.g., Twitter, headline feeds), such resources are not always accessible. Consequently,
there is a critical need for models that can operate effectively on limited data, such as
headlines alone. This project addresses this gap by focusing exclusively on headline-
based classification, aiming to develop lightweight, rapid detection systems suitable for
integration into real-time news monitoring pipelines.

39
Methodology
This chapter details the methodology adopted for the development of the fake news
detection system, covering dataset characteristics, preprocessing steps, feature
engineering, model training, hyperparameter optimization, and evaluation strategies.

5.3.1 Dataset Description

The project utilizes the FakeNewsNet dataset, a curated collection of approximately


40,000 news headlines, evenly distributed between real and fake labels. The real news
headlines are sourced from reputable mainstream news outlets, while the fake news
samples are collected from fact-checking organizations that verify and classify news
stories. Each data entry consists of the headline text and an associated binary label (real
or fake). The dataset is sufficiently diverse, encompassing a variety of domains such as
politics, health, and entertainment, thereby providing a solid foundation for developing
generalized models.

5.3.2 Data Preprocessing

Effective preprocessing is crucial for enhancing model performance. The raw text
undergoes the following transformations:

1. Data Cleaning: Null values, duplicate records, and irrelevant entries are removed
to ensure data integrity.

2. Normalization: Text is converted to lowercase to maintain consistency, and all


punctuation marks and numeric characters are stripped out.

3. Tokenization: Headlines are split into individual tokens (words) using the
Natural Language Toolkit (NLTK).

4. Lemmatization: Each token is reduced to its base or dictionary form, improving


the ability of models to recognize variations of the same word.

5. Stop-word Removal: Common English words (e.g., 'the', 'and', 'is') that
contribute little to semantic meaning are filtered out.

5.3.3 Feature Extraction


40
The processed text is vectorized using the Term Frequency–Inverse Document
Frequency (TF-IDF) technique, which quantifies the importance of words relative to the
entire corpus. TF-IDF helps in emphasizing discriminative words while diminishing the
impact of commonly occurring but less informative terms. To strike a balance between
feature richness and computational efficiency, the vectorizer is configured to extract a
maximum of 5,000 features.

5.3.4 Model Training

The preprocessed dataset is split into a training set (80%) and a testing set (20%). The
following machine learning models are trained and evaluated:

 Logistic Regression: A linear classifier utilizing an L2 penalty (Ridge


regularization) to prevent overfitting. Regularization strength is tuned through
cross-validation.

 Decision Tree: A non-parametric model that splits data based on Gini impurity,
with depth constraints applied to enhance generalization.

 Random Forest: An ensemble learning method aggregating predictions from 100


decision trees, employing bootstrap sampling and feature bagging techniques.

 Multinomial Naive Bayes: A probabilistic classifier based on Bayes' theorem,


well-suited for text classification tasks, with Laplace smoothing to handle zero-
frequency issues.

5.3.5 Hyperparameter Tuning

Hyperparameter optimization is conducted through exhaustive grid search across


predefined parameter grids:

 Logistic Regression: Regularization strength C values tested are {0.01, 0.1, 1,


10}.

 Decision Tree: Maximum depth values considered are {10, 20, None}.

 Random Forest: Number of estimators varied among {50, 100, 200}.

 Naive Bayes: Laplace smoothing parameter (α) values examined are {0.5, 1.0,
1.5}.

41
Cross-validation ensures that the selected hyperparameters generalize well to unseen
data.

5.3.6 Evaluation Metrics

The trained models are evaluated using multiple performance metrics to ensure
comprehensive assessment:

 Accuracy: Measures the overall correctness of the model.

 Precision: Evaluates the proportion of true positive predictions among all


positive predictions.

 Recall: Assesses the proportion of true positive predictions among all actual
positive instances.

 F1-Score: Provides a harmonic mean of precision and recall, particularly useful


when classes are imbalanced.

 ROC-AUC Score: Reflects the model’s capability to distinguish between classes


across different threshold settings, with higher scores indicating better
discrimination ability.

42
Results and Discussion

5.4.1 Dataset Statistics

The dataset comprises 20,000 real headlines and 20,000 fake headlines.

Figure 4.1 visualizes the class distribution.

Figure 4.1: Real vs Fake News Distribution

5.4.2 Model Performance


F1-
Model Accuracy Precision Recall
Score
Logistic Regression 0.85 0.84 0.85 0.84
Decision Tree 0.81 0.80 0.81 0.80

43
F1-
Model Accuracy Precision Recall
Score
Random Forest 0.88 0.87 0.88 0.87
Multinomial Naive
0.78 0.77 0.78 0.77
Bayes
Table 4.1: Comparison of Model Performance

Logistic Regression Performance


 Accuracy: 0.8250
 Precision: 0.8158
 Recall: 0.8250
 F1 Score: 0.8063

Classification Report:
Class Precision Recall F1- Support
Score
0 (Fake) 0.75 0.42 0.54 1131
1 (Real) 0.84 0.95 0.89 3509

Decision Tree Performance


 Accuracy: 0.7761
 Precision: 0.7759
 Recall: 0.7761
 F1 Score: 0.7760

Classification Report:
Class Precision Recal F1-Score Support

0 (Fake) 0.54 0.54 0.54 1131


1 (Real) 0.85 0.85 0.85 3509

Random Forest Performance


 Accuracy: 0.8213
 Precision: 0.8106
 Recall: 0.8213
 F1 Score: 0.8119

44
Classification Report:
Class Precision Recall F1-Score Support
0 0.68 0.51 0.58 1131
(Fake)
1 (Real) 0.85 0.92 0.89 3509

Naive Bayes Performance


 Accuracy: 0.8310
 Precision: 0.8249
 Recall: 0.8310
 F1 Score: 0.8117

Classification Report:
Class Precision Recall F1-Score Support

0 0.78 0.42 0.55 1131


(Fake)

1 (Real) 0.84 0.96 0.90 3509

45
Confusion Matrix

46
Figure 4.2: Confusion Matrix for Random Forest Model

Figure 4.3: Confusion Matrix for Logistic Regression Model

47
Figure 4.3: Confusion Matrix for Naive Bayes Model

Figure 4.3: Confusion Matrix for Decision Tree Model

48
4.1 ROC Curves

Fig ROC Curve for Logistic Regression Model Fig ROC Curve for Random Forest Model

Figure 4.4: ROC Curve for Naive Bayes Model Figure 4.4: ROC Curve for Decision Tree Model

49
50
The bar chart illustrates the Top 20 Features ranked by their TF-IDF (Term
Frequency-Inverse Document Frequency) scores, highlighting the most influential
terms contributing to fake news classification.
 "kardashian" has the highest TF-IDF score among all features, indicating that it
is the most distinctive and impactful word in the corpus for separating fake and
real news headlines.
 Other dominant terms include "new," "star," "jennif," "jenner," "season,"
and "award," which are also highly influential.
 Many of the top-ranked words (e.g., "kardashian," "jenner," "kim," "meghan,"
"justin") are associated with celebrity culture and entertainment news,
suggesting that fake news articles in the dataset often target or mention popular
public figures and celebrity events.
 Words like "babi," "wed," "say," "reveal," and "date" point toward topics
involving personal life events such as weddings, births, and revelations—
common themes that fake news outlets may exploit to attract readers.
 Interestingly, the presence of shortened forms or truncated words like "princ"
(likely "prince"), "celebr" (celebrity), and "markl" (Markle) indicates that
headline texts sometimes feature partial words due to preprocessing or inherent
text styles.

51
52
Chapter 6
Conclusion
The MSME Data Science training provided a comprehensive understanding of Python
programming, data analysis, visualization, and machine learning concepts. Through
hands-on examples and exercises, I gained practical experience with essential libraries
and techniques used in the field of data science.

Key takeaways from the training include:


• Proficiency in Python programming for data analysis
• Expertise in data manipulation using NumPy and Pandas
• Skills in creating informative visualizations with Matplotlib and Seaborn
• Understanding of machine learning algorithms and their applications
• Practical experience in implementing end-to-end data science projects

The training emphasized the importance of a strong programming foundation,


particularly in Python, to solve real-world data problems efficiently. Learning to
manipulate data using libraries like Pandas and NumPy allowed me to handle large
datasets, clean missing values, transform variables, and prepare data for modeling with
confidence. These skills are crucial for any data scientist aiming to derive accurate
insights from complex datasets.

Visualization skills acquired through libraries like Matplotlib, Seaborn, and Plotly
enabled me to better communicate analytical results. I learned how to select the
appropriate visualization techniques based on the data type and the message I wanted to
convey. Effective data storytelling is an essential aspect of data science, helping
stakeholders and decision-makers quickly understand findings and take action.

The introduction to machine learning concepts, including supervised and unsupervised


learning, opened up new perspectives on how predictive models are built and evaluated.
Implementing algorithms such as linear regression, decision trees, and K-nearest
neighbors provided me with a hands-on understanding of both the theoretical and
practical aspects of machine learning.

One of the most valuable experiences from the training was completing practical
projects that simulated real-world problems. These projects involved complete
workflows, from data collection and cleaning to model building and evaluation. They
helped strengthen my ability to think critically, approach problems systematically, and
apply the right tools and techniques at each stage of a data science project.

53
54

You might also like