0% found this document useful (0 votes)
245 views91 pages

De&v Lab Manual

Data exploration and visualization

Uploaded by

Bhuvanesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
245 views91 pages

De&v Lab Manual

Data exploration and visualization

Uploaded by

Bhuvanesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Jayalakshmi Institute of Technology

NH 7, Salem Main Rd, T.Kanigarahalli,Thoppur,


Dharmapuri, Tamil Nadu 636 352.
(Approved by AICTE - New Delhi, Affiliated to Anna University - Chennai)

DEPARTMENT OF ARTIFICIAL INTELIGENCE


AND DATA SCIENCE

DATA EXPLORATION AND VISUALIZATON


LAB MANUAL
1.A) Install the Data Analysis and Visualization Tool python
Aim:
To implement the Install the Data Analysis and Visualization Tool python.
Algorithm:
Step 1: Install Python
Step 2: Install a Python Package Manager (if not already installed):
Python comes with pip, the package installer, but you can update it to the latest version
by running:python -m pip install --upgrade pip
Step 3: Set Up a Virtual Environment (recommended):Create a virtual environment to manage
dependencies for your project:python -m venvmyenv
Activate the virtual environment:myenv\Scripts\activate
Step 4: Install Data Analysis Libraries:
Pandas: For data manipulation and analysis.pip install pandas
NumPy: Often used with Pandas for numerical operations.pip install numpy
Step 5: Install Visualization Libraries:
Matplotlib: A basic plotting library for creating static, animated, and interactive
visualizations.pip install matplotlib
Seaborn: Built on Matplotlib, Seaborn provides a high-level interface for drawing
attractive andinformative statistical graphics.pip install seaborn
Step 6: Verify Installation:Check that the libraries are installed correctly by opening a Python
interpreter and importing them: import pandas as pdimportnumpy as np
importmatplotlib.pyplot as plt import seaborn as sns
Step 7: Optional: Install Jupyter Notebook (for an interactive environment):
If you prefer using Jupyter Notebook for writing and running Python code interactively,
install it using: pip install jupyter
Step 8: Optional: Install Additional Tools:
Depending on your needs, you might also want to install other libraries such as SciPy for
scientific computations, Plotly for interactive plots, or Statsmodels for statistical
modeling.
Program:
Pip install numpy
pip install pandas
pip install seaborn
!pip install matplotlib

#Pythonversionstoverifyinstallations:
Import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Print versions to verify installations
print("Pandasversion:", pd.__version__)
print("NumPy version:",np.version__)
print("Matplotlib version:", matplotlib.version)
# Correctly access the versionfrom thematplotlibmodule
print("Seabornversion:",sns.version)

Output:
Pandas version: 2.1.4
NumPy version: 1.26.4
Matplotlib version: 3.7.1
Seabornversion:0.13.1

Result:
Thus the python program to implement the Install the Data Analysis and Visualization Tool python has
been successfully verified
.
1.B)Install Pandas Package in Python and execute the Program for simple Data
frame attributes
Aim:
To implement the Install Pandas Package in Python and execute the Program for simple Data
frame attributes
Algorithm:
Step 1: Install the Pandas Package
To install the Pandas package, you will need to use the Python package manager, pip. Open your
terminal or command prompt and run the following command:pip install pandas
Step 2: Verify the Installation and Create a Simple DataFrameTo create a DataFrame and
explore its attributes:
Import the Pandas Library: Begin by importing the pandas library.
Create a DataFrame: Use the pd.DataFrame() function to create a simple DataFrame.
Explore DataFrame Attributes: Access various attributes such as head(), shape, columns, index,
and dtypes.

Program:

Import pandas as pd
#Create asimple DataFrame
data ={
'Name': ['Alice', 'Bob', 'Charlie',
'David'],'Age':[24,27,22,32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)
# Display the DataFrameprint("DataFrame:")
print(df)# Display DataFrame attributes
print("\nDataFrame Attributes:")# Shape of the DataFrame
print(f"Shape: {df.shape}")# Column names
print(f"Columns:{df.columns}")# Index
print(f"Index: {df.index}")# Data types of each column
print(f"Data Types:\n{df.dtypes}")# Descriptive statistics
print("\nDescriptive Statistics:")
print(df.describe(include='all'))
# Info about the DataFrame
print("\nDataFrameInfo:")
df.info()
Output:
DataFrame:
Name Age City
• Alice 24 New York
• Bob 27 Los Angeles
• Charlie 22 Chicago
• David 32 Houston

DataFrame Attributes:
Shape: (4, 3)
Columns: Index(['Name', 'Age', 'City'], dtype='object')
Index: RangeIndex(start=0, stop=4, step=1)

Data Types:
Name object
Age int64
City object
dtype: object
Descriptive Statistics:
Name Age City
count 4 4.000000 4
unique 4 NaN 4
top Alice NaN New York
freq 1 NaN 1
mean NaN 26.250000 NaN
std NaN 4.349329 NaN
min NaN 22.000000 NaN
25% NaN 23.500000 NaN
50% NaN 25.500000 NaN
75% NaN 28.250000 NaN
max NaN 32.000000 NaN
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):

Column Non-Null Count Dtype


-
• Name 4 non-null object
• Age 4 non-null int64
• City 4 non-null object dtypes: int64(1), object(2)
memory usage: 224.0+ bytes

Result:
Thus the python program to implement the Install Pandas Package in Python and execute the Program
for simple Data frame attributes has been successfully verified.
2.Create a Program using Numpy package functions and 2D or 3D array to perform
simple matrix operation
Aim:
To Create a Program using Numpy package functions and 2D or 3D array to perform simple matrix
operation
Algorithm:
Step1: Addition/Subtraction:
Check if both matrices have the same dimensions.
Add/Subtract corresponding elements from both matrices.
Step2:Multiplication:
Ensure that the number of columns in the first matrix equals the number of rows in
the second matrix.Multiply each element of the rows of the first matrix by the
corresponding elements of the columns of the second matrix and sum them.
Step3:Transpose:Convert the rows of the matrix into columns and vice versa.
Program:
Import numpy as np

# 2D Array Example
print("2D Array Operations")

# Create two 2D arrays


matrix_a = np.array([[1, 2, 3], [4, 5, 6]])
matrix_b = np.array([[7, 8, 9], [10, 11, 12]])
print("Matrix A:")
print(matrix_a)
print("\nMatrix B:")
print(matrix_b)

# 1. Matrix Addition
matrix_addition = matrix_a + matrix_b
print("\nMatrix Addition (A + B):")
print(matrix_addition)

# 2. Matrix Multiplication (Dot Product)


matrix_multiplication = np.dot(matrix_a, matrix_b.T) # Transpose B for valid multiplication
print("\nMatrix Multiplication (A . B^T):")
print(matrix_multiplication)

# 3. Element-wise Operation (Multiplication)


elementwise_multiplication = matrix_a * matrix_b
print("\nElement-wise Multiplication (A * B):")
print(elementwise_multiplication)

# 4. Matrix Transposition
matrix_transpose = np.transpose(matrix_a)
print("\nMatrix Transposition (Transpose of A):")
print(matrix_transpose)

# 3D Array Example
print("\n3D Array Operations")

# Create a 3D array
array_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]], [[13, 14, 15], [16,
17, 18]]])
print("3D Array:")
print(array_3d)
# 5. Sum along axis 0
sum_axis_0 = np.sum(array_3d, axis=0)
print("\nSum along axis 0:")
print(sum_axis_0)

# 6. Sum along axis 1


sum_axis_1 = np.sum(array_3d, axis=1)
print("\nSum along axis 1:")
print(sum_axis_1)

# 7. Sum along axis 2


sum_axis_2 = np.sum(array_3d, axis=2)
print("\nSum along axis 2:")
print(sum_axis_2)

Output:
2D Array Operations
Matrix A:
[[1 2 3]
[4 5 6]]

Matrix B:
[[ 7 8 9]
[10 11 12]]

Matrix Addition (A + B):


[[ 8 10 12]
[14 16 18]]

Matrix Multiplication (A . B^T):


[[ 50 68]
[122 167]]

Element-wise Multiplication (A * B):


[[ 7 16 27]
[40 55 72]]

Matrix Transposition (Transpose of A):


[[1 4]
[2 5]
[3 6]]

3D Array Operations
3D Array:
[[[ 1 2 3]
[ 4 5 6]]
[[ 7 8 9]
[10 11 12]]
[[13 14 15]
[16 17 18]]]

Sum along axis 0:


[[21 24 27]
[30 33 36]]

Sum along axis 1:


[[ 5 7 9]
[17 19 21]
[29 31 33]]

Sum along axis 2:


[[ 6 15]
[24 33]
[42 51]]
Result:
Thus the python program Create a Program using Numpy package functions and 2D or 3D array to
perform simple matrix operation has been successfully verified.
3.To combine Numpy and Pandas data frame to create dataset and perform the
following
A) Color Variation each column data

Aim:
To combine Numpy and Pandas data frame to create dataset and perform the Color Variation
each column data

Algorithm:
1. Generate a dataset using NumPy
2. Convert the dataset to a Pandas DataFrame.
3. Apply a color variation algorithm based on the values.
4. Display the styled DataFrame with color variation.

Program:
Import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Step 1: Create a Dataset using NumPy and Pandas
# Generate random data using NumPy
np.random.seed(0) # For reproducibility
data = {
'Age': np.random.randint(20, 60, size=100),
'Income': np.random.randint(30000, 120000, size=100),
'Expenses': np.random.randint(10000, 50000, size=100),
'Savings': np.random.randint(5000, 20000, size=100)
}

# Create a DataFrame from the generated data


df = pd.DataFrame(data)

# Display the first few rows of the DataFrame


print("DataFrame:")
print(df.head())

# Step 2: Visualize Data with Color Variations

# Set up the matplotlib figure


plt.figure(figsize=(12, 10))

# Plot histograms with color variations for each column


for i, column inenumerate(df.columns):
plt.subplot(2, 2, i + 1)
sns.histplot(df[column], kde=True, color=sns.color_palette("husl",
len(df.columns))[i])
plt.title(f'Distribution of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

OUTPUT:
DataFrame:
Age Income Expenses Savings
0 20 54777 43391 11797
1 23 43824 42232 10637
2 23 32418 18962 13448
3 59 42843 30435 16400
4 29 108778 44009 16471

Result:
Thus the python program Numpy and Pandas data frame to create dataset and perform the Color
Variation each column data has been successfully verified
3.To combine Numpy and Pandas data frame to create dataset and perform the
following B) Highlight Max and Min values with output
Aim:
To combine Numpy and Pandas data frame to create dataset and perform the following
Highlight Max and Min values with output.

Algorithm:

1. Generate a dataset using NumPy.


2. Convert the dataset to a Pandas DataFrame.
3. Define and apply an algorithm to highlight the maximum and minimum values.
4. Display the output with highlighted max and min values.

PROGRAM:
import numpy as np
import pandas as pd

# Step 1: Create a NumPy array


np.random.seed(42) # For reproducible results
data = np.random.randint(1, 100, size=(10, 5)) # Create an array of random
integers

# Step 2: Convert the NumPy array into a Pandas DataFrame


columns = ['A', 'B', 'C', 'D', 'E']
df = pd.DataFrame(data, columns=columns)

# Display the DataFrame


print("Original DataFrame:")
print(df)

# Step 3: Highlight the maximum and minimum values


defhighlight_max_min(s):
is_max = s == s.max()
is_min = s == s.min()
return ['background-color: yellow'if v else'background-color: lightblue'if
m else''for v, m inzip(is_max, is_min)]

# Apply the function to the DataFrame


styled_df = df.style.apply(highlight_max_min, axis=0)

# Display the styled DataFrame


styled_df
OUTPUT:
Original DataFrame:
A B C D E
0 52 93 15 72 61
1 21 83 87 75 75
2 88 24 3 22 53
3 2 88 30 38 2
4 64 60 21 33 76
5 58 22 89 49 91
6 59 42 92 60 80
7 15 62 62 47 62
8 51 55 64 3 51
9 7 21 73 39 18
A B C D E
0 52 93 15 72 61
1 21 83 87 75 75
2 88 24 3 22 53
3 2 88 30 38 2
4 64 60 21 33 76
5 58 22 89 49 91
6 59 42 92 60 80
7 15 62 62 47 62
8 51 55 64 3 51
9 7 21 73 39 18

Result:
Thus the python program to combine Numpy and Pandas data frame to create dataset and perform
the Highlight Max and Min values with output has been successfully verified
3.To combine Numpy and Pandas data frame to create dataset and perform the following C) To
generate Background gradient color variation
Aim:
To combine Numpy and Pandas data frame to create dataset and perform the following
to generate Background gradient color variation

Algorithm:
1.Generate a dataset using NumPy.
2.Convert the dataset to a Pandas DataFrame.
3.Apply a background gradient color variation using an algorithm.
4.Background Gradient: The background_gradient function from Pandas is used to apply
color gradients based on the values in each column.
5.Colormap: The cmap argument specifies the color map to be used. In this example, we use the
Viridiscolormap, which ranges from dark blue for low values to yellow for high values.
Other colormaps (like 'coolwarm', 'plasma', etc.) can also be used.
6.Scaling: Each column’s values are scaled between the minimum and maximum values of that
column, and a gradient is applied based on this range.
PROGRAM:
import numpy as np
import pandas as pd

# Step 1: Create a NumPy array


np.random.seed(42) # For reproducible results
data = np.random.randint(1, 100, size=(10, 5)) # Create an array of random
integers

# Step 2: Convert the NumPy array into a Pandas DataFrame


columns = ['A', 'B', 'C', 'D', 'E']
df = pd.DataFrame(data, columns=columns)

# Display the DataFrame


print("Original DataFrame:")
print(df)

# Step 3: Generate a background gradient color variation


styled_df = df.style.background_gradient(cmap='viridis')
# Display the styled DataFrame with gradient
styled_df

OUTPUT:
Original DataFrame:
A B C D E
0 52 93 15 72 61
1 21 83 87 75 75
2 88 24 3 22 53
3 2 88 30 38 2
4 64 60 21 33 76
5 58 22 89 49 91
6 59 42 92 60 80
7 15 62 62 47 62
8 51 55 64 3 51
9 7 21 73 39 18

A B C D E
0 52 93 15 72 61
1 21 83 87 75 75
2 88 24 3 22 53
3 2 88 30 38 2
4 64 60 21 33 76
5 58 22 89 49 91
6 59 42 92 60 80
7 15 62 62 47 62
8 51 55 64 3 51
9 7 21 73 39 18

Result:
Thus the python Numpy and Pandas data frame to create dataset and perform the generate
Background gradient color variationhas been successfully verified
4.Explore multivariable dataset, To perform any four data cleaning method and visualize Bar
chart.
Aim:
To multivariable dataset, To perform any four data cleaning method and visualize Bar chart.

Algorithm:
1.Input: A multivariable dataset (DataFrame).
2.Handling Missing Data:
For each numerical column, fill missing values with the median.
For each categorical column, fill missing values with the mode.
3.Remove Duplicates:
Check for and drop duplicate rows.
4.Convert Data Types:Ensure appropriate data types for each column (e.g., convert salary to
integer).
5.Handle Outliers:
Define outlier thresholds and cap values as necessary.
6.Visualization:Group the data by a relevant variable (e.g., city) and plot a bar chart to
compare the results (e.g., average salary by city).
Visualize the Data: Create a bar chart to visualize some aspect of the cleaned data.
Use an Algorithm to structure the workflow for cleaning and visualization.
PROGRAM
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset


# For this example, let's use a built-in dataset from seaborn
df = sns.load_dataset('titanic')

# Display the first few rows of the dataset


print("First few rows of the dataset:")
print(df.head())

# 1. Handle missing values - Fill missing 'age' values with the median
df['age'].fillna(df['age'].median(), inplace=True)

# 2. Drop columns with too many missing values


# Let's say we drop columns with more than 500 missing values
df.dropna(thresh=len(df) - 500, axis=1, inplace=True)

# 3. Remove duplicates
df.drop_duplicates(inplace=True)

# 4. Convert categorical variables into dummy/indicator variables


df = pd.get_dummies(df, drop_first=True)

# Display the cleaned dataset


print("\nCleaned Dataset:")
print(df.head())
# Visualization: Bar chart of the number of survivors by class
sns.set(style="whitegrid")
plt.figure(figsize=(8, 6))
sns.barplot(x='class', y='survived', data=sns.load_dataset('titanic'))
plt.title('Survival Rate by Class')
plt.ylabel('Survival Rate')
plt.xlabel('Class')
plt.show()

OUTPUT:

First few rows of the dataset:


survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third

who adult_male deck embark_town alive alone


0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True

Cleaned Dataset:
survived pclass age sibsp parch fare adult_male alone sex_male
\
0 0 3 22.0 1 0 7.2500 True False True
1 1 1 38.0 1 0 71.2833 False False False
2 1 3 26.0 0 0 7.9250 False True False
3 1 1 35.0 1 0 53.1000 False False False
4 0 3 35.0 0 0 8.0500 True True True

embarked_Q embarked_S class_Second class_Third who_man who_woman \


0 False True False True True False
1 False False False False False True
2 False True False True False True
3 False True False False False True
4 False True False True True False

embark_town_Queenstown embark_town_Southampton alive_yes


0 False True False
1 False False True
2 False True True
3 False True True
4 False True False
Result:
Thus the python Explore multivariable dataset, To perform any four data cleaning method and
visualize Bar chart has been successfully verified
5.Explore using seaborn to load the dataset three variable( Username, Tweet,
Location) tweets comment review for #tag Jallikattu Protest.
A) Perform scatterplot using different location tweet

Aim:
To Explore using seaborn to load the dataset three variable( Username, Tweet, Location) tweets
comment review for #tag Jallikattu Protest

Algorithm:
1.Input: A dataset containing tweet data (Username, Tweet, Location) for the #JallikattuProtest.
2.Preprocessing:Use value_counts() on the Location column to calculate the count of
tweets for each location.
3.Scatter Plot Creation:Plot the number of tweets on the y-axis and the corresponding locations
on the x-axis.
4.Customization:Customize the plot with labels, colors, and marker size to enhance
visualization.
5.Output:Display a scatter plot that shows the distribution of tweets from different locations.

PROGRAM:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Simulate the Dataset


data = {
'Username': ['user1', 'user2', 'user3', 'user4', 'user5', 'user6', 'user7',
'user8'],
'Tweet': [
'We support #Jallikattu at @Marina',
'#SaveTNfarmers protest @Chennai',
'#Jallikattu is our right! #TamilNadu',
'Proud of #Jallikattu culture @SaveTNfarmers',
'#SaveTNfarmers and #Jallikattu go hand in hand',
'@Marina #TamilNadu #Jallikattu',
'Protect our culture #Jallikattu #TamilNadu',
'#Jallikattu @Chennai #TamilNadu'],
'Location': ['Chennai', 'Madurai', 'Coimbatore', 'Chennai', 'Madurai',
'Trichy', 'Coimbatore', 'Chennai']}
df = pd.DataFrame(data)
# Display the DataFrame
print("Tweet DataFrame:")
print(df)
# Step 2: Perform a scatterplot using different location tweets
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Username', y='Location', data=df, hue='Location', s=100)
# Add title and labels
plt.title('Scatterplot of Tweets by Location')
plt.xlabel('Username')
plt.ylabel('Location')
plt.show()

OUTPUT
Tweet DataFrame:
Username Tweet Location
0 user1 We support #Jallikattu at @Marina Chennai
1 user2 #SaveTNfarmers protest @Chennai Madurai
2 user3 #Jallikattu is our right! #TamilNadu Coimbatore
3 user4 Proud of #Jallikattu culture @SaveTNfarmers Chennai
4 user5 #SaveTNfarmers and #Jallikattu go hand in hand Madurai
5 user6 @Marina #TamilNadu #Jallikattu Trichy
6 user7 Protect our culture #Jallikattu #TamilNadu Coimbatore
7 user8 #Jallikattu @Chennai #TamilNadu Chennai

Result:
Thus the python program to Explore using seaborn to load the dataset three variable( Username,
Tweet, Location) tweets comment review for #tag Jallikattu Protest as Perform scatter plot using
different location tweet has been successfully verified
5.Explore using seaborn to load the dataset three variable( Username, Tweet,
Location) tweets comment review for #tag Jallikattu Protest.
B)Perform Bubble chart for # and @ tag
Aim:
To Explore using seaborn to load the dataset three variable( Username, Tweet, Location)
tweets comment review for #tag Jallikattu Protest To Perform Bubble chart for # and @ tag

Algorithm:
1.input:A dataset containing tweet data (Username, Tweet, Location) for the #JallikattuProtest, with
hashtags and mentions in the Tweet column.
2.Extract Tags:Define functions to extract hashtags (#) and mentions (@) using regular expressions
from each tweet.
3.Count Occurrences:For both hashtags and mentions, count their occurrences and store the frequency.
4.Visualize with Bubble Chart:
Use Seaborn'sscatterplot function to plot a bubble chart, where:
The x-axis represents unique hashtags or mentions.
The size of each bubble represents the frequency (i.e., the count) of that hashtag or mention.
Customize the chart with labels, titles, and bubble sizes.
PROGRAM
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from collections import Counter

# Step 1: Simulate the Dataset


data = {
'Username': ['user1', 'user2', 'user3', 'user4', 'user5', 'user6', 'user7',
'user8'],
'Tweet': [
'We support #Jallikattu at @Marina',
'#SaveTNfarmers protest @Chennai',
'#Jallikattu is our right! #TamilNadu',
'Proud of #Jallikattu culture @SaveTNfarmers',
'#SaveTNfarmers and #Jallikattu go hand in hand',
'@Marina #TamilNadu #Jallikattu',
'Protect our culture #Jallikattu #TamilNadu',
'#Jallikattu @Chennai #TamilNadu'
],
'Location': ['Chennai', 'Madurai', 'Coimbatore', 'Chennai', 'Madurai',
'Trichy', 'Coimbatore', 'Chennai']
}

df = pd.DataFrame(data)

# Display the DataFrame


print("Tweet DataFrame:")
print(df)
# Step 2: Extract hashtags and mentions
hashtags = []
mentions = []

for tweet in df['Tweet']:


hashtags.extend(re.findall(r'#\w+', tweet))
mentions.extend(re.findall(r'@\w+', tweet))

# Count occurrences of hashtags and mentions


hashtag_counts = Counter(hashtags)
mention_counts = Counter(mentions)

# Combine hashtags and mentions into one DataFrame for visualization


tags = {**hashtag_counts, **mention_counts}
tag_data = pd.DataFrame(list(tags.items()), columns=['Tag', 'Count'])

# Step 3: Perform Bubble Chart for # and @ tag


plt.figure(figsize=(12, 8))
plt.scatter(tag_data['Tag'], tag_data['Count'], s=tag_data['Count']*500,
alpha=0.6, edgecolors="w", linewidth=2)

# Add titles and labels


plt.title('Bubble Chart for Hashtags and Mentions')
plt.xlabel('Tag')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
OUTPUT:
weet DataFrame:

Username Tweet Location


0 user1 We support #Jallikattu at @Marina Chennai
1 user2 #SaveTNfarmers protest @Chennai Madurai
2 user3 #Jallikattu is our right! #TamilNadu Coimbatore
3 user4 Proud of #Jallikattu culture @SaveTNfarmers Chennai
4 user5 #SaveTNfarmers and #Jallikattu go hand in hand Madurai
5 user6 @Marina #TamilNadu #Jallikattu Trichy
6 user7 Protect our culture #Jallikattu #TamilNadu Coimbatore
7 user8 #Jallikattu @Chennai #TamilNadu Chennai
Result:
Thus the python program to Explore using seaborn to load the dataset three variable( Username,
Tweet, Location) tweets comment review for #tag Jallikattu Protest as Perform Bubble chart for # and
@ tag has been successfully verified
6.Create a Pie chart for Student Result Analysis by using pie plot in python. Plot segregation will be
Distinction (Greater than or Equal to 8.5 CGPA) and First Class (Greater than 6.5 CGPA).

AIM:
To Create a Pie chart for Student Result Analysis by using pie plot in python. Plot segregation
will be Distinction (Greater than or Equal to 8.5 CGPA) and First Class (Greater than 6.5 CGPA).

Algorithm:

1.Input: A dataset of students with CGPA scores.


2.Categorization:
Classify students into "Distinction" if their CGPA is greater than or equal to 8.5.
Classify students into "First Class" if their CGPA is greater than 6.5 but less than 8.5.
3.Count Categories:
Count the number of students in each category (Distinction and First Class).
4.Plotting:
Use Matplotlib'spie() function to create a pie chart with slices for each category.
Customize the chart with labels, colors, and percentage annotations.
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt

# Step 1: Create a simulated dataset


data = {
'Student': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace',
'Hannah', 'Ivy', 'Jack'],
'CGPA': [9.0, 7.5, 8.7, 6.8, 8.0, 7.0, 9.1, 8.2, 8.9, 6.7]
}

df = pd.DataFrame(data)

# Step 2: Categorize students based on CGPA


conditions = [
(df['CGPA'] >= 8.5),
(df['CGPA'] >6.5) & (df['CGPA'] <8.5)
]
categories = ['Distinction', 'First Class']

df['Category'] = pd.cut(df['CGPA'], bins=[6.5, 8.5, 10], labels=['First Class',


'Distinction'], right=False)

# Step 3: Calculate the count for each category


category_counts = df['Category'].value_counts()

# Step 4: Plot the pie chart


plt.figure(figsize=(8, 8))
plt.pie(category_counts, labels=category_counts.index, autopct='%1.1f%%',
startangle=140, colors=['#66b3ff','#99ff99'])
plt.title('Student Result Analysis')
plt.show()
OUTPUT:

Result:
Thus the python program to Create a Pie chart for Student Result Analysis by using pie plot in
python. Plot segregation will be Distinction and First Class has been successfully verified
7.Create a Lollipop chart for Festival Shopping dataset of your own(20 rows and 5 to 10
columns)

Aim:
To Create a Lollipop chart for Festival Shopping dataset of your own.

Algorithm:
1.Input: A dataset with customer shopping details including total expenditure.
2.Sort Data (optional): Sort the dataset by total expenditure to make the chart visually
organized.
3.Create Lollipop Chart:Use Matplotlib’sstem() to plot customer IDs on the x-axis and total
expenditure on the y-axis.
Customize marker size, colors, and line format for better visualization.
4.Customize the Chart:Add axis labels, rotate the x-axis labels for readability, and set a chart
title.
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Step 1: Create a simulated dataset (20 rows, 5 columns)


np.random.seed(42)
data = {
'Item': [f'Item {i+1}'for i inrange(20)],
'Price': np.random.randint(100, 1000, size=20),
'Quantity': np.random.randint(1, 10, size=20),
'Discount (%)': np.random.randint(0, 30, size=20),
'Total Sales': np.random.randint(500, 5000, size=20)
}

df = pd.DataFrame(data)
# Display the DataFrame
print("Festival Shopping Dataset:")
print(df)
# Step 2: Sort the dataset based on 'Total Sales' for better visualization
df_sorted = df.sort_values(by='Total Sales', ascending=False)

# Step 3: Create the lollipop chart


plt.figure(figsize=(12, 8))
plt.stem(df_sorted['Item'], df_sorted['Total Sales'], basefmt=" ",
use_line_collection=True)
# Customize the plot
plt.title('Lollipop Chart for Festival Shopping - Total Sales per Item')
plt.xlabel('Item')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.grid(True)
# Show the plot
plt.show()
OUTPUT:
Festival Shopping Dataset:
Item Price Quantity Discount (%) Total Sales
0 Item 1 202 6 8 1275
1 Item 2 535 5 6 4514
2 Item 3 960 2 17 534
3 Item 4 370 8 3 3652
4 Item 5 206 6 24 2455
5 Item 6 171 2 27 2085
6 Item 7 800 5 13 4443
7 Item 8 120 1 17 3573
8 Item 9 714 6 25 1521
9 Item 10 221 9 8 3961
10 Item 11 566 1 25 3113
11 Item 12 314 3 20 4343
12 Item 13 430 7 1 2000
13 Item 14 558 4 19 661
14 Item 15 187 9 27 4797
15 Item 16 472 3 14 2481
16 Item 17 199 5 27 1495
17 Item 18 971 3 6 3842
18 Item 19 763 7 11 4298
19 Item 20 230 5 28 1775

Result:
Thus the python program to Create a Lollipop chart for Festival Shopping dataset of your
own(20 rows and 5 to 10 columns) has been successfully verified
8.To perform the following data transformation techniques of your own dataset. (20 Rows and 5 Columns)
A) Removing Null Values (NaN)

Aim:
To perform the following data transformation techniques of your own dataset to Removing Null Values
(NaN)

Algorithm:
1.Input: A dataset with potential null values (NaN).
2.Identify Null Values:Use pd.DataFrame.isnull() or pd.DataFrame.isna() to identify
null values in the dataset.
3.Remove Null Values:Use pd.DataFrame.dropna() to remove:
Rows with any null values by default.
Columns with any null values by specifying axis=1.

PROGRAM:
import pandas as pd
import numpy as np

# Create a DataFrame with 20 rows and 5 columns, with some NaN values
data = {
'A': np.random.randint(1, 100, 20),
'B': np.random.randint(1, 100, 20),
'C': np.random.choice([np.nan, 50, 60, 70], 20),
'D': np.random.choice([np.nan, 80, 90], 20),
'E': np.random.randint(1, 100, 20)
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Remove rows with any NaN values


df_cleaned = df.dropna()

print("\nDataFrame after removing rows with NaN values:")


print(df_cleaned)

OUTPUT:
Original DataFrame:
A B C D E
0 62 8 50.0 80.0 43
1 40 88 70.0 90.0 29
2 85 63 NaN 90.0 36
3 80 11 60.0 NaN 13
4 82 81 70.0 90.0 32
5 53 8 70.0 90.0 71
6 24 35 50.0 80.0 59
7 26 35 60.0 80.0 86
8 89 33 60.0 NaN 28
9 60 5 NaN 90.0 66
10 41 41 60.0 90.0 42
11 29 28 NaN 90.0 45
12 15 7 60.0 NaN 62
13 45 73 50.0 NaN 57
14 65 72 60.0 80.0 6
15 89 12 NaN NaN 28
16 71 34 NaN 90.0 28
17 9 33 50.0 90.0 44
18 88 48 60.0 NaN 84
19 1 23 60.0 90.0 30

DataFrame after removing rows with NaN values:


A B C D E
0 62 8 50.0 80.0 43
1 40 88 70.0 90.0 29
4 82 81 70.0 90.0 32
5 53 8 70.0 90.0 71
6 24 35 50.0 80.0 59
7 26 35 60.0 80.0 86
10 41 41 60.0 90.0 42
14 65 72 60.0 80.0 6
17 9 33 50.0 90.0 44
19 1 23 60.0 90.0 30

Result:
Thus the python programTo perform the following data transformation techniques of your own
dataset. (20 Rows and 5 Columns) A) Removing Null Values (NaN) has been successfully verified
8.Write a python To perform the following data transformation techniques of your own dataset.
(20 Rows and 5 Columns)
B)Drop Columns

Aim:
To Write a python To perform the following data transformation techniques of your own dataset.
(20 Rows and 5 Columns)Drop Columns

Algorithm:
1.Input: A dataset (DataFrame) and a list of column names to drop.
2.Identify Columns: Determine which columns need to be removed based on business rules or
analysis requirements.
3.Drop Columns:Use pd.DataFrame.drop(columns=<list_of_columns>) to remove
specified columns from the dataset.
PROGRAM:
import pandas as pd
import numpy as np

# Create a DataFrame with 20 rows and 5 columns


data = {
'A': np.random.randint(1, 100, 20),
'B': np.random.randint(1, 100, 20),
'C': np.random.randint(1, 100, 20),
'D': np.random.randint(1, 100, 20),
'E': np.random.randint(1, 100, 20)
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Drop columns 'B' and 'D'


df_dropped = df.drop(columns=['B', 'D'])
print("\nDataFrame after dropping columns 'B' and 'D':")
print(df_dropped)
OUTPUT;
Original DataFrame:
A B C D E
0 62 24 54 69 20
1 75 79 87 98 96
2 92 59 96 70 71
3 89 32 97 86 52
4 62 96 1 11 33
5 97 88 19 16 40
6 1 52 2 97 39
7 27 62 53 73 82
8 62 58 44 59 1
9 77 52 90 70 11
10 3 12 32 80 92
11 70 39 70 93 57
12 72 2 32 3 89
13 27 3 68 20 50
14 9 56 55 59 23
15 62 81 75 36 31
16 37 59 56 19 94
17 97 2 17 90 42
18 51 2 38 67 99
19 44 92 24 19 7

DataFrame after dropping columns 'B' and 'D':


A C E
0 62 54 20
1 75 87 96
2 92 96 71
3 89 97 52
4 62 1 33
5 97 19 40
6 1 2 39
7 27 53 82
8 62 44 1
9 77 90 11
10 3 32 92
11 70 70 57
12 72 32 89
13 27 68 50
14 9 55 23
15 62 75 31
16 37 56 94
17 97 17 42
18 51 38 99
19 44 24 7

Result:
Thus the python To perform the following data transformation techniques of your own dataset.
(20 Rows and 5 Columns) as Drop Columns has been successfully verified
8.perform the following data transformation techniques of your own dataset. (20 Rows and 5 Columns)
C)Merging database style dataframes

Aim:
To perform the following data transformation techniques of your own dataset. (20 Rows and 5
Columns)Merging database style dataframes

Algorithm:
1.Input: Two datasets (DataFrames) and a common key (column) for merging.
2.Identify Merge Key: Determine the column(s) that will be used as the join key.
3.Choose Merge Type:
Inner Join: how='inner' (only matching rows).
Left Join: how='left' (all rows from the left DataFrame).
Right Join: how='right' (all rows from the right DataFrame).
Outer Join: how='outer' (all rows from both DataFrames).
4.MergeDataFrames:Use pd.merge(df1, df2, on='common_column', how='join_type')
to perform the merge.
5.Output:The merged DataFrame containing combined data based on the specified join type.
PROGRAM:
import pandas as pd
import numpy as np

# Create the first DataFrame


data1 = {
'ID': np.arange(1, 21),
'Name': [f'Name{i}'for i inrange(1, 21)],
'Age': np.random.randint(20, 50, 20)
}
df1 = pd.DataFrame(data1)

# Create the second DataFrame


data2 = {
'ID': np.arange(1, 16), # Note: Only 15 IDs to simulate some missing data
'Salary': np.random.randint(30000, 70000, 15),
'Department': [f'Department{i % 3}'for i inrange(15)]
}
df2 = pd.DataFrame(data2)

print("DataFrame 1:")
print(df1)

print("\nDataFrame 2:")
print(df2)

# Merge DataFrames using an inner join on 'ID'


df_inner = pd.merge(df1, df2, on='ID', how='inner')

# Merge DataFrames using a left join on 'ID'


df_left = pd.merge(df1, df2, on='ID', how='left')
# Merge DataFrames using a right join on 'ID'
df_right = pd.merge(df1, df2, on='ID', how='right')

# Merge DataFrames using an outer join on 'ID'


df_outer = pd.merge(df1, df2, on='ID', how='outer')

print("\nInner Join (only matching IDs):")


print(df_inner)

print("\nLeft Join (all IDs from df1):")


print(df_left)

print("\nRight Join (all IDs from df2):")


print(df_right)

print("\nOuter Join (all IDs from both DataFrames):")


print(df_outer)

OUTPUT:
DataFrame 1:
ID Name Age
0 1 Name1 47
1 2 Name2 35
2 3 Name3 45
3 4 Name4 35
4 5 Name5 44
5 6 Name6 39
6 7 Name7 47
7 8 Name8 36
8 9 Name9 21
9 10 Name10 20
10 11 Name11 35
11 12 Name12 49
12 13 Name13 31
13 14 Name14 24
14 15 Name15 24
15 16 Name16 46
16 17 Name17 42
17 18 Name18 28
18 19 Name19 28
19 20 Name20 22

DataFrame 2:
ID Salary Department
0 1 45151 Department0
1 2 31154 Department1
2 3 34499 Department2
3 4 36295 Department0
4 5 42183 Department1
5 6 59299 Department2
6 7 42874 Department0
7 8 62711 Department1
8 9 35539 Department2
9 10 32557 Department0
10 11 68360 Department1
11 12 46482 Department2
12 13 32200 Department0
13 14 32961 Department1
14 15 51357 Department2

Inner Join (only matching IDs):


ID Name Age Salary Department
0 1 Name1 47 45151 Department0
1 2 Name2 35 31154 Department1
2 3 Name3 45 34499 Department2
3 4 Name4 35 36295 Department0
4 5 Name5 44 42183 Department1
5 6 Name6 39 59299 Department2
6 7 Name7 47 42874 Department0
7 8 Name8 36 62711 Department1
8 9 Name9 21 35539 Department2
9 10 Name10 20 32557 Department0
10 11 Name11 35 68360 Department1
11 12 Name12 49 46482 Department2
12 13 Name13 31 32200 Department0
13 14 Name14 24 32961 Department1
14 15 Name15 24 51357 Department2

Left Join (all IDs from df1):


ID Name Age Salary Department
0 1 Name1 47 45151.0 Department0
1 2 Name2 35 31154.0 Department1
2 3 Name3 45 34499.0 Department2
3 4 Name4 35 36295.0 Department0
4 5 Name5 44 42183.0 Department1
5 6 Name6 39 59299.0 Department2
6 7 Name7 47 42874.0 Department0
7 8 Name8 36 62711.0 Department1
8 9 Name9 21 35539.0 Department2
9 10 Name10 20 32557.0 Department0
10 11 Name11 35 68360.0 Department1
11 12 Name12 49 46482.0 Department2
12 13 Name13 31 32200.0 Department0
13 14 Name14 24 32961.0 Department1
14 15 Name15 24 51357.0 Department2
15 16 Name16 46 NaN NaN
16 17 Name17 42 NaN NaN
17 18 Name18 28 NaN NaN
18 19 Name19 28 NaN NaN
19 20 Name20 22 NaN NaN

Right Join (all IDs from df2):


ID Name Age Salary Department
0 1 Name1 47 45151 Department0
1 2 Name2 35 31154 Department1
2 3 Name3 45 34499 Department2
3 4 Name4 35 36295 Department0
4 5 Name5 44 42183 Department1
5 6 Name6 39 59299 Department2
6 7 Name7 47 42874 Department0
7 8 Name8 36 62711 Department1
8 9 Name9 21 35539 Department2
9 10 Name10 20 32557 Department0
10 11 Name11 35 68360 Department1
11 12 Name12 49 46482 Department2
12 13 Name13 31 32200 Department0
13 14 Name14 24 32961 Department1
14 15 Name15 24 51357 Department2

Outer Join (all IDs from both DataFrames):


ID Name Age Salary Department
0 1 Name1 47 45151.0 Department0
1 2 Name2 35 31154.0 Department1
2 3 Name3 45 34499.0 Department2
3 4 Name4 35 36295.0 Department0
4 5 Name5 44 42183.0 Department1
5 6 Name6 39 59299.0 Department2
6 7 Name7 47 42874.0 Department0
7 8 Name8 36 62711.0 Department1
8 9 Name9 21 35539.0 Department2
9 10 Name10 20 32557.0 Department0
10 11 Name11 35 68360.0 Department1
11 12 Name12 49 46482.0 Department2
12 13 Name13 31 32200.0 Department0
13 14 Name14 24 32961.0 Department1
14 15 Name15 24 51357.0 Department2
15 16 Name16 46 NaN NaN
16 17 Name17 42 NaN NaN
17 18 Name18 28 NaN NaN
18 19 Name19 28 NaN NaN
19 20 Name20 22 NaN NaN

Result:
Thus the python perform the following data transformation techniques of your own dataset. (20
Rows and 5 Columns) as Merging database style data frames has been successfully verified
9.To perform dataframe merge function (inner, left and outer join) using simple dataset.

Aim:
To perform dataframe merge function (inner, left and outer join) using simple dataset.

Algorithm:
1.Input: Two datasets (DataFrames) and a common key (column) for merging.
2.Identify Merge Key: Determine the column(s) to be used as the join key.
3.Choose Merge Type:
Inner Join: how='inner' (only matching rows from both DataFrames).
Left Join: how='left' (all rows from the left DataFrame).
Outer Join: how='outer' (all rows from both DataFrames).
4.MergeDataFrames:Use pd.merge(df1, df2, on='common_column', how='join_type') to perfo
the merge.
5.Output:The merged DataFrame containing combined data based on the specified join type.
PROGRAM:
import pandas as pd

# Create the first DataFrame


data1 = {
'ID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45]
}
df1 = pd.DataFrame(data1)

# Create the second DataFrame


data2 = {
'ID': [3, 4, 5, 6, 7],
'Salary': [70000, 80000, 90000, 100000, 110000],
'Department': ['HR', 'IT', 'Finance', 'Marketing', 'Sales']
}
df2 = pd.DataFrame(data2)
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
# Inner Join: Only matching rows in both DataFrames
df_inner = pd.merge(df1, df2, on='ID', how='inner')
# Left Join: All rows from df1 and matching rows from df2
df_left = pd.merge(df1, df2, on='ID', how='left')
# Outer Join: All rows from both DataFrames
df_outer = pd.merge(df1, df2, on='ID', how='outer')
print("\nInner Join (only matching IDs):")
print(df_inner)
print("\nLeft Join (all IDs from df1):")
print(df_left)
print("\nOuter Join (all IDs from both DataFrames):")
print(df_outer)
OUTPUT:
DataFrame 1:
ID Name Age
0 1 Alice 25
1 2 Bob 30
2 3 Charlie 35
3 4 David 40
4 5 Eva 45

DataFrame 2:
ID Salary Department
0 3 70000 HR
1 4 80000 IT
2 5 90000 Finance
3 6 100000 Marketing
4 7 110000 Sales

Inner Join (only matching IDs):


ID Name Age Salary Department
0 3 Charlie 35 70000 HR
1 4 David 40 80000 IT
2 5 Eva 45 90000 Finance

Left Join (all IDs from df1):


ID Name Age Salary Department
0 1 Alice 25 NaN NaN
1 2 Bob 30 NaN NaN
2 3 Charlie 35 70000.0 HR
3 4 David 40 80000.0 IT
4 5 Eva 45 90000.0 Finance

Outer Join (all IDs from both DataFrames):


ID Name Age Salary Department
0 1 Alice 25.0 NaN NaN
1 2 Bob 30.0 NaN NaN
2 3 Charlie 35.0 70000.0 HR
3 4 David 40.0 80000.0 IT
4 5 Eva 45.0 90000.0 Finance
5 6 NaN NaN 100000.0 Marketing
6 7 NaN NaN 110000.0 Sales

Result:
Thus the python To perform dataframe merge function (inner, left and outer join) using simple
dataset has been successfully verified
10.Explore simple dataset and perform Transformation techniques such as data reduplication,
Replace values, Handling missing Data, Backward and Forward filling.

Aim:
To Explore simple dataset and perform Transformation techniques such as data deduplication,
Replace values, Handling missing Data, Backward and Forward filling.

Algorithm:
1.DataDeduplication:
Input: A dataset (DataFrame).
Process: Use df.drop_duplicates() to remove duplicate rows.
Output: A DataFrame without duplicates.
2.Replace Values:
Input: A DataFrame and a dictionary of replacements.
Process: Use df.replace({column_name: {old_value: new_value}}) to replace
specific values.
Output: A DataFrame with replaced values.
3.Handling Missing Data:
Input: A DataFrame with missing values.
Process: Use df.dropna() to remove rows with missing values.
Output: A DataFrame without missing data.
Program
import pandas as pd
import numpy as np

# Create a sample DataFrame


data = {
'ID': [1, 2, 2, 4, 5, 6, 7, 7, 9],
'Name': ['Alice', 'Bob', 'Bob', 'David', 'Eva', np.nan, 'George', 'George',
'Ivy'],
'Age': [25, 30, 30, 40, np.nan, 50, np.nan, 50, 60],
'Salary': [50000, 60000, 60000, np.nan, 70000, 80000, 80000, np.nan, 90000]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Data Deduplication: Remove duplicate rows


df_dedup = df.drop_duplicates()
print("\nDataFrame after Deduplication:")
print(df_dedup)

# Replace Values: Replace NaN with 'Unknown' in 'Name', 0 in 'Age' and 'Salary'
df_replaced = df.copy()
df_replaced['Name'].fillna('Unknown', inplace=True)
df_replaced['Age'].fillna(0, inplace=True)
df_replaced['Salary'].fillna(0, inplace=True)
print("\nDataFrame after Replacing Missing Values:")
print(df_replaced)

# Handling Missing Data: Drop rows with any missing values


df_dropped = df.dropna()
print("\nDataFrame after Dropping Rows with Missing Values:")
print(df_dropped)
# Backward Filling: Fill missing values with the next value in the column
df_filled_backward = df.fillna(method='bfill')
print("\nDataFrame after Backward Filling Missing Values:")
print(df_filled_backward)

# Forward Filling: Fill missing values with the previous value in the column
df_filled_forward = df.fillna(method='ffill')
print("\nDataFrame after Forward Filling Missing Values:")
print(df_filled_forward)
output:
Original DataFrame:
ID Name Age Salary
0 1 Alice 25.0 50000.0
1 2 Bob 30.0 60000.0
2 2 Bob 30.0 60000.0
3 4 David 40.0 NaN
4 5 Eva NaN 70000.0
5 6 NaN 50.0 80000.0
6 7 George NaN 80000.0
7 7 George 50.0 NaN
8 9 Ivy 60.0 90000.0

DataFrame after Deduplication:


ID Name Age Salary
0 1 Alice 25.0 50000.0
1 2 Bob 30.0 60000.0
3 4 David 40.0 NaN
4 5 Eva NaN 70000.0
5 6 NaN 50.0 80000.0
6 7 George NaN 80000.0
7 7 George 50.0 NaN
8 9 Ivy 60.0 90000.0

DataFrame after Replacing Missing Values:


ID Name Age Salary
0 1 Alice 25.0 50000.0
1 2 Bob 30.0 60000.0
2 2 Bob 30.0 60000.0
3 4 David 40.0 0.0
4 5 Eva 0.0 70000.0
5 6 Unknown 50.0 80000.0
6 7 George 0.0 80000.0
7 7 George 50.0 0.0
8 9 Ivy 60.0 90000.0

DataFrame after Dropping Rows with Missing Values:


ID Name Age Salary
0 1 Alice 25.0 50000.0
1 2 Bob 30.0 60000.0
2 2 Bob 30.0 60000.0
8 9 Ivy 60.0 90000.0

DataFrame after Backward Filling Missing Values:


ID Name Age Salary
0 1 Alice 25.0 50000.0
1 2 Bob 30.0 60000.0
2 2 Bob 30.0 60000.0
3 4 David 40.0 70000.0
4 5 Eva 50.0 70000.0
5 6 George 50.0 80000.0
6 7 George 50.0 80000.0
7 7 George 50.0 90000.0
8 9 Ivy 60.0 90000.0

DataFrame after Forward Filling Missing Values:


ID Name Age Salary
0 1 Alice 25.0 50000.0
1 2 Bob 30.0 60000.0
2 2 Bob 30.0 60000.0
3 4 David 40.0 60000.0
4 5 Eva 40.0 70000.0
5 6 Eva 50.0 80000.0
6 7 George 50.0 80000.0
7 7 George 50.0 80000.0
8 9 Ivy 60.0 90000.0

Result:
Thus the python program to Explore simple dataset and perform Transformation techniques such
as data reduplication, Replace values, Handling missing Data, Backward and Forward filling has been
successfully verified
11.To perform hypothesis testing using stats library of your own dataset Explore T test

Aim:
To perform hypothesis testing using stats library of your own dataset Explore T test

Algorithm:

Step 1: Define the null hypothesis (H0) and the alternative hypothesis (H1).
H0: μ1 = μ2 (The means of the two groups are equal)
H1: μ1 ≠ μ2 (The means of the two groups are not equal)
Step 2: Set the significance level (alpha). Commonly, alpha = 0.05.
Step 3: Calculate the T-statistic and P-value using:
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
Step 4: Compare the P-value with alpha.
If P-value < alpha: Reject H0 (there is a significant difference)
If P-value ≥ alpha: Fail to reject H0 (no significant difference)
Output: Present T-statistic, P-value, and conclusion.

Program:
#To install scipy lib fuction

pip install scipy


import numpy as np
import pandas as pd
from scipy import stats

# Create a simple dataset


np.random.seed(0) # For reproducibility
# Sample data for two independent groups
group1 = np.random.normal(loc=50, scale=10, size=30) #Mean=50,Std Dev=10,n=30
group2 = np.random.normal(loc=55, scale=10, size=30)#Mean=55,Std Dev=10,n=30
# Convert to DataFrame for easier visualization
data = {
'Group1': group1,
'Group2': group2
}
df = pd.DataFrame(data)

print("Dataset:")
print(df.head())

# Perform T-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print("\nT-test Results:")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Interpret the results


alpha = 0.05 # significance level
if p_value < alpha:
print("Reject the null hypothesis: There is a significant difference
between the two groups.")
else:
print("Fail to reject the null hypothesis: There is no significant
difference between the two groups.")

output:
Dataset:
Group1 Group2
0 67.640523 56.549474
1 54.001572 58.781625
2 59.787380 46.122143
3 72.408932 35.192035
4 68.675580 51.520879

T-test Results:
T-statistic: 0.8897019207505096
P-value: 0.3773014533943507
Fail to reject the null hypothesis: There is no significant difference between
the two groups.

Result:
Thus the python program to perform hypothesis testing using stats library of your own dataset Explore
T test has been successfully verified
12.Explore and visualize of your own dataset/ data frame and perform numerical summaries and
spread level.
A) Floating values into two columns from single variable
Aim:
To .Explore and visualize of your own dataset/ data frame and perform numerical summaries
and spread level as Floating values into two columns from single variable

Algorithm:
Step 1: Generate a synthetic dataset with multiple columns, including at least one floating-point
column.
Step 2: Use descriptive statistics methods to get numerical summaries (mean, median, std, etc.).
Step 3: Split the floating-point column into two separate columns based on its value (e.g., integer and
decimal parts).
Step 4: Visualize the distributions of the columns using histograms or box plots.

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample DataFrame


np.random.seed(42) # For reproducibility

# Generate sample data


data = {
'Value': np.random.uniform(10.0, 100.0, size=20) # Random floating
values between 10 and 100
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Split the 'Value' column into 'Integer_Part' and 'Decimal_Part'


df['Integer_Part'] = df['Value'].astype(int)
df['Decimal_Part'] = (df['Value'] - df['Integer_Part']).round(2)

print("\nDataFrame with Split Columns:")


print(df)

# Visualize the data


plt.figure(figsize=(12, 6))

# Plot original values


plt.subplot(1, 2, 1)
sns.histplot(df['Value'], kde=True, color='blue')
plt.title('Distribution of Original Values')

# Plot integer and decimal parts


plt.subplot(1, 2, 2)
sns.histplot(df['Integer_Part'], kde=True, color='green', label='Integer
Part')
sns.histplot(df['Decimal_Part'], kde=True, color='red', label='Decimal
Part')
plt.title('Distribution of Integer and Decimal Parts')
plt.legend()

plt.tight_layout()
plt.show()

# Numerical Summaries
print("\nNumerical Summaries:")
print(df.describe())

# Spread Level: Calculate the range (max - min) for each part
range_integer = df['Integer_Part'].max() - df['Integer_Part'].min()
range_decimal = df['Decimal_Part'].max() - df['Decimal_Part'].min()

print("\nRange of Integer Part:", range_integer)


print("Range of Decimal Part:", range_decimal)

output:
Original DataFrame:
Value
0 43.708611
1 95.564288
2 75.879455
3 63.879264
4 24.041678
5 24.039507
6 15.227525
7 87.955853
8 64.100351
9 73.726532
10 11.852604
11 97.291887
12 84.919838
13 29.110520
14 26.364247
15 26.506406
16 37.381802
17 57.228079
18 48.875052
19 36.210623

DataFrame with Split Columns:


Value Integer_Part Decimal_Part
0 43.708611 43 0.71
1 95.564288 95 0.56
2 75.879455 75 0.88
3 63.879264 63 0.88
4 24.041678 24 0.04
5 24.039507 24 0.04
6 15.227525 15 0.23
7 87.955853 87 0.96
8 64.100351 64 0.10
9 73.726532 73 0.73
10 11.852604 11 0.85
11 97.291887 97 0.29
12 84.919838 84 0.92
13 29.110520 29 0.11
14 26.364247 26 0.36
15 26.506406 26 0.51
16 37.381802 37 0.38
17 57.228079 57 0.23
18 48.875052 48 0.88
19 36.210623 36 0.21

Numerical Summaries:
Value Integer_Part Decimal_Part
count 20.000000 20.000000 20.000000
mean 51.193206 50.700000 0.493500
std 27.691818 27.554921 0.331413
min 11.852604 11.000000 0.040000
25% 26.470866 26.000000 0.225000
50% 46.291831 45.500000 0.445000
75% 74.264763 73.500000 0.857500
max 97.291887 97.000000 0.960000

Range of Integer Part: 86


Range of Decimal Part: 0.9199999999999999

Result:
Thus the python program to .Explore and visualize of your own dataset/ data frame and perform
numerical summaries and spread level as Floating values into two columns from single variable has
been successfully verified
12. Explore and visualize of your own dataset/ data frame and perform numerical summaries and
spread level.
B) Perform Descriptive Analysis
Aim:
To Explore and visualize of your own dataset/ data frame and perform numerical
summaries and spread level as Perform Descriptive Analysis
Algorithm:
Step 1: Generate a synthetic dataset with at least 20 rows and 3-5 columns, including both numerical
and categorical data.
Step 2: Use descriptive statistics methods (like describe(), mean(), median(), etc.) to summarize
numerical data.
Step 3: Visualize the distributions of numerical columns using histograms or box plots.
Step 4: Visualize relationships between variables using scatter plots or pair plots.
PROGRAM:
#1. Create a Sample Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample DataFrame


np.random.seed(42) # For reproducibility

data = {
'Age': np.random.randint(18, 70, size=50), # Random ages between 18 and 70
'Salary': np.random.uniform(30000, 100000, size=50), # Random salaries
between 30k and 100k
'Department': np.random.choice(['HR', 'Engineering', 'Marketing'],
size=50) # Random departments
}
df = pd.DataFrame(data)

print("Dataset:")
print(df.head())
#2. Perform Descriptive Analysis
# Descriptive statistics for numerical variables
numerical_summary = df.describe()
print("\nNumerical Summary:")
print(numerical_summary)

# Frequency count for categorical variables


categorical_summary = df['Department'].value_counts()
print("\nCategorical Summary:")
print(categorical_summary)

# Spread level: Range of numerical variables


range_age = df['Age'].max() - df['Age'].min()
range_salary = df['Salary'].max() - df['Salary'].min()
print("\nRange of Age:", range_age)
print("Range of Salary:", range_salary)
#3. Visualize the Data
# Visualizations
plt.figure(figsize=(14, 6))

# Histogram for Age


plt.subplot(1, 2, 1)
sns.histplot(df['Age'], kde=True, color='skyblue')
plt.title('Distribution of Age')

# Histogram for Salary


plt.subplot(1, 2, 2)
sns.histplot(df['Salary'], kde=True, color='salmon')
plt.title('Distribution of Salary')

plt.tight_layout()
plt.show()

# Box plots for numerical data by Department


plt.figure(figsize=(14, 6))

# Boxplot for Salary by Department


sns.boxplot(x='Department', y='Salary', data=df)
plt.title('Salary Distribution by Department')

plt.show()

# Pairplot for numerical features


sns.pairplot(df[['Age', 'Salary']])
plt.suptitle('Pairplot of Age and Salary', y=1.02)
plt.show()
OUTPUT:
Dataset:
Age Salary Department
0 56 97594.242315 Marketing
1 69 86587.814368 Engineering
2 46 51322.963842 HR
3 32 36837.047980 Marketing
4 60 77896.311856 Engineering

Numerical Summary:
Age Salary
count 50.00000 50.000000
mean 43.82000 63883.002511
std 15.05187 22118.144018
min 19.00000 30386.548199
25% 32.25000 43766.624704
50% 41.50000 62736.529135
75% 56.00000 84208.756689
max 69.00000 99082.085562
Categorical Summary:
Department
Marketing 23
HR 15
Engineering 12
Name: count, dtype: int64

Range of Age: 50
Range of Salary: 68695.53736338405
Result:
Thus the python program to Explore and visualize of your own dataset/ data frame and perform
numerical summaries and spread level as Perform Descriptive Analysis has been successfully
verified
12.Explore and visualize of your own dataset/ data frame and perform numerical summaries and spread
level.
C) Perform Percentage Table both row and column.
Aim:
To Explore and visualize of your own dataset/ data frame and perform numerical summaries
and spread level as Perform Percentage Table both row and column
ALGORITHM:
Step 1: Generate a synthetic dataset with at least 20 rows and 3-5 columns, including categorical data.
Step 2: Perform numerical summaries to get counts of each category.
Step 3: Calculate percentage for each category in both rows and columns.
Step 4: Visualize the distribution of categorical data.
Program:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Department': ['HR', 'HR', 'Engineering', 'Engineering', 'Marketing',
'Marketing'],
'Age_Group_18_25': [5, 10, 8, 12, 7, 6],
'Age_Group_26_35': [15, 20, 25, 30, 22, 18],
'Age_Group_36_50': [10, 5, 15, 10, 10, 20],
'Age_Group_51_70': [0, 0, 2, 3, 1, 4]
}
df = pd.DataFrame(data)
# Set 'Department' as the index
df.set_index('Department', inplace=True)
print("Original DataFrame:")
print(df)
# Calculate row and column percentages
row_percentages = df.div(df.sum(axis=1), axis=0) * 100
column_percentages = df.div(df.sum(axis=0), axis=1) * 100
print("\nRow Percentages:")
print(row_percentages)
print("\nColumn Percentages:")
print(column_percentages)
output:
Original DataFrame:
Age_Group_18_25 Age_Group_26_35 Age_Group_36_50 \
Department
HR 5 15 10
HR 10 20 5
Engineering 8 25 15
Engineering 12 30 10
Marketing 7 22 10
Marketing 6 18 20
Age_Group_51_70
Department
HR 0
HR 0
Engineering 2
Engineering 3
Marketing 1
Marketing 4

Row Percentages:
Age_Group_18_25 Age_Group_26_35 Age_Group_36_50 \
Department
HR 16.666667 50.000000 33.333333
HR 28.571429 57.142857 14.285714
Engineering 16.000000 50.000000 30.000000
Engineering 21.818182 54.545455 18.181818
Marketing 17.500000 55.000000 25.000000
Marketing 12.500000 37.500000 41.666667

Age_Group_51_70
Department
HR 0.000000
HR 0.000000
Engineering 4.000000
Engineering 5.454545
Marketing 2.500000
Marketing 8.333333

Column Percentages:
Age_Group_18_25 Age_Group_26_35 Age_Group_36_50 \
Department
HR 10.416667 11.538462 14.285714
HR 20.833333 15.384615 7.142857
Engineering 16.666667 19.230769 21.428571
Engineering 25.000000 23.076923 14.285714
Marketing 14.583333 16.923077 14.285714
Marketing 12.500000 13.846154 28.571429

Age_Group_51_70
Department
HR 0.0
HR 0.0
Engineering 20.0
Engineering 30.0
Marketing 10.0
Marketing 40.0

Result:
Thus the python program to Explore and visualize of your own dataset/ data frame and perform
numerical summaries and spread level as Perform Percentage Table both row and column has been
successfully verified
13.Perform Time Series Analysis and apply various visualization methods for Internet Traffic Time
Dataset. (Create own data with minimum 5 columns and 20 rows)
Aim:
To create Time Series Analysis and apply various visualization methods for Internet Traffic
Time Dataset. (Create own data with minimum 5 columns and 20 rows)
Algorithm:
Step 1: Generate a synthetic dataset with at least 20 rows and 5 columns, including a timestamp.
Step 2: Convert the timestamp to a datetime format for analysis.
Step 3: Calculate basic statistics (mean, max, min) for the traffic metrics.
Step 4: Visualize the data using line plots, bar plots, and scatter plots.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import autocorrelation_plot

# Create a sample time series dataset


np.random.seed(42) # For reproducibility

# Generate date range


dates = pd.date_range(start='2024-01-01', periods=20, freq='D')

# Generate synthetic data


data = {
'Date': dates,
'Page_Views': np.random.poisson(lam=200, size=20)+np.arange(20) * 5,
# Increasing trend
'Unique_Visitors': np.random.poisson(lam=100, size=20)+np.arange(20 * 3,
# Increasing trend
'New_Signups': np.random.poisson(lam=20, size=20) + np.random.normal(0, 5,
size=20).astype(int),
# Slightly noisy trend
'Session_Duration': np.random.uniform(100,300,size=20)-np.arange(20)* 2,
# Decreasing trend
'Bounce_Rate': np.random.uniform(20, 60, size=20) + np.sin(np.linspace(0,
3.14, 20)) * 10
# Seasonal component
}
df = pd.DataFrame(data)
df.set_index('Date', inplace=True)

print("Dataset:")
print(df.head())

# Time Series Analysis and Visualization

# Plot Time Series Data


plt.figure(figsize=(14, 10))
# Page Views
plt.subplot(3, 2, 1)
plt.plot(df.index, df['Page_Views'], marker='o', color='blue')
plt.title('Page Views Over Time')
plt.xlabel('Date')
plt.ylabel('Page Views')

# Unique Visitors
plt.subplot(3, 2, 2)
plt.plot(df.index, df['Unique_Visitors'], marker='o', color='green')
plt.title('Unique Visitors Over Time')
plt.xlabel('Date')
plt.ylabel('Unique Visitors')

# New Signups
plt.subplot(3, 2, 3)
plt.plot(df.index, df['New_Signups'], marker='o', color='orange')
plt.title('New Signups Over Time')
plt.xlabel('Date')
plt.ylabel('New Signups')

# Session Duration
plt.subplot(3, 2, 4)
plt.plot(df.index, df['Session_Duration'], marker='o', color='red')
plt.title('Session Duration Over Time')
plt.xlabel('Date')
plt.ylabel('Session Duration (seconds)')

# Bounce Rate
plt.subplot(3, 2, 5)
plt.plot(df.index, df['Bounce_Rate'], marker='o', color='purple')
plt.title('Bounce Rate Over Time')
plt.xlabel('Date')
plt.ylabel('Bounce Rate (%)')

# Autocorrelation Plot for Page Views


plt.subplot(3, 2, 6)
autocorrelation_plot(df['Page_Views'])
plt.title('Autocorrelation of Page Views')

plt.tight_layout()
plt.show()

# Descriptive Statistics
print("\nDescriptive Statistics:")
print(df.describe())

# Seasonal Decomposition
from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose Page Views


result = seasonal_decompose(df['Page_Views'], model='additive', period=7) #
Weekly seasonal pattern
result.plot()
plt.show()
output:
Dataset:
Page_Views Unique_Visitors New_Signups Session_Duration
Date
2024-01-01 195 103 17 287.345998
2024-01-02 215 87 22 125.504189
2024-01-03 194 82 24 164.213270
2024-01-04 219 106 19 116.694704
2024-01-05 236 123 20 276.938724

Bounce_Rate
Date
2024-01-01 55.483457
2024-01-02 52.840141
2024-01-03 48.926675
2024-01-04 28.122861
2024-01-05 32.604629
Descriptive Statistics:
Page_Views Unique_Visitors New_Signups Session_Duration Bounce_Rate
count 20.000000 20.000000 20.000000 20.000000 20.000000
mean 244.250000 128.250000 20.800000 193.278187 44.313501
std 29.572169 23.763916 6.708988 60.400430 11.431342
min 194.000000 82.000000 6.000000 94.620554 28.122861
25% 221.250000 107.500000 17.750000 135.832926 33.501020
50% 243.000000 130.500000 20.000000 194.830227 49.401129
75% 265.250000 146.750000 24.250000 248.604233 53.304734
max 297.000000 162.000000 34.000000 287.345998 63.296567

Result:
Thus the python program as To create Time Series Analysis and apply various visualization
methods for Internet Traffic Time Dataset. (Create own data with minimum 5 columns and 20 rows) has
been successfully verified
14.Perform EDA for Water quality dataset. All attributes are numeric variables and they are listed bellow:
aluminium - dangerous if greater than 2.8 ammonia - dangerous if greater than 32.5 arsenic - dangerous if
greater than 0.01 barium - dangerous if greater than 2 cadmium - dangerous if greater than 0.005

Aim:
To Perform EDA for Water quality dataset. All attributes are numeric variables
Algorithm:

Step 1: Generate a synthetic dataset with specified attributes (e.g., aluminium, ammonia, etc.).
Step 2: Calculate descriptive statistics to summarize the dataset.
Step 3: Identify samples exceeding dangerous levels for each contaminant.
Step 4: Visualize the data using histograms, box plots, and scatter plots.
Present numerical summaries and visualizations.

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample water quality DataFrame


np.random.seed(42) # For reproducibility

# Generate sample data


data = {
'aluminium': np.random.uniform(0, 5, size=50),
# Random values between 0 and 5
'ammonia': np.random.uniform(0, 50, size=50),
# Random values between 0 and 50
'arsenic': np.random.uniform(0, 0.02, size=50),
# Random values between 0 and 0.02
'barium': np.random.uniform(0, 3, size=50),
# Random values between 0 and 3
'cadmium': np.random.uniform(0, 0.01, size=50)
# Random values between 0 and 0.01
}
df = pd.DataFrame(data)

print("Dataset:")
print(df.head())

# Perform Descriptive Statistics


print("\nDescriptive Statistics:")
print(df.describe())

# Check for dangerous levels


danger_thresholds = {
'aluminium': 2.8,
'ammonia': 32.5,
'arsenic': 0.01,
'barium': 2,
'cadmium': 0.005
}

# Identify rows with dangerous levels


dangerous_levels = df[(df['aluminium'] > danger_thresholds['aluminium']) |
(df['ammonia'] > danger_thresholds['ammonia']) |
(df['arsenic'] > danger_thresholds['arsenic']) |
(df['barium'] > danger_thresholds['barium']) |
(df['cadmium'] > danger_thresholds['cadmium'])]

print("\nRows with Dangerous Levels:")


print(dangerous_levels)

# Visualizations
plt.figure(figsize=(14, 12))

# Histograms
plt.subplot(3, 2, 1)
sns.histplot(df['aluminium'], kde=True, color='blue')
plt.title('Histogram of Aluminium')

plt.subplot(3, 2, 2)
sns.histplot(df['ammonia'], kde=True, color='green')
plt.title('Histogram of Ammonia')

plt.subplot(3, 2, 3)
sns.histplot(df['arsenic'], kde=True, color='orange')
plt.title('Histogram of Arsenic')

plt.subplot(3, 2, 4)
sns.histplot(df['barium'], kde=True, color='red')
plt.title('Histogram of Barium')

plt.subplot(3, 2, 5)
sns.histplot(df['cadmium'], kde=True, color='purple')
plt.title('Histogram of Cadmium')

plt.tight_layout()
plt.show()

# Boxplots
plt.figure(figsize=(14, 10))

plt.subplot(3, 2, 1)
sns.boxplot(y=df['aluminium'], color='blue')
plt.title('Boxplot of Aluminium')

plt.subplot(3, 2, 2)
sns.boxplot(y=df['ammonia'], color='green')
plt.title('Boxplot of Ammonia')

plt.subplot(3, 2, 3)
sns.boxplot(y=df['arsenic'], color='orange')
plt.title('Boxplot of Arsenic')

plt.subplot(3, 2, 4)
sns.boxplot(y=df['barium'], color='red')
plt.title('Boxplot of Barium')

plt.subplot(3, 2, 5)
sns.boxplot(y=df['cadmium'], color='purple')
plt.title('Boxplot of Cadmium')

plt.tight_layout()
plt.show()
# Scatter plots
plt.figure(figsize=(14, 10))
# Scatter plot between Aluminium and Ammonia
plt.subplot(2, 2, 1)
plt.scatter(df['aluminium'], df['ammonia'], alpha=0.7)
plt.xlabel('Aluminium')
plt.ylabel('Ammonia')
plt.title('Aluminium vs Ammonia')
# Scatter plot between Aluminium and Arsenic
plt.subplot(2, 2, 2)
plt.scatter(df['aluminium'], df['arsenic'], alpha=0.7)
plt.xlabel('Aluminium')
plt.ylabel('Arsenic')
plt.title('Aluminium vs Arsenic')
# Scatter plot between Barium and Cadmium
plt.subplot(2, 2, 3)
plt.scatter(df['barium'], df['cadmium'], alpha=0.7)
plt.xlabel('Barium')
plt.ylabel('Cadmium')
plt.title('Barium vs Cadmium')
plt.tight_layout()
plt.show()
Dataset:
aluminium ammonia arsenic barium cadmium
0 1.872701 48.479231 0.000629 2.724798 0.006420
1 4.753572 38.756641 0.012728 0.718686 0.000841
2 3.659970 46.974947 0.006287 0.434685 0.001616
3 2.993292 44.741368 0.010171 1.468358 0.008986
4 0.780093 29.894999 0.018151 2.956951 0.006064

Descriptive Statistics:
aluminium ammonia arsenic barium cadmium
count 50.000000 50.000000 50.000000 50.000000 50.000000
mean 2.229620 24.721879 0.009566 1.552091 0.005161
std 1.444416 15.342076 0.005951 0.870897 0.003091
min 0.102922 0.276106 0.000139 0.049763 0.000051
25% 0.918835 10.843701 0.004998 0.725709 0.002389
50% 2.180244 25.413211 0.008445 1.636463 0.005726
75% 3.249275 38.559728 0.015833 2.259806 0.007405
max 4.849549 49.344347 0.019436 2.956951 0.009730

Rows with Dangerous Levels:


aluminium ammonia arsenic barium cadmium
0 1.872701 48.479231 0.000629 2.724798 0.006420
1 4.753572 38.756641 0.012728 0.718686 0.000841
2 3.659970 46.974947 0.006287 0.434685 0.001616
3 2.993292 44.741368 0.010171 1.468358 0.008986
4 0.780093 29.894999 0.018151 2.956951 0.006064
5 0.779973 46.093712 0.004986 0.726166 0.000092
6 0.290418 4.424625 0.008208 2.016407 0.001015
7 4.330881 9.799143 0.015111 2.284859 0.006635
8 3.005575 2.261364 0.004576 0.712913 0.000051
9 3.540363 16.266517 0.001540 2.184649 0.001608
10 0.102922 19.433864 0.005795 1.103349 0.005487
11 4.849549 13.567452 0.003224 1.896917 0.006919
12 4.162213 41.436875 0.018594 1.900589 0.006520
13 1.061696 17.837666 0.016162 1.607324 0.002243
14 0.909125 14.046725 0.012668 0.270869 0.007122
15 0.917023 27.134804 0.017429 2.505907 0.002372
16 1.521211 7.046211 0.016073 0.962340 0.003254
17 2.623782 40.109849 0.003731 0.559556 0.007465
18 2.159725 3.727532 0.017851 0.122325 0.006496
19 1.456146 49.344347 0.010787 1.772679 0.008492
20 3.059264 38.612238 0.016149 2.032693 0.006576
21 0.697469 9.935784 0.017922 0.049763 0.005683
23 1.831809 40.773071 0.002201 0.679487 0.003677
24 2.280350 35.342867 0.004559 1.935518 0.002652
25 3.925880 36.450358 0.008542 0.523099 0.002440
26 0.998369 38.563517 0.016360 2.072813 0.009730
27 2.571172 3.702233 0.017215 1.160206 0.003931
28 2.962073 17.923286 0.000139 2.810190 0.008920
29 0.232252 5.793453 0.010215 0.412563 0.006311
30 3.037724 43.155171 0.008348 1.023199 0.007948
31 0.852621 31.164906 0.004442 0.340421 0.005026
32 0.325258 16.544901 0.002397 2.774081 0.005769
33 4.744428 3.177918 0.006752 2.632018 0.004925
34 4.828160 15.549116 0.018858 0.773825 0.001952
35 4.041987 16.259166 0.006464 1.979952 0.007225
36 1.523069 36.480309 0.010376 2.451667 0.002808
37 0.488361 31.877874 0.014060 1.665602 0.000243
38 3.421165 44.360637 0.007273 1.588952 0.006455
39 2.200762 23.610746 0.019436 0.725557 0.001771
40 0.610191 5.979712 0.019249 0.279308 0.009405
41 2.475885 35.662239 0.005036 2.691647 0.009539
42 0.171943 38.039252 0.009945 2.701254 0.009149
43 4.546602 28.063860 0.006018 1.899304 0.003702
44 1.293900 38.548359 0.005697 1.017089 0.000155
45 3.312611 24.689780 0.000738 1.047629 0.009283
46 1.558555 26.136641 0.012191 2.177867 0.004282
47 2.600340 21.377051 0.010054 2.691331 0.009667
48 2.733551 1.270956 0.001030 2.661259 0.009636
49 0.924272 5.394571 0.005573 2.339627 0.008530
Result:
Thus the python program to EDA for Water quality dataset. All attributes are numeric variables
and they are listed bellow: aluminium - dangerous if greater than 2.8 ammonia - dangerous if greater
than 32.5 arsenic - dangerous if greater than 0.01 barium - dangerous if greater than 2 cadmium -
dangerous if greater than 0.005 has been successfully verified
15.Perform EDA on map using various map dataset to find the nearest Sports Shop from your
Location with mouse rollover effect

Aim:
To Perform EDA on map using various map dataset to find the nearest Sports Shop from your
Location with mouse rollover effect

Algorithm:
Step 1: Load the dataset containing sports shop information.
Step 2: Define the user's location.
Step 3: Calculate distances to each shop using Haversine formula or similar.
Step 4: Create a folium map centered at the user's location.
Step 5: Add markers for each shop with a mouse rollover effect to display shop details.
Step 6: Display the map.

Program:
import pandas as pd
import folium
from geopy.distance import geodesic

# Sample Data: Sports Shops


data = {
'Shop': ['Sports Shop A', 'Sports Shop B', 'Sports Shop C', 'Sports Shop
D'],
'Latitude': [40.748817, 40.749825, 40.750788, 40.751850],
'Longitude': [-73.985428, -73.987456, -73.982542, -73.984750]}
shops_df = pd.DataFrame(data)

# Your location
your_location = (40.748817, -73.985428) # Example coordinates (latitude,
longitude)

# Create the base map


m = folium.Map(location=your_location, zoom_start=14)

# Add your location to the map


folium.Marker(
location=your_location,
popup='Your Location',
icon=folium.Icon(color='blue', icon='info-sign')).add_to(m)

# Add sports shops to the map


for _, row in shops_df.iterrows():
folium.Marker(
location=(row['Latitude'], row['Longitude']),
popup=folium.Popup(f"{row['Shop']}", parse_html=True),
icon=folium.Icon(color='red', icon='info-sign')
).add_to(m)

# Calculate the nearest sports shop


deffind_nearest_shop(your_location, shops_df):
nearest_shop = None
min_distance = float('inf')

for _, row in shops_df.iterrows():


shop_location = (row['Latitude'], row['Longitude'])
distance = geodesic(your_location, shop_location).km
if distance < min_distance:
min_distance = distance
nearest_shop = row['Shop']

return nearest_shop, min_distance

nearest_shop, distance = find_nearest_shop(your_location, shops_df)

print(f"Nearest Sports Shop: {nearest_shop}")


print(f"Distance: {distance:.2f} km")

# Save map to HTML


m.save('sports_shops_map.html')

output:
Nearest Sports Shop: Sports Shop A
Distance: 0.00 km

Result:
Thus the python program to EDA on map using various map dataset to find the nearest Sports Shop
from your Location with mouse rollover effect has been successfully verified
17.Perform EDA for Price of petroleum products in India from the year 2013 to 2023. (Create
dataset with minimum 5 columns and 20 rows.)
Aim:
To create the EDA for Price of petroleum products in India from the year 2013 to 2023. (Create
dataset with minimum 5 columns and 20 rows.)

Algorithm:
Step 1: Create a dataset with at least 5 columns and 20 rows.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics.
Step 5: Visualize data (line plots, bar charts).
Step 6: Analyze results and interpret findings.

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create a sample dataset
np.random.seed(42) # For reproducibility
# Define the data
data = {
'Year': np.repeat(np.arange(2013, 2024), 12),
'Month': np.tile(np.arange(1, 13), 11),
'Petrol_Price': np.random.uniform(60,120,size=132)+np.linspace(0,10,132),
'Diesel_Price': np.random.uniform(50,100,size=132)+np.linspace(0, 8,132),
'LPG_Price':np.random.uniform(400,1000,size=132)+np.linspace(0,50,132)}
# Create a DataFrame
df = pd.DataFrame(data)
# Display basic information
print("Dataset:")
print(df.head())
# Perform Descriptive Statistics
print("\nDescriptive Statistics:")
print(df.describe())
# Plotting
# Line plots for Petrol, Diesel, and LPG prices over time
plt.figure(figsize=(14, 8))
# Petrol Price
plt.subplot(3, 1, 1)
sns.lineplot(data=df, x='Month', y='Petrol_Price', hue='Year', marker='o')
plt.title('Monthly Petrol Prices (2013-2023)')
plt.xlabel('Month')
plt.ylabel('Price (INR per liter)')

# Diesel Price
plt.subplot(3, 1, 2)
sns.lineplot(data=df, x='Month', y='Diesel_Price', hue='Year', marker='o')
plt.title('Monthly Diesel Prices (2013-2023)')
plt.xlabel('Month')
plt.ylabel('Price (INR per liter)')
# LPG Price
plt.subplot(3, 1, 3)
sns.lineplot(data=df, x='Month', y='LPG_Price', hue='Year', marker='o')
plt.title('Monthly LPG Prices (2013-2023)')
plt.xlabel('Month')
plt.ylabel('Price (INR per cylinder)')
plt.tight_layout()
plt.show()

# Additional Analysis: Average Prices by Year


df_yearly = df.groupby('Year').mean().reset_index()
plt.figure(figsize=(14, 8))
# Average Petrol Price by Year
plt.subplot(1, 3, 1)
sns.barplot(data=df_yearly, x='Year', y='Petrol_Price', palette='Blues')
plt.title('Average Petrol Prices by Year')
plt.xlabel('Year')
plt.ylabel('Average Price (INR per liter)')
# Average Diesel Price by Year
plt.subplot(1, 3, 2)
sns.barplot(data=df_yearly, x='Year', y='Diesel_Price', palette='Greens')
plt.title('Average Diesel Prices by Year')
plt.xlabel('Year')
plt.ylabel('Average Price (INR per liter)')
# Average LPG Price by Year
plt.subplot(1, 3, 3)
sns.barplot(data=df_yearly, x='Year', y='LPG_Price', palette='Reds')
plt.title('Average LPG Prices by Year')
plt.xlabel('Year')
plt.ylabel('Average Price (INR per cylinder)')
plt.tight_layout()
plt.show()

output:
Dataset:
Year Month Petrol_Price Diesel_Price LPG_Price
0 2013 1 82.472407 55.993268 926.423843
1 2013 2 117.119194 66.941827 844.842850
2 2013 3 104.072308 97.267623 818.972803
3 2013 4 96.148517 66.343353 822.635489
4 2013 5 69.666462 76.183806 617.221408

Descriptive Statistics:
Year Month Petrol_Price Diesel_Price LPG_Price
count 132.000000 132.000000 132.000000 132.000000 132.000000
mean 2018.000000 6.500000 93.579911 79.327571 721.103925
std 3.174324 3.465203 18.221675 15.216369 174.006893
min 2013.000000 1.000000 61.998428 52.638240 428.223814
25% 2015.000000 3.750000 77.945617 65.872682 576.909566
50% 2018.000000 6.500000 92.986088 80.956321 729.372830
75% 2021.000000 9.250000 110.516472 91.065617 863.529239
max 2023.000000 12.000000 124.480392 107.380555 1042.394688

sns.barplot(data=df_yearly, x='Year', y='LPG_Price', palette='Reds')

Result:
Thus the python program the EDA for Price of petroleum products in India from the year 2013 to 2023.
(Create dataset with minimum 5 columns and 20 rows.) has been successfully verified
18.Explore and visualize women empowerment in India 2025 and compare every five year from 2010.
(Create dataset with minimum 5 columns and 20 rows.) With dataset
Aim:
To visualize women empowerment in India 2025 and compare every five year from 2010.
(Create dataset with minimum 5 columns and 20 rows.) With dataset

Algorithm:
Step 1: Create a dataset with at least 5 columns and 20 rows.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics.
Step 5: Visualize data (line plots, bar charts).
Step 6: Analyze results and interpret findings.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a synthetic dataset


np.random.seed(42) # For reproducibility
# Generate data
years = np.arange(2010, 2026, 1)
data = {
'Year': np.repeat(years, 5),
'Metric': ['Literacy_Rate', 'Workforce_Participation', 'Higher_Education',
'Political_Representation', 'Health_Care_Access'] * len(years),
'Value': np.concatenate([
np.random.uniform(50, 80, len(years)) + np.linspace(0, 10, len(years)),
# Literacy Rate
np.random.uniform(20, 50, len(years)) + np.linspace(0, 10, len(years)),
# Workforce Participation
np.random.uniform(10, 30, len(years)) + np.linspace(0, 10, len(years)),
# Higher Education
np.random.uniform(5, 20, len(years)) + np.linspace(0, 5, len(years)),
# Political Representation
np.random.uniform(60, 90, len(years)) + np.linspace(0, 5, len(years))
# Health Care Access
])
}

# Create DataFrame
df = pd.DataFrame(data)
# Display basic information
print("Dataset:")
print(df.head(20)) # Display first 20 rows

# Perform Descriptive Statistics


print("\nDescriptive Statistics:")
print(df.groupby('Metric').describe())
# Plotting

plt.figure(figsize=(14, 10))

# Line plot for each metric


metrics = df['Metric'].unique()
colors = ['blue', 'green', 'orange', 'red', 'purple']

for i, metric inenumerate(metrics):


plt.subplot(3, 2, i+1)
metric_data = df[df['Metric'] == metric]
sns.lineplot(data=metric_data, x='Year', y='Value', marker='o',
color=colors[i])
plt.title(f'{metric.replace("_", " ").title()} (2010-2025)')
plt.xlabel('Year')
plt.ylabel('Value (%)')

plt.tight_layout()
plt.show()

output:
Dataset:
Year Metric Value
0 2010 Literacy_Rate 61.236204
1 2010 Workforce_Participation 79.188096
2 2010 Higher_Education 73.293152
3 2010 Political_Representation 69.959755
4 2010 Health_Care_Access 57.347226
5 2011 Literacy_Rate 58.013169
6 2011 Workforce_Participation 55.742508
7 2011 Higher_Education 80.651951
8 2011 Political_Representation 73.366784
9 2011 Health_Care_Access 77.242177
10 2012 Literacy_Rate 57.284201
11 2012 Workforce_Participation 86.430629
12 2012 Higher_Education 82.973279
13 2012 Political_Representation 65.036840
14 2012 Health_Care_Access 64.788082
15 2013 Literacy_Rate 65.502135
16 2013 Workforce_Participation 29.127267
17 2013 Higher_Education 36.409360
18 2013 Political_Representation 34.291684
19 2013 Health_Care_Access 30.736874

Descriptive Statistics:
Year \
count mean std min 25% 50%
Metric
Health_Care_Access 16.0 2017.5 4.760952 2010.0 2013.75 2017.5
Higher_Education 16.0 2017.5 4.760952 2010.0 2013.75 2017.5
Literacy_Rate 16.0 2017.5 4.760952 2010.0 2013.75 2017.5
Political_Representation 16.0 2017.5 4.760952 2010.0 2013.75 2017.5
Workforce_Participation 16.0 2017.5 4.760952 2010.0 2013.75 2017.5
Value \
75% max count mean std
Metric
Health_Care_Access 2021.25 2025.0 16.0 45.426186 27.309181
Higher_Education 2021.25 2025.0 16.0 43.194516 27.522326
Literacy_Rate 2021.25 2025.0 16.0 47.597035 23.782723
Political_Representation 2021.25 2025.0 16.0 43.815057 25.184779
Workforce_Participation 2021.25 2025.0 16.0 42.272576 26.929909
\
min 25% 50% 75%
Metric
Health_Care_Access 8.106150 23.396187 34.876154 68.440044
Higher_Education 10.939743 19.657972 34.586850 68.239293
Literacy_Rate 14.830159 26.416322 48.557479 62.302686
Political_Representation 9.011743 25.785765 35.588632 66.267569
Workforce_Participation 8.994054 23.865402 30.892074 65.744496

max
Metric
Health_Care_Access 91.273275
Higher_Education 85.065909
Literacy_Rate 85.536882
Political_Representation 87.463843
Workforce_Participation 87.138110

Result:
Thus the python program the Explore and visualize women empowerment in India 2025 and
compare every five year from 2010. (Create dataset with minimum 5 columns and 20 rows.) With
dataset has been successfully verified
19.Perform EDA and Visualization for COVID-19 dataset.
A) State wise Bar chart

Aim:
To Perform EDA and Visualization for COVID-19 dataset to State wise Bar chart

Algorithm:
Step 1: Create a dataset with state names and their corresponding COVID-19 statistics.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics.
Step 5: Visualize data (state-wise bar chart).
Step 6: Analyze results and interpret findings.

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a synthetic COVID-19 dataset


np.random.seed(42) # For reproducibility

# Define states and generate synthetic data


states = ['State_A', 'State_B', 'State_C', 'State_D', 'State_E']
data = {
'State': states,
'Total_Cases': np.random.randint(10000, 500000, size=len(states)),
'Total_Deaths': np.random.randint(1000, 50000, size=len(states)),
'Total_Recovered': np.random.randint(5000, 450000, size=len(states))
}

# Create DataFrame
df = pd.DataFrame(data)
# Display basic information
print("Dataset:")
print(df)

# Perform Descriptive Statistics


print("\nDescriptive Statistics:")
print(df.describe())

# Plotting: State-wise Bar Chart for Total Cases


plt.figure(figsize=(10, 6))
sns.barplot(x='State', y='Total_Cases', data=df, palette='viridis')
plt.title('Total COVID-19 Cases by State')
plt.xlabel('State')
plt.ylabel('Total Cases')
plt.xticks(rotation=45)
plt.show()
Output:
Dataset:
State Total_Cases Total_Deaths Total_Recovered
0 State_A 131958 45732 92498
1 State_B 156867 12284 379871
2 State_C 141932 7265 393468
3 State_D 375838 17850 180203
4 State_E 269178 38194 196335

Descriptive Statistics:
Total_Cases Total_Deaths Total_Recovered
count 5.000000 5.000000 5.000000
mean 215154.600000 24265.000000 248475.000000
std 105378.307183 16796.917247 132284.115617
min 131958.000000 7265.000000 92498.000000
25% 141932.000000 12284.000000 180203.000000
50% 156867.000000 17850.000000 196335.000000
75% 269178.000000 38194.000000 379871.000000
max 375838.000000 45732.000000 393468.000000
.

sns.barplot(x='State', y='Total_Cases', data=df, palette='viridis')

Result:
Thus the python program as Perform EDA and Visualization for COVID-19 dataset to State wise
Bar chart has been successfully verified
19.Perform EDA and Visualization for COVID-19 dataset.
B) Recovered from COVID-19 District wise Bar chart

Aim:
To Perform EDA and Visualization for COVID-19 dataset to Recovered from COVID-19 District wise
Bar chart

Algorithm:
Step 1: Create a dataset with district names and their corresponding recovered COVID-19 statistics.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics.
Step 5: Visualize data (district-wise bar chart).
Step 6: Analyze results and interpret findings.

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a synthetic COVID-19 dataset with district-wise data


np.random.seed(42) # For reproducibility

# Define districts and generate synthetic data


districts = ['District_A', 'District_B', 'District_C', 'District_D',
'District_E']
data = {
'District': districts,
'Recovered_Cases': np.random.randint(5000, 200000, size=len(districts)),
'Total_Cases': np.random.randint(10000, 250000, size=len(districts)),
'Total_Deaths': np.random.randint(1000, 50000, size=len(districts))
}

# Create DataFrame
df = pd.DataFrame(data)

# Display basic information


print("Dataset:")
print(df)

# Perform Descriptive Statistics


print("\nDescriptive Statistics:")
print(df.describe())

# Plotting: District-wise Bar Chart for Recovered Cases


plt.figure(figsize=(10, 6))
sns.barplot(x='District', y='Recovered_Cases', data=df, palette='viridis')
plt.title('COVID-19 Recovered Cases by District')
plt.xlabel('District')
plt.ylabel('Recovered Cases')
plt.xticks(rotation=45)
plt.show()

OUTPUT:
Dataset:
District Recovered_Cases Total_Cases Total_Deaths
0 District_A 126958 120268 38194
1 District_B 151867 217892 22962
2 District_C 136932 64886 48191
3 District_D 108694 147337 45131
4 District_E 124879 223458 17023

Descriptive Statistics:
Recovered_Cases Total_Cases Total_Deaths
count 5.000000 5.000000 5.000000
mean 129866.000000 154768.200000 34300.200000
std 15933.867814 67132.702837 13715.672156
min 108694.000000 64886.000000 17023.000000
25% 124879.000000 120268.000000 22962.000000
50% 126958.000000 147337.000000 38194.000000
75% 136932.000000 217892.000000 45131.000000
max 151867.000000 223458.000000 48191.000000

sns.barplot(x='District', y='Recovered_Cases', data=df, palette='viridis')

Result:
Thus the python program the EDA and Visualization for COVID-19 dataset to Recovered from COVID-19
District wise Bar chart has been successfully verified
19.Perform EDA and Visualization for COVID-19 dataset.
C) Descriptive analysis for different age group.(Create dataset with minimum 5 columns and 20 rows.)

Aim:
To Perform EDA and Visualization for COVID-19 dataset to Descriptive analysis for different age group.
Algorithm:
Step 1: Create a dataset with age groups and their corresponding COVID-19 statistics.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics for each age group.
Step 5: Visualize data (e.g., bar charts for confirmed cases, recoveries, and deaths).
Step 6: Analyze results and interpret findings.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a synthetic COVID-19 dataset with age group data


np.random.seed(42) # For reproducibility

# Define age groups and generate synthetic data


age_groups = ['0-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70',
'71-80', '81+']
data = {
'Age_Group': np.random.choice(age_groups, size=20),
'Total_Cases': np.random.randint(1000, 50000, size=20),
'Total_Deaths': np.random.randint(10, 5000, size=20),
'Recovered_Cases': np.random.randint(500, 20000, size=20),
'New_Cases': np.random.randint(50, 3000, size=20),
'New_Deaths': np.random.randint(5, 300, size=20)
}

# Create DataFrame
df = pd.DataFrame(data)

# Display basic information


print("Dataset:")
print(df.head())

# Perform Descriptive Statistics by Age Group


print("\nDescriptive Statistics by Age Group:")
age_group_stats = df.groupby('Age_Group').agg({
'Total_Cases': ['mean', 'sum', 'std'],
'Total_Deaths': ['mean', 'sum', 'std'],
'Recovered_Cases': ['mean', 'sum', 'std'],
'New_Cases': ['mean', 'sum', 'std'],
'New_Deaths': ['mean', 'sum', 'std']
}).reset_index()
print(age_group_stats)
# Visualization: Total Cases and Deaths by Age Group
plt.figure(figsize=(12, 6))
sns.barplot(x='Age_Group', y='Total_Cases', data=df, palette='viridis',
ci=None)

plt.title('Total COVID-19 Cases by Age Group')


plt.xlabel('Age Group')
plt.ylabel('Total Cases')
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(12, 6))
sns.barplot(x='Age_Group', y='Total_Deaths', data=df, palette='rocket',
ci=None)

plt.title('Total COVID-19 Deaths by Age Group')


plt.xlabel('Age Group')
plt.ylabel('Total Deaths')
plt.xticks(rotation=45)
plt.show()

output:

Dataset:
Age_Group Total_Cases Total_Deaths Recovered_Cases New_Cases New_Deaths
0 61-70 18568 2898 1521 429 235
1 31-40 20769 2445 12153 542 45
2 71-80 29693 610 11305 1230 32
3 41-50 7396 2373 13917 2112 139
4 61-70 28480 2071 8489 114 205

Descriptive Statistics by Age Group:


Age_Group Total_Cases Total_Deaths \
mean sum std mean sum
0 11-20 18720.500000 37441 19559.280674 3557.5 7115
1 21-30 39603.500000 79207 4002.931488 518.0 1036
2 31-40 12258.000000 24516 12036.371629 1909.0 3818
3 41-50 17925.000000 71700 17957.845101 2030.5 8122
4 51-60 3727.500000 7455 1171.675936 2809.5 5619
5 61-70 24568.666667 73706 5275.975865 2340.0 7020
6 71-80 16667.200000 83336 10608.236361 1892.6 9463

Recovered_Cases New_Cases \
std mean sum std mean sum
0 559.321464 10857.000000 21714 2585.182392 1088.000000 2176
1 377.595021 11113.000000 22226 1302.490691 2095.000000 4190
2 758.018469 6407.000000 12814 8126.071129 1643.500000 3287
3 1356.603234 7109.750000 28439 5117.947204 1619.000000 6476
4 1717.562372 8948.500000 17897 1158.948014 2538.000000 5076
5 483.345632 7833.333333 23500 6011.377906 653.333333 1960
6 887.934007 8546.400000 42732 6278.818902 1131.400000 5657
New_Deaths
std mean sum std
0 933.380951 246.00 492 21.213203
1 739.633693 224.00 448 67.882251
2 1557.756239 132.50 265 123.743687
3 794.150699 63.25 253 55.608003
4 214.960461 176.50 353 58.689863
5 679.850229 159.00 477 106.714573
6 397.607596 117.80 589 105.103283

Result:
Thus the python program to perform EDA and Visualization for COVID-19 dataset on descriptive
analysis for different age grouphas been successfully verified
20.A)Perform EDA for Ticket Booking Bus

Aim:
To python program to Perform EDA for Ticket Booking Bus

Algorithm:
Step 1: Create a dataset with relevant attributes.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics for the dataset.
Step 5: Visualize data (e.g., bar charts, pie charts).
Step 6: Analyze results and interpret findings.

PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a synthetic bus ticket booking dataset


np.random.seed(42) # For reproducibility

# Define data
num_rows = 50
dates = pd.date_range(start='2024-01-01', periods=num_rows, freq='D')
data = {
'Booking_ID': np.arange(1, num_rows + 1),
'Date': np.random.choice(dates, size=num_rows),
'Bus_ID': np.random.choice(['Bus_01', 'Bus_02', 'Bus_03', 'Bus_04'],
size=num_rows),
'Passenger_ID': np.random.randint(1000, 5000, size=num_rows),
'Seat_No': np.random.randint(1, 50, size=num_rows),
'Booking_Status': np.random.choice(['Booked', 'Cancelled', 'Completed'],
size=num_rows),
'Amount': np.random.uniform(100, 500, size=num_rows).round(2),
'Travel_Distance': np.random.randint(10, 500, size=num_rows)
}

# Create DataFrame
df = pd.DataFrame(data)

# Display basic information


print("Dataset:")
print(df.head())

# Check for missing values and data types


print("\nMissing Values:")
print(df.isnull().sum())
print("\nData Types:")
print(df.dtypes)
# Perform Descriptive Statistics
print("\nDescriptive Statistics:")
print(df.describe(include='all'))

# Booking Status Distribution


plt.figure(figsize=(8, 5))
sns.countplot(x='Booking_Status', data=df, palette='viridis')
plt.title('Distribution of Booking Status')
plt.xlabel('Booking Status')
plt.ylabel('Count')
plt.show()

# Total Revenue by Bus


plt.figure(figsize=(10, 6))
bus_revenue = df[df['Booking_Status'] ==
'Completed'].groupby('Bus_ID')['Amount'].sum().reset_index()
sns.barplot(x='Bus_ID', y='Amount', data=bus_revenue, palette='plasma')
plt.title('Total Revenue by Bus')
plt.xlabel('Bus ID')
plt.ylabel('Total Revenue')
plt.show()

# Average Travel Distance by Booking Status


plt.figure(figsize=(10, 6))
sns.boxplot(x='Booking_Status', y='Travel_Distance', data=df,
palette='coolwarm')
plt.title('Travel Distance by Booking Status')
plt.xlabel('Booking Status')
plt.ylabel('Travel Distance (km)')
plt.show()

# Correlation Matrix
plt.figure(figsize=(10, 8))
correlation = df[['Amount', 'Travel_Distance']].corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
OUTPUT:
Dataset:
Booking_ID Date Bus_ID Passenger_ID Seat_No Booking_Status \
0 1 2024-02-08 Bus_01 2076 40 Cancelled
1 2 2024-01-29 Bus_02 1791 22 Cancelled
2 3 2024-01-15 Bus_04 4993 27 Completed
3 4 2024-02-12 Bus_04 3264 35 Cancelled
4 5 2024-01-08 Bus_03 1763 1 Completed
Amount Travel_Distance
0 353.74 125
1 372.28 84
2 312.37 122
3 279.11 465
4 321.16 429

Missing Values:
Booking_ID 0
Date 0
Bus_ID 0
Passenger_ID 0
Seat_No 0
Booking_Status 0
Amount 0
Travel_Distance 0
dtype: int64

Data Types:
Booking_ID int64
Date datetime64[ns]
Bus_ID object
Passenger_ID int64
Seat_No int64
Booking_Status object
Amount float64
Travel_Distance int64
dtype: object

Descriptive Statistics:
Booking_ID Date Bus_ID Passenger_ID Seat_No \
count 50.00000 50 50 50.00000 50.00000
unique NaN NaN 4 NaN NaN
top NaN NaN Bus_02 NaN NaN
freq NaN NaN 18 NaN NaN
mean 25.50000 2024-01-24 16:19:12 NaN 2985.20000 26.14000
min 1.00000 2024-01-02 00:00:00 NaN 1064.00000 1.00000
25% 13.25000 2024-01-14 06:00:00 NaN 2095.00000 14.25000
50% 25.50000 2024-01-24 00:00:00 NaN 3075.00000 28.00000
75% 37.75000 2024-02-06 18:00:00 NaN 3915.50000 38.50000
max 50.00000 2024-02-19 00:00:00 NaN 4993.00000 49.00000
std 14.57738 NaN NaN 1129.03166 14.20694

Booking_Status Amount Travel_Distance


count 50 50.000000 50.000000
unique 3 NaN NaN
top Booked NaN NaN
freq 20 NaN NaN
mean NaN 327.255600 278.920000
min NaN 132.340000 31.000000
25% NaN 241.567500 163.250000
50% NaN 322.240000 292.000000
75% NaN 423.165000 405.500000
max NaN 498.500000 493.000000
std NaN 107.615938 140.614372

sns.countplot(x='Booking_Status', data=df, palette='viridis')


sns.barplot(x='Bus_ID', y='Amount', data=bus_revenue, palette='plasma')

sns.boxplot(x='Booking_Status', y='Travel_Distance', data=df,


palette='coolwarm')
Result:
Thus the python program to perform EDA for Ticket Booking Bus has been successfully verified
20.B)Perform EDA for Train Ticket Booking

Aim:
To program program as Perform EDA for Train Ticket Booking
Algorithm:
Step 1: Create a dataset with relevant attributes.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics for the dataset.
Step 5: Visualize data (e.g., bar charts, pie charts).
Step 6: Analyze results and interpret findings.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a synthetic train ticket booking dataset


np.random.seed(42) # For reproducibility

# Define data
num_rows = 50
dates = pd.date_range(start='2024-01-01', periods=num_rows, freq='D')
data = {
'Booking_ID': np.arange(1, num_rows + 1),
'Date': np.random.choice(dates, size=num_rows),
'Train_ID': np.random.choice(['Train_A', 'Train_B', 'Train_C', 'Train_D'],
size=num_rows),
'Passenger_ID': np.random.randint(1000, 5000, size=num_rows),
'Seat_No': np.random.randint(1, 100, size=num_rows),
'Booking_Status': np.random.choice(['Booked', 'Cancelled', 'Completed'],
size=num_rows),
'Amount': np.random.uniform(50, 500, size=num_rows).round(2),
'Travel_Distance': np.random.randint(10, 1000, size=num_rows)
}

# Create DataFrame
df = pd.DataFrame(data)

# Display basic information


print("Dataset:")
print(df.head())

# Check for missing values and data types


print("\nMissing Values:")
print(df.isnull().sum())

print("\nData Types:")
print(df.dtypes)
# Perform Descriptive Statistics
print("\nDescriptive Statistics:")
print(df.describe(include='all'))

# Booking Status Distribution


plt.figure(figsize=(8, 5))
sns.countplot(x='Booking_Status', data=df, palette='viridis')
plt.title('Distribution of Booking Status')
plt.xlabel('Booking Status')
plt.ylabel('Count')
plt.show()

# Total Revenue by Train


plt.figure(figsize=(10, 6))
train_revenue = df[df['Booking_Status'] ==
'Completed'].groupby('Train_ID')['Amount'].sum().reset_index()
sns.barplot(x='Train_ID', y='Amount', data=train_revenue, palette='plasma')
plt.title('Total Revenue by Train')
plt.xlabel('Train ID')
plt.ylabel('Total Revenue')
plt.show()

# Average Travel Distance by Booking Status


plt.figure(figsize=(10, 6))
sns.boxplot(x='Booking_Status', y='Travel_Distance', data=df,
palette='coolwarm')
plt.title('Travel Distance by Booking Status')

plt.xlabel('Booking Status')
plt.ylabel('Travel Distance (km)')

plt.show()

# Revenue Over Time


plt.figure(figsize=(12, 6))
df['Date'] = pd.to_datetime(df['Date'])
revenue_over_time = df[df['Booking_Status'] ==
'Completed'].groupby('Date')['Amount'].sum().reset_index()
sns.lineplot(x='Date', y='Amount', data=revenue_over_time, marker='o')

plt.title('Revenue Over Time')


plt.xlabel('Date')
plt.ylabel('Total Revenue')
plt.show()
OUTPUT:
Dataset:
Booking_ID Date Train_ID Passenger_ID Seat_No Booking_Status \
0 1 2024-02-08 Train_A 2076 86 Cancelled
1 2 2024-01-29 Train_B 1791 91 Completed
2 3 2024-01-15 Train_D 4993 35 Booked
3 4 2024-02-12 Train_D 3264 65 Completed
4 5 2024-01-08 Train_C 1763 99 Cancelled

Amount Travel_Distance
0 115.20 977
1 270.25 429
2 493.54 431
3 158.92 113
4 352.46 861

Missing Values:
Booking_ID 0
Date 0
Train_ID 0
Passenger_ID 0
Seat_No 0
Booking_Status 0
Amount 0
Travel_Distance 0
dtype: int64

Data Types:
Booking_ID int64
Date datetime64[ns]
Train_ID object
Passenger_ID int64
Seat_No int64
Booking_Status object
Amount float64
Travel_Distance int64
dtype: object

Descriptive Statistics:
Booking_ID Date Train_ID Passenger_ID Seat_No \
count 50.00000 50 50 50.00000 50.000000
unique NaN NaN 4 NaN NaN
top NaN NaN Train_B NaN NaN
freq NaN NaN 18 NaN NaN
mean 25.50000 2024-01-24 16:19:12 NaN 2985.20000 49.460000
min 1.00000 2024-01-02 00:00:00 NaN 1064.00000 1.000000
25% 13.25000 2024-01-14 06:00:00 NaN 2095.00000 28.000000
50% 25.50000 2024-01-24 00:00:00 NaN 3075.00000 46.000000
75% 37.75000 2024-02-06 18:00:00 NaN 3915.50000 76.500000
max 50.00000 2024-02-19 00:00:00 NaN 4993.00000 99.000000
std 14.57738 NaN NaN 1129.03166 29.864463

Booking_Status Amount Travel_Distance


count 50 50.000000 50.000000
unique 3 NaN NaN
top Booked NaN NaN
freq 21 NaN NaN
mean NaN 279.017600 520.620000
min NaN 57.460000 31.000000
25% NaN 158.852500 245.000000
50% NaN 295.470000 480.500000
75% NaN 377.445000 821.250000
max NaN 493.540000 977.000000
std NaN 130.155225 302.010399

sns.countplot(x='Booking_Status', data=df, palette='viridis')

sns.barplot(x='Train_ID', y='Amount', data=train_revenue, palette='plasma')


sns.boxplot(x='Booking_Status', y='Travel_Distance', data=df,
palette='coolwarm')
Result:
Thus the python program as perform EDA for Train Ticket Booking has been successfully verified
20.C)Perform EDA for flight Ticket Booking

Aim:
To python program as Perform EDA for flight Ticket Booking
Algorithm:
Step 1: Create a dataset with relevant attributes.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics for the dataset.
Step 5: Visualize data (e.g., bar charts, pie charts).
Step 6: Analyze results and interpret findings.

PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a synthetic flight ticket booking dataset


np.random.seed(42) # For reproducibility

# Define data
num_rows = 50
dates = pd.date_range(start='2024-01-01', periods=num_rows, freq='D')
data = {
'Booking_ID': np.arange(1, num_rows + 1),
'Date': np.random.choice(dates, size=num_rows),
'Flight_ID': np.random.choice(['Flight_101', 'Flight_102', 'Flight_103',
'Flight_104'], size=num_rows),
'Passenger_ID': np.random.randint(1000, 5000, size=num_rows),
'Seat_No': np.random.randint(1, 200, size=num_rows),
'Booking_Status': np.random.choice(['Booked', 'Cancelled', 'Completed'],
size=num_rows),
'Amount': np.random.uniform(100, 1000, size=num_rows).round(2),
'Travel_Distance': np.random.randint(100, 5000, size=num_rows),
'Class': np.random.choice(['Economy', 'Business', 'First'], size=num_rows)
}

# Create DataFrame
df = pd.DataFrame(data)

# Display basic information


print("Dataset:")
print(df.head())

# Check for missing values and data types


print("\nMissing Values:")
print(df.isnull().sum())
print("\nData Types:")
print(df.dtypes)

# Perform Descriptive Statistics


print("\nDescriptive Statistics:")
print(df.describe(include='all'))

# Booking Status Distribution


plt.figure(figsize=(8, 5))
sns.countplot(x='Booking_Status', data=df, palette='viridis')
plt.title('Distribution of Booking Status')
plt.xlabel('Booking Status')
plt.ylabel('Count')
plt.show()

# Total Revenue by Flight


plt.figure(figsize=(10, 6))
flight_revenue = df[df['Booking_Status'] ==
'Completed'].groupby('Flight_ID')['Amount'].sum().reset_index()
sns.barplot(x='Flight_ID', y='Amount', data=flight_revenue, palette='plasma')
plt.title('Total Revenue by Flight')
plt.xlabel('Flight ID')
plt.ylabel('Total Revenue')
plt.show()

# Average Travel Distance by Booking Status


plt.figure(figsize=(10, 6))
sns.boxplot(x='Booking_Status', y='Travel_Distance', data=df,
palette='coolwarm')
plt.title('Travel Distance by Booking Status')
plt.xlabel('Booking Status')
plt.ylabel('Travel Distance (km)')
plt.show()

# Revenue by Class
plt.figure(figsize=(10, 6))
class_revenue = df[df['Booking_Status'] ==
'Completed'].groupby('Class')['Amount'].sum().reset_index()
sns.barplot(x='Class', y='Amount', data=class_revenue, palette='magma')
plt.title('Total Revenue by Class')
plt.xlabel('Class')
plt.ylabel('Total Revenue')
plt.show()

# Revenue Over Time


plt.figure(figsize=(12, 6))
df['Date'] = pd.to_datetime(df['Date'])
revenue_over_time = df[df['Booking_Status'] ==
'Completed'].groupby('Date')['Amount'].sum().reset_index()
sns.lineplot(x='Date', y='Amount', data=revenue_over_time, marker='o')
plt.title('Revenue Over Time')
plt.xlabel('Date')
plt.ylabel('Total Revenue')
plt.show()

OUTPUT:
Dataset:
Booking_ID Date Flight_ID Passenger_ID Seat_No Booking_Status \
0 1 2024-02-08 Flight_101 2076 104 Booked
1 2 2024-01-29 Flight_102 1791 35 Booked
2 3 2024-01-15 Flight_104 4993 193 Booked
3 4 2024-02-12 Flight_104 3264 101 Completed
4 5 2024-01-08 Flight_103 1763 175 Cancelled

Amount Travel_Distance Class


0 704.92 2657 Economy
1 785.46 198 Business
2 313.87 2300 Economy
3 755.39 3061 First
4 431.00 4533 First

Missing Values:
Booking_ID 0
Date 0
Flight_ID 0
Passenger_ID 0
Seat_No 0
Booking_Status 0
Amount 0
Travel_Distance 0
Class 0
dtype: int64

Data Types:
Booking_ID int64
Date datetime64[ns]
Flight_ID object
Passenger_ID int64
Seat_No int64
Booking_Status object
Amount float64
Travel_Distance int64
Class object
dtype: object

Descriptive Statistics:
Booking_ID Date Flight_ID Passenger_ID Seat_No \
count 50.00000 50 50 50.00000 50.000000
unique NaN NaN 4 NaN NaN
top NaN NaN Flight_102 NaN NaN
freq NaN NaN 18 NaN NaN
mean 25.50000 2024-01-24 16:19:12 NaN 2985.20000 95.540000
min 1.00000 2024-01-02 00:00:00 NaN 1064.00000 1.000000
25% 13.25000 2024-01-14 06:00:00 NaN 2095.00000 42.500000
50% 25.50000 2024-01-24 00:00:00 NaN 3075.00000 94.000000
75% 37.75000 2024-02-06 18:00:00 NaN 3915.50000 140.750000
max 50.00000 2024-02-19 00:00:00 NaN 4993.00000 193.000000
std 14.57738 NaN NaN 1129.03166 58.362768

Booking_Status Amount Travel_Distance Class


count 50 50.000000 50.00000 50
unique 3 NaN NaN 3
top Booked NaN NaN Business
freq 21 NaN NaN 18
mean NaN 554.685800 2733.78000 NaN
min NaN 108.280000 116.00000 NaN
25% NaN 314.820000 1324.25000 NaN
50% NaN 615.740000 2902.00000 NaN
75% NaN 754.882500 3887.50000 NaN
max NaN 943.060000 4993.00000 NaN
std NaN 263.451305 1478.65742 NaN

.
Result:
Thus the python program as EDA for flight Ticket Booking has been successfully verified

You might also like