De&v Lab Manual
De&v Lab Manual
#Pythonversionstoverifyinstallations:
Import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Print versions to verify installations
print("Pandasversion:", pd.__version__)
print("NumPy version:",np.version__)
print("Matplotlib version:", matplotlib.version)
# Correctly access the versionfrom thematplotlibmodule
print("Seabornversion:",sns.version)
Output:
Pandas version: 2.1.4
NumPy version: 1.26.4
Matplotlib version: 3.7.1
Seabornversion:0.13.1
Result:
Thus the python program to implement the Install the Data Analysis and Visualization Tool python has
been successfully verified
.
1.B)Install Pandas Package in Python and execute the Program for simple Data
frame attributes
Aim:
To implement the Install Pandas Package in Python and execute the Program for simple Data
frame attributes
Algorithm:
Step 1: Install the Pandas Package
To install the Pandas package, you will need to use the Python package manager, pip. Open your
terminal or command prompt and run the following command:pip install pandas
Step 2: Verify the Installation and Create a Simple DataFrameTo create a DataFrame and
explore its attributes:
Import the Pandas Library: Begin by importing the pandas library.
Create a DataFrame: Use the pd.DataFrame() function to create a simple DataFrame.
Explore DataFrame Attributes: Access various attributes such as head(), shape, columns, index,
and dtypes.
Program:
Import pandas as pd
#Create asimple DataFrame
data ={
'Name': ['Alice', 'Bob', 'Charlie',
'David'],'Age':[24,27,22,32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)
# Display the DataFrameprint("DataFrame:")
print(df)# Display DataFrame attributes
print("\nDataFrame Attributes:")# Shape of the DataFrame
print(f"Shape: {df.shape}")# Column names
print(f"Columns:{df.columns}")# Index
print(f"Index: {df.index}")# Data types of each column
print(f"Data Types:\n{df.dtypes}")# Descriptive statistics
print("\nDescriptive Statistics:")
print(df.describe(include='all'))
# Info about the DataFrame
print("\nDataFrameInfo:")
df.info()
Output:
DataFrame:
Name Age City
• Alice 24 New York
• Bob 27 Los Angeles
• Charlie 22 Chicago
• David 32 Houston
DataFrame Attributes:
Shape: (4, 3)
Columns: Index(['Name', 'Age', 'City'], dtype='object')
Index: RangeIndex(start=0, stop=4, step=1)
Data Types:
Name object
Age int64
City object
dtype: object
Descriptive Statistics:
Name Age City
count 4 4.000000 4
unique 4 NaN 4
top Alice NaN New York
freq 1 NaN 1
mean NaN 26.250000 NaN
std NaN 4.349329 NaN
min NaN 22.000000 NaN
25% NaN 23.500000 NaN
50% NaN 25.500000 NaN
75% NaN 28.250000 NaN
max NaN 32.000000 NaN
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
Result:
Thus the python program to implement the Install Pandas Package in Python and execute the Program
for simple Data frame attributes has been successfully verified.
2.Create a Program using Numpy package functions and 2D or 3D array to perform
simple matrix operation
Aim:
To Create a Program using Numpy package functions and 2D or 3D array to perform simple matrix
operation
Algorithm:
Step1: Addition/Subtraction:
Check if both matrices have the same dimensions.
Add/Subtract corresponding elements from both matrices.
Step2:Multiplication:
Ensure that the number of columns in the first matrix equals the number of rows in
the second matrix.Multiply each element of the rows of the first matrix by the
corresponding elements of the columns of the second matrix and sum them.
Step3:Transpose:Convert the rows of the matrix into columns and vice versa.
Program:
Import numpy as np
# 2D Array Example
print("2D Array Operations")
# 1. Matrix Addition
matrix_addition = matrix_a + matrix_b
print("\nMatrix Addition (A + B):")
print(matrix_addition)
# 4. Matrix Transposition
matrix_transpose = np.transpose(matrix_a)
print("\nMatrix Transposition (Transpose of A):")
print(matrix_transpose)
# 3D Array Example
print("\n3D Array Operations")
# Create a 3D array
array_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]], [[13, 14, 15], [16,
17, 18]]])
print("3D Array:")
print(array_3d)
# 5. Sum along axis 0
sum_axis_0 = np.sum(array_3d, axis=0)
print("\nSum along axis 0:")
print(sum_axis_0)
Output:
2D Array Operations
Matrix A:
[[1 2 3]
[4 5 6]]
Matrix B:
[[ 7 8 9]
[10 11 12]]
3D Array Operations
3D Array:
[[[ 1 2 3]
[ 4 5 6]]
[[ 7 8 9]
[10 11 12]]
[[13 14 15]
[16 17 18]]]
Aim:
To combine Numpy and Pandas data frame to create dataset and perform the Color Variation
each column data
Algorithm:
1. Generate a dataset using NumPy
2. Convert the dataset to a Pandas DataFrame.
3. Apply a color variation algorithm based on the values.
4. Display the styled DataFrame with color variation.
Program:
Import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Step 1: Create a Dataset using NumPy and Pandas
# Generate random data using NumPy
np.random.seed(0) # For reproducibility
data = {
'Age': np.random.randint(20, 60, size=100),
'Income': np.random.randint(30000, 120000, size=100),
'Expenses': np.random.randint(10000, 50000, size=100),
'Savings': np.random.randint(5000, 20000, size=100)
}
plt.tight_layout()
plt.show()
OUTPUT:
DataFrame:
Age Income Expenses Savings
0 20 54777 43391 11797
1 23 43824 42232 10637
2 23 32418 18962 13448
3 59 42843 30435 16400
4 29 108778 44009 16471
Result:
Thus the python program Numpy and Pandas data frame to create dataset and perform the Color
Variation each column data has been successfully verified
3.To combine Numpy and Pandas data frame to create dataset and perform the
following B) Highlight Max and Min values with output
Aim:
To combine Numpy and Pandas data frame to create dataset and perform the following
Highlight Max and Min values with output.
Algorithm:
PROGRAM:
import numpy as np
import pandas as pd
Result:
Thus the python program to combine Numpy and Pandas data frame to create dataset and perform
the Highlight Max and Min values with output has been successfully verified
3.To combine Numpy and Pandas data frame to create dataset and perform the following C) To
generate Background gradient color variation
Aim:
To combine Numpy and Pandas data frame to create dataset and perform the following
to generate Background gradient color variation
Algorithm:
1.Generate a dataset using NumPy.
2.Convert the dataset to a Pandas DataFrame.
3.Apply a background gradient color variation using an algorithm.
4.Background Gradient: The background_gradient function from Pandas is used to apply
color gradients based on the values in each column.
5.Colormap: The cmap argument specifies the color map to be used. In this example, we use the
Viridiscolormap, which ranges from dark blue for low values to yellow for high values.
Other colormaps (like 'coolwarm', 'plasma', etc.) can also be used.
6.Scaling: Each column’s values are scaled between the minimum and maximum values of that
column, and a gradient is applied based on this range.
PROGRAM:
import numpy as np
import pandas as pd
OUTPUT:
Original DataFrame:
A B C D E
0 52 93 15 72 61
1 21 83 87 75 75
2 88 24 3 22 53
3 2 88 30 38 2
4 64 60 21 33 76
5 58 22 89 49 91
6 59 42 92 60 80
7 15 62 62 47 62
8 51 55 64 3 51
9 7 21 73 39 18
A B C D E
0 52 93 15 72 61
1 21 83 87 75 75
2 88 24 3 22 53
3 2 88 30 38 2
4 64 60 21 33 76
5 58 22 89 49 91
6 59 42 92 60 80
7 15 62 62 47 62
8 51 55 64 3 51
9 7 21 73 39 18
Result:
Thus the python Numpy and Pandas data frame to create dataset and perform the generate
Background gradient color variationhas been successfully verified
4.Explore multivariable dataset, To perform any four data cleaning method and visualize Bar
chart.
Aim:
To multivariable dataset, To perform any four data cleaning method and visualize Bar chart.
Algorithm:
1.Input: A multivariable dataset (DataFrame).
2.Handling Missing Data:
For each numerical column, fill missing values with the median.
For each categorical column, fill missing values with the mode.
3.Remove Duplicates:
Check for and drop duplicate rows.
4.Convert Data Types:Ensure appropriate data types for each column (e.g., convert salary to
integer).
5.Handle Outliers:
Define outlier thresholds and cap values as necessary.
6.Visualization:Group the data by a relevant variable (e.g., city) and plot a bar chart to
compare the results (e.g., average salary by city).
Visualize the Data: Create a bar chart to visualize some aspect of the cleaned data.
Use an Algorithm to structure the workflow for cleaning and visualization.
PROGRAM
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Handle missing values - Fill missing 'age' values with the median
df['age'].fillna(df['age'].median(), inplace=True)
# 3. Remove duplicates
df.drop_duplicates(inplace=True)
OUTPUT:
Cleaned Dataset:
survived pclass age sibsp parch fare adult_male alone sex_male
\
0 0 3 22.0 1 0 7.2500 True False True
1 1 1 38.0 1 0 71.2833 False False False
2 1 3 26.0 0 0 7.9250 False True False
3 1 1 35.0 1 0 53.1000 False False False
4 0 3 35.0 0 0 8.0500 True True True
Aim:
To Explore using seaborn to load the dataset three variable( Username, Tweet, Location) tweets
comment review for #tag Jallikattu Protest
Algorithm:
1.Input: A dataset containing tweet data (Username, Tweet, Location) for the #JallikattuProtest.
2.Preprocessing:Use value_counts() on the Location column to calculate the count of
tweets for each location.
3.Scatter Plot Creation:Plot the number of tweets on the y-axis and the corresponding locations
on the x-axis.
4.Customization:Customize the plot with labels, colors, and marker size to enhance
visualization.
5.Output:Display a scatter plot that shows the distribution of tweets from different locations.
PROGRAM:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
OUTPUT
Tweet DataFrame:
Username Tweet Location
0 user1 We support #Jallikattu at @Marina Chennai
1 user2 #SaveTNfarmers protest @Chennai Madurai
2 user3 #Jallikattu is our right! #TamilNadu Coimbatore
3 user4 Proud of #Jallikattu culture @SaveTNfarmers Chennai
4 user5 #SaveTNfarmers and #Jallikattu go hand in hand Madurai
5 user6 @Marina #TamilNadu #Jallikattu Trichy
6 user7 Protect our culture #Jallikattu #TamilNadu Coimbatore
7 user8 #Jallikattu @Chennai #TamilNadu Chennai
Result:
Thus the python program to Explore using seaborn to load the dataset three variable( Username,
Tweet, Location) tweets comment review for #tag Jallikattu Protest as Perform scatter plot using
different location tweet has been successfully verified
5.Explore using seaborn to load the dataset three variable( Username, Tweet,
Location) tweets comment review for #tag Jallikattu Protest.
B)Perform Bubble chart for # and @ tag
Aim:
To Explore using seaborn to load the dataset three variable( Username, Tweet, Location)
tweets comment review for #tag Jallikattu Protest To Perform Bubble chart for # and @ tag
Algorithm:
1.input:A dataset containing tweet data (Username, Tweet, Location) for the #JallikattuProtest, with
hashtags and mentions in the Tweet column.
2.Extract Tags:Define functions to extract hashtags (#) and mentions (@) using regular expressions
from each tweet.
3.Count Occurrences:For both hashtags and mentions, count their occurrences and store the frequency.
4.Visualize with Bubble Chart:
Use Seaborn'sscatterplot function to plot a bubble chart, where:
The x-axis represents unique hashtags or mentions.
The size of each bubble represents the frequency (i.e., the count) of that hashtag or mention.
Customize the chart with labels, titles, and bubble sizes.
PROGRAM
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from collections import Counter
df = pd.DataFrame(data)
AIM:
To Create a Pie chart for Student Result Analysis by using pie plot in python. Plot segregation
will be Distinction (Greater than or Equal to 8.5 CGPA) and First Class (Greater than 6.5 CGPA).
Algorithm:
df = pd.DataFrame(data)
Result:
Thus the python program to Create a Pie chart for Student Result Analysis by using pie plot in
python. Plot segregation will be Distinction and First Class has been successfully verified
7.Create a Lollipop chart for Festival Shopping dataset of your own(20 rows and 5 to 10
columns)
Aim:
To Create a Lollipop chart for Festival Shopping dataset of your own.
Algorithm:
1.Input: A dataset with customer shopping details including total expenditure.
2.Sort Data (optional): Sort the dataset by total expenditure to make the chart visually
organized.
3.Create Lollipop Chart:Use Matplotlib’sstem() to plot customer IDs on the x-axis and total
expenditure on the y-axis.
Customize marker size, colors, and line format for better visualization.
4.Customize the Chart:Add axis labels, rotate the x-axis labels for readability, and set a chart
title.
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(data)
# Display the DataFrame
print("Festival Shopping Dataset:")
print(df)
# Step 2: Sort the dataset based on 'Total Sales' for better visualization
df_sorted = df.sort_values(by='Total Sales', ascending=False)
Result:
Thus the python program to Create a Lollipop chart for Festival Shopping dataset of your
own(20 rows and 5 to 10 columns) has been successfully verified
8.To perform the following data transformation techniques of your own dataset. (20 Rows and 5 Columns)
A) Removing Null Values (NaN)
Aim:
To perform the following data transformation techniques of your own dataset to Removing Null Values
(NaN)
Algorithm:
1.Input: A dataset with potential null values (NaN).
2.Identify Null Values:Use pd.DataFrame.isnull() or pd.DataFrame.isna() to identify
null values in the dataset.
3.Remove Null Values:Use pd.DataFrame.dropna() to remove:
Rows with any null values by default.
Columns with any null values by specifying axis=1.
PROGRAM:
import pandas as pd
import numpy as np
# Create a DataFrame with 20 rows and 5 columns, with some NaN values
data = {
'A': np.random.randint(1, 100, 20),
'B': np.random.randint(1, 100, 20),
'C': np.random.choice([np.nan, 50, 60, 70], 20),
'D': np.random.choice([np.nan, 80, 90], 20),
'E': np.random.randint(1, 100, 20)
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
OUTPUT:
Original DataFrame:
A B C D E
0 62 8 50.0 80.0 43
1 40 88 70.0 90.0 29
2 85 63 NaN 90.0 36
3 80 11 60.0 NaN 13
4 82 81 70.0 90.0 32
5 53 8 70.0 90.0 71
6 24 35 50.0 80.0 59
7 26 35 60.0 80.0 86
8 89 33 60.0 NaN 28
9 60 5 NaN 90.0 66
10 41 41 60.0 90.0 42
11 29 28 NaN 90.0 45
12 15 7 60.0 NaN 62
13 45 73 50.0 NaN 57
14 65 72 60.0 80.0 6
15 89 12 NaN NaN 28
16 71 34 NaN 90.0 28
17 9 33 50.0 90.0 44
18 88 48 60.0 NaN 84
19 1 23 60.0 90.0 30
Result:
Thus the python programTo perform the following data transformation techniques of your own
dataset. (20 Rows and 5 Columns) A) Removing Null Values (NaN) has been successfully verified
8.Write a python To perform the following data transformation techniques of your own dataset.
(20 Rows and 5 Columns)
B)Drop Columns
Aim:
To Write a python To perform the following data transformation techniques of your own dataset.
(20 Rows and 5 Columns)Drop Columns
Algorithm:
1.Input: A dataset (DataFrame) and a list of column names to drop.
2.Identify Columns: Determine which columns need to be removed based on business rules or
analysis requirements.
3.Drop Columns:Use pd.DataFrame.drop(columns=<list_of_columns>) to remove
specified columns from the dataset.
PROGRAM:
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Result:
Thus the python To perform the following data transformation techniques of your own dataset.
(20 Rows and 5 Columns) as Drop Columns has been successfully verified
8.perform the following data transformation techniques of your own dataset. (20 Rows and 5 Columns)
C)Merging database style dataframes
Aim:
To perform the following data transformation techniques of your own dataset. (20 Rows and 5
Columns)Merging database style dataframes
Algorithm:
1.Input: Two datasets (DataFrames) and a common key (column) for merging.
2.Identify Merge Key: Determine the column(s) that will be used as the join key.
3.Choose Merge Type:
Inner Join: how='inner' (only matching rows).
Left Join: how='left' (all rows from the left DataFrame).
Right Join: how='right' (all rows from the right DataFrame).
Outer Join: how='outer' (all rows from both DataFrames).
4.MergeDataFrames:Use pd.merge(df1, df2, on='common_column', how='join_type')
to perform the merge.
5.Output:The merged DataFrame containing combined data based on the specified join type.
PROGRAM:
import pandas as pd
import numpy as np
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
OUTPUT:
DataFrame 1:
ID Name Age
0 1 Name1 47
1 2 Name2 35
2 3 Name3 45
3 4 Name4 35
4 5 Name5 44
5 6 Name6 39
6 7 Name7 47
7 8 Name8 36
8 9 Name9 21
9 10 Name10 20
10 11 Name11 35
11 12 Name12 49
12 13 Name13 31
13 14 Name14 24
14 15 Name15 24
15 16 Name16 46
16 17 Name17 42
17 18 Name18 28
18 19 Name19 28
19 20 Name20 22
DataFrame 2:
ID Salary Department
0 1 45151 Department0
1 2 31154 Department1
2 3 34499 Department2
3 4 36295 Department0
4 5 42183 Department1
5 6 59299 Department2
6 7 42874 Department0
7 8 62711 Department1
8 9 35539 Department2
9 10 32557 Department0
10 11 68360 Department1
11 12 46482 Department2
12 13 32200 Department0
13 14 32961 Department1
14 15 51357 Department2
Result:
Thus the python perform the following data transformation techniques of your own dataset. (20
Rows and 5 Columns) as Merging database style data frames has been successfully verified
9.To perform dataframe merge function (inner, left and outer join) using simple dataset.
Aim:
To perform dataframe merge function (inner, left and outer join) using simple dataset.
Algorithm:
1.Input: Two datasets (DataFrames) and a common key (column) for merging.
2.Identify Merge Key: Determine the column(s) to be used as the join key.
3.Choose Merge Type:
Inner Join: how='inner' (only matching rows from both DataFrames).
Left Join: how='left' (all rows from the left DataFrame).
Outer Join: how='outer' (all rows from both DataFrames).
4.MergeDataFrames:Use pd.merge(df1, df2, on='common_column', how='join_type') to perfo
the merge.
5.Output:The merged DataFrame containing combined data based on the specified join type.
PROGRAM:
import pandas as pd
DataFrame 2:
ID Salary Department
0 3 70000 HR
1 4 80000 IT
2 5 90000 Finance
3 6 100000 Marketing
4 7 110000 Sales
Result:
Thus the python To perform dataframe merge function (inner, left and outer join) using simple
dataset has been successfully verified
10.Explore simple dataset and perform Transformation techniques such as data reduplication,
Replace values, Handling missing Data, Backward and Forward filling.
Aim:
To Explore simple dataset and perform Transformation techniques such as data deduplication,
Replace values, Handling missing Data, Backward and Forward filling.
Algorithm:
1.DataDeduplication:
Input: A dataset (DataFrame).
Process: Use df.drop_duplicates() to remove duplicate rows.
Output: A DataFrame without duplicates.
2.Replace Values:
Input: A DataFrame and a dictionary of replacements.
Process: Use df.replace({column_name: {old_value: new_value}}) to replace
specific values.
Output: A DataFrame with replaced values.
3.Handling Missing Data:
Input: A DataFrame with missing values.
Process: Use df.dropna() to remove rows with missing values.
Output: A DataFrame without missing data.
Program
import pandas as pd
import numpy as np
print("Original DataFrame:")
print(df)
# Replace Values: Replace NaN with 'Unknown' in 'Name', 0 in 'Age' and 'Salary'
df_replaced = df.copy()
df_replaced['Name'].fillna('Unknown', inplace=True)
df_replaced['Age'].fillna(0, inplace=True)
df_replaced['Salary'].fillna(0, inplace=True)
print("\nDataFrame after Replacing Missing Values:")
print(df_replaced)
# Forward Filling: Fill missing values with the previous value in the column
df_filled_forward = df.fillna(method='ffill')
print("\nDataFrame after Forward Filling Missing Values:")
print(df_filled_forward)
output:
Original DataFrame:
ID Name Age Salary
0 1 Alice 25.0 50000.0
1 2 Bob 30.0 60000.0
2 2 Bob 30.0 60000.0
3 4 David 40.0 NaN
4 5 Eva NaN 70000.0
5 6 NaN 50.0 80000.0
6 7 George NaN 80000.0
7 7 George 50.0 NaN
8 9 Ivy 60.0 90000.0
Result:
Thus the python program to Explore simple dataset and perform Transformation techniques such
as data reduplication, Replace values, Handling missing Data, Backward and Forward filling has been
successfully verified
11.To perform hypothesis testing using stats library of your own dataset Explore T test
Aim:
To perform hypothesis testing using stats library of your own dataset Explore T test
Algorithm:
Step 1: Define the null hypothesis (H0) and the alternative hypothesis (H1).
H0: μ1 = μ2 (The means of the two groups are equal)
H1: μ1 ≠ μ2 (The means of the two groups are not equal)
Step 2: Set the significance level (alpha). Commonly, alpha = 0.05.
Step 3: Calculate the T-statistic and P-value using:
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
Step 4: Compare the P-value with alpha.
If P-value < alpha: Reject H0 (there is a significant difference)
If P-value ≥ alpha: Fail to reject H0 (no significant difference)
Output: Present T-statistic, P-value, and conclusion.
Program:
#To install scipy lib fuction
print("Dataset:")
print(df.head())
# Perform T-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print("\nT-test Results:")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")
output:
Dataset:
Group1 Group2
0 67.640523 56.549474
1 54.001572 58.781625
2 59.787380 46.122143
3 72.408932 35.192035
4 68.675580 51.520879
T-test Results:
T-statistic: 0.8897019207505096
P-value: 0.3773014533943507
Fail to reject the null hypothesis: There is no significant difference between
the two groups.
Result:
Thus the python program to perform hypothesis testing using stats library of your own dataset Explore
T test has been successfully verified
12.Explore and visualize of your own dataset/ data frame and perform numerical summaries and
spread level.
A) Floating values into two columns from single variable
Aim:
To .Explore and visualize of your own dataset/ data frame and perform numerical summaries
and spread level as Floating values into two columns from single variable
Algorithm:
Step 1: Generate a synthetic dataset with multiple columns, including at least one floating-point
column.
Step 2: Use descriptive statistics methods to get numerical summaries (mean, median, std, etc.).
Step 3: Split the floating-point column into two separate columns based on its value (e.g., integer and
decimal parts).
Step 4: Visualize the distributions of the columns using histograms or box plots.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print("Original DataFrame:")
print(df)
plt.tight_layout()
plt.show()
# Numerical Summaries
print("\nNumerical Summaries:")
print(df.describe())
# Spread Level: Calculate the range (max - min) for each part
range_integer = df['Integer_Part'].max() - df['Integer_Part'].min()
range_decimal = df['Decimal_Part'].max() - df['Decimal_Part'].min()
output:
Original DataFrame:
Value
0 43.708611
1 95.564288
2 75.879455
3 63.879264
4 24.041678
5 24.039507
6 15.227525
7 87.955853
8 64.100351
9 73.726532
10 11.852604
11 97.291887
12 84.919838
13 29.110520
14 26.364247
15 26.506406
16 37.381802
17 57.228079
18 48.875052
19 36.210623
Numerical Summaries:
Value Integer_Part Decimal_Part
count 20.000000 20.000000 20.000000
mean 51.193206 50.700000 0.493500
std 27.691818 27.554921 0.331413
min 11.852604 11.000000 0.040000
25% 26.470866 26.000000 0.225000
50% 46.291831 45.500000 0.445000
75% 74.264763 73.500000 0.857500
max 97.291887 97.000000 0.960000
Result:
Thus the python program to .Explore and visualize of your own dataset/ data frame and perform
numerical summaries and spread level as Floating values into two columns from single variable has
been successfully verified
12. Explore and visualize of your own dataset/ data frame and perform numerical summaries and
spread level.
B) Perform Descriptive Analysis
Aim:
To Explore and visualize of your own dataset/ data frame and perform numerical
summaries and spread level as Perform Descriptive Analysis
Algorithm:
Step 1: Generate a synthetic dataset with at least 20 rows and 3-5 columns, including both numerical
and categorical data.
Step 2: Use descriptive statistics methods (like describe(), mean(), median(), etc.) to summarize
numerical data.
Step 3: Visualize the distributions of numerical columns using histograms or box plots.
Step 4: Visualize relationships between variables using scatter plots or pair plots.
PROGRAM:
#1. Create a Sample Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = {
'Age': np.random.randint(18, 70, size=50), # Random ages between 18 and 70
'Salary': np.random.uniform(30000, 100000, size=50), # Random salaries
between 30k and 100k
'Department': np.random.choice(['HR', 'Engineering', 'Marketing'],
size=50) # Random departments
}
df = pd.DataFrame(data)
print("Dataset:")
print(df.head())
#2. Perform Descriptive Analysis
# Descriptive statistics for numerical variables
numerical_summary = df.describe()
print("\nNumerical Summary:")
print(numerical_summary)
plt.tight_layout()
plt.show()
plt.show()
Numerical Summary:
Age Salary
count 50.00000 50.000000
mean 43.82000 63883.002511
std 15.05187 22118.144018
min 19.00000 30386.548199
25% 32.25000 43766.624704
50% 41.50000 62736.529135
75% 56.00000 84208.756689
max 69.00000 99082.085562
Categorical Summary:
Department
Marketing 23
HR 15
Engineering 12
Name: count, dtype: int64
Range of Age: 50
Range of Salary: 68695.53736338405
Result:
Thus the python program to Explore and visualize of your own dataset/ data frame and perform
numerical summaries and spread level as Perform Descriptive Analysis has been successfully
verified
12.Explore and visualize of your own dataset/ data frame and perform numerical summaries and spread
level.
C) Perform Percentage Table both row and column.
Aim:
To Explore and visualize of your own dataset/ data frame and perform numerical summaries
and spread level as Perform Percentage Table both row and column
ALGORITHM:
Step 1: Generate a synthetic dataset with at least 20 rows and 3-5 columns, including categorical data.
Step 2: Perform numerical summaries to get counts of each category.
Step 3: Calculate percentage for each category in both rows and columns.
Step 4: Visualize the distribution of categorical data.
Program:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Department': ['HR', 'HR', 'Engineering', 'Engineering', 'Marketing',
'Marketing'],
'Age_Group_18_25': [5, 10, 8, 12, 7, 6],
'Age_Group_26_35': [15, 20, 25, 30, 22, 18],
'Age_Group_36_50': [10, 5, 15, 10, 10, 20],
'Age_Group_51_70': [0, 0, 2, 3, 1, 4]
}
df = pd.DataFrame(data)
# Set 'Department' as the index
df.set_index('Department', inplace=True)
print("Original DataFrame:")
print(df)
# Calculate row and column percentages
row_percentages = df.div(df.sum(axis=1), axis=0) * 100
column_percentages = df.div(df.sum(axis=0), axis=1) * 100
print("\nRow Percentages:")
print(row_percentages)
print("\nColumn Percentages:")
print(column_percentages)
output:
Original DataFrame:
Age_Group_18_25 Age_Group_26_35 Age_Group_36_50 \
Department
HR 5 15 10
HR 10 20 5
Engineering 8 25 15
Engineering 12 30 10
Marketing 7 22 10
Marketing 6 18 20
Age_Group_51_70
Department
HR 0
HR 0
Engineering 2
Engineering 3
Marketing 1
Marketing 4
Row Percentages:
Age_Group_18_25 Age_Group_26_35 Age_Group_36_50 \
Department
HR 16.666667 50.000000 33.333333
HR 28.571429 57.142857 14.285714
Engineering 16.000000 50.000000 30.000000
Engineering 21.818182 54.545455 18.181818
Marketing 17.500000 55.000000 25.000000
Marketing 12.500000 37.500000 41.666667
Age_Group_51_70
Department
HR 0.000000
HR 0.000000
Engineering 4.000000
Engineering 5.454545
Marketing 2.500000
Marketing 8.333333
Column Percentages:
Age_Group_18_25 Age_Group_26_35 Age_Group_36_50 \
Department
HR 10.416667 11.538462 14.285714
HR 20.833333 15.384615 7.142857
Engineering 16.666667 19.230769 21.428571
Engineering 25.000000 23.076923 14.285714
Marketing 14.583333 16.923077 14.285714
Marketing 12.500000 13.846154 28.571429
Age_Group_51_70
Department
HR 0.0
HR 0.0
Engineering 20.0
Engineering 30.0
Marketing 10.0
Marketing 40.0
Result:
Thus the python program to Explore and visualize of your own dataset/ data frame and perform
numerical summaries and spread level as Perform Percentage Table both row and column has been
successfully verified
13.Perform Time Series Analysis and apply various visualization methods for Internet Traffic Time
Dataset. (Create own data with minimum 5 columns and 20 rows)
Aim:
To create Time Series Analysis and apply various visualization methods for Internet Traffic
Time Dataset. (Create own data with minimum 5 columns and 20 rows)
Algorithm:
Step 1: Generate a synthetic dataset with at least 20 rows and 5 columns, including a timestamp.
Step 2: Convert the timestamp to a datetime format for analysis.
Step 3: Calculate basic statistics (mean, max, min) for the traffic metrics.
Step 4: Visualize the data using line plots, bar plots, and scatter plots.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import autocorrelation_plot
print("Dataset:")
print(df.head())
# Unique Visitors
plt.subplot(3, 2, 2)
plt.plot(df.index, df['Unique_Visitors'], marker='o', color='green')
plt.title('Unique Visitors Over Time')
plt.xlabel('Date')
plt.ylabel('Unique Visitors')
# New Signups
plt.subplot(3, 2, 3)
plt.plot(df.index, df['New_Signups'], marker='o', color='orange')
plt.title('New Signups Over Time')
plt.xlabel('Date')
plt.ylabel('New Signups')
# Session Duration
plt.subplot(3, 2, 4)
plt.plot(df.index, df['Session_Duration'], marker='o', color='red')
plt.title('Session Duration Over Time')
plt.xlabel('Date')
plt.ylabel('Session Duration (seconds)')
# Bounce Rate
plt.subplot(3, 2, 5)
plt.plot(df.index, df['Bounce_Rate'], marker='o', color='purple')
plt.title('Bounce Rate Over Time')
plt.xlabel('Date')
plt.ylabel('Bounce Rate (%)')
plt.tight_layout()
plt.show()
# Descriptive Statistics
print("\nDescriptive Statistics:")
print(df.describe())
# Seasonal Decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
Bounce_Rate
Date
2024-01-01 55.483457
2024-01-02 52.840141
2024-01-03 48.926675
2024-01-04 28.122861
2024-01-05 32.604629
Descriptive Statistics:
Page_Views Unique_Visitors New_Signups Session_Duration Bounce_Rate
count 20.000000 20.000000 20.000000 20.000000 20.000000
mean 244.250000 128.250000 20.800000 193.278187 44.313501
std 29.572169 23.763916 6.708988 60.400430 11.431342
min 194.000000 82.000000 6.000000 94.620554 28.122861
25% 221.250000 107.500000 17.750000 135.832926 33.501020
50% 243.000000 130.500000 20.000000 194.830227 49.401129
75% 265.250000 146.750000 24.250000 248.604233 53.304734
max 297.000000 162.000000 34.000000 287.345998 63.296567
Result:
Thus the python program as To create Time Series Analysis and apply various visualization
methods for Internet Traffic Time Dataset. (Create own data with minimum 5 columns and 20 rows) has
been successfully verified
14.Perform EDA for Water quality dataset. All attributes are numeric variables and they are listed bellow:
aluminium - dangerous if greater than 2.8 ammonia - dangerous if greater than 32.5 arsenic - dangerous if
greater than 0.01 barium - dangerous if greater than 2 cadmium - dangerous if greater than 0.005
Aim:
To Perform EDA for Water quality dataset. All attributes are numeric variables
Algorithm:
Step 1: Generate a synthetic dataset with specified attributes (e.g., aluminium, ammonia, etc.).
Step 2: Calculate descriptive statistics to summarize the dataset.
Step 3: Identify samples exceeding dangerous levels for each contaminant.
Step 4: Visualize the data using histograms, box plots, and scatter plots.
Present numerical summaries and visualizations.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print("Dataset:")
print(df.head())
# Visualizations
plt.figure(figsize=(14, 12))
# Histograms
plt.subplot(3, 2, 1)
sns.histplot(df['aluminium'], kde=True, color='blue')
plt.title('Histogram of Aluminium')
plt.subplot(3, 2, 2)
sns.histplot(df['ammonia'], kde=True, color='green')
plt.title('Histogram of Ammonia')
plt.subplot(3, 2, 3)
sns.histplot(df['arsenic'], kde=True, color='orange')
plt.title('Histogram of Arsenic')
plt.subplot(3, 2, 4)
sns.histplot(df['barium'], kde=True, color='red')
plt.title('Histogram of Barium')
plt.subplot(3, 2, 5)
sns.histplot(df['cadmium'], kde=True, color='purple')
plt.title('Histogram of Cadmium')
plt.tight_layout()
plt.show()
# Boxplots
plt.figure(figsize=(14, 10))
plt.subplot(3, 2, 1)
sns.boxplot(y=df['aluminium'], color='blue')
plt.title('Boxplot of Aluminium')
plt.subplot(3, 2, 2)
sns.boxplot(y=df['ammonia'], color='green')
plt.title('Boxplot of Ammonia')
plt.subplot(3, 2, 3)
sns.boxplot(y=df['arsenic'], color='orange')
plt.title('Boxplot of Arsenic')
plt.subplot(3, 2, 4)
sns.boxplot(y=df['barium'], color='red')
plt.title('Boxplot of Barium')
plt.subplot(3, 2, 5)
sns.boxplot(y=df['cadmium'], color='purple')
plt.title('Boxplot of Cadmium')
plt.tight_layout()
plt.show()
# Scatter plots
plt.figure(figsize=(14, 10))
# Scatter plot between Aluminium and Ammonia
plt.subplot(2, 2, 1)
plt.scatter(df['aluminium'], df['ammonia'], alpha=0.7)
plt.xlabel('Aluminium')
plt.ylabel('Ammonia')
plt.title('Aluminium vs Ammonia')
# Scatter plot between Aluminium and Arsenic
plt.subplot(2, 2, 2)
plt.scatter(df['aluminium'], df['arsenic'], alpha=0.7)
plt.xlabel('Aluminium')
plt.ylabel('Arsenic')
plt.title('Aluminium vs Arsenic')
# Scatter plot between Barium and Cadmium
plt.subplot(2, 2, 3)
plt.scatter(df['barium'], df['cadmium'], alpha=0.7)
plt.xlabel('Barium')
plt.ylabel('Cadmium')
plt.title('Barium vs Cadmium')
plt.tight_layout()
plt.show()
Dataset:
aluminium ammonia arsenic barium cadmium
0 1.872701 48.479231 0.000629 2.724798 0.006420
1 4.753572 38.756641 0.012728 0.718686 0.000841
2 3.659970 46.974947 0.006287 0.434685 0.001616
3 2.993292 44.741368 0.010171 1.468358 0.008986
4 0.780093 29.894999 0.018151 2.956951 0.006064
Descriptive Statistics:
aluminium ammonia arsenic barium cadmium
count 50.000000 50.000000 50.000000 50.000000 50.000000
mean 2.229620 24.721879 0.009566 1.552091 0.005161
std 1.444416 15.342076 0.005951 0.870897 0.003091
min 0.102922 0.276106 0.000139 0.049763 0.000051
25% 0.918835 10.843701 0.004998 0.725709 0.002389
50% 2.180244 25.413211 0.008445 1.636463 0.005726
75% 3.249275 38.559728 0.015833 2.259806 0.007405
max 4.849549 49.344347 0.019436 2.956951 0.009730
Aim:
To Perform EDA on map using various map dataset to find the nearest Sports Shop from your
Location with mouse rollover effect
Algorithm:
Step 1: Load the dataset containing sports shop information.
Step 2: Define the user's location.
Step 3: Calculate distances to each shop using Haversine formula or similar.
Step 4: Create a folium map centered at the user's location.
Step 5: Add markers for each shop with a mouse rollover effect to display shop details.
Step 6: Display the map.
Program:
import pandas as pd
import folium
from geopy.distance import geodesic
# Your location
your_location = (40.748817, -73.985428) # Example coordinates (latitude,
longitude)
output:
Nearest Sports Shop: Sports Shop A
Distance: 0.00 km
Result:
Thus the python program to EDA on map using various map dataset to find the nearest Sports Shop
from your Location with mouse rollover effect has been successfully verified
17.Perform EDA for Price of petroleum products in India from the year 2013 to 2023. (Create
dataset with minimum 5 columns and 20 rows.)
Aim:
To create the EDA for Price of petroleum products in India from the year 2013 to 2023. (Create
dataset with minimum 5 columns and 20 rows.)
Algorithm:
Step 1: Create a dataset with at least 5 columns and 20 rows.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics.
Step 5: Visualize data (line plots, bar charts).
Step 6: Analyze results and interpret findings.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create a sample dataset
np.random.seed(42) # For reproducibility
# Define the data
data = {
'Year': np.repeat(np.arange(2013, 2024), 12),
'Month': np.tile(np.arange(1, 13), 11),
'Petrol_Price': np.random.uniform(60,120,size=132)+np.linspace(0,10,132),
'Diesel_Price': np.random.uniform(50,100,size=132)+np.linspace(0, 8,132),
'LPG_Price':np.random.uniform(400,1000,size=132)+np.linspace(0,50,132)}
# Create a DataFrame
df = pd.DataFrame(data)
# Display basic information
print("Dataset:")
print(df.head())
# Perform Descriptive Statistics
print("\nDescriptive Statistics:")
print(df.describe())
# Plotting
# Line plots for Petrol, Diesel, and LPG prices over time
plt.figure(figsize=(14, 8))
# Petrol Price
plt.subplot(3, 1, 1)
sns.lineplot(data=df, x='Month', y='Petrol_Price', hue='Year', marker='o')
plt.title('Monthly Petrol Prices (2013-2023)')
plt.xlabel('Month')
plt.ylabel('Price (INR per liter)')
# Diesel Price
plt.subplot(3, 1, 2)
sns.lineplot(data=df, x='Month', y='Diesel_Price', hue='Year', marker='o')
plt.title('Monthly Diesel Prices (2013-2023)')
plt.xlabel('Month')
plt.ylabel('Price (INR per liter)')
# LPG Price
plt.subplot(3, 1, 3)
sns.lineplot(data=df, x='Month', y='LPG_Price', hue='Year', marker='o')
plt.title('Monthly LPG Prices (2013-2023)')
plt.xlabel('Month')
plt.ylabel('Price (INR per cylinder)')
plt.tight_layout()
plt.show()
output:
Dataset:
Year Month Petrol_Price Diesel_Price LPG_Price
0 2013 1 82.472407 55.993268 926.423843
1 2013 2 117.119194 66.941827 844.842850
2 2013 3 104.072308 97.267623 818.972803
3 2013 4 96.148517 66.343353 822.635489
4 2013 5 69.666462 76.183806 617.221408
Descriptive Statistics:
Year Month Petrol_Price Diesel_Price LPG_Price
count 132.000000 132.000000 132.000000 132.000000 132.000000
mean 2018.000000 6.500000 93.579911 79.327571 721.103925
std 3.174324 3.465203 18.221675 15.216369 174.006893
min 2013.000000 1.000000 61.998428 52.638240 428.223814
25% 2015.000000 3.750000 77.945617 65.872682 576.909566
50% 2018.000000 6.500000 92.986088 80.956321 729.372830
75% 2021.000000 9.250000 110.516472 91.065617 863.529239
max 2023.000000 12.000000 124.480392 107.380555 1042.394688
Result:
Thus the python program the EDA for Price of petroleum products in India from the year 2013 to 2023.
(Create dataset with minimum 5 columns and 20 rows.) has been successfully verified
18.Explore and visualize women empowerment in India 2025 and compare every five year from 2010.
(Create dataset with minimum 5 columns and 20 rows.) With dataset
Aim:
To visualize women empowerment in India 2025 and compare every five year from 2010.
(Create dataset with minimum 5 columns and 20 rows.) With dataset
Algorithm:
Step 1: Create a dataset with at least 5 columns and 20 rows.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics.
Step 5: Visualize data (line plots, bar charts).
Step 6: Analyze results and interpret findings.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create DataFrame
df = pd.DataFrame(data)
# Display basic information
print("Dataset:")
print(df.head(20)) # Display first 20 rows
plt.figure(figsize=(14, 10))
plt.tight_layout()
plt.show()
output:
Dataset:
Year Metric Value
0 2010 Literacy_Rate 61.236204
1 2010 Workforce_Participation 79.188096
2 2010 Higher_Education 73.293152
3 2010 Political_Representation 69.959755
4 2010 Health_Care_Access 57.347226
5 2011 Literacy_Rate 58.013169
6 2011 Workforce_Participation 55.742508
7 2011 Higher_Education 80.651951
8 2011 Political_Representation 73.366784
9 2011 Health_Care_Access 77.242177
10 2012 Literacy_Rate 57.284201
11 2012 Workforce_Participation 86.430629
12 2012 Higher_Education 82.973279
13 2012 Political_Representation 65.036840
14 2012 Health_Care_Access 64.788082
15 2013 Literacy_Rate 65.502135
16 2013 Workforce_Participation 29.127267
17 2013 Higher_Education 36.409360
18 2013 Political_Representation 34.291684
19 2013 Health_Care_Access 30.736874
Descriptive Statistics:
Year \
count mean std min 25% 50%
Metric
Health_Care_Access 16.0 2017.5 4.760952 2010.0 2013.75 2017.5
Higher_Education 16.0 2017.5 4.760952 2010.0 2013.75 2017.5
Literacy_Rate 16.0 2017.5 4.760952 2010.0 2013.75 2017.5
Political_Representation 16.0 2017.5 4.760952 2010.0 2013.75 2017.5
Workforce_Participation 16.0 2017.5 4.760952 2010.0 2013.75 2017.5
Value \
75% max count mean std
Metric
Health_Care_Access 2021.25 2025.0 16.0 45.426186 27.309181
Higher_Education 2021.25 2025.0 16.0 43.194516 27.522326
Literacy_Rate 2021.25 2025.0 16.0 47.597035 23.782723
Political_Representation 2021.25 2025.0 16.0 43.815057 25.184779
Workforce_Participation 2021.25 2025.0 16.0 42.272576 26.929909
\
min 25% 50% 75%
Metric
Health_Care_Access 8.106150 23.396187 34.876154 68.440044
Higher_Education 10.939743 19.657972 34.586850 68.239293
Literacy_Rate 14.830159 26.416322 48.557479 62.302686
Political_Representation 9.011743 25.785765 35.588632 66.267569
Workforce_Participation 8.994054 23.865402 30.892074 65.744496
max
Metric
Health_Care_Access 91.273275
Higher_Education 85.065909
Literacy_Rate 85.536882
Political_Representation 87.463843
Workforce_Participation 87.138110
Result:
Thus the python program the Explore and visualize women empowerment in India 2025 and
compare every five year from 2010. (Create dataset with minimum 5 columns and 20 rows.) With
dataset has been successfully verified
19.Perform EDA and Visualization for COVID-19 dataset.
A) State wise Bar chart
Aim:
To Perform EDA and Visualization for COVID-19 dataset to State wise Bar chart
Algorithm:
Step 1: Create a dataset with state names and their corresponding COVID-19 statistics.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics.
Step 5: Visualize data (state-wise bar chart).
Step 6: Analyze results and interpret findings.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create DataFrame
df = pd.DataFrame(data)
# Display basic information
print("Dataset:")
print(df)
Descriptive Statistics:
Total_Cases Total_Deaths Total_Recovered
count 5.000000 5.000000 5.000000
mean 215154.600000 24265.000000 248475.000000
std 105378.307183 16796.917247 132284.115617
min 131958.000000 7265.000000 92498.000000
25% 141932.000000 12284.000000 180203.000000
50% 156867.000000 17850.000000 196335.000000
75% 269178.000000 38194.000000 379871.000000
max 375838.000000 45732.000000 393468.000000
.
Result:
Thus the python program as Perform EDA and Visualization for COVID-19 dataset to State wise
Bar chart has been successfully verified
19.Perform EDA and Visualization for COVID-19 dataset.
B) Recovered from COVID-19 District wise Bar chart
Aim:
To Perform EDA and Visualization for COVID-19 dataset to Recovered from COVID-19 District wise
Bar chart
Algorithm:
Step 1: Create a dataset with district names and their corresponding recovered COVID-19 statistics.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics.
Step 5: Visualize data (district-wise bar chart).
Step 6: Analyze results and interpret findings.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create DataFrame
df = pd.DataFrame(data)
OUTPUT:
Dataset:
District Recovered_Cases Total_Cases Total_Deaths
0 District_A 126958 120268 38194
1 District_B 151867 217892 22962
2 District_C 136932 64886 48191
3 District_D 108694 147337 45131
4 District_E 124879 223458 17023
Descriptive Statistics:
Recovered_Cases Total_Cases Total_Deaths
count 5.000000 5.000000 5.000000
mean 129866.000000 154768.200000 34300.200000
std 15933.867814 67132.702837 13715.672156
min 108694.000000 64886.000000 17023.000000
25% 124879.000000 120268.000000 22962.000000
50% 126958.000000 147337.000000 38194.000000
75% 136932.000000 217892.000000 45131.000000
max 151867.000000 223458.000000 48191.000000
Result:
Thus the python program the EDA and Visualization for COVID-19 dataset to Recovered from COVID-19
District wise Bar chart has been successfully verified
19.Perform EDA and Visualization for COVID-19 dataset.
C) Descriptive analysis for different age group.(Create dataset with minimum 5 columns and 20 rows.)
Aim:
To Perform EDA and Visualization for COVID-19 dataset to Descriptive analysis for different age group.
Algorithm:
Step 1: Create a dataset with age groups and their corresponding COVID-19 statistics.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics for each age group.
Step 5: Visualize data (e.g., bar charts for confirmed cases, recoveries, and deaths).
Step 6: Analyze results and interpret findings.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create DataFrame
df = pd.DataFrame(data)
plt.figure(figsize=(12, 6))
sns.barplot(x='Age_Group', y='Total_Deaths', data=df, palette='rocket',
ci=None)
output:
Dataset:
Age_Group Total_Cases Total_Deaths Recovered_Cases New_Cases New_Deaths
0 61-70 18568 2898 1521 429 235
1 31-40 20769 2445 12153 542 45
2 71-80 29693 610 11305 1230 32
3 41-50 7396 2373 13917 2112 139
4 61-70 28480 2071 8489 114 205
Recovered_Cases New_Cases \
std mean sum std mean sum
0 559.321464 10857.000000 21714 2585.182392 1088.000000 2176
1 377.595021 11113.000000 22226 1302.490691 2095.000000 4190
2 758.018469 6407.000000 12814 8126.071129 1643.500000 3287
3 1356.603234 7109.750000 28439 5117.947204 1619.000000 6476
4 1717.562372 8948.500000 17897 1158.948014 2538.000000 5076
5 483.345632 7833.333333 23500 6011.377906 653.333333 1960
6 887.934007 8546.400000 42732 6278.818902 1131.400000 5657
New_Deaths
std mean sum std
0 933.380951 246.00 492 21.213203
1 739.633693 224.00 448 67.882251
2 1557.756239 132.50 265 123.743687
3 794.150699 63.25 253 55.608003
4 214.960461 176.50 353 58.689863
5 679.850229 159.00 477 106.714573
6 397.607596 117.80 589 105.103283
Result:
Thus the python program to perform EDA and Visualization for COVID-19 dataset on descriptive
analysis for different age grouphas been successfully verified
20.A)Perform EDA for Ticket Booking Bus
Aim:
To python program to Perform EDA for Ticket Booking Bus
Algorithm:
Step 1: Create a dataset with relevant attributes.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics for the dataset.
Step 5: Visualize data (e.g., bar charts, pie charts).
Step 6: Analyze results and interpret findings.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Define data
num_rows = 50
dates = pd.date_range(start='2024-01-01', periods=num_rows, freq='D')
data = {
'Booking_ID': np.arange(1, num_rows + 1),
'Date': np.random.choice(dates, size=num_rows),
'Bus_ID': np.random.choice(['Bus_01', 'Bus_02', 'Bus_03', 'Bus_04'],
size=num_rows),
'Passenger_ID': np.random.randint(1000, 5000, size=num_rows),
'Seat_No': np.random.randint(1, 50, size=num_rows),
'Booking_Status': np.random.choice(['Booked', 'Cancelled', 'Completed'],
size=num_rows),
'Amount': np.random.uniform(100, 500, size=num_rows).round(2),
'Travel_Distance': np.random.randint(10, 500, size=num_rows)
}
# Create DataFrame
df = pd.DataFrame(data)
# Correlation Matrix
plt.figure(figsize=(10, 8))
correlation = df[['Amount', 'Travel_Distance']].corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
OUTPUT:
Dataset:
Booking_ID Date Bus_ID Passenger_ID Seat_No Booking_Status \
0 1 2024-02-08 Bus_01 2076 40 Cancelled
1 2 2024-01-29 Bus_02 1791 22 Cancelled
2 3 2024-01-15 Bus_04 4993 27 Completed
3 4 2024-02-12 Bus_04 3264 35 Cancelled
4 5 2024-01-08 Bus_03 1763 1 Completed
Amount Travel_Distance
0 353.74 125
1 372.28 84
2 312.37 122
3 279.11 465
4 321.16 429
Missing Values:
Booking_ID 0
Date 0
Bus_ID 0
Passenger_ID 0
Seat_No 0
Booking_Status 0
Amount 0
Travel_Distance 0
dtype: int64
Data Types:
Booking_ID int64
Date datetime64[ns]
Bus_ID object
Passenger_ID int64
Seat_No int64
Booking_Status object
Amount float64
Travel_Distance int64
dtype: object
Descriptive Statistics:
Booking_ID Date Bus_ID Passenger_ID Seat_No \
count 50.00000 50 50 50.00000 50.00000
unique NaN NaN 4 NaN NaN
top NaN NaN Bus_02 NaN NaN
freq NaN NaN 18 NaN NaN
mean 25.50000 2024-01-24 16:19:12 NaN 2985.20000 26.14000
min 1.00000 2024-01-02 00:00:00 NaN 1064.00000 1.00000
25% 13.25000 2024-01-14 06:00:00 NaN 2095.00000 14.25000
50% 25.50000 2024-01-24 00:00:00 NaN 3075.00000 28.00000
75% 37.75000 2024-02-06 18:00:00 NaN 3915.50000 38.50000
max 50.00000 2024-02-19 00:00:00 NaN 4993.00000 49.00000
std 14.57738 NaN NaN 1129.03166 14.20694
Aim:
To program program as Perform EDA for Train Ticket Booking
Algorithm:
Step 1: Create a dataset with relevant attributes.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics for the dataset.
Step 5: Visualize data (e.g., bar charts, pie charts).
Step 6: Analyze results and interpret findings.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Define data
num_rows = 50
dates = pd.date_range(start='2024-01-01', periods=num_rows, freq='D')
data = {
'Booking_ID': np.arange(1, num_rows + 1),
'Date': np.random.choice(dates, size=num_rows),
'Train_ID': np.random.choice(['Train_A', 'Train_B', 'Train_C', 'Train_D'],
size=num_rows),
'Passenger_ID': np.random.randint(1000, 5000, size=num_rows),
'Seat_No': np.random.randint(1, 100, size=num_rows),
'Booking_Status': np.random.choice(['Booked', 'Cancelled', 'Completed'],
size=num_rows),
'Amount': np.random.uniform(50, 500, size=num_rows).round(2),
'Travel_Distance': np.random.randint(10, 1000, size=num_rows)
}
# Create DataFrame
df = pd.DataFrame(data)
print("\nData Types:")
print(df.dtypes)
# Perform Descriptive Statistics
print("\nDescriptive Statistics:")
print(df.describe(include='all'))
plt.xlabel('Booking Status')
plt.ylabel('Travel Distance (km)')
plt.show()
Amount Travel_Distance
0 115.20 977
1 270.25 429
2 493.54 431
3 158.92 113
4 352.46 861
Missing Values:
Booking_ID 0
Date 0
Train_ID 0
Passenger_ID 0
Seat_No 0
Booking_Status 0
Amount 0
Travel_Distance 0
dtype: int64
Data Types:
Booking_ID int64
Date datetime64[ns]
Train_ID object
Passenger_ID int64
Seat_No int64
Booking_Status object
Amount float64
Travel_Distance int64
dtype: object
Descriptive Statistics:
Booking_ID Date Train_ID Passenger_ID Seat_No \
count 50.00000 50 50 50.00000 50.000000
unique NaN NaN 4 NaN NaN
top NaN NaN Train_B NaN NaN
freq NaN NaN 18 NaN NaN
mean 25.50000 2024-01-24 16:19:12 NaN 2985.20000 49.460000
min 1.00000 2024-01-02 00:00:00 NaN 1064.00000 1.000000
25% 13.25000 2024-01-14 06:00:00 NaN 2095.00000 28.000000
50% 25.50000 2024-01-24 00:00:00 NaN 3075.00000 46.000000
75% 37.75000 2024-02-06 18:00:00 NaN 3915.50000 76.500000
max 50.00000 2024-02-19 00:00:00 NaN 4993.00000 99.000000
std 14.57738 NaN NaN 1129.03166 29.864463
Aim:
To python program as Perform EDA for flight Ticket Booking
Algorithm:
Step 1: Create a dataset with relevant attributes.
Step 2: Load the dataset into a DataFrame.
Step 3: Clean the dataset (if necessary).
Step 4: Perform descriptive statistics for the dataset.
Step 5: Visualize data (e.g., bar charts, pie charts).
Step 6: Analyze results and interpret findings.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Define data
num_rows = 50
dates = pd.date_range(start='2024-01-01', periods=num_rows, freq='D')
data = {
'Booking_ID': np.arange(1, num_rows + 1),
'Date': np.random.choice(dates, size=num_rows),
'Flight_ID': np.random.choice(['Flight_101', 'Flight_102', 'Flight_103',
'Flight_104'], size=num_rows),
'Passenger_ID': np.random.randint(1000, 5000, size=num_rows),
'Seat_No': np.random.randint(1, 200, size=num_rows),
'Booking_Status': np.random.choice(['Booked', 'Cancelled', 'Completed'],
size=num_rows),
'Amount': np.random.uniform(100, 1000, size=num_rows).round(2),
'Travel_Distance': np.random.randint(100, 5000, size=num_rows),
'Class': np.random.choice(['Economy', 'Business', 'First'], size=num_rows)
}
# Create DataFrame
df = pd.DataFrame(data)
# Revenue by Class
plt.figure(figsize=(10, 6))
class_revenue = df[df['Booking_Status'] ==
'Completed'].groupby('Class')['Amount'].sum().reset_index()
sns.barplot(x='Class', y='Amount', data=class_revenue, palette='magma')
plt.title('Total Revenue by Class')
plt.xlabel('Class')
plt.ylabel('Total Revenue')
plt.show()
OUTPUT:
Dataset:
Booking_ID Date Flight_ID Passenger_ID Seat_No Booking_Status \
0 1 2024-02-08 Flight_101 2076 104 Booked
1 2 2024-01-29 Flight_102 1791 35 Booked
2 3 2024-01-15 Flight_104 4993 193 Booked
3 4 2024-02-12 Flight_104 3264 101 Completed
4 5 2024-01-08 Flight_103 1763 175 Cancelled
Missing Values:
Booking_ID 0
Date 0
Flight_ID 0
Passenger_ID 0
Seat_No 0
Booking_Status 0
Amount 0
Travel_Distance 0
Class 0
dtype: int64
Data Types:
Booking_ID int64
Date datetime64[ns]
Flight_ID object
Passenger_ID int64
Seat_No int64
Booking_Status object
Amount float64
Travel_Distance int64
Class object
dtype: object
Descriptive Statistics:
Booking_ID Date Flight_ID Passenger_ID Seat_No \
count 50.00000 50 50 50.00000 50.000000
unique NaN NaN 4 NaN NaN
top NaN NaN Flight_102 NaN NaN
freq NaN NaN 18 NaN NaN
mean 25.50000 2024-01-24 16:19:12 NaN 2985.20000 95.540000
min 1.00000 2024-01-02 00:00:00 NaN 1064.00000 1.000000
25% 13.25000 2024-01-14 06:00:00 NaN 2095.00000 42.500000
50% 25.50000 2024-01-24 00:00:00 NaN 3075.00000 94.000000
75% 37.75000 2024-02-06 18:00:00 NaN 3915.50000 140.750000
max 50.00000 2024-02-19 00:00:00 NaN 4993.00000 193.000000
std 14.57738 NaN NaN 1129.03166 58.362768
.
Result:
Thus the python program as EDA for flight Ticket Booking has been successfully verified