Dsa 1
Dsa 1
Reading data from text files, Excel and the web and exploring various commands for
doing descriptive analytics on the Iris data set
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
# Descriptive statistics
print("\nSummary Statistics:")
print(iris_df.describe())
# Pairplot visualization
sns.pairplot(iris_df, hue="species", markers=["o", "s", "D"])
plt.suptitle("Pairplot of Iris Dataset", y=1.02)
plt.show()
Output:
First 5 rows of the dataset:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
Summary Statistics:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Species Count:
species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
8. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
import pandas as pd
import numpy as np
from scipy.stats import skew, kurtosis
# Frequency
frequency = data[column].value_counts()
print("Frequency:\n", frequency)
# Mean
mean = data[column].mean()
print(f"Mean: {mean:.2f}")
# Median
median = data[column].median()
print(f"Median: {median:.2f}")
# Mode
mode = data[column].mode()[0]
print(f"Mode: {mode:.2f}")
# Variance
variance = data[column].var()
print(f"Variance: {variance:.2f}")
# Standard Deviation
std_dev = data[column].std()
print(f"Standard Deviation: {std_dev:.2f}")
# Skewness
skewness = skew(data[column])
print(f"Skewness: {skewness:.2f}")
# Kurtosis
kurt = kurtosis(data[column])
print(f"Kurtosis: {kurt:.2f}")
output:
Standard Deviation: 0.33
Skewness: 1.92
Kurtosis: 5.55