0% found this document useful (0 votes)
21 views8 pages

Dsa 1

The document provides a detailed analysis of the Iris dataset using Python, including data loading, descriptive statistics, and visualizations such as pairplots and boxplots. It also covers univariate analysis on the Pima Indians Diabetes dataset, calculating metrics like mean, median, mode, variance, standard deviation, skewness, and kurtosis for each feature. The analysis includes frequency counts and summary statistics for the dataset, highlighting the distribution of various attributes.

Uploaded by

pratikkokate88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

Dsa 1

The document provides a detailed analysis of the Iris dataset using Python, including data loading, descriptive statistics, and visualizations such as pairplots and boxplots. It also covers univariate analysis on the Pima Indians Diabetes dataset, calculating metrics like mean, median, mode, variance, standard deviation, skewness, and kurtosis for each feature. The analysis includes frequency counts and summary statistics for the dataset, highlighting the distribution of various attributes.

Uploaded by

pratikkokate88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

7.

Reading data from text files, Excel and the web and exploring various commands for
doing descriptive analytics on the Iris data set

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets

# Load the Iris dataset from sklearn


iris = datasets.load_iris()

# Convert to Pandas DataFrame


iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target # Add species column (numeric)
iris_df['species'] = iris_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'}) # Convert to
categorical

# Display first 5 rows


print("First 5 rows of the dataset:")
print(iris_df.head())

# Dataset summary information


print("\nDataset Information:")
print(iris_df.info())

# Descriptive statistics
print("\nSummary Statistics:")
print(iris_df.describe())

# Count of each species


print("\nSpecies Count:")
print(iris_df['species'].value_counts())

# Pairplot visualization
sns.pairplot(iris_df, hue="species", markers=["o", "s", "D"])
plt.suptitle("Pairplot of Iris Dataset", y=1.02)
plt.show()

# Boxplot for feature distribution


plt.figure(figsize=(10,6))
sns.boxplot(data=iris_df, orient="h")
plt.title("Feature Distribution of Iris Dataset")
plt.show()

Output:
First 5 rows of the dataset:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None

Summary Statistics:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Species Count:
species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
8. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.

import pandas as pd
import numpy as np
from scipy.stats import skew, kurtosis

# URL of the Pima Indians Diabetes dataset


url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

# Column names for the dataset


column_names = [
"Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
"Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"
]

# Load the dataset


df = pd.read_csv(url, header=None, names=column_names)

# Display the first few rows of the dataset


print("Dataset Head:")
print(df.head())

# Function to perform univariate analysis


def univariate_analysis(data, column):
print(f"\nUnivariate Analysis for {column}:")
print("---------------------------------")

# Frequency
frequency = data[column].value_counts()
print("Frequency:\n", frequency)

# Mean
mean = data[column].mean()
print(f"Mean: {mean:.2f}")

# Median
median = data[column].median()
print(f"Median: {median:.2f}")

# Mode
mode = data[column].mode()[0]
print(f"Mode: {mode:.2f}")

# Variance
variance = data[column].var()
print(f"Variance: {variance:.2f}")
# Standard Deviation
std_dev = data[column].std()
print(f"Standard Deviation: {std_dev:.2f}")

# Skewness
skewness = skew(data[column])
print(f"Skewness: {skewness:.2f}")

# Kurtosis
kurt = kurtosis(data[column])
print(f"Kurtosis: {kurt:.2f}")

# Perform univariate analysis for each column


for column in df.columns:
univariate_analysis(df, column)

output:
Standard Deviation: 0.33
Skewness: 1.92
Kurtosis: 5.55

Univariate Analysis for Age:


---------------------------------
Frequency:
Age
22 72
21 63
25 48
24 46
23 38
28 35
26 33
27 32
29 29
31 24
41 22
30 21
37 19
42 18
33 17
36 16
38 16
32 16
45 15
34 14
46 13
40 13
43 13
39 12
35 10
44 8
50 8
51 8
52 8
58 7
54 6
47 6
49 5
60 5
53 5
57 5
48 5
63 4
66 4
55 4
62 4
59 3
56 3
65 3
67 3
61 2
69 2
72 1
81 1
64 1
70 1
68 1
Name: count, dtype: int64
Mean: 33.24
Median: 29.00
Mode: 22.00
Variance: 138.30
Standard Deviation: 11.76
Skewness: 1.13
Kurtosis: 0.63

Univariate Analysis for Outcome:


---------------------------------
Frequency:
Outcome
0 500
1 268
Name: count, dtype: int64
Mean: 0.35
Median: 0.00
Mode: 0.00
Variance: 0.23
Standard Deviation: 0.48
Skewness: 0.63
Kurtosis: -1.60
1 268
Name: count, dtype: int64
Mean: 0.35
Median: 0.00
Mode: 0.00
Variance: 0.23
Standard Deviation: 0.48
Skewness: 0.63
Kurtosis: -1.60
1 268
Name: count, dtype: int64
Mean: 0.35
Median: 0.00
Mode: 0.00
Variance: 0.23
Standard Deviation: 0.48
Skewness: 0.63
1 268
Name: count, dtype: int64
Mean: 0.35
Median: 0.00
Mode: 0.00
Variance: 0.23
Standard Deviation: 0.48
1 268
Name: count, dtype: int64
Mean: 0.35
Median: 0.00
Mode: 0.00
1 268
Name: count, dtype: int64
Mean: 0.35
Median: 0.00
1 268
Name: count, dtype: int64
Mean: 0.35
1 268
1 268
Name: count, dtype: int64
Mean: 0.35
Median: 0.00
Mode: 0.00
Variance: 0.23
1 268
Name: count, dtype: int64
Mean: 0.35
1 268
Name: count, dtype: int64
Mean: 0.35
1 268
Name: count, dtype: int64
Mean: 0.35
1 268
Name: count, dtype: int64
1 268
Name: count, dtype: int64
Mean: 0.35
1 268
Name: count, dtype: int64
Mean: 0.35
1 268
1 268
1 268
1 268
1 268
1 268
Name: count, dtype: int64
Mean: 0.35
Median: 0.00
Mode: 0.00
Variance: 0.23
Standard Deviation: 0.48
Skewness: 0.63
Kurtosis: -1.60

You might also like