25 - Assignment10.ipynb - Colaboratory

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

4/3/24, 6:25 PM Assignment4.

ipynb - Colaboratory

Name : Chaitrali Ghule


Roll no.: 25
PRN : 12210688

Scan the dataset and give the inference as: a. List down the features and their types (e.g., numeric, nominal) available in the dataset. b. Create a
histogram for each feature in the dataset to illustrate the feature distributions. c. Create a boxplot for each feature in the dataset. d. Compare
distributions and identify outliers.

import pandas as pd #Data Manipulation


import numpy as np #Numerical Operations
import matplotlib.pyplot as plt #Data Visualisation
import seaborn as sns #Statistical Data Visualisation
import warnings #Ignore Warnings

warnings.filterwarnings("ignore")

df = pd.read_csv("iris.csv");

keyboard_arrow_down Data Preprocessing


df.shape

(150, 6)

df.isnull().sum()

Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

Double-click (or enter) to edit

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 1/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

# Filtering out the data of each species


setosa = df[df['Species'] == 'Iris-setosa']
versicolor = df[df['Species'] == 'Iris-versicolor']
virginica = df[df['Species'] == 'Iris-virginica']

keyboard_arrow_down Basic Statistical Details :


# MEAN
print("Mean for Setosa =" ,"\n",setosa.mean(),"\n")
print("Mean for Versicolor =" , "\n" ,versicolor.mean(),"\n")
print("Mean for Virginica =" ,"\n",virginica.mean(),"\n")

Mean for Setosa =


Id 25.500
SepalLengthCm 5.006
SepalWidthCm 3.418
PetalLengthCm 1.464
PetalWidthCm 0.244
dtype: float64

Mean for Versicolor =


Id 75.500
SepalLengthCm 5.936
SepalWidthCm 2.770
PetalLengthCm 4.260
PetalWidthCm 1.326
dtype: float64

Mean for Virginica =


Id 125.500
SepalLengthCm 6.588
SepalWidthCm 2.974
PetalLengthCm 5.552
PetalWidthCm 2.026
dtype: float64

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 2/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
#STANDARD DEVIATION
print("Standard Deviation for setosa " ,"\n", setosa.std() ,"\n")
print("Standard Deviation for versicolor " , "\n" ,versicolor.std(),"\n")
print("Standard Deviation for virginica " ,"\n",virginica.std(),"\n")

Standard Deviation for setosa


Id 14.577380
SepalLengthCm 0.352490
SepalWidthCm 0.381024
PetalLengthCm 0.173511
PetalWidthCm 0.107210
dtype: float64

Standard Deviation for versicolor


Id 14.577380
SepalLengthCm 0.516171
SepalWidthCm 0.313798
PetalLengthCm 0.469911
PetalWidthCm 0.197753
dtype: float64

Standard Deviation for virginica


Id 14.577380
SepalLengthCm 0.635880
SepalWidthCm 0.322497
PetalLengthCm 0.551895
PetalWidthCm 0.274650
dtype: float64

g2 = df.groupby(['Species'])
iris_setosa = g2.get_group('Iris-setosa')
iris_setosa.describe()

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

count 50.00000 50.00000 50.000000 50.000000 50.00000

mean 25.50000 5.00600 3.418000 1.464000 0.24400

std 14.57738 0.35249 0.381024 0.173511 0.10721

min 1.00000 4.30000 2.300000 1.000000 0.10000

25% 13.25000 4.80000 3.125000 1.400000 0.20000

50% 25.50000 5.00000 3.400000 1.500000 0.20000

75% 37.75000 5.20000 3.675000 1.575000 0.30000

max 50.00000 5.80000 4.400000 1.900000 0.60000

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 3/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory

keyboard_arrow_down VISUALISATIONS
# SCATTER PLOT : Relationship between sepal length and petal length.
sn.scatterplot(data=df, x='SepalLengthCm', y='PetalLengthCm', hue='SepalWidthCm', palette='viridis')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Petal Length (cm)')
plt.title('Sepal Length and Petal Length')
plt.show()

#Relationship between petal length & petal width


sn.scatterplot(data=df, x='PetalLengthCm', y='PetalWidthCm', hue='Species', palette='Set2')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('Petal Length and Petal Width')
plt.show()

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 4/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory

q3 = np.percentile(df['PetalLengthCm'],75)
print("Q3 is: ",q3)

Q3 is: 5.1

q2 = np.percentile(df['PetalLengthCm'],50)
print("Q2 is: ",q2)

Q2 is: 4.35

q1 = np.percentile(df['PetalLengthCm'],25)
print("Q1 is: ",q1)

Q1 is: 1.6

IQR = q3-q2
print("IQR is: ",IQR)

IQR is: 0.75

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 5/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
import matplotlib.pyplot as plt
df.boxplot(column = ['PetalLengthCm'])
plt.show()

df1=df.drop(columns='Id', inplace=True)

plt.figure(figsize=(10, 6))
sns.boxplot(data=df.drop(columns='Species'), orient='h', fliersize=5, linewidth=1)
plt.title('Box Plot with Outliers for Iris Dataset')
plt.xlabel('Feature Value')
plt.ylabel('Features')
plt.show()

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 6/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory

Insights for the box plots :

1. Only sepal width column/feature has certain outliers.


2. Petal length has the highest Inter Quartile Range (IQR)

keyboard_arrow_down SETOSA
setosa.drop(columns='Id', inplace=True)

plt.figure(figsize=(10, 6))
sns.boxplot(data=setosa, orient='h', fliersize=5, linewidth=1)
plt.title('Box Plot with Outliers for Iris Dataset')
plt.xlabel('Feature Value')
plt.ylabel('Features')
plt.show()

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 7/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory

Insights for the box plots (specifically for Setosa species)

1. Only petal width and petal length eature has certain outliers.
2. Sepal width has the highest Inter Quartile Range (IQR)

q3 = np.percentile(setosa['PetalLengthCm'],75)
print("Q3 is: ",q3)
q2 = np.percentile(setosa['PetalLengthCm'],50)
print("Q2 is: ",q2)
q1 = np.percentile(setosa['PetalLengthCm'],25)
print("Q1 is: ",q1)
IQR = q3-q2
print("IQR is: ",IQR)
min = q1-1.5*IQR
max = q3+1.5*IQR
print("minimum outlier value is: ",min)
print("maximum outlier value is: ",max)

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 8/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory

Q3 is: 1.5750000000000002
Q2 is: 1.5
Q1 is: 1.4
IQR is: 0.07500000000000018
minimum outlier value is: 1.2874999999999996
maximum outlier value is: 1.6875000000000004

keyboard_arrow_down VIRGINICA
virginica.drop(columns='Id', inplace=True)
plt.figure(figsize=(10, 6))
sns.boxplot(data=virginica, orient='h', fliersize=5, linewidth=1)
plt.title('Box Plot with Outliers for Iris Dataset')
plt.xlabel('Feature Value')
plt.ylabel('Features')
plt.show()

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 9/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
q3 = np.percentile(virginica['PetalLengthCm'],75)
print("Q3 is: ",q3)
q2 = np.percentile(virginica['PetalLengthCm'],50)
print("Q2 is: ",q2)
q1 = np.percentile(virginica['PetalLengthCm'],25)
print("Q1 is: ",q1)
IQR = q3-q2
print("IQR is: ",IQR)
min = q1-1.5*IQR
max = q3+1.5*IQR
print("minimum outlier value is: ",min)
print("maximum outlier value is: ",max)

Q3 is: 5.875
Q2 is: 5.55
Q1 is: 5.1
IQR is: 0.3250000000000002
minimum outlier value is: 4.612499999999999
maximum outlier value is: 6.362500000000001

keyboard_arrow_down VERSICOLOR
versicolor.drop(columns='Id', inplace=True)
plt.figure(figsize=(10, 6))
sns.boxplot(data=versicolor, orient='h', fliersize=5, linewidth=1)
plt.title('Box Plot with Outliers for Iris Dataset')
plt.xlabel('Feature Value')
plt.ylabel('Features')
plt.show()

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 10/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory

q3 = np.percentile(versicolor['PetalLengthCm'],75)
print("Q3 is: ",q3)
q2 = np.percentile(versicolor['PetalLengthCm'],50)
print("Q2 is: ",q2)
q1 = np.percentile(versicolor['PetalLengthCm'],25)
print("Q1 is: ",q1)
IQR = q3-q2
print("IQR is: ",IQR)
min = q1-1.5*IQR
max = q3+1.5*IQR
print("minimum outlier value is: ",min)
print("maximum outlier value is: ",max)

Q3 is: 4.6
Q2 is: 4.35
Q1 is: 4.0
IQR is: 0.25
minimum outlier value is: 3.625
maximum outlier value is: 4.975

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 11/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory

Double-click (or enter) to edit

correlation_matrix = df.corr()
# Set up the matplotlib figure
plt.figure(figsize=(8, 6))
# Draw the heatmap with the mask and correct aspect ratio
sn.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
# Rotate the tick labels for better readability
plt.yticks(rotation=0)
plt.xticks(rotation=90)
# Set the title
plt.title('Correlation Heatmap of Iris Dataset Features')
plt.show()

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 12/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
from matplotlib import pyplot as plt
df['SepalWidthCm'].plot(kind='hist', bins=20, title='SepalWidthCm', edgecolor='black')
plt.gca().spines[['top', 'right',]].set_visible(False)

keyboard_arrow_down INSIGHTS
Average Petal Length for each species

average_petal_length_setosa = df[df['Species'] == 'Iris-setosa']['PetalLengthCm'].mean()


average_petal_length_versicolor = df[df['Species'] == 'Iris-versicolor']['PetalLengthCm'].mean()
average_petal_length_virginica = df[df['Species'] == 'Iris-virginica']['PetalLengthCm'].mean()
print("Setosa: {:.2f}cm".format(average_petal_length_setosa))
print("Versicolor: {:.2f}cm".format(average_petal_length_versicolor))
print("Virginica: {:.2f}cm".format(average_petal_length_virginica))

Setosa: 1.46cm
Versicolor: 4.26cm

https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 13/13

You might also like