25 - Assignment10.ipynb - Colaboratory
25 - Assignment10.ipynb - Colaboratory
25 - Assignment10.ipynb - Colaboratory
ipynb - Colaboratory
Scan the dataset and give the inference as: a. List down the features and their types (e.g., numeric, nominal) available in the dataset. b. Create a
histogram for each feature in the dataset to illustrate the feature distributions. c. Create a boxplot for each feature in the dataset. d. Compare
distributions and identify outliers.
warnings.filterwarnings("ignore")
df = pd.read_csv("iris.csv");
(150, 6)
df.isnull().sum()
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 1/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 2/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
#STANDARD DEVIATION
print("Standard Deviation for setosa " ,"\n", setosa.std() ,"\n")
print("Standard Deviation for versicolor " , "\n" ,versicolor.std(),"\n")
print("Standard Deviation for virginica " ,"\n",virginica.std(),"\n")
g2 = df.groupby(['Species'])
iris_setosa = g2.get_group('Iris-setosa')
iris_setosa.describe()
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 3/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
keyboard_arrow_down VISUALISATIONS
# SCATTER PLOT : Relationship between sepal length and petal length.
sn.scatterplot(data=df, x='SepalLengthCm', y='PetalLengthCm', hue='SepalWidthCm', palette='viridis')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Petal Length (cm)')
plt.title('Sepal Length and Petal Length')
plt.show()
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 4/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
q3 = np.percentile(df['PetalLengthCm'],75)
print("Q3 is: ",q3)
Q3 is: 5.1
q2 = np.percentile(df['PetalLengthCm'],50)
print("Q2 is: ",q2)
Q2 is: 4.35
q1 = np.percentile(df['PetalLengthCm'],25)
print("Q1 is: ",q1)
Q1 is: 1.6
IQR = q3-q2
print("IQR is: ",IQR)
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 5/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
import matplotlib.pyplot as plt
df.boxplot(column = ['PetalLengthCm'])
plt.show()
df1=df.drop(columns='Id', inplace=True)
plt.figure(figsize=(10, 6))
sns.boxplot(data=df.drop(columns='Species'), orient='h', fliersize=5, linewidth=1)
plt.title('Box Plot with Outliers for Iris Dataset')
plt.xlabel('Feature Value')
plt.ylabel('Features')
plt.show()
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 6/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
keyboard_arrow_down SETOSA
setosa.drop(columns='Id', inplace=True)
plt.figure(figsize=(10, 6))
sns.boxplot(data=setosa, orient='h', fliersize=5, linewidth=1)
plt.title('Box Plot with Outliers for Iris Dataset')
plt.xlabel('Feature Value')
plt.ylabel('Features')
plt.show()
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 7/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
1. Only petal width and petal length eature has certain outliers.
2. Sepal width has the highest Inter Quartile Range (IQR)
q3 = np.percentile(setosa['PetalLengthCm'],75)
print("Q3 is: ",q3)
q2 = np.percentile(setosa['PetalLengthCm'],50)
print("Q2 is: ",q2)
q1 = np.percentile(setosa['PetalLengthCm'],25)
print("Q1 is: ",q1)
IQR = q3-q2
print("IQR is: ",IQR)
min = q1-1.5*IQR
max = q3+1.5*IQR
print("minimum outlier value is: ",min)
print("maximum outlier value is: ",max)
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 8/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
Q3 is: 1.5750000000000002
Q2 is: 1.5
Q1 is: 1.4
IQR is: 0.07500000000000018
minimum outlier value is: 1.2874999999999996
maximum outlier value is: 1.6875000000000004
keyboard_arrow_down VIRGINICA
virginica.drop(columns='Id', inplace=True)
plt.figure(figsize=(10, 6))
sns.boxplot(data=virginica, orient='h', fliersize=5, linewidth=1)
plt.title('Box Plot with Outliers for Iris Dataset')
plt.xlabel('Feature Value')
plt.ylabel('Features')
plt.show()
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 9/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
q3 = np.percentile(virginica['PetalLengthCm'],75)
print("Q3 is: ",q3)
q2 = np.percentile(virginica['PetalLengthCm'],50)
print("Q2 is: ",q2)
q1 = np.percentile(virginica['PetalLengthCm'],25)
print("Q1 is: ",q1)
IQR = q3-q2
print("IQR is: ",IQR)
min = q1-1.5*IQR
max = q3+1.5*IQR
print("minimum outlier value is: ",min)
print("maximum outlier value is: ",max)
Q3 is: 5.875
Q2 is: 5.55
Q1 is: 5.1
IQR is: 0.3250000000000002
minimum outlier value is: 4.612499999999999
maximum outlier value is: 6.362500000000001
keyboard_arrow_down VERSICOLOR
versicolor.drop(columns='Id', inplace=True)
plt.figure(figsize=(10, 6))
sns.boxplot(data=versicolor, orient='h', fliersize=5, linewidth=1)
plt.title('Box Plot with Outliers for Iris Dataset')
plt.xlabel('Feature Value')
plt.ylabel('Features')
plt.show()
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 10/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
q3 = np.percentile(versicolor['PetalLengthCm'],75)
print("Q3 is: ",q3)
q2 = np.percentile(versicolor['PetalLengthCm'],50)
print("Q2 is: ",q2)
q1 = np.percentile(versicolor['PetalLengthCm'],25)
print("Q1 is: ",q1)
IQR = q3-q2
print("IQR is: ",IQR)
min = q1-1.5*IQR
max = q3+1.5*IQR
print("minimum outlier value is: ",min)
print("maximum outlier value is: ",max)
Q3 is: 4.6
Q2 is: 4.35
Q1 is: 4.0
IQR is: 0.25
minimum outlier value is: 3.625
maximum outlier value is: 4.975
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 11/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
correlation_matrix = df.corr()
# Set up the matplotlib figure
plt.figure(figsize=(8, 6))
# Draw the heatmap with the mask and correct aspect ratio
sn.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
# Rotate the tick labels for better readability
plt.yticks(rotation=0)
plt.xticks(rotation=90)
# Set the title
plt.title('Correlation Heatmap of Iris Dataset Features')
plt.show()
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 12/13
4/3/24, 6:25 PM Assignment4.ipynb - Colaboratory
from matplotlib import pyplot as plt
df['SepalWidthCm'].plot(kind='hist', bins=20, title='SepalWidthCm', edgecolor='black')
plt.gca().spines[['top', 'right',]].set_visible(False)
keyboard_arrow_down INSIGHTS
Average Petal Length for each species
Setosa: 1.46cm
Versicolor: 4.26cm
https://colab.research.google.com/drive/1hf6Z7J-tNyENcnGuykxI1f157LcWO-JJ#scrollTo=FzG6bcfkkWKe&printMode=true 13/13