Ass 10 DSBDL
Ass 10 DSBDL
10
Aim: Data Visualization III
Download the Iris flower dataset or any other dataset into a DataFrame.
(e.g., https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the inference
as:
1. List down the features and their types (e.g., numeric, nominal) available in the
dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature
distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distributions and identify outliers.
Introduction:
Iris dataset is the Hello World for the Data Science, so if you have started your career in Data
Science and Machine Learning you will be practicing basic ML algorithms on this famous
dataset. Iris dataset contains five columns such as Petal Length, Petal Width, Sepal Length, Sepal
Width and Species Type.
Iris is a flowering plant, the researchers have measured various features of the different iris
flowers and recorded digitally.
Histograms
Histograms allow seeing the distribution of data for various columns. It can be used for
uni as well as bi-variate analysis.
Distplot is used basically for the univariant set of observations and visualizes it through
a histogram i.e. only one observation and hence we choose one particular column of the
dataset.
The box plot is used to display the distribution of the categorical data in the form of quartiles.
The center of the box shows the median value. The value from the lower whisker to the bottom
of the box shows the first quartile. From the bottom of the box to the middle of the box lies the
second quartile. From the middle of the box to the top of the box lies the third quartile and finally
from the top of the box to the top whisker lies the last quartile.
Attribute Information about data set:
Attribute Information:
-> sepal length in cm
-> sepal width in cm
-> petal length in cm
-> petal width in cm
-> class:
Iris Setosa
Iris Versicolour
Iris Virginica
Summary Statistics:
Min Max Mean SD Class Correlation
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
plt.xlabel("Sepal_Length_cm")
plt.ylabel("Count")
plt.xlabel("Sepal_Width_cm")
plt.ylabel("Count")
plt.show()
x = data.PetalLengthCm
plt.xlabel("Petal_Length_cm")
plt.ylabel("Count")
plt.show()
x = data.PetalWidthCm
plt.xlabel("Petal_Width_cm")
plt.ylabel("Count")
plt.show()
3. Create a box plot for each feature in the dataset.
# removing Id column
print(new_data.head())
new_data.boxplot()
Conclusion: Thus we have studied data visualization on Iris data set with
histogram and boxplot.