10
10
TITLE
Assignment No. 10
Aim:
Summary statistics, data visualization, histogram and boxplot for the features on
the Iris dataset or any other dataset.
Prerequisites
Learning Objectives
Learning Outcome:
o Students will be able to compute statistics on the features of the dataset, use
histograms and boxplot on the features of the dataset.
Theory:
Data analysis is a process of inspecting, cleansing, transforming, and
modelling data with the goal of discovering useful information, informing
conclusions, and supporting decision-making. Data analysis has multiple facets and
approaches, encompassing diverse techniques under a variety of names, while being
used in different business, science, and social science domains.
A data set (or dataset) is a collection of data. Most commonly a data set corresponds
to the contents of a single database table, or a single statistical data matrix, where
every column of the table represents a particular variable, and each row corresponds
to a given member of the data set in question.
The Iris Dataset contains four features (length and width of sepals and petals) of 50
samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). These
measures were used to create a linear discriminant model to classify the species. The
dataset is often used in data mining, classification and clustering examples and to test
algorithms.
Attribute Information:
-> sepal length in cm
-> sepal width in cm
-> petal length in cm
-> petal width in cm
-> class:
Iris Setosa
Iris Versicolour
Iris Virginica
Summary statistic:
Mean, standard deviation, regression, sample size determination and hypothesis
testing are the fundamental data analytics methods.
Mean: The sum of all the data entries divided by the number of entries.
Range: The difference between the maximum and minimum data entries in the
set.
Range = (Max. data entry) – (Min. data entry)
Standard deviation:
The standard deviation measure variability and consistency of the sample or
population. In most real-world applications, consistency is a great advantage. In
statistical data analysis, less variation is often better.
Variance: The average squared deviation from the mean is also known as the
variance.
Percentile: Let p be any integer between 0 and 100. The pth percentile of data set
is the data value at which p percent of the value in the data set are less than or
equal to this value.
• How to calculate percentiles: Use the following steps for calculating percentiles
for small data sets.
• Step 1: Sort the data in ascending order (from smallest to largest)
Summary Statistics:
Min Max Mean SD Class Correlation
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
Box Plot:
A boxplot shows the distribution of the data with more detailed information. It shows
the outliers more clearly, maximum, minimum, quartile(Q1), third quartile(Q3),
interquartile range(IQR), and median. You can calculate the middle 50% from the IQR.
Histogram:
Both histograms and box plots are used to explore and present the data in an easy and
understandable manner. Histograms are preferred to determine the underlying probability
distribution of a data. Box plots on the other hand are more useful when comparing between
several data sets. They are less detailed than histograms and take up less space.