Iris Dataset
Allan Lao
2023-09-26
##ctrl-alt-i for code blocks
Iris Dataset in R
The iris dataset is a built-in dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different
species.
To explore the dataset, we can describe it statistically or visualize using charts.
Load the Iris Dataset
Since the iris dataset is a built-in dataset, we simply need to load and use it
data(iris)
Explore the Structure of the dataset
First is to examine the data structure to determine the size, number of columns and other attributes. The order on what you want to look is all up to
the analyst.
Structure
The structure of the dataset
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
str() shows the structure indicating the number of observations (records) and variables as well as its data type. There are 150 rows of records in
the iris dataset with 5 columns. Note the Species variable has a data type of Factor
The dimension
dim(iris)
## [1] 150 5
The names of the columns
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
If you want to take a glimpse at the first 4 lines of rows.
head(iris,4)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
4 rows
Optionally you may check also the last 6 records
tail(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
6 rows
Describe the Iris Dataset using Statistical tools
Now, lets usse some statistics to describe the dataset.
The descriptive statistics summary
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
For each of the numeric variables we can see the following information:
Min: The minimum value.
1st Qu: The value of the first quartile (25th percentile).
Median: The median value.
Mean: The mean value.
3rd Qu: The value of the third quartile (75th percentile).
Max: The maximum value.
For the only categorical variable in the dataset (Species) we see a frequency count of each value:
setosa: This species occurs 50 times.
versicolor: This species occurs 50 times.
virginica: This species occurs 50 times.
Visualize the Iris Dataset
The plot () function is the generic function for plotting R objects.
plot(iris)
the entire dataset provides a glimpse of the relation between its variables. The chart below Sepal.Length represents the Sepal.Width in the y-axis
and Sepal.Length in the x-axis
Plot quantitative variables
plot(iris$Sepal.Length) #Quantitative
<> #### Plot 2 quantitative variables
plot(iris$Sepal.Width, iris$Sepal.Length,
col=factor(iris$Species),
main='Sepal Length vs Width',
xlab='Sepal Width',
ylab='Sepal Length',
pch=19)
legend(x = "topleft", lty = c(4,6), text.font = 4,
text.col = "blue",
pch=13,
col = (factor(iris$Species)),
legend=levels(factor(iris$Species)))
<>
Plotting a Factor variable
The plot() function automatically detects the type of variable and determines the appropriate chart to use by default
plot(iris$Species)
Next, will use histogram to determine how data is spread across a range of values. Just being curious on the distribution of Sepal Length.
hist(iris$Sepal.Length,
col='steelblue',
main='Histogram',
xlab='Length',
ylab='Frequency')
Box Plot shows 5 statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum. It is thus
useful for visualizing the spread of the data is and deriving inferences accordingly
Using a boxplot() we can determine the distribution of sepal length across species.
boxplot(Sepal.Length~Species,
data=iris,
main='Sepal Length by Species',
xlab='Species',
ylab='Sepal Length',
col='steelblue',
border='black')