Module 2: R Assignment
Anuraag K. Macha
ALY6010: Probability Theory and Introductory Statistics
Dr. Thomas Goulding
06/01/24
Introduction
This analysis focuses on the Iris dataset from the UCI Machine Learning Repository,
utilizing ggplot2 and psych packages in R to generate descriptive statistics and visualizations.
The dataset includes measurements of sepal length, sepal width, petal length, and petal width for
three species of Iris flowers. The goal is to understand the dataset's overall structure, compare
measurements across species, and visualize key relationships and distributions.
Data Analysis
To gain an overview of the dataset, descriptive statistics were produced using the
describe function from the psych package. This provided insights into the mean, standard
deviation, minimum, maximum, and sample size (N) for each variable, detailed in the three-line
table below:
Variable Mean Standard Dev. Minimum Maximum Number
Sepal Length (cm) 5.84 0.83 4.3 7.9 150
Sepal Width (cm) 3.05 0.43 2.0 4.4 150
Petal Length (cm) 3.76 1.76 1.0 6.9 150
Petal Width (cm) 1.20 0.76 0.1 2.5 150
The average sepal length of the Iris flowers is 5.84 cm with a standard deviation of 0.83 cm.
Sepal width has an average of 3.05 cm and a standard deviation of 0.43 cm.
Petal length varies significantly with an average of 3.76 cm and a standard deviation of 1.76 cm.
Petal width has an average of 1.20 cm, reflecting the varied petal sizes among the different
species.
Next, descriptive statistics by group, specifically by species, were generated to observe
how these measurements varied across different species of Iris flowers. This allowed an
understanding of the differences in sepal and petal dimensions among Iris setosa, Iris versicolor,
and Iris virginica.
Figure 1: Descriptive statistics by species
Three types of visualizations were then created using ggplot2. First, a scatter plot of sepal
length versus petal length was produced, adding a linear regression line with the geom_smooth
function and an abline using geom_abline. This helped visualize the relationship between these
two variables and displayed a positive correlation.
Figure 2: Scatter Plot of Sepal Length vs Petal Length
Second, a jitter plot was created to show the distribution of petal length across different
species, using geom_jitter to avoid overplotting and provide a clearer view of data density. As
we can see in Figure 3 below, the species have distinct petal length with a few outliers.
Figure 3: Jitter Plot of Petal Length by Species
Lastly, a boxplot of sepal length by species was generated using geom_boxplot, which
allowed the detection of potential outliers and comparison of the central tendency and spread of
sepal lengths among the different species. Boxplots are useful for detecting outliers, and Figure 4
below shows that the series iris-virginica has one outlier.
Figure 4: Boxplot of Petal Length by Species
Conclusion
The analysis of the Iris dataset provided valuable insights through descriptive statistics
and visualizations. The descriptive statistics revealed significant variation in sepal and petal
dimensions across different species. Scatter plots, jitter plots, and boxplots created using ggplot2
effectively illustrated relationships, distributions, and potential outliers. This comprehensive
examination enhances the understanding of the Iris dataset, showcasing the differences and
relationships among its key variables across species.
Works Cited
Kabacoff, R. (2022). R in action: Data analysis and graphics with R and Tidyverse. Manning
Publications.
Bluman, A. G. (2018). Elementary statistics: A step by step approach. McGraw-Hill Education.
R functions. (n.d.). https://www.w3schools.com/r/r_functions.asp
Kosourova, E. (2023, March 6). Apply functions in R with examples [apply(), sapply(), lapply (),
tapply()]. Dataquest. https://www.dataquest.io/blog/apply-functions-in-r-sapply-lapply-
tapply/
Appendix
The written and executed R commands are included in the R script file that was submitted
alongside this file.