Lecture 2 Data Presentation
Lecture 2 Data Presentation
Data Presentation
A. Frequency Table
The table() function is used to generate frequency tables, which count the occurrences
of unique values in a vector or factor. Here's the basic syntax:
table(x)
Where x is the vector or factor for which you want to create the frequency table. For
example, suppose you have a vector x:
x <- c("A", "B", "A", "C", "B", "A", "A", "B", "C", "C")
You can also use table() with multiple variables to create contingency tables. For
instance:
y <- c("M", "F", "F", "M", "M", "F", "M", "F", "F", "M")
table(x, y)
This will produce a contingency table showing the frequencies of combinations of values
of x and y.
y
x FM
A22
B21
C12
To compute percentages in a frequency table in R, you can use the prop.table() function
along with table(). Here's how you can do it:
# Create a vector
x <- c("A", "B", "A", "C", "B", "A", "A", "B", "C", "C")
# Compute frequencies
Frequency<- table(x)
# Compute percentages
Percentage <- prop.table(Frequency) * 100
# Combine frequencies and percentages
freq_table <- cbind(Frequency, Percentage)
We can also present the data graphically, for example using a bar plot. The barplot()
function is used to create bar plots, also known as bar charts or bar graphs. Bar plots are
used to visualize the distribution of categorical data by representing the frequencies or
proportions of different categories using rectangular bars. The basic syntax of the
barplot() function is:
barplot(height, ...)
Where:
• ...: additional arguments that control the appearance of the bar plot, such as colors,
axis labels, titles, etc.
For example, suppose you have a vector height representing the heights of bars you want
to plot:
height <- c(10, 20, 15, 25, 30)
barplot(height)
This will create a simple bar plot with bars of different heights.
You can also customize the appearance of the bar plot using additional arguments. For
example, you can specify the names of the bars using the names.arg argument:
names <- c("Category 1", "Category 2", "Category 3", "Category 4", "Category 5")
barplot(heights, names.arg = names, col = "blue", main = "Bar Plot Example", xlab =
"Categories", ylab = "Frequency")
This will create a bar plot with custom bar names, blue bars, a main title, and axis labels.
B. ggplot2 Graphics
The Grammar of Graphics helps us build graphical representations from different visual
elements. This grammar allows us to communicate about plot components. The Grammar
of Graphics was created by Leland Wilkinson and was adapted by Hadley Wickham.
Installing ggplot2
So let us begin by first installing this package using the R function ‘install. packages()’.
install.packages('ggplot2')
This guide will use the ‘Iris’ dataset and ‘Motor trend car road tests’ dataset.
The iris dataset contains dimensions for 50 flowers from three distinct species on four
different features (in centimetres). We can import the iris dataset using the following
command because it is a built-in dataset in R:
data(iris)
If you wish to quickly summarize the dataset, use the summary() function and it will
summarize each variable in the dataset.
summary(iris)
The ggplot2 is made of three basic elements: Plot = Data + Aesthetics + Geometry.
1. Scatter Plot
Now we will start this tutorial with a scatter plot. To plot it, we will be using the
geom_point() function. Here we will plot the Sepal length variable on the x-axis and
the petal length variable on the y axis.
ggplot(iris, aes(x=Sepal.Length, y=Petal.Length))+geom_point()
#We can plot different shapes for different species by using the following command:
ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species, shape=Species))+
geom_point()
Geoms are supported by ggplot2 in a variety of ways for plotting different graphs
like:
• Scatter Plot: To plot individual points, use geom_point
• Bar Charts: For drawing bars, use geom_bar
• Histograms: For drawing binned values, geom_histogram
• Line Charts: To plot lines, use geom_line
• Polygons: To draw arbitrary shapes, use geom_polygon
• Creating Maps: Use geom_map for drawing polygons in the shape of a map
by using the map_data() function
• Creating Patterns: Use the geom_smooth function for showing simple
trends or approximations
Points and smoothed lines can be plotted together for the same x and y variables, but
with different colours for each geom.
ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species))+
geom_point(color = "blue") + geom_smooth(color = "red")
2. Bar Plot
We will plot the bar chart for this dataset using the following command:
ggplot(mtcars, aes(x = gear)) +geom_bar()
Using the coord_flip() command, you can interchange the x-axis and y-axis,
ggplot(mtcars, aes(x = gear)) +geom_bar()+coord_flip()
Statistical Transformations
Many different statistical transformations are supported by ggplot2. For more levels,
we can directly call stat_ functions. For example, here, we make a scatter plot of
horsepower vs mpg and then use stat summary to draw the mean.
ggplot(mtcars, aes(hp, mpg)) + geom_point(color = "blue") +
stat_summary(fun = "mean", geom = "line", linetype = "dashed")
3. Histogram
A Histogram is used to show the frequency distribution of a continuous-discrete
variable.
Similarly, we can use the geom_boxplot() command for plotting a box plot. We will
plot mpg vs cyl. As we can see from the image, mpg is a continuous variable, while cyl
is categorical. So before plotting, we convert the variable cyl to a factor. Below is the
output graph.
If we want to change the boundary colour of the boxplot, we have to use the
scale_color_manual() function with the hex values of colours of our choice.
cyl_factor <- as.factor(mtcars$cyl)
ggplot(mtcars, aes(x= cyl_factor, y=mpg,color = cyl_factor)) + geom_boxplot()+
scale_color_manual(values = c("#3a0ca3", "#c9184a", "#3a5a40"))
5. Pie Chart
The pie chart shows the proportions as a part of the whole in the data
ggplot(mtcars, aes(x="", y=mpg, fill=cyl_factor)) +
geom_bar(stat="identity", width=1) + coord_polar("y", start=0)
6. Contour Plot
ggplot2 can generate a 2D density contour plot with geom_density_2d. You only need
to provide your data frame with the x and y values inside aes.
ggplot(mtcars, aes(mpg, hp)) + geom_density_2d_filled(show.legend = FALSE)+
coord_cartesian(expand = FALSE) + labs(x = "mpg")
It’s important to note that you can make a scatter plot with contour lines. First, add
the points using geom_point, & then geom_density_2d.
ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() + geom_density_2d()
7. Customization in ggplot2 in R
We can do a lot with ggplot2. Let’s explore it in the following sections:
Plot Titles
You can add a title, a subtitle, a caption, and a tag for your visualization when using
ggplot2. There are two methods for adding titles: ggtitle and the labs function. The
former is only for titles and subtitles, but the latter allows for the addition of tags and
captions.
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg)) + geom_boxplot()+
ggtitle("hp vs mpg")
Similarly, You can add a subtitle the same way you added the title, but with the subtitle
argument using the ggtitle() or labs() function:
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg)) + geom_boxplot()+
ggtitle("hp vs mpg", subtitle = "Subtitle of the plot")
Horizontal alignment or hjust is used to control the alignment of the title (i.e., left,
centre, right). Similarly, for controlling the vertical alignment, vjust can be used.
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg))+geom_boxplot()+ggtitle("hp vs mpg")+
theme(plot.title = element_text(hjust = 1, size = 16, face = "bold"))
Lattice Graphics
There are several ways to make graphs in R. One approach is a system called lattice
graphics. The first step for using lattice is to load the lattice package using the check box
in the Packages tab or using the following command:
require(lattice)
Let’s switch to a more interesting data set from the Health Evaluation and Linkage to
Primary Care study. The HELP study was a clinical trial for adult inpatients recruited
from a detoxification unit. Patients with no primary care physician were randomized to
receive a multidisciplinary assessment and a brief motivational intervention or usual care,
with the goal of linking them to primary medical care. You can find out more about this
data using R’s help.
?HELPrct
So, where there is more area, there is more data. For a histogram, rectangles are used to
indicate how much data is in each of several “bins”. The result is a picture that shows a
rough “shape” of the distribution.
The y component of the formula is empty since we let R compute the heights of the bars
for us.
histogram(~ age, data=HELPrct, n=20) #n = 20 gives approx. 20 bars
We can use a conditional variable to give us separate histograms for each sex.
histogram(~ age | sex, data=HELPrct, n=20)
Boxplots: bwplot()
This is improved, but the species names run into each other. We could fix that run-
together text by using abbreviated names or rotating the labels 45 or 90 degrees. Instead
of those solutions, we can also just reverse the roles of the horizontal and vertical axes.
bwplot(substance ~ age, data=HELPrct)
Scatterplots: xyplot()
Scatterplots are made with xyplot(). The formula interface is very natural for this. Just
remember that the “y variable” comes first. (Its label is also farther left on the plot if that
helps you remember.)
Even better (for this example), we can use the groups argument to indicate the different
species using different symbols on the same panel.
xyplot(Sepal.Length ~ Sepal.Width, groups=Species, data=iris)
There are several ways to save plots, but the easiest is probably the following:
There are lots of arguments that control how these plots look. Here are just a few
examples.
auto.key
It would be useful to have a legend for the previous plot. auto.key=TRUE turns on
a simple legend. (There are ways to have more control, if you need it.)
xyplot(Sepal.Length ~ Sepal.Width, groups=Species, data=iris, auto.key=TRUE)
alpha, cex
Sometimes it is nice to have elements of a plot be partly transparent. When such elements
overlap, they get darker, showing us where data are “piling up.” Setting the alpha
argument to a value between 0 and 1 controls the degree of transparency: 1 is completely
opaque, 0 is invisible. The cex argument controls “character expansion” and can be used
to make the plotting “characters” larger or smaller by specifying the scaling ratio.
xyplot(Sepal.Length ~ Sepal.Width, groups=Species, data=iris,
auto.key=list(columns=3), alpha=.5, cex=1.3)
trellis.par.set()
Default settings for lattice graphics are set using trellis.par.set(). Don’t like the default
font sizes? You can change to a 7-point (base) font using
trellis.par.set(fontsize=list(text=7)) # base size for text is 7 point
Nearly every feature of a lattice plot can be controlled: fonts, colors, symbols, line
thicknesses, colors, etc. Rather than describe them all here, we’ll mention only that groups
of these settings can be collected into a theme. show.settings() will show you what the
theme looks like.
trellis.par.set(theme=col.whitebg()) # a theme in the lattice package
show.settings()
The Current Population Survey (CPS) is used to supplement census information between
census years. These CPS data frame consist of a random sample of persons from the CPS,
with information on wages and other characteristics of the workers, including sex,
number of years of education, years of work experience, occupational status, region of
residence and union membership.
head(HELPrct, 3)
Categorical variables are often summarized in a table. R can make a table for a categorical
variable using xtabs().
xtabs(~ sex, HELPrct)
sex
female male
107 346
substance
alcohol cocaine heroin
177 152 124
Alternatively, we can use table() and prop.table() to make tables of counts, proportions,
or percentages.
with(HELPrct, table(sex))
sex
female male
107 346
with(HELPrct, prop.table(table(sex)))
sex
female male
0.2362031 0.7637969
with(HELPrct, prop.table(table(sex))*100)
sex
female male
23.62031 76.37969
substance
sex alcohol cocaine heroin
female 36 41 30
male 141 111 94
NW W
67 467
Replacing rbind() with cbind() will allow you to give the data column-wise instead.
Just as bar charts are used to display the distribution of one categorical variable, mosaic
plots can do the same for cross tables. mosaic() (from the vcd package) is not a
lattice plot, but it does use a similar formula interface.
require(vcd) # load the visualizing categorical data package
mosaic(~ sex + substance, HELPrct)
Or we can send our own hand-made table (although the output isn’t quite as nice without
some extra effort we won’t discuss just now):
mosaic(mycrosstable)
Barcharts can also be used to display two-way tables. First we convert the cross-table to a
data frame. Then we can use this data frame for plotting.
HELP <- as.data.frame(xtabs(~ sex + substance, data=HELPrct)); HELP
R Examples
The commands below are illustrated with the data sets iris and CPS. To apply these in
other situa- tions, you will need to substitute the name of your data frame and the
variables in it.
Instructions: Document your code clearly, including comments explaining each step
and the rationale behind it. Provide clear and informative titles, labels, and legends for
each visualization, and organize your code into logical sections to improve readability.
1. Use the "iris" dataset available in R as the dataset for this exercise. It contains
measurements of sepal length, sepal width, petal length, petal width, and species
of iris flowers.
4. Plot a box plot of petal width ("Petal.Width") for each species of iris to compare
the distribution of petal widths across different species and interpret.
5. Visualize the relationship between sepal width ("Sepal.Width") and petal width
("Petal.Width") using a scatter plot with color-coded points for each species of
iris and interpret.
6. Create a bar plot showing the average sepal length ("Sepal.Length") for each
species of iris to compare sepal lengths across different species and interpret.
7. Explore any additional relationships or patterns in the data that you find
interesting using ggplot2.
Set B:
1. Calculate the natural logarithm (log base e) and base 10 logarithm of 12,345.
log(12,345)
[1] 0.4252
2. Install and load the mosaic package. Make sure the lattice is also loaded (no need
to install it, it is already installed).
Here are some other packages you may like to install as well.
3. Enter the following small data set in an Excel or Google spreadsheet and import
the data into RStudio.
You can import directly from Google. From Excel, save the file as a csv and import
that (as a text file) into RStudio. Name the data frame JunkData.
4. What is the average (mean) width of the sepals in the iris data set?
5. Determine the average (mean) sepal width for each of the three species in the iris
data set.
6. The Jordan8687 data set (in the fastR package) contains the number of points
Michael Jordan scored in each game of the 1986–87 season.
7. Cuckoos lay their eggs in the nests of other birds. Is the size of cuckoo eggs different
in different host species nests? The cuckoo data set (in fastR) contains data from a
study attempting to answer this question.
a. When were these data collected? (Use ?cuckoo to get information about the
data set.)
d. Calculate the mean length of the eggs for each host species.
e. What do you think? Does it look like the size is differs among the different
host species? Refer to your R output as you answer this question. (We’ll
learn formal methods to investigate this later in the semester.)
8. The Utilities2 data set in the mosaic package contains a number of variables about
the utilities bills at a residence in Minnesota over a number of years. Since the
number of days in a billing cycle varies from month to month, variables like
gasbillpday (elecbillpday, etc.) contain the gas bill (electric bill, etc.) divided by
the number of days in the billing cycle.
b) What does type='l' do? Make your plot with and without it. Which is
easier to read in this situation?
c) What happens if we replace type='l' with type='b'?
Your first try probably won’t give you what you expect. The reason is that
month is coded using numbers, so R treats it as numerical data. We want to
treat it as categorical data. To do this in R use factor(month) in place of
month. R calls categorical data a factor.
f) Make any other plot you like using this data. Include both a copy of your
plot and a discussion of what you can learn from it.
9. The table below is from a study of nighttime lighting in infancy and eyesight (later
in life).
There are several ways to do this. You could use R as a calculator to do the
arithmetic. You can save some typing if you use the function
prop.table(). See ?prop.table for documentation. If you just want row and
column totals added to the table, see mar_table() in the vcd package.
c) Make a mosaic plot for this data. What does this plot reveal?