0% found this document useful (0 votes)
3 views18 pages

Lecture 2 Data Presentation

This document provides an overview of data presentation techniques in R, focusing on frequency tables, graphical representations using base R and ggplot2, and the lattice graphics system. It covers various plotting functions, including bar plots, histograms, scatter plots, and box plots, along with customization options for visualizations. Additionally, it introduces the ggplot2 package, its components, and how to create complex visualizations using different geometries and statistical transformations.

Uploaded by

aprillynsalipot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views18 pages

Lecture 2 Data Presentation

This document provides an overview of data presentation techniques in R, focusing on frequency tables, graphical representations using base R and ggplot2, and the lattice graphics system. It covers various plotting functions, including bar plots, histograms, scatter plots, and box plots, along with customization options for visualizations. Additionally, it introduces the ggplot2 package, its components, and how to create complex visualizations using different geometries and statistical transformations.

Uploaded by

aprillynsalipot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Lecture 2:

Data Presentation

A. Frequency Table

The table() function is used to generate frequency tables, which count the occurrences
of unique values in a vector or factor. Here's the basic syntax:
table(x)

Where x is the vector or factor for which you want to create the frequency table. For
example, suppose you have a vector x:
x <- c("A", "B", "A", "C", "B", "A", "A", "B", "C", "C")

You can use table() to generate a frequency table:


table(x)

This will produce:


x
ABC
433

You can also use table() with multiple variables to create contingency tables. For
instance:
y <- c("M", "F", "F", "M", "M", "F", "M", "F", "F", "M")
table(x, y)

This will produce a contingency table showing the frequencies of combinations of values
of x and y.
y
x FM
A22
B21
C12

To compute percentages in a frequency table in R, you can use the prop.table() function
along with table(). Here's how you can do it:
# Create a vector
x <- c("A", "B", "A", "C", "B", "A", "A", "B", "C", "C")

# Compute frequencies
Frequency<- table(x)

# Compute percentages
Percentage <- prop.table(Frequency) * 100
# Combine frequencies and percentages
freq_table <- cbind(Frequency, Percentage)

# Print the result


freq_table

We can also present the data graphically, for example using a bar plot. The barplot()
function is used to create bar plots, also known as bar charts or bar graphs. Bar plots are
used to visualize the distribution of categorical data by representing the frequencies or
proportions of different categories using rectangular bars. The basic syntax of the
barplot() function is:
barplot(height, ...)

Where:

• height: a numeric vector or matrix containing the heights of the bars.

• ...: additional arguments that control the appearance of the bar plot, such as colors,
axis labels, titles, etc.

For example, suppose you have a vector height representing the heights of bars you want
to plot:
height <- c(10, 20, 15, 25, 30)
barplot(height)

This will create a simple bar plot with bars of different heights.

You can also customize the appearance of the bar plot using additional arguments. For
example, you can specify the names of the bars using the names.arg argument:
names <- c("Category 1", "Category 2", "Category 3", "Category 4", "Category 5")
barplot(heights, names.arg = names, col = "blue", main = "Bar Plot Example", xlab =
"Categories", ylab = "Frequency")

This will create a bar plot with custom bar names, blue bars, a main title, and axis labels.

Some of the most commonly used ones include:


1. plot(): This function is used to create various types of plots, including scatter
plots, line plots, and more.
2. hist(): This function generates histograms, which are used to represent the
distribution of numerical data.
3. barplot(): As mentioned earlier, this function creates bar plots, which are useful
for visualizing the frequencies or proportions of categorical data.
4. boxplot(): This function produces box-and-whisker plots, which are effective for
visualizing the distribution of numerical data and identifying outliers.
5. pie(): This function generates pie charts, which are useful for displaying
proportions of a whole.

B. ggplot2 Graphics

ggplot2 is a popular data visualization package in the R programming language. It was


developed by Hadley Wickham and is based on the principles of the “Grammar of
Graphics,” which provides a systematic and structured approach to creating and
understanding data visualizations. ggplot2 allows users to create a wide variety of high-
quality and customizable statistical graphs, making it a valuable tool for data exploration
and presentation.

The Grammar of Graphics helps us build graphical representations from different visual
elements. This grammar allows us to communicate about plot components. The Grammar
of Graphics was created by Leland Wilkinson and was adapted by Hadley Wickham.

A ggplot is made up of a few basic components:

Data: The raw data that you want to plot.


Geometries geom_: The geometric shapes used to visualize the data.
Aesthetics aes(): Aesthetics pertaining to the geometric and statistical objects,
like colour, size, shape, location, and transparency
Scales scale_: includes a set of values for each aesthetic mapping in the plot
Statistical transformations stat_: calculates the different data values used in
the plot.
Coordinate system coord_: used to organize the geometric objects by mapping
data coordinates
Facets facet_: a grid of plots is displayed for groups of data.
Visual themes theme(): The overall visual elements of a plot, like grids & axes,
background, fonts, and colours.

Installing ggplot2
So let us begin by first installing this package using the R function ‘install. packages()’.
install.packages('ggplot2')

This guide will use the ‘Iris’ dataset and ‘Motor trend car road tests’ dataset.

The iris dataset contains dimensions for 50 flowers from three distinct species on four
different features (in centimetres). We can import the iris dataset using the following
command because it is a built-in dataset in R:
data(iris)
If you wish to quickly summarize the dataset, use the summary() function and it will
summarize each variable in the dataset.
summary(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

The ggplot2 is made of three basic elements: Plot = Data + Aesthetics + Geometry.

Following are the essential elements of any plot:


Data: It is the dataframe.
Aesthetics: It is used to represent x and y in a graph. It can alter the colour, size,
dots, the height of bars etc.
Geometry: It defines the graphics type, i.e., scatter plot, bar plot, jitter plot etc.

1. Scatter Plot
Now we will start this tutorial with a scatter plot. To plot it, we will be using the
geom_point() function. Here we will plot the Sepal length variable on the x-axis and
the petal length variable on the y axis.
ggplot(iris, aes(x=Sepal.Length, y=Petal.Length))+geom_point()

#Let us set the colour to species by using this syntax:


ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species))+geom_point()

#We can plot different shapes for different species by using the following command:
ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species, shape=Species))+
geom_point()

Geoms are supported by ggplot2 in a variety of ways for plotting different graphs
like:
• Scatter Plot: To plot individual points, use geom_point
• Bar Charts: For drawing bars, use geom_bar
• Histograms: For drawing binned values, geom_histogram
• Line Charts: To plot lines, use geom_line
• Polygons: To draw arbitrary shapes, use geom_polygon
• Creating Maps: Use geom_map for drawing polygons in the shape of a map
by using the map_data() function
• Creating Patterns: Use the geom_smooth function for showing simple
trends or approximations

A variety of geometries can be added to a plot, allowing you to build complex


visualizations that display multiple elements of your data. For example,
ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species))+geom_point() +
geom_smooth()

Points and smoothed lines can be plotted together for the same x and y variables, but
with different colours for each geom.
ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species))+
geom_point(color = "blue") + geom_smooth(color = "red")

# color aesthetic defined only for a particular geom_point layer


ggplot(iris, aes(x=Sepal.Length, y=Petal.Length)) +
geom_point(aes(col = Species)) +geom_smooth(se = FALSE)

2. Bar Plot
We will plot the bar chart for this dataset using the following command:
ggplot(mtcars, aes(x = gear)) +geom_bar()

Using the coord_flip() command, you can interchange the x-axis and y-axis,
ggplot(mtcars, aes(x = gear)) +geom_bar()+coord_flip()

Statistical Transformations
Many different statistical transformations are supported by ggplot2. For more levels,
we can directly call stat_ functions. For example, here, we make a scatter plot of
horsepower vs mpg and then use stat summary to draw the mean.
ggplot(mtcars, aes(hp, mpg)) + geom_point(color = "blue") +
stat_summary(fun = "mean", geom = "line", linetype = "dashed")

3. Histogram
A Histogram is used to show the frequency distribution of a continuous-discrete
variable.

Using the geom_histogram() command, we can create a simple histogram:


ggplot(mtcars,aes(x=mpg)) + geom_histogram()
4. Box Plot
A Box plot displays the distribution of the data and skewness in the data with the help
of quartile and averages.

Similarly, we can use the geom_boxplot() command for plotting a box plot. We will
plot mpg vs cyl. As we can see from the image, mpg is a continuous variable, while cyl
is categorical. So before plotting, we convert the variable cyl to a factor. Below is the
output graph.

So, we will use the following command to plot the graph:


ggplot(mtcars, aes(x=as.factor(cyl), y=mpg)) + geom_boxplot()

If we want to change the boundary colour of the boxplot, we have to use the
scale_color_manual() function with the hex values of colours of our choice.
cyl_factor <- as.factor(mtcars$cyl)
ggplot(mtcars, aes(x= cyl_factor, y=mpg,color = cyl_factor)) + geom_boxplot()+
scale_color_manual(values = c("#3a0ca3", "#c9184a", "#3a5a40"))

5. Pie Chart
The pie chart shows the proportions as a part of the whole in the data
ggplot(mtcars, aes(x="", y=mpg, fill=cyl_factor)) +
geom_bar(stat="identity", width=1) + coord_polar("y", start=0)

6. Contour Plot
ggplot2 can generate a 2D density contour plot with geom_density_2d. You only need
to provide your data frame with the x and y values inside aes.
ggplot(mtcars, aes(mpg, hp)) + geom_density_2d_filled(show.legend = FALSE)+
coord_cartesian(expand = FALSE) + labs(x = "mpg")

It’s important to note that you can make a scatter plot with contour lines. First, add
the points using geom_point, & then geom_density_2d.
ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() + geom_density_2d()

7. Customization in ggplot2 in R
We can do a lot with ggplot2. Let’s explore it in the following sections:

Plot Titles
You can add a title, a subtitle, a caption, and a tag for your visualization when using
ggplot2. There are two methods for adding titles: ggtitle and the labs function. The
former is only for titles and subtitles, but the latter allows for the addition of tags and
captions.
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg)) + geom_boxplot()+
ggtitle("hp vs mpg")

To add the title, use the labs function.


ggplot(mtcars, aes(x=as.factor(cyl), y=mpg)) + geom_boxplot() +
labs(title = "hp vs mpg")

Similarly, You can add a subtitle the same way you added the title, but with the subtitle
argument using the ggtitle() or labs() function:
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg)) + geom_boxplot()+
ggtitle("hp vs mpg", subtitle = "Subtitle of the plot")

Horizontal alignment or hjust is used to control the alignment of the title (i.e., left,
centre, right). Similarly, for controlling the vertical alignment, vjust can be used.
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg))+geom_boxplot()+ggtitle("hp vs mpg")+
theme(plot.title = element_text(hjust = 1, size = 16, face = "bold"))

Lattice Graphics

There are several ways to make graphs in R. One approach is a system called lattice
graphics. The first step for using lattice is to load the lattice package using the check box
in the Packages tab or using the following command:
require(lattice)

lattice plots make use of a formula interface:


plotname(y ~ x | z, data=dataname, groups=grouping_variable, … )

• Here are the names of several lattice plots:


– histogram (for histograms)
– bwplot (for boxplots)
– xyplot (for scatter plots)
– qqmath (for quantile-quantile plots)
• x is the name of the variable that is plotted along the horizontal (x) axis.
• y is the name of the variable that is plotted along the vertical (y) axis. (For some
plots, this slot is empty because R computes these values from the values of x.)
• z is a conditioning variable used to split the plot into multiple subplots called
panels.
• grouping_variable is used to display different groups differently (different colors
or symbols, for example) within the same panel.
• There are many additional arguments to these functions that let you control just
how the plots look. (But we’ll focus on the basics for now.)
Histograms: histogram()

Let’s switch to a more interesting data set from the Health Evaluation and Linkage to
Primary Care study. The HELP study was a clinical trial for adult inpatients recruited
from a detoxification unit. Patients with no primary care physician were randomized to
receive a multidisciplinary assessment and a brief motivational intervention or usual care,
with the goal of linking them to primary medical care. You can find out more about this
data using R’s help.
?HELPrct

Histograms display a distribution using the important idea the

AREA = relative frequency

So, where there is more area, there is more data. For a histogram, rectangles are used to
indicate how much data is in each of several “bins”. The result is a picture that shows a
rough “shape” of the distribution.

The y component of the formula is empty since we let R compute the heights of the bars
for us.
histogram(~ age, data=HELPrct, n=20) #n = 20 gives approx. 20 bars

We can use a conditional variable to give us separate histograms for each sex.
histogram(~ age | sex, data=HELPrct, n=20)

We can even condition on two things at once:


histogram(~ age | substance + sex, data=HELPrct, n=20)

Density plots densityplot()

Density plots are smoother versions of histograms.


densityplot(~ age | substance + sex, data=HELPrct, n=20)

If we want to get really fancy, we can do both at once:


histogram(~age|substance+sex,data=HELPrct,n=20, type='density',
panel=function(...){ panel.histogram(...); panel.densityplot(...)})

Boxplots: bwplot()

Boxplots are made pretty much the same way as histograms:


bwplot(~ age, data=HELPrct)

We can use conditioning as we did for histograms:


bwplot(~ age | substance, data=HELPrct)

But there are better ways to do this.


bwplot(age ~ substance, data=HELPrct)

This is improved, but the species names run into each other. We could fix that run-
together text by using abbreviated names or rotating the labels 45 or 90 degrees. Instead
of those solutions, we can also just reverse the roles of the horizontal and vertical axes.
bwplot(substance ~ age, data=HELPrct)

We can combine this with conditioning if we like:


bwplot(substance ~ age | sex, data=HELPrct)

Scatterplots: xyplot()

Scatterplots are made with xyplot(). The formula interface is very natural for this. Just
remember that the “y variable” comes first. (Its label is also farther left on the plot if that
helps you remember.)

xyplot(Sepal.Length ~ Sepal.Width, data=iris)

Again, we can use conditioning to make a panel for each species.


xyplot(Sepal.Length ~ Sepal.Width | Species, data=iris)

Even better (for this example), we can use the groups argument to indicate the different
species using different symbols on the same panel.
xyplot(Sepal.Length ~ Sepal.Width, groups=Species, data=iris)

Saving Your Plots

There are several ways to save plots, but the easiest is probably the following:

1. In the Plots tab, click the “Export” button.


2. Copy the image to the clipboard using right click.
3. Go to your Word document and paste in the image.
4. Resize or reposition your image in Word as needed.

A Few Bells and Whistles

There are lots of arguments that control how these plots look. Here are just a few
examples.
auto.key

It would be useful to have a legend for the previous plot. auto.key=TRUE turns on
a simple legend. (There are ways to have more control, if you need it.)
xyplot(Sepal.Length ~ Sepal.Width, groups=Species, data=iris, auto.key=TRUE)

alpha, cex

Sometimes it is nice to have elements of a plot be partly transparent. When such elements
overlap, they get darker, showing us where data are “piling up.” Setting the alpha
argument to a value between 0 and 1 controls the degree of transparency: 1 is completely
opaque, 0 is invisible. The cex argument controls “character expansion” and can be used
to make the plotting “characters” larger or smaller by specifying the scaling ratio.
xyplot(Sepal.Length ~ Sepal.Width, groups=Species, data=iris,
auto.key=list(columns=3), alpha=.5, cex=1.3)

main, sub, xlab, ylab

xyplot(Sepal.Length ~ Sepal.Width, groups=Species, data=iris,


main="Some Iris Data",
sub="(R. A. Fisher analysized this data in 1936)",
xlab="sepal width (cm)",
ylab="sepal length (cm)",
alpha=.5,
auto.key=list(columns=3))

trellis.par.set()

Default settings for lattice graphics are set using trellis.par.set(). Don’t like the default
font sizes? You can change to a 7-point (base) font using
trellis.par.set(fontsize=list(text=7)) # base size for text is 7 point

Nearly every feature of a lattice plot can be controlled: fonts, colors, symbols, line
thicknesses, colors, etc. Rather than describe them all here, we’ll mention only that groups
of these settings can be collected into a theme. show.settings() will show you what the
theme looks like.
trellis.par.set(theme=col.whitebg()) # a theme in the lattice package
show.settings()

trellis.par.set(theme=col.abd()) # a theme in the abd package


show.settings()

trellis.par.set(theme=col.mosaic()) # a theme in the mosaic package


show.settings()

trellis.par.set(theme=col.mosaic(bw=TRUE)) #black & white version of previous theme


show.settings()

trellis.par.set(theme=col.mosaic()) # back to the mosaic theme


trellis.par.set(fontsize=list(text=9)) # and back to a larger font

Tabulating Categorical Data

The Current Population Survey (CPS) is used to supplement census information between
census years. These CPS data frame consist of a random sample of persons from the CPS,
with information on wages and other characteristics of the workers, including sex,
number of years of education, years of work experience, occupational status, region of
residence and union membership.

head(HELPrct, 3)

Making Frequency and Contingency Tables with xtabs()

Categorical variables are often summarized in a table. R can make a table for a categorical
variable using xtabs().
xtabs(~ sex, HELPrct)

sex
female male
107 346

xtabs(~ substance, HELPrct)

substance
alcohol cocaine heroin
177 152 124
Alternatively, we can use table() and prop.table() to make tables of counts, proportions,
or percentages.
with(HELPrct, table(sex))

sex
female male
107 346

with(HELPrct, prop.table(table(sex)))

sex
female male
0.2362031 0.7637969

with(HELPrct, prop.table(table(sex))*100)

sex
female male
23.62031 76.37969

We can make a cross-table (also called a contingency table or a two-way table)


summarizing this data with xtabs(). This is often a more useful view of data with two
categorical variables.
xtabs(~ sex + substance, HELPrct)

substance
sex alcohol cocaine heroin
female 36 41 30
male 141 111 94

Entering Tables by Hand

Because categorical data is so easy to summarize in a table, often the frequency or


contingency tables are given instead. You can enter these tables manually as follows:
myrace <- c( NW=67, W=467 ) # c for combine or concatenate
myrace

NW W
67 467

mycrosstable <- rbind( # bind row-wise


NW = c(clerical=15, const=3, manag=6, manuf=11, other=5, prof=7, sales=3,
service=17), W = c(82,17,49,57,63,98,35,66) # no need to repeat the column names
)
mycrosstable

clerical const manag manuf other prof sales service


NW 15 3 6 11 5 7 3 17
W 82 17 49 57 63 98 35 66

Replacing rbind() with cbind() will allow you to give the data column-wise instead.

Graphing Categorical Data

The lattice function barchart() can display these tables as barcharts.


barchart(xtabs(~ sex, HELPrct))

barchart(xtabs(~ sex, HELPrct), horizontal=FALSE) # vertical bars

Just as bar charts are used to display the distribution of one categorical variable, mosaic
plots can do the same for cross tables. mosaic() (from the vcd package) is not a
lattice plot, but it does use a similar formula interface.
require(vcd) # load the visualizing categorical data package
mosaic(~ sex + substance, HELPrct)

Or we can send our own hand-made table (although the output isn’t quite as nice without
some extra effort we won’t discuss just now):
mosaic(mycrosstable)

Barcharts can also be used to display two-way tables. First we convert the cross-table to a
data frame. Then we can use this data frame for plotting.
HELP <- as.data.frame(xtabs(~ sex + substance, data=HELPrct)); HELP

sex substance Freq


1 female alcohol 36
2 male alcohol 141
3 female cocaine 41
4 male cocaine 111
5 female heroin 30
6 male heroin 94
barchart(Freq ~ sex, groups=substance, data=HELP)

R Examples

The commands below are illustrated with the data sets iris and CPS. To apply these in
other situa- tions, you will need to substitute the name of your data frame and the
variables in it.

answer <- 42 Store the value 42 in a variable named answer.


log(123); log10(123); sqrt(123) Take natural logarithm, base 10 logarithm, or square root of
123.
x <- c(1,2,3) Make a variable containing values 1, 2, and 3 (in that
order).
data(iris) (Re)load the data set iris.
summary(iris$Sepal.Length) Summarize the distribution of the Sepal.Length variable in
the iris data frame.
summary(iris) Summarize each variable in the iris data frame.
str(iris) A different way to summarize the iris data frame.
head(iris) First few rows of the data frame iris.
require(Hmisc) require(abd) Load packages. (This can also be done by checking boxes in the
Packages tab.)
summary(Sepal.Length~Species,
data=iris,fun=favstats) Compute favorite statistics of Sepal.Length for each
Species. [requires Hmisc]
histogram(~Sepal.Length|Species, iris) Histogram of Sepal.Length conditioned on Species.
bwplot(Sepal.Length~Species, iris) Boxplot of Sepal.Length conditioned on Species.
xyplot(Sepal.Length~Sepal.Width|
Species, iris) Scatterplot of Sepal.Length by Sepal.Width with sep- arate
panels for each Species.
xtabs(~sex, HELPrct) Frequency table of the variable sector.
barchart(xtabs(~sex, HELPrct)) Make a barchart from the table.
xtabs(~sex + substance, HELPrct) Cross tabulation of sector and race.
mosaic(~sex + substance, HELPrct) Make a mosaic plot.
xtData <- as.data.frame(
xtabs(~sex + substance,
HELPrct)) Save cross table information as xtData.
barchart(Freq~sex, data=xtData,
groups=substance) Use xtData to make a segmented bar chart.
sum(x); mean(x); median(x);
var(x); sd(x); quantile(x) Sum, mean, median, variance, standard deviation, quantiles of x.
EXERCISES

Set A: Exploring Data Visualization with ggplot2

Instructions: Document your code clearly, including comments explaining each step
and the rationale behind it. Provide clear and informative titles, labels, and legends for
each visualization, and organize your code into logical sections to improve readability.

1. Use the "iris" dataset available in R as the dataset for this exercise. It contains
measurements of sepal length, sepal width, petal length, petal width, and species
of iris flowers.

2. Explore the distribution of sepal length ("Sepal.Length") using a histogram and


interpret.

3. Create a scatter plot of sepal length ("Sepal.Length") versus petal length


("Petal.Length") to visualize the relationship between these two variables and
interpret.

4. Plot a box plot of petal width ("Petal.Width") for each species of iris to compare
the distribution of petal widths across different species and interpret.

5. Visualize the relationship between sepal width ("Sepal.Width") and petal width
("Petal.Width") using a scatter plot with color-coded points for each species of
iris and interpret.

6. Create a bar plot showing the average sepal length ("Sepal.Length") for each
species of iris to compare sepal lengths across different species and interpret.

7. Explore any additional relationships or patterns in the data that you find
interesting using ggplot2.
Set B:

1. Calculate the natural logarithm (log base e) and base 10 logarithm of 12,345.

What happens if you leave the comma in this number?

log(12,345)

[1] 0.4252

2. Install and load the mosaic package. Make sure the lattice is also loaded (no need
to install it, it is already installed).

Here are some other packages you may like to install as well.

• Hmisc (Frank Harrell’s miscellaneous utilities),

• vcd (visualizing categorical data),

• fastR (Foundations and Applications of Statistics), and

• abd (Analysis of Biological Data).

3. Enter the following small data set in an Excel or Google spreadsheet and import
the data into RStudio.

You can import directly from Google. From Excel, save the file as a csv and import
that (as a text file) into RStudio. Name the data frame JunkData.

4. What is the average (mean) width of the sepals in the iris data set?
5. Determine the average (mean) sepal width for each of the three species in the iris
data set.

6. The Jordan8687 data set (in the fastR package) contains the number of points
Michael Jordan scored in each game of the 1986–87 season.

a) Make a histogram of this data. Add an appropriate title.

b) How would you describe the shape of the distribution?

c) In approximately what percentage of his games, did Michael Jordan score


less than 20 points? More than 50? (You may want to add
breaks=seq(0,70,by=5) to your command to neaten up the bins.)

7. Cuckoos lay their eggs in the nests of other birds. Is the size of cuckoo eggs different
in different host species nests? The cuckoo data set (in fastR) contains data from a
study attempting to answer this question.

a. When were these data collected? (Use ?cuckoo to get information about the
data set.)

b. What are the units on the length measurements?

c. Make side-by-side boxplots of the length of the eggs by species.

d. Calculate the mean length of the eggs for each host species.

e. What do you think? Does it look like the size is differs among the different
host species? Refer to your R output as you answer this question. (We’ll
learn formal methods to investigate this later in the semester.)

8. The Utilities2 data set in the mosaic package contains a number of variables about
the utilities bills at a residence in Minnesota over a number of years. Since the
number of days in a billing cycle varies from month to month, variables like
gasbillpday (elecbillpday, etc.) contain the gas bill (electric bill, etc.) divided by
the number of days in the billing cycle.

a) Make a scatter plot of gasbillpday vs. monthsSinceY2K using the command

xyplot(gasbillpday ~ monthsSinceY2K, data=Utilities2, type='l') #the letter

What pattern(s) do you see?

b) What does type='l' do? Make your plot with and without it. Which is
easier to read in this situation?
c) What happens if we replace type='l' with type='b'?

d) Make a scatter plot of gasbillpday by month. What do you notice?

e) Make side-by-side boxplots of gasbillpday by month using the Utilities2


data frame. What do you notice?

Your first try probably won’t give you what you expect. The reason is that
month is coded using numbers, so R treats it as numerical data. We want to
treat it as categorical data. To do this in R use factor(month) in place of
month. R calls categorical data a factor.

f) Make any other plot you like using this data. Include both a copy of your
plot and a discussion of what you can learn from it.

9. The table below is from a study of nighttime lighting in infancy and eyesight (later
in life).

a) Recreate the table in RStudio.

b) What percent of the subjects slept with a nightlight as infants?

There are several ways to do this. You could use R as a calculator to do the
arithmetic. You can save some typing if you use the function
prop.table(). See ?prop.table for documentation. If you just want row and
column totals added to the table, see mar_table() in the vcd package.

c) Make a mosaic plot for this data. What does this plot reveal?

You might also like