Data Visualization With R (2019!02!14)
Data Visualization With R (2019!02!14)
R is:
A programming language used for statistical computing and data
visualization.
Open source and freely available under the GNU General Public
License.
Supported by the R Project for Statistical Computing
Download latest version of R via:
https://www.r-project.org/
The use of R can be facilitated through the use of Rstudio.
Download Rstudio via:
https://www.rstudio.com/products/rstudio/download/
Basic R Data Modes
Numeric
e.g. 3.14, -543.6, 0.2, 10., etc.
Integer
e.g. 3, -544, 0, 10, etc.
Complex
e.g. -17 + 8*i
Logical
e.g. TRUE, FALSE
Character
e.g. “Hello”, “To be or not to be”, “ljfakl;r#”, etc.
While other modes of data exist, they are far less common than those listed above.
Common R Objects
Name Dimensions Contents Example
Vector 1 • Series of values 12.2 9.6 -4.8 2.5
• Single data mode
Matrix 2 • Values stored in rows 12.2 9.6 -4.8 2.5
and columns 8.3 -7.6 9.3 -2.7
• All values of the same -4.4 17.7 14.7 -6.9
data mode 1.7 4.5 53.4 5.2
List 1 • Series of values Denver 73.4 TRUE
• Allows multiple modes
Data Frame 2 • Values stored in rows Denver 73.4 TRUE
and columns Topeka 49.8 FALSE
• Different columns may
have different data
modes
read.csv(“bigfile.csv”,
header = TRUE,
sep = “,”,
colClasses = c(“character”, “character”, “logical”, “numeric”))
Unsure about how to use a
command?
Type a question mark (?), followed by the name of the
command (without any arguments defined).
For example, to learn about the “par” command, which is used
for formatting various aspects of plots and other data
visualizations:
?par()
Creating New Objects
List of R Packages:
https://cran.r-project.org/web/packages/available_packages_by_name.html
Over 13,700 packages listed (as of 8 February 2019)
Examples of R packages:
dplyr: manipulates data by taking subsets, summarizing, rearranging and joining data sets.
tidyr: reformats layouts of data sets to make them more compatible with R.
lubridate: simplifies working with dates and times.
oce: analysis of oceanographic data.
AMR: antimicrobial resistance analysis
WDI: used for downloading World Development Indicators data from the World Bank
Accessing Data Sets
Data sets may be imported from the Internet or from a computer directory.
R can import data in a wide variety of formats; some of the more common are:
• Excel • Access
• CSV • SQL Server
• TXT • Minitab
• SPSS
R also includes a set of standard data sets (for you to practice/play with)
To list available datasets, type the following command in R or Rstudio:
library(help="datasets")
A more complete description of many of these may be found on the following site:
https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
For the examples in this presentation, we use data sets already included within R.
How to Specify Elements in Objects
Element positions are often indicated using index numbers.
Index numbers start at “1” and increase from there.
Unlike some programming languages, such as Python, where indices start with “0”.
An index of “1” refers to:
The 1st element in a vector or list
The 1st row or column of a matrix or data frame
When specifying an element in a matrix or data frame, always indicate the row followed by the
column.
e.g. mtcars[7, 4] refers to the element in the 7th row and 4th column of the data object mtcars.
Colons can be used to indicate ranges of index numbers
e.g. 3:9 indicates the indices from 3 through 9.
Negative signs indicate the indices of elements to be excluded
e.g. mtcars[7, -4] indicates everything in the 7th row of mtcars except what’s in the 4th column.
Names can be assigned to elements in lists or to rows and columns in dataframes.
mtcars["Valiant","hp"] will go to the row for the Plymouth Valiant and retrieve its horsepower (hp).
Mtcars$hp will retrieve the horsepower (hp) column from mtcars.
Let’s try this out.
Please open up the “Rstudio” application
Element values
Examples
Question: What value(s) would be retrieved from the list “Orlando” by the
following command?
Orlando[5]
Note: R is case-sensitive.
Therefore, in this command, the “O” in “Orlando” must be capitalized.
If it were not capitalized (i.e., orlando[5]), you would get the following
error statement:
Error: object 'orlando' not found
Examples
Question: What value(s) would be retrieved from the list “Orlando” by
the following command?
Orlando[5]
Orlando[2:4]
Orlando[2:4]
Orlando[c(1, 5, 6)]
Orlando[c(1, 5, 6)]
Orlando[c(1, 4:6)]
Orlando[c(1, 4:6)]
Orlando["population"]
Orlando["population"]
Note:
When executing this command, you would get the following message:
Row Names
Examples
Question: What value(s) would be retrieved from the data frame
“New_England” by the following command?
New_England[3, 4]
New_England[3, 4]
New_England[3, 4:5]
New_England[3, 4:5]
New_England[3, 4:5]
New_England[“Vermont”, 1]
New_England[“Vermont”, 1]
New_England[“Vermont”, “Population”]
New_England[“Vermont”, “Population”]
New_England[ , “Per_Capita_Income”]
New_England[ , “Per_Capita_Income”]
New_England$Square_Miles
New_England$Square_Miles
Example: using a dataset named mtcars (containing data about cars), extract
those records where mileage is greater than 20 mpg and either of the following
is true: the engine has more than 4 cylinders or 100 hp.
Example: Arrange the mtcars data set in descending order by horsepower (hp).
Command Arguments
plot(mtcars$disp, mtcars$hp)
X coordinates Y coordinates
Jazzing up your plot
Let’s start by introducing “par”.
par is a command for specifying graphical parameters
Type “par()” to get a listing of the current values assigned to all or your par settings
Explanations of par settings available at http://
stat.ethz.ch/R-manual/R-devel/library/graphics/html/par.html
Command Arguments
mar=c(5, 4, 3, 2), Margins on bottom, left, top and right sides of plot area in number of lines of text.
oma=c(0,0,0,0), Outer margins on bottom, left, top and right sides of plot area in number of lines of text.
col.lab="darkorange2", Color for x- and y-axis labels (using standard R color set)
col.axis="darkorange2", Color for axis annotation (using standard R color set)
font.lab = 2, Font setting for x- and y-labels (“2” indicates bold type)
cex.main=1.2, Scaling factor for size of main title (relative to default value)
cex.axis=0.9, Scaling factor for size of axis annotations (relative to default value)
cex.lab=0.9, Scaling factor for size of axis annotations (relative to default value)
Command Arguments
Command Arguments
First, create a linear model (lm) based upon the data in the plot.
Object being Command Arguments
created
trend <- lm(hp~disp, Indicate that power (hp) is a function of displacement (disp)
subset(mtcars,hp==max(hp) | disp==max(disp))
Take a subset of mtcars for which hp matches the maximum hp value or disp matches the maximum disp value
The result:
Show the cars with the highest power and
displacement
Draw attention to the desired points by surrounding them with gold diamonds
Command Arguments
points(c(301, 472), The x-coordinates for the maximum hp and maximum disp points, respectively
c(335,205), The y-coordinates for the maximum hp and maximum disp points, respectively
Command Arguments
X-coordinates
y-coordinates
text
text color
As a final touch, let’s add some gridlines
Command Arguments
grid(NULL, Number of cells in the x-direction (“NULL” aligns with existing ticks/numbers)
NULL, Number of cells in the y-direction (“NULL” aligns with existing ticks/numbers)
This dataObject
can being
be extracted into a new object and plotted, as follows:
Source Desired Columns
created
plot(take_4)
Resulting Array of Maps
Using a programming loop
Suppose that you wanted to generate and simultaneously display 3 graphs using the mtcars
data set, as follows:
weight versus horsepower, for 4-cylinder cars.
weight versus horsepower, for 6-cylinder cars.
weight versus horsepower, for 8-cylinder cars.
1. Create a vector object with the unique cyl (the number of cylinders) values from the
mtcars data.
Object being Command Arguments
created
Arguments
Command
par(mfrow=c(3,1),mar=c(4,3,3,2), cex=0.6)
End loop }
Plots generated using programming loop
Bubble Charts
Bubble charts build upon traditional x-y plots by incorporating multi-sized
and multi-colored “bubbles” (symbols) to convey additional information.
Source: https://
www.flickr.com/phot
os/jawspeak/5944275
063
Create a bubble chart in R
Using the mtcars data set, let’s create a bubble chart that:
1. Shows horsepower (hp) as a function of the time required to go a quarter
mile (qsec).
2. Scales the size of the symbols to reflect the engine displacements of the
cars.
3. Colors the symbols to reflect the number of cylinders in the cars.
4. Includes legends indicating the meanings of the symbol colors and sizes.
Let’s start by clearing the existing contents and settings from the plot area, as
follows:
dev.off()
How big should the bubbles be?
You want the bubble’s area to be proportional to the engine
displacement.
However, in R, the bubble’s size is specified by giving the radius.
The radius of the bubble is proportional to the square root of its area.
Establish a new object that reflects the appropriate radius of the
bubble for each car model.
Program an R loop that will assign the appropriate color to each car
model in the mtcar data and list the colors in a vector object.
Loop to assign bubble colors
1. Create
bubble_colors <- NULL Object bubble_colors initially has
object
no contents
4. End loop
}
Note: the html color codes (e.g. "#1842ec50“) all have “50” appended to the end; this makes the
colors 50% transparent when plotted, thereby preventing the bubbles from obscuring each other
on the bubble plot.
Plot the data
Command Arguments
set")
Here’s what we have so far
Let’s add a colors legend
We will use the legend command to create a legend, similarly to in an
earlier example.
Command Arguments
legend(21.5, 250,
fill=c("#1eec1850","#1842ec50","#ec351850"),
legend=c("4 cylinders","6 cylinders","8 cylinders"),
bg="azure1",
text.col="black")
In the above command, the fill argument creates boxes with the specified
colors next to the legend text.
Add a bubble size legend
1. Install and load a new package that allows the addition of shapes to the
plot.
install.packages("plotrix")
library(plotrix)
draw.circle(22,325,0.22,nv=100,border=NULL,col=NULL,lty=1,lwd=1)
draw.circle(22.75,325,0.11,nv=100,border=NULL,col=NULL,lty=1,lwd=1)
Add a bubble size legend
3. Label the size legend
text(22, 355, "Displacement", col="black")
head(mtcars, 15)
Making heat maps with R
The command used for creating heat maps works only with data
matrices (not data frames).
Is the mtcars data set a data matrix?
Check using this command:
str(mtcars)
No, mtcars is a
data frame, not a
data matrix.
Function to graph
curve(roller,
Minimum x value (i.e. start graphing at x = 0
0,
100, Maximum x value (i.e. stop graphing at x = 30
Supplementary Visualizations
Supplementary Visualizations
2. Tree Maps
Suppose that,
On the left side of your plotting area, you wanted these plots, and
On the right side, you wanted a bar graph showing the mileage for
each model of car from the mtcars data set (in order of increasing
mileage).
3, Have 3 rows.
2, Have 2 columns.
byrow=FALSE) Do not fill the cells by row (i.e. fill the cells by column,
starting with the first column)
layout(plot_space)
Repeat the 3 original plots
1. Start with the 3 plots from the earlier exercise (you already know
how these work)
for (i in 1:length(cyl_num) {
cyl_count <- cyl_num_ordered[i]
main_title <- paste(cyl_count,"Cylinders")
mtcars_sub <- subset(mtcars,mtcars$cyl==cyl_count)
plot(mtcars_sub$wt,mtcars_sub$hp, main=main_title)
}
The 3 plots will appear on the left side of the plotting area.
Add the 4th plot
1. Order you data in the sequence in which you would like it to appear:
Plot the mileage (mpg) data from the mtcars data set
barplot(mtcars$mpg,
main = "Mileage (mpg)", Main title
cex.names = 0.5) Scale the size of the labels (relative to the default size
And here’s how it looks…
Tree Maps
Tree maps represent data as a series of colored rectangles.
Rectangles are grouped to represent hierarchical relationships.
Rectangles are sized and colored, respectively, to illustrate associated data
values.
By User:GBoshouwers
(MagnaView Designer Pro)
[Public domain], via Wikimedia
Commons
Create a Tree Map using R
To make them easier to understand, rename the 4 th and 5th columns of the
data.
colnames(tree_data)[4:5] <- c("Population", "Area (sq.km)")
Create a Tree Map using R
plot(faithful$eruptions,
faithful$waiting,
xlim=c(xmin,xmax),
ylim=c(ymin,ymax),
xlab="Eruption Duration (sec)",
ylab="Interval Between Eruptions (sec)",
main="Sequence of Old Faithful Eruptions")
Creating a Connected Scatter Plot using R
Now, add some lines tracing the data points in the sequence in which they were recorded. However,
Do not add all of the lines, as the plot will become too congested (and therefore difficult to
read)
Alternate between differently colored lines to make it easier to follow from one line to the next.
Therefore, let’s
plot only the first 10 lines, and
alternate between red and green lines.
Creating a Connected Scatter Plot using R
The matrix will continue adding values of first “red” and then
“green” until all 10 positions in the matrix are filled
Creating a Connected Scatter Plot using R
lm stats http://www.inside-r.org/r-doc/stats/lm
Function Package Reference Materials
manipulate manipulate https://support.rstudio.com/hc/en-us/articles/200551906-Interactive-Plottin
g-with-Manipulate
https://support.rstudio.com/hc/en-us/articles/200551906-Interactive-Plottin
g-with-Manipulate
Standard R Colors
• http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
http://goo.gl/forms/gcrJ1OSi5m