R Programming Unit-3 Complete Notes
R Programming Unit-3 Complete Notes
Unit 3 – tibble
Contents:
• Advanced Graphics:
Advanced plotting using Trellis; ggplots2, Lattice, Examples that Present Panels of
Scatterplots using xyplot(), Simple use of xyplot
29
Unit 3
Advanced Graphics:
Advanced plotting using Trellis
Lattice is an add-on package that implements Trellis graphics (originally developed for S and S-PLUS)
in R. It is a powerful and elegant high-level data visualization system, with an emphasis on multivariate
data,that is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard
requirements. This lab covers the basics of lattice and gives pointers to further resources.
lattice provides a high-level system for statistical graphics that is independent of traditional R graphics.
It is modeled on the Trellis suite in S-PLUS, and implements most of its features. In fact, lattice can
be considered an implementation of the general principles of Trellis graphics (?).
It uses the grid package (?) as the underlying implementation engine, and thus inherits many of its
features by default.
Trellis displays are defined by the type of graphic and the role different variables play in it. Each
display type is associated with a corresponding high-level function (histogram, densityplot, etc.).
Possible roles depend on the type of display, but typical ones are:
primary variables: those that define the primary display (e.g., gcsescore in the previous examples).
conditioning variables: divides data into subgroups, each of which are presented in a different panel
(e.g., score in the last two examples).
grouping variables: subgroups are contrasted within panels by superposing the corresponding displays
(e.g., gender in the last example).
ggplots2
ggplot2 package in R Programming Language also termed as Grammar of Graphics is a free, open-
source, and easy-to-use visualization package widely used in R. It is the most powerful visualization
package written by Hadley Wickham.
It includes several layers on which it is governed. The layers are as follows:
Building Blocks of layers with the grammar of graphics
• Data: The element is the data set itself
• Aesthetics: The data is to map onto the Aesthetics attributes such as x-axis, y-axis, color, fill,
size, labels, alpha, shape, line width, line type
• Geometrics: How our data being displayed using point, line, histogram, bar, boxplot
• Facets: It displays the subset of the data using Columns and rows
• Statistics: Binning, smoothing, descriptive, intermediate
• Coordinates: the space between data and display using Cartesian, fixed, polar, limits
• Themes: Non-data link
Lattice
The lattice package was written by Deepayan Sarkar. The package provides better defaults. It also
provides the ability to display multivariate relationships and it improves on the base-R graphics. This
package supports the creation of trellis graphs:
graphs that display a variable or the relationship between variables, conditioned on one or other
variables.
The typical format is:
graph_type(formula, data=)
30
xyplot(Sepal.Length ~ Petal.Length,
data = iris)
31
readr supports the following file formats with these read_*() functions:
read_csv(): comma-separated values (CSV) files
read_tsv(): tab-separated values (TSV) files
read_delim(): delimited files (CSV and TSV are important special cases)
read_fwf(): fixed-width files
read_table(): whitespace-separated files
read_log(): web log files
File Headers
The first line in a comma-separated file is not always the column names; that information might be
available from elsewhere outside the file. If you do not want to interpret the first line as column names,
you can use the
option col_names = FALSE.
read_csv(
file = "data/data.csv",
col_names =FALSE
)
Since the data/data.csv file has a header, that is interpreted as part of the data, and because the header
consists of strings, read_csv infers that all the column types are strings. If we did not have the header,
for example, if we had the file data/data-no-header.csv:
1, a, a, 1.2
2, b, b, 2.1
3, c, c, 13.0
then we would get the same data frame as before, except that the names would be autogenerated:
read_csv(
file = "data/data-no-header.csv",
col_names =FALSE
)
If you have data in a file without a header, but you do not want the autogenerated names, you can
provide column names to the col_names option:
read_csv(
file = "data/data-no-header.csv",
col_names = c("X", "Y", "Z", "W")
)
Column Types
When read_csv parses a file, it infers the type of each column. This inference can be slow, or worse
the inference can be incorrect. If you know a priori what the types should be, you can specify this
using the col_types option. If you do this, then read_csv will not make a guess at the types. It will,
however, replace values that it cannot parse as of the right type into NA.
32
By default, read_csv guesses, so we could make this explicit using the type specification "????":
read_csv(
file = "data/data.csv",
col_types = "????"
)
The results of the guesses are double for columns A and D and character for columns B and C. If we
wanted to make this explicit, we could use "dccd".
read_csv(
file = "data/data.csv",
col_types = "dccd"
)
If you want an integer type for column A, you can use "iccd":
read_csv(
file = "data/data.csv",
col_types = "iccd"
)
33
)
read_csv(
file = "data/data.csv",
col_types = cols(D = col_character())
)
Most of the col_ functions do not take any arguments, but they are affected by the locale parameter the
same way that the string specifications are. For factors, date, time, and datetime types, however, you
have more control over the format using the col_ functions. You can use arguments to these functions
for specifying how read_csv should parse dates and how it should construct factors.
For factors, you can explicitly set the levels. If you do not, then the column parser will set the levels
in the order it sees the different strings in the column. For example, in data/data.csv the strings in
columns C and D are in the order a, b, and c:
By defaults, the two columns will be interpreted as characters, but if we specify that C should be a
factor, we get one where the levels are a, b, and c, in that order.
my_data <- read_csv(
file = "data/data.csv",
col_types = cols(C = col_factor())
)
If we want the levels in a different order, we can give col_factor() a levels argument.
my_data <- read_csv(
file = "data/data.csv",
col_types = cols(
C = col_factor(levels = c("c", "b", "a"))
)
)
my_data$C
34
There are shortcuts for frequently used formats:
As we saw earlier, you can set the date and time format using the locale() function. If you do not, the
default codes will be %AD for dates and %AT for time (there is no locale() argument for datetime).
These codes specify YMD and H:M/H:M:S formats, respectively, but are more relaxed in matching
the patterns. The date parse, for example, will allow different separators. For dates, both “1975-02-
15” and “1975/02/15” will be read as February the 15th 1975, and for time, both “18:00” and “6:00
pm” will be six o’clock in the evening.
In the following text, I give a few examples. I will use the functions parse_date, parse_time, and
parse_datetime rather than read_csv with column type specifications. These functions are used by
read_csv when you specify a date, time, or datetime column type, but using read_csv for the
examples would be unnecessarily verbose. Each takes a vector string representation of dates and
time. For more examples, you can read the function documentation ?col_datetime. Parsing time is
simplest; there is not much variation in how time points are written. The main differences are in
whether you use 24-hour clocks or 12-hour clocks. The %R and %T codes expect 24-hour clocks and
differ in whether seconds are included or not.
parse_time(c("18:00"), format = "%R")
parse_time(c("18:00:30"), format = "%T")
35
Space-separated Columns
The preceding functions all read delimiter-separated columns. They expect a single character to
separate one column from the next. If the argument trim_ws is true, they ignore whitespace. This
argument is true by default for read_csv, read_csv2, and read_tsv, but false for read_delim. The
functions read_table and read_table2 take a different approach and separate columns by one or more
spaces. The simplest of the two is read_table2. It expects any sequence of whitespace to separate
columns. Consider
read_table2(
"A B C D
1234
15 16 17 18"
)
The header names are separated by two spaces. The first data line has spaces before the first line
since the string is indented the way it is. Between columns, there are also two spaces. For the second
data line, we have several spaces before the first value, once again, but this time only single space
between the columns. If we used a delimiter character to specify that we wanted a space to separate
columns, we had to have exactly the same number of spaces between each column. The read_table
function instead reads the data as fixed-width columns. It uses the whitespace in the file to figure out
the width of the columns. After this, each line will be split into characters that match the width of the
columns and assigned to those columns.
read_table(
"
ABCD
121 xyz 14 15
22 abc 24 25
"
)
the columns are aligned, and the rows are interpreted as we might expect. Aligned, here, means that
we have aligned spaces at some position between the columns. If you do not have spaces at the same
location in all rows, columns will be merged.
read_table(
"
ABCD
121 xyz 14 15
22 abc 24 25
"
)
Here, the header C is at the position that should separate columns C and D, and these columns are
therefore merged. If you have spaces in all rows but data between them in some columns only, you
will get an error. For example, if your data looks like this
read_table(
"
ABCD
121 xyz x 14 15
22 abc 24 25
"
)
where the x in the first data line sits between two all-space columns. If you need more specialized
fixed-width files, you might want to consider the read_fwf function. See its documentation for
details: ?read_fwf. The read_table and read_table2 functions take many of the same arguments as the
36
delimiter-based parser, so you can, for example, specify column types and set the locale in the same
way as the preceding data. Not part of the main Tidyverse, the packages loaded when you load the
package tidyverse, is readxl. Its read_excel function does exactly what it says on the tin; it reads
Excel spreadsheets into R. Its interface is similar to the functions in readr. Where the interface differs
is in Excel specific options such as which sheet to read. Such options are clearly only needed when
reading Excel files.
This means that generic functions, if not specialized in the other classes, will use the data.frame
version, and this, in turn, means that you can often use tibbles in functions that expect data frames. It
does not mean that you can always use tibbles as a replacement for a data frame. If you run into this
problem, you can translate a tibble into a data frame using as.data.frame():
y <- as.data.frame(x)
y
You can create a tibble from vectors using the tibble() function:
x <- tibble(
x = 1:100,
y = x^2,
z = y^2
)
X
37
Two things to notice here: when you print a tibble, you only see the first ten lines. This is because the
tibble has enough lines that it will flood the console if you print all of them. If a tibble has more than
20 rows, you will only see the first ten. If it has fewer, you will see all the rows. You can change how
many lines you will see using the n option to print():
print(x, n = 2)
If a tibble has more columns than your console can show, only some will be printed. You can change
the number of characters it will print using the width option to print
Indexing Tibbles
You can index a tibble in much the same way as you can index a data frame. You can extract a
column using single-bracket index ([]), either by name or by index:
x <- read_csv(file = "data/data.csv")
y <- as.data.frame(x)
x["A"]
The result is a tibble or data.frame, respectively, containing a single column. Chapter 3 Representing
Tables: tibble 40 If you use double brackets ([[]]), you will get the vector contained in a column
rather than a tibble/data frame:
x[["A"]]
You will also get the underlying vector of a column if you use $-indexing:
x$A
You cannot do this using [[]]. You can extract a subset of rows and columns if you use two indices.
For example, you can get the first two rows in the first two columns using [1:2,1:2]:
x[1:2,1:2]
38