0% found this document useful (0 votes)
62 views

R Programming Unit-3 Complete Notes

Uploaded by

pawarpushkaraj05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

R Programming Unit-3 Complete Notes

Uploaded by

pawarpushkaraj05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Advanced Graphics, Importing Data- readr, Representing Tables

Unit 3 – tibble

Contents:

• Advanced Graphics:
Advanced plotting using Trellis; ggplots2, Lattice, Examples that Present Panels of
Scatterplots using xyplot(), Simple use of xyplot

• Importing Data- readr:


Functions for Reading Data, File Headers, Column Types, String-based Column Type
Specification, Functionbased Column Type Specification Parsing Time and Dates,
Space-separated Columns, Functions for Writing Data

• Representing Tables – tibble:


Creating Tibbles, Indexing Tibbles

29
Unit 3
Advanced Graphics:
Advanced plotting using Trellis
Lattice is an add-on package that implements Trellis graphics (originally developed for S and S-PLUS)
in R. It is a powerful and elegant high-level data visualization system, with an emphasis on multivariate
data,that is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard
requirements. This lab covers the basics of lattice and gives pointers to further resources.
lattice provides a high-level system for statistical graphics that is independent of traditional R graphics.
It is modeled on the Trellis suite in S-PLUS, and implements most of its features. In fact, lattice can
be considered an implementation of the general principles of Trellis graphics (?).
It uses the grid package (?) as the underlying implementation engine, and thus inherits many of its
features by default.
Trellis displays are defined by the type of graphic and the role different variables play in it. Each
display type is associated with a corresponding high-level function (histogram, densityplot, etc.).
Possible roles depend on the type of display, but typical ones are:
primary variables: those that define the primary display (e.g., gcsescore in the previous examples).
conditioning variables: divides data into subgroups, each of which are presented in a different panel
(e.g., score in the last two examples).
grouping variables: subgroups are contrasted within panels by superposing the corresponding displays
(e.g., gender in the last example).

ggplots2
ggplot2 package in R Programming Language also termed as Grammar of Graphics is a free, open-
source, and easy-to-use visualization package widely used in R. It is the most powerful visualization
package written by Hadley Wickham.
It includes several layers on which it is governed. The layers are as follows:
Building Blocks of layers with the grammar of graphics
• Data: The element is the data set itself
• Aesthetics: The data is to map onto the Aesthetics attributes such as x-axis, y-axis, color, fill,
size, labels, alpha, shape, line width, line type
• Geometrics: How our data being displayed using point, line, histogram, bar, boxplot
• Facets: It displays the subset of the data using Columns and rows
• Statistics: Binning, smoothing, descriptive, intermediate
• Coordinates: the space between data and display using Cartesian, fixed, polar, limits
• Themes: Non-data link

Lattice
The lattice package was written by Deepayan Sarkar. The package provides better defaults. It also
provides the ability to display multivariate relationships and it improves on the base-R graphics. This
package supports the creation of trellis graphs:
graphs that display a variable or the relationship between variables, conditioned on one or other
variables.
The typical format is:
graph_type(formula, data=)

Scatter Plots in the Lattice Package


The xyplot() function can be used to create a scatter plot in R using the lattice package. The iris
dataset is perfectly suited for this example.
library(lattice)

30
xyplot(Sepal.Length ~ Petal.Length,
data = iris)

We can also create plots in multiple panels based on groups.


xyplot(Sepal.Length ~ Petal.Length | Species,
group = Species,
data = iris,
type = c("p", "smooth"),
scales = "free")

Examples that Present Panels of Scatterplots using xyplot()


xyplot(): Scatter plot
R function: The R function xyplot() is used to produce bivariate scatter plots or time-series plots. The
simplified format is as follow:
xyplot(y ~ x, data)
Data set: mtcars
my_data <- iris
head(my_data)

Basic scatter plot: y ~ x


# Default plot
xyplot(Sepal.Length ~ Petal.Length, data = my_data)
# Color by groups
xyplot(Sepal.Length ~ Petal.Length, group = Species,
data = my_data, auto.key = TRUE)
xyplot(Sepal.Length ~ Petal.Length, data = my_data,
type = c("p", "g", "smooth"),
xlab = "Miles/(US) gallon", ylab = "Weight (1000 lbs)")

Multiple panels by groups: y ~ x | group


xyplot(Sepal.Length ~ Petal.Length | Species,
group = Species, data = my_data,
type = c("p", "smooth"),
scales = "free")

cloud(): 3D scatter plot


my_data <- iris
head(my_data)

Simple use of xyplot


The xyplot() function of the lattice package allows to build a scatterplot for several categories
automatically. The lattice library offers the xyplot() function. It builds a scatterplot for each levels of
a factor automatically.

Importing Data- readr:


Functions for Reading Data
readr is part of the core tidyverse, so you can load it with:
library(tidyverse)
library(readr)

31
readr supports the following file formats with these read_*() functions:
read_csv(): comma-separated values (CSV) files
read_tsv(): tab-separated values (TSV) files
read_delim(): delimited files (CSV and TSV are important special cases)
read_fwf(): fixed-width files
read_table(): whitespace-separated files
read_log(): web log files

File Headers
The first line in a comma-separated file is not always the column names; that information might be
available from elsewhere outside the file. If you do not want to interpret the first line as column names,
you can use the
option col_names = FALSE.
read_csv(
file = "data/data.csv",
col_names =FALSE
)
Since the data/data.csv file has a header, that is interpreted as part of the data, and because the header
consists of strings, read_csv infers that all the column types are strings. If we did not have the header,
for example, if we had the file data/data-no-header.csv:
1, a, a, 1.2
2, b, b, 2.1
3, c, c, 13.0
then we would get the same data frame as before, except that the names would be autogenerated:
read_csv(
file = "data/data-no-header.csv",
col_names =FALSE
)
If you have data in a file without a header, but you do not want the autogenerated names, you can
provide column names to the col_names option:
read_csv(
file = "data/data-no-header.csv",
col_names = c("X", "Y", "Z", "W")
)

Column Types
When read_csv parses a file, it infers the type of each column. This inference can be slow, or worse
the inference can be incorrect. If you know a priori what the types should be, you can specify this
using the col_types option. If you do this, then read_csv will not make a guess at the types. It will,
however, replace values that it cannot parse as of the right type into NA.

String-based Column Type Specification


In the simplest string specification format, you must provide a string with the same length as you
have columns and where each character in the string specifies the type of one column. The characters
specifying different types are this:

32
By default, read_csv guesses, so we could make this explicit using the type specification "????":
read_csv(
file = "data/data.csv",
col_types = "????"
)

The results of the guesses are double for columns A and D and character for columns B and C. If we
wanted to make this explicit, we could use "dccd".

read_csv(
file = "data/data.csv",
col_types = "dccd"
)
If you want an integer type for column A, you can use "iccd":
read_csv(
file = "data/data.csv",
col_types = "iccd"
)

Function based Column Type Specification


If you are like me, you might find it hard to remember the single-character codes for different types.
If so, you can use longer type names that you specify using function calls. These functions have names
that start with col_, so you can use autocomplete to get a list of them. The types you can specify using
functions are the same as those you can specify using characters, of course, and the functions are:

You need to wrap the function-based type specifications in a call to cols.


read_csv(
file = "data/data.csv",
col_types = cols(A = col_integer())

33
)

read_csv(
file = "data/data.csv",
col_types = cols(D = col_character())
)

Most of the col_ functions do not take any arguments, but they are affected by the locale parameter the
same way that the string specifications are. For factors, date, time, and datetime types, however, you
have more control over the format using the col_ functions. You can use arguments to these functions
for specifying how read_csv should parse dates and how it should construct factors.
For factors, you can explicitly set the levels. If you do not, then the column parser will set the levels
in the order it sees the different strings in the column. For example, in data/data.csv the strings in
columns C and D are in the order a, b, and c:

By defaults, the two columns will be interpreted as characters, but if we specify that C should be a
factor, we get one where the levels are a, b, and c, in that order.
my_data <- read_csv(
file = "data/data.csv",
col_types = cols(C = col_factor())
)

If we want the levels in a different order, we can give col_factor() a levels argument.
my_data <- read_csv(
file = "data/data.csv",
col_types = cols(
C = col_factor(levels = c("c", "b", "a"))
)
)
my_data$C

We can also make factors ordered using the ordered argument


my_data <- read_csv(
file = "data/data.csv",
col_types = cols(
B = col_factor(ordered = TRUE),
C = col_factor(levels = c("c", "b", "a"))
)
)
my_data$B

Parsing Time and Dates


The most complex types to read (or write) are dates and time (and datetime), just because these are
written in many different ways. You can specify the format that dates and datetime are in using a
string with codes that indicate how time information is represented. The codes are these:

34
There are shortcuts for frequently used formats:

As we saw earlier, you can set the date and time format using the locale() function. If you do not, the
default codes will be %AD for dates and %AT for time (there is no locale() argument for datetime).
These codes specify YMD and H:M/H:M:S formats, respectively, but are more relaxed in matching
the patterns. The date parse, for example, will allow different separators. For dates, both “1975-02-
15” and “1975/02/15” will be read as February the 15th 1975, and for time, both “18:00” and “6:00
pm” will be six o’clock in the evening.
In the following text, I give a few examples. I will use the functions parse_date, parse_time, and
parse_datetime rather than read_csv with column type specifications. These functions are used by
read_csv when you specify a date, time, or datetime column type, but using read_csv for the
examples would be unnecessarily verbose. Each takes a vector string representation of dates and
time. For more examples, you can read the function documentation ?col_datetime. Parsing time is
simplest; there is not much variation in how time points are written. The main differences are in
whether you use 24-hour clocks or 12-hour clocks. The %R and %T codes expect 24-hour clocks and
differ in whether seconds are included or not.
parse_time(c("18:00"), format = "%R")
parse_time(c("18:00:30"), format = "%T")

35
Space-separated Columns
The preceding functions all read delimiter-separated columns. They expect a single character to
separate one column from the next. If the argument trim_ws is true, they ignore whitespace. This
argument is true by default for read_csv, read_csv2, and read_tsv, but false for read_delim. The
functions read_table and read_table2 take a different approach and separate columns by one or more
spaces. The simplest of the two is read_table2. It expects any sequence of whitespace to separate
columns. Consider
read_table2(
"A B C D
1234
15 16 17 18"
)
The header names are separated by two spaces. The first data line has spaces before the first line
since the string is indented the way it is. Between columns, there are also two spaces. For the second
data line, we have several spaces before the first value, once again, but this time only single space
between the columns. If we used a delimiter character to specify that we wanted a space to separate
columns, we had to have exactly the same number of spaces between each column. The read_table
function instead reads the data as fixed-width columns. It uses the whitespace in the file to figure out
the width of the columns. After this, each line will be split into characters that match the width of the
columns and assigned to those columns.
read_table(
"
ABCD
121 xyz 14 15
22 abc 24 25
"
)
the columns are aligned, and the rows are interpreted as we might expect. Aligned, here, means that
we have aligned spaces at some position between the columns. If you do not have spaces at the same
location in all rows, columns will be merged.
read_table(
"
ABCD
121 xyz 14 15
22 abc 24 25
"
)
Here, the header C is at the position that should separate columns C and D, and these columns are
therefore merged. If you have spaces in all rows but data between them in some columns only, you
will get an error. For example, if your data looks like this
read_table(
"
ABCD
121 xyz x 14 15
22 abc 24 25
"
)
where the x in the first data line sits between two all-space columns. If you need more specialized
fixed-width files, you might want to consider the read_fwf function. See its documentation for
details: ?read_fwf. The read_table and read_table2 functions take many of the same arguments as the

36
delimiter-based parser, so you can, for example, specify column types and set the locale in the same
way as the preceding data. Not part of the main Tidyverse, the packages loaded when you load the
package tidyverse, is readxl. Its read_excel function does exactly what it says on the tin; it reads
Excel spreadsheets into R. Its interface is similar to the functions in readr. Where the interface differs
is in Excel specific options such as which sheet to read. Such options are clearly only needed when
reading Excel files.

Functions for Writing Data


Writing data to a file is more straightforward than reading data because we have the data in the
correct types and we do not need to deal with different formats. With readr’s writing functions, we
have fewer options to format our output—for example, we cannot give the functions a locale() and
we cannot specify date and time formatting, but we can use different functions to specify delimiters
and time will be output in ISO 8601 which is what the reading functions will use as default. The
functions are write_delim(), write_csv(), write_csv2(), and write_tsv(), and for formats that Excel
can read, write_excel_csv() and write_excel_csv2(). The difference between write_csv() and write_
excel_csv() and between write_csv2() and write_excel_csv2() is that the Excel functions include a
UTF-8 byte order mark so Excel knows that the file is UTF-8 encoded.
The first argument to these functions is the data we want to write and the second is the path to the file
we want to write to. If this file has suffix .gz, .bz2, or .xz, the output is automatically compressed. I
will not list all the arguments for these functions here, but you can read the documentation for them
from the R console. The argument you are most likely to use is col_names which, if true, means that
the function will write the column names as the first line in the output, and if false, will not. If you
use write_delim(), you might also want to specify the delimiter character using the delim argument.
By default it is a single space; if you write to a file using write_delim() with the default options, you
get the data in a format that you can read using read_table2(). The delimiter characters and the
decimal points for write_csv(), write_csv2(), and write_tsv are the same as for the corresponding
read functions

Representing Tables – tibble:


Creating Tibbles
Tidyverse functions that create tabular data will create tibbles rather than data frames. For example,
when we use read_csv to read a file into memory, the result is a tibble:
x <- read_csv(file = "data/data.csv")
The table that read_csv() creates has several super-classes, but the last is data.frame.
class(x)

This means that generic functions, if not specialized in the other classes, will use the data.frame
version, and this, in turn, means that you can often use tibbles in functions that expect data frames. It
does not mean that you can always use tibbles as a replacement for a data frame. If you run into this
problem, you can translate a tibble into a data frame using as.data.frame():
y <- as.data.frame(x)
y

You can create a tibble from vectors using the tibble() function:
x <- tibble(
x = 1:100,
y = x^2,
z = y^2
)
X

37
Two things to notice here: when you print a tibble, you only see the first ten lines. This is because the
tibble has enough lines that it will flood the console if you print all of them. If a tibble has more than
20 rows, you will only see the first ten. If it has fewer, you will see all the rows. You can change how
many lines you will see using the n option to print():
print(x, n = 2)

If a tibble has more columns than your console can show, only some will be printed. You can change
the number of characters it will print using the width option to print

print(x, n = 2, width = 15)

Indexing Tibbles
You can index a tibble in much the same way as you can index a data frame. You can extract a
column using single-bracket index ([]), either by name or by index:
x <- read_csv(file = "data/data.csv")

y <- as.data.frame(x)
x["A"]
The result is a tibble or data.frame, respectively, containing a single column. Chapter 3 Representing
Tables: tibble 40 If you use double brackets ([[]]), you will get the vector contained in a column
rather than a tibble/data frame:

x[["A"]]

You will also get the underlying vector of a column if you use $-indexing:

x$A

Using [] you can extract more than one column


x[c("A", "C")]

You cannot do this using [[]]. You can extract a subset of rows and columns if you use two indices.
For example, you can get the first two rows in the first two columns using [1:2,1:2]:
x[1:2,1:2]

38

You might also like