R Cheat Sheet PDF
R Cheat Sheet PDF
Do something Do something
Getting Help An integer
2:6 2 3 4 5 6
sequence } }
RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] Learn more at web page or vignette • package version • Updated: 3/15
Types Matrices Strings Also see the stringr package.
m <- matrix(x, nrow = 3, ncol = 3) paste(x, y, sep = ' ')
Converting between common data types in R. Can always go Join multiple vectors together.
Create a matrix from x.
from a higher value in the table to a lower value.
paste(x, collapse = ' ') Join elements of a vector together.
m[2, ] - Select a row t(m)
w
ww Transpose
grep(pattern, x) Find regular expression matches in x.
ww
as.logical TRUE, FALSE, TRUE Boolean values (TRUE or FALSE).
w m[ , 1] - Select a column
m %*% n gsub(pattern, replace, x) Replace matches in x with a string.
ww
as.numeric 1, 0, 1
numbers.
w
ww
ww
preferred to factors. nchar(x) Number of characters in a string.
as.factor
'1', '0', '1',
levels: '1', '0'
Character strings with preset
levels. Needed for some
statistical models.
w Lists Factors
l <- list(x = 1:5, y = c('a', 'b')) factor(x) cut(x, breaks = 4)
Maths Functions A list is a collection of elements which can be of different types. Turn a vector into a factor. Can
set the levels of the factor and
Turn a numeric vector into a
factor by ‘cutting’ into
log(x) Natural log. sum(x) Sum. l[[2]] l[1] l$x l['y'] the order. sections.
New list with New list with
exp(x) Exponential. mean(x) Mean. Second element Element named
only the first only element
max(x) Largest element. median(x) Median.
of l.
element.
x.
named y. Statistics
min(x) Smallest element. quantile(x) Percentage
lm(y ~ x, data=df) prop.test
Also see the t.test(x, y)
quantiles.
dplyr package. Data Frames Linear model. Perform a t-test for Test for a
round(x, n) Round to n decimal rank(x) Rank of elements. difference
difference between
places. glm(y ~ x, data=df) between
df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) means.
Generalised linear model. proportions.
signif(x, n) Round to n var(x) The variance. A special case of a list where all elements are the same length.
significant figures. pairwise.t.test
List subsetting summary aov
Perform a t-test for
cor(x, y) Correlation. sd(x) The standard x y Get more detailed information Analysis of
paired data.
deviation. out a model. variance.
df$x df[[2]]
1 a
Variable Assignment Distributions
2 b Understanding a data frame
> a <- 'apple' Random Density Cumulative
Quantile
> a See the full data Variates Function Distribution
3 c View(df)
[1] 'apple' frame. Normal rnorm dnorm pnorm qnorm
See the first 6
Matrix subsetting head(df) Poisson rpois dpois ppois qpois
rows.
The Environment Binomial rbinom dbinom pbinom qbinom
df[ , 2]
ls() List all variables in the nrow(df) cbind - Bind columns. Uniform runif dunif punif qunif
environment. Number of rows.
columns.
rm(list = ls()) Remove all variables from the rbind - Bind rows. plot(x) plot(x, y) hist(x)
environment. Values of x in Values of x Histogram of
dim(df)
Number of order. against y. x.
You can use the environment panel in RStudio to
df[2, 2] columns and
browse variables in your environment. rows.
Dates See the lubridate package.
RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15
Advanced R Environments
Cheat Sheet Search Path Function Environments
Search path – mechanism to look up objects, particularly functions. 1. Enclosing environment - an environment where the
Created by: Arianne Colton and Sean Chen function is created. It determines how function finds
• Access with : search() – lists all parents of the global environment
value.
Environment Basics (see Figure 1)
• Enclosing environment never changes, even if the
• Access any environment on the search path:
Environment – Data structure (with two function is moved to a different environment.
as.environment('package:base')
components below) that powers lexical scoping • Access with: environment(‘func1’)
1. Empty environment – ultimate ancestor of library(reshape2); search() 3. Execution environment - new created environments
all environments '.GlobalEnv' 'package:reshape2' ... 'Autoloads' 'package:base‘ to host a function call execution.
• Parent: none NOTE: Autoloads : special environment used for saving memory by • Two parents :
• Access with: emptyenv() only loading package objects (like big datasets) when needed I. Enclosing environment of the function
2. Base environment - environment of the II. Calling environment of the function
base package Figure 2 – Package Attachment
• Execution environment is thrown away once the
• Parent: empty environment function has completed.
• Access with: baseenv() Binding Names to Values 4. Calling environment - environments where the
3. Global environment – the interactive function was called.
Assignment – act of binding (or rebinding) a name to a value in an
workspace that you normally work in
environment. • Access with: parent.frame(‘func1’)
• Parent: environment of last attached
1. <- (Regular assignment arrow) – always creates a variable in the • Dynamic scoping :
package
current environment
• Access with: globalenv() • About : look up variables in the calling
2. <<- (Deep assignment arrow) - modifies an existing variable environment rather than in the enclosing
4. Current environment – environment that
found by walking up the parent environments environment
R is currently working in (may be any of the
above and others) • Usage : most useful for developing functions that
Warning: If <<- doesn’t find an existing variable, it will create
aid interactive data analysis
• Parent: empty environment one in the global environment.
• Access with: environment()
RStudio® is a trademark of RStudio, Inc. • CC BY Arianne Colton, Sean Chen • [email protected] • 844-448-1212 • rstudio.com Updated: 2/16
Data Structures Object Oriented (OO) Field Guide
Object Oriented Systems S3
Homogeneous Heterogeneous
1d Atomic vector List R has three object oriented systems : 1. About S3 :
1. S3 is a very casual system. It has no formal • R's first and simplest OO system
2d Matrix Data frame
definition of classes. It implements generic
• Only OO system used in the base and stats
nd Array function OO.
package
• Generic-function OO - a special type of
• Methods belong to functions, not to objects or
Note: R has no 0-dimensional or scalar types. Individual numbers function called a generic function decides
classes.
or strings, are actually vectors of length one, NOT scalars. which method to call.
2. Notation :
Human readable description of any R data structure : Example: drawRect(canvas, 'blue')
• generic.class()
str(variable) Language: R
Date method for the
mean.Date()
• Message-passing OO - messages generic - mean()
Every Object has a mode and a class
(methods) are sent to objects and the object
1. Mode: represents how an object is stored in memory
determines which function to call. 3. Useful ‘Generic’ Operations
• ‘type’ of the object from R’s point of view
• Get all methods that belong to the ‘mean’
• Access with: typeof() Example: canvas.drawRect('blue')
generic:
2. Class: represents the object’s abstract type Language: Java, C++, and C# - Methods(‘mean’)
• ‘type’ of the object from R’s object-oriented programming • List all generics that have a method for the
2. S4 works similarly to S3, but is more formal.
point of view ‘Date’ class :
Two major differences to S3 :
• Access with: class() - methods(class = ‘Date’)
• Formal class definitions - describe the
typeof() class() representation and inheritance for each class, 4. S3 objects are usually built on top of lists, or
strings or vector of strings character character and has special helper functions for defining atomic vectors with attributes.
generics and methods. • Factor and data frame are S3 class
numbers or vector of numbers numeric numeric
• Multiple dispatch - generic functions can • Useful operations:
list list list pick methods based on the class of any
data.frame list data.frame number of arguments, not just one. Check if object is is.object(x) & !isS4(x) or
3. Reference classes are very different from S3 an S3 object pryr::otype()
and S4:
Factors Check if object
• Implements message-passing OO - inherits from a inherits(x, 'classname')
1. Factors are built on top of integer vectors using two attributes : methods belong to classes, not functions. specific class
• Notation - $ is used to separate objects and
class(x) -> 'factor' Determine class of
methods, so method calls look like class(x)
any object
levels(x) # defines the set of allowed values canvas$drawRect('blue').
2. Useful when you know the possible values a variable may take,
even if you don’t see all values in a given dataset. Base Type (C Structure)
Warning on Factor Usage: R base types - the internal C-level types that underlie • Internal representation : C structure (or struct) that
1. Factors look and often behave like character vectors, they the above OO systems. includes :
are actually integers. Be careful when treating them like • Includes : atomic vectors, list, functions, • Contents of the object
strings.
environments, etc. • Memory Management Information
2. Most data loading functions automatically convert character
• Useful operation : Determine if an object is a base • Type
vectors to factors. (Use argument stringAsFactors = FALSE
type (Not S3, S4 or RC) is.object(x) returns FALSE - Access with: typeof()
to suppress this behavior)
RStudio® is a trademark of RStudio, Inc. • CC BY Arianne Colton, Sean Chen • [email protected] • 844-448-1212 • rstudio.com Updated: 2/16
Functions
Function Basics Function Arguments Primitive Functions
Arguments – passed by reference and copied on modify
Functions – objects in their own right What are Primitive Functions?
1. Arguments are matched first by exact name (perfect matching), then
All R functions have three parts: 1. Call C code directly with .Primitive() and contain no R code
by prefix matching, and finally by position.
2. Check if an argument was supplied : missing()
body() code inside the function print(sum) :
i <- function(a, b) { > function (..., na.rm = FALSE) .Primitive('sum')
list of arguments which missing(a) -> # return true or false
formals() controls how you can } 2. formals(), body(), and environment() are all NULL
call the function
3. Only found in base package
“map” of the location of 3. Lazy evaluation – since x is not used stop("This is an error!")
4. More efficient since they operate at a low level
the function’s variables never get evaluated.
environment()
(see “Enclosing
f <- function(x) {
Environment”) Influx Functions
10
}
Every operation is a function call f(stop('This is an error!')) -> 10 What are Influx Functions?
• +, for, if, [, $, { … 1. Function name comes in between its arguments, like + or –
4. Force evaluation
• x + y is the same as `+`(x, y) 2. All user-created infix functions must start and end with %.
f <- function(x) {
force(x)
Note: the backtick (`), lets you refer to `%+%` <- function(a, b) paste0(a, b)
10
functions or variables that have
} 'new' %+% 'string'
otherwise reserved or illegal names.
5. Default arguments evaluation 3. Useful way of providing a default value in case the output of
Lexical Scoping f <- function(x = ls()) { another function is NULL:
a <- 1
What is Lexical Scoping? `%||%` <- function(a, b) if (!is.null(a)) a else b
x
• Looks up value of a symbol. (see } function_that_might_return_null() %||% default value
"Enclosing Environment")
• findGlobals() - lists all the external f() -> 'a' 'x' ls() evaluated inside f
dependencies of a function Replacement Functions
f(ls()) ls() evaluated in global environment
What are Replacement Functions?
f <- function() x + 1
1. Act like they modify their arguments in place, and have the
codetools::findGlobals(f) Return Values special name xxx <-
> '+' 'x' 2. Actually create a modified copy. Can use pryr::address() to
• Last expression evaluated or explicit return().
find the memory address of the underlying object
environment(f) <- emptyenv() Only use explicit return() when returning early.
RStudio® is a trademark of RStudio, Inc. • CC BY Arianne Colton, Sean Chen • [email protected] • 844-448-1212 • rstudio.com Updated: 2/16
Subsetting
Subsetting returns a copy of the Data Frame Subsetting Examples
original data, NOT copy-on modified
Data Frame – possesses the characteristics of both lists and
1. Lookup tables (character subsetting)
Simplifying vs. Preserving Subsetting matrices. If you subset with a single vector, they behave like lists; if
you subset with two vectors, they behave like matrices x <- c('m', 'f', 'u', 'f', 'f', 'm', 'm')
1. Simplifying subsetting lookup <- c(m = 'Male', f = 'Female', u = NA)
1. Subset with a single vector : Behave like lists
• Returns the simplest possible lookup[x]
data structure that can represent >m f u f f m m
df1[c('col1', 'col2')]
the output > 'Male' 'Female' NA 'Female' 'Female' 'Male' 'Male'
unname(lookup[x])
2. Preserving subsetting 2. Subset with two vectors : Behave like matrices > 'Male' 'Female' NA 'Female' 'Female' 'Male' 'Male'
• Keeps the structure of the output 2. Matching and merging by hand (integer subsetting)
df1[, c('col1', 'col2')]
the same as the input. Lookup table which has multiple columns of information:
• When you use drop = FALSE, it’s
The results are the same in the above examples, however, results are grades <- c(1, 2, 2, 3, 1)
preserving info <- data.frame(
different if subsetting with only one column. (see below)
Simplifying* Preserving grade = 3:1,
1. Behave like matrices desc = c('Excellent', 'Good', 'Poor'),
Vector x[[1]] x[1] fail = c(F, F, T)
str(df1[, 'col1']) -> int [1:3] )
List x[[1]] x[1]
First Method
Factor x[1:4, drop = T] x[1:4] • Result: the result is a vector
id <- match(grades, info$grade)
2. Behave like lists info[id, ]
x[1, , drop = F] or
Array x[1, ] or x[, 1]
x[, 1, drop = F] str(df1['col1']) -> ‘data.frame’ Second Method
Data x[, 1, drop = F] or rownames(info) <- info$grade
x[, 1] or x[[1]] • Result: the result remains a data frame of 1 column
frame x[1] info[as.character(grades), ]
RStudio® is a trademark of RStudio, Inc. • CC BY Arianne Colton, Sean Chen • [email protected] • 844-448-1212 • rstudio.com Updated: 2/16
String manipulation with stringr : : CHEAT SHEET
The stringr package provides a set of internally consistent tools for working with character strings, i.e. sequences of characters surrounded by quotation marks.
1 str_which(string, pattern) Find the indexes of str_subset(string, pattern) Return only the str_pad(string, width, side = c("left", "right",
2 strings that contain a pattern match. strings that contain a pattern match. "both"), pad = " ") Pad strings to constant
4
str_which(fruit, "a") str_subset(fruit, "b") width. str_pad(fruit, 17)
0 str_count(string, pattern) Count the number str_extract(string, pattern) Return the first str_trunc(string, width, side = c("right", "left",
3 of matches in a string. NA pattern match found in each string, as a vector. "center"), ellipsis = "...") Truncate the width of
1
2 str_count(fruit, "a") Also str_extract_all to return every pattern strings, replacing content with ellipsis.
match. str_extract(fruit, "[aeiou]") str_trunc(fruit, 3)
str_locate(string, pattern) Locate the
start end
2 4
4 7 positions of pattern matches in a string. Also str_match(string, pattern) Return the first str_trim(string, side = c("both", "left", "right"))
NA NA str_locate_all. str_locate(fruit, "a") NA NA
pattern match found in each string, as a Trim whitespace from the start and/or end of a
3 4
matrix with a column for each ( ) group in string. str_trim(fruit)
pattern. Also str_match_all.
str_match(sentences, "(a|the) ([^ ]+)")
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor ! • stringr 1.2.0 • Updated: 2017-10
Need to Know Regular Expressions - Regular expressions, or regexps, are a concise language for
describing patterns in strings.
[:space:]
new line
"
Pattern arguments in stringr are interpreted as MATCH CHARACTERS see <- function(rx) str_view_all("abc ABC 123\t.!?\\(){}\n", rx)
regular expressions after any special characters [:blank:] .
have been parsed. string (type regexp matches example
this) (to mean this) (which matches this) space
In R, you write regular expressions as strings, a (etc.) a (etc.) see("a") abc ABC 123 .!?\(){} tab
sequences of characters surrounded by quotes \\. \. . see("\\.") abc ABC 123 .!?\(){}
("") or single quotes('').
\\! \! ! see("\\!") abc ABC 123 .!?\(){} [:graph:]
Some characters cannot be represented directly \\? \? ? see("\\?") abc ABC 123 .!?\(){}
in an R string . These must be represented as \\\\ \\ \ see("\\\\") abc ABC 123 .!?\(){} [:punct:]
special characters, sequences of characters that \\( \( ( see("\\(") abc ABC 123 .!?\(){}
have a specific meaning., e.g. . , : ; ? ! \ | / ` = * + - ^
\\) \) ) see("\\)") abc ABC 123 .!?\(){}
Special Character Represents \\{ \{ { see("\\{") abc ABC 123 .!?\(){} _ ~ " ' [ ] { } ( ) < > @# $
\\ \ \\} \} } see( "\\}") abc ABC 123 .!?\(){}
\" " \\n \n new line (return) see("\\n") abc ABC 123 .!?\(){} [:alnum:]
\n new line \\t \t tab see("\\t") abc ABC 123 .!?\(){}
Run ?"'" to see a complete list \\s \s any whitespace (\S for non-whitespaces) see("\\s") abc ABC 123 .!?\(){} [:digit:]
\\d \d any digit (\D for non-digits) see("\\d") abc ABC 123 .!?\(){}
0 1 2 3 4 5 6 7 8 9
Because of this, whenever a \ appears in a regular \\w \w any word character (\W for non-word chars) see("\\w") abc ABC 123 .!?\(){}
expression, you must write it as \\ in the string \\b \b word boundaries see("\\b") abc ABC 123 .!?\(){}
that represents the regular expression. [:digit:]
1
digits see("[:digit:]") abc ABC 123 .!?\(){} [:alpha:]
1
Use writeLines() to see how R views your string [:alpha:] letters see("[:alpha:]") abc ABC 123 .!?\(){} [:lower:] [:upper:]
1
after all special characters have been parsed. [:lower:] lowercase letters see("[:lower:]") abc ABC 123 .!?\(){}
[:upper:]
1
uppercase letters see("[:upper:]") abc ABC 123 .!?\(){} a b c d e f A B CD E F
writeLines("\\.") [:alnum:]
1
letters and numbers see("[:alnum:]") abc ABC 123 .!?\(){}
# \. g h i j k l GH I J K L
[:punct:] 1 punctuation see("[:punct:]") abc ABC 123 .!?\(){}
mn o p q r MNO PQ R
writeLines("\\ is a backslash") [:graph:] 1 letters, numbers, and punctuation see("[:graph:]") abc ABC 123 .!?\(){}
# \ is a backslash [:space:] 1 space characters (i.e. \s) see("[:space:]") abc ABC 123 .!?\(){} s t u vw x S TU V WX
[:blank:] 1 space and tab (but not new line) see("[:blank:]") abc ABC 123 .!?\(){} z Z
. every character except a new line see(".") abc ABC 123 .!?\(){}
INTERPRETATION 1 Many base R functions require classes to be wrapped in a second set of [ ], e.g. [[:digit:]]
Patterns in stringr are interpreted as regexs To
change this default, wrap the pattern in one of:
ALTERNATES alt <- function(rx) str_view_all("abcde", rx) QUANTIFIERS quant <- function(rx) str_view_all(".a.aa.aaa", rx)
regex(pattern, ignore_case = FALSE, multiline = example example
FALSE, comments = FALSE, dotall = FALSE, ...) regexp matches regexp matches
Modifies a regex to ignore cases, match end of ab|d or alt("ab|d") abcde a? zero or one quant("a?") .a.aa.aaa
lines as well of end of strings, allow R comments [abe] one of alt("[abe]") abcde a* zero or more quant("a*") .a.aa.aaa
within regex's , and/or to have . match everything a+ one or more quant("a+") .a.aa.aaa
including \n. [^abe] anything but alt("[^abe]") abcde
str_detect("I", regex("i", TRUE)) [a-c] range alt("[a-c]") abcde 1 2 ... n a{n} exactly n quant("a{2}") .a.aa.aaa
1 2 ... n a{n, } n or more quant("a{2,}") .a.aa.aaa
fixed() Matches raw bytes but will miss some n ... m a{n, m} between n and m quant("a{2,4}") .a.aa.aaa
characters that can be represented in multiple ANCHORS anchor <- function(rx) str_view_all("aaa", rx)
ways (fast). str_detect("\u0130", fixed("i")) regexp matches example
^a start of string anchor("^a") aaa GROUPS ref <- function(rx) str_view_all("abbaab", rx)
coll() Matches raw bytes and will use locale
specific collation rules to recognize characters a$ end of string anchor("a$") aaa Use parentheses to set precedent (order of evaluation) and create groups
that can be represented in multiple ways (slow).
regexp matches example
str_detect("\u0130", coll("i", TRUE, locale = "tr"))
(ab|d)e sets precedence alt("(ab|d)e") abcde
LOOK AROUNDS look <- function(rx) str_view_all("bacad", rx)
boundary() Matches boundaries between example
characters, line_breaks, sentences, or words. regexp matches Use an escaped number to refer to and duplicate parentheses groups that occur
str_split(sentences, boundary("word")) a(?=c) followed by look("a(?=c)") bacad earlier in a pattern. Refer to each group by its order of appearance
a(?!c) not followed by look("a(?!c)") bacad string regexp matches example
(?<=b)a preceded by look("(?<=b)a") bacad (type this) (to mean this) (which matches this) (the result is the same as ref("abba"))
(?<!b)a not preceded by look("(?<!b)a") bacad \\1 \1 (etc.) first () group, etc. ref("(a)(b)\\2\\1") abbaab
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor ! • stringr 1.2.0 • Updated: 2017-10
Apply functions with purrr : : CHEAT SHEET
Apply Functions Work with Lists
Map functions apply a function iteratively to each element of a list FILTER LISTS SUMMARISE LISTS TRANSFORM LISTS
or vector.
map(.x, .f, …) Apply a a b pluck(.x, ..., .default=NULL) a FALSE every(.x, .p, …) Do all a a modify(.x, .f, ...) Apply
fun( ,…) b Select an element by name elements pass a test? function to each element. Also
map( , fun, …) fun( ,…) function to each b b b
element of a list or c or index, pluck(x,"b") ,or its c every(x, is.character) c c map, map_chr, map_dbl,
fun( ,…) d attribute with attr_getter. d d map_dfc, map_dfr, map_int,
vector. map(x, is.logical)
pluck(x,"b",attr_getter("n")) a TRUE some(.x, .p, …) Do some map_lgl. modify(x, ~.+ 2)
b elements pass a test?
map2(.x, ,y, .f, …) Apply a a keep(.x, .p, …) Select c some(x, is.character) a a modify_at(.x, .at, .f, ...) Apply
fun( , ,…) elements that pass a function to elements by name
map2( , ,fun,…) fun( , ,…) a function to pairs of b c b b
fun( , ,…) elements from two lists, c logical test. keep(x, is.na) a TRUE has_element(.x, .y) Does a c c or index. Also map_at.
vectors. map2(x, y, sum) b list contain an element? d d modify_at(x, "b", ~.+ 2)
a b discard(.x, .p, …) Select c has_element(x, "foo")
b elements that do not pass a a a modify_if(.x, .p, .f, ...) Apply
pmap(.l, .f, …) Apply a c logical test. discard(x, is.na) detect(.x, .f, ..., .right=FALSE, b b function to elements that
fun( , , ,…) function to groups of a c
pmap( ,fun,…) fun( , , ,…) b .p) Find first element to pass. c c pass a test. Also map_if.
fun( , , ,…) elements from list of lists, a NULL b compact(.x, .p = identity)
c detect(x, is.character) d d modify_if(x, is.numeric,~.+2)
vectors. pmap(list(x, y, z), b Drop empty elements.
sum, na.rm = TRUE) c NULL compact(x) detect_index(.x, .f, ..., .right modify_depth(.x,.depth,.f,...)
a 3
b = FALSE, .p) Find index of Apply function to each
a a head_while(.x, .p, …) c first element to pass. element at a given level of a
fun invoke_map(.f, .x = detect_index(x, is.character) list. modify_depth(x, 1, ~.+ 2)
fun ( ,…)
list(NULL), …, .env=NULL) b b Return head elements
invoke_map( fun , ,…) fun ( ,…) c until one does not pass.
Run each function in a list. xy z 2
fun fun ( ,…) d Also tail_while. a vec_depth(x) Return depth
Also invoke. l <- list(var, head_while(x, is.character) b (number of levels of WORK WITH LISTS
sd); invoke_map(l, x = 1:9) c indexes). vec_depth(x) array_tree(array, margin =
lmap(.x, .f, ...) Apply function to each list-element of a list or vector. NULL) Turn array into list.
RESHAPE LISTS JOIN (TO) LISTS Also array_branch.
imap(.x, .f, ...) Apply .f to each element of a list or vector and its index.
array_tree(x, margin = 3)
a flatten(.x) Remove a level + append(x, values, after =
OUTPUT of indexes from a list. Also length(x)) Add to end of list. cross2(.x, .y, .filter = NULL)
b +
map(), map2(), pmap(), function returns c flatten_chr, flatten_dbl, append(x, list(d = 1)) All combinations of .x
imap and invoke_map flatten_dfc, flatten_dfr, and .y. Also cross, cross3,
map list flatten_int, flatten_lgl. prepend(x, values, before = cross_df. cross2(1:3, 4:6)
each return a list. Use a +
suffixed version to map_chr character vector flatten(x) 1) Add to start of list.
return the results as a map_dbl double (numeric) vector prepend(x, list(d = 1)) a p set_names(x, nm = x) Set
specific type of flat xy x y transpose(.l, .names = b q the names of a vector/list
vector, e.g. map2_chr, map_dfc data frame (column bind) a a NULL) Transposes the index + splice(…) Combine objects c r directly or with a function.
pmap_lgl, etc. map_dfr data frame (row bind) b b order in a multi-level list. into a list, storing S3 objects set_names(x, c("p", "q", "r"))
c c transpose(x) + as sub-lists. splice(x, y, "foo") set_names(x, tolower)
Use walk, walk2, and map_int integer vector
pwalk to trigger side map_lgl logical vector
effects. Each return its
input invisibly.
walk triggers side effects, returns
the input invisibly
Reduce Lists Modify function behavior
a b
func + a b c d func( , ) reduce(.x, .f, ..., .init) compose() Compose negate() Negate a quietly() Modify
SHORTCUTS - within a purrr function: Apply function recursively multiple functions. predicate function (a function to return
c
"name" becomes ~ .x .y becomes func( , ) to each element of a list or pipe friendly !) list of results,
function(x) x[["name"]], function(.x, .y) .x .y, e.g.
d vector. Also reduce_right, lift() Change the type output, messages,
func( , ) reduce2, reduce2_right. of input a function partial() Create a warnings.
e.g. map(l, "a") extracts a map2(l, p, ~ .x +.y ) becomes
from each element of l map2(l, p, function(l, p) l + p ) reduce(x, sum) takes. Also lift_dl, version of a function
lift_dv, lift_ld, lift_lv, that has some args possibly() Modify
~ .x becomes function(x) x, ~ ..1 ..2 etc becomes a b c d func( , ) lift_vd, lift_vl. preset to values. function to return
func + accumulate(.x, .f, ..., .init)
e.g. map(l, ~ 2 +.x) becomes function(..1, ..2, etc) ..1 ..2 etc, c Reduce, but also return default value
func( , ) rerun() Rerun safely() Modify func whenever an error
map(l, function(x) 2 + x ) e.g. pmap(list(a, b, c), ~ ..3 + ..1 - ..2) d intermediate results. Also
becomes pmap(list(a, b, c), func( , ) expression n times. to return list of occurs (instead of
accumulate_right. results and errors. error).
function(a, b, c) c + a - b) accumulate(x, sum)
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at purrr.tidyverse.org • purrr 0.2.3 • Updated: 2017-09
Nested Data "cell" contents
List Column Workflow Nested data frames use a list column, a list that is stored as a
column vector of a data frame. A typical workflow for list columns:
A nested data frame stores
1 2 3
Sepal.L Sepal.W Petal.L Petal.W
individual tables within the 5.1 3.5 1.4 0.2 Make a list Work with Simplify
cells of a larger, organizing 4.9 3.0 1.4 0.2 column list columns the list
4.7 3.2 1.3 0.2
column
S.L S.W P.L P.W
table. 4.6 3.1 1.5 0.2 Species S.L S.W P.L P.W 5.1 3.5 1.4 0.2
Call:
lm(S.L ~ ., df)
setosa 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2
5.0 3.6 1.4 0.2 Coefs:
setosa 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2
(Int) S.W P.L P.W
n_iris$data[[1]] setosa 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 2.3 0.6 0.2 0.2
setosa 4.6 3.1 1.5 0.2
nested data frame Sepal.L Sepal.W Petal.L Petal.W versi 7.0 3.2 4.7 1.4
Species data S.L S.W P.L P.W
7.0 3.2 4.7 1.4
Species data model Call: Species beta
setos <tibble [50x4]> lm(S.L ~ ., df)
versi 6.4 3.2 4.5 1.5 setosa <tibble [50x4]> <S3: lm> setos 2.35
Species data 7.0 3.2 4.7 1.4 versi <tibble [50x4]> 6.4 3.2 4.5 1.5
versi 6.9 3.1 4.9 1.5 versi <tibble [50x4]> <S3: lm> Coefs: versi 1.89
virgini <tibble [50x4]> 6.9 3.1 4.9 1.5
setosa <tibble [50 x 4]> 6.4 3.2 4.5 1.5 versi 5.5 2.3 4.0 1.3 virgini <tibble [50x4]> <S3: lm> (Int) S.W P.L P.W virgini 0.69
5.5 2.3 4.0 1.3 1.8 0.3 0.9 -0.6
versicolor <tibble [50 x 4]> 6.9 3.1 4.9 1.5 virgini 6.3 3.3 6.0 2.5
virginica <tibble [50 x 4]> 5.5 2.3 4.0 1.3 virgini 5.8 2.7 5.1 1.9 S.L S.W P.L P.W
virgini 7.1 3.0 5.9 2.1 Call:
n_iris 6.5 2.8 4.6 1.5 6.3 3.3 6.0 2.5
lm(S.L ~ ., df)
virgini 6.3 2.9 5.6 1.8 5.8 2.7 5.1 1.9
n_iris$data[[2]] 7.1 3.0 5.9 2.1 Coefs:
(Int) S.W P.L P.W
6.3 2.9 5.6 1.8 0.6 0.3 0.9 -0.1
Sepal.L Sepal.W Petal.L Petal.W n_iris <- iris %>% mod_fun <- function(df) b_fun <- function(mod)
Use a nested data frame to: 6.3 3.3 6.0 2.5 group_by(Species) %>% lm(Sepal.Length ~ ., data = df) coefficients(mod)[[1]]
5.8 2.7 5.1 1.9 nest()
• preserve relationships 7.1 3.0 5.9 2.1 m_iris <- n_iris %>% m_iris %>% transmute(Species,
between observations and 6.3 2.9 5.6 1.8 mutate(model = map(data, mod_fun)) beta = map_dbl(model, b_fun))
subsets of data 6.5 3.0 5.8 2.2
n_iris$data[[3]]
• manipulate many sub-tables 1. MAKE A LIST COLUMN - You can create list columns with functions in the tibble and dplyr packages, as well as tidyr’s nest()
at once with the purrr functions map(), map2(), or pmap().
tibble::tribble(…) tibble::tibble(…) dplyr::mutate(.data, …) Also transmute()
Makes list column when needed Saves list input as list columns Returns list col when result returns list.
Use a two step process to create a nested data frame: tribble( ~max, ~seq, max seq tibble(max = c(3, 4, 5), seq = list(1:3, 1:4, 1:5)) mtcars %>% mutate(seq = map(cyl, seq))
1. Group the data frame into groups with dplyr::group_by() 3, 1:3, 3 <int [3]>
with one row per group S.L S.W P.L P.W 5, 1:5)
5 <int [5]>
tibble::enframe(x, name="name", value="value") dplyr::summarise(.data, …)
5.1 3.5 1.4 0.2 Converts multi-level list to tibble with list cols Returns list col when result is wrapped with list()
Species S.L S.W P.L P.W
setosa 5.1 3.5 1.4 0.2
Species
setosa
S.L S.W P.L P.W
5.1 3.5 1.4 0.2
4.9
4.7
3.0 1.4
3.2 1.3
0.2
0.2
enframe(list('3'=1:3, '4'=1:4, '5'=1:5), 'max', 'seq') mtcars %>% group_by(cyl) %>%
setosa 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 4.6 3.1 1.5 0.2 summarise(q = list(quantile(mpg)))
setosa 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 5.0 3.6 1.4 0.2
setosa 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2
setosa 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 S.L S.W P.L P.W 2. WORK WITH LIST COLUMNS - Use the purrr functions map(), map2(), and pmap() to apply a function that returns a result element-wise
to the cells of a list column. walk(), walk2(), and pwalk() work the same way, but return a side effect.
versi 7.0 3.2 4.7 1.4 versi 7.0 3.2 4.7 1.4 Species data 7.0 3.2 4.7 1.4
versi 6.4 3.2 4.5 1.5 versi 6.4 3.2 4.5 1.5 setos <tibble [50x4]> 6.4 3.2 4.5 1.5
versi versi 6.9 3.1 4.9 1.5 6.9 3.1 4.9 1.5
purrr::map(.x, .f, ...)
6.9 3.1 4.9 1.5 versi <tibble [50x4]>
versi versi 5.5 2.3 4.0 1.3 5.5 2.3 4.0 1.3
fun( , …)
5.5 2.3 4.0 1.3 virgini <tibble [50x4]> data data result
versi
virgini
6.5 2.8 4.6 1.5 versi
virgini
6.5
6.3
2.8
3.3
4.6
6.0
1.5
2.5
6.5 2.8 4.6 1.5
Apply .f element-wise to .x as .f(.x) map( <tibble [50x4]>
, fun, …) fun(
<tibble [50x4]>
, …)
result 1
6.3 3.3 6.0 2.5 <tibble [50x4]> <tibble [50x4]> result 2
virgini 5.8 2.7 5.1 1.9 virgini 5.8 2.7 5.1 1.9 S.L S.W P.L P.W n_iris %>% mutate(n = map(data, dim)) <tibble [50x4]> fun( <tibble [50x4]> , …) result 3
virgini 7.1 3.0 5.9 2.1 virgini 7.1 3.0 5.9 2.1 6.3 3.3 6.0 2.5
virgini 6.3 2.9 5.6 1.8 virgini 6.3 2.9 5.6 1.8 5.8 2.7 5.1 1.9 purrr::map2(.x, .y, .f, ...) data model
virgini 6.5 3.0 5.8 2.2 virgini 6.5 3.0 5.8 2.2 7.1 3.0 5.9 2.1
Apply .f element-wise to .x and .y as .f(.x, .y)
data model
fun( <tibble [50x4]> , <S3: lm> ,…) result
map2( , , fun, …)
<tibble [50x4]> <S3: lm> result 1
6.3 2.9 5.6 1.8 fun( <tibble [50x4]> , <S3: lm> ,…)
n_iris <- iris %>% group_by(Species) %>% nest() 6.5 3.0 5.8 2.2 m_iris %>% mutate(n = map2(data, model, list)) <tibble [50x4]>
<tibble [50x4]>
<S3: lm>
<S3: lm> fun( <tibble [50x4]> , <S3: lm> ,…)
result 2
result 3
pmap(list( , , ), fun, …)
<tibble [50x4]> <S3: lm> coef result 1
fun( , , ,…)
coef
m_iris %>% <tibble [50x4]> <S3: lm> AIC <tibble [50x4]> <S3: lm> AIC result 2
mutate(n = pmap(list(data, model, data), list)) <tibble [50x4]> <S3: lm> BIC fun( <tibble [50x4]> , <S3: lm> , BIC ,…) result 3
Unnest a nested data frame Species data Species S.L S.W P.L P.W
setos <tibble [50x4]> setosa 5.1 3.5 1.4 0.2
with unnest(): versi <tibble [50x4]> setosa 4.9 3.0 1.4 0.2 3. SIMPLIFY THE LIST COLUMN (into a regular column)
virgini <tibble [50x4]> setosa 4.7 3.2 1.3 0.2
n_iris %>% unnest() setosa 4.6 3.1 1.5 0.2
Use the purrr functions map_lgl(), purrr::map_lgl(.x, .f, ...) purrr::map_dbl(.x, .f, ...)
versi 7.0 3.2 4.7 1.4
tidyr::unnest(data, ..., .drop = NA, .id=NULL, .sep=NULL) versi 6.4 3.2 4.5 1.5 map_int(), map_dbl(), map_chr(), Apply .f element-wise to .x, return a logical vector Apply .f element-wise to .x, return a double vector
versi 6.9 3.1 4.9 1.5
as well as tidyr’s unnest() to reduce n_iris %>% transmute(n = map_lgl(data, is.matrix)) n_iris %>% transmute(n = map_dbl(data, nrow))
Unnests a nested data frame. versi 5.5 2.3 4.0 1.3
a list column into a regular column. purrr::map_chr(.x, .f, ...)
virgini
virgini
6.3
5.8
3.3
2.7
6.0
5.1
2.5
1.9 purrr::map_int(.x, .f, ...)
virgini 7.1 3.0 5.9 2.1 Apply .f element-wise to .x, return an integer vector Apply .f element-wise to .x, return a character vector
virgini 6.3 2.9 5.6 1.8
n_iris %>% transmute(n = map_int(data, nrow)) n_iris %>% transmute(n = map_chr(data, nrow))
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at purrr.tidyverse.org • purrr 0.2.3 • Updated: 2017-09
Data Import : : CHEAT SHEET
R’s tidyverse is built around tidy data stored
in tibbles, which are enhanced data frames.
Read Tabular Data - These functions share the common arguments: Data types
read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), readr functions guess
The front side of this sheet shows the types of each column and
how to read text files into R with quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000,
n_max), progress = interactive()) convert types when appropriate (but will NOT
readr. convert strings to factors automatically).
The reverse side shows how to A B C Comma Delimited Files
a,b,c read_csv("file.csv") A message shows the type of each column in the
create tibbles with tibble and to 1 2 3
result.
1,2,3 To make file.csv run:
layout tidy data with tidyr. 4 5 NA
4,5,NA write_file(x = "a,b,c\n1,2,3\n4,5,NA", path = "file.csv")
## Parsed with column specification:
## cols(
OTHER TYPES OF DATA A B C Semi-colon Delimited Files ## age = col_integer(), age is an
a;b;c
Try one of the following packages to import 1 2 3 read_csv2("file2.csv") ## sex = col_character(), integer
other types of files 1;2;3 4 5 NA write_file(x = "a;b;c\n1;2;3\n4;5;NA", path = "file2.csv") ## earn = col_double()
4;5;NA ## )
• haven - SPSS, Stata, and SAS files
Files with Any Delimiter sex is a
• readxl - excel files (.xls and .xlsx) character
A B C read_delim("file.txt", delim = "|") earn is a double (numeric)
• DBI - databases a|b|c 1 2 3 write_file(x = "a|b|c\n1|2|3\n4|5|NA", path = "file.txt")
• jsonlite - json 1|2|3 4 5 NA 1. Use problems() to diagnose problems.
• xml2 - XML 4|5|NA Fixed Width Files x <- read_csv("file.csv"); problems(x)
• httr - Web APIs read_fwf("file.fwf", col_positions = c(1, 3, 5))
• rvest - HTML (Web Scraping) abc
A B C
write_file(x = "a b c\n1 2 3\n4 5 NA", path = "file.fwf")
1 2 3 2. Use a col_ function to guide parsing.
123 4 5 NA • col_guess() - the default
Save Data 4 5 NA
Tab Delimited Files
read_tsv("file.tsv") Also read_table(). • col_character()
write_file(x = "a\tb\tc\n1\t2\t3\n4\t5\tNA", path = "file.tsv") • col_double(), col_euro_double()
Save x, an R object, to path, a file path, as: • col_datetime(format = "") Also
USEFUL ARGUMENTS col_date(format = ""), col_time(format = "")
Comma delimited file
write_csv(x, path, na = "NA", append = FALSE, Example file Skip lines • col_factor(levels, ordered = FALSE)
a,b,c 1 2 3
col_names = !append) write_file("a,b,c\n1,2,3\n4,5,NA","file.csv") read_csv(f, skip = 1) • col_integer()
1,2,3 4 5 NA
File with arbitrary delimiter f <- "file.csv" • col_logical()
4,5,NA
write_delim(x, path, delim = " ", na = "NA", • col_number(), col_numeric()
append = FALSE, col_names = !append) A B C No header A B C Read in a subset • col_skip()
CSV for excel
1 2 3
read_csv(f, col_names = FALSE) 1 2 3 read_csv(f, n_max = 1) x <- read_csv("file.csv", col_types = cols(
write_excel_csv(x, path, na = "NA", append =
4 5 NA A = col_double(),
Provide header B = col_logical(),
FALSE, col_names = !append) x y z
Missing Values C = col_factor()))
String to file
A B C read_csv(f, col_names = c("x", "y", "z")) A B C
w
w
9 audi a4 quattro 1.8
10 audi a4 quattro 2.0 C 1999 212K/1T C 1999 212K 1T
# ... with 224 more rows, and 3
#
#
more variables: year <int>,
cyl <int>, trans <chr>
column, gathering the column values into a column into the column names, spreading the C 2000 213K/1T C 2000 213K 1T
single value column. values of a value column across the new columns.
separate(table3, rate,
tibble display table4a table2
country 1999 2000 country year cases country year type count country year cases pop
into = c("cases", "pop"))
156 1999 6 auto(l4)
A 0.7K 2K A 1999 0.7K A 1999 cases 0.7K A 1999 0.7K 19M
separate_rows(data, ..., sep = "[^[:alnum:].]
157 1999 6 auto(l4)
158 2008 6 auto(l4)
159 2008 8 auto(s4) B 37K 80K B 1999 37K A 1999 pop 19M A 2000 2K 20M
160 1999 4 manual(m5)
C 212K 213K C 1999 212K B 1999 37K 172M
+", convert = FALSE)
161 1999 4 auto(l4) A 2000 cases 2K
162 2008 4 manual(m5)
163 2008 4 manual(m5) A 2000 2K B 2000 80K 174M
164 2008 4 auto(l4) A 2000 pop 20M
165 2008
166 1999
[ reached
4
4
auto(l4)
auto(l4)
getOption("max.print")
B 2000 80K B 1999 cases 37K C 1999 212K 1T Separate each cell in a column to make
A large table -- omitted 68 rows ] C 2000 213K B 1999 pop 172M C 2000 213K 1T several rows. Also separate_rows_().
key value B 2000 cases 80K
to display data frame display B 2000 pop 174M table3
• Control the default appearance with options: C 1999 cases 212K country year rate country year rate
C 1999 pop 1T A 1999 0.7K/19M A 1999 0.7K
options(tibble.print_max = n, C 2000 cases 213K A 2000 2K/20M A 1999 19M
tibble.print_min = m, tibble.width = Inf) C 2000 pop 1T B 1999 37K/172M A 2000 2K
B 2000 80K/174M A 2000 20M
gather(table4a, `1999`, `2000`, key value
• View full data set with View() or glimpse() C 1999 212K/1T B 1999 37K
key = "year", value = "cases") spread(table2, type, count) C 2000 213K/1T B 1999 172M
• Revert to data frame with as.data.frame() B 2000 80K
B 2000 174M
CONSTRUCT A TIBBLE IN TWO WAYS
tibble(…)
Handle Missing Values C
C
1999
1999
212K
1T
Both drop_na(data, ...) fill(data, ..., .direction = c("down", "up")) replace_na(data, C 2000 213K
Construct by columns. replace = list(), ...)
C 2000 1T
make this Drop rows containing Fill in NA’s in … columns with most
tibble(x = 1:3, y = c("a", "b", "c")) tibble NA’s in … columns. recent non-NA values. Replace NA’s by column. separate_rows(table3, rate)
x x x
tribble(…)
unite(data, col, ..., sep = "_", remove = TRUE)
A tibble: 3 × 2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2
Construct by rows. x y A 1 A 1 A 1 A 1 A 1 A 1
tribble( ~x, ~y, <int> <chr> B NA D 3 B NA B 1 B NA B 2
Collapse cells across several columns to
1 1 a C NA C NA C 1 C NA C 2
1, "a", 2 2 b D 3 D 3 D 3 D 3 D 3 make a single column.
2, "b", 3 3 c E NA E NA E 3 E NA E 2
table5
3, "c") drop_na(x, x2) fill(x, x2) replace_na(x, list(x2 = 2)) country century year country year
as_tibble(x, …) Convert data frame to tibble. Afghan 19 99 Afghan 1999
enframe(x, name = "name", value = "value") Expand Tables - quickly create tables with combinations of values Afghan
Brazil
20
19
0
99
Afghan
Brazil
2000
1999
Convert named vector to a tibble Brazil 20 0 Brazil 2000
complete(data, ..., fill = list()) expand(data, ...) China 19 99 China 1999
is_tibble(x) Test whether x is a tibble. China 20 0 China 2000
Adds to the data missing combinations of the Create new tibble with all possible combinations
values of the variables listed in … of the values of the variables listed in … unite(table5, century, year,
complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb) col = "year", sep = "")
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at tidyverse.org • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01
Data Transformation with dplyr : : CHEAT SHEET
dplyr
dplyr functions work with pipes and expect tidy data. In tidy data:
Manipulate Cases Manipulate Variables
A B C A B C
& EXTRACT CASES EXTRACT VARIABLES
pipes
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x %>% f(y)
its own column case, is in its own row becomes f(x, y) pull(.data, var = -1) Extract column values as
filter(.data, …) Extract rows that meet logical
a vector. Choose by name or index.
Summarise Cases
w
www
ww criteria. filter(iris, Sepal.Length > 7)
w
www pull(iris, Sepal.Length)
weight = NULL, .env = parent.frame()) Randomly Use these helpers with select (),
summary function select fraction of rows.
e.g. select(iris, starts_with("Sepal"))
summarise(.data, …)
Compute table of summaries.
w
www
ww sample_frac(iris, 0.5, replace = TRUE)
sample_n(tbl, size, replace = FALSE, weight =
contains(match) num_range(prefix, range) :, e.g. mpg:cyl
ends_with(match) one_of(…) -, e.g, -Species
w
ww summarise(mtcars, avg = mean(mpg))
Count number of rows in each group defined slice(.data, …) Select rows by position. MAKE NEW VARIABLES
slice(iris, 10:15)
by the variables in … Also tally().
w
ww count(iris, Species)
w
www
ww top_n(x, n, wt) Select and order top n entries (by
group if grouped data). top_n(iris, 5, Sepal.Width)
These apply vectorized functions to columns. Vectorized funs take
vectors as input and return vectors of the same length as output
(see back).
VARIATIONS vectorized function
summarise_all() - Apply funs to every column.
summarise_at() - Apply funs to specific columns. mutate(.data, …)
summarise_if() - Apply funs to all cols of one type. Logical and boolean operators to use with filter() Compute new column(s).
<
>
<=
>=
is.na()
!is.na()
%in%
!
|
&
xor() w
wwww
w mutate(mtcars, gpm = 1/mpg)
transmute(.data, …)
Group Cases See ?base::logic and ?Comparison for help. Compute new column(s), drop others.
Use group_by() to create a "grouped" copy of a table.
dplyr functions will manipulate each "group" separately and
w
ww transmute(mtcars, gpm = 1/mpg)
mtcars %>%
arrange(.data, …) Order rows by values of a
column or columns (low to high), use with
w
www mutate_all(faithful, funs(log(.), log2(.)))
mutate_if(iris, is.numeric, funs(log(.)))
w
www
ww group_by(cyl) %>% w
www
ww desc() to order from high to low.
arrange(mtcars, mpg) mutate_at(.tbl, .cols, .funs, …) Apply funs to
A B C
vectorized function summary function Use bind_cols() to paste tables beside each
other as they are. + y
C v 3
d w 4
dplyr::between() - x >= left & x <= right sd() - standard deviation EXTRACT ROWS
dplyr::near() - safe == for floating point var() - variance x y
numbers
A B.x C B.y D Use by = c("col1", "col2", …) to A B C A B D
MISC a t 1 t 3
specify one or more common a
b
t
u
1
2 + a
b
t
u
3
2 =
Row Names
b u 2 u 2
c v 3 NA NA columns to match on. c v 3 d w 1
dplyr::case_when() - multi-case if_else() left_join(x, y, by = "A")
dplyr::coalesce() - first non-NA values by
element across a set of vectors Tidy data does not use rownames, which store a
variable outside of the columns. To work with the A.x B.x C A.y B.y Use a named vector, by = c("col1" = Use a "Filtering Join" to filter one table against
dplyr::if_else() - element-wise if() + else() rownames, first move them into a column. a t 1 d w
"col2"), to match on columns that the rows of another.
dplyr::na_if() - replace specific values with NA b u 2 b u
have different names in each table.
C A B c v 3 a t
pmax() - element-wise max() rownames_to_column() left_join(x, y, by = c("C" = "D")) semi_join(x, y, by = NULL, …)
A B A B C
pmin() - element-wise min() 1 a t 1 a t Move row names into col. a t 1 Return rows of x that have a match in y.
dplyr::recode() - Vectorized switch() 2 b u 2 b u a <- rownames_to_column(iris, var A1 B1 C A2 B2 Use suffix to specify the suffix to b u 2 USEFUL TO SEE WHAT WILL BE JOINED.
dplyr::recode_factor() - Vectorized switch()
3 c v 3 c v
= "C") a t 1 d w give to unmatched columns that
for factors b u 2 b u
have the same name in both tables. A B C anti_join(x, y, by = NULL, …)
c v 3 a t
A B C A B column_to_rownames() left_join(x, y, by = c("C" = "D"), suffix = c v 3 Return rows of x that do not have a
1 a t 1 a t
Move col in row names. c("1", "2")) match in y. USEFUL TO SEE WHAT WILL
2 b u 2 b u
3 c v 3 c v column_to_rownames(a, var = "C") NOT BE JOINED.
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2017-03
R Markdown : : CHEAT SHEET File path to output document
5
Parameters
based slides, Notebooks, and more.
modify run
chunk current
Workflow
options chunk
Parameterize your documents to reuse with
different inputs (e.g., data, values, etc.)
---
1. Add parameters · Create and set params:
parameters in the header as sub- n: 100
values of params d: !r Sys.Date()
---
2. Call parameters · Call parameter
values in code as params$<name>
6 Today’s date
3. Set parameters · Set values wth is `r params$d`
1 Open a new .Rmd file at File ▶ New File ▶
Knit with parameters or the params
R Markdown. Use the wizard that opens to pre- argument of render():
populate the file with a template 7 render("doc.Rmd", params = list(n = 1,
2 Write document by editing template d = as.Date("2015-01-01"))
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at rmarkdown.rstudio.com • rmarkdown 1.6 • Updated: 2016-02
Pandoc’s Markdown Set render options with YAML
Write with syntax on the left to create effect on right (after render)
Plain text
When you render, R Markdown
1. runs the R code, embeds results and text into .md file with knitr
rmarkdown
End a line with two spaces
to start a new paragraph. 2. then converts the .md file into the finished format with pandoc
*italics* and **bold**
beamer
ioslides
`verbatim code`
gituhb
word
html
slidy
sub/superscript^2^~2~ sub-option description
odt
pdf
md
rtf
~~strikethrough~~
escaped: \* \_ \\ citation_package The LaTeX package to process citations, natbib, biblatex or none X X X
endash: --, emdash: ---
Set a document’s code_folding Let readers to toggle the display of R code, "none", "hide", or "show" X
equation: $A = \pi*r^{2}$
default output format ---
equation block: output: html_document colortheme Beamer color theme to use X
in the YAML header: ---
$$E = mc^{2}$$ # Body css CSS file to use to style document X X X
> block quote dev Graphics device to use for figure output (e.g. "png") X X X X X X X
duration Add a countdown timer (in minutes) to footer of slides X
# Header1 {#anchor} output value creates
fig_caption Should figures be rendered with captions? X X X X X X X
## Header 2 {#css_id} html_document html
fig_height, fig_width Default figure height and width (in inches) for document X X X X X X X X X X
### Header 3 {.css_class}
pdf_document pdf (requires Tex )
word_document Microsoft Word (.docx) highlight Syntax highlighting: "tango", "pygments", "kate","zenburn", "textmate" X X X X X
#### Header 4 includes File of content to place in document (in_header, before_body, after_body) X X X X X X X X
odt_document OpenDocument Text
##### Header 5 rtf_document Rich Text Format incremental Should bullets appear one at a time (on presenter mouse clicks)? X X X
###### Header 6 md_document Markdown keep_md Save a copy of .md file that contains knitr output X X X X X X
<!--Text comment--> github_document Github compatible markdown keep_tex Save a copy of .tex file that contains knitr output X X
ioslides_presentation ioslides HTML slides latex_engine Engine to render latex, "pdflatex", "xelatex", or "lualatex" X X
\textbf{Tex ignored in HTML}
<em>HTML ignored in pdfs</em> slidy_presentation slidy HTML slides lib_dir Directory of dependency files to use (Bootstrap, MathJax, etc.) X X X
<http://www.rstudio.com> beamer_presentation Beamer pdf slides (requires Tex) mathjax Set to local or a URL to use a local/URL version of MathJax to render equations X X X
[link](www.rstudio.com)
Jump to [Header 1](#anchor) Indent 2 Indent 4 md_extensions Markdown extensions to add to default definition or R Markdown X X X X X X X X X X
image: Customize output with spaces spaces number_sections Add section numbering to headers X X
---
sub-options (listed to output: html_document:
 the right): code_folding: hide pandoc_args Additional arguments to pass to Pandoc X X X X X X X X X X
* unordered list toc_float: TRUE preserve_yaml Preserve YAML front matter in final document? X
+ sub-item 1 ---
+ sub-item 2 # Body reference_docx docx file whose styles should be copied when producing docx output X
- sub-sub-item 1 self_contained Embed dependencies into the doc X X X
* item 2 html tabsets slide_level The lowest heading level that defines individual slides X
Continued (indent 4 spaces)
Use tablet css class to place sub-headers into tabs smaller Use the smaller font size in the presentation? X
1. ordered list # Tabset {.tabset .tabset-fade .tabset-pills} smart Convert straight quotes to curly, dashes to em-dashes, … to ellipses, etc. X X X
2. item 2 ## Tab 1 template Pandoc template to use when rendering file quarterly_report.html). X X X X X
i) sub-item 1
A. sub-sub-item 1 text 1 Tabset theme Bootswatch or Beamer theme to use for page X X
(@) A list whose numbering ## Tab 2 Tab 1 Tab 2 toc Add a table of contents at start of document X X X X X X X
text 2 toc_depth The lowest level of headings to add to table of contents X X X X X X
continues after text 1
### End tabset End tabset toc_float Float the table of contents to the left of the main content X
(@) an interruption
Term 1
: Definition 1
Multi-language code
snippets to quickly use Navigate Open in Export Delete Delete
common blocks of code. recent plots window plot plot all plots
RStudio recognizes that files named app.R,
server.R, ui.R, and global.R belong to a shiny app Jump to function in file Change file type GUI Package manager lists every installed package
Create Upload Delete Rename Change
folder file file file directory
Install Update Create reproducible package
Run Choose Publish to Manage Path to displayed directory Packages Packages library for your project
app location to shinyapps.io publish Working Maximize,
view app or server accounts Directory minimize panes
Press ! to see Drag pane A File browser keyed to your working directory.
command history boundaries Click on file or directory name to open. Click to load package with Package Delete
library(). Unclick to detach version from
Highlighted
Package Writing
line shows Stop Shiny Publish to shinyapps.io, Refresh
where app rpubs, RSConnect, …
execution has
paused File > New Project > View(<data>) opens spreadsheet like view of data set
New Directory > R Package
Run commands in Examine variables Select function Step through Step into and Resume Quit debug Turn project into package,
environment where in executing in traceback to code one line out of functions execution mode Enable roxygen documentation with
execution has paused environment debug at a time to run Tools > Project Options > Build Tools
Roxygen guide at Filter rows by value Sort by Search
Help > Roxygen Quick Reference or value range values for value
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at www.rstudio.com • RStudio IDE 0.99.832 • Updated: 2016-01
1 LAYOUT Windows/Linux Mac 4 WRITE CODE Windows /Linux Mac WHY RSTUDIO SERVER PRO?
Move focus to Source Editor Ctrl+1 Ctrl+1 Attempt completion Tab or Ctrl+Space Tab or Cmd+Space RSP extends the the open source server with a
Move focus to Console Ctrl+2 Ctrl+2 Navigate candidates !/$ !/$ commercial license, support, and more:
Accept candidate Enter, Tab, or # Enter, Tab, or #
Move focus to Help Ctrl+3 Ctrl+3 • open and run multiple R sessions at once
Show History Ctrl+4 Ctrl+4 Dismiss candidates Esc Esc
Undo Ctrl+Z Cmd+Z
• tune your resources to improve performance
Show Files Ctrl+5 Ctrl+5 • edit the same project at the same time as others
Show Plots Ctrl+6 Ctrl+6 Redo Ctrl+Shift+Z Cmd+Shift+Z
Show Packages Ctrl+7 Ctrl+7 Cut Ctrl+X Cmd+X • see what you and others are doing on your server
Show Environment Ctrl+8 Ctrl+8 Copy Ctrl+C Cmd+C • switch easily from one version of R to a different version
Show Git/SVN Ctrl+9 Ctrl+9 Paste Ctrl+V Cmd+V • integrate with your authentication, authorization, and audit practices
Show Build Ctrl+0 Ctrl+0 Select All Ctrl+A Cmd+A Download a free 45 day evaluation at
Delete Line Ctrl+D Cmd+D www.rstudio.com/products/rstudio-server-pro/
2 RUN CODE Windows/Linux Mac Select Shift+[Arrow] Shift+[Arrow]
Search command history Ctrl+! Cmd+! Select Word Ctrl+Shift+ "/# Option+Shift+ "/# 5 DEBUG CODE Windows/Linux Mac
Navigate command history !/$ !/$ Select to Line Start Alt+Shift+" Cmd+Shift+" Toggle Breakpoint Shift+F9 Shift+F9
Move cursor to start of line Home Cmd+" Select to Line End Alt+Shift+# Cmd+Shift+# Execute Next Line F10 F10
Move cursor to end of line End Cmd+ # Select Page Up/Down Shift+PageUp/Down Shift+PageUp/Down Step Into Function Shift+F4 Shift+F4
Change working directory Ctrl+Shift+H Ctrl+Shift+H Select to Start/End Shift+Alt+!/$ Cmd+Shift+!/$ Finish Function/Loop Shift+F6 Shift+F6
Interrupt current command Esc Esc Delete Word Left Ctrl+Backspace Ctrl+Opt+Backspace Continue Shift+F5 Shift+F5
Clear console Ctrl+L Ctrl+L Delete Word Right Option+Delete Stop Debugging Shift+F8 Shift+F8
Quit Session (desktop only) Ctrl+Q Cmd+Q Delete to Line End Ctrl+K
Restart R Session Ctrl+Shift+F10 Cmd+Shift+F10 Delete to Line Start Option+Backspace Windows/Linux
6 VERSION CONTROL Mac
Run current line/selection Ctrl+Enter Cmd+Enter Indent Tab (at start of line) Tab (at start of line) Show diff Ctrl+Alt+D Ctrl+Option+D
Run current (retain cursor) Alt+Enter Option+Enter Outdent Shift+Tab Shift+Tab Commit changes Ctrl+Alt+M Ctrl+Option+M
Run from current to end Ctrl+Alt+E Cmd+Option+E Yank line up to cursor Ctrl+U Ctrl+U Scroll diff view Ctrl+!/$ Ctrl+!/$
Run the current function Ctrl+Alt+F Cmd+Option+F Yank line after cursor Ctrl+K Ctrl+K Stage/Unstage (Git) Spacebar Spacebar
Source a file
definition Ctrl+Alt+G Cmd+Option+G Insert yanked text Ctrl+Y Ctrl+Y Stage/Unstage and move to next Enter Enter
Source the current file Ctrl+Shift+S Cmd+Shift+S Insert <- Alt+- Option+-
Source with echo Ctrl+Shift+Enter Cmd+Shift+Enter Insert %>% Ctrl+Shift+M Cmd+Shift+M
7 MAKE PACKAGES Windows/Linux Mac
Show help for function F1 F1
3 NAVIGATE CODE Windows /Linux Mac Build and Reload Ctrl+Shift+B Cmd+Shift+B
Show source code F2 F2
Goto File/Function Ctrl+. Ctrl+. Load All (devtools) Ctrl+Shift+L Cmd+Shift+L
unction
New document Ctrl+Shift+N Cmd+Shift+N
Fold Selected Alt+L Cmd+Option+L Test Package (Desktop) Ctrl+Shift+T Cmd+Shift+T
New document (Chrome) Ctrl+Alt+Shift+N Cmd+Shift+Opt+N
Unfold Selected Shift+Alt+L Cmd+Shift+Option+L Test Package (Web) Ctrl+Alt+F7 Cmd+Opt+F7
Open document Ctrl+O Cmd+O
Fold All Alt+O Cmd+Option+O Check Package Ctrl+Shift+E Cmd+Shift+E
Save document Ctrl+S Cmd+S
Unfold All Shift+Alt+O Cmd+Shift+Option+O Close document Ctrl+W Cmd+W Document Package Ctrl+Shift+D Cmd+Shift+D
Go to line Shift+Alt+G Cmd+Shift+Option+G Close document (Chrome) Ctrl+Alt+W Cmd+Option+W
Jump to Shift+Alt+J Cmd+Shift+Option+J Close all documents Ctrl+Shift+W Cmd+Shift+W 8 DOCUMENTS AND APPS Windows/Linux Mac
Switch to tab Ctrl+Shift+. Ctrl+Shift+. Extract function Ctrl+Alt+X Cmd+Option+X Preview HTML (Markdown, etc.) Ctrl+Shift+K Cmd+Shift+K
Previous tab Ctrl+F11 Ctrl+F11 Extract variable Ctrl+Alt+V Cmd+Option+V Knit Document (knitr) Ctrl+Shift+K Cmd+Shift+K
Next tab Ctrl+F12 Ctrl+F12 Reindent lines Ctrl+I Cmd+I Compile Notebook Ctrl+Shift+K Cmd+Shift+K
First tab Ctrl+Shift+F11 Ctrl+Shift+F11 (Un)Comment lines Ctrl+Shift+C Cmd+Shift+C Compile PDF (TeX and Sweave) Ctrl+Shift+K Cmd+Shift+K
Last tab Ctrl+Shift+F12 Ctrl+Shift+F12 Reflow Comment Ctrl+Shift+/ Cmd+Shift+/ Insert chunk (Sweave and Knitr) Ctrl+Alt+I Cmd+Option+I
Navigate back Ctrl+F9 Cmd+F9 Reformat Selection Ctrl+Shift+A Cmd+Shift+A Insert code section Ctrl+Shift+R Cmd+Shift+R
Navigate forward Ctrl+F10 Cmd+F10 Select within braces Ctrl+Shift+E Ctrl+Shift+E Re-run previous region Ctrl+Shift+P Cmd+Shift+P
Jump to Brace Ctrl+P Ctrl+P Show Diagnostics Ctrl+Shift+Alt+P Cmd+Shift+Opt+P Run current document Ctrl+Alt+R Cmd+Option+R
Select within Braces Ctrl+Shift+Alt+E Ctrl+Shift+Option+E Transpose Letters Ctrl+T
Run from start to current line Ctrl+Alt+B Cmd+Option+B
Use Selection for Find Ctrl+F3 Cmd+E Move Lines Up/Down Alt+!/$ Option+!/$
Run the current code section Ctrl+Alt+T Cmd+Option+T
Find in Files Ctrl+Shift+F Cmd+Shift+F Copy Lines Up/Down Shift+Alt+!/$ Cmd+Option+!/$
Run previous Sweave/Rmd code Ctrl+Alt+P Cmd+Option+P
Find Next Win: F3, Linux: Ctrl+G Cmd+G Add New Cursor Above Ctrl+Alt+Up Ctrl+Option+Up
Find Previous W: Shift+F3, L: Cmd+Shift+G Add New Cursor Below Ctrl+Alt+Down Ctrl+Option+Down Run the current chunk Ctrl+Alt+C Cmd+Option+C
Jump to Word Ctrl+Shift+G
Ctrl+ "/# Option+ "/# Move Active Cursor Up Ctrl+Alt+Shift+Up Ctrl+Option+Shift+Up Run the next chunk Ctrl+Alt+N Cmd+Option+N
Jump to Start/End Ctrl+!/$ Cmd+!/$ Move Active Cursor Down Ctrl+Alt+Shift+Down Ctrl+Opt+Shift+Down Sync Editor & PDF Preview Ctrl+F8 Cmd+F8
Toggle Outline Ctrl+Shift+O Cmd+Shift+O Find and Replace Ctrl+F Cmd+F Previous plot Ctrl+Alt+F11 Cmd+Option+F11
Use Selection for Find Ctrl+F3 Cmd+E Next plot Ctrl+Alt+F12 Cmd+Option+F12
Replace and Find Ctrl+Shift+J Cmd+Shift+J Show Keyboard Shortcuts Alt+Shift+K Option+Shift+K
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at www.rstudio.com • RStudio IDE 0.1.0 • Updated: 2017-09
Data Visualization with ggplot2 : : CHEAT SHEET
Basics Geoms Use a geom function to represent data points, use the geom’s aesthetic properties to represent variables.
Each function returns a layer.
GRAPHICAL PRIMITIVES TWO VARIABLES
ggplot2 is based on the grammar of graphics, the idea
that you can build every graph from the same a <- ggplot(economics, aes(date, unemploy)) continuous x , continuous y continuous bivariate distribution
components: a data set, a coordinate system, b <- ggplot(seals, aes(x = long, y = lat)) h <- ggplot(diamonds, aes(carat, price))
e <- ggplot(mpg, aes(cty, hwy))
and geoms—visual marks that represent data points. a + geom_blank()
e + geom_label(aes(label = cty), nudge_x = 1, h + geom_bin2d(binwidth = c(0.25, 500))
(Useful for expanding limits) nudge_y = 1, check_overlap = TRUE) x, y, label, x, y, alpha, color, fill, linetype, size, weight
F M A alpha, angle, color, family, fontface, hjust,
b + geom_curve(aes(yend = lat + 1,
lineheight, size, vjust
+ = xend=long+1,curvature=z)) - x, xend, y, yend,
alpha, angle, color, curvature, linetype, size e + geom_jitter(height = 2, width = 2)
h + geom_density2d()
x, y, alpha, colour, group, linetype, size
x, y, alpha, color, fill, shape, size
data geom coordinate plot a + geom_path(lineend="butt", linejoin="round", h + geom_hex()
x=F·y=A system linemitre=1)
x, y, alpha, colour, fill, size
e + geom_point(), x, y, alpha, color, fill, shape,
x, y, alpha, color, group, linetype, size size, stroke
To display values, map variables in the data to visual a + geom_polygon(aes(group = group))
e + geom_quantile(), x, y, alpha, color, group,
properties of the geom (aesthetics) like size, color, and x x, y, alpha, color, fill, group, linetype, size linetype, size, weight
continuous function
and y locations. i <- ggplot(economics, aes(date, unemploy))
b + geom_rect(aes(xmin = long, ymin=lat, xmax=
F M A long + 1, ymax = lat + 1)) - xmax, xmin, ymax, e + geom_rug(sides = "bl"), x, y, alpha, color, i + geom_area()
ymin, alpha, color, fill, linetype, size x, y, alpha, color, fill, linetype, size
+ =
linetype, size
a + geom_ribbon(aes(ymin=unemploy - 900, e + geom_smooth(method = lm), x, y, alpha, i + geom_line()
ymax=unemploy + 900)) - x, ymax, ymin, color, fill, group, linetype, size, weight x, y, alpha, color, group, linetype, size
data geom coordinate plot alpha, color, fill, group, linetype, size
x=F·y=A system
color = F e + geom_text(aes(label = cty), nudge_x = 1, i + geom_step(direction = "hv")
size = A nudge_y = 1, check_overlap = TRUE), x, y, label, x, y, alpha, color, group, linetype, size
alpha, angle, color, family, fontface, hjust,
LINE SEGMENTS lineheight, size, vjust
common aesthetics: x, y, alpha, color, linetype, size
b + geom_abline(aes(intercept=0, slope=1)) visualizing error
Complete the template below to build a graph. b + geom_hline(aes(yintercept = lat)) df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
required b + geom_vline(aes(xintercept = long)) discrete x , continuous y j <- ggplot(df, aes(grp, fit, ymin = fit-se, ymax = fit+se))
ggplot (data = <DATA> ) + f <- ggplot(mpg, aes(class, hwy))
b + geom_segment(aes(yend=lat+1, xend=long+1)) j + geom_crossbar(fatten = 2)
<GEOM_FUNCTION> (mapping = aes( <MAPPINGS> ), x, y, ymax, ymin, alpha, color, fill, group, linetype,
b + geom_spoke(aes(angle = 1:1155, radius = 1)) f + geom_col(), x, y, alpha, color, fill, group,
stat = <STAT> , position = <POSITION> ) + Not
linetype, size size
<COORDINATE_FUNCTION> + required,
sensible j + geom_errorbar(), x, ymax, ymin, alpha, color,
f + geom_boxplot(), x, y, lower, middle, upper, group, linetype, size, width (also
<FACET_FUNCTION> + defaults
supplied ONE VARIABLE continuous ymax, ymin, alpha, color, fill, group, linetype, geom_errorbarh())
<SCALE_FUNCTION> + shape, size, weight
c <- ggplot(mpg, aes(hwy)); c2 <- ggplot(mpg)
j + geom_linerange()
<THEME_FUNCTION> f + geom_dotplot(binaxis = "y", stackdir = x, ymin, ymax, alpha, color, group, linetype, size
c + geom_area(stat = "bin")
"center"), x, y, alpha, color, fill, group
x, y, alpha, color, fill, linetype, size j + geom_pointrange()
ggplot(data = mpg, aes(x = cty, y = hwy)) Begins a plot f + geom_violin(scale = "area"), x, y, alpha, color, x, y, ymin, ymax, alpha, color, fill, group, linetype,
that you finish by adding layers to. Add one geom c + geom_density(kernel = "gaussian")
fill, group, linetype, size, weight shape, size
function per layer.
x, y, alpha, color, fill, group, linetype, size, weight
aesthetic mappings data geom
c + geom_dotplot()
maps
qplot(x = cty, y = hwy, data = mpg, geom = “point") x, y, alpha, color, fill data <- data.frame(murder = USArrests$Murder,
Creates a complete plot with given data, geom, and discrete x , discrete y state = tolower(rownames(USArrests)))
mappings. Supplies many useful defaults. c + geom_freqpoly() x, y, alpha, color, group, g <- ggplot(diamonds, aes(cut, color)) map <- map_data("state")
linetype, size k <- ggplot(data, aes(fill = murder))
last_plot() Returns the last plot g + geom_count(), x, y, alpha, color, fill, shape, k + geom_map(aes(map_id = state), map = map)
c + geom_histogram(binwidth = 5) x, y, alpha,
ggsave("plot.png", width = 5, height = 5) Saves last plot color, fill, linetype, size, weight size, stroke + expand_limits(x = map$long, y = map$lat),
as 5’ x 5’ file named "plot.png" in working directory. map_id, alpha, color, fill, linetype, size
Matches file type to file extension. c2 + geom_qq(aes(sample = hwy)) x, y, alpha,
color, fill, linetype, size, weight
THREE VARIABLES
seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2))l <- ggplot(seals, aes(long, lat))
discrete l + geom_contour(aes(z = z))
l + geom_raster(aes(fill = z), hjust=0.5, vjust=0.5,
d <- ggplot(mpg, aes(fl)) x, y, z, alpha, colour, group, linetype,
interpolate=FALSE)
size, weight x, y, alpha, fill
d + geom_bar()
x, alpha, color, fill, linetype, size, weight l + geom_tile(aes(fill = z)), x, y, alpha, color, fill,
linetype, size, width
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at http://ggplot2.tidyverse.org • ggplot2 3.1.0 • Updated: 2018-12
Stats An alternative way to build a layer Scales Coordinate Systems Faceting
A stat builds new variables to plot (e.g., count, prop). Scales map data values to the visual values of an r <- d + geom_bar() Facets divide a plot into
fl cty cyl aesthetic. To change a mapping, add a new scale. r + coord_cartesian(xlim = c(0, 5))
subplots based on the
xlim, ylim
values of one or more
(n <- d + geom_bar(aes(fill = fl)))
+ =
x ..count..
The default cartesian coordinate system discrete variables.
aesthetic prepackaged scale-specific r + coord_fixed(ratio = 1/2)
scale_ to adjust scale to use arguments ratio, xlim, ylim
t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
data stat geom coordinate plot Cartesian coordinates with fixed aspect ratio
x = x ·
system n + scale_fill_manual( between x and y units
y = ..count.. values = c("skyblue", "royalblue", "blue", “navy"), r + coord_flip()
t + facet_grid(cols = vars(fl))
Visualize a stat by changing the default stat of a geom limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", “r"), xlim, ylim
facet into columns based on fl
name = "fuel", labels = c("D", "E", "P", "R")) Flipped Cartesian coordinates
function, geom_bar(stat="count") or by using a stat t + facet_grid(rows = vars(year))
r + coord_polar(theta = "x", direction=1 )
facet into rows based on year
function, stat_count(geom="bar"), which calls a default range of title to use in labels to use breaks to use in theta, start, direction
values to include legend/axis in legend/axis legend/axis
geom to make a layer (equivalent to a geom function). in mapping Polar coordinates t + facet_grid(rows = vars(year), cols = vars(fl))
Use ..name.. syntax to map stat variables to aesthetics. r + coord_trans(ytrans = “sqrt")
facet into both rows and columns
xtrans, ytrans, limx, limy
t + facet_wrap(vars(fl))
GENERAL PURPOSE SCALES Transformed cartesian coordinates. Set xtrans and wrap facets into a rectangular layout
geom to use stat function geommappings ytrans to the name of a window function.
Use with most aesthetics
i + stat_density2d(aes(fill = ..level..), Set scales to let axis limits vary across facets
scale_*_continuous() - map cont’ values to visual ones π + coord_quickmap()
geom = "polygon") 60
variable created by stat scale_*_discrete() - map discrete values to visual ones π + coord_map(projection = "ortho", t + facet_grid(rows = vars(drv), cols = vars(fl),
lat
scale_*_identity() - use data values as visual ones orientation=c(41, -74, 0))projection, orienztation, scales = "free")
xlim, ylim x and y axis limits adjust to individual facets
c + stat_bin(binwidth = 1, origin = 10)
scale_*_manual(values = c()) - map discrete values to long
Themes
Set legend type for each aesthetic: colorbar, legend, or
ggplot() + stat_function(aes(x = -3:3), n = 99, fun = o + scale_fill_gradient2(low="red", high=“blue", none (no legend)
dnorm, args = list(sd=0.5)) x | ..x.., ..y.. mid = "white", midpoint = 25) n + scale_fill_discrete(name = "Title",
labels = c("A", "B", "C", "D", "E"))
e + stat_identity(na.rm = TRUE) r + theme_bw()
r + theme_classic() Set legend title and labels with a scale function.
o + scale_fill_gradientn(colours=topo.colors(6)) White background
ggplot() + stat_qq(aes(sample=1:100), dist = qt, Also: rainbow(), heat.colors(), terrain.colors(), with grid lines r + theme_light()
dparam=list(df=5)) sample, x, y | ..sample.., ..theoretical..
Zooming
cm.colors(), RColorBrewer::brewer.pal() r + theme_gray()
r + theme_linedraw()
e + stat_sum() x, y, size | ..n.., ..prop.. Grey background
(default theme) r + theme_minimal()
e + stat_summary(fun.data = "mean_cl_boot") SHAPE AND SIZE SCALES Minimal themes
r + theme_dark()
r + theme_void()
Without clipping (preferred)
h + stat_summary_bin(fun.y = "mean", geom = "bar") p <- e + geom_point(aes(shape = fl, size = cyl)) dark for contrast
p + scale_shape() + scale_size() Empty theme t + coord_cartesian(
e + stat_unique() xlim = c(0, 100), ylim = c(10, 20))
p + scale_shape_manual(values = c(3:7))
With clipping (removes unseen data points)
t + xlim(0, 100) + ylim(10, 20)
p + scale_radius(range = c(1,6))
p + scale_size_area(max_size = 6) t + scale_x_continuous(limits = c(0, 100)) +
scale_y_continuous(limits = c(0, 100))
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at http://ggplot2.tidyverse.org • ggplot2 3.1.0 • Updated: 2018-12
Data Transformation with data.table : : CHEAT SHEET
Basics Manipulate columns with j Group according to by
data.table is an extremely fast and memory efficient package
for transforming data in R. It works by converting R’s native a a a dt[, j, by = .(a)] – group rows by
EXTRACT
data frame objects into data.tables with new and enhanced values in specified columns.
functionality. The basics of working with data.tables are: dt[, c(2)] – extract columns by number. Prefix
column numbers with “-” to drop. dt[, j, keyby = .(a)] – group and
dt[i, j, by] simultaneously sort rows by values
in specified columns.
Take data.table dt, b c b c dt[, .(b, c)] – extract columns by name.
subset rows using i COMMON GROUPED OPERATIONS
and manipulate columns with j,
grouped according to by. dt[, .(c = sum(b)), by = a] – summarize rows within groups.
data.tables are also data frames – functions that work with data SUMMARIZE dt[, c := sum(b), by = a] – create a new column and compute rows
frames therefore also work with data.tables. within groups.
a x dt[, .(x = sum(a))] – create a data.table with new
columns based on the summarized values of rows.
dt[, .SD[1], by = a] – extract first row of groups.
CC BY SA Erik Petrovski • www.petrovski.dk • Learn more with the data.table homepage or vignette • data.table version 1.11.8 • Updated: 2019-01
UNIQUE ROWS
unique(dt, by = c("a", "b")) – extract unique
BIND
Apply function to cols.
a b a b a b a b a b rbind(dt_a, dt_b) – combine rows of two
1 2 1 2 rows based on columns specified in “by”. + = data.tables.
2 2 2 2 Leave out “by” to use all columns. APPLY A FUNCTION TO MULTIPLE COLUMNS
1 2
a b a b dt[, lapply(.SD, mean), .SDcols = c("a", "b")] –
uniqueN(dt, by = c("a", "b")) – count the number of unique rows
1 4 2 5 apply a function – e.g. mean(), as.character(),
based on columns specified in “by”. a b x y a b x y cbind(dt_a, dt_b) – combine columns
2 5 which.max() – to columns specified in .SDcols
of two data.tables.
3 6 with lapply() and the .SD symbol. Also works
+ = with groups.
RENAME COLUMNS
a a a_m cols <- c("a")
a b x y setnames(dt, c("a", "b"), c("x", "y")) – rename 1 1 2 dt[, paste0(cols, "_m") := lapply(.SD, mean),
columns. .SDcols = cols] – apply a function to specified
Reshape a data.table
2 2 2
3 3 2 columns and assign the result with suffixed
variable names to the original data.
SET KEYS RESHAPE TO WIDE FORMAT
setkey(dt, a, b) – set keys to enable fast repeated lookup in
specified columns using “dt[.(value), ]” or for merging without id y a b id a_x a_z b_x b_z dcast(dt, Sequential rows
specifying merging columns using “dt_a[dt_b]”. A x 1 3 A 1 2 3 4 id ~ y,
A z 2 4 B 1 2 3 4
B x 1 3
value.var = c("a", "b")) ROW IDS
B z 2 4
dt[, c := 1:.N, by = b] – within groups, compute a
Combine data.tables
a b a b c
Reshape a data.table from long to wide format. 1 a 1 a 1 column with sequential row IDs.
2 a 2 a 2
dt A data.table. 3 b 3 b 1
JOIN id ~ y Formula with a LHS: ID columns containing IDs for
multiple entries. And a RHS: columns with values to
LAG & LEAD
a b x y a b x dt_a[dt_b, on = .(b = y)] – join spread in column headers.
1 c 3 b 3 b 3 data.tables on rows with equal values. value.var Columns containing values to fill into cells. dt[, c := shift(a, 1), by = b] – within groups,
2 a + 2 c = 1 c 2
a
1
b
a
a
1
b
a
c
NA duplicate a column with rows lagged by
3 b 1 a 2 a 1 2 a 2 a 1 specified amount.
RESHAPE TO LONG FORMAT 3 b 3 b NA
4 b 4 b 3
a b c x y z a b c x dt_a[dt_b, on = .(b = y, c > z)] – id a_x a_z b_x b_z id y a b melt(dt, 5 b 5 b 4 dt[, c := shift(a, 1, type = "lead"), by = b] –
1 c 7 3 b 4 3 b 4 3 join data.tables on rows with within groups, duplicate a column with rows
+ = id.vars = c("id"),
A 1 2 3 4 A 1 1 3
2 a 5 2 c 5 1 c 5 2 equal and unequal values. B 1 2 3 4 B 1 1 3 leading by specified amount.
3 b 6 1 a 8 NA a 8 1 A 2 2 4 measure.vars = patterns("^a", "^b"),
B 2 2 4 variable.name = "y",
value.name = c("a", "b"))
ROLLING JOIN read & write files
a id date b id date a id date b
Reshape a data.table from wide to long format.
1 A 01-01-2010 + 1 A 01-01-2013 = 2 A 01-01-2013 1 dt A data.table. IMPORT
2 A 01-01-2012 1 B 01-01-2013 2 B 01-01-2013 1 id.vars ID columns with IDs for multiple entries.
3 A 01-01-2014 measure.vars Columns containing values to fill into cells (often in fread("file.csv") – read data from a flat file such as .csv or .tsv into R.
1 B 01-01-2010
pattern form).
2 B 01-01-2012
variable.name, Names of new columns for variables and values fread("file.csv", select = c("a", "b")) – read specified columns from a
value.name derived from old headers. flat file into R.
dt_a[dt_b, on = .(id = id, date = date), roll = TRUE] – join
data.tables on matching rows in id columns but only keep the most
recent preceding match with the left data.table according to date
columns. “roll = -Inf” reverses direction. EXPORT
fwrite(dt, "file.csv") – write data to a flat file from R.
CC BY SA Erik Petrovski • www.petrovski.dk • Learn more with the data.table homepage or vignette • data.table version 1.11.8 • Updated: 2019-01
Machine Learning Modelling in R : : CHEAT SHEET
Supervised & Unsupervised Learning Meta-Algorithm, Time Series & Model Validation
Introduction • "center"
method getParamSet(learner=)
trafo
• "scale" lower=-2,upper=2,trafo=function(x) 10^x
• "standardize" "classif.qda"
• "range" range=c(0,1) Logical
LogicalVector CharacterVector DiscreteVector
mergeSmallFactorLevels(task=,cols=,min.perc=)
train(learner=,task=) makeTuneControl<type>()
WrappedModel • Grid(resolution=10L)
summarizeColumns(obj=) obj • Random(maxit=100)
• MBO(budget=)
• Irace(n.instances=)
capLargeValues dropFeatures getLearnerModel() • CMAES Design GenSA
removeConstantFeatures summarizeLevels
predict(object=,task=,newdata=) tuneParams(learner=,task=,resampling=,
measures=,par.set=,control=)
pred
View(pred)
makeClassifTask(data=,target=)
as.data.frame(pred)
A B C
positive Quickstart
makeRegrTask(data=,target=)
0 63 100 performance(pred=,measures=)
listMeasures() library(mlbench)
makeMultilabelTask(data=,target=)
• acc auc bac ber brier[.scaled] f1 fdr fn data(Soybean)
A B C fnr fp fpr gmean multiclass[.au1u .aunp .aunu soy = createDummyFeatures(Soybean,target="Class")
.brier] npv ppv qsr ssr tn tnr tp tpr wkappa tsk = makeClassifTask(data=soy,target="Class")
makeClusterTask(data=) • arsq expvar kendalltau mae mape medae ho = makeResampleInstance("Holdout",tsk)
medse mse msle rae rmse rmsle rrse rsq sae tsk.train = subsetTask(tsk,ho$train.inds[[1]])
spearmanrho sse tsk.test = subsetTask(tsk,ho$test.inds[[1]])
makeSurvTask(data=,target= • db dunn G1 G2 silhouette
c("time","event")) • multilabel[.f1 .subset01 .tpr .ppv
.acc .hamloss]
• mcp meancosts
• cindex
makeCostSensTask(data=,costs=) • featperc timeboth timepredict timetrain
A B lrn = makeLearner("classif.xgboost",nrounds=10)
cv = makeResampleDesc("CV",iters=5)
• calculateConfusionMatrix(pred=) res = resample(lrn,tsk.train,cv,acc)
task • calculateROCMeasures(pred=)
• weights=
• blocking=
makeResampleDesc(method=,...,stratify=)
method ps = makeParamSet(makeNumericParam("eta",0,1),
• "CV" iters= makeNumericParam("lambda",0,200),
makeLearner(cl=,predict.type=,...,par.vals=) • "LOO" iters= makeIntegerParam("max_depth",1,20))
• "RepCV" tc = makeTuneControlMBO(budget=100)
reps= folds= tr = tuneParams(lrn,tsk.train,cv5,acc,ps,tc)
• cl= "classif.xgboost" • "Subsample" lrn = setHyperPars(lrn,par.vals=tr$x)
"regr.randomForest" "cluster.kmeans" iters= split= eta lambda max_depth
• predict.type="response" • "Bootstrap" iters=
"prob" • "Holdout" split=
"se" stratify
• View(listLearners()) cv2
• View(listLearners(task)) cv3 cv5 cv10 hout
• View(listLearners("classif",
properties=c("prob", "factors"))) resample()
"classif" crossval() repcv() holdout() subsample()
"prob" "factors" bootstrapOOB() bootstrapB632() bootstrapB632plus()
• getLearnerProperties()
function(required_parameters=,optional_parameters=)
Configuration Feature Extraction Visualization Wrappers
configureMlr()
• show.info Wrapper 3, etc.
TRUE filterFeatures(task=,method=, generateThreshVsPerfData(obj=,measures=)
• on.learner.error "stop" perc=,abs=,threshold=) Wrapper 2
"warn" Wrapper 1
"quiet" "stop" • plotThreshVsPerf(obj)
• on.learner.warning Learner
"warn" "quiet" "warn" perc= abs= ThreshVsPerfData
• on.par.without.desc threshold= • plotROCCurves(obj)
"stop" "warn" "quiet" "stop"
• on.par.out.of.bounds method ThreshVsPerfData
"stop" "warn" "quiet" "stop" "randomForestSRC.rfsrc" measures=list(fpr,tpr) makeDummyFeaturesWrapper(learner=)
• on.measure.not.applicable "anova.test" "carscore" "cforest.importance" makeImputeWrapper(learner=,classes=,cols=)
"stop" "warn" "quiet" "stop" "chi.squared" "gain.ratio" "information.gain" makePreprocWrapper(learner=,train=,predict=)
• show.learner.output "kruskal.test" "linear.correlation" "mrmr" "oneR" makePreprocWrapperCaret(learner=,...)
TRUE "permutation.importance" "randomForest.importance" • plotResiduals(obj=) makeRemoveConstantFeaturesWrapper(learner=)
• on.error.dump "randomForestSRC.rfsrc" "randomForestSRC.var.select" Prediction BenchmarkResult
on.learner.error "stop" TRUE "rank.correlation" "relief"
"symmetrical.uncertainty" "univariate.model.score"
getMlrOptions() "variance" makeOverBaggingWrapper(learner=)
generateLearningCurveData(learners=,task=, makeSMOTEWrapper(learner=)
resampling=,percs=,measures=) makeUndersampleWrapper(learner=)
makeWeightedClassesWrapper(learner=)
Parallelization selectFeatures(learner=,task=
resampling=,measures=,control=)
• plotLearningCurve(obj=)
LearningCurveData
control makeCostSensClassifWrapper(learner=)
parallelMap
makeCostSensRegrWrapper(learner=)
makeCostSensWeightedPairsWrapper(learner=)
generateFilterValuesData(task=,method=)
• makeFeatSelControlExhaustive(max.features=)
parallelStart(mode=,cpus=,level=) max.features • plotFilterValues(obj=)
• makeMultilabelBinaryRelevanceWrapper(learner=)
• mode makeFeatSelControlRandom(maxit=,prob=,
makeMultilabelClassifierChainsWrapper(learner=)
• "local" mapply max.features=) FilterValuesData
makeMultilabelDBRWrapper(learner=)
• "multicore" prob maxit
makeMultilabelNestedStackingWrapper(learner=)
parallel::mclapply
• makeMultilabelStackingWrapper(learner=)
• "socket" makeFeatSelControlSequential(method=,maxit=,
• "mpi" max.features=,alpha=,beta=) generateHyperParsEffectData(tune.result=)
parallel::makeCluster parallel::clusterMap method "sfs"
• "BatchJobs" "sbs" "sffs"
"sfbs" alpha • plotHyperParsEffect(hyperpars.effec makeBaggingWrapper(learner=)
BatchJobs::batchMap
makeConstantClassWrapper(learner=)
• cpus beta t.data=,x=,y=,z=)
makeDownsampleWrapper(learner=,dw.perc=)
• level "mlr.benchmark"
• makeFeatSelControlGA(maxit=,max.features=,mu=, HyperParsEffectData makeFeatSelWrapper(learner=,resampling=,control=)
"mlr.resample" "mlr.selectFeatures"
lambda=,crossover.rate=,mutation.rate=) makeFilterWrapper(learner=,fw.perc=,fw.abs=,
"mlr.tuneParams" "mlr.ensemble"
• plotOptPath(op=) fw.threshold=)
<obj>$opt.path <obj> makeMultiClassWrapper(learner=)
parallelStop()
mu tuneResult featSelResult makeTuneWrapper(learner=,resampling=,par.set=,
lambda crossover.rate • plotTuneMultiCritResult(res=) control=)
Imputation mutation.rate
impute(obj=,target=,cols=,dummy.cols=,dummy.type=)
selectFeatures FeatSelResult
generatePartialDependenceData(obj=,input=) Nested Resampling
fsr tsk obj
tsk = subsetTask(tsk,features=fsr$x) input
• obj= • plotPartialDependence(obj=)
• target=
• cols= PartialDependenceData
• dummy.cols=
• dummy.type=
classes
"numeric"
dummy.classes cols
Benchmarking • resample benchmark
• makeTuneWrapper
benchmark(learners=,tasks=,resamplings=,measures=) • plotBMRBoxplots(bmr=) makeFeatSelWrapper
cols classes • plotBMRSummary(bmr=)
cols=list(V1=imputeMean()) V1 • plotBMRRanksAsBarChart(bmr=)
imputeMean()
Intro Define
INSTALLATION
The keras R package uses the Python keras library.
Compile Fit Evaluate Predict
Keras is a high-level neural networks API You can install all the prerequisites directly from R.
developed with a focus on enabling fast • Model • Batch size
• Sequential • Optimiser • Epochs • Evaluate • classes https://keras.rstudio.com/reference/install_keras.html
experimentation. It supports multiple back-
ends, including TensorFlow, CNTK and Theano. model • Loss • Validation • Plot • probability
library(keras) See ?install_keras
• Multi-GPU • Metrics split
install_keras() for GPU instructions
TensorFlow is a lower level mathematical model
library for building deep neural network This installs the required libraries in an Anaconda
architectures. The keras R package makes it https://keras.rstudio.com The “Hello, World!” environment or virtual environment 'r-tensorflow'.
easy to use Keras and TensorFlow in R. https://www.manning.com/books/deep-learning-with-r of deep learning
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at keras.rstudio.com • keras 2.1.2 • Updated: 2017-12
More layers Preprocessing
CONVOLUTIONAL LAYERS ACTIVATION LAYERS SEQUENCE PREPROCESSING Keras TensorFlow
layer_conv_1d() 1D, e.g.
temporal convolution
layer_activation(object, activation)
Apply an activation function to an output
pad_sequences()
Pads each sequence to the same length (length of Pre-trained models
the longest sequence)
layer_activation_leaky_relu() Keras applications are deep learning models
layer_conv_2d_transpose() Leaky version of a rectified linear unit skipgrams() that are made available alongside pre-trained
Transposed 2D (deconvolution) Generates skipgram word pairs weights. These models can be used for
α layer_activation_parametric_relu() prediction, feature extraction, and fine-tuning.
layer_conv_2d() 2D, e.g. spatial Parametric rectified linear unit make_sampling_table() application_xception()
convolution over images Generates word rank-based probabilistic sampling xception_preprocess_input()
layer_activation_thresholded_relu() table Xception v1 model
Thresholded rectified linear unit
layer_conv_3d_transpose()
Transposed 3D (deconvolution) layer_activation_elu() TEXT PREPROCESSING application_inception_v3()
layer_conv_3d() 3D, e.g. spatial Exponential linear unit inception_v3_preprocess_input()
text_tokenizer() Text tokenization utility Inception v3 model, with weights pre-trained
convolution over volumes
on ImageNet
fit_text_tokenizer() Update tokenizer internal
layer_conv_lstm_2d() vocabulary
Convolutional LSTM DROPOUT LAYERS application_inception_resnet_v2()
save_text_tokenizer(); load_text_tokenizer() inception_resnet_v2_preprocess_input()
layer_separable_conv_2d() layer_dropout() Inception-ResNet v2 model, with weights
Depthwise separable 2D Save a text tokenizer to an external file
Applies dropout to the input trained on ImageNet
layer_upsampling_1d() texts_to_sequences();
layer_spatial_dropout_1d() texts_to_sequences_generator() application_vgg16(); application_vgg19()
layer_upsampling_2d() layer_spatial_dropout_2d()
layer_upsampling_3d() Transforms each text in texts to sequence of integers VGG16 and VGG19 models
layer_spatial_dropout_3d()
Upsampling layer Spatial 1D to 3D version of dropout texts_to_matrix(); sequences_to_matrix() application_resnet50() ResNet50 model
layer_zero_padding_1d() Convert a list of sequences into a matrix
layer_zero_padding_2d() application_mobilenet()
layer_zero_padding_3d() RECURRENT LAYERS text_one_hot() One-hot encode text to word indices mobilenet_preprocess_input()
Zero-padding layer mobilenet_decode_predictions()
layer_simple_rnn() text_hashing_trick()
Fully-connected RNN where the output mobilenet_load_model_hdf5()
layer_cropping_1d() Converts a text to a sequence of indexes in a fixed-
layer_cropping_2d() is to be fed back to input MobileNet model architecture
size hashing space
layer_cropping_3d()
Cropping layer layer_gru() text_to_word_sequence()
Gated recurrent unit - Cho et al Convert text to a sequence of words (or tokens) ImageNet is a large database of images with
POOLING LAYERS
layer_cudnn_gru() labels, extensively used for deep learning
layer_max_pooling_1d() Fast GRU implementation backed IMAGE PREPROCESSING
layer_max_pooling_2d() by CuDNN imagenet_preprocess_input()
layer_max_pooling_3d() image_load() Loads an image into PIL format. imagenet_decode_predictions()
Maximum pooling for 1D to 3D layer_lstm() Preprocesses a tensor encoding a batch of
Long-Short Term Memory unit - flow_images_from_data() images for ImageNet, and decodes predictions
layer_average_pooling_1d() Hochreiter 1997 flow_images_from_directory()
layer_average_pooling_2d()
layer_average_pooling_3d()
Average pooling for 1D to 3D
layer_cudnn_lstm()
Fast LSTM implementation backed
Generates batches of augmented/normalized data
from images and labels, or a directory Callbacks
by CuDNN A callback is a set of functions to be applied at
layer_global_max_pooling_1d() image_data_generator() Generate minibatches of
image data with real-time data augmentation. given stages of the training procedure. You can
layer_global_max_pooling_2d() use callbacks to get a view on internal states
LOCALLY CONNECTED LAYERS
layer_global_max_pooling_3d() and statistics of the model during training.
Global maximum pooling fit_image_data_generator() Fit image data
layer_locally_connected_1d() generator internal statistics to some sample data callback_early_stopping() Stop training when
layer_global_average_pooling_1d() layer_locally_connected_2d() a monitored quantity has stopped improving
layer_global_average_pooling_2d() Similar to convolution, but weights are not generator_next() Retrieve the next item callback_learning_rate_scheduler() Learning
layer_global_average_pooling_3d() shared, i.e. different filters for each patch rate scheduler
Global average pooling image_to_array(); image_array_resize()
callback_tensorboard() TensorBoard basic
image_array_save() 3D array representation visualizations
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at keras.rstudio.com • keras 2.1.2 • Updated: 2017-12
Data Science in Spark with Sparklyr : : CHEAT SHEET
Intro Data Science Toolchain with Spark + sparklyr Using
sparklyr is an R interface for Apache Spark™,
it provides a complete dplyr backend and the Import Tidy
Understand
Communicate
sparklyr
Transform Visualize
option to query directly using Spark SQL • Export an R • dplyr verb Transformer Collect data into • Collect data A brief example of a data analysis using
statement. With sparklyr, you can orchestrate DataFrame • Direct Spark function R for plotting into R Apache Spark, R and sparklyr in local mode
distributed machine learning using either • Read a file SQL (DBI) • Share plots,
Spark’s MLlib or H2O Sparkling Water. • Read existing • SDF function Wrangle Model documents, library(sparklyr); library(dplyr); library(ggplot2);
Hive table (Scala API) • Spark MLlib and apps library(tidyr);
Starting with version 1.044, RStudio Desktop, Install Spark locally
R for Data Science, Grolemund & Wickham
• H2O Extension set.seed(100)
Server and Pro include integrated support for
the sparklyr package. You can create and spark_install("2.0.1") Connect to local version
manage connections to Spark clusters and local Getting Started sc <- spark_connect(master = "local")
Spark instances from inside the IDE.
LOCAL MODE (No cluster required) ON A YARN MANAGED CLUSTER
RStudio Integrates with sparklyr
1. Install a local version of Spark: 1. Install RStudio Server or RStudio Pro on import_iris <- copy_to(sc, iris, "spark_iris",
Open connection log Disconnect spark_install ("2.0.1") one of the existing nodes, preferably an overwrite = TRUE)
2. Open a connection edge node Copy data to Spark memory
sc <- spark_connect (master = "local") 2. Locate path to the cluster’s Spark Home
Directory, it normally is “/usr/lib/spark” partition_iris <- sdf_partition( Partition
import_iris,training=0.5, testing=0.5) data
3. Open a connection
Open the ON A MESOS MANAGED CLUSTER
Spark UI spark_connect(master=“yarn-client”,
version = “1.6.2”, spark_home = sdf_register(partition_iris,
1. Install RStudio Server or Pro on one of the
existing nodes [Cluster’s Spark path]) c("spark_iris_training","spark_iris_test"))
Preview
Spark & Hive Tables 1K rows
2. Locate path to the cluster’s Spark directory
Create a hive metadata for each partition
3. Open a connection
Cluster Deployment spark_connect(master=“[mesos URL]”,
version = “1.6.2”, spark_home = ON A SPARK STANDALONE CLUSTER
tidy_iris <- tbl(sc,"spark_iris_training") %>%
select(Species, Petal_Length, Petal_Width)
[Cluster’s Spark path]) 1. Install RStudio Server or RStudio Pro on
MANAGED CLUSTER Spark ML
Worker Nodes one of the existing nodes or a server in the
Cluster Manager Decision Tree
same LAN Model
Driver Node model_iris <- tidy_iris %>%
USING LIVY (Experimental) 2. Install a local version of Spark:
fd 1. The Livy REST application should be spark_install (version = “2.0.1")
ml_decision_tree(response="Species",
features=c("Petal_Length","Petal_Width"))
YARN running on the cluster 3. Open a connection
fd or
Mesos 2. Connect to the cluster spark_connect(master=“spark:// test_iris <- tbl(sc,"spark_iris_test") Create
sc <- spark_connect(method = "livy", host:port“, version = "2.0.1", reference to
fd master = "http://host:port") spark_home = spark_home_dir()) pred_iris <- sdf_predict(
Spark table
model_iris, test_iris) %>%
Bring data back
STAND ALONE CLUSTER Worker Nodes
Tuning Spark collect
into R memory
for plotting
Driver Node pred_iris %>%
fd EXAMPLE CONFIGURATION IMPORTANT TUNING PARAMETERS with defaults inner_join(data.frame(prediction=0:2,
lab=model_iris$model.parameters$labels)) %>%
fd config <- spark_config()
config$spark.executor.cores <- 2
•
•
spark.yarn.am.cores • spark.executor.instances
spark.yarn.am.memory 512m • spark.executor.extraJavaOptions ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
config$spark.executor.memory <- "4G" • spark.network.timeout 120s • spark.executor.heartbeatInterval 10s geom_point()
fd sc <- spark_connect (master="yarn-client", • spark.executor.memory 1g • sparklyr.shell.executor-memory
config = config, version = "2.0.1") • spark.executor.cores 1 • sparklyr.shell.driver-memory spark_disconnect(sc) Disconnect
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at spark.rstudio.com • sparklyr 0.5 • Updated: 2016-12
Reactivity Visualize & Communicate Model (MLlib)
COPY A DATA FRAME INTO SPARK SPARK SQL COMMANDS DOWNLOAD DATA TO R MEMORY ml_decision_tree(my_table,
sdf_copy_to(sc, iris, "spark_iris") r_table <- collect(my_table) response = “Species", features =
DBI::dbWriteTable(sc, "spark_iris", iris)
plot(Petal_Width~Petal_Length, data=r_table)
sdf_copy_to(sc, x, name, memory, repartition, c(“Petal_Length" , "Petal_Width"))
DBI::dbWriteTable(conn, name, dplyr::collect(x)
overwrite) value) Download a Spark DataFrame to an R DataFrame ml_als_factorization(x, user.column = "user",
sdf_read_column(x, column) rating.column = "rating", item.column = "item",
IMPORT INTO SPARK FROM A FILE FROM A TABLE IN HIVE Returns contents of a single column to R rank = 10L, regularization.parameter = 0.1, iter.max = 10L,
Arguments that apply to all functions: my_var <- tbl_cache(sc, name= ml.options = ml_options())
sc, name, path, options = list(), repartition = 0, "hive_iris") SAVE FROM SPARK TO FILE SYSTEM ml_decision_tree(x, response, features, max.bins = 32L, max.depth
memory = TRUE, overwrite = TRUE Arguments that apply to all functions: x, path = 5L, type = c("auto", "regression", "classification"), ml.options =
tbl_cache(sc, name, force = TRUE)
CSV spark_read_csv( header = TRUE, Loads the table into memory spark_read_csv( header = TRUE, ml_options()) Same options for: ml_gradient_boosted_trees
columns = NULL, infer_schema = TRUE, CSV
delimiter = ",", quote = "\"", escape = "\\", ml_generalized_linear_regression(x, response, features,
delimiter = ",", quote = "\"", escape = "\\", my_var <- dplyr::tbl(sc,
charset = "UTF-8", null_value = NULL) intercept = TRUE, family = gaussian(link = "identity"), iter.max =
charset = "UTF-8", null_value = NULL) name= "hive_iris")
dplyr::tbl(scr, …) JSON spark_read_json(mode = NULL) 100L, ml.options = ml_options())
JSON spark_read_json()
Creates a reference to the table PARQUET spark_read_parquet(mode = NULL) ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x),
PARQUET spark_read_parquet() without loading it into memory compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options())
ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha =
Wrangle Reading & Writing from Apache Spark (50/k) + 1, beta = 0.1 + 1, ml.options = ml_options())
ml_linear_regression(x, response, features, intercept = TRUE,
SPARK SQL VIA DPLYR VERBS ML TRANSFORMERS tbl_cache
sdf_copy_to alpha = 0, lambda = 0, iter.max = 100L, ml.options = ml_options())
Translates into Spark SQL statements ft_binarizer(my_table,input.col=“Petal_Le dplyr::tbl
dplyr::copy_to Same options for: ml_logistic_regression
ngth”, output.col="petal_large", DBI::dbWriteTable
my_table <- my_var %>% ml_multilayer_perceptron(x, response, features, layers, iter.max =
threshold=1.2)
filter(Species=="setosa") %>% 100, seed = sample(.Machine$integer.max, 1), ml.options =
sample_n(10) Arguments that apply to all functions: ml_options())
x, input.col = NULL, output.col = NULL spark_read_<fmt>
sdf_collect ml_naive_bayes(x, response, features, lambda = 0, ml.options =
DIRECT SPARK SQL COMMANDS dplyr::collect File
ft_binarizer(threshold = 0.5) ml_options())
Assigned values based on threshold sdf_read_column System
my_table <- DBI::dbGetQuery( sc , ”SELECT * ml_one_vs_rest(x, classifier, response, features, ml.options =
spark_write_<fmt>
FROM iris LIMIT 10") ft_bucketizer(splits) ml_options())
Extensions
DBI::dbGetQuery(conn, statement) Numeric column to discretized column
ml_pca(x, features = dplyr::tbl_vars(x), ml.options = ml_options())
ft_discrete_cosine_transform(inverse
Create an R package that calls the full Spark API & ml_random_forest(x, response, features, max.bins = 32L,
SCALA API VIA SDF FUNCTIONS = FALSE)
provide interfaces to Spark packages. max.depth = 5L, num.trees = 20L, type = c("auto", "regression",
Time domain to frequency domain
sdf_mutate(.data) CORE TYPES "classification"), ml.options = ml_options())
Works like dplyr mutate function ft_elementwise_product(scaling.col)
spark_connection() Connection between R and the ml_survival_regression(x, response, features, intercept =
Element-wise product between 2 cols
sdf_partition(x, ..., weights = NULL, seed = Spark shell process TRUE,censor = "censor", iter.max = 100L, ml.options = ml_options())
sample (.Machine$integer.max, 1)) ft_index_to_string() spark_jobj() Instance of a remote Spark object
Index labels back to label as strings ml_binary_classification_eval(predicted_tbl_spark, label, score,
sdf_partition(x, training = 0.5, test = 0.5) spark_dataframe() Instance of a remote Spark metric = "areaUnderROC")
sdf_register(x, name = NULL) ft_one_hot_encoder() DataFrame object
Continuous to binary vectors ml_classification_eval(predicted_tbl_spark, label, predicted_lbl,
Gives a Spark DataFrame a table name CALL SPARK FROM R metric = "f1")
sdf_sample(x, fraction = 1, replacement = ft_quantile_discretizer(n.buckets=5L) invoke() Call a method on a Java object
Continuous to binned categorical ml_tree_feature_importance(sc, model)
TRUE, seed = NULL) invoke_new() Create a new object by invoking a
values
sdf_sort(x, columns) constructor
Sorts by >=1 columns in ascending order ft_sql_transformer(sql) invoke_static() Call a static method on an object sparklyr
sdf_with_unique_id(x, id = "id") ft_string_indexer( params = NULL) is an R
Column of labels into a column of label MACHINE LEARNING EXTENSIONS
sdf_predict(object, newdata) indices. ml_options() interface
ml_create_dummy_variables()
Spark DataFrame with predicted values for
ft_vector_assembler() ml_model()
ml_prepare_dataframe()
Combine vectors into single row-vector
ml_prepare_response_features_intercept()
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at spark.rstudio.com • sparklyr 0.5 • Updated: 2016-12
pattern
regmatches(string, regexpr(pattern, string))
Cheat Sheet extract first match [1] "tam" "tim"
string regmatches(string, gregexpr(pattern, string))
extract all matches, outputs a list
[[1]] "tam" [[2]] character(0) [[3]] "tim" "tom"
stringr::str_extract(string, pattern)
extract first match [1] "tam" NA "tim"
[[:digit:]] or \\d Digits; [0-9] stringr::str_extract_all(string, pattern)
\\D Non-digits; [^0-9] extract all matches, outputs a list
[[:lower:]] Lower-case letters; [a-z] > string <- c("Hiphopopotamus", "Rhymenoceros", "time for bottomless lyrics")
stringr::str_extract_all(string, pattern, simplify = TRUE)
[[:upper:]] Upper-case letters; [A-Z] > pattern <- "t.m"
extract all matches, outputs a matrix
[[:alpha:]] Alphabetic characters; [A-z]
stringr::str_match(string, pattern)
[[:alnum:]] Alphanumeric characters [A-z0-9]
extract first match + individual character groups
\\w Word characters; [A-z0-9_]
\\W Non-word characters grep(pattern, string) regexpr(pattern, string) stringr::str_match_all(string, pattern)
[[:xdigit:]] or \\x Hexadec. digits; [0-9A-Fa-f] [1] 1 3 find starting position and length of first match extract all matches + individual character groups
[[:blank:]] Space and tab grep(pattern, string, value = TRUE) gregexpr(pattern, string)
[[:space:]] or \\s Space, tab, vertical tab, newline, [1] "Hiphopopotamus" find starting position and length of all matches
form feed, carriage return [2] "time for bottomless lyrics“ stringr::str_locate(string, pattern)
\\S Not space; [^[:space:]] sub(pattern, replacement, string)
grepl(pattern, string) find starting and end position of first match replace first match
[[:punct:]] Punctuation characters; [1] TRUE FALSE TRUE
!"#$%&’()*+,-./:;<=>?@[]^_`{|}~ stringr::str_locate_all(string, pattern) gsub(pattern, replacement, string)
[[:graph:]] Graphical characters; stringr::str_detect(string, pattern) find starting and end position of all matches replace all matches
[[:alnum:][:punct:]] [1] TRUE FALSE TRUE
stringr::str_replace(string, pattern, replacement)
[[:print:]] Printable characters;
[[:alnum:][:punct:]\\s] replace first match
[[:cntrl:]] or \\c Control characters; \n, \r etc. stringr::str_replace_all(string, pattern, replacement)
strsplit(string, pattern) or stringr::str_split(string, pattern) replace all matches
Cheat Sheet the predictors with the preProc option. function is used again.
• Set the random number seed prior to calling train repeatedly train(x = x, y = y, method = "glmnet",
to get the same resamples across calls. Many train options can be specified using the trainControl preProc = c("center", "scale"),
function: tuneGrid = grid)
• Use the train option na.action = na.pass if you will
being imputing missing data. Also, use this option when train(y ~ ., data = dat, method = "cubist",
predicting new data containing missing values. trControl = trainControl(<options>))
Random Search
To pass options to the underlying model function, you can pass
them to train via the ellipses: Resampling Options For tuning, train can also generate random tuning parameter
train(y ~ ., data = dat, method = "rf", combinations over a wide range. tuneLength controls the total
# options to `randomForest`: trainControl is used to choose a resampling method: number of combinations to evaluate. To use random search:
importance = TRUE)
trainControl(method = <method>, <options>)
trainControl(search = "random")
d
mtq link d
mtq_link north = TRUE,
a scale = 5)
i j a
Proportional Symbols Links layers can be Sources,
5 km
a b Author
propSymbolsLayer(x = mtq, var = "myvar", used by *LinkLayer().
inches = 0.1, symbols = "circle")
b
a c b Figure Dimensions
d c Get figure dimensions based on the dimension ratio of a spatial object,
c c figure margins and output resolution.
f_dim <- getFigDim(x = sf_obj, width = 500,
Colorized Proportional Symbols (relative data) Polygons to Borders mar = c(0,0,0,0))
propSymbolsChoroLayer(x = mtq, var = "myvar",
mtq_border <- getBorders(x = mtq) png("fig.png", width = 500, height = f_dim[2])
var2 = "myvar2")
par(mar = c(0,0,0,0))
Borders layers can be used by plot(sf_obj, col = "#729fcf")
discLayer() function dev.off()
Colorized Proportional Symbols (qualitative data)
propSymbolsTypoLayer(x = mtq, var = "myvar", default controled ratio
var2 = "myvar2")
Population
84174
Double Proportional Symbols Polygons to Pencil Lines
40589 propTrianglesLayer(x = mtq, var1 = "myvar",
mtq_pen <- getPencilLayer(x = mtq) b
var2 = "myvar2")
12725
583 f_dim[2]
OpenStreetMap Basemap (see rosm package)
Legends
tiles <- getTiles(x = mtq, type = "osm") a
tilesLayer(tiles)
Classification legendChoro()
a / b == f_dim[1] / f_dim[2] f_dim[1]
or
A B C
apply: Apply a function
h2o.splitFrame: Split an existing H2O column (cbind) or rows (rbind). over an H2O parsed data
h2o.cumprod: Vector of the cumulative By Rows By Cols object (an array) margins.
dataset according to user-specified ratios. h2o.merge: Merges 2 H2OFrames. products of the elements of the argument.
GROUP BY AGGREGATION
MISSING DATA HANDLING h2o.arrange: Sorts an H2OFrame h2o.cumsum: Vector of the cumulative h2o.group_by: Apply an
h2o.impute: Impute a column of data using by columns. sums of the elements of the argument. aggregate function to each
the mean, median, or mode. PRECISION group of an H2O dataset.
ELEMENT INDEX SELECTION
h2o.insertMissingValues: Replaces a user- h2o.which: True Condition’s Row Numbers h2o.round: Round values to the specified TABULATION
specified fraction of entries in an H2O number of decimal places. The default is 0. h2o.table: Use the cross-classifying
CONDITIONAL VALUE SELECTION
dataset with missing values. h2o.signif: Round values to the specified
A B C D
h2o.ifelse: Apply conditional statements to * * * * factors to build a table of counts at
h2o.na_omit: Remove Rows With NAs. numeric vectors in an H2O Frame. number of significant digits. each combination of factor levels.
H2O.ai. • CC BY SA Juan Telleria Ruiz de Aguirre • [email protected] • http://docs.h2o.ai/ • Learn more at H2O.ai documentation webpage • package version 3.18.0.11 • Updated: 2018-06
h2o: : CHEAT SHEET
Data Modeling Data Munging Cluster Operations
MODEL TRAINING: SUPERVISED LEARNING GRID SEARCH GENERAL COLUMN MANIPULATION H2O KEY VALUE STORE ACCESS
h2o.deeplearning: Perform Deep Learning h2o.grid: Efficient method to build multiple is.na: Display missing elements. h2o.assign: Assign H2O hex.keys to R objects.
Neural Networks on an H2OFrame. models with different hyperparameters.
FACTOR LEVEL MANIPULATIONS h2o.getFrame: Get H2O dataset Reference.
h2o.getGrid: Get a grid object from H2O
h2o.gbm: Build Gradient Boosted Regression distributed K/V store. h2o.levels: Display a list of the unique h2o.getModel: Get H2O model reference.
Trees or Classification Trees. values found in a categorical data column. h2o.ls: Display a list of object keys in the
MODEL SCORING
h2o.glm: Fit a Generalized Linear Model, h2o.predict: Obtain predictions from various h2o.relevel: Reorders levels of an H2O running instance of H2O.
specified by a response variable, a set of fitted H2O model objects. factor, similarly to standard R's relevel. h2o.rm: Remove specified H2O Objects from
predictors, and the error distribution. the H2O server, but not from the R environment.
h2o.scoreHistory: Get Model Score History. h2o.setLevels: Set Levels of H2O Factor.
h2o.naiveBayes: Compute Naive Bayes MODEL METRICS h2o.removeAll: Remove All H2O Objects from
classification probabilities on an H2O Frame. NUMERIC COLUMN MANIPULATIONS
h2o.make_metrics: Given predicted values the H2O server, but not from the R environment.
h2o.cut: Convert H2O Numeric Data to
h2o.randomForest: Perform Random Forest (target for regression, class-1 probabilities, or H2O MODEL IMPORT / EXPORT
Factor by breaking it into Intervals.
Classification on an H2O Frame. binomial or per-class probabilities for h2o.loadModel: Load H2OModel from disk.
h2o.xgboost: Build an Extreme Gradient multinomial), compute a model metrics object. CHARACTER COLUMN MANIPULATIONS h2o.saveModel: Save H2OModel object to disk.
Boosted Model using the XGBoost backend. GENERAL MODEL HELPER h2o.strsplit: “String Split”: Splits the given
h2o.download_pojo: Download the Scoring
h2o.performance: Evaluate the predictive factor column on the input split.
h2o.stackedEnsemble: Build a stacked POJO (Plain Old Java Object) of an H2O Model.
ensemble (aka. Super Learner) using the performance of a Supervised Learning h2o.tolower: Convert the characters of a
h2o.download_mojo: Download the model in
specified H2O base learning algorithms. Regression or Classification Model via various character vector to lower case.
MOJO format.
metrics. Set xval = TRUE for retrieving the h2o.toupper: Convert the characters of a
h2o.automl: Automates the Supervised training cross-validation metrics. H2O CLUSTER CONNECTION
Machine Learning Model Training Process: character vector to upper case.
h2o.init: Connect to a running H2O instance
Automatically Trains and Cross-validates a set REGRESSION MODEL HELPER h2o.trim: “Trim spaces”: Remove leading using all CPUs on the host.
of Models, and trains a Stacked Ensemble. h2o.mse: Display the mean squared error and trailing white space.
calculated from “Predicted Responses” and h2o.shutdown: Shut down the specified H2O
MODEL TRAINING: UNSUPERVISED h2o.gsub: Match a pattern & replace all instance. All data on the server will be lost!
“Actual (Reference) Responses”. Set xval =
LEARNING instances (occurrences) of the matched
TRUE for retrieving the cross-validation MSE. H2O CLUSTER INFORMATION
h2o.prcomp: Perform Principal Components pattern with the replacement string globally.
Analysis on the given H2O Frame. CLASSIFICATION MODEL HELPERS h2o.clusterInfo: Display the name, version,
h2o.accuracy: Get Model Accuracy metric. h2o.sub: Match a pattern & replace the uptime, total nodes, total memory, total cores
h2o.kmeans: Perform k-means Clustering on first instance (occurrence) of the matched and health of a cluster running H2O.
the given H2O Frame. h2o.auc: Retrieve the AUC (area under ROC pattern with the replacement string.
curve). Set xval = TRUE for retrieving the h2o.clusterStatus: Retrieve information on the
h2o.anomaly: Detect anomalies in a H2O cross-validation AUC. DATE MANIPULATIONS status of the cluster running H2O.
Frame using a H2O Deep Learning Model h2o.month: Convert Milliseconds to H2O LOGGING
with Auto-Encoding. h2o.confusionMatrix: Display prediction
Months in H2O Datasets (Scale: 0 to 11). h2o.clearLog: Clear all H2O R command and
errors for classification data (“Predicted” vs
h2o.deepfeatures: Extract the non-linear “Reference : Real Values”). h2o.year: Convert Milliseconds to Years in error response logs from the local disk.
features from a H2O Frame using a H2O H2O Datasets, indexed starting from 1900. h2o.downloadAllLogs: Download all H2O log
Deep Learning Model. h2o.hit_ratio_table: Retrieve the Hit Ratios.
h2o.day: Convert Milliseconds to Day of files to the local disk.
Set xval = TRUE for retrieving the cross-
h2o.glrm: Builds a Generalized Low Rank Month in H2O Datasets (Scale: 1 to 31). h2o.logAndEcho: Write a message to the H2O
validation Hit Ratio.
Decomposition of an H2O Frame. Java log file and echo it back.
CLUSTERING MODEL HELPER h2o.hour: Convert Milliseconds to Hour of
h2o.openLog: Open existing logs of H2O R
h2o.svd: Singular value decomposition of an h2o.betweenss: Get the between cluster Day in H2O Datasets (Scale: 0 to 23). POST commands and error responses on disk.
H2O Frame using the power method. Sum of Squares. h2o.dayOfWeek: Convert Milliseconds to h2o.getLogPath: Get the file path for the H2O
h2o.word2vec: Trains a word2vec model on h2o.centers: Retrieve the Model Centers. Day of Week in a H2OFrame (Scale: 0 to 6) R command and error response logs.
a String column of an H2O data frame. PREDICTOR VARIABLE IMPORTANCE h2o.startLogging: Begin logging H2O R POST
MATRIX OPERATIONS
SURVIVAL MODELS: TIME-TO-EVENT h2o.varimp: Retrieve the variable importance commands and error responses.
%∗%: Multiply two conformable matrices.
h2o.coxph: Trains a Cox Proportional h2o.stopLogging: Stop logging H2O R POST
h2o.varimp_plot: Plot Variable Importances. t: Returns the transpose of an H2OFrame.
Hazards Model (CoxPH) on an H2O Frame. commands and error responses.
H2O.ai. • CC BY SA Juan Telleria Ruiz de Aguirre • [email protected] • http://docs.h2o.ai/ • Learn more at H2O.ai documentation webpage • package version 3.18.0.11 • Updated: 2018-06
Data & Variable Descriptives and Summaries Recode and Transform Variables Summarise Variables and Cases
Transformation Most of the sjmisc functions (including recode- Recode functions add a suffix to new variables, The summary functions
with sjmisc Cheat Sheet functions) also work on grouped data frames: so original variables are preserved. mostly mimic base R
library(dplyr) By default, original input data frame and new equivalents, but are de-
efc %>% created variables are returned. Use append = signed to work together
group_by(e16sex, c172code) %>% FALSE to return the recoded variables only. with pipes and dplyr.
sjmisc complements dplyr, and helps with data
transformation tasks and recoding variables. frq(e42dep)
rec(x, ..., rec, as.num = TRUE, var.label = row_sums(x, ..., na.rm = TRUE, var =
sjmisc works together "rowsums", append = FALSE)
seamlessly with dplyr Frequency Tables NULL, val.labels = NULL, append = TRUE,
and pipes. All func- suffix = "_r") Row sums of data frames.
tions are designed to row_sums(efc, c82cop1:c90cop9)
frq(x, ..., sort.frq = c("none", "asc", "desc"), Recode values, return result as numeric,
support labelled data. weight.by = NULL, auto.grp ) character or categorical (factor).
Print frequency tables of (labelled) vectors. Uses rec(mtcars, carb, rec = "1,2=1; 3,4=2; else=3") row_means(x, ..., n, var = "rowmeans",
Design Philosophy variable labels as table header. append = FALSE)
data(efc); frq(efc, e42dep, c161sex) dicho(x, ..., dich.by = "median", as.num = Row means, for at least n valid (non-NA) values.
The design of sjmisc functions follows the FALSE, var.label = NULL, val.labels = NULL, row_means(efc, c82cop1:c90cop9, n = 7)
tidyverse-approach: first argument is always the Use this data set append = TRUE, suffix = "_d")
data (either a data frame or vector), followed by in examples!
variable names to be processed by the functions. Dichotomise variable by median, mean or row_count(x, ..., count, var = "rowcount",
specific value. append = FALSE)
flat_table(data, ..., margin = c("counts",
The returned object for each function equals the dicho(mtcars, disp) Row-wise count # of values in data frames.
"cell", "row", "col"), digits = 2,
type of the data-argument. Also col_count().
show.values = FALSE)
split_var(x, ..., n, as.num = FALSE, row_count(efc, c82cop1:c90cop9, count = 2)
Vector input Print contingency tables of (labelled) vectors.
• If the data-argument is a vector, functions Uses value labels. val.labels = NULL, var.label = NULL,
return a vector. flat_table(efc, e42dep, c172code, e16sex) inclusive = FALSE, append = TRUE, Other Useful Functions
suffix = "_g")
Split variable into equal sized groups. Unlike add_columns() and replace_columns() to
count_na(x, ...) dplyr::ntile(), does not split original categories combine data frames, but either replace or
rec(mtcars$carb, rec = "1,2=1; 3,4=2; else=3")
Print frequency table of tagged NA values. into different values (see examples in ?split_var). preserve existing columns.
library(haven); x <- labelled(c(1:3, split_var(mtcars, mpg, disp, n = 3) set_na() and replace_na() to convert regular
Data frame input tagged_na("a", "a", "z")), labels = into missing values, or vice versa. replace_na()
• If the data-argument is a data frame, functions c("Refused" = tagged_na("a"), "N/A" = also replaces specific tagged NA values only.
return a data frame. tagged_na("z"))) group_var(x, ..., size = 5, as.num = TRUE,
count_na(x) right.interval = FALSE, n = 30, append = remove_var() and var_rename() to remove
TRUE, suffix = "_gr") variables from data frames, or rename variables.
Split variable into groups with equal value range, group_str() to group similar string values. Useful
Descriptive Summary or into a max. # of groups (value range per group for variables with similar, but not identically
is adjusted to match # of groups).
rec(mtcars, carb, rec = "1,2=1; 3,4=2; else=3") descr(x, ..., max.length = NULL) group_var(mtcars, mpg, disp, size = 5) merge_df() to full join data frames and preserve
Descriptive summary of data frames, including group_var(mtcars, mpg, size = "auto", n = 4) value and variable labels.
variable labels in output. to_long() to gather multiple columns in data
-ellipses Argument descr(efc, contains("cop"), max.length = 20) frames from wide into long format.
std(x, ..., robust = "sd", include.fac = FALSE,
Apply functions to a single variable, selected
variables or to a complete data frame.
append = TRUE, suffix = "_z")
Finding Variables in a Data Frame Z-standardise variables. Also center(). Use with %>% and dplyr
Variable selection is powered by select():
Separate variables with comma, or use Use find_var() to search for variables by names, std(efc, e17age, c160age) # use sjmisc-functions in pipes
select-helpers to select variables, e.g. ?rec: value or variable labels. Returns vector/data mtcars %>% select(gear, carb) %>%
frame. rec(rec = "min:3=1; 4:max=2")
recode_to(x, ..., lowest = 0, highest = -1,
rec(mtcars, one_of(c("gear", "carb")), # use sjmisc-function inside mutate
append = TRUE, suffix = "_r0)
rec = "min:3=1; 4:max=2") find_var(efc, pattern = "cop", out = "df" ) mtcars %>% select(gear, carb) %>% mutate(
rec(mtcars, gear, carb, rec = "min:3=1; 4:max=2") # variables with "level" in names and value labels recode_to(mtcars$gear)
find_var(efc, "level", search = "name_value")
CC BY Daniel Lüdecke [email protected] github.com/strengejacke Learn more with browseVignettes("sjmisc") sjmisc 2.7.0 02/18
randomizr: : CHEAT SHEET
Two Arm Trials Multi Arm Trials Declaration
Simple random assignment is like flipping coins for each unit Set the number of arms with num_arms or with conditions. Learn about assignment procedures by “declaring” them with
separately. declare_ra()
complete_ra(N = 100, num_arms = 3)
simple_ra(N = 100, prob = 0.5) complete_ra(N = 100, conditions = c(“control”, declaration <-
“placebo”, “treatment”)) declare_ra(N = 100, m_each = c(30, 30, 40))
Complete random assignment allocates a fixed number of units to The *_each arguments in randomizr functions specify design declaration # print design information
each condition. parameters for each arm separately.
Conduct a random assignment:
complete_ra(N = 100, m = 50) complete_ra(N = 100, m_each = c(20, 30, 50))
complete_ra(N = 100, prob = 0.5) complete_ra(N = 100, conduct_ra(declaration)
prob_each = c(0.2, 0.3, 0.5))
Obtain observed condition probabilities (useful for inverse
Block random assignment conducts complete random assignment probability weighting if probabilities of assignment are not
separately for groups of units. If the design is the same for all blocks, use prob_each: constant)
blocks <- rep(c("A", "B", "C"), blocks <- rep(c("A", "B","C"), Z <- conduct_ra(declaration)
c(50, 100, 200)) c(50, 100, 200)) obtain_condition_probabilities(declaration, Z)
block_ra(blocks = blocks,
# defaults to half of each block
block_ra(blocks = blocks)
prob_each = c(.1, .1, .8))
Sampling
# can change with block_m If the design is different in different blocks, use block_m_each All assignment functions have sampling analogues: Sampling is
block_ra(blocks = blocks, or block_prob_each: identical to a two arm trial where the treatment group is sampled.
block_m = c(20, 30, 40))
block_m_each <- rbind(c(10, 20, 20), Assignment Sampling
c(30, 50, 20), simple_ra() simple_rs()
Cluster random assignment allocates whole groups of units to c(50, 75, 75))
block_ra(blocks = blocks, complete_ra() complete_rs()
conditions together.
block_m_each = block_m_each) block_ra() strata_rs()
clusters <- rep(letters, times = 1:26 cluster_ra() cluster_rs()
block_prob_each <- rbind(c(.1, .1, .8),
cluster_ra(clusters = clusters)
c(.2, .2, .6), block_and_cluster_ra() strata_and_cluster_rs()
c(.3, .3, .4)) declare_ra() declare_rs()
block_ra(blocks = blocks,
Block and cluster random assignment conducts cluster random conduct_ra() draw_rs()
block_prob_each = block_prob_each)
assignment separately for groups of clusters.
If conditions is numeric, the output will be numeric. Stata
clusters <- rep(letters, times = 1:26) If conditions is not numeric, the output will be a factor with A Stata version of randomizr is available, with the same arguments
blocks <- rep(paste0("block_", 1:5), levels in the order provided to conditions. but different syntax:
c(15, 40, 65, 90, 141))
block_and_cluster_ra(blocks = blocks, complete_ra(N = 100, conditions = -2:2) ssc install randomizr
clusters = clusters) complete_ra(N = 100, conditions = c(“A”, “B”)) set obs 100
complete_ra, m(50)
randomizr is part of the DeclareDesign suite of packages for designing, implementing, and analyzing social science research designs.
CC BY SA Alex Coppock • [email protected] • declaredesign.org • Learn more at randomizr.declaredesign.org • package version 0.16.0 • Updated: 2018-06